Old neural network architecture is labelling one frame(2D) from a single camera and to train a neural network on the last two frames(2.5D(2D(x,y)+0.5D(~time))) to predictict where objects are in the image(2D) then do some neural network magic to get it into bird eye view(~2.5D).
New neural network architecture will take a video feed, generate a point cloud of all the static objects(3D) and of the moving objects(4D) and train a neural network to predict where static objects are(3D) and dynamic objects will be(4D) based on the current image frames and recurrent information from the neural network at previous timestep.
I think he is referring to how the neural network internally will be representing the information. If it needs to think in 4D it will start to think in 4D. In order to predict where a moving car will be in 4D space, which is needed in order to predict how the next frame will look like for example(this is not needed but is a good way to augment the neural network), the neural network will find an internal representation in vector form for this. See this video at 22:47