The bollard is seen but the ground is nowhere in view.
Sonar is just like a very simple ruler tape measure. You roll your ruler out and you get the distance. It emits ultrasound and it bounces back to instantly translate to distance.
It's more difficult with cameras. As you said, it can use the cues like a bollard and the ground. It has to translate pixels into meaningful objects (labeling). It's not a simple ruler. It's a very complicated computation of the pixels to get what you want: In this case, distance.
Right, it sees it before pulling it and uses vehicle and scene kinematics to track it. The Occupancy Network is the hard part. Once that works, USS behavior is easy to code.