Due to LIDAR limitations, vision must be solved for L5.
You must have standalone vision working to multiple nines on a LIDAR vehicle. If the LIDAR knows that a light exists but does not know the state then vision must be able to do that all by itself. If LIDAR knows there’s a square block of stucco at an intersection but cannot tell if a stop sign has been painted on it then vision must be solved for L5 to be deployed.
The inverse is not true. You do not need LIDAR deployed to solve for vision.
So LIDAR clearly provides a “cheat” to propel self driving forward maybe 60% of the way, but vision must be solved for the system to be put into production for regular end users.
You need vision to be at 99.99999% of accuracy on ONE TASK when using a Lidar/Radar only car. This is different than needing 99.99999% of accuracy on 100+ TASKS when you have a vision only system. Its currently impossible for a camera only system to achieve the 99.99999% of accuracy that's needed. Definitely not in autonomous driving and not in other areas of deep learning. Its about levels of accuracy and rate of failure. How many miles can you go without a serious perception failure? You need to be able to go millions of miles.
If vision is solved to multiple nines then LIDAR is just an extra expensive wart. Another tool, of course, but not particularly useful when combined with a solved vision system.
Our latest efforts in deep learning is not close. Take for example the fact that there are billions of android phones and hundreds of millions of google homes and every time you use their voice app, they get every single raw audio data (not below ~0.1% in Tesla's case). Yet we haven't even gotten past 99% at any category of Machine Learning and we are in year 8th. Take voice recognition, voice transcription or translation for example. Its still subpar in the real world. This is even with us talking very slow and calmly to it. This is with basically a trillion voice data. Yet Google still haven't solved voice recognition and we are no where close to solving computer vision. Same with Amazon's Alexa (hundreds of millions) and still they fall short.
So the noise that 'all you is a lot of data' is simply misleading.
This is why having different sensor suite that fail/exceed in different ways so that they compliment other sensor suites is the key. It solves the accuracy problem because radar doesn't fail in the same scenarios as cameras and lidars. Also lidar doesn't fail in the same scenarios as cameras and radars and vice versa.
So instead of needing ~99.99999% for one sensor suite. Now you only need ~99.99% for each modality of sensor suite. Its not a lidar vs camera. If someone had a lidar only system they would also need to reach ~99.99999% accuracy with just lidars which is also not currently possible.
I really think karpathy’s talk shed a lot of light on this. They’ve made tremendous progress even though it’s hard to see right now from outside their labs.
The talk simply highlighted what others are doing in state of the art computer vision research. If you have been following the field you would know that Tesla is just reaping the benefits, they are not inventing anything. In-fact the models and the training data and architecture code of the network below is freely available to download.
Last edited: