Welcome to Tesla Motors Club
Discuss Tesla's Model S, Model 3, Model X, Model Y, Cybertruck, Roadster and More.
Register

Tesla.com - "Transitioning to Tesla Vision"

This site may earn commission on affiliate links.
Everyone has their opinion on how they think AP works, but what I've told you are facts regardless of opinion. My background is actually in computer vision and have been working in dedicated computer vision projects since 2016 and have personal friends that are part of the AP team at Tesla. Convolutional neural networks take in frames, you can create a pipeline that starts with stitched frames (e.g. video) and have that be parsed out on the fly and it makes it seem like you are working with "video" only. The "training" that occurs is a supervised approach which annotators look over parsed video frames and label objects of interest. The label process creates an XML file with the coordinate points of what you labeled and the classification. This is what then goes into training and why this is a supervised approach. On inference (e.g. when you are driving your car on autopilot), the video is read in, parsed into frames, ran through the layers of convolution neural network and inferences on object coordinates made. Fast forward this to 20FPS of processing through the GPU and you have live "video" inferencing.
The NN is still working only with a frame at a time in your example. Anyways, instead of me paraphrasing, here is what Elon have said about the distinction between what they are doing with their latest software vs what they have been doing previously:
"The actual major milestone that is happening right now is really transition of the autonomy systems of cars, like AI, if you will, from thinking about things like 2.5D, things like isolated pictures… and transitioning to 4D, which is videos essentially. That architectural change, which has been underway for some time… really matters for full self-driving. It’s hard to convey how much better a fully 4D system would work. This is fundamental, the car will seem to have a giant improvement. Probably not later than this year, it will be able to do traffic lights, stops, turns, everything."
Tesla Video Data Processing Supercomputer ‘Dojo’ – Building a 4D System Aiming at L5 Self-Driving | Synced

The important distinction is the NN now takes into account previous frames when analyzing the current frame (as well as likely factoring in the time in between frames). That's why it's more like "video" (where typically the compression methods take in certain frames as key frames, then interpret the differences between frames). While the previous approach Elon describes as "2.5D" is more like a series of images that are treated in isolation.
 
Last edited:
Nah, you can have a masters and still be wrong. And yes, you are wrong.
LOL My last post, this reminds me why I dont really do social media. When someone tries to explain something to you, with knowledge outside of medium articles, wikipedia, and videos aimed at educating at the 30k foot level you should listen, observe, learn and STFU. Remember, what AK releases to you via youtube, is the same *sugar* he shows people out of their depth within Tesla. Anyway, back to actually creating these vision systems.
 
LOL My last post, this reminds me why I dont really do social media. When someone tries to explain something to you, with knowledge outside of medium articles, wikipedia, and videos aimed at educating at the 30k foot level you should listen, observe, learn and STFU. Remember, what AK releases to you via youtube, is the same *sugar* he shows people out of their depth within Tesla. Anyway, back to actually creating these vision systems.



err... you realize that AK video was a stream from CVPR, not something AK "released on Youtube" and that it is from one of the most important annual conferences in the field of computer vision and pattern recognition... not some rando investor presentation... right?
 
He has masters
1624305446274.png
 
Convolutional neural networks take in frames, you can create a pipeline that starts with stitched frames (e.g. video) and have that be parsed out on the fly and it makes it seem like you are working with "video" only.
I'm not sure what you mean by this.
In computer vision the term stitching is typically used for methods that combine semi-overlapping images of (spatial) regions of the same scene. Creating panoramas from multiple images is an example. However here you're referring to temporal stitching (combining images recorded at different times by the same camera). How does that work?
 
  • Funny
Reactions: rxlawdude
Are @powertoold and @mikes_fsd trying to say that video is not similar to pictures in a sequence?

There's a difference when you train a NN on images vs video. Yes, a video is a sequence of images, but almost all camera-based applications of NNs are trained on images (for example, hot dog vs not, or simple object recognition). These applications take in a sequence of images (video) and then run the NN model on each frame. The output prediction is usually "hot dog" + bounding box + some confidence %.

The difference with video is that you want an output prediction such as velocity, acceleration, or "future trajectory." These type of predictions are not possible by training with only images. For example, you can't just use 1 million labeled images of cars with their correct velocities and expect the NN to output the correct velocity of a new car image. That's because velocity requires a change in position over time. The NN has to be trained on related image sequences (video with moving cars) in order to output these predictions.

Training NNs with video to derive velocity and acceleration is actually a difficult problem. That's why in the current talk, Karpathy said that people in the CV industry don't believe cameras can do rangefinding with the accuracy needed to drive cars.
 
Well then I don't see why Isidro JR's explanation of how the tech Musk claimed about "4d" works is controversial?
I don't think there's anything here that really rise to a state of "controversial", but I don't think Isidro JR really did any explaining about what is really different between the previous "2.5D" system and the current "4D" system in Tesla Vision. His statement seems to imply the only difference is it's fusing the different views (from 8 images) and other than that it is no difference (still only doing everything frame by frame).

I can't say I know all the details either (like everyone here, we are only able to go by things Tesla is willing to make public), but from Elon's statement and from watching the latest video in full, my understanding is different. For example, Andrej talks about the last few months they've been working on this, they were able to determine it was possible for their synthetic vision NNs to get depth, velocity, acceleration sensing with vision alone that is just as good if not better than radar. That implies their previous solution was not doing that (at least not in a very reliable way).

It makes more sense if you consider the NNs as treating the data more like "video", instead of isolated "frames" or "images". It's possible to still do monocular depth sensing from a single frame or image, but getting velocity/acceleration is much harder (although not impossible: there are ways to determine velocity for example based on motion blur in a single frame). However, once you consider multiple frames, by comparing differences between two or more frames, determining velocity is obviously much easier.

And from the diagram shown in this post (around 19 minutes in the video), the part I would be interested in is what is being spit out in the line that goes to each "head". Note that the line is taking input from multiple points in time in the tensor video queue and I note it goes to a "video module" in each head (not an "image module"). Andrej says they first fuse all the images across all the cameras, then they fuse the information across all of time (temporal), saying it's done either by recurrent neural network, transformer, or 3D convolutions.
1624290432208-png.675845

Tesla.com - "Transitioning to Tesla Vision"

I also found the older discussion from ScaledML in February 2020 before Tesla Vision. You can see they were already fusing the different cameras using a Fusion layer. There was talk about a temporal module with the role of "smoothing out" predictions (discussed at around 20 minutes in the linked video). This ties in with Elon's explanation of how the "2.5D" worked during the July 2020 earnings call: "Well, the actual major milestone that's happening right now is really a transition of the autonomy system of the cars like AI, if you will, from thinking about things in, I call it like, 2.5D. It's like -- think of taking like isolated pictures and doing image recognition on pictures that are harshly correlated in time but not very well and transitioning to kind of a 4D where it's like -- which is video essentially." So they weren't completely ignoring time even in the 2.5D approach, but it's only taking into account of it in a very rough way.

And in this video, it's all talking about recognition objects/road features in a static context, and there is no mention of detecting the velocity and acceleration of such objects using vision.
BEV.png

How Tesla uses neural network at scale in production
Edited Transcript of TSLA.OQ earnings conference call or presentation 22-Jul-20 9:30pm GMT

I don't think people really care about the mechanism of how it splits the video stream into individual frames before feeding it into a NN, only that the NN now takes into account the data from more than one frame at once (not just the current frame it is processing), and allows detection of velocity and acceleration of objects (especially now in the absence of radar, which previously could have at least done some of this).

Edit: I wrote my comment before reading @powertoold and he said basically the same thing in a more succinct way.
 
Last edited:
The new Honda Civic is camera only too. I bet many other manufacturers will follow suit.
The standard Honda Sensing® suite of active safety and driver assistive technologies uses a new single-camera system that provides a wider field of view than the previous radar-and-camera based system. Combined with software advances and a new, more powerful processor, the system is also capable of more quickly and accurately identifying pedestrians, bicyclists and other vehicles, along with road lines and road signs.

Honda Sensing® has been further enhanced with expanded driver-assistive functionality. The system now adds Traffic Jam Assist, and the new camera-based system improves on existing functionality, such as more natural brake application and quicker reactions when using Adaptive Cruise Control (ACC). It also has more linear and natural steering action when using the Lane Keeping Assist System (LKAS). With the addition of eight sonar sensors, Civic, for the first time, features Low-Speed Braking Control, and front and rear false-start prevention.
 
  • Like
Reactions: DanCar and WhiteWi
Everyone has their opinion on how they think AP works, but what I've told you are facts regardless of opinion. My background is actually in computer vision and have been working in dedicated computer vision projects since 2016 and have personal friends that are part of the AP team at Tesla. Convolutional neural networks take in frames, you can create a pipeline that starts with stitched frames (e.g. video) and have that be parsed out on the fly and it makes it seem like you are working with "video" only. The "training" that occurs is a supervised approach which annotators look over parsed video frames and label objects of interest. The label process creates an XML file with the coordinate points of what you labeled and the classification. This is what then goes into training and why this is a supervised approach. On inference (e.g. when you are driving your car on autopilot), the video is read in, parsed into frames, ran through the layers of convolution neural network and inferences on object coordinates made. Fast forward this to 20FPS of processing through the GPU and you have live "video" inferencing.
If is true that convolutional neural networks (CNN) typically process 2 dimensional input data or “frames”, but there is no reason why additional dimensions could not be added. Also, even if we were to stick with 2 dimensional frames these frames can be any two parameters and are not limited to horizontal and vertical pixels as we typically think of in camera images. One dimension can certainly be time for example. Video is nothing more that sampled time in a sense.
 
  • Like
Reactions: mikes_fsd
Does anyone know if there is any NHTSA testing or re-certification activity underway for the no-radar cars? Or any way to find out short of investigatory activity?

I believe Elon's tweet said something about "next week", two weeks ago, and the website says features will be restored in "coming weeks".

Outside of the usual doubt vs. belief, does anyone have info or hints regarding this? Or regarding the procedures to get NHTSA to perform the testing & certification activity when Tesla is ready?
Update on this, NHTSA has restored the LDW checkmark after discussions with Tesla on Thursday (which answers the question that people asked previously about how its removal doesn't make sense, given the LDW depends on the cameras detecting the lane, with radar playing no role).
U.S. safety agency probes 10 Tesla crash deaths since 2016
You can see that the check mark has restored.
Vehicle Detail Search | NHTSA

On the testing:
Separately, NHTSA has conducted no new tests after it withdrew its designation last month that some newer Tesla Model 3 and Model Y vehicles as having four advanced safety features after the automaker said it was removing radar sensors to transition to a camera-based Autopilot system.

The agency said Thursday that after discussions with Tesla it restored the lane departure warning designation after Tesla confirmed the technology was unaffected.

NHTSA said in a statement it has "not yet finalized the list of Model Year 2022 vehicles" for testing.

Unlike claimed by some upthread, NHTSA didn't suggest in their statement to the media that Tesla could restore the checkmarks on based on Tesla's own testing, if the features were affected by the production change. They only commented they have not finalized which vehicles to test yet for 2022 Model year (which seems to imply we might not get resolution until then).

The procedures for mid-year changes may not necessarily be the same as model year changes either. As noted previously, when Tesla lost check marks related to AP2, they were never restored for the "later release" cars, even though in software they were added back, and the following model year had it listed. The exact same thing may happen here.

FCW and LDW checkmarks were in 2016 Model S "Early Release" (NHTSA did not give check marks for AEB until 2018):
Vehicle Detail Search | NHTSA
FCW and LDW checkmarks removed for 2016 "Later Release" due to AP 2.0 changes and was never added back:
Vehicle Detail Search | NHTSA
FCW and LDW checkmarks came back for 2017 model year:
Vehicle Detail Search | NHTSA
 
Wow I missed the riveting discussion of what "video" is! o_O

Remember, previous "progress" in autopilot was progressing to feeding in 2 images at a time (instead of just one) allowing the loosest definition of "video" because technically it is a sequence of images!

Obviously, the more frames per second and # of seconds of frames you inject into the model, the more reliable the model predictions should be in theory. The problem is the more frames you require, the heavier the model becomes and the harder it is to train (both fighting convergence and computational resources). I believe this is especially true with RNNs and LSTMs, which have been replaced very recently with transformers.

Convolutional neural networks are actually pretty simple. They are basically filters. Filters can be applied both in images (spatially) and in time series (temporally). [I personally work in the area of physiologically time series so I do a lot of CNNs] CNNs and transformers basically find out what filters work best for extracting information from the signal. This is pretty powerful because before you would have to pre-determine which filters to try, but now it is part of the optimization process.

But while powerful, these models are still pretty stupid. Like, while the filters are great at identifying key pieces of information, the CNNs aren't great at understanding their relative positioning. In images for instance, CNNs may identify eyes, noses, but not fully understand the relative positioning of those attributes (nose is always inbetween faces, etc..). Same happens in time series.

This is where transformers comes in, as they help overcome these issues. (Note I haven't done any work with transformers but I need to soon!)

So previously, Tesla was using basic CNNs on mostly images without much time series knowledge. With less compute power.

Contrast that to now - Tesla is feeding in a time series of images into transformers. With more compute power. So not only will they have temporal information ("memory" of objects), they also are using an architecture that allows understanding of the relative positioning of where those objects occur in both space and time.

Suffice to say, this could create significant leaps in their perception module.
 
The new Honda Civic is camera only too. I bet many other manufacturers will follow suit.

or maybe Honda feels like "good enough" and cuts costs for a car starting at just $21k *and* zero ambition to ever be L3 ... Don't get me wrong - camera only might work but I have yet to see a manufacturer claiming L3 is around the corner and removing radar / not using Lidar at the same time...
 
  • Like
  • Disagree
Reactions: Platini and WhiteWi