Welcome to Tesla Motors Club
Discuss Tesla's Model S, Model 3, Model X, Model Y, Cybertruck, Roadster and More.
Register

Neural Networks

This site may earn commission on affiliate links.
When the new v9 NN processes two frames, does it always do two new frames (i.e. 0 and 1, 2 and 3, ...) or does it always use the previous frame (0 and 1, 1 and 2, 2 and 3) ?

If the latter I'm curious if there's some further optimization that could be done such that rather than processing the previous frame a second time (first as 'new', second as 'old) that it could just cache the results and have a continuous integration across frames without the work to process them twice?
 
So basically we are driving data collection devices. It’s great to be part of the beginning for the next generation of automobiles but kind of scary at the same time. Even V9 has its shortcomings which I’ve experienced and reported via “bug report” and recorded on “dash cam”
 
So basically we are driving data collection devices. It’s great to be part of the beginning for the next generation of automobiles but kind of scary at the same time. Even V9 has its shortcomings which I’ve experienced and reported via “bug report” and recorded on “dash cam”

That’s an interesting thought: I wonder if a Tesla will eventually use the dash cam footage in conjunction with AP disengagement/“Take control immediately” warnings to flag clips for itself to send to Tesla for review when the car is eventually parked and connected to WiFi?
 
NN Changes in V9 (2018.39.7)

Have not had much time to look at V9 yet, but I though I’d share some interesting preliminary analysis. Please note that network size estimates here are spreadsheet calculations derived from a large number of raw kernel specifications. I think they’re about right and I’ve checked them over quite carefully but it’s a lot of math and there might be some errors.

First, some observations:

Like V8 the V9 NN (neural net) system seems to consist of a set of what I call ‘camera networks’ which process camera output directly and a separate set of what I call ‘post processing’ networks that take output from the camera networks and turn it into higher level actionable abstractions. So far I’ve only looked at the camera networks for V9 but it’s already apparent that V9 is a pretty big change from V8.

---------------
One unified camera network handles all 8 cameras

Same weight file being used for all cameras (this has pretty interesting implications and previously V8 main/narrow seems to have had separate weights for each camera)

Processed resolution of 3 front cameras and back camera: 1280x960 (full camera resolution)

Processed resolution of pillar and repeater cameras: 640x480 (1/2x1/2 of camera’s true resolution)

all cameras: 3 color channels, 2 frames (2 frames also has very interesting implications)

(was 640x416, 2 color channels, 1 frame, only main and narrow in V8)
------------

Various V8 versions included networks for pillar and repeater cameras in the binaries but AFAIK nobody outside Tesla ever saw those networks in operation. Normal AP use on V8 seemed to only include the use of main and narrow for driving and the wide angle forward camera for rain sensing. In V9 it’s very clear that all cameras are being put to use for all the AP2 cars.

The basic camera NN (neural network) arrangement is an Inception V1 type CNN with L1/L2/L3ab/L4abcdefg layer arrangement (architecturally similar to V8 main/narrow camera up to end of inception blocks but much larger)
  • about 5x as many weights as comparable portion of V8 net
  • about 18x as much processing per camera (front/back)
The V9 network takes 1280x960 images with 3 color channels and 2 frames per camera from, for example, the main camera. That’s 1280x960x3x2 as an input, or 7.3M. The V8 main camera was 640x416x2 or 0.5M - 13x less data.

For perspective, V9 camera network is 10x larger and requires 200x more computation when compared to Google’s Inception V1 network from which V9 gets it’s underlying architectural concept. That’s processing *per camera* for the 4 front and back cameras. Side cameras are 1/4 the processing due to being 1/4 as many total pixels. With all 8 cameras being processed in this fashion it’s likely that V9 is straining the compute capability of the APE. The V8 network, by comparison, probably had lots of margin.

network outputs:
  • V360 object decoder (multi level, processed only)
  • back lane decoder (back camera plus final processed)
  • side lane decoder (pillar/repeater cameras plus final processed)
  • path prediction pp decoder (main/narrow/fisheye cameras plus final processed)
  • “super lane” decoder (main/narrow/fisheye cameras plus final processed)

Previous V8 aknet included a lot of processing after the inception blocks - about half of the camera network processing was taken up by non-inception weights. V9 only includes inception components in the camera network and instead passes the inception processed outputs, raw camera frames, and lots of intermediate results to the post processing subsystem. I have not yet examined the post processing subsystem.

And now for some speculation:

Input changes:

The V9 network takes 1280x960 images with 3 color channels and 2 frames per camera from, for example, the main camera. That’s 1280x960x3x2 as an input, or 7.3MB. The V8 main camera processing frame was 640x416x2 or 0.5MB - 13x less data. The extra resolution means that V9 has access to smaller and more subtle detail from the camera, but the more interesting aspect of the change to the camera interface is that camera frames are being processed in pairs. These two pairs are likely time-offset by some small delay - 10ms to 100ms I’d guess - allowing each processed camera input to see motion. Motion can give you depth, separate objects from the background, help identify objects, predict object trajectories, and provide information about the vehicle’s own motion. It's a pretty fundamental improvement to the basic perceptions of the system.

Camera agnostic:

The V8 main/narrow network used the same architecture for both cameras, but by my calculation it was probably using different weights for each camera (probably 26M each for a total of about 52M). This make sense because main/narrow have a very different FOV, which means the precise shape of objects they see varies quite a bit - especially towards the edges of frames. Training each camera separately is going to dramatically simplify the problem of recognizing objects since the variation goes down a lot. That means it’s easier to get decent performance with a smaller network and less training. But it also means you have to build separate training data sets, evaluate them separately, and load two different networks alternately during operation. It also means that you network can learn some bad habits because it always sees the world in the same way.

Building a camera agnostic network relaxes these problems and simultaneously makes the network more robust when used on any individual camera. Being camera agnostic means the network has to have a better sense of what an object looks like under all kinds of camera distortions. That’s a great thing, but it’s very, *very* expensive to achieve because it requires a lot of training, a lot of training data and, probably, a really big network. Nobody builds them so it’s hard to say for sure, but these are probably safe assumptions.

Well, the V9 network appears to be camera agnostic. It can process the output from any camera on the car using the same weight file.

It also has the fringe benefit of improved computational efficiency. Since you just have the one set of weights you don’t have to constantly be swapping weight sets in and out of your GPU memory and, even more importantly, you can batch up blocks of images from all the cameras together and run them through the NN as a set. This can give you a multiple of performance from the same hardware.

I didn’t expect to see a camera agnostic network for a long time. It’s kind of shocking.

Considering network size:

This V9 network is a monster, and that’s not the half of it. When you increase the number of parameters (weights) in an NN by a factor of 5 you don’t just get 5 times the capacity and need 5 times as much training data. In terms of expressive capacity increase it’s more akin to a number with 5 times as many digits. So if V8’s expressive capacity was 10, V9’s capacity is more like 100,000. It’s a mind boggling expansion of raw capacity. And likewise the amount of training data doesn’t go up by a mere 5x. It probably takes at least thousands and perhaps millions of times more data to fully utilize a network that has 5x as many parameters.

This network is far larger than any vision NN I’ve seen publicly disclosed and I’m just reeling at the thought of how much data it must take to train it. I sat on this estimate for a long time because I thought that I must have made a mistake. But going over it again and again I find that it’s not my calculations that were off, it’s my expectations that were off.

Is Tesla using semi-supervised training for V9? They've gotta be using more than just labeled data - there aren't enough humans to label this much data. I think all those simulation designers they hired must have built a machine that generates labeled data for them, but even so.

And where are they getting the datacenter to train this thing? Did Larry give Elon a warehouse full of TPUs?

I mean, seriously...

I look at this thing and I think - oh yeah, HW3. We’re gonna need that. Soon, I think.

Omnidirectionality (V360 object decoder):

With these new changes the NN should be able to identify every object in every direction at distances up to hundreds of meters and also provide approximate instantaneous relative movement for all of those objects. If you consider the FOV overlap of the cameras, virtually all objects will be seen by at least two cameras. That provides the opportunity for downstream processing use multiple perspectives on an object to more precisely localize and identify it.

General thoughts:

I’ve been driving V9 AP2 for a few days now and I find the dynamics to be much improved over recent V8. Lateral control is tighter and it’s been able to beat all the V8 failure scenarios I’ve collected over the last 6 months. Longitudinal control is much smoother, traffic handling is much more comfortable. V9’s ability to prospectively do a visual evaluation on a target lane prior to making a change makes the auto lane change feature a lot more versatile. I suspect detection errors are way down compared to V8 but I also see that a few new failure scenarios have popped up (offramp / onramp speed control seem to have some bugs). I’m excited to see how this looks in a couple of months after they’ve cleaned out the kinks that come with any big change.

Being an avid observer of progress in deep neural networks my primary motivation for looking at AP2 is that it’s one of the few bleeding edge commercial applications that I can get my hands on and I use it as a barometer of how commercial (as opposed to research) applications are progressing. Researchers push the boundaries in search of new knowledge, but commercial applications explore the practical ramifications of new techniques. Given rapid progress in algorithms I had expected near future applications might hinge on the great leaps in efficiency that are coming from new techniques. But that’s not what seems to be happening right now - probably because companies can do a lot just by scaling up NN techniques we already have.

In V9 we see Tesla pushing in this direction. Inception V1 is a four year old architecture that Tesla is scaling to a degree that I imagine inceptions’s creators could not have expected. Indeed, I would guess that four years ago most people in the field would not have expected that scaling would work this well. Scaling computational power, training data, and industrial resources plays to Tesla’s strengths and involves less uncertainty than potentially more powerful but less mature techniques. At the same time Tesla is doubling down on their ‘vision first / all neural networks’ approach and, as far as I can tell, it seems to be going well.

As a neural network dork I couldn’t be more pleased.

So, it is good;)
 
And yet even with this fantastic new NN, V9 on Autosteer still tried to murder me twice during my commute today, plus side-swipe some cars on local roads where admittedly I had no business using Autosteer.

Actually, maybe the more intelligent NN is making it a more competent murderer...
So, do you ever log into TMC from your car? Just wondering if it can read your posts.
 
My understanding of all this is minimal, but I've found these posts to be really interesting.

This may seem silly, regardless I'm posing the question... It's mentioned many times that training the NN requires a ton of human time, software simulation time, and more. As Tesla is starting to add games in V9, could they possibly "game" the process of training so that owners could identify things relevant to training the NN and help the process along?
We currently can sit and play asteroids while charging, what if instead of that, while waiting we could help train the AI similar to how google and other companies use captchas as a technique to train their AI.

google_recapcha_street_signs-199x300.png


I forget the name of the project, but isn't there a space observation thing going on where people visually inspect images looking for certain characteristics that could be a type of galaxy or something similar? They then use that data both for science and also to help train a NN to better identify such space objects for research purposes?

/edit. I think it was Zooniverse for the space thing, using real people to help research into space observations (Galaxy zoo) and other earth based stuff (Shakespere's world).
 
Last edited:
My understanding of all this is minimal, but I've found these posts to be really interesting.
I forget the name of the project, but isn't there a space observation thing going on where people visually inspect images looking for certain characteristics that could be a type of galaxy or something similar?..../edit. I think it was Zooniverse for the space thing, using real people to help research into space observations (Galaxy zoo) and other earth based stuff (Shakespere's world).
this one perhaps?
SETI@home (or work)(some bad person installed it on _every_ Learningtree computer in Wash, DC area ;)
A Brief History of SETI@Home - The Atlantic
 
For one thing the frame sizes and kernel depths are enough bigger (2x or more) that normally you'd expect the behavior to move into a new regime because the expressive power of the network is going to be qualitatively greater. Usually when that happens you have to make other changes as well and you will start seeing behavior that can't be easily extrapolated from the original system.

What does that mean — for the “behaviour to move into a new regime”? Is there an example of that you can point to?
 
In his talks, Karpathy emphasizes how much training that image labellers need, and how long the instruction manual has been growing to cover edge cases. With Google captchas, I almost don’t trust the labelling because it’s an annoyance people are just trying to get past as fast as possible.
Also you can't just dump random images from Google Earth into the captcha service and train it using the general public. Captcha needs to know there's a sign in that image and which tiles are covered by the sign in order for their service to work. Implying that this metadata is already there labeled manually.
 
  • Helpful
Reactions: strangecosmos
In his talks, Karpathy emphasizes how much training that image labellers need, and how long the instruction manual has been growing to cover edge cases. With Google captchas, I almost don’t trust the labelling because it’s an annoyance people are just trying to get past as fast as possible.
I believe that captcha providers will resend an image to someone else for labeling if the person that labelled it earlier (first?) does something suggesting they may not be labeling things accurately.
 
I guess my point was, is there a way that owners could somehow help the process? Or is this stuff already at such a high level that there is nothing we could contribute that would provide any advantage to Tesla. I mean hundreds of owners showed up to assist with deliveries at the end of Q3, I'm sure just as many if not more would help contribute to the NN project if there was a way to, especially if the end result would be better performing autopilot/FSD features for the cars they own.
 
In his talks, Karpathy emphasizes how much training that image labellers need, and how long the instruction manual has been growing to cover edge cases. With Google captchas, I almost don’t trust the labelling because it’s an annoyance people are just trying to get past as fast as possible.

Werent you the one who did some math saying it will take an average 5 seconds to label an image and I responded with how you clearly havent done your research and you called me a troll and condensing for saying that.

Now you are arguing against part of your thesis again only because you saw a presentation from Tesla.

Which again I previously said The only people some consider experts is anyone from Tesla which is very shortsighted.

Proves my point that some ppl only adjust their views at the beat of teslas drum. No matter how much logic facts you present them. It means nothing unless Tesla says otherwise.

Some even called the thousands of sdc engineers, hardware, sensor and safety industry engineers around the world BS.

This is a problem when people process a report from @jimmy_d. This NN which is frankly being blown out of proportion is simply the perception system which Tesla is clearly playing catchup in.

Yet when people view this info, looking through half the comments on reddit and eletrek, and I quote "This proves Tesla is 5 years ahead of the competition"

Its simply dumbfounding!
 
  • Love
Reactions: AnxietyRanger
What does that mean — for the “behaviour to move into a new regime”? Is there an example of that you can point to?

When you make relatively small changes to the size of a network - input resolution, kernel channels, incremental increase in layer widths or number of layers - the general characteristics of the network stay the same but you get incremental changes to the performance, to the optimal hyperparameters, to the training time and so forth. Larger changes are likely to cross boundaries where the optimal hyperparameters, training behavior, accuracy and so forth see big enough changes that the qualitative utility of some kinds of output change.

An example might be - very generally - speech recognition accuracy versus network size and the resulting utility of the network in real world recognition applications. If word recognition accuracy is much below a certain threshold then context correction doesn't work and the tool makes so many mistakes that nobody will want to use it. Above that threshold the ability to correct errors from context rapidly improves and suddenly voice recognition is much more useful in real applications. You won't cross that threshold if you increase the network by 10%, but if you increase it by 100% you might. There are lots of such qualitative behavior thresholds - some matter a lot, some not so much, but all of them push the behavior of the network into a new regime where the developer has to reexamine how the network is used from scratch rather than just tweak the settings used to configure and operate the smaller version.
 
Probably from location metadata. Pictures taken near intersections are more likely to include traffic signals and/or signage.
That would require very reliable metadata down to where in the picture the sign is located. Otherwize the captcha service that needs to be 100% reliable would not be anywhere near. With that kind of metadata, there would be no reason to run this through the captcha service at all, you could just train your system on the data directly.