Welcome to Tesla Motors Club
Discuss Tesla's Model S, Model 3, Model X, Model Y, Cybertruck, Roadster and More.
Register

Neural Networks

This site may earn commission on affiliate links.
Suppose you use the TensorFlow model that Intel demoed early last year (90% accuracy), you would still have a 99.999997% chance of correctly detecting at least one light (on average) within a tenth of a second at 30 fps, assuming an average of 2.5 lights per direction — that is, 1 - ((.1 ^ 2.5) ^ 3).

So unless Intel's algorithm is just way too slow for real-time use, chances are the 98% accuracy is *not* over several seconds.

If we assume instead that 98% is per-frame, per-light, then if they identify 98% of them, assuming an average of 2.5 traffic lights per intersection, you'd have about a 99.994% chance of detecting it correctly in any given frame, and a 99.999999999982% chance of detecting one within a tenth of a second at only 30 fps. At 60 fps, well, Google's calculator can't work with numbers that small, though the odds of failure are one in almost 31 septillion. You are 10 billion times more likely to win a billion dollars or more in both PowerBall AND Mega Millions than for that to fail.

So a 98% detection rate is probably good enough. Not necessarily, but probably. :)

The most straightforward way to interpret Elon’s comment is that when a dev mode Tesla drives through an intersection, it has a 98% chance of correctly recognizing whether the traffic light governing its lane is red, yellow, or green.

If it were already at 99.99999%+ accuracy per intersection (less than 1 error per 10 million intersections), Elon’s comment wouldn’t make sense. I think he would just say traffic light recognition is solved.
 
The most straightforward way to interpret Elon’s comment is that when a dev mode Tesla drives through an intersection, it has a 98% chance of correctly recognizing whether the traffic light governing its lane is red, yellow, or green.

If it were already at 99.99999%+ accuracy per intersection (less than 1 error per 10 million intersections), Elon’s comment wouldn’t make sense. I think he would just say traffic light recognition is solved.

There are publicly available models that have 98% detection per image floating around on GutHub. If they aren't doing at least as well as those, then I'm amazed these cars are even staying on the road.

At only 98% chance per intersection, none of this makes any sense. I mean, let's say that you have a five-second window to detect it. At 30 fps, that's 150 frames. If you have a 2.5% chance of detecting it in a frame, you have a 97.5% chance of not detecting it. So your chance of not detecting it in two frames is 97.5% of 97.5% percent. Your chance of not detecting it in 150 frames, then, is .975 ^ 150, or about 2.2%. So to have only a 98% chance per intersection, you would have to successfully detect a traffic light with only about 2.5% accuracy.

I'm pretty sure I could do significantly better than a 2.5% detection rate per frame with a simple brute force, non-neural-net algorithm that just looks for yellow areas with three areas of red, yellow, green, or black inside them. In a hundred lines of code or less.

No, that has to be per-frame, and probably per light, per-frame. That's the only way such a low number can possibly make sense.

I mean, perhaps it really does have only a 98% chance of correctly detecting the color of the signal light, but if that's true, then maybe it wasn't a good idea using greyscale + red cameras instead of proper full-color cameras. :)

Also note that with the exception of sideways lights, if you can't tell the color of the light, you're doing something very wrong, because they are always in the same order. :D
 
There are publicly available models that have 98% detection per image floating around on GutHub. If they aren't doing at least as well as those, then I'm amazed these cars are even staying on the road.

At only 98% chance per intersection, none of this makes any sense. I mean, let's say that you have a five-second window to detect it. At 30 fps, that's 150 frames. If you have a 2.5% chance of detecting it in a frame, you have a 97.5% chance of not detecting it. So your chance of not detecting it in two frames is 97.5% of 97.5% percent. Your chance of not detecting it in 150 frames, then, is .975 ^ 150, or about 2.2%. So to have only a 98% chance per intersection, you would have to successfully detect a traffic light with only about 2.5% accuracy.

I'm pretty sure I could do significantly better than a 2.5% detection rate per frame with a simple brute force, non-neural-net algorithm that just looks for yellow areas with three areas of red, yellow, green, or black inside them. In a hundred lines of code or less.

No, that has to be per-frame, and probably per light, per-frame. That's the only way such a low number can possibly make sense.

I mean, perhaps it really does have only a 98% chance of correctly detecting the color of the signal light, but if that's true, then maybe it wasn't a good idea using greyscale + red cameras instead of proper full-color cameras. :)

Also note that with the exception of sideways lights, if you can't tell the color of the light, you're doing something very wrong, because they are always in the same order. :D

I think yhese probability calculations are assuming the detection events are independent events. Given that each frame is quite similar to the previous (and recognition is highly dependant on location/ surroundings), it is more likely that they are dependent events, thus you cannot just multiple the failure rates together and subtract from one to get the pass rate.

Flipping a coin: independent.
Registering the correct stop light when it:
  • Aligns with the light at the next intersection
  • Is mounted to the underside of a reflective skywalk
  • Is blown half a lane to the side due to winds.
  • Is actually a reflection off a window
Are things that don't generally improve just by giving it more frames, you need to train and test the NN for those cases
 
I think yhese probability calculations are assuming the detection events are independent events. Given that each frame is quite similar to the previous (and recognition is highly dependant on location/ surroundings), it is more likely that they are dependent events, thus you cannot just multiple the failure rates together and subtract from one to get the pass rate.

Flipping a coin: independent.
Registering the correct stop light when it:
  • Aligns with the light at the next intersection
  • Is mounted to the underside of a reflective skywalk
  • Is blown half a lane to the side due to winds.
  • Is actually a reflection off a window

You're right, of course, but:
  • Lights aligning with other lights can be differentiated by parallax from the multiple cameras (and, ideally, should be fixed in some way to hide the back lights).
  • Reflective skywalks would be a design safety issue that needs to be fixed.
  • Wind issues are also design safety issues that need to be fixed. (That's what the bottom wire is for, or, better, actual post mounting.)
  • Windows are not typically within the field of view of the front camera, but even if they are, they are so far out from the lanes that they can be ignored programmatically.
The things that I would expect to cause problems are more like
  • Detecting low traffic lights when there's something unusual behind them that matches the colors on the light.
  • Reliably distinguishing red traffic lights from tail lights on taxis.
  • Detecting traffic lights when partially obstructed by other vehicles.
most of which do change from frame to frame (more often than not) because the camera angle changes.

Are things that don't generally improve just by giving it more frames, you need to train and test the NN for those cases.

Yeah, but those NNs reached 98% when trained on only 600 images. Tesla cars have likely seen billions of disengagements caused by traffic lights. If Tesla performs some fairly straightforward data processing, they should be able to figure out hot spots near intersections where disengagement occurs frequently. If they also see vehicles not disengaging and going straight through those same intersections without stopping, those intersections should almost always have traffic lights.

From there, they should be able to take all of the video data of disengagements at all of those locations and test the NN on them, manually tagging the images from any intersections where the NN fails to detect a reasonable number of traffic lights in those images.

In fact, by mining that data properly, Tesla should be able to create an accurate database of the location of every single permanent traffic light in the country and fairly quickly update it as new lights appear and temporary lights disappear. Then, Tesla can use that database to substantially increase safety by slowing down if it cannot successfully detect a traffic light, eventually bringing the vehicle to a stop before the intersection if it fails to do so, forcing the user to intervene manually.

So really, I can't see how they could only be getting 98% accuracy still, even for single images, much less for entire intersections. That number just seems way too low to be plausible unless they're literally just taking the existing NN with its 600-image training data and using it as-is.
 
  • Like
Reactions: mongo
Yeah, but those NNs reached 98% when trained on only 600 images.

So really, I can't see how they could only be getting 98% accuracy still, even for single images, much less for entire intersections. That number just seems way too low to be plausible unless they're literally just taking the existing NN with its 600-image training data and using it as-is.

Yeah but how many test data? This is called over-fitting to your dataset.
 
I just don’t see how your interpretation can be consistent with what Elon said. If Tesla’s neural network has a less than 1 in 30 septillion chance of failing to recognize a traffic light, why wouldn’t Elon say the problem is solved? Why is anyone at Tesla still wasting their time with traffic lights when the existing solution is — for all intents and purposes — perfect?

When Elon said “98%” and “99.999%”, he wasn’t giving actual, precise figures. He said “I guess like, I don't know, 98%” and “like 99.999%”. So I don’t think we should read too much into those numbers. It sounded like just some off-handed numbers for the purposes of explanation.

I also think there is something logically wrong with the idea that a neural network has uncorrelated probabilities of correctly classifying an object with each successive frame an object appears in. Say you trained a neural network to recognize dogs, and it only had a 0.1% chance of correctly classifying an image of a dog. Well, that’s easy! Just show the same picture to the network 1,000 times, and it will have a 100% chance of getting it right. Presto! I have solved computer vision!

The problem with this way of thinking is that the neural network will simply make the same error 1,000 times. With video, each successive frame is a slightly new image, and you get new angles as the car moves. But that doesn’t by itself solve the problem.

Let’s say that 2 seconds before reaching the light, the neural network detects the light and correctly classifies it as red. But then just as it reaches the light, the network either fails to detect the light or misclassifies it as yellow or green. The car will then run the red light. So it’s not enough to classify a red light in a single frame among the hundreds of frames leading up to an intersection. The network has to consistently classify the light correctly.
 
Last edited:
Let’s say that 2 seconds before reaching the light, the neural network detects the light and correctly classifies it as red. But then just as it reaches the light, the network either fails to detect the light or misclassifies it as yellow or green. The car will then run the red light. So it’s not enough to classify a red light in a single frame among the hundreds of frames leading up to an intersection. The network has to consistently classify the light correctly.

I would expect the neural net to only need to correctly identify that a traffic light facing the camera exists and provide its approximate bounding box. Everything after that can and probably should be done in procedural code, because that approach would be more testable.
 
Do you have a link to it? I recall he said that 98% line on twitter, but I can't seem to find it.

A google search of "Traffic light detection is at 98% elon musk twitter" is giving me nothing, and neither is variations on it.

He didn’t tweet the 98% figure. On Twitter, you can search “from:elonmusk 98%” or “from:elonmusk traffic lights” and confirm this. The 98% figure is from Tesla’s Q4 2018 earnings call. Transcript here: https://seekingalpha.com/article/42...musk-q4-2018-results-earnings-call-transcript

The transcript has a lot of mistakes, but you can also listen to the earnings call and use the transcript to help you find the right time code: Tesla, Inc. Q4 2018 Financial Results and Q&A Webcast | Tesla, Inc.

I would expect the neural net to only need to correctly identify that a traffic light facing the camera exists and provide its approximate bounding box. Everything after that can and probably should be done in procedural code, because that approach would be more testable.

What if it correctly identifies it in one frame, and can’t identify in the next 59 frames?
 
At only 98% chance per intersection, none of this makes any sense. I mean, let's say that you have a five-second window to detect it. At 30 fps, that's 150 frames. If you have a 2.5% chance of detecting it in a frame, you have a 97.5% chance of not detecting it. So your chance of not detecting it in two frames is 97.5% of 97.5% percent. Your chance of not detecting it in 150 frames, then, is .975 ^ 150, or about 2.2%. So to have only a 98% chance per intersection, you would have to successfully detect a traffic light with only about 2.5% accuracy.

First the math is way off. It’s p=1-(1-.975)^150. Also the errors are not random, if you fail to detect it in one frame it is not random if you will detect it in the next frame, the likelihood increases greatly.
 
First the math is way off. It’s p=1-(1-.975)^150. Also the errors are not random, if you fail to detect it in one frame it is not random if you will detect it in the next frame, the likelihood increases greatly.

I'm pretty sure you're doing that upside down. If the odds of correctly detecting something in any given frame are the same for each frame, then the odds of correctly detecting it in at least one of two frames must be greater than the odds of finding it in a single frame. With your math, the odds go the opposite direction, towards being less able to detect it in multiple frames than in one.

If we were talking about the odds of finding the light in every frame, then you would be correct, but that's a non-goal when it comes to determining what color the light is. More to the point, any such attempt would be fundamentally impossible unless you could guarantee that your car had continuous line-of-sight to the light, without any possibility of cars coming between you and the light, even momentarily.

The only strict requirement is that you know the color at least once before the point where you can no longer safely stop for the light, but not so far ahead that a green light would not guarantee your ability to get into (and, ideally, through) the intersection before it turns red. So there's a time window in which you have to get at least one frame, or else you can't figure out what to do, but you shouldn't need more than one frame ever, in practice, as long as you correctly detect the color in at least one frame within that interval.

What if it correctly identifies it in one frame, and can’t identify in the next 59 frames?

That's perfectly fine, so long as that one frame is within the time window noted above. If it isn't, then it is useless. But really, if the odds are billions-to-one against not being able to find it in any of the frames, the odds are still millions-to-one against not being able to find it in dozens of frames, assuming the frames are sufficiently independent. (See previous discussions about how this isn't strictly true, but that they're likely closer to being independent than dependent, i.e. closer to failing one in 10^24 times than to failing one in 50 times.)
 
Last edited:
  • Helpful
Reactions: mongo
I have a feeling the research that is gonna come out of this will greatly improve the performance of neural networks for computer vision:

And I think Tesla are in a great spot to augment their data. Hopefully this will help with traffic light detection!

More reading:
Out of shape? Why deep learning works differently than we thought

After training a deep neural network on thousands and thousands of these images with arbitrary textures, we found that it actually acquired a shape bias instead of a preference for textures! A cat with elephant skin is now perceived as a cat by this new shape-based network. Moreover, there were a number of emergent benefits. The network suddenly got better than its normally-trained counterpart at both recognizing standard images and at locating objects in images; highlighting how useful human-like, shape-based representations can be. Our most surprising finding, however, was that it learned how to cope with noisy images (in the real world, this could be objects behind a layer of rain or snow) — without ever seeing any of these noise patterns before! Simply by focusing on object shapes instead of easily distorted textures, this shape-based network is the first deep neural network to approach general, human-level noise robustness.
 
Wow, that is so interesting about the NN with a shape bias! Great post, heltok, kudos.

Karpathy tweeted about the texture bias thing and jimmy_d had some interesting thoughts on it.

Karpathy and jimmy_d both argue that neural networks are not inherently texture biased; the ImageNet dataset just happens to be texture biased, and neural networks trained on this dataset internalize the bias.
 
Last edited:
I'm pretty sure you're doing that upside down. If the odds of correctly detecting something in any given frame are the same for each frame, then the odds of correctly detecting it in at least one of two frames must be greater than the odds of finding it in a single frame. With your math, the odds go the opposite direction, towards being less able to detect it in multiple frames than in one.

I feel like this is also wrong. (I mean, image recognition is not actually a statistically independent event, since frames are nearly identical within 1-2 seconds. So at this point even attempting to calculate odds is purely academic and has no connection to the real world)

….But for the sake of correct math we're asking: "what are the odds that there won't be at least 1 correct frame in 150 frames?" Assuming the 'correct' frame is 98%.

So the rule of 'at least one' probabilities means we flip the question and ask what are the odds that every single frame is the 2% result. So (1-.98) ^ 150 frames consecutively = 0.000000000000000000000000000...% of not getting at least one correct result. If there was a 2% chance of failure the chance of 150 consecutive failures is essentially 0%. But... that also assumes you have no false positive blahh diddy blah and all the other reasons this line of reasoning doesn't really mean anything. :D