Welcome to Tesla Motors Club
Discuss Tesla's Model S, Model 3, Model X, Model Y, Cybertruck, Roadster and More.
Register

v12 is not e2e AI

This site may earn commission on affiliate links.
Ok - granted this title is a little click baity, but hear me out.

While no one knows for sure how v12 works, Tesla (I suppose mainly Elon) have claimed that it is 'photons in, controls out'. This implies that there are *no* heuristic rules anywhere in the stack. Every action taken by v12 is the result of learning how (good) humans drive.

This is clearly true in many situations, where v12 seems surprisingly human-like. However, two counter-examples make it clear that there are still heuristics in place, and these heuristics overrule whatever the NN might want to do:
  1. Stop signs. No way they trained this thing on millions of examples of humans stopping at stop signs, and it learned to come to a complete stop at all times. (Yes, I know this was requested by the NHTSA. But my point remains - that hard-coded rule must still be in place today).
  2. Creeping at intersections. v12, just like earlier versions, stops way back from the actual intersection first, and then begins a very gradual creep. No human drives like that. I understand why - it's clear that the b-pillar cams are the achilles heel of FSD. But even a driver with that same limited visibility wouldn't stop that far back first, and then begin gradual creeping.
I'm not suggesting this is necessarily a bad thing - I'm actually glad that they still have the ability to apply heuristic rules on top of a (seemingly) non-deterministic NN. But I do wonder how many lines of 'software 1.0' still remain in the stack.
 
  • Like
Reactions: mgs333
I too bought into what Elon was saying about the "photons in, controls out" statement (I should have known better), but that was clearly an aspirational statement. If you look closely at the release notes, it does not say that it's end-to-end NN, but rather that it replaced something like 300K lines of C++ code, implying that there is still code in the system. And likely will be for the foreseeable future.
 
I'm pretty sure Elon specifically said in his original introductory V12 live demo something along the lines of "there are no lines of code saying 'this is what a stop sign looks like' and no lines of code telling the car what to do at a stop sign". I haven't been back to watch the video again, but that was my memory of it.

If it behaves the same awkward way at every stop sign then I guess it's possible they have been forced by the NHTSA to train it that way.

He also mentioned at some point that they are using heuristic code to go through the training data and select the desired training videos automatically instead of using humans to do that. So there might be lines of code in the training centre filtering out which real drivers are actually stopping at stop signs and discarding the rest. If they need a million training videos and only 1% of real drivers are actually stopping at stop signs, then maybe they needed to start with 100 million videos. Those 1% of real drivers may be the sole source of training, so the car ends up driving like they do and this is what we end up with! As noted above, maybe simulated training is also used.

From the videos I've watched, part of the problem seems to be the way roads are built in USA, and the authority's attitude towards it - you have a stop sign with a stop line painted on the road way short of the intersection, and your authority says you have to stop there. But you can't see from there, so then you have to creep. It's no wonder real drivers don't comply.

In Australia we don't have many stop signs - usually "Give Way (Yield)" signs at most intersections, with the stop signs reserved for intersections with limited visibility on approach. And we don't have 4-way stops at all - one road always has priority. But we also don't have these lines painted way short of the intersection - if there is a line painted it will be actually at the intersection. For that reason, most drivers here do tend to obey most stop signs because they are actually sensible and not overused.
 
It's end-to-end AI, just with carefully curated (and/or simulated) training data, particularly for stop signs. I'm not sure what the stop-10-feet-too-early-at-intersections thing is about though; it seems unlikely they would deliberately program or train that in? Perhaps it's an unintended side-effect of the overly cautious stop-sign training. (The training is clearly not perfect, for stop signs or anything else.) Either way, I expect intersection creep to improve a lot in v12.4.

What's not end-to-end though, is the FSD visualization, which shows the intermediate 3D scene representation to the user. (Not photons, not controls.) It's not clear if this may be extracted somehow from the larger E2E network, or if it's a separate parallel network? I'm guessing the latter, because occasionally they don't agree; e.g. sometimes the car will take a path or lane that differs from the one shown in the visualization, or will ignore a lane marking (e.g. left-turn-only) that does show up in the visualization. But to the extent Tesla needs to show the intermediate 3D scene representation to the user, they will need to continue training a ground-truth occupancy network or equivalent. This may be why they're still gobbling up millions of dollars of Lidar units, even if they are technically not needed for training the primary E2E network (photons-in-control-out) per se.

In any case, yes, there will always be heuristics. They've just been moved, from C++ control code to training set curation code.
 
  • Like
Reactions: gsmith123
I'm not sure what the stop-10-feet-too-early-at-intersections thing is about though; it seems unlikely they would deliberately program or train that in? Perhaps it's an unintended side-effect of the overly cautious stop-sign training.

Stop sign maneuvers as a whole (both in terms of stopping point and speed profile) feel like a bit of a temporary "solution" to me. It's so conservative it feels like it they just wanted to quickly make something that was obviously compliant with stopping requirements (but kind of sucks) while they work on a better version.

If the stopping point was easily tunable via a runtime parameter controlling C++ code why wouldn't they have made it stop a foot further forward? On the other hand, if that behavior is actually trained into the E2E neural net it could be extremely expensive to recollect and/or regenerate the relevant training video (it sure is well trained though, it's so consistent).

What I'm hoping to see:
- faster approach with higher deceleration from 2 to 0 mph
- stop at the stop line (or level with the stop sign if no line)
- creep a bit faster (positioning is generally good though)
 
  • Like
Reactions: Ben W
Stop sign maneuvers as a whole (both in terms of stopping point and speed profile) feel like a bit of a temporary "solution" to me. It's so conservative it feels like it they just wanted to quickly make something that was obviously compliant with stopping requirements (but kind of sucks) while they work on a better version.

If the stopping point was easily tunable via a runtime parameter controlling C++ code why wouldn't they have made it stop a foot further forward? On the other hand, if that behavior is actually trained into the E2E neural net it could be extremely expensive to recollect and/or regenerate the relevant training video (it sure is well trained though, it's so consistent).

What I'm hoping to see:
- faster approach with higher deceleration from 2 to 0 mph
- stop at the stop line (or level with the stop sign if no line)
- creep a bit faster (positioning is generally good though)
It's sort of incongruous that NHTSA is just fine with allowing the driver to override the speed limit via preferences (e.g. you can set FSD's top speed to "Speed Limit + 15%"), but they're absolutely opposed to overriding stop sign behavior via preferences, even though the former is arguably just as illegal and unsafe as the latter. (And particularly given that FSD only ever actually rolled through stops when there were no pedestrians or cross traffic nearby.) Perhaps it's because rolling through a stop sign feels like a more black-and-white legal/illegal choice, while top speed is more continuous? I suppose there's an argument to be made that keeping up with the flow of traffic [if everyone is speeding] is safer than rigidly observing the speed limit, but there's no equivalent argument for rolling through stop signs, other than conceivably to reduce the risk of getting rear-ended.

It's more understandable that fully autonomous driverless cars would be required to completely stop at stop signs, but given that FSD drivers can already override the stop sign behavior on a case-by-case basis (by goosing the accelerator), it would be that much nicer if we could override it by default. Oh well, one can dream :)
 
Perhaps it's because rolling through a stop sign feels like a more black-and-white legal/illegal choice, while top speed is more continuous? I suppose there's an argument to be made that keeping up with the flow of traffic [if everyone is speeding] is safer than rigidly observing the speed limit, but there's no equivalent argument for rolling through stop signs

Agreed. We'll probably never know exactly what NHTSAs thinking is on this, but your explanation makes sense.

One other factor is that speed limit sign detection still isn't always accurate (after 8 years since HW2 launched and they brought that in-house no less...). There's a decent argument that you should be able to override it to correct a bad speed limit detection. Maybe NHTSA considers that a legit reason?

Previously the system limited how much you could speed, I think to 5 over the limit on undivided highways. Interestingly that constraint seems to be gone nowdays so they've actually gotten more lenient. It was mildly annoying when everyone was cruising 72-75 mph and the car only allowed 70
 
This whole stop sign discussion takes me back to driver's ed and my driver's test.

My driver's ed instructor instructed me to stop at the stop line/crosswalk, look for people crossing the street, and then creep up to get a better view of oncoming traffic. Of course I didn't actually drive this way: I drove the way my parents and everyone else did: as you pull up to the intersection you are already on the lookout for pedestrians, so unless there are any pedestrians, pull up to where you have a good view of oncoming traffic, stop there, and then go.

But I took my driver's ed instructor's teaching as gospel, so when it came time for my driver's test, I figured I should replicate what my driver's ed teacher told me.

The only points I got off on my driver's test were for hesitant behavior at 4-way stops!
 
  • Funny
Reactions: APotatoGod
The only points I got off on my driver's test were for hesitant behavior at 4-way stops!

Lol, you can't win them all.

It's surprisingly hard to be:
- technically compliant with making full stops and stopping at the stop line
- a reasonably assertive driver
- smooth

Tesla has a pretty hard challenge given they're trying to follow rules that humans don't need to follow.
 
What's not end-to-end though, is the FSD visualization, which shows the intermediate 3D scene representation to the user. (Not photons, not controls.) It's not clear if this may be extracted somehow from the larger E2E network, or if it's a separate parallel network?

This is something else I’ve been thinking about. Elon’s definition of ‘end to end’ might not mean just a single neural network with ‘photons in, controls out’.

I’m wondering whether the perception neural nets are still used to map out the vehicle’s surroundings (and other vehicles/objects) in vector space, and then this data is what is fed to the neural network(s) for planning/control. This would still be ‘end to end’ in that there is no procedural code, but it’s not a single monolithic neural network.

Have we heard anything from Tesla to point in either direction?
 
Have we heard anything from Tesla to point in either direction?
It was observed pretty early on that the visualization and the behavior of V12 don't seem to match up. That is, the visualization will show things that V12 will completely ignore. Interestingly, the visualization sometimes hallucinates a pedestrian, which V12 would happily drive through. The new autopark software appears to be using the V11 perception stuff because when it hallucinates a pedestrian in the parking space, autopark will refuse to move towards it.

Part of the reason for going to "end-to-end" is to avoid introducing human concepts into the process of moving from photons to control outputs The training process will come up with the most efficient way of doing that because it doesn't first have to come up with an intermediate representation - which isn't needed for driving the car.