FSD v12.x (end to end AI)

Ben W · May 3, 2024

Todd Burch said:
I'm just going to have to disagree with you there on multiple counts for multiple reasons. Adding lidar does NOT reduce the compute requirements, because you STILL need camera vision and the corresponding neural networks to do things that lidar is completely and utterly useless for--Reading signs, identifying and determining the color of stoplights, lane markings, etc. So with Tesla's approach, you get all the info you need from the single set of cameras. With Lucid's approach, you need those same neural networks to read signs, identify stoplight colors, etc...but then you have to do all the lidar processing, the radar processing, you need to localize and fuse that information into one coherent picture of the world around you, etc.

The size of the input does not determine the required size of the network. Look at AlphaGo. Tiny input (just a few hundred bits), but a massive network required to process it. Part of the heavy lifting Tesla's network currently does is to effectively construct the lidar point-cloud from the visual inputs. (Not literally, but roughly informationally equivalent.) Feeding the lidar data in directly completely removes the need for the rest of the network to do that part of the information-processing job, so the overall network become much simpler for the same quality final result, despite more bits being fed in. A tiny network is sufficient to read stoplight colors from an image. The difficult part is fusing all the inputs into the 3d map. (Obviously the "control" part of the network remains roughly the same, in terms of what sorts of processing it needs to do, and of course with E2E all these "parts" are entangled together.)

Todd Burch said:
(I do point cloud processing of billions of 3D points from precision measurement hardware for a living, so have a little bit of experience in this area).

I'm willing to bet the incremental cost of 5 radars, N lidars, and the processing sufficient to do the above for an autonomous driving application is going to be significantly north of $1000.

The major cost component is one RoboSense lidar unit (not 'N'), with an apparent list price of around $1900. Lucid undoubtedly gets it at a volume discount, and the price should decrease rapidly over time. I believe the rest of the sensors are much cheaper. Even if it's currently a $5k premium over Tesla's sensor array (for a much more "premium" car), it's a fairly minor factor in Lucid's overall financial situation, and again the sensor costs should decrease rapidly over time and with economies of scale.

Todd Burch said:
Yes, of course Lucid is losing money on each car for a whole host of reasons--that's why I said part of. One of Rawlinson's biggest mistakes was that he wanted to get back at Musk so tried to out-Tesla the car. Fine, if you disregard the cost of production you can make a car that outperforms a Tesla. But ultimately you have to make money on the thing, and Rawlinson suffered from the same flaw that most CEOs who try to start a car company suffer from: The inability to recognize that making a profit on the car is key to survival.

It's impossible to make a profit from day one. A multi-year period of substantial losses is expected for any startup in the EV space. Tesla didn't turn its first full-year profit until 2020, twelve years after delivery of its first cars. Lucid's first deliveries were in 2021, so if they're profitable by 2032 they will be ahead of Tesla by that metric.

Todd Burch said:
But the proof is in the pudding. Can't wait to see Lucid's autonomous driving solution in a few years and compare it to Tesla's. That is, if the Saudis haven't decided to stop funding the black hole they're throwing their money into by that point.

Yes, that's the question. (Disagreed though that it's a black hole; Lucid's cars are some of the most compelling out there. Faraday Future, on the other hand...)

Dewg · May 3, 2024

I'll just toss my observations in here for fun:

The cameras can absolutely see quite a ways away. This is evident and observable easily by pinch zooming out at an intersection and watching the car track all the traffic all 4 ways.

The distance is also pretty accurate. This is also evident and observable in the visualization when you are stopped at a light, or behind another car. You can see how far back the car stops from the line, and how close other cars are around you on the screen. If the accuracy was off significantly, there would be discrepancies here. For example, you stop 10 feet from the line at an intersection, but visualization shows you are at the line. This does not happen.

As indicated by others above, most of the problems are in the planner, and not in the camera's ability to detect objects and distances.

I'll also add another anecdotal observation. When I'm in my local automated car wash, the exit is facing a busy street. While the car is covered in water, visualizations still show, accurately I might add, traffic moving both ways on the street. The only time it fails is when the windshield is completely covered by soap. BUT, even then, I've occasionally seen some cars show up, I'm guessing from the B-Pillars looking forward.

Ben W · May 3, 2024

cyborgLIS said:
Yes, it's a long ways off and you can't skip the intermediate human+robot driver step. Though I imagine it could be done within 20 to 30 years. There are probably not many cars on the road older than that at any given time. I imagine it would require just a few years to draft and begin implementation of a protocol for inter-operation of autonomous vehicles. The government could mandate that all new vehicles incorporate the system. The system would add a modest cost to the car, but consumers could be incentivized with credits (similar to our current EV consumer incentives), or buy a used car in their price range. In 30 years the number of running cars without the system would be minimal. Those cars would need to be prohibited from being driven so the switch could be made to autonomous only roads.

This may eventually happen, and perhaps sooner than 50 years in wealthy regions, although you will probably still have to pry the ability to drive non-autonomous "classic" cars from Jay Leno's cold dead hands. In rural regions or developing countries, it will take quite a bit longer for the tech and requirements to trickle down. But my guess is that it will also be a moot point by then, because ASI (artificial superintelligence) may well become a reality in the next couple decades, and that will change everything completely.

cyborgLIS said:
The thing is there will always be advancements in hardware (better cameras, more cameras, new sensors, better placement of sensors), And it doesn't seem possible to future-proof the neural nets. Anytime a new sensor is added or enhanced, the NN will need to be rebuilt. For example, let's say Tesla adds 3 forward facing cameras near the front bumper to enhance cross-traffic detection. Under the previous approach, only the NN responsible for perception needs to be retrained. The layer responsible for making decisions based on output from the perception layer would not need any adjustments. It would simply have a more detailed map of its environment with which it could make better decisions. With E2E, is there even a separate perception layer anymore? As I understand, there is one continuous NN that would need to be retrained with a brand new set of training data featuring the new sensors. And that new data is not going to be of identical scenarios if you are re-capturing real world footage. Which means the new NN might introduce regressions for some situations.

Presumably Tesla will have (and already to some degree has) a robust mechanism in place to identify and minimize regressions. True that the NN will have to be independently retrained for each sensor + computer combination; evidently the variations between S/3/X/Y are small enough that they can share the same training, although Cybertruck may be too different for that to work. (And they will eventually want to train Semi and Roadster and Model 2 and Robotaxi...) That's why they need so much training compute. But once they work out how to successfully train one of these networks, then training the rest is simply a matter of brute force computation. Most of the training clips will remain the same from one software dot-release to the next, and the synthetic ones should be straightforward to regenerate for different sensor suites.

cyborgLIS said:
Furthermore, since we are using machine learning, would adding sensors even help? The current placement of the cameras approximates what the human driver can see from the driver's seat. You could add 3 front bumper cameras, but since the human driver cannot see information from that vantage point, whatever information is picked up by those cameras wouldn't influence a human driver's decisions, and therefore wouldn't influence the behavior demonstrated in the training footage. Though I suppose this could be addresses with simulated footage.

True, in order to drive with superhuman skill you will need a training set that goes beyond human captures. That's what the synthetic training is for; it can generate and throw extremely challenging virtual situations at the car, and still train it to not crash. The sensor suite already has numerous advantages and disadvantages relative to a human, and it has already learned to use some of those advantages. (E.g. for blind-spot awareness.) Bumper cameras would simply be another similar advantage. And matching skilled human performance should be enough for L4/Robotaxi; the question of how to achieve truly superhuman performance will not be critical to solve for at least the next few years.

Todd Burch · May 3, 2024

uscbucsfan said:
Maybe, but 2 certainly will give us insight.

Odd that it is taking so long for 12.4, when Elon spoke about how great it was before 12.3 went wide. How much "cleaning up" needs to be done on each version before it's released? (obviously rhetorical).

I think the assumption was that releases, without as much hard coding, would produce and deploy rapidly, probably our/my mistake with the interpretation.

Many years ago Google (now Waymo) conducted a study and determined that the better an autonomous system is, the less diligent the driver is about monitoring the system. It's common sense. If the system sucks, people are going to watch it like a hawk. As it gets better, people get complacent--so while mistakes are less common, they can be more severe because driver's aren't ready to intervene.

I feel like v12 is getting into that nebulous area where people are just starting to get complacent. I'm sure the Autopilot team is well aware of this and perhaps is doing significantly more safety testing before releasing to the fleet as it gets really good.

Twiglett · May 3, 2024

Todd Burch said:
Many years ago Google (now Waymo) conducted a study and determined that the better an autonomous system is, the less diligent the driver is about monitoring the system. It's common sense. If the system sucks, people are going to watch it like a hawk. As it gets better, people get complacent--so while mistakes are less common, they can be more severe because driver's aren't ready to intervene.

I feel like v12 is getting into that nebulous area where people are just starting to get complacent. I'm sure the Autopilot team is well aware of this and perhaps is doing significantly more safety testing before releasing to the fleet as it gets really good.

Tesla drivers were complacent with Autopilot 2, so this isn't something recent. It's just we expect more now, but the complacency issue has been an issue many, many years.

Ben W · May 3, 2024

Todd Burch said:
Many years ago Google (now Waymo) conducted a study and determined that the better an autonomous system is, the less diligent the driver is about monitoring the system. It's common sense. If the system sucks, people are going to watch it like a hawk. As it gets better, people get complacent--so while mistakes are less common, they can be more severe because driver's aren't ready to intervene.

I feel like v12 is getting into that nebulous area where people are just starting to get complacent. I'm sure the Autopilot team is well aware of this and perhaps is doing significantly more safety testing before releasing to the fleet as it gets really good.

Ironically, this is a positive side-effect of the car driving "robotically"; it does enough odd things (not necessarily dangerous things) that it keeps the supervising driver on their toes, so when it does do dangerous things (like swerve into a short turnout lane at 80mph on the highway, which for some reason it LOVES to do), the driver is more likely to be paying attention. It's also trained me to pay immediate close attention when the car initiates a merge, or does anything other than drive in a straight line.

Agreed that the uncanny valley is coming soon, and it will be very treacherous to cross. My hope is that improvements to the current well-known problem areas will automagically help with the "long tail" situations as well, though this remains to be seen.

AlanSubie4Life · May 3, 2024

Ben W said:
uncanny valley

The only valley coming is the Trough of Disillusionment.

Sorry to be a Negative Nancy, but this is so over. For this hardware anyway, and I'm not even sure whether better sensors & a modest amount more processing will matter.

No one knows for sure, of course. But it's really tough and discouraging to see the guy in charge high on his own supply.

I’m optimistic for the first 9 on Chuck’s Turn though! Betting against it for 12.4 for consistency. Sadly no beer at stake.

aronth5 · May 3, 2024

Todd Burch said:
Many years ago Google (now Waymo) conducted a study and determined that the better an autonomous system is, the less diligent the driver is about monitoring the system. It's common sense. If the system sucks, people are going to watch it like a hawk. As it gets better, people get complacent--so while mistakes are less common, they can be more severe because driver's aren't ready to intervene.

I feel like v12 is getting into that nebulous area where people are just starting to get complacent. I'm sure the Autopilot team is well aware of this and perhaps is doing significantly more safety testing before releasing to the fleet as it gets really good.

Good point. I have to constantly remind myself be vigilant since I haven't had a critical safety issue in a long time and it's easy to get lazy. Disengagements sure but situations where I actually feel FSD would be in an accident if I didn't disengage not since I've had V12. With V11 frequently.

Ben W · May 3, 2024

AlanSubie4Life said:
The only valley coming is the Trough of Disillusionment.

Sorry to be a Negative Nancy, but this is so over. For this hardware anyway, and I'm not even sure whether better sensors & a modest amount more processing will matter.

I agree that Robotaxi / L4 will not be achievable until at least HW5, probably HW6. That's why it's so unfortunate that HW3/HW4 cars have no computer upgrade path (that we know of). But HW3/HW4 will likely be capable of getting us well into the problematic zone where it feels perfect for tens or hundreds of miles, until suddenly it's dangerously not.

AlanSubie4Life said:
No one knows for sure, of course. But it's really tough and discouraging to see the guy in charge high on his own supply.

I’m optimistic for the first 9 on Chuck’s Turn though! Betting against it for 12.4 for consistency. Sadly no beer at stake.

The first 9 is the easiest, and if v12.4 succeeds, it will have taken them eight years (since AP1) to do it. How long until they can achieve the next eight 9's? (Two weeks, right?)

Tronguy · May 3, 2024

Scott Chapman said:
That's true! The distance is essentially 0 once it is detected by the camera.

Active vs. passive detection. Think SONAR: One can send out a sound pulse, look for reflections, and Detect Things.

One can, with the same receiver array, listen for machine noises/propeller burbles, and such and get targets that way.

One might think that SONAR can give range but passive listening can't: One would be wrong. For serious detection purposes one can either drag an array of sensors or simply have several emplaced sensors; correlation between physically disparate receivers can give range, thank-you-very-much.

Going back to light: Cameras tend to give one spectrally separate information (i.e., colors). LIDAR is more like Morse Code, where a blast of light goes out and what comes back is sensed for amplitude and time from transmit, but not for spectral information, like the color of what gets reflected. Further, in no particular order, I can easily see where LIDAR has Issues:

For decent ranging performance, LIDAR doesn't generally send out a blast in all directions: One has to aim the transmitter. Hence, the shape of a LIDAR, which is somewhat more like a globe and is sending out repetitive pulse in a repeating basis all over the area of interest. This is much like the search RADAR that one sees outside of airports. All those pulses coming back are received by the same physical hardware. The 3D spacial inventory (this is not an image) has to be synthesized by the hardware and compute. Unlike that airport RADAR which, for a background, either has open air or fast moving targets that can be identified by their Doppler Shift. LIDAR has Real Problems with Clutter, which is all that junk that's not moving and doesn't matter, like trees, where RADAR can differentiate that Stuff That Doesn't Matter fairly easily.
Cameras also have the world to work with; but changes in pixel/image sizes is ONE problem, not multiple ones. So, if one is going to solve driving-around-in-the-world, not having to contend with a completely separate compute with LIDAR/RADAR/USS/whatever simplifies things to a single problem set, rather than multiple problem sets.

Yeah, in principle, with unlimited money multiple sensor suites can give a better view of Everything. But that's unlimited money. Since everything one typically needs is in those cameras, whose passive reception is perfectly capable of figuring out range, why not just solve the problem once?

Finally: See the weird atom symbol with wings? That means I really did work on this stuff. For a living. And, while this field wasn't my primary means of acquiring bread after college, you had better bet that I took a lot of pretty-easy-electives in the field for fun. And some not so easy electives while I was at it with antennas and microwaves.

rlsd · May 3, 2024

Questions for the experts:
1. When radar, Lidar, and camera images are obtained, the AI system needs to spend time to identify the meaning of them (car, human,...). Is that correct?

2. Can radar/Lidar detect the small text on the road sign? Ex: No turn on red, Mon - Fri 6 AM - 3 PM.

j0shm1lls · May 3, 2024

rlsd said:
Questions for the experts:
1. When radar, Lidar, and camera images are obtained, the AI system needs to spend time to identify the meaning of them (car, human,...). Is that correct?

2. Can radar/Lidar detect the small text on the road sign? Ex: No turn on red, Mon - Fri 6 AM - 3 PM.

Totally not an expert, but I can at least answer #2, Radar and Lidar cannot read road signs. It just sees the sign as a flat plane.

Edtreo · May 3, 2024

j0shm1lls said:
Totally not an expert, but I can at least answer #2, Radar and Lidar cannot read road signs. It just sees the sign as a flat plane.

I switched off my sonar for now as vision only appears to be more accurate.

zoomer0056 · May 3, 2024

Edtreo said:
I switched off my sonar for now as vision only appears to be more accurate.

Tesla Vision shows blobs where it's up to the user to determine closeness. USS shows distances and squiggly lines and it's up to the user to determine if they believe the reported distance. I've also turned off USS in park assist. After ignoring the frantic USS warnings when backing into my garage I now can try out another system. It seems just as frantic as USS. I just say to the car, relax, it'll be ok ... I'm using the mirrors.

Daniel in SD · May 3, 2024

j0shm1lls said:
Totally not an expert, but I can at least answer #2, Radar and Lidar cannot read road signs. It just sees the sign as a flat plane.

It can if the reflectivity of the light and dark parts of the sign are different. But why wouldn’t you just use the cameras? Every AV has cameras.

Tronguy said:
cameras, whose passive reception is perfectly capable of figuring out range, why not just solve the problem once?

The problem is nobody has figured out how to get the performance necessary out of cameras. Tesla is still hitting curbs.

JB47394 · May 3, 2024

Daniel in SD said:
Tesla is still hitting curbs.

Is that a sensor limitation or a control problem?

Daniel in SD · May 3, 2024

JB47394 said:
Is that a sensor limitation or a control problem?

It’s obviously not a sensor problem since you can clearly see curbs on the camera.
With end-to-end there’s no way to know if the neural net “sees” the curb. All we know are the inputs and outputs.

stopcrazypp · May 3, 2024

j0shm1lls said:
Totally not an expert, but I can at least answer #2, Radar and Lidar cannot read road signs. It just sees the sign as a flat plane.

There are some lidar that can read road signs if they were made multilayer (as in the letter or background is raised). Those can read road markings in the same way. That said, it's not going be reliable enough for all scenarios (because some signs may be printed and there is no depth difference) so you would always need a camera anyways.

EVNow · May 3, 2024

JB47394 said:
Is that a sensor limitation or a control problem?

With NN - no way of even knowing that !

That said - the biggest issues for FSD are dynamic situations, rather than static.

Dewg · May 3, 2024

Daniel in SD said:
The problem is nobody has figured out how to get the performance necessary out of cameras. Tesla is still hitting curbs.

But has it always hit curbs, or is a new thing with v12?

FSD v12.x (end to end AI)

Chess Grandmaster (Supervised)

Active Member

Chess Grandmaster (Supervised)

14-Year Member

Single pedal driver

Chess Grandmaster (Supervised)

Efficiency Obsessed Member

Long Time Follower

Chess Grandmaster (Supervised)

Active Member

Active Member

Member

Member

Active Member

(supervised)

Active Member

(supervised)

Well-Known Member

Well-Known Member

Active Member

Similar threads