FSD v12.x (end to end AI)

rentierparasit · Dec 31, 2023

Hi, EVNow --

> we all drive in different cities and states - and millions take roadtrips every year without problems.

We all have NGI -- Natural General Intelligence. That may or may not be important here.

Yours,
RP

diplomat33 · Dec 31, 2023

EVNow said:
IMO we exaggerate the differences, so no - I don't think NN needs to be trained for each city separately. As I said - we all drive in different cities and states - and millions take roadtrips every year without problems. Of course, good maps are a prerequisite ...

Well, I think that is one reason why companies use maps. Mobileye says that they embed those unwritten rules and driving preferences into their REM maps. So no, you don't need to train the NN separately for each city. Instead, what you can do is train the NN once for general driving everywhere and just have the map provide info to the car about any differences.

EVNow · Dec 31, 2023

rentierparasit said:
Hi, EVNow --

> we all drive in different cities and states - and millions take roadtrips every year without problems.

We all have NGI -- Natural General Intelligence. That may or may not be important here.

Yours,
RP

NN is supposed to mimic "General Intelligence" which is a misnomer according to the father of NN (see the AI podcast). You don't need "AGI" for driving as such and when they talk about AGI in the chat gpt context, they don't include driving, anyway.

spacecoin · Dec 31, 2023

Mardak said:
You could say the same thing about transformers from 2017 or general attention mechanisms from years earlier or even more broadly decades earlier for modern back-propagation, but it wasn't until last year that people saw practical usage of these with ChatGPT based on OpenAI's engineering efforts to train at scale as well as to fine-tune towards human preferences. Sure, Tesla is applying "old" techniques to "old" problems, but you seem to have already decided that Tesla's engineering efforts won't be able to get end-to-end control working for FSD.

It’s not like I missed GPT-1 or GPT-2 before GPT-3 and then ChatGPT. Nor did I miss AlexNet, AlphaGo, AlphaZero or AlphaFold. But hey, welcome to the field!

I have no doubt Tesla will make the planner ”working” with v12. I am also very confident that their system won’t get to autonomy from the current state overnight.

It’s engineering - not magic. And at the present time I am not a huge believer in camera-only autonomy, nor have huge faith in Tesla’s team outpacing some other teams like for example Waymo.

Speaking of LLM:s… You might find this interesting:

https://x.com/karpathy/status/1733299213503787018

Happy new year! Keep ”dreaming”

Mardak · Dec 31, 2023

drtimhill said:
you can SEE that they still have it since the car can still show the NN world-view in the few demo videos we have seen

The framerate of other vehicle visualizations from the v12 livestream and 12.1 videos seem to be on average closer to 10 frames per second whereas 11.4.9 videos seem to be closer to 20 fps. At least earlier 12.x versions seem to be described as running end-to-end control at the maximum 36fps matching the camera inputs, so potentially the existing neural networks are kept around to periodically update the visualization. If the existing perception is reused as inputs to end-to-end control, it could be running at higher framerate for decisions but lower for visualizations. Or it could just be purely used as a visualization aid to provide some context for the blue control path visualization but not directly used for control.

Supcom · Dec 31, 2023

diplomat33 said:
In fact, maybe the key to getting to 4-10x safety is more redundancy: have 2 stacks, an end-to-end stack and a modular NN stack, running in parallel so that if one makes a mistake, the other can catch it, thus making the overall system even more reliable and thus safer?

The man with two watches never knows what time it is.

diplomat33 · Dec 31, 2023

Supcom said:
The man with two watches never knows what time it is.

That's a bad analogy. A better analogy would be when I take the meeting minutes at my job, I use a physical recorder and a recording app on my phone to record the audio of the meeting. That way I have two audio recordings of the same meeting. If either device somehow fails to record properly or maybe the audio is not very clear in one recording, I have another recording to fall back on. The odds of both recordings failing at the same time is much less likely than just one recording failing. So having two recordings is more reliable than just having one recording.

mongo · Dec 31, 2023

diplomat33 said:
That's a bad analogy. A better analogy would be when I take the meeting minutes at my job, I use a physical recorder and a recording app on my phone to record the audio of the meeting. That way I have two audio recordings of the same meeting. If either device somehow fails to record properly or maybe the audio is not very clear in one recording, I have another recording to fall back on. The odds of both recordings failing at the same time is much less likely than just one recording failing. So having two recordings is more reliable than just having one recording.

That's a bad analogy, you are talking about a clear device failure (on a device assumed to be 100% accurate when working) or determinable noise on only one, not a qualitative threshold decision.

What if both devices are recording someone who mumbles?

Mardak · Dec 31, 2023

EVNow said:
I don't think NN needs to be trained for each city separately

The neural networks probably want to be trained on diverse data from around the world so that it can hopefully understand the potential behaviors of other vehicles. These might not be specific perception outputs of existing 11.x, but if 12.x networks internally can learn these differences, it might even lead to control behaving differently based on what it sees other vehicles doing. Similar to how large language models will complete a sentence differently based on the subject, e.g., Sally vs Bob vs Fido, and how some assistants are biased to be less biased, end-to-end control might actually be quite sensitive to perceived context and could require debiasing to say follow further at 3 seconds even if everybody else is tailgating.

Supcom · Dec 31, 2023

diplomat33 said:
That's a bad analogy. A better analogy would be when I take the meeting minutes at my job, I use a physical recorder and a recording app on my phone to record the audio of the meeting. That way I have two audio recordings of the same meeting. If either device somehow fails to record properly or maybe the audio is not very clear in one recording, I have another recording to fall back on. The odds of both recordings failing at the same time is much less likely than just one recording failing. So having two recordings is more reliable than just having one recording.

It's actually appropriate. Unless one of the two stacks crashes or has a detectable failure, how do you determine which of the two is wrong? Maybe they are both acceptable, in different ways. Or, maybe they are both wrong.

The two watch issue comes about because the wearer has no way to determine which of the two watches is closest to the 'correct' time, so is always unsure which to use.

Bitdepth · Dec 31, 2023

Supcom said:
The man with two watches never knows what time it is.

You are talking about Segal's law which talks about the problem of having too many or too little information. The man who has a single watch does not in fact know if his time is accurate and the man with two watches does not know which one is more accurate.

When you are talking about safety critical systems in engineering, you use redundancy to reduce the likelihood of safety-critical events. Having multimodal sensor suites is not too much information and sensor fusion is a well understood problem. Having various NN running simultaneously is also one way to reduce safety critical events. Mobileye system for example works like that.

Supcom · Dec 31, 2023

mongo said:
That's a bad analogy, you are talking about a clear device failure (on a device assumed to be 100% accurate when working) or determinable noise on only one, not a qualitative threshold decision.

What if both devices are recording someone who mumbles?

Or, you have two people taking notes. One writes, "The CEO wants donuts, not bagels, at the next meeting. The other writes, "The CEO wants bagels, not donuts at the next meeting" You weren't at the meeting but are responsible for refreshments. Which is correct? Should you start looking for a new job now?

mongo · Dec 31, 2023

Bitdepth said:
You are talking about Segal's law which talks about the problem of having too many or too little information. The man who has a single watch does not in fact know if his time is accurate and the man with two watches does not know which one is more accurate.

When you are talking about safety critical systems in engineering, you use redundancy to reduce the likelihood of safety-critical events. Having multimodal sensor suites is not too much information and sensor fusion is a well understood problem. Having various NN running simultaneously is also safety critical as well. Mobileye system for example works like that.

Which is why Cybertruck's steer by wire has three steering wheel sensors and dual rack position sensors with different measurement methods.

Bitdepth · Dec 31, 2023

mongo said:
Which is why Cybertruck's steer by wire has three steering wheel sensors and dual rack position sensors with different measurement methods.

I’m sorry that’s too many information, how will they know which one is more accurate. They should use only one. Lol

jabloomf1230 · Dec 31, 2023

Bitdepth said:
I’m sorry that’s too many information, how will they know which one is more accurate. They should use only one. Lol

diplomat33 · Dec 31, 2023

Supcom said:
Or, you have two people taking notes. One writes, "The CEO wants donuts, not bagels, at the next meeting. The other writes, "The CEO wants bagels, not donuts at the next meeting" You weren't at the meeting but are responsible for refreshments. Which is correct? Should you start looking for a new job now?

Uh that is not how it works. I am the meeting secretary. I am the only one responsible for taking notes and writing the minutes from those notes. That is why I have two audio recordings so I can go back and listen to what was actually said to make sure my minutes are 100% accurate.

And meeting secretary is an additional responsibility I volunteered for. It is not my main job.

Supcom · Dec 31, 2023

diplomat33 said:
Uh that is not how it works. I am the meeting secretary. I am the only one responsible for taking notes and writing the minutes from those notes. That is why I have two audio recordings so I can go back and listen to what was actually said to make sure my minutes are 100% accurate.

And meeting secretary is an additional responsibility I volunteered for. It is not my main job.

I wasn't necessarily referring to your specific meetings. Only setting a hypothetical where the output is processed data, not a recording straight from a sensor. BTW, there are many workplaces where audio/video recording is forbidden.

DanCar · Dec 31, 2023

Having an end to end training ensemble makes it easier to program the Tesla Bot. No way is tesla going to try to hand code that because they don't want to code for 8 years like FSD and then fail.
Below tweet mentions end to end training for the bot:

https://twitter.com/x/status/1741595801854349529

Bitdepth · Dec 31, 2023

DanCar said:
Having an end to end training ensemble makes it easier to program the Tesla Bot. No way is tesla going to try to hand code that because they don't want to code for 8 years like FSD and then fail.
Below tweet mentions end to end training for the bot:

https://twitter.com/x/status/1741595801854349529

Just about the only thing I love about this company (bar management) as a tech enthusiast is the willingness to move fast break things and keeping on iterating and willingness to take risks. Can’t wait to see the progress 2024 holds.

drtimhill · Dec 31, 2023

Mardak said:
The framerate of other vehicle visualizations from the v12 livestream and 12.1 videos seem to be on average closer to 10 frames per second whereas 11.4.9 videos seem to be closer to 20 fps. At least earlier 12.x versions seem to be described as running end-to-end control at the maximum 36fps matching the camera inputs, so potentially the existing neural networks are kept around to periodically update the visualization. If the existing perception is reused as inputs to end-to-end control, it could be running at higher framerate for decisions but lower for visualizations. Or it could just be purely used as a visualization aid to provide some context for the blue control path visualization but not directly used for control.

As I noted, it's unlikely that they are running two stacks like that. It's fat too expensive given they are already at the limits of HW3 capacity.

FSD v12.x (end to end AI)

Member

Average guy who loves autonomous vehicles

Well-Known Member

Active Member

Active Member

Active Member

Average guy who loves autonomous vehicles

Well-Known Member

Active Member

Active Member

Member

Active Member

Well-Known Member

Member

Minister of Silly Walks

Average guy who loves autonomous vehicles

Active Member

Active Member

Member

Active Member

Similar threads