Welcome to Tesla Motors Club
Discuss Tesla's Model S, Model 3, Model X, Model Y, Cybertruck, Roadster and More.
Register

Monolithic versus Compound AI System

This site may earn commission on affiliate links.

spacecoin

Active Member
Jan 28, 2022
1,207
1,386
Europe
MobilEye is not a believer in pure e2e. I think they argue their position quite well.


Back in November 2022, the release of ChatGPT garnered widespread attention, not only for its versatility but also for its end-to-end design. This design involved a single foundational component, the GPT 3.5 large language model, which was enhanced through both supervised learning and reinforcement learning from human feedback to support conversational tasks. This holistic approach to AI was highlighted again with the launch of Tesla’s latest FSD system, described as an end-to-end neural network that processes visual data directly “from photons to driving control decisions," without intermediary steps or “glue code."

While AI models continue to evolve, the latest generation of ChatGPT has moved away from the monolithic E2E approach.
 
Last edited:
MobilEye is not a believer in pure e2e. I think they argue their point quite well.


The discussion of bias vs variance was very interesting. I learned some new things.

I thought their example of the long tail was especially interesting. In their example, it would take a fleet of 1M cars about 3 years to finish the last "9" in the march of 9s. They acknowledge that nobody knows how long the "long tail" actually is. We could be overestimating or underestimating the long tail. But I think it still shows that it will likely take years for an e2e approach to get to "eyes off". It is not going to happen quickly.

I like how they debunk the idea that e2e does not require any "glue code". While the stack itself may not have glue code, they explain how there will still be glue code in the training.

Lastly, I assumed that Tesla did not need to validate ground truth anymore since e2e does not have a perception stack. So, their explanation of why Tesla still needs to validate ground truth and how the purchase of lidar is likely for that purpose, was interesting to me.
 
  • Like
Reactions: Ben W and spacecoin
MobilEye is not a believer in pure e2e. I think they argue their position quite well.


Back in November 2022, the release of ChatGPT garnered widespread attention, not only for its versatility but also for its end-to-end design. This design involved a single foundational component, the GPT 3.5 large language model, which was enhanced through both supervised learning and reinforcement learning from human feedback to support conversational tasks. This holistic approach to AI was highlighted again with the launch of Tesla’s latest FSD system, described as an end-to-end neural network that processes visual data directly “from photons to driving control decisions," without intermediary steps or “glue code."

While AI models continue to evolve, the latest generation of ChatGPT has moved away from the monolithic E2E approach.
This was a really interesting read, thanks for sharing!

The blog post does make one very questionable assumption, which is that “long tail” events are independent and uncorrelated. To give an illustrative example, suppose two events with 10^-4 probability are “struck by lightning” and “attacked by a great white shark”, while a long-tail event with super-low 10^-8 probability might be “struck by lightning while simultaneously being attacked by a great white shark”.

It’s likely that by the time this super-rare event is encountered in the wild, the network will be familiar enough with its core concepts, as well as the general idea of how to handle multiple unusual things happening at once, to be able to handle this super-rare event successfully without ever having to have been explicitly trained on it.

In other words, generalization and extrapolation from the earlier part of the long tail might be sufficient to solve the VAST majority of the latter part of the long tail. That’s how it works for humans, anyway. And if it works for autonomous vehicles, then fully E2E L4 may be more achievable than Mobileye expects.
 
Last edited:
In other words, generalization and extrapolation from the earlier part of the long tail might be sufficient to solve the VAST majority of the latter part of the long tail. That’s how it works for humans, anyway. And if it works for autonomous vehicles, then fully E2E L4 may be more achievable than Mobileye expects.
Completely agree on this. The AI approach is not a typical search the database for inputs and propose outputs out of the database. The AI is basically a highly complex and evolved algorithm that will produce results that it was not even trained for. That is how humans operate, and that is why it takes so long for a human to mature compared to most others in the animal kingdom that come pretty much preprogrammed to do their activities, and unable to do anything beyond that.
 
  • Like
Reactions: Ben W
This was a really interesting read, thanks for sharing!

The blog post does make one very questionable assumption, which is that “long tail” events are independent and uncorrelated. To give an illustrative example, suppose two events with 10^-4 probability are “struck by lightning” and “attacked by a great white shark”, while a long-tail event with super-low 10^-8 probability might be “struck by lightning while simultaneously being attacked by a great white shark”.

It’s likely that by the time this super-rare event is encountered in the wild, the network will be familiar enough with its core concepts, as well as the general idea of how to handle multiple unusual things happening at once, to be able to handle this super-rare event successfully without ever having to have been explicitly trained on it.

In other words, generalization and extrapolation from the earlier part of the long tail might be sufficient to solve the VAST majority of the latter part of the long tail. That’s how it works for humans, anyway. And if it works for autonomous vehicles, then fully E2E L4 may be more achievable than Mobileye expects.

Koopman has a neat little video that explains the long tail.


I think it begs the question: how much of the long tail actually needs to be solved for AVs to be safe/good enough? In other words, when does an edge case become so rare that we don't care anymore? Koopman makes the point that as edge cases become more rare, it will take more and more data to find the next edge case. So I would suggest that eventually, we might get to the point that we have solved enough edge cases to be "good enough". Put differently, the next rare edge cases requires so much time to find that it is not worth it anymore.

Koopman has another little video that explains how many miles you need to statistically validate safety greater than humans.


To get statistical significance, you need to drive 1B miles with fewer than 10 safety critical interventions in order to be safer than human drivers. With Tesla's fleet, they can get to 1B miles in a reasonable timeframe. So with end-to-end, in theory, it might be as "simple" as push out a version of FSD, wait until the fleet has driven 1B miles, collect how many safety critical interventions occurred. Retrain the end-to-end on the collected data. Push out another version. Repeat cycle until the number of safety critical interventions is less than 10 per 1B miles. Once Tesla hits that benchmark, remove driver supervision requirement. In fact, I would argue that is what Tesla is doing. The question is how long will it take to get to less than 10 safety critical interventions per 1B miles.
 
  • Like
Reactions: Ben W
Koopman has a neat little video that explains the long tail.


I think it begs the question: how much of the long tail actually needs to be solved for AVs to be safe/good enough? In other words, when does an edge case become so rare that we don't care anymore? Koopman makes the point that as edge cases become more rare, it will take more and more data to find the next edge case. So I would suggest that eventually, we might get to the point that we have solved enough edge cases to be "good enough". Put differently, the next rare edge cases requires so much time to find that it is not worth it anymore.

Koopman has another little video that explains how many miles you need to statistically validate safety greater than humans.


To get statistical significance, you need to drive 1B miles with fewer than 10 safety critical interventions in order to be safer than human drivers. With Tesla's fleet, they can get to 1B miles in a reasonable timeframe. So with end-to-end, in theory, it might be as "simple" as push out a version of FSD, wait until the fleet has driven 1B miles, collect how many safety critical interventions occurred. Retrain the end-to-end on the collected data. Push out another version. Repeat cycle until the number of safety critical interventions is less than 10 per 1B miles. Once Tesla hits that benchmark, remove driver supervision requirement. In fact, I would argue that is what Tesla is doing. The question is how long will it take to get to less than 10 safety critical interventions per 1B miles.
A useful definition of intelligence is: "The ability to solve problems that one hasn't encountered before."

The video makes the same questionable assumption I mentioned earlier, which is that it assumes that the FSD network can't be good enough to extrapolate solutions to not-yet-seen edge cases from knowledge of other edge cases. (It basically states as common knowledge that humans can do this, but computers can't. I disagree; it's inevitable that computers will eventually gain humanlike capability in this regard.)

As neural networks grow in size and complexity (and training), they increase in both knowledge and "intellect". The former is related to the class of situations that they directly know (or have been directly trained) how to solve. The latter is related to the class of situations that they haven't yet seen or been trained for directly, but could still solve on the fly via "reasoning" from the rest of their knowledge base.

It's still an open question whether HW3/HW4 will ever be capable of running (or whether Tesla is capable of designing and training) a network of sufficient knowledge and intellect to "solve" L4, that will fit on the chip. As hardware improves, the task gets easier; the problem will certainly be solved on the most advanced hardware first (my guess is HW6 around 2030-2032), and then potentially over time trickle it may down to earlier hardware, though I doubt it will ever trickle all the way back to HW3. (My guess is that HW5 might eventually gain full L4 capability, but not HW4.)

The limiting factors are both compute and sensor suite robustness, which is why I expect and hope that Tesla will eventually add radar and Lidar back to the mix. L4 that can be foiled by a pigeon with good aim is just sparkling L2.
 
A useful definition of intelligence is: "The ability to solve problems that one hasn't encountered before."
and makes sure that that problem is no longer one without a solution by training itself, be it by failing or by being successful at it. In other words it can learn from experiencing it, regardless of the result (succes or failure) of that experience.

As I say, learning from failures is key to be successful.
 
Last edited:
A useful definition of intelligence is: "The ability to solve problems that one hasn't encountered before."

The video makes the same questionable assumption I mentioned earlier, which is that it assumes that the FSD network can't be good enough to extrapolate solutions to not-yet-seen edge cases from knowledge of other edge cases. (It basically states as common knowledge that humans can do this, but computers can't. I disagree; it's inevitable that computers will eventually gain humanlike capability in this regard.)

As neural networks grow in size and complexity (and training), they increase in both knowledge and "intellect". The former is related to the class of situations that they directly know (or have been directly trained) how to solve. The latter is related to the class of situations that they haven't yet seen or been trained for directly, but could still solve on the fly via "reasoning" from the rest of their knowledge base.

It's still an open question whether HW3/HW4 will ever be capable of running (or whether Tesla is capable of designing and training) a network of sufficient knowledge and intellect to "solve" L4, that will fit on the chip. As hardware improves, the task gets easier; the problem will certainly be solved on the most advanced hardware first (my guess is HW6 around 2030-2032), and then potentially over time trickle it may down to earlier hardware, though I doubt it will ever trickle all the way back to HW3. (My guess is that HW5 might eventually gain full L4 capability, but not HW4.)

The limiting factors are both compute and sensor suite robustness, which is why I expect and hope that Tesla will eventually add radar and Lidar back to the mix. L4 that can be foiled by a pigeon with good aim is just sparkling L2.
Hi, Ben --

> it assumes that the FSD network can't be good enough to extrapolate solutions to not-yet-seen edge cases from knowledge of other edge cases. [,,,,]situations that they haven't yet seen or been trained for directly, but could still solve on the fly via "reasoning" from the rest of their knowledge base.

I'm going to take that as being close enough to AGI. Do you believe that neural nets alone are enough to get AGI?

Yours,
RP
 
  • Like
Reactions: enemji
Hi, Ben --

> it assumes that the FSD network can't be good enough to extrapolate solutions to not-yet-seen edge cases from knowledge of other edge cases. [,,,,]situations that they haven't yet seen or been trained for directly, but could still solve on the fly via "reasoning" from the rest of their knowledge base.

I'm going to take that as being close enough to AGI. Do you believe that neural nets alone are enough to get AGI?

Yours,
RP
AGI will require a neural net that can take its own output into account. In other words it must be able to iteratively process information in order to "think" about it, and must have a memory of its own thought process. (ChatGPT sort of does this; I'm not sure if the current FSD architecture does.) But yes, with an architecture like that, and with general-domain input (not just driving clips), it will eventually get to AGI. However, if a system is only fed driving clips as input, that's too narrow and not enough for AGI. Ironically, FSD will have to know about a lot more than just driving in order to solve all driving-related tasks properly.
 
  • Like
Reactions: enemji
AGI will require a neural net that can take its own output into account. In other words it must be able to iteratively process information in order to "think" about it, and must have a memory of its own thought process. (ChatGPT sort of does this; I'm not sure if the current FSD architecture does.) But yes, with an architecture like that, and with general-domain input (not just driving clips), it will eventually get to AGI. However, if a system is only fed driving clips as input, that's too narrow and not enough for AGI. Ironically, FSD will have to know about a lot more than just driving in order to solve all driving-related tasks properly.
No one knows what ”AGI will require” tbh. There isn’t even a clear definition of what people mean when they say “AGI”.
 
The video makes the same questionable assumption I mentioned earlier, which is that it assumes that the FSD network can't be good enough to extrapolate solutions to not-yet-seen edge cases from knowledge of other edge cases. (It basically states as common knowledge that humans can do this, but computers can't. I disagree; it's inevitable that computers will eventually gain humanlike capability in this regard.)

As neural networks grow in size and complexity (and training), they increase in both knowledge and "intellect". The former is related to the class of situations that they directly know (or have been directly trained) how to solve. The latter is related to the class of situations that they haven't yet seen or been trained for directly, but could still solve on the fly via "reasoning" from the rest of their knowledge base.

Shashua talks about the ability for AI to "reason" outside of it's training data. He says we are not quite there yet.


As I understand it, the way FSD E2E training works is that sensor input is mapped to a control output. Simple examples might be something like "red light" is mapped to "stop", "yellow light" is mapped to "slow down", "slow lead car" is mapped to "lane change", etc... The more data, the more complex the connections. But I don't think the NN is able to reason per se. It simply maps a certain input to a certain output. So when our cars drive, the NN takes in the sensor input which connects to a certain control output. More data means a bigger NN, which means more connections, more driving cases mapped to the correct driving output.

But if the NN were to encounter a situation outside of its training data, I don't think it would necessarily know what to do since it does not have any output mapped to that input. The NN would probably "hallucinate" an output, ie, it would guess base on the closest input it has. In some cases, we might get lucky and the guess turns out to be the correct output. In other cases, the guess might produce a bad output. And that is where you retrain the NN on that new edge case so that it has the correct output mapped for that edge case. Now, it might depend on how close the edge case is to the training set. If the edge case is very similar to the training set, the NN might be more likely to guess the correct output. But if the edge case is something completely different from anything in the training set, the NN is probably less likely to guess the correct output. That is why you want to reduce variance with a generalized and diverse training set so that there is a good chance that any edge case will be close enough to the training set. I could see a situation where if the E2E NN is big enough, where say 99.99% of driving cases are mapped to the correct output, that is good enough because any remaining edge cases will either be close enough to the training set that the NN can guess the correct output or the edge case is so rare it does not matter.

I guess the real question is how big does the E2E NN have to be "solve" FSD. If you map say 99.9% of driving cases to the correct output, is that good enough? Can the NN "reason" well enough on the remaining 0.1%? It is possible that a big enough NN will be good enough to "solve" L5 but we might not have the on board compute yet to handle that.
 
While AI models continue to evolve, the latest generation of ChatGPT has moved away from the monolithic E2E approach.
Personally I think this is essentially a case of starting with your conclusion and working back from it. ChatGPT literally released a new E2E version 2 days before this blog post, and has claimed vastly superior performance with it.[0] In light of this I don't even see a need to continue to keep reading this post when the argument they try to start with has been made obsolete before the post even went live.

Plus I mean, the whole point of having these E2E systems is to be able to have a model that has greater generalization abilities. (i.e. Teslabot or w/e stupid name it's called).
 
Personally I think this is essentially a case of starting with your conclusion and working back from it. ChatGPT literally released a new E2E version 2 days before this blog post, and has claimed vastly superior performance with it.[0] In light of this I don't even see a need to continue to keep reading this post when the argument they try to start with has been made obsolete before the post even went live.

Plus I mean, the whole point of having these E2E systems is to be able to have a model that has greater generalization abilities. (i.e. Teslabot or w/e stupid name it's called).

Yes the blog is wrong about the latest GPT model not being E2E. But I think you are missing the blog's argument. The blog is not arguing that E2E is not capable or not generalized. The argument is about safety. The argument is that AVs are safety critical applications, ChatGPT is not. So yes, E2E has proven to be more capable and more generalized but that is not the only metric for AVs. AVs must also be safe too. And ultimately, being super generalized is not enough for AVs if the MTBF is not high enough. I think Mobileye is trying to make the case that E2E is not the best approach to achieve the very high MBTF goal of 10^7 hours which they claim is needed for "eyes off" autonomy.
 
As I understand it, the way FSD E2E training works is that sensor input is mapped to a control output. Simple examples might be something like "red light" is mapped to "stop", "yellow light" is mapped to "slow down", "slow lead car" is mapped to "lane change", etc... The more data, the more complex the connections. But I don't think the NN is able to reason per se. It simply maps a certain input to a certain output. So when our cars drive, the NN takes in the sensor input which connects to a certain control output. More data means a bigger NN, which means more connections, more driving cases mapped to the correct driving output.
If what you say is correct, then that is no better or even is just a HUGE IF ELSEIF WHEN DO LOOP module, which I don’t want to believe that it is.
 
Yes the blog is wrong about the latest GPT model not being E2E. But I think you are missing the blog's argument. The blog is not arguing that E2E is not capable or not generalized. The argument is about safety. The argument is that AVs are safety critical applications, ChatGPT is not. So yes, E2E has proven to be more capable and more generalized but that is not the only metric for AVs. AVs must also be safe too. And ultimately, being super generalized is not enough for AVs if the MTBF is not high enough. I think Mobileye is trying to make the case that E2E is not the best approach to achieve the very high MBTF goal of 10^7 hours which they claim is needed for "eyes off" autonomy.

I mean, I get the point of their article.

It's to reassure their investors that "they are on the right track to solve this and Tesla won't ever." Because that's (1) their value prop as a supplier for systems rather than a direct manufacturer, (2) it's an easy way to score free advertising. It's why they keep claiming everything needs large data sets for E2E when Tesla can poll the fleet for specific cases, and also create simulated data for these scenarios just like Waymo.

I'll highlight part of it as an example here.
The rolling stop events had to be taken out of the training data so that the E2E system would not adopt bad behavior (as it is supposed to imitate humans). In other words, the glue-code (and bugs) are shifting from the system code to data curation code. Actually, it may be easier to detect bugs in system code (at least there are existing methodologies for that) than to detect bugs in data curation.
This is trying to somehow claim "glue code (and bugs)" is now in the data training set. First of all, the rolling stop isn't a "bug" in any sense of the word, it's a ticket I would personally close with the tag "as designed." Tesla designed the system to imitate humans, as MobileEye correctly claims, then suddenly because of that it's "a bug." Is it "a bug" when humans roll a stop sign...? Changing targets for something (especially due to a govt. investigation) isn't a bug.

They then start claiming "it may be easier to detect bugs in system code" when Tesla has...so much data available to poll the cars for specific scenarios. Here, I'll even layout the logic for detecting this event.

Did the car sense a stop sign? Did the car come to a complete stop (speed <= 0mph) before proceeding.

Wow, such a complex scenario, meanwhile I've spent 2 weeks personally debugging code that wasn't working the way it should, and it all came down to an engineer using the wrong system clock (that internal automated tools are supposed to flag as wrong).

Also for funsies, here's ChatGPT 4-o's rebuttal to the article:

The argument favoring CAIS over E2E due to the bias-variance tradeoff can be contested on several grounds:
  1. Advancements in E2E Learning: Recent advancements in neural network architectures and training techniques have significantly improved the generalization capabilities of E2E systems. Techniques such as transfer learning, data augmentation, and more sophisticated neural networks can reduce the generalization error without requiring prohibitively large datasets.
  2. Richness of Data Representation: While CAIS systems may intentionally simplify the representation of reality through a "sensing state," this abstraction can overlook critical nuances present in raw data. E2E systems can capture richer and more detailed representations directly from the raw data, potentially leading to more accurate decision-making.
  3. Data Utilization Efficiency: The assumption that E2E systems need exponentially more data than CAIS systems might be overstated. Advanced data synthesis, simulation environments, and efficient data collection strategies can provide diverse and comprehensive datasets for E2E training without an impractical increase in data volume.
  4. Real-World Performance: In practice, E2E systems have shown remarkable performance in complex tasks by learning directly from large, diverse datasets. For example, E2E systems have excelled in areas like image recognition and natural language processing, suggesting that similar approaches can be effectively adapted to autonomous driving.
  5. Adaptability and Scalability: E2E systems can adapt more readily to new scenarios and environments as they do not rely on pre-defined abstractions that might need extensive re-engineering. This adaptability can lead to faster deployment and scaling in varied real-world conditions.
In conclusion, while the CAIS approach emphasizes managing the bias-variance tradeoff through architectural constraints, the evolving capabilities of E2E systems challenge the notion that such constraints are necessary or beneficial. With continuous improvements in data handling and model training, E2E systems can potentially offer a more accurate and flexible solution for self-driving technology.
 
  • Like
Reactions: enemji
Wow, such a complex scenario, meanwhile I've spent 2 weeks personally debugging code that wasn't working the way it should, and it all came down to an engineer using the wrong system clock.
Reminds me of a time when I was debugging a german code. Curious me took the comment against it written in german and translated it. It basically said: I dont know why I put this code in here. It does not make any sense but I was told to do so, and I did so.

Oh well. Life is fun.
 
AGI will require a neural net that can take its own output into account. In other words it must be able to iteratively process information in order to "think" about it, and must have a memory of its own thought process. (ChatGPT sort of does this; I'm not sure if the current FSD architecture does.) But yes, with an architecture like that, and with general-domain input (not just driving clips), it will eventually get to AGI. However, if a system is only fed driving clips as input, that's too narrow and not enough for AGI. Ironically, FSD will have to know about a lot more than just driving in order to solve all driving-related tasks properly.
Hi, Ben --

> AGI will require a neural net that can take its own output into account.

A lot of people think that recursion is the secret of consciousness. I'm not sure a neural net that can take its own input into account really counts as neural net anymore. I've programmed computers but claim no more than a layman's knowledge of recent advances in AI. Still, I think what you're hinting at would require a fundamentally different architecture, where that hard part isn't the "neural nets", if they're even involved at all.

Yours,
RP
 
Last edited:
  • Like
Reactions: enemji
If what you say is correct, then that is no better or even is just a HUGE IF ELSEIF WHEN DO LOOP module, which I don’t want to believe that it is.
Hi, Enemji --

> that is no better or even is just a HUGE IF ELSEIF WHEN DO LOOP module,

Again, I have a background in CS but only a layman's knowledge of AI. One of the main, I don't know, insights isn't the right word, I doubt I'll be able to explain this, of CS is that if you can turn one thing/problem into another thing/problem, then they're the same thing. I think of NN's as just being another kind of solver. You use solvers for problems you're too dumb to figure out on your own, like, you might not know how to take a square root, but if you've got Newtown-Raphson you can get arbitrarily close.

So .... the use of NNs here isn't that there's any kind of magic to them, it's just that you can get *close enough* to a solution that you're dumb to figure out on your own. At some level though, beneath the hood, there is no difference between an NN your "HUGE IF ELSEIF WHEN DO LOOP".

Yours,
RP