Welcome to Tesla Motors Club
Discuss Tesla's Model S, Model 3, Model X, Model Y, Cybertruck, Roadster and More.
Register

Monolithic versus Compound AI System

This site may earn commission on affiliate links.
I
Hi, Enemji --

> that is no better or even is just a HUGE IF ELSEIF WHEN DO LOOP module,

Again, I have a background in CS but only a layman's knowledge of AI. One of the main, I don't know, insights isn't the right word, I doubt I'll be able to explain this, of CS is that if you can turn one thing/problem into another thing/problem, then they're the same thing. I think of NN's as just being another kind of solver. You use solvers for problems you're too dumb to figure out on your own, like, you might not know how to take a square root, but if you've got Newtown-Raphson you can get arbitrarily close.

So .... the use of NNs here isn't that there's any kind of magic to them, it's just that you can get *close enough* to a solution that you're dumb to figure out on your own. At some level though, beneath the hood, there is no difference between an NN your "HUGE IF ELSEIF WHEN DO LOOP".

Yours,
RP
RP, it is my understanding that the implementation is more like that of an equation that spits out an output based upon the values of the input variables x, y, z, a, b, c etc. This is what makes AI/NN so much faster in response as it is not running through a series of statements that include all these IF/DO/WHEN kimd of commands.

Tesla is using what is known as DNN & CNN to that end. It is very challenging to understand how they arrive at specific decisions and therefore using LiDAR in their training fleet to validate their decisions during training.
 
Last edited:
I

RP, it is my understanding that the implementation is more like that of an equation that spits out an output based upon the values of the input variables x, y, z, a, b, c etc. This is what makes AI/NN so much faster in response as it is not running through a series of statements that include all these IF/DO/WHEN kimd of commands.

Tesla is using what is known as DNN & CNN to that end. It is very challenging to understand how they arrive at specific decisions and therefore using LiDAR in their training fleet to validate their decisions during training.
Hi, Enemji --

Not a hill I'm willing to die on, because I honestly don't know what I'm talking about, but I believe it should be possible to "decompile" a neural net into, say, C code, thereby showing that they're functionally equivalent.

Yours,
Bowd
 
  • Like
Reactions: enemji
Hi, Enemji --

Not a hill I'm willing to die on, because I honestly don't know what I'm talking about, but I believe it should be possible to "decompile" a neural net into, say, C code, thereby showing that they're functionally equivalent.

Yours,
Bowd
From how I understand CNN works, they are decomposing/deconstructing the incoming image into layers of photons, and building a pattern, and recording the human inputs against these patterns. The algorithm(equation) is essentially built to take incoming light from the cameras, feed it as a pattern input to the algorithm to identify the closest response that it was trained on, and produce the output that will control the car ie brake, steer, accelerate, turn, et al
 
The way I have always envisioned FSD to operate is to communicate back these failures to the Tesla AI hub where this incident will be used to train and make it possible
Training the next iteration of the software will always (or at least for the foreseeable future) be centralized. The software will be identical for each car for each drive; it won't "learn" as it goes. Otherwise you'd end up with millions of cars, each gradually developing its own distinct "personality", which would be an absolute quality-control nightmare!

So yes, the failures (safety-critical interventions or crashes) are communicated back to Tesla HQ, where they are used as guidance to curate the next set of training data, which will hopefully selectively "fix" the observed problems for the next software iteration.
 
  • Like
Reactions: enemji
No one knows what ”AGI will require” tbh. There isn’t even a clear definition of what people mean when they say “AGI”.
Loosely, "AGI" means the cognitive ability to solve most problems that a typical human can solve. It may or may not include the ability to "learn" (self-training of its own neural network), or to have a "mind", or "consciousness", though those terms are also quite ill-defined.

True that no one knows what this will require. The latest LLM's are getting close in some ways, but are not quite there yet, and it's unclear whether the LLM approach can fundamentally scale to get all the way there, or if some other "secret sauce" or architectural framework might turn out to be required. I do expect that within a decade or two the most powerful systems will be widely considered to have reached "AGI". Then we'll all start having the same discussion about "ASI" (Artificial Super-Intelligence), which will undoubtedly keep us occupied for the next few decades!
 
  • Like
Reactions: enemji
Shashua talks about the ability for AI to "reason" outside of it's training data. He says we are not quite there yet.
We are there for simple reasoning. Complex reasoning, not yet.
As I understand it, the way FSD E2E training works is that sensor input is mapped to a control output. Simple examples might be something like "red light" is mapped to "stop", "yellow light" is mapped to "slow down", "slow lead car" is mapped to "lane change", etc... The more data, the more complex the connections. But I don't think the NN is able to reason per se. It simply maps a certain input to a certain output. So when our cars drive, the NN takes in the sensor input which connects to a certain control output. More data means a bigger NN, which means more connections, more driving cases mapped to the correct driving output.
This sounds like a "lookup-table", which is not really a good analogy to how modern neural networks work. Rather, a neural network synthesizes its vast amount of training data (which could be exabytes, orders of magnitude more information than the network could explicitly contain) into a much smaller "understanding and reasoning" procedure that can reliably reconstruct the patterns it finds in the training data. There is no explicitly programmed notion of e.g. "stop sign" or "red light"; the training data is completely unlabelled. The system learns for itself the correlations between the corresponding visual patterns and the desired behavior. There may be a "stop sign" neuron inside the neural network, or there may not be; it doesn't matter, and it's not essential to the correct functioning of the network.

The hope is that a system like this, by being forced to economize in an information-theoretic sense, will inherently discover the "essence" of the ideas encoded in the training data. This will allow it not only to reproduce all the examples in the training set, but also to successfully generalize/extrapolate to unseen cases as well. This is how humans can solve problems we've never seen before: we can intuitively break them down into simpler components that may each have some individual familiarity, and then apply already-learned techniques to deal with incorporating the novel and uncertain aspects of the situation to come up with a reasonable response.
But if the NN were to encounter a situation outside of its training data, I don't think it would necessarily know what to do since it does not have any output mapped to that input. The NN would probably "hallucinate" an output, ie, it would guess base on the closest input it has. In some cases, we might get lucky and the guess turns out to be the correct output. In other cases, the guess might produce a bad output. And that is where you retrain the NN on that new edge case so that it has the correct output mapped for that edge case. Now, it might depend on how close the edge case is to the training set. If the edge case is very similar to the training set, the NN might be more likely to guess the correct output. But if the edge case is something completely different from anything in the training set, the NN is probably less likely to guess the correct output. That is why you want to reduce variance with a generalized and diverse training set so that there is a good chance that any edge case will be close enough to the training set. I could see a situation where if the E2E NN is big enough, where say 99.99% of driving cases are mapped to the correct output, that is good enough because any remaining edge cases will either be close enough to the training set that the NN can guess the correct output or the edge case is so rare it does not matter.
"There is nothing new under the sun." It is likely that if a network is trained on the 10,000 most common cases, it would be able to generalize (with simple reasoning) to handle most of the 1,000,000 most common cases as well. If further trained on just the exceptions in that dataset, it should be capable of further generalizing to the vast majority of the billion most common cases, which should be enough 9's of reliability to handle L4 autonomy. This is the hope, at least. Also, the meta-idea of "how to deal with weird unknown things" should be implicitly learned by the system through the training process, since any part of the network that begins to develop that ability would be highly reinforced and rewarded.
I guess the real question is how big does the E2E NN have to be "solve" FSD. If you map say 99.9% of driving cases to the correct output, is that good enough? Can the NN "reason" well enough on the remaining 0.1%? It is possible that a big enough NN will be good enough to "solve" L5 but we might not have the on board compute yet to handle that.
The first NN to "solve" L5 (or even L4) for pure vision would undoubtedly be gigantic, far too large to fit on HW4 or HW3. It may take a long time to streamline it down to fit on real-world autonomous hardware. Adding sensor-suite "crutches" such as radar/Lidar/ultrasonic might dramatically shrink the required size of the network and improve its latency, making it feasible on given compute hardware much sooner than a vision-only solution. (As well as providing weather robustness, increasing the ODD and reliability.) This is why I hope Tesla changes its tune and decides to build out the sensor suite more fully for HW5/Robotaxi; it should get them to full autonomy much sooner.
 
  • Like
Reactions: enemji
Again, I have a background in CS but only a layman's knowledge of AI. One of the main, I don't know, insights isn't the right word, I doubt I'll be able to explain this, of CS is that if you can turn one thing/problem into another thing/problem, then they're the same thing. I think of NN's as just being another kind of solver. You use solvers for problems you're too dumb to figure out on your own, like, you might not know how to take a square root, but if you've got Newtown-Raphson you can get arbitrarily close.

So .... the use of NNs here isn't that there's any kind of magic to them, it's just that you can get *close enough* to a solution that you're dumb to figure out on your own. At some level though, beneath the hood, there is no difference between an NN your "HUGE IF ELSEIF WHEN DO LOOP".
True that any Von Neumann machine is at a very abstract level equivalent to any other. The difference is that an "IF ELSEIF WHEN DO LOOP" is by its nature very serial; all the processing goes through a single tiny bottleneck on the chip that does the testing and branching. A neural network, on the other hand, is massively (embarrassingly) parallel, and branchless. Millions of computations are done at once by an extremely parallel processor ("TPU"), rather than processing one instruction at a time by a single serial "CPU". You could simulate these massively parallel networks on CPU, per computational equivalence, but they would run millions of times slower.
 
  • Like
Reactions: enemji
MobilEye is not a believer in pure e2e. I think they argue their position quite well.


Back in November 2022, the release of ChatGPT garnered widespread attention, not only for its versatility but also for its end-to-end design. This design involved a single foundational component, the GPT 3.5 large language model, which was enhanced through both supervised learning and reinforcement learning from human feedback to support conversational tasks. This holistic approach to AI was highlighted again with the launch of Tesla’s latest FSD system, described as an end-to-end neural network that processes visual data directly “from photons to driving control decisions," without intermediary steps or “glue code."

While AI models continue to evolve, the latest generation of ChatGPT has moved away from the monolithic E2E approach.
This article linked by another member suggests Tesla's current E2E system is not a monolithic E2E system, but rather a modular E2E system (which explains how they were able to get it out relatively quickly with relatively few regressions, while keeping the UI mostly the same).
Breakdown: How Tesla will transition from Modular to End-To-End Deep Learning
FSD v12.x (end to end AI)

"And we’ve got the final piece of the puzzle, which is to have the control part of the car transition from about 300,000 lines of C++ code to a [full-scale] neural network, so the whole system will be neural network, photons into controls out"
Elon Musk recaps Tesla FSD neural network at All-In Summit 2023
Note this is the quote being referenced. But what Elon said is still consistent with a modular E2E system. Basically the last piece of the puzzle is the planning part was still C++, but Tesla switched that to NNs. After doing that, everything would be NN (other than some relatively minor code to connect the various modules). Then they also feed back the final output of the control results into training the perception network, which is what makes it "E2E", as oppose to previously the perception network working and trained completely independently and only caring about its own output, not whether it results in correct control output.
 
Last edited:
  • Like
Reactions: Ben W and enemji
Tesla is using what is known as DNN & CNN to that end. It is very challenging to understand how they arrive at specific decisions and therefore using LiDAR in their training fleet to validate their decisions during training.
I don't think this is what Tesla is using the LiDAR for. Neural nets are extremely opaque; even with LiDAR ground truth it's essentially impossible to "know why" the neural net makes a particular decision.

Rather, they were likely using LiDAR to train v11, which had an intermediate "occupancy network" layer that required LiDAR ground truth in order to make accurate training data for. They might be currently using it to validate the camera calibration process. They might also be using it to enhance the accuracy of their synthetic data visualizations for training. And of course, if future vehicles may incorporate LiDAR into their autonomous sensor suites (which I hope happens), Tesla will need to have real-world LiDAR data captures to train the E2E models with.
 
  • Like
Reactions: enemji
Hi, Enemji --

Not a hill I'm willing to die on, because I honestly don't know what I'm talking about, but I believe it should be possible to "decompile" a neural net into, say, C code, thereby showing that they're functionally equivalent.

Yours,
Bowd
Correct, you could. Of course, the C code would be absolute unintelligible gibberish (or at best, a sequence of gigantic matrix multiplies with random-looking numbers), even if you could compile and run it and get the same results as the neural network. Put another way, you could show that they're functionally equivalent, but you wouldn't gain any insight into the neural network by doing so.
 
  • Like
Reactions: enemji
From how I understand CNN works, they are decomposing/deconstructing the incoming image into layers of photons, and building a pattern, and recording the human inputs against these patterns. The algorithm(equation) is essentially built to take incoming light from the cameras, feed it as a pattern input to the algorithm to identify the closest response that it was trained on, and produce the output that will control the car ie brake, steer, accelerate, turn, et al
Sort of. The network doesn't maintain any independent memory of the inputs; it just synthesizes them all together into a combined network. So there's no such thing as "identify the closest response that it was trained on". Rather, through the training on millions of examples, it's learned to recognize the fundamental patterns that were present in the various inputs, and then infers the information-processing patterns for how those inputs tend to produce outputs. So at inference time (while the car is driving), it's merely following those combined patterns, and not looking back to specific training examples.
 
  • Like
Reactions: enemji
Correct, you could. Of course, the C code would be absolute unintelligible gibberish (or at best, a sequence of gigantic matrix multiplies with random-looking numbers), even if you could compile and run it and get the same results as the neural network. Put another way, you could show that they're functionally equivalent, but you wouldn't gain any insight into the neural network by doing so.
I don't think this is generally correct. You could do it for some amount of sample data (ie the test cases that you use as proof), but you can't guarantee functional equality for all cases of input, since we can't explain why NN:s produce the output it does.

Perhaps you could potentially represent the statistics based "reasoning" in the NN in C code as you say, but it's not going to be rules based or understandable by humans, so it will still be a blackbox and a pointless exorcise.
 
Last edited:
  • Like
Reactions: enemji
So there's no such thing as "identify the closest response that it was trained on".
I agree that it does not do that through a database search but rather the algorithm does that given that every pattern being captured by a Tesla on the road is different yet similar. That is however fuzzy logic and not acceptable in a court of law as there are no specific heuristics that can be traced or traversed through, which is where devices and sensors that are specifically designed to be deterministic ie LiDAR are used to measure distance to ensure that during training their fuzzy logic is proposing the right outcomes.
 
I don't think this is what Tesla is using the LiDAR for. Neural nets are extremely opaque; even with LiDAR ground truth it's essentially impossible to "know why" the neural net makes a particular decision.
I would tend to agree with you if Tesla had stereoscopic vision for each camera it has. But it does not. Technically without using a complex sensor pixel based focusing system, I see no other way how Tesla would be able to validate distances of objects when it deconstructs and reconstructs a pattern for the CNN.
 
I don't think this is generally correct. You could do it for some amount of sample data (ie the test cases that you use as proof), but you can't guarantee functional equality for all cases of input, since we can't explain why NN:s produce the output it does.

Perhaps you could potentially represent the statistics based "reasoning" in the NN in C code as you say, but it's not going to be rules based or understandable by humans, so it will still be a blackbox and a pointless exorcise.
"Can't explain" is only in the semantic sense, not in the arithmetic sense. The neural network is deterministic and will always produce exactly the same output for exactly the same input. This can be mirrored in equivalent C code that is guaranteed to have functional equality (though drastically slower runtime).
 
I agree that it does not do that through a database search but rather the algorithm does that given that every pattern being captured by a Tesla on the road is different yet similar. That is however fuzzy logic and not acceptable in a court of law as there are no specific heuristics that can be traced or traversed through, which is where devices and sensors that are specifically designed to be deterministic ie LiDAR are used to measure distance to ensure that during training their fuzzy logic is proposing the right outcomes.
LiDAR is closer to ground truth than depth inferred from pure vision, but neither is exact or deterministic (due to sensor noise, etc). The real world is far too complicated to have a formal process for a "right outcome" (control-wise) for any given input, otherwise the C++ coding approach would have been sufficient. But it's true that for training just the occupancy network component (if the "E2E" v12 approach is still modular that way), LiDAR is useful to ensure accurate training data. Even so, we still have no guarantee or semantic understanding why the occupancy network works, or whether it will work in any specific not-yet-tested case. We can only say that it empirically gets the correct answer about X% of the time based on N trials.
 
  • Like
Reactions: enemji
Let’s consider a simple example where you have a dataset of fruits characterized by their weight and texture, and you want to classify new fruits as either 'Apple' or 'Banana'.

Training Data:​

Weight (grams)Texture (0=Smooth, 1=Coarse)Label
1500Apple
1700Apple
1301Banana
1801Banana

Test Data:​

  • New fruit: Weight = 160 grams, Texture = 0

k-NN Operation (k=3):​

  1. Calculate distance (Euclidean) between the new fruit and all training data points.
  2. Identify the closest three fruits (let's assume k=3 for this example).
  3. Classify based on majority vote among the k-nearest neighbors.
If two of the three closest fruits are labeled 'Apple', the new fruit will also be classified as 'Apple'.

Key Parameters:​

  • k (Number of Neighbors): Choosing the right value of k is crucial. A smaller k makes the model sensitive to noise, while a larger k makes it computationally expensive and possibly less precise.
  • Distance Metric: The most common metric is Euclidean distance, but others like Manhattan or Minkowski can also be used depending on the dataset.
 
I would tend to agree with you if Tesla had stereoscopic vision for each camera it has. But it does not. Technically without using a complex sensor pixel based focusing system, I see no other way how Tesla would be able to validate distances of objects when it deconstructs and reconstructs a pattern for the CNN.
The point of the pure E2E approach (photons in, control out) is that there's no need for the network to explicitly calculate distances at all. (Unless it's a two-stage modular network with a human-designed occupancy layer in the middle.) Elon's statement about"LiDAR no longer being necessary" for training suggests that this two-stage approach is no longer the case, perhaps for v12.4 or v12.5 forward.

For the monolithic E2E approach, there is no need to either calculate or validate distances, which completely removes the need for LiDAR (ground-truth depth) input for training purposes. The only need for the occupancy network would then be the on-screen FSD visualization, which could be done with a smaller parallel network, which in itself would still benefit from LiDAR for training.

It's also possible that the primary E2E network is using the secondary occupancy network's output as one of its inputs, thus indirectly benefiting from the LiDAR training without requiring the LiDAR data stream from each of its training clips per se.
 
The point of the pure E2E approach (photons in, control out) is that there's no need for the network to explicitly calculate distances at all. (Unless it's a two-stage modular network with a human-designed occupancy layer in the middle.) Elon's statement about"LiDAR no longer being necessary" for training suggests that this two-stage approach is no longer the case, perhaps for v12.4 or v12.5 forward.

For the monolithic E2E approach, there is no need to either calculate or validate distances, which completely removes the need for LiDAR (ground-truth depth) input for training purposes. The only need for the occupancy network would then be the on-screen FSD visualization, which could be done with a smaller parallel network, which in itself would still benefit from LiDAR for training.

It's also possible that the primary E2E network is using the secondary occupancy network's output as one of its inputs, thus indirectly benefiting from the LiDAR training without requiring the LiDAR data stream from each of its training clips per se.
Agree but I am coming from a legal perspective. The day one of the RT/AV gets into a collision how are they going to prove they are not at fault, and that Tesla RT knew how far the other vehicle/person etc was? Does NN provide that answer? Curious.

This is the same situation where S. Ramanujan would dream of equations that were almost perfect in his sleep, and then Prof GH Hardy had to push him to validate those equations, and provide written evidence of their accuracy.
 
Let’s consider a simple example where you have a dataset of fruits characterized by their weight and texture, and you want to classify new fruits as either 'Apple' or 'Banana'.

Training Data:​

Weight (grams)Texture (0=Smooth, 1=Coarse)Label
1500Apple
1700Apple
1301Banana
1801Banana

Test Data:​

  • New fruit: Weight = 160 grams, Texture = 0

k-NN Operation (k=3):​

  1. Calculate distance (Euclidean) between the new fruit and all training data points.
  2. Identify the closest three fruits (let's assume k=3 for this example).
  3. Classify based on majority vote among the k-nearest neighbors.
If two of the three closest fruits are labeled 'Apple', the new fruit will also be classified as 'Apple'.

Key Parameters:​

  • k (Number of Neighbors): Choosing the right value of k is crucial. A smaller k makes the model sensitive to noise, while a larger k makes it computationally expensive and possibly less precise.
  • Distance Metric: The most common metric is Euclidean distance, but others like Manhattan or Minkowski can also be used depending on the dataset.
This describes a classical database lookup, not a neural network. A neural network does not maintain a database of its inputs; all the information in the input training set gets irreversibly (and lossily) condensed and mashed together into the neural network, with no way to retrieve or extract or compare against any specific example that was used for training. A neural network does not have any explicit "distance calculation", "identify closest inputs", or even "classification" (unless its final output is classification, but that's not the case for FSD E2E).
 
  • Like
Reactions: enemji