Welcome to Tesla Motors Club
Discuss Tesla's Model S, Model 3, Model X, Model Y, Cybertruck, Roadster and More.
Register

FSD tweets

This site may earn commission on affiliate links.
Last edited:
  • Like
  • Informative
Reactions: JB47394 and rwjr44
  • Like
Reactions: Ben W and DanCar
Another Elon crazy incorrect prediction. Perhaps designed to pump up the stock price. Elon says AGI next year. This prediction will work as well as his 2016 and future year predictions of FSD coming in 2 years.
In 2017 Elon said SpaceX would fly private citizens around the Moon in 2018. He is consistently about 10x too optimistic on short-to-medium-term timeframes.
 
A short post on the best architectures for real-time image and video processing.TL;DR: use convolutions with stride or pooling at the low levels, and stick self-attention circuits at higher levels, where feature vectors represent objects.
PS: ready to bet that Tesla FSD uses convolutions (or perhaps more complex *local* operators) at the low levels, combined with more global circuits at higher levels (perhaps using self-attention). Transformers on low-level patch embeddings are a complete waste of electrons.

I'm not saying ViTs are not practical (we use them).
I'm saying they are way too slow and inefficient to be practical for real-time processing of high-resolution images and video.
[Also, @sainingxie's work on ConvNext has shown that they are just as good as ViTs if you do it right. But whatever].
You need at least a few Conv layers with pooling and stride before you stick self-attention circuits.
Self-attention is equivariant to permutations, which is completely nonsensical for low-level image/video processing (having a single strided conv at the front-end to 'patchify' also doesn't make sense).
Global attention is also nonsensical (and not scalable), since correlations are highly local in images and video.
At high level, once features represent objects, then it makes sense to use self-attention circuits: what matters is the relationships and interactions between objects, not their positions.
This type of hybrid architecture was inaugurated by the DETR system by @alcinos26 and collaborators.
As I've said since the DETR work, my favorite family of architectures is conv/stride/pooling at the lower levels, and self-attention circuits at the higher levels.
 

It seems that E2E is still prone to the same pattern of improvements, regressions, improvements, regressions, just like the old stack. In my own experience, V12.3.6 has had regressions from V12.2 like the new behavior or wobbling in the lane before making a turn. I guess we will see if the overall trend of V12 is still a faster improvement (ie fewer regressions or regressions are fixed faster).
 
It seems that E2E is still prone to the same pattern of improvements, regressions, improvements, regressions, just like the old stack. In my own experience, V12.3.6 has had regressions from V12.2 like the new behavior or wobbling in the lane before making a turn. I guess we will see if the overall trend of V12 is still a faster improvement (ie fewer regressions or regressions are fixed faster).
I'm with Amnon Shashua on this one. He says e2e is doomed to failure.
Mobileye says FSD E2E approach is doomed to failure.

I don't think it will be too hard for Tesla to switch to an ensemble of neural networks once they realize their predicament. It is also possible that Tesla does have an ensemble currently and not e2e neural network, since Elon is loose with his terminology.
 
Last edited:
  • Like
Reactions: diplomat33

And some more details from the same user:

GO6hOq6boAAfyLO
 
  • Like
Reactions: DanCar