Is this a broadly agreed on community consensus?
I really thought they were trying to end-to-end
one single neural network where it takes in video (and other sensor inputs) and spits out a path and velocity plan.
A previous
post makes the point the if they're reusing the existing perception layer they could be building V12 more incrementally.
Another downside I see is that separate layers for perception and planning implies a connection between them in the middle (think "public API" in computer engineering terminology). I think people have used the term "Vector Space" for the output from the perception stack (locations of vehicles, locations of VRUs, lane lines, driveable space, signs, signals, metadata about these objects, etc).
The downside I see is these are all human picked and curated concepts. By building planning on top of these the planner can
only know about the concepts that human decided to build into the perception layer. Combining all of it into one neural net, end to end, eliminates the need to even pick what stuff is represented in the intermediary "Vector Space", or how it's represented.