Quick newbie question, I don't understand the Tesla graph below. From wikipedia, the top computer in the world is 1.6 ExaFlops. In 2024, Is Tesla aiming to be 100x more powerful with 100 Exa-Flops? And if you have 1/3rd of GPus in early2024, are you at 33Exaflops? It doesn't match the top 5 in the world (approx. 1Exa-flop).
That list is based on the LINPACK benchmark for traditional x86 (64-bit) CPU-based systems. Those supercomputers are designed for traditional compute workloads. Hence, it's not a useful comparison for Machine Learning systems.
CPUs are well-suited for general-purpose workloads (all software tasks), allowing them to handle diverse instruction types, including arithmetic, logical, control flow, and I/O operations. CPUs are highly optimized for sequential operations and quick memory access.
Traditional compute work is limited by algorithmic complexity. Scaling the system often doesn't provide proportional improvements in performance or results due to the inherent limitations of the algorithms being used.
On the other hand, in machine learning, better results are often achieved with more data and larger models (although there are other factors involved, this is generally true for very complex problems). Scaling supercomputers for machine learning tasks can provide significant improvements. The parallel and distributed nature of computing tasks allows for infinite scalability (in theory).
Machine learning computers use GPUs or accelerators, which are excellent for performing massive-scale matrix multiplication operations required for machine learning tasks. Accelerators and GPUs have hundreds of cores that can process these specific arithmetic operations in parallel.
Now, going on a bit of a tangent...
Not all theoretical FLOPs on accelerators are created equal. Other systems surrounding the accelerators play a critical role in overall system's utilization, occupancy, and energy consumption. Dojo was specifically designed for massive video training. It's important to recognize that Dojo's performance should be evaluated within the context of its intended purpose. It may not do well in some arbitrary training benchmarks used by the industry to compare training hardwares.
On a small benchmark model, Dojo's performance is on par with the A100.
However, Dojo excels in large-scale complex models with high-intensity arithmetic workloads. These models face critical data-transfer bottlenecks and diminishing returns when training is scaled on the Nvidia stack. Dojo is 3.2x and 4.4x faster compared to the A100:
Dojo V2 will be more general purpose (Autopilot, Bots, AGI and potentially, open it to everyone for a pay-as-you-go service).