Say your inference computer can process N weights per time slice
You divide the problem into 4 steps of size A,B,C, and D
You have M pieces of training data
They are subdivided into sets for each problem step: a,b,c,d
For ease of discussion, assume all sets and NN are the same size: A=B=C=D=N/4 and a=b=c=d=M/4
When you train, you run training 4 times, once for each sub step. Say it takes 100 Million rounds for a good output.
Total training: 4 steps * 100 million * M/4 cases * N/4 weights = 1/4 * 100 million * M*N or
25 million * N * M
Then you realize breaking the problem up into discrete steps looses a lot of context and you would be better off with one full sized net training on all the data.
1 step * 100 million * M cases * N weights = 100 million * N * M
Presto, you just quadrupled the amount of training compute needed.
HOWEVER!
Remember how your training data was case specific and you needed 100 million runs of that subset of cases on a subset of the full NN to get good results? Yeah, that didn't go away. You are now tweaking 4x the weights each run and may need to run each test case 4x (or 16x) more times to move the parameters sufficiently (due to the adjustment coefficient changing and number of parameters impacted per run)
Say it's only 4x more, now your compute requirements are 16 times greater than before. if 16x, that's a 64x total increase.
Even at 4x due purely to unification, that's a lot more compute for the same training speed and further additional cases need 4x additional compute than before. Plus, you can no longer just split the task into 4 independent cluster of training.
Time to enter the Dojo.