Pedantically, FSD did see it, but didn't have enough relevant examples in its training set to know what the correct thing was (if anything) to do with that information. The whole point (and beauty) of v12-style end-to-end neural networks is that there is no intermediate hand-engineered "programming" layer, where a Tesla engineer may have made an explicit decision about what to do with hand-waving signals from other drivers. If there are enough examples in the training set that involve other drivers waving (or not), and the car responding appropriately, the network will learn the relevance and correct behavior implicitly.
I agree with you that v12 doesn't currently seem to be well-enough trained on hand signals to know what to do with them yet. The purpose of Tesla wanting to gather billions of training examples from the fleet is that as the training set grows, it will invariably include such examples, without requiring anyone at Tesla to have to explicitly select or look for them. This is particularly critical for the "long tail" of edge cases, which is far too long and varied to be manually curate-able. But it will eventually allow FSD to respond appropriately to truly oddball situations that only come up once in a million miles.
One major limitation, given Tesla's current approach as I understand it, is if the appropriate action is contextually dependent on a broader time horizon. For instance, a few weeks ago I was in stop-and-go traffic on an interstate due to road construction ahead. Two lanes were going my direction; periodically one lane would move a bit, then the other. FSD (with its short time horizon) kept interpreting the stopped car in front of me as an "obstacle", and kept trying to dart into the other lane. "Understanding" the broader context in this situation would be necessary for the car to realize that staying in its own lane is the right thing to do, and that the car in front isn't an "obstacle". Perhaps Tesla can overcome this by finding a way to include much longer clips (10 minutes, say) into its training data. Understanding such broader context, over a wide range of timescales (from seconds to years or even decades), is an essential part of general intelligence, I think.