It's interesting to consider that diffusion models could be more compute-efficient for certain discriminative problems than transformer models. In the example you provided, the task is to determine the orientation and possible walking direction of a pedestrian in a crosswalk. We can hypothesize a few reasons why diffusion models could be more efficient in such scenarios:
- Simpler feature extraction: Diffusion models, when trained for denoising or inpainting tasks, can learn to extract relevant features from noisy or partially occluded images. In the case of determining the pedestrian's orientation, this could mean that the model can focus on crucial features, such as body shape or posture, with less computation compared to a transformer model that needs to process the entire scene with self-attention mechanisms.
- Local context: In some cases, the information needed to solve the task is available in a local context, such as the pedestrian's immediate surroundings. Diffusion models can capture this local context efficiently by modeling the noise addition and removal process, which may require less computation compared to a transformer model that needs to account for global context and long-range dependencies in the scene.
- Task-specific optimization: It is possible that diffusion models have been specifically optimized for the task of pedestrian orientation and walking direction estimation. This could involve training the model on a large dataset of pedestrian images, including various orientations and poses, which would enable the model to learn relevant features and patterns more efficiently.
- Lower complexity: In some cases, the complexity of transformer models could be a disadvantage. Transformers rely on self-attention mechanisms, which can require substantial compute resources, especially when dealing with high-resolution images. Diffusion models, on the other hand, might have a lower complexity, allowing them to process images more quickly and with less computation.
It's important to note that the performance and efficiency of these models can vary greatly depending on the specific problem, dataset, and model architecture. While diffusion models might have an advantage in the given example, transformers have shown great success in various discriminative tasks, such as object recognition and image segmentation. The choice between the two model types will depend on the specific problem and the available computational resources.