Understanding human actions in video is essential for developing intelligent systems capable of interpreting, reacting to, and anticipating human behavior. However, relying exclusively on visual information can be limiting, particularly in complex, ambiguous, or noisy environments. Different modalities capture different aspects of human actions: language provides semantic context, depth conveys spatial structure, and temporal cues help model progression over time. In this thesis, we explore multiple levels of multimodal action understanding. By integrating complementary sources of information, we contribute across three key tasks: online action detection, hierarchical action recognition and action anticipation. To begin, we introduce a method for online action detection guided by textual information. By integrating vision-language models, our approach not only improves detection accuracy but also enables zero-shot and few-shot learning. Additionally, text-guided architectures allow for open-world detection, as they are not limited to a fixed set of action classes. Next, we explore hierarchical action recognition, highlighting the importance of modeling actions at different levels of granularity. We propose a vision-language transformer architecture trained to jointly classify both coarse- and fine-grained actions. This is supported by the creation of the Hierarchical TSU dataset, the first ADL dataset with multi-level annotations and contextual information, demonstrating that hierarchical and contextual cues significantly enhance recognition performance. To further scale hierarchical modeling, we introduce HierADL, a framework for automatically generating action hierarchies using semantic and visual features. Combined with a dual-level classifier, HierADL outperforms flat classification methods across multiple ADL benchmarks. Finally, we address the problem of action anticipation by proposing AAG, a method that relies on a single frame by substituting video aggregation with depth information and summaries of past actions. This approach significantly reduces the computational overhead associated with video-based methods, while still achieving competitive performance.
© 2001-2026 Fundación Dialnet · Todos los derechos reservados