Action Understanding in RGB Videos Through Deep Learning and Multimodal Cues

Manuel Benavent Lledó

Ayuda

Action Understanding in RGB Videos Through Deep Learning and Multimodal Cues

Autores: Manuel Benavent Lledó
Directores de la Tesis: José García Rodríguez (dir. tes.), Sergio Orts Escolano (dir. tes.)
Lectura: En la Universitat d'Alacant / Universidad de Alicante ( España ) en 2025
Idioma: español
Número de páginas: 182
Tribunal Calificador de la Tesis: Raquel Martínez España (presid.), Andrés Fuster Guilló (secret.), Felipe Restrepo Calle (voc.)
Programa de doctorado: Programa Oficial de Doctorado en Informática
Enlaces
- Tesis en acceso abierto en: RUA
Resumen
- Understanding human actions in video is essential for developing intelligent systems capable of interpreting, reacting to, and anticipating human behavior. However, relying exclusively on visual information can be limiting, particularly in complex, ambiguous, or noisy environments. Different modalities capture different aspects of human actions: language provides semantic context, depth conveys spatial structure, and temporal cues help model progression over time. In this thesis, we explore multiple levels of multimodal action understanding. By integrating complementary sources of information, we contribute across three key tasks: online action detection, hierarchical action recognition and action anticipation. To begin, we introduce a method for online action detection guided by textual information. By integrating vision-language models, our approach not only improves detection accuracy but also enables zero-shot and few-shot learning. Additionally, text-guided architectures allow for open-world detection, as they are not limited to a fixed set of action classes. Next, we explore hierarchical action recognition, highlighting the importance of modeling actions at different levels of granularity. We propose a vision-language transformer architecture trained to jointly classify both coarse- and fine-grained actions. This is supported by the creation of the Hierarchical TSU dataset, the first ADL dataset with multi-level annotations and contextual information, demonstrating that hierarchical and contextual cues significantly enhance recognition performance. To further scale hierarchical modeling, we introduce HierADL, a framework for automatically generating action hierarchies using semantic and visual features. Combined with a dual-level classifier, HierADL outperforms flat classification methods across multiple ADL benchmarks. Finally, we address the problem of action anticipation by proposing AAG, a method that relies on a single frame by substituting video aggregation with depth information and summaries of past actions. This approach significantly reduces the computational overhead associated with video-based methods, while still achieving competitive performance.

Acceso de usuarios registrados

¿Olvidó su contraseña?

¿Es nuevo? Regístrese

Ventajas de registrarse

Dialnet Plus

Opciones de compartir

Opciones de entorno

Sugerencia / Errata

Coordinado por: