Ayuda
Ir al contenido

Dialnet


Learning of meaningful visual representations for continuous lip-reading

  • Autores: Adriana Fernández López
  • Directores de la Tesis: Federico Mateo Sukno (dir. tes.)
  • Lectura: En la Universitat Pompeu Fabra ( España ) en 2021
  • Idioma: español
  • Tribunal Calificador de la Tesis: Glòria Haro Ortega (presid.), Adrià Ruiz Ovejero (secret.), Oriol Vinyals Deulofeu (voc.)
  • Programa de doctorado: Programa de Doctorado en Tecnologías de la Información y las Comunicaciones por la Universidad Pompeu Fabra
  • Materias:
  • Enlaces
    • Tesis en acceso abierto en: TDX
  • Resumen
    • In the last decades, there has been an increased interest in decoding speech exclusively using visual cues, i.e. mimicking the human capability to perform lip-reading, leading to Automatic Lip-Reading (ALR) systems.

      However, it is well known that the access to speech through the visual channel is subject to many limitations when compared to the audio channel, i.e. it has been argued that humans can actually read around 30% of the information from the lips, and the rest is filled-in from the context. Thus, one of the main challenges in ALR resides in the visual ambiguities that arise at the word level, highlighting that not all sounds that we hear can be easily distinguished by observing the lips.

      In the literature, early ALR systems addressed simple recognition tasks such as alphabet or digit recognition but progressively shifted to more complex and realistic settings leading to several recent systems that target continuous lip-reading. To a large extent, these advances have been possible thanks to the construction of powerful systems based on deep learning architectures that have quickly started to replace traditional systems. Despite the recognition rates for continuous lip-reading may appear modest in comparison to those achieved by audio-based systems, the field has undeniably made a step forward. Interestingly, an analogous effect can be observed when humans try to decode speech: given sufficiently clean signals, most people can effortlessly decode the audio channel but would struggle to perform lip-reading, since the ambiguity of the visual cues makes it necessary the use of further context to decode the message.

      In this thesis, we explore the appropriate modeling of visual representations with the aim to improve continuous lip-reading. To this end, we present different data-driven mechanisms to handle the main challenges in lip-reading related to the ambiguities or the speaker dependency of visual cues.

      Our results highlight the benefits of a proper encoding of the visual channel, for which the most useful features are those that encode corresponding lip positions in a similar way, independently of the speaker. This fact opens the door to i) lip-reading in many different languages without requiring large-scale datasets, and ii) increasing the contribution of the visual channel in audio-visual speech systems. On the other hand, our experiments identify a tendency to focus on the modeling of temporal context as the key to advance the field, where there is a need for ALR models that are trained on datasets comprising large speech variability at several context levels. In this thesis, we show that both proper modeling of visual representations and the ability to retain context at several levels are necessary conditions to build successful lip-reading systems.


Fundación Dialnet

Dialnet Plus

  • Más información sobre Dialnet Plus

Opciones de compartir

Opciones de entorno