Lagrangian duality for efficient large-scale reinforcement learning

Joan Bas Serrano

Ayuda

Lagrangian duality for efficient large-scale reinforcement learning

Autores: Joan Bas Serrano
Directores de la Tesis: Gergely Neu (dir. tes.)
Lectura: En la Universitat Pompeu Fabra ( España ) en 2022
Idioma: español
Tribunal Calificador de la Tesis: Anders Jonsson (presid.), Volkan Cevher (secret.), Olivier Pietquin (voc.)
Programa de doctorado: Programa de Doctorado en Tecnologías de la Información y las Comunicaciones por la Universidad Pompeu Fabra
Materias:
- Matemáticas
  - Ciencia de los ordenadores
    - Inteligencia artificial
Enlaces
- Tesis en acceso abierto en: TDX
Resumen
- Reinforcement learning is an expanding field where very often there is a mismatch between the high performance of the algorithms and their poor theoretical justification. For this reason, there is a need of algorithms that are well grounded in theory, with strong mathematical guarantees and that are efficient in solving large-scale problems. In this work we explore the linear programming approach for optimal control in MDPs. In order to develop novel reinforcement learning algorithms, we apply tools from constrained optimization to this linear programming framework. We propose a variety of new algorithms using techniques like constraint relaxation, regularization or Lagrangian duality. We provide a formal performance analysis for all of these algorithms, and evaluate them in a range of benchmark tasks.
  
  In concrete, the first set of results (chapter 4) is based on a linearly relaxed version of a saddle-point problem that characterizes the optimal solution in MDPs. We first introduce the bilinear saddle-point formulation of the MDP optimization problem and present a linearly parameterized version of this problem that enables to reduce the dimensionality of the problem. We characterize a set of assumptions that allow a reduced-order saddle-point representation of the optimal policy and propose an algorithm with convergence guarantees that shows the sufficiency of the assumptions.
  
  The second set of results (chapter 5) is based on a new reinforcement learning algorithm derived from a regularized linear-programming formulation of optimal control in MDPs. We first present the constrained optimization problem that we aim to solve and from which we derive a new loss function for policy evaluation that serves as an alternative to the widely used squared Bellman error. We then use this new loss function that we call logistic Bellman error to build the new algorithmic scheme called Q-REPS. We also analyze the error propagation of Q-REPS. After that, we provide a practical saddle-point algorithm (with two variants) and derive bounds on their performance. Finally, we show the effectiveness of our method on a range of benchmark problems.

Acceso de usuarios registrados

¿Olvidó su contraseña?

¿Es nuevo? Regístrese

Ventajas de registrarse

Dialnet Plus

Opciones de compartir

Opciones de entorno

Sugerencia / Errata

Coordinado por: