Ayuda
Ir al contenido

Dialnet


Lagrangian duality for efficient large-scale reinforcement learning

  • Autores: Joan Bas Serrano
  • Directores de la Tesis: Gergely Neu (dir. tes.)
  • Lectura: En la Universitat Pompeu Fabra ( España ) en 2022
  • Idioma: español
  • Tribunal Calificador de la Tesis: Anders Jonsson (presid.), Volkan Cevher (secret.), Olivier Pietquin (voc.)
  • Programa de doctorado: Programa de Doctorado en Tecnologías de la Información y las Comunicaciones por la Universidad Pompeu Fabra
  • Materias:
  • Enlaces
    • Tesis en acceso abierto en: TDX
  • Resumen
    • Reinforcement learning is an expanding field where very often there is a mismatch between the high performance of the algorithms and their poor theoretical justification. For this reason, there is a need of algorithms that are well grounded in theory, with strong mathematical guarantees and that are efficient in solving large-scale problems. In this work we explore the linear programming approach for optimal control in MDPs. In order to develop novel reinforcement learning algorithms, we apply tools from constrained optimization to this linear programming framework. We propose a variety of new algorithms using techniques like constraint relaxation, regularization or Lagrangian duality. We provide a formal performance analysis for all of these algorithms, and evaluate them in a range of benchmark tasks.

      In concrete, the first set of results (chapter 4) is based on a linearly relaxed version of a saddle-point problem that characterizes the optimal solution in MDPs. We first introduce the bilinear saddle-point formulation of the MDP optimization problem and present a linearly parameterized version of this problem that enables to reduce the dimensionality of the problem. We characterize a set of assumptions that allow a reduced-order saddle-point representation of the optimal policy and propose an algorithm with convergence guarantees that shows the sufficiency of the assumptions.

      The second set of results (chapter 5) is based on a new reinforcement learning algorithm derived from a regularized linear-programming formulation of optimal control in MDPs. We first present the constrained optimization problem that we aim to solve and from which we derive a new loss function for policy evaluation that serves as an alternative to the widely used squared Bellman error. We then use this new loss function that we call logistic Bellman error to build the new algorithmic scheme called Q-REPS. We also analyze the error propagation of Q-REPS. After that, we provide a practical saddle-point algorithm (with two variants) and derive bounds on their performance. Finally, we show the effectiveness of our method on a range of benchmark problems.


Fundación Dialnet

Dialnet Plus

  • Más información sobre Dialnet Plus

Opciones de compartir

Opciones de entorno