Ayuda
Ir al contenido

Dialnet


Data preprocessing and quality diagnosis in deep learning-based in silico bioactivity prediction

  • Autores: Ángela López del Río
  • Directores de la Tesis: Alexandre Perera Lluna (dir. tes.)
  • Lectura: En la Universitat Politècnica de Catalunya (UPC) ( España ) en 2021
  • Idioma: español
  • Materias:
  • Texto completo no disponible (Saber más ...)
  • Resumen
    • Drug discovery is a time and resource consuming process involving the identification of a target and the exploration of suitable drug candidates for it. To streamline drug discovery, computational techniques help identifying molecular candidates with desirable properties by modeling their interactions with the target. These techniques are in constant improvement thanks to the development of algorithms, the increasing computational power and the growth of public molecular databases. Specifically, machine learning approaches provide predictive models on biochemical properties and target-ligand binding activity.

      Deep learning is a machine learning approach that automatically extracts multiple levels of representations of the data. Within the last ten years, deep learning has outperformed classical prediction models in most domains, including drug discovery. Common use cases encompass molecular property prediction, de novo compound generation, protein secondary structure prediction and target-compound binding prediction.

      However, studies point out the reported performance of deep learning bioactivity prediction models could be a consequence of data bias rather than generalization capability. Efforts are being put in addressing this problem, but it is still present in the state of the art, rewarding novelty over critical assessment. Moreover, the flexibility of deep learning derives in a lack of consensus on how to represent the input spaces, making it difficult to compare models in a common benchmark. Bioactivity data has limited availability because of its associated costs and is often imbalanced, hampering the model learning process. The diagnosis of these problems is not straightforward, since deep learning models are considered black boxes, hindering their adoption as the de facto solution in computer-aided drug discovery.

      The present thesis aims to improve deep learning models for computational drug discovery, focusing in the input representation, the data bias control, the data imbalance correction and the model diagnosis.

      First, this thesis assesses the effect that different validation strategies have on binding classification models, aiming to find the most realistic performance estimates. The strategy based on clustering molecules to avoid having similar compounds in training and test sets showed to be the most similar to a prospective validation, and thus, more consistent than random cross-validation (over-optimistic) or than an external test set from other database (over-pessimistic).

      Second, this thesis focuses on the sequential inputs padding. Padding is necessary to establish a common sequence length by adding zeros to each sequence. These are usually added at the end of the sequence, without formal justification behind it. Here, classical and novel padding strategies were compared in an enzyme classification task. Results showed that the padding position has an effect in the performance of deep learning models, so it should be tuned as an additional hyperparameter.

      Third, this thesis studies the effect of data imbalance in protein-compound activity classification models and its mitigation through resampling techniques. The model performance was assessed for different combinations of oversampling the minority class and clustering. Results showed that the proportion of actives predicted by the model was explained by the actual data balance in the test set. Data clustering, followed by data resampling in training and validation sets, stood as the best performing strategy without altering the test set.

      To accomplish the three points above, this thesis provides a systematic way to diagnose deep learning models, identifying the factors that govern the model predictions and performance. Specifically, explanatory linear models enabled informed, quantitative decisions regarding input preprocessing. This ultimately leads to more consistent deep learning target-compound binding prediction models.


Fundación Dialnet

Dialnet Plus

  • Más información sobre Dialnet Plus

Opciones de compartir

Opciones de entorno