The impact of schizophrenia misdiagnosis rates on machine learning models performance

Daniel Martins; Conceição Egas; Joel P. Arrais

Ayuda

The impact of schizophrenia misdiagnosis rates on machine learning models performance

Daniel Martins ^[1] ^[1] ; Conceição Egas ^[1] ; Joel P. Arrais ^[1]
1. [1] Universidade de Coimbra
  
  Universidade de Coimbra
  
  Coimbra (Sé Nova), Portugal
Localización: Practical applications of computational biology and bioinformatics, 17th International Conference (PACBB 2023) / Miguel Rocha (ed. lit.), Florentino Fernández Riverola (ed. lit.), Mohd Saberi Mohamad (ed. lit.), Ana Belén Gil González (ed. lit.), 2023, ISBN 978-3-031-38078-5, págs. 3-13
Idioma: inglés
Enlaces
- Texto Completo Libro
Resumen
- Schizophrenia is a complex disease with severely disabling symptoms. A consistent leading causal gene for the disease onset has not been found. There is also a lack of consensus on the disease etiology and diagnosis. Sweden poses a paradigmatic case, where relatively high misdiagnosis rates (19%) have been reported.
  
  A large-scale case-control dataset based on the Swedish population was reduced to its most representative variants and the distinction between cases and controls was further scrutinized through gene-annotation based Machine Learning (ML) models.
  
  The intra-group differences on cases and controls were accentuated by training the model on the entire dataset. The cases and controls with a higher likelihood to be misclassified, and hence more likely to be misdiagnosed were excluded from subsequent analysis. The model was then conventionally trained on the reduced dataset and the performances were compared.
  
  The results indicate that the reported prevalence and misdiagnosis rates for Schizophrenia may be transposed to case-control cohorts, hence, reducing the performance of eventual association studies based on such datasets. After the sample filtering procedure, a simple Machine Learning model reached a performance more concurrent with the Schizophrenia heritability estimates on the literature.
  
  Sample selection on large-scale datasets sequenced for Association Studies may enable the adaptation of ML approaches and strategies to complex studies research