Resumen de Improvement of sample classification and metabolite profiling in 1h-nmr by a machine learning-based modelling of signal parameters

Ayuda

Resumen de Improvement of sample classification and metabolite profiling in 1h-nmr by a machine learning-based modelling of signal parameters

Daniel Cañueto Rodriguez

Metabolomics consists of the study of small molecules called metabolites which are present in cells, tissues, organs, and biofluids from organisms. An important analytical technique to characterize the metabolites levels present in a mixture is nuclear magnetic resonance (NMR). NMR spectra of metabolomics samples contain a sum of Lorentzian signals distributed along the spectrum which correspond to the metabolites present in the mixture. The area below a metabolite signal is proportional to the metabolite concentration. This quantitative property renders NMR with high-throughput potential for the quantification of metabolite concentrations. The 1H-NMR option remains the default choice when acquiring NMR spectra of samples. The recommended choice to quantify metabolites in NMR spectra is the profiling of metabolites through the deconvolution of the metabolite signals. To perform signal deconvolution, any metabolite signal has three parameters from which ones the signal can be constructed: the chemical shift (i.e., the location in the spectrum), the half bandwidth (i.e., the half-width at half-height) and the intensity. Theoretically, a metabolite signal should always have the same chemical shift, have a half bandwidth always perfectly collinear to the half bandwidths of the other signals and have an intensity perfectly collinear to the intensity of the other signals from this metabolite. These constraints ease greatly the search by optimization algorithms of the combination of signals in a spectrum which is best able to fit the spectrum lineshape. As a result, several 1H-NMR automatic profiling tools have appeared in order to improve the quality and reduce the duration of the metabolite profiling process when compared to options such as manual profiling or fingerprinting.

However, the chemical shift and the half bandwidth of a signal are determined by the chemical environment of the nucleus mediating the signal. This chemical environment is affected by the high variance in the sample and matrix properties (e.g., pH, ionic strength, presence of macromolecules) and in the sample storage (freezing) and preparation (e.g., thawing, buffering, dilution) prevalent in the study of complex matrices. Also, there are differences in the spectrum acquisition output depending on the spectrometer used. As a result, the chemical shift and half bandwidth (and in some cases even the relative intensity) might show relevant variability in complex matrices even after buffering. This variability in the signal parameters can distort the assumptions necessary to perform a high-quality optimization of lineshape fitting and challenge the performance of NMR automatic profiling tools. Default strategies to circumvent limitations are based on the reduction of the search space during optimization by the enforcement of restrictions during sample preparation and spectrum acquisition, the restriction of the tool to specific matrices or the implementation of bioinformatic solutions based on empirical observation. However, these strategies do not suppose a full solution to the original problem: how to model the variability present in the signal parameters if we cannot fully parameterize the sources of variability present in the sample to be analysed. The development of a modelling of the signal parameters without the need of prior information is a necessary step to implement an automatic profiling which can be as sample-, protocol- and matrix-independent as possible. These needs are even more necessary when considering the ongoing sensitivity improvements in NMR as they will mean a higher number of metabolites to profile and to monitor during automatic profiling.

The current lack of effective modelling of the variability in the signal parameters also hinders the harnessing of the information encoded in them. The signal parameters encode in their patterns of variability specific properties of the sample analysed (pH, ionic strength, temperature, the chemical environment of the proton mediating the signal). Therefore, the information present in these variability patterns might be exploited to maximize the quality and quantity of information extracted from the samples. For example: - Metabolomics studies do not exploit the information present in the chemical shift or the half bandwidth to try to improve the discrimination between different kinds of samples related to differences in pH or ionic strength.

- The modelling of signal parameters should enable the analysis between the expected and the obtained signal parameters during metabolite profiling in order to detect wrong annotations and suboptimal quantifications.

- Signals show collinear variability patterns in chemical shift and half bandwidth as the nuclei that mediate these signals can have similar chemical environments. This collinearity might help in the identification of unknown metabolites by the analysis of patterns of how their signals behave similarly to the signals of known metabolites.

A possible reason to the not yet developed implementation of strategies to exploit the benefits of the modelling of the signal parameters is the previous need to collect all the signal parameters of several datasets to evaluate strategies and to implement them. To overcome this challenge, it is necessary to first develop a tool which is able to collect and export the signal parameters values for their analysis. Likewise, the development of strategies to model the signal parameters and exploit the benefits of their modelling recommends the use of machine learning (ML)algorithms which are able to handle the non-linearities and noise present in the signal parameter datasets. The development of a tool in an open-source statistical language such as R, with state-of-the-art machine learning algorithms and implementations, seems a necessary step.

The development of this new open-source tool with the implementation of ML-based solutions to solve some challenges of complex matrices was the first goal of the thesis. Some additional achievements fulfilled during this goal were the next ones: • An identification tool of metabolite signals thanks to the unsupervised analysis of clusters of chemical shifts which behave similarly to the signal analysed (and, therefore, should come from metabolites with similar structures).

- Another metabolite identification tool to minimize wrong annotations of e.g. metabolites not typical from the matrix analysed, based on the data mining of open-source HMDB information about the reported concentration and presence information of each metabolite for each matrix and about the parameters of each metabolite signal.

- The row-wise dimensionality reduction of spectra datasets to help during exploratory analysis thanks to the selection of exemplars of spectra clusters able to efficiently represent the variance present in a spectra dataset.

- The generation of indicators of possible wrong annotations and improvable quantifications to help identify individual quantifications. These indicators are based on the study of the difference between the expected signal parameter value and the obtained one by random-forest based regressions.

- The creation of the first public reproducible 1H-NMR metabolite profiling workflows of metabolomics studies to enhance the reproducibility of metabolomics study workflows.

The second goal was to harness the potential of the signal parameters collected with the new profiling tool to enhance the discrimination of different metabolomics sample types. After exploratory analysis, only chemical shift information was added to metabolite concentration information to try help increase the performance of sample classification of three public metabolomics datasets. To avoid the problems related to the use of noisy information such as chemical shift and of low sample sizes, the implementation of ML-based (concretely, random forest-based) sample classification workflows contained solutions to avoid overfitting (e.g., bootstrap) or to perform feature selection. These solutions were based on open-source implementations of ML algorithms and enabled overcoming the existing obstacles to perform the harnessing of the potential of chemical shifts. In two of the three datasets studied, chemical shift information helped to provide an AUROC value higher than 0.9 during sample classification. In the other dataset, the chemical shift also showed discriminant potential (AUROC 0.831). These results were consistent with the pH imbalance characteristic of the condition studied in the datasets. These results show that it is possible to use chemical shift information to enhance the diagnostic and predictive properties of NMR. In addition, it was demonstrated that the signal misalignment produced by chemical shift variability, if it is caused by changes in the sample class, might distort the results of metabolomics studies performed by fingerprinting approaches (an alternative to profiling).

The third goal was the modelling of the signal parameters, already explored during the development of the open-source profiling tool to identify suboptimal quantifications, in order to maximize the quality of automatic profiling. The objective was to achieve a solution to monitor the variability in the signal parameters which was as matrix-, sample- and protocol-independent as feasible and which could be implemented to any tool. To achieve this objective, the strategy consisted of inferring the specific properties of each sample from the values of the same signal parameters collected during a first profiling iteration. There is extensive multicollinearity in the signal parameter information and this multicollinearity can be exploited to generate narrow and accurate predictions of the expected parameter values in a signal according to the values present in the collinear signals. During the first profiling iteration, because of the need of a wider search space during the optimization of lineshape fitting, it is expected a presence of suboptimal quantifications and wrong annotations which add noise to the datasets of signal parameters. This noise, if not adequately controlled, can create regression models of the expected values which overfit values and do not generalize well. To overcome this limitation, a robust implementation based on the removal of noisy features, the selection through random forest of relevant features and the enrichment of the dataset through PCA-based information was designed. In addition, to handle the nonlinearities present in certain signal parameters, the random forest algorithm was chosen to perform the prediction of the expected signal parameters. Lastly, to avoid overfitting, 0.632 bootstrap was performed. The predictions achieved during bootstrap were also used to generate prediction intervals for the predictions of signal parameters generated. These prediction intervals were then used as new value ranges during the optimization of lineshape fitting in a second profiling iteration of automatic profiling in two datasets of complex matrices (human faecal extract and human serum). Results show that the use of these prediction intervals helped maximize the quality of automatic profiling in the two datasets. Preliminary results also suggested benefits in its implementation in human urine. In addition, it was demonstrated that the predictions of signal parameters generated can be a better indicator of quality of profiling than current methods used (e.g., the fitting error). The use of these predictions to parameterize the quality of metabolite profiling consists of the averaging of the difference between the parameter values of the profiled signal and the expected values.

Acceso de usuarios registrados

¿Olvidó su contraseña?

¿Es nuevo? Regístrese

Ventajas de registrarse

Dialnet Plus

Coordinado por: