Erola Pairó Castiñeira
Gene expression is a complex and highly regulated process. Most of the regulation is controlled by short DNA sequences that can be bound by some proteins called transcription factors (TF). Binding to these sites, the transcription factors, can start the transcription of mRNA, stop it, or just control the amount of mRNA produced. The DNA binding sites of these transcription factors have some specific characteristics: (1) They are short sequences (2) They can be located anywhere in the genome and (3) they are degenerated, which means that some mutations in the binding site sequence do not alter its functionality. These characteristics made impossible to look for a specific sequences in a specific region and, create the need to model the binding sites in order to detect them. Due to the importance of gene expression in the study of cell differentiation and its implication in some genetic diseases, many computational models and experimental processes to model binding site motifs and then find them into a genome have appeared. The computational models can be divided into two main groups: motif discovery methods which try to find binding sites within a set of co-regulated sequences without previous knowledge and motif search methods which use previous known sites to create a model and then try to locate binding sequences fitting this model. Most of the algorithms for binding site detection (both discovery and search) are based on Position weight matrices (PWM), which are matrices of frequencies of each nucleotide in each position, and assume that positions are independent. Some others take into account interdependences, but they need many sequences to be trained and high computational times. The focus of this thesis is to use the conversion from symbolical to numerical DNA and the previous knowledge of binding site sequences in order to construct models for DNA motifs. In this context, known multivariate signal processing techniques can be the ideal tools to construct models which can take into account interdependences without needing a large number of sequences or a high computational time. To characterize the transcriptions factors, the relationships TF-protein were studied, showing that most transcription factors regulate the expression of 5-10 genes and at the same time most proteins are regulated by more than 1 TF. The study of interdependences between positions showed that more than 90% of the binding sites have significant interdependences, but that the percentage of interdependences is not enough to classify TF according to structure. The conversion of DNA motif matrices into numerical matrices allows the use ofl Component Analysis (PCA) to model the binding sites which captures the information of the interdependences into the covariance, a second order statistics. Using the hypothesis that the binding sites will fit better to the PCA model than genomic, sequences, the Q-residuals can be used to detect binding sites within the genome. When compared to PWM the Q-residuals detector performs as least as well, and the improvement of detection is significantly correlated to the percentage of positions with interdependences. The disadvantage of these PCa models is that they are difficult to interpret. Converting the DNA symbolical matrix into a DNA numerical cube allows the calculation PARAFAC models which are easier to interpret. Since PARAFAC models have unique solutions, their scores can be combined with the PARAFAC Q-residuals in order to construct a quadratic detector that also performs better than PSSM models. When the numerical detectors are compared to detectors that take into account interdependences, they perform better when there are not many sequences available, but there are more sensitive to the number of positions.
© 2001-2024 Fundación Dialnet · Todos los derechos reservados