Ayuda
Ir al contenido

Dialnet


Storage format selection and optimization for materialized intermediate results in data-intensive flows

  • Autores: Rana Faisal Munir
  • Directores de la Tesis: Albert Abelló Gamazo (dir. tes.), Wolfgang Lehner (codir. tes.), Oscar Romero Moral (codir. tes.)
  • Lectura: En la Universitat Politècnica de Catalunya (UPC) ( España ) en 2019
  • Idioma: español
  • Tribunal Calificador de la Tesis: Torben Bach Pedersen (presid.), Antonio Corral Liria (secret.), Norbert Ritter (voc.)
  • Programa de doctorado: Programa de Doctorado Erasmus Mundus en Tecnologías de la Información para la Inteligencia Empresarial / Information Technologies for Business Intelligence por la Universidad Politécnica de Catalunya; Aalborg Universitet(Dinamarca); Politechnika Poznanska(Polonia); Technische Universität Dresden(Alemania) y Université Libre de Bruxelles(Bélgica)
  • Materias:
  • Texto completo no disponible (Saber más ...)
  • Resumen
    • Modern organizations produce and collect large volumes of data, that need to be processed repeatedly and quickly for gaining business insights. For such processing, typically, Data-intensive Flows (DIFs) are deployed on distributed processing frameworks. The DIFs of different users have many computation overlaps (i.e., parts of the processing are duplicated), thus wasting computational resources and increasing the overall cost. The output of these computation overlaps (known as intermediate results) can be materialized for reuse, which helps in reducing the cost and saves computational resources if properly done. Furthermore, the way such outputs are materialized must be considered, as different storage layouts (i.e., horizontal, vertical, and hybrid) can be used to reduce the I/O cost.

      In this PhD work, we first propose a novel approach for automatically materializing the intermediate results of DIFs through a multi-objective optimization method, which can tackle multiple and conflicting quality metrics. Next, we study the behavior of different operators of DIFs that are the first to process the loaded materialized results. Based on this study, we devise a rule-based approach, that decides the storage layout for materialized results based on the subsequent operation types. Despite improving the cost in general, the heuristic rules do not consider the amount of data read while making the choice, which could lead to a wrong decision. Thus, we design a cost model that is capable of finding the right storage layout for every scenario. The cost model uses data and workload characteristics to estimate the I/O cost of a materialized intermediate results with different storage layouts and chooses the one which has minimum cost. The results show that storage layouts help to reduce the loading time of materialized results and overall, they improve the performance of DIFs.

      The thesis also focuses on the optimization of the configurable parameters of hybrid layouts. We propose ATUN-HL (Auto TUNing Hybrid Layouts), which based on the same cost model and given the workload and characteristics of data, finds the optimal values for configurable parameters in hybrid layouts (i.e., Parquet).

      Finally, the thesis also studies the impact of parallelism in DIFs and hybrid layouts. Our proposed cost model helps to devise an approach for fine-tuning the parallelism by deciding the number of tasks and machines to process the data. Thus, the cost model proposed in this thesis, enables in choosing the best possible storage layout for materialized intermediate results, tuning the configurable parameters of hybrid layouts, and estimating the number of tasks and machines for the execution of DIFs.


Fundación Dialnet

Dialnet Plus

  • Más información sobre Dialnet Plus

Opciones de compartir

Opciones de entorno