Ayuda
Ir al contenido

Dialnet


Resumen de Layout inference and table detection in spreadsheet document

Elvis Koci

  • Spreadsheet applications have evolved to be a tool of great importance for businesses, open data, and scientific communities. Using these applications, users can perform various transformations, generate new content, analyze and format data such that they are visually comprehensive. The same data can be presented in different ways, depending on the preferences and the intentions of the user.

    These functionalities make spreadsheets user-friendly, but not as much machine-friendly. When it comes to integrating with other sources, the free-for-all nature of spreadsheets is disadvantageous. It is rather difficult to algorithmically infer the structure of the data when they are intermingled with formatting, formulas, layout artifacts, and textual metadata. Therefore, user involvement is often required, which results in cumbersome and time-consuming tasks. Overall, the lack of automatic processing methods limits our ability to explore and reuse a great amount of rich data stored into partially-structured documents such as spreadsheets.

    In this thesis, we tackle this open challenge, which so far has been scarcely investigated in literature. Specifically, we are interested in extracting tabular data from spreadsheets, since they hold concise, factual, and to a large extend structured information. It is easier to process such information, in order to make it available to other applications. For instance, spreadsheet (tabular) data can be loaded into databases. Thus, these data would become instantly available to existing or new business processes. Furthermore, we can eliminate the risk of losing valuable company knowledge, by moving data or integrating spreadsheets with other more sophisticated information management systems.

    To achieve the aforementioned objectives and advancements, in this thesis, we develop a spreadsheet processing pipeline. The requirements for this pipeline were derived from a large scale empirical analysis of real-world spreadsheets, from business and Web settings. Specifically, we propose a series of specialized steps that build on top of each other with the goal of discovering the structure of data in spreadsheet documents. Our approach is bottom-up, as it starts from the smallest unit (i.e., the cell) to ultimately arrive at the individual tables of the sheet.

    Additionally, this thesis makes use of sophisticated machine learning and optimization techniques. In particular, we apply these techniques for layout analysis and table detection in spreadsheets. We target highly diverse sheet layouts, with one or multiple tables and arbitrary arrangement of contents. Moreover, we foresee the presence of textual metadata and other non-tabular data in the sheet. Furthermore, we work even with problematic tables (e.g., containing empty rows/columns and missing values). Finally, we bring flexibility to our approach. This not only allows us to tackle the above-mentioned challenges but also to reuse our solution for different (spreadsheet) datasets.


Fundación Dialnet

Dialnet Plus

  • Más información sobre Dialnet Plus