A framework to support run-time adaptation in reconfigurable multi-accelerator systems

Alfonso Rodríguez Medina

Ayuda

A framework to support run-time adaptation in reconfigurable multi-accelerator systems

Autores: Alfonso Rodríguez Medina
Directores de la Tesis: Eduardo Torre Arnanz (dir. tes.)
Lectura: En la Universidad Politécnica de Madrid ( España ) en 2020
Idioma: español
Tribunal Calificador de la Tesis: Javier Uceda Antolín (presid.), José Andrés Otero Marnotes (secret.), Michael Hüebner (voc.), Ángel de Castro Martín (voc.), Roberto Sarmiento (voc.)
Programa de doctorado: Programa de Doctorado en Ingeniería Eléctrica y Electrónica por la Universidad de Oviedo y la Universidad Politécnica de Madrid
Materias:
- Ciencias tecnológicas
  - Tecnología electrónica
Enlaces
- Tesis en acceso abierto en: Archivo Digital UPM
Resumen
- High-performance embedded computing systems have experienced a surge in both application complexity and overall system requirements. In order to keep pace with these fast changes, traditional computing platforms and paradigms have been also forced to evolve and adapt themselves to unprecedented scenarios. As a consequence, it is fairly common nowadays to see parallel computing platforms (e.g., multi-core CPUs, general-purpose GPUs, or FPGAs) at the core of high-performance embedded computing solutions. However, and even though execution performance still plays an important role in the platform selection process, additional criteria, such as energy efficiency or fault tolerance, need to be taken into account as well. In this context, the use of run-time reconfigurable SRAM-based FPGAs as standalone processing elements has already proven to be the best option in multiple application scenarios, since they provide the best of both hardware (computing performance, energy efficiency) and software (programmability, execution flexibility) worlds.
  
  However, the use of these devices presents several problems, the most relevant one still being their lack of accessibility for the general public. In most cases, developers are expected to have extensive low-level knowledge of the underlying technology, either to describe and implement their algorithms, or to execute and manage the generated hardware accelerators. In order to overcome these issues, it is mandatory for reconfigurable FPGA-based parallel computing frameworks to provide three key components: computing architectures to create a proper acceleration infrastructure, design-time support tools to enable high-level entry points for accelerator description and automated system implementation, and run-time support mechanisms to hide low-level reconfiguration and execution management details. The framework proposed in this Thesis is an integrated solution that covers all three components.
  
  The framework is built around the ARTICo3 architecture, a hardware-based reconfigurable multi-accelerator infrastructure inspired by general-purpose GPUs and OpenCL abstractions. As such, it is capable of using DPR to exploit both task- (load different application-specific hardware accelerators) and data-level (load multiple copies of the same application-specific hardware accelerator) parallelism. In addition, it features a configurable datapath that can be dynamically modified during application execution to enable run-time tradeoffs between computing performance, energy efficiency, and fault tolerance. In this regard, the architecture supports three operation modes when multiple copies of the same application-specific hardware accelerator are present on the FPGA fabric: parallel (each accelerator gets different input data and executes in SIMD-like fashion), redundant (each accelerator belongs to a TMR/DMR group, receives the same input data, and sends its results back through a configurable voter unit to enforce fault tolerance), and reduction-oriented (similar to parallel, but with a final reduction operation before writing results back to main memory). To simplify its programming, ARTICo3 uses a memory-mapped virtualization of the hardware accelerators, and provides a register-based interface for configuration purposes. The ARTICo3 architecture also features an embedded monitoring infrastructure to keep track of relevant execution metrics (e.g., accelerator latency, number of faults per reconfigurable slot).
  
  The ARTICo3 framework also comes with a toolchain to automatically generate reconfigurable multi-accelerator systems with DPR-compatible floorplannings. In order to further simplify FPGA-based system implementation, two different entry points for hardware accelerator specification are supported: low-level RTL descriptions in VHDL/Verilog (for users with certain knowledge in hardware design) and high-level algorithmic descriptions in C/C++ (for users with no previous experience on hardware design), which are then implemented using a commercial HLS engine under the hood. Independently of the selected entry point, accelerators are integrated in the main ARTICo3 infrastructure by instantiating the input user logic in a standard wrapper, which features configurable memory and register banks, and a fixed interface to simplify the implementation of the static and reconfigurable partitions on the FPGA. The toolchain is implemented in a modular way, enabling users to extend its functionality by creating custom templates for multi-accelerator system deployment.
  
  The last component of the ARTICo3 framework is a runtime library that enables a data-parallel execution model similar to the one present in the OpenCL specification. This library is accessible from user applications through a lightweight API that hides low-level reconfiguration and accelerator management details. In fact, accelerator scheduling is transparently performed using a DMA-friendly memory model and a scalable execution scheme. In parallel, the framework includes a lightweight model for hardware accelerator execution that can be used to perform run-time estimations of execution performance and power consumption.
  
  In order to validate the proposed framework, three different approaches have been considered: using optimized custom VHDL-based hardware accelerators, using optimized custom HLS-based hardware accelerators, and using standard HLS-based benchmarks. In the first scenario, the different execution profiles in ARTICo3 have been assessed, showing the different behavior of memory-bounded applications (data transfers are larger than computing time) and computing-bounded applications (computing time is larger than data transfers). In the second scenario, it is shown that ARTICo3-based hardware acceleration does indeed outperform alternative high-performance embedded computing platforms, showing speedups of up to 14x and energy efficiency ratios of up to 10x when compared with a software-based implementation of the same algorithm. Finally, the benchmark evaluation has been used to identify which type of application can truly benefit from using the framework proposed in this Thesis, associating different computation/communication patterns to memory- and computing-bounded execution profiles.
  
  Although the baseline ARTICo3 framework provides an accessible and transparent way for software programmers to implement reconfigurable multi-accelerator systems, it constrains them to use a single programming model. However, and as the benchmark-based validation has confirmed, the native data-parallel programming model of the ARTICo3 framework may not render proper acceleration values in certain application scenarios. Using plain C/C++ code for accelerator specification might be also a limitation, especially when considering that current programming trends tend to raise the level of abstraction. As a consequence, the ARTICo3 programming and execution models have been extended with transparent hardware/software multithreading capabilities and dataflow-based accelerator specification support.
  
  Hardware/software multithreading has been enabled by a direct integration with ReconOS, another hardware-oriented framework for reconfigurable multi-accelerator systems that provides the required architectural and OS-level extensions to allow hardware accelerators to behave as regular software threads. Additionally, an evolvable SA system has been implemented as an ARTICo3 accelerator to support circuit adaptation using learn-by-imitation procedures. This approach enables a multi-grain reconfiguration approach, combining the coarse-grained DPR used in ARTICo3 slots with a fine-grained DPR mechanism to change functional units at LUT level.
  
  Dataflow-based extensions have been enabled by a direct integration with MDC, a tool to generate hardware accelerators with embedded CGR substrates using model-based specifications written in CAPH. This enables a hybrid reconfiguration approach that combines DPR (to perform slow but complete functional changes) with register-based CGR (to perform fast but parametric-only modifications in the accelerator logic). In addition, the integrated MDC-ARTICo3 toolchain allows designers to generate highly-optimized structures that heavily enforce resource reuse, since MDC identifies and merges common functional units in the internal accelerator datapath.
  
  Finally, the ARTICo3 framework has been evaluated under a real-world application scenario for high-performance embedded computing: on-board hyperspectral data processing. Remote sensing applications that rely on hyperspectral imaging sensors produce large amounts of data that need to be further processed to extract relevant information. In this regard, there are two alternatives: on the one hand, data can be acquired, compressed on board, and sent to on-Earth processing facilities; on the other hand, data can be directly processed on board. In this Thesis, a run-time adaptive implementation of a lossless hyperspectral data compressor based on the CCSDS 123 standard has been proposed. A novel hardware-friendly data partitioning algorithm compatible with the standard has been applied to properly exploit data-level parallelism. Experimental results show comparable performance an energy efficiency levels with alternative solutions available in the literature. The second scenario (i.e., on-board data processing without compression) has been also addressed in this Thesis, proposing a run-time adaptive implementation of a linear unmixing chain for hyperspectral images. This particular implementation relies on a novel hardware-friendly data partitioning and reduction algorithm to ensure proper data-level parallelism and execution scalability, which has been addressed in both single- and multi-FPGA (using a small Ethernet-based computing cluster) contexts.
  
  The main contributions of this Thesis can be classified and summarized as follows: • [Architecture] A hardware-based processing architecture for adaptive high-performance embedded computing based on run-time tradeoffs between computing performance, energy efficiency, and fault tolerance.
  
  • [Design] An automated design methodology to generate custom reconfigurable multi-accelerator systems from either low-level RTL descriptions or high-level algorithmic descriptions and HLS.
  
  • [Runtime] A runtime library to transparently manage FPGA reconfiguration and computation offloading in multi-accelerator scenarios.
  
  • [Validation] A Dwarf-based characterization and validation strategy for the ARTICo3 framework based on HLS benchmarks.
  
  • [Runtime] A multi-paradigm programming approach for reconfigurable multi-accelerator systems that combines SIMD-like data-parallel execution with transparent hardware/software multithreading.
  
  • [Design] An integrated toolchain to automatically generate reconfigurable multi-accelerator systems from high-level dataflow descriptions.
  
  • [Application] A run-time adaptive FPGA implementation of a low-complexity on-board CCSDS 123 lossless multispectral and hyperspectral compressor with selectable performance and energy efficiency levels.
  
  • [Application] A run-time adaptive and multi-FPGA implementation of a low-complexity on-board linear hyperspectral unmixing chain with selectable performance and energy efficiency levels.
  
  This Thesis is organized in 5 chapters. Chapter 1 discusses the motivation behind the developed work, establishes the main goals to be achieved, and presents a brief overview of the technology background (reconfigurable computing and parallel computing). In Chapter 2, the three main components of the ARTICo3 framework (hardware-based computing architecture, automated toolchain, and runtime library) are presented and validated using custom designs (i.e., in-house VHDL and HLS-based accelerators) and an HLS benchmark suite. The reference abstractions provided by the ARTICo3 framework are extended in Chapter 3 to also support transparent hardware/software multithreading using ReconOS and dataflow-oriented accelerator specification using MDC. Chapter 4 showcases the ARTICo3 framework in two state-of-the-art high-performance embedded computing applications: hyperspectral data compression and hyperspectral linear unmixing. Finally, Chapter 5 closes the Thesis, discussing the conclusions drawn from the developed work, summarizing its main contributions, analyzing its impact with quantitative measurements such as its related publications, and presenting the future lines of work.

Acceso de usuarios registrados

¿Olvidó su contraseña?

¿Es nuevo? Regístrese

Ventajas de registrarse

Dialnet Plus

Opciones de compartir

Opciones de entorno

Sugerencia / Errata

Coordinado por: