Memscale: scalable memory aggregation for clusters

Hector Montaner Mas

Ayuda

Memscale: scalable memory aggregation for clusters

Autores: Hector Montaner Mas
Directores de la Tesis: Holger Fröning (dir. tes.), Federico Silla Jiménez (dir. tes.)
Lectura: En la Universitat Politècnica de València ( España ) en 2013
Idioma: inglés
Tribunal Calificador de la Tesis: José Francisco Duato Marin (presid.), Antonio Robles Martínez (secret.), Francisco J. Alfaro Cortés (voc.), Tor Skeie (voc.), Juan Manuel Orduña Huertas (voc.)
Materias:
- Matemáticas
  - Ciencia de los ordenadores
    - Informática
Enlaces
- Tesis en acceso abierto en: TESEO
Resumen
- Abstract This PhD thesis addresses the needs of shared-memory applications with very large memory requirements. The problem is that current computers can- not hold in a cost-effective way enough memory to efficiently execute such memory-hungry applications. This is because although memory capacity and computing power increase over time, there is always a maximum configura- tion at a given moment imposed by hardware manufacturers. For example, at the moment of writing this manuscript, the maximum amount of memory in a mainstream x86 server is 2TB, although this configuration is extremely expensive.
  
  An alternative architecture, like the cluster of computers, presents better scalability. More nodes can be added to the cluster to get an unlimited ag- gregated memory (and computing power). However, this architecture has a drawback: processors can only access memory installed in its own node. The consequence of this is a set of isolated address spaces. Thus, shared-memory applications cannot be natively executed beyond one single node and other programming paradigms, like the message-passing one, must be used. Unfor- tunately, message-passing programming has several problems such as higher complexity, work unbalancing, data replication, and coarse granularity in com- munications.
  
  Software and hardware distributed shared-memory (DSM) systems avoid the use of the message-passing paradigm by aggregating all the resources in the entire cluster into a single system image. However, other problems arise, like the communication overhead in the case of software solutions or the lack of scalability in both scenarios due to the overhead of the coherency protocol.
  
  In this PhD we propose a hardware solution, similar to a hardware DSM in the sense that we physically extend the addressing capabilities of the proces- sors. As a result, processors in the cluster have direct access to main memory installed at any node. Thus, the entire memory of the cluster becomes avail- able to a single shared-memory application. However, in a first stage in this PhD, we restrict this feature so that a memory location can only be accessed by a single node. This means that applications can use as much memory in the cluster as needed but their execution must be restricted to the processors in a single node. This configuration limits which cache memories can hold a copy of a given memory value. Therefore, after a write operation, only those cache memories in the local node will have to be informed. We call this hardware-based system Memscale.
  
  Contrary to the mentioned alternatives, our system decouples memory from processors. This is based on the appreciation that many applications will benefit from large memory configurations but will not be able to make use of a high number of processors. The advantage of this approach is that the coherency protocol does not need to be maintained in the inter-node space.
  
  This key characteristic allows a high scalability level not found in traditional shared-memory systems.
  
  We have implemented our idea in a prototype cluster to prove its feasi- bility. In this way, we are able to test our proposal with real shared-memory applications and compare its performance with other approaches. When com- pared to traditional solutions, our system behaves especially well for those applications with low memory locality. Applications that require random ac- cess to the dataset will benefit from the fine granularity (cache line) and low latency of our hardware-based system.
  
  In a second stage, we also explore the possibility of sharing memory among different nodes. In this scenario, we need to establish a mechanism to solve the potential problem of persistent stale copies. In a first approach, we focus on read-only databases. We are able to run a MySQL server across the entire cluster. On one hand, we take advantage of the global memory pool to load tables into main memory and, thus, achieving a low execution time for queries (one order of magnitude faster than SSD-based servers even with our slow FPGA prototype). On the other hand, the possibility of leveraging every pro- cessor in the cluster allows high throughput in terms of number of queries per time unit with linear scalability up to the maximum tested configuration (16 nodes). Although this case of use does not suffer from the lack of coherency, it is necessary to introduce a mechanism to manage the write operations re- quired in a commercial implementation. We adopted the eventual consistency model as a way of dealing with this issue.
  
  Additionally, in this work we analyze the use of low-power processors along with our solution. As the use of remote memory inevitably increases the average latency in comparison with regular main memory, a decrement in the frequency of processors may not harm the execution time of applications but may report substantial savings in power and energy consumption. By reducing in half the frequency of processors we save 18% of the energy consumed by the system and execution time only increases 2%.
  
  Finally, we extend the applicability of our non-coherent shared memory proposal to general purpose applications. To do so, we leverage the release con- sistency model and the synchronization primitives already present in shared- memory code to build the required safety net. We extend the functionality of these primitives, barriers and locks, to behave as points of coherency. In this way, we show how regular parallel code can be directly migrated to a non- coherent environment. Additionally, we adapt the implementation of the lock and barrier algorithms to make the most of the low-latency communication in our system. For instance, the optimized barrier synchronizes among 1024 threads in 15¿s, much faster than any other reported implementation up to our knowledge.
  
  Notice that this PhD is the result of a joint development of Technical University of Valencia (Spain) and University of Heidelberg (Germany) with funds provided by the PROMETEO program from Generalitat Valenciana.

Acceso de usuarios registrados

¿Olvidó su contraseña?

¿Es nuevo? Regístrese

Ventajas de registrarse

Dialnet Plus

Opciones de compartir

Opciones de entorno

Sugerencia / Errata

Coordinado por: