Ayuda
Ir al contenido

Dialnet


Resumen de Correctable and uncorrectable errors using large scale DRAM DIMMs in replacement network servers

Sanghyeon Baeg, Mirza Qasim, Junhyeong Kwon, Tan Li, Nilay Gupta, ShiJie Wen, Satyadev Kolli

  • This paper investigated DRAM DIMM errors using field records in replacement network servers. Large DRAM samples of about 40 K were collected over a 2.5 years period from 23 different server types, included various DIMMs from three different DRAM manufacturers with densities between 4 and 128 GB, and speeds between 1066 and 2400 Mbps. Errors that occurred during system operation were classified as either correctable (CE) or uncorrectable (UE) errors based on error correction code (ECC) schemes built into the servers. Of the collected DIMMS, 24% had recorded errors, where CE-only, UE-only, and UE and CE together comprised 28%, 43%, and 29% of recorded errors, respectively. Since UEs can cause large-scale failures, systems are replaced upon any UE occurrence. Approximately half UE-only DIMMs had 1 UE error. In contrast, many DIMMs had billions of CE errors, where a faulty location may be repetitively accessed. Such drastic differences in UE and CE counts help explain the importance of ECC and error mitigation schemes. Comparative analyses of errors were made over the manufacturers and operating speeds. After reasonable adjustments for repetitive counts of errors, failure in time (FIT) differences were up to 38% over manufacturers. Higher speed DIMMs generally had higher FIT with 2400 Mbps DIMMs exhibiting 6.7 times FIT of 1066 Mbps DIMMs.


Fundación Dialnet

Dialnet Plus

  • Más información sobre Dialnet Plus