Abstract: Studies on DRAMs in high performance computing clusters and servers have concluded that memory errors in the field are dominated by hard faults, which cannot be fixed by scrubbing. Our accelerated life testing of embedded DRAM products shows that bits can degrade gradually. We propose improving system availability by performing in-field repair at the chip level. One page at a time, user data is copied to a temporary page and the page is stress tested with a long effective refresh time. Degrading memory cells are detected and repaired before any errors occur. Our 576 Mb embedded DRAM at 1.5 GHz in a 40nm CMOS technology with 8 metal layers achieves improved resilience to both aging memory cells and cells with variable retention time (VRT). Un-interrupted user access of 6 billion 72-bit read and write operations per second is maintained during background repair.

Conference proceedings to be available soon at IEEE Xplore.