A New Method of Dynamic Reliability Management for Chip Multi-Processors
Abstract
This work presents a new dynamic reliability management controller which successfully extends the expected lifetime of Chip Multi-Processors (CMPs). This is achieved by migrating tasks within the CMP, effectively reducing core wear and temperature. While this does decrease performance, results obtained show that the performance penalty is below 10% while lifetime expectancy increases are above 30%. The estimation of lifetime is done by using a full system simulator to obtain execution, power and temperature traces, and then feeding this data to the REliability eSTimation (REST) tool. REST uses a Monte Carlo based algorithm to estimate the Mean Time To Failure (MTTF) of the CMP according to aging mechanisms which affect the transistors.