Fault Tolerant and Adaptive Systems with Focus on Networks-On-Chips
Abstract
The first step in design of reliable systems is the ability to evaluate the reliability of the system. This is an important step in the process of designing reliable systems because design techniques that proactively improve the lifetime reliability of systems on chip (SoC) require some form of redundancy. The cost (area, power, and design time overheads) associated with such redundancy makes developing systems with resilience to all types of failure mechanisms impractical. To address this problem we developed an accurate reliability evaluation algorithm that is capable of identifying the vulnerable subblocks of the system. The proposed reliability evaluation methodology can also be utilized to develop a new lifetime aware floorplanning strategy that is capable of identifying the most reliable floorplan for a given design. We consider this an essential step toward a design approach where reliability is a primary objective.
Recent advances in CMOS technology and integration of multiple processing elements in a single chip has also made the on chip communication a challenge in design of multi-processor SoCs (MPSoC). Networks on Chip (NoC) has been introduced as a new communication medium in response to the rising need for the new communication structure for MPSoCs. While NoC is proven to be an efficient communication structure for SoC, the same failure mechanisms and processing faults that have adverse effects on processing elements can also render NoC inoperable. We proposed a new multi layered reliable design methodology for NoCs as a hybrid solution composed of multiple layers of fault tolerant design techniques to address this challenge. The proposed structure for NoCs can address hard failures across three levels of abstraction. In first layer (software layer), we use a reliability aware mapping algorithm to assign application tasks on NoC such that network reliability is improved. In second and third layers (architecture and network-routing layers), we design an NoC architecture that uses self-repairable links and a distributed routing. The combination of these techniques in the proposed layered approach helps to provide a better performance and tradeoff between the improvement in reliability and cost due to the required redundancy and extra logic.