NSTC2007

Title of Presentation: Fault Tolerant Microprocessors for Space Missions

Primary (Corresponding) Author: Daniel Sorin

Organization of Primary Author: Duke University

Abstract: In this project, we are developing microprocessors that can operate autonomously in the presence of permanent physical faults. NASA relies on microprocessors to control critical equipment, and these microprocessors must work correctly for long stretches of time without the possibility of user repair. With smaller transistors, smaller wires, and faster clocks, the incidence of permanent faults is increasing, and NASA's microprocessors must tolerate them. Moreover, due to cost and power constraints, this fault tolerance cannot be achieved by simply replicating all of the microprocessors.

This research has followed four primary thrusts. First, we designed a low-cost mechanism for diagnosing permanent faults in various components of microprocessors, including functional units (adders, multipliers, etc.) and entries in the tables that coordinate the scheduling of instructions. Our diagnosis mechanism adds little hardware and power consumption to the baseline microprocessor, and it enables the microprocessor to deconfigure faulty components. Second, we developed the first mechanism for diagnosing wear-out in microprocessor components. Our mechanism detects "delay faults", which occur when the correct value is computed but it takes too long too compute. Delay faults often indicate the beginning of wear-out and we want to diagnose them before they become catastrophic failures. Third, we have designed a fault-tolerant, reconfigurable multiplier. Because most microprocessors only have one multiplier, it is not acceptable to deconfigure it if it contains a permanent fault. Because multipliers are large and consume a lot of power, it is often infeasible to replicate them. Thus, we created a class of multipliers that can detect errors, diagnose the locations of the underlying faults, and reconfigure around these faults. Fourth, we proposed a new metric for comparing different fault-tolerant designs. There has been considerable research into appropriate metrics that apply to transient errors, and we extended one of these ("architectural vulnerability factor") to also apply to permanent faults. This metric provides considerably more insight into the relative values of competing fault-tolerant design options.