Fault-tolerant computer for reconfigurable hardware devices

Inventors

LaMeres, Brock J.Major, Christopher Michel

Assignees

Resilient Computing LLCMontana State University Bozeman

Publication Number

US-11966284-B1

Publication Date

2024-04-23

Expiration Date

2043-10-10

Interested in licensing this patent?

MTEC can help explore whether this patent might be available for licensing for your application.


Abstract

A fault-tolerant computer system includes a plurality of processors configured to simultaneously execute identical sets of processor-executable instructions, each of the plurality of processors containing a processor core including one or more registers and a local memory, an arbiter configured to read each of the registers of the plurality of processors, detect incorrect register values, and overwrite the registers containing the incorrect register values with corrected register values, and a memory scrubber configured to read each address of the local memories of the plurality of processors, detect incorrect memory values, and overwrite addresses containing the incorrect memory values with corrected memory values. In various embodiments, the computer system may be implemented using one or more field programmable gate arrays (FPGAs).

Core Innovation

The invention relates to a fault-tolerant computer system comprising multiple processors, each with its own core, registers, and local memory, which execute identical instructions in parallel. The system includes an arbiter that monitors all processor registers, detects incorrect values, and overwrites any erroneous registers with corrected values. Additionally, a memory scrubber reads each address in the processors’ local memories, detects incorrect memory values, and overwrites addresses containing errors with the correct data. This approach can be implemented using field programmable gate arrays (FPGAs) and extends resilience mechanisms to both processor registers and memory.

The problem addressed is the vulnerability of space-based and high-radiation environment computer systems to radiation-induced failures, specifically total ionizing dose (TID) and single event effects (SEE), including single event upsets (SEU) and functional interrupts (SEFI). Existing solutions such as radiation-hardening by process or design, redundancy strategies like triple modular redundancy (TMR), and error-correcting codes or memory scrubbing, each have limitations related to cost, complexity, resource consumption, or inability to fully prevent system failure under accumulated faults.

The disclosed system uses a combination of hardware-level redundancy (four-modular redundancy), backdoor access for error correction, and continuous background scrubbing. The arbiter employs majority-voting to correct register errors rapidly—within a single instruction cycle—while the memory scrubber continuously iterates through memory addresses to detect and repair faults as a background process, offering resilience without significant interruption to the processors’ operations.

Claims Coverage

There are two independent claims in the patent, covering a fault-tolerant computer system and a method of operating such a system. The main inventive features are summarized below.

Fault-tolerant computer system with redundant processors, arbiter, and memory scrubber

The system includes a plurality of processors, each having a processor core with one or more registers and a local memory. All processors simultaneously execute identical sets of processor-executable instructions. An arbiter is configured to: - Read each register of the plurality of processors - Detect incorrect register values - Overwrite registers with incorrect values using corrected values based on majority voting A memory scrubber is configured to: - Read each address in the local memories of the processors - Detect incorrect memory values - Overwrite addresses containing incorrect values with corrected values from the majority.

Method for operating a fault-tolerant computer system with register and memory correction

The method includes the following main steps: 1. Simultaneously executing identical sets of processor-executable instructions on a plurality of processors, each comprising a processor core with registers and a local memory. 2. Reading each of the processors’ registers, detecting incorrect register values, and overwriting incorrect registers with corrected values. 3. Reading each address in the processors’ local memories, detecting incorrect memory values, and overwriting addresses containing the incorrect memory values with corrected values.

In summary, the inventive features reside in a fault-tolerant architecture enabling majority-voting-based detection and correction of errors in both processor registers and memory, as well as a method implementing these corrections during parallel execution of identical instructions across redundant processors.

Stated Advantages

Provides increased reliability and radiation resistance for computing systems in high-radiation environments by combining redundant processors, register arbitration, and continuous memory scrubbing.

Offers high-performance, radiation-resistant computing suitable for spacecraft, supporting improved mean time to failure and reliability compared to previous triple modular redundancy systems.

Allows rapid error detection and correction in processor registers within a single instruction cycle, minimizing interruption and repair time.

Supports background memory scrubbing, mitigating memory errors without significant impact on processor performance.

Enables partial and full reconfiguration of processor tiles (in FPGAs), allowing continued operation while repairs are made, unlike systems requiring full resets.

Documented Applications

Computing systems for space exploration missions, addressing demands for high reliability and performance in harsh radiation environments.

Aerospace hardware for spacecraft, including applications such as high-altitude balloons, sounding rockets, CubeSats, International Space Station missions, and lunar surface demonstrations.

JOIN OUR MAILING LIST

Stay Connected with MTEC

Keep up with active and upcoming solicitations, MTEC news and other valuable information.