The Wide-band Advanced Recorder Processor (WARP) went into an anomalous state at the end of the day on June 21, 2001 . The WARP was reporting Error Detection and Correction (EDAC) uncorrectable errors which continued until a reset of the WARP was implemented on the afternoon of June 29, 2001 . The errors went undetected by the Flight Operations Team because limit checking on the EDAC uncorrectable counter had been eliminated. The presence of uncorrectable errors on the WARP caused the science data taken during the anomaly to be mostly unusable.
Early on June 29, WARP engineers came up with the following preliminary diagnosis:
- The errors occur only on Memory Card #2 (outermost card)
- The errors occur on all 6 of the 4-Mbit arrays.
- The errors are recurrent
- There are over 200,000 errors per playback set (about 20% of the data)
- The errors appear to occur in 80 byte blocks
The engineers prepared an operations instruction to read the WARP Memory Mask register. This test showed that there were no signs of corruption in the register value. Next they ran a WARP Memory Built-In-Test on Memory Card #2 in the range mode. During the course of this test, the WARP memory is reformatted. After this, they ran the DCE Self-Test (RS-422 Card Data Injection) which generates card test data. It then became evident that the WARP problem had disappeared and the WARP had returned to a nominal state with no uncorrectable errors. The WARP team filled the entire memory (48 Gbits) and monitored the EDAC errors on playback to prove the return to nominal state.
The WARP team attributed the cause of the problem to a stuck bit within a state machine inside one of the memory boards. The team reasoned that the problem did not appear to be hardware related, so if the problem occurred again, it would probably not occur in the same way or the same location. In fact, there are no mechanisms within the WARP to isolate the state of state machines at failure, so it would be almost impossible to plan a diagnostic set of dumps to be taken in the event of another failure.
Fortunately, the problem did not reoccur.
An internal review was conducted of all the WARP telemetry parameters and limit settings to assure that the correct parameters were in fact being monitored with the right limit values. The limit setting of 1 was instituted for the value of the uncorrectable EDAC error counter (i.e., if the value=1, then the limit is violated and reported to the console operators).
It was originally thought that a routine reformat of the WARP memory should be performed as a precaution (either once per day or after every sequence of predetermined DCEs). However, this would preclude troubleshooting if the error reoccurred, so the deletion of files after downlink from the ATS load was retained as the nominal operational mode that had been used since early in the mission.
After the anomaly occurred, the WARP was reformatted three times: once during the ACE safehold anomaly ( 9/14/01 ); once when an operator error resulted in entry into low power mode by the WARP internal protection mechanisms; and once for the weekend shutdown surrounding the Leonids meteor shower. The operator error was documented in a Root Cause and Corrective Action report and resolved by additional training.
A full accounting of this anomaly is contained in the WARP Anomaly Resolution Report.