Tejas' readings: Algorithm based Fault Tolerance

Algorithm-based fault tolerance (ABFT) is an error detection, location and correction scheme which uses redundant computations within the algorithms to detect and correct errors caused by permanent or transient failures in the hardware, concurrently with normal operation.

The key idea of the ABFT technique is to encode the data at a higher level using checksum schemes and redesign algorithms to operate on the encoded data. ABFT can be used to detect and correct errors with a low overhead. ABFT is generally not applicable to all problems.

Systematic way of converting a concurrent computation into a ABFT

In an ABFT system, the input data is encoded. The algorithm is modified to work on the modified input. The computation steps of the algorithm are the distributed among the computation units. It is important to allocate the computational steps on different processors so that an error in one processor affects as few data elements as possible. The encoded output produced by the algorithm is decoded to check the correctness of the results. An ABFT system uses redundant computations within the algorithm to detect and correct errors caused by permanent or transient failures. ABFT is not a general mechanism as other approaches, it varies from algorithm to algorithm. This is captured in the figure below.

ABFT system uses redundant computations within the algorithm to detect and correct errors caused by permanent or transient failures in the hardware. It is not a general approach and varies from algorithm to algorithm. When the modified algorithm is actually executed on a multiprocessor architecture, the overheads are required to be minimum in comparison to TMR. The modified algorithm takes more time to operate on the encoded data when compared to the original algorithm.

ABFT approach for matrix operations

Let PE_i indicate a processing element in the system.

DS(i) is the set of data items affected by PEi. System has checks C₁, C₂, .... , C_p which check the correctness of the data produced by the PEs.

A check C_khas a associated error set ES(k) indicating the data elements checked by C_k.

A (g,h) check, where g is the cardinality of the error set, is evaluated as:

· pass, output = 0 if no error in data items

· failed, output = 1 if more than h errors

· unpredictable if there are more than h errors in the system.

If A is a n*n matrix, we define row checksum matrix (A_r) and column checksum matrix (A_c) as:

A_r is a n*(n+1) matrix = [A Ae]

e, is a n *1 vector, = [1 1 ...... 1]^T

A_c =

	A
	Ae^T

where Ae^Tis the column summation vector. The full checksum matrix A_f is given as:

	A	Ae
	Ae^T	e^TAe

The exact location of one error can be determined by intersecting the inconsistent rows and columns. The use of either row checksum/column checksum matrices makes the system 1-error detectable. A full checksum matrix system is also 1-error detectable.

Tejas' readings

Sunday, March 24, 2013

Algorithm based Fault Tolerance

No comments:

Post a Comment