Sunday, March 24, 2013

Algorithm based Fault Tolerance

Algorithm-based  fault  tolerance  (ABFT)  is  an  error detection,  location  and  correction  scheme  which  uses redundant  computations  within  the  algorithms  to  detect and  correct  errors  caused  by  permanent  or  transient  failures in the hardware, concurrently with normal operation.

The key idea of the ABFT technique is to encode the data at a higher level using checksum schemes and redesign algorithms to operate on the encoded data. ABFT can be used to detect and correct errors with a low overhead. ABFT is generally not applicable to all problems.

Systematic way of converting a concurrent computation into a ABFT

In  an  ABFT  system,  the  input  data  is  encoded.  The  algorithm  is  modified to  work on the  modified input.  The  computation  steps  of the  algorithm  are  the distributed  among the  computation  units.  It  is  important  to allocate  the  computational  steps  on  different  processors  so  that  an  error  in one  processor affects  as  few  data  elements as  possible.  The encoded  output  produced  by  the  algorithm is  decoded  to check  the  correctness  of the  results. An  ABFT  system  uses  redundant  computations  within  the algorithm  to  detect  and  correct errors  caused  by  permanent or  transient  failures.  ABFT  is  not  a  general  mechanism  as  other approaches,  it  varies  from  algorithm  to algorithm. This is captured in the figure below.

ABFT  system  uses  redundant computations  within  the algorithm  to  detect  and  correct errors  caused  by  permanent or  transient  failures  in  the  hardware.  It  is not a general approach and varies  from  algorithm  to  algorithm.  When  the  modified  algorithm  is  actually executed on  a multiprocessor architecture,  the  overheads  are  required  to be  minimum  in  comparison  to  TMR.  The  modified  algorithm  takes  more  time  to  operate  on  the encoded  data  when  compared  to  the  original  algorithm. 


ABFT approach for matrix operations

Let PE­i indicate a processing element in the system.
DS(i) is the set of data items affected by PEi. System has checks C1, C2, .... , Cp which check the correctness of the data produced by the PEs.
A check Ck has a associated error set ES(k) indicating the data elements checked by Ck.

A (g,h) check, where g is the cardinality of the error set,  is evaluated as:
·         pass, output = 0 if no error in data items
·         failed, output = 1 if more than h errors
·         unpredictable if there are more than h errors in the system.

If A is a n*n matrix, we define row checksum matrix (Ar) and column checksum matrix (Ac) as:
Ar is a n*(n+1) matrix = [A Ae]
e, is a n *1 vector, = [1 1 ...... 1]T

A =

A


AeT

   



where AeT is the column summation vector. The full checksum matrix Af is given as:


A
Ae


AeT
eTAe


The exact location of one error can be determined by intersecting the inconsistent rows and columns. The use of either row  checksum/column checksum matrices makes  the system 1-error detectable. A full checksum matrix system is also 1-error detectable. 

No comments:

Post a Comment