The key idea of the
ABFT technique is to encode the data at a higher level using checksum schemes
and redesign algorithms to operate on the encoded data. ABFT can be used to
detect and correct errors with a low overhead. ABFT is generally not applicable
to all problems.
Systematic way of
converting a concurrent computation into a ABFT
In an ABFT
system, the input
data is encoded.
The algorithm is
modified to work on the modified input. The
computation steps of the
algorithm are the distributed among the
computation units. It
is important to allocate
the computational steps
on different processors
so that an
error in one processor affects as
few data elements as
possible. The encoded output
produced by the
algorithm is decoded to check
the correctness of the
results. An ABFT system
uses redundant computations
within the algorithm to
detect and correct errors caused
by permanent or transient
failures. ABFT is
not a general
mechanism as other approaches, it
varies from algorithm
to algorithm. This is captured in the figure below.
ABFT system uses
redundant computations
within the algorithm to
detect and correct errors caused
by permanent or transient
failures in the
hardware. It is not a general approach and varies from
algorithm to algorithm.
When the modified
algorithm is actually executed on a multiprocessor architecture, the
overheads are required
to be minimum in
comparison to TMR.
The modified algorithm
takes more time
to operate on the
encoded data when
compared to the
original algorithm.
ABFT approach for
matrix operations
Let PEi
indicate a processing element in the system.
DS(i) is the
set of data items affected by PEi. System has checks C1, C2, .... , Cp which
check the correctness of the data produced by the PEs.
A check Ck has a associated error set ES(k) indicating the data
elements checked by Ck.
A (g,h) check, where g is the cardinality of the error set, is evaluated as:
·
pass, output = 0 if no error in data items
·
failed, output = 1 if more than h errors
·
unpredictable if there are more than h errors in
the system.
If A is a n*n matrix, we define row checksum matrix (Ar) and
column checksum matrix (Ac)
as:
Ar is a n*(n+1) matrix =
[A Ae]
e, is a n *1 vector, = [1 1 ...... 1]T
Ac =
A
| ||
AeT
|
where AeT is the column summation vector. The full checksum matrix Af is given as:
A
|
Ae
|
||
AeT
|
eTAe
|
The exact location of one error can be determined by
intersecting the inconsistent rows and columns. The use of either row checksum/column checksum matrices makes the system 1-error detectable. A full
checksum matrix system is also 1-error detectable.
No comments:
Post a Comment