• No results found

Correcting Column Erasure and Element Error

In document Coding for Information Storage (Page 103-106)

Rebuilding Multiple Failures

6.3 Correcting Column Erasure and Element Error

In array codes for storage systems, data is arranged in a 2D array. Each column in the array is typically stored in a separate disk and is called a node, and each entry in the array is call ed an element. In the conventional error model, disk failures correspond to an erasure or an error of an entire node. Therefore, array codes are usually designed to correct such entire node failures.

However, if we consider different applications, such as the case of flash memory as storage nodes, element error is also possible. In other words, we may encounter only a few errors in a column as well as entire node erasures. For an MDS array code with two parities, the minimum Hamming distance is 3, therefore, it is not possible to correct a node erasure and a node error at the same time. However, since zigzag code has very long column lengths, we ask ourselves: is it capable of correcting a node erasure and some element errors?

Given a (k+2, k)zigzag code generated by distinct binary vectors T = {v0, v1, . . . , vk1}, the following algorithm corrects a node erasure and an element error. Here we assume that the erasure and error are in different columns, and there is only a single element error in the systematic part of the array. The code has two parities and 2m rows, and the zigzag permutations are fj = vj, j ∈ [0, k−1]. The original array is denoted by (ai,j), the erroneous array is(ˆai,j). The row coefficients are all ones, and the zigzag coefficients are βi,j. Let x0, x1, . . . , xp1F. Denote

f1(x0, x1, . . . , xp1) = (xf1(0), xf1(1), . . . , xf1(p1))for a permutation f on[0, p−1]. Algorithm 6.5 Suppose columnt is erased, and there is at most one element error in the remaining array. Compute for alli∈ [0, 2m−1]the syndromes:

si,0=

j6=t

ˆai,j−ri,

si,1=

j6=t

βf1 j (i),jˆaf1

j (i),j−zi.

Let the syndrome beS0= (s0,0, s1,0, . . . , s2m1,0)andS1= (s0,1, s1,1, . . . , s2m1,1).

Compute for alli∈ [0, 2m−1],xi = βi,tsi,0. Let X= (x0, . . . , x2m1),Y= ft1(S1),W =X−Y.

- IfW =0, there is no element error. Assign column t as−S0.

- Else, there will be two rowsr, r0 such thatwr, wr0are nonzero. Findj such that vj = r+r0+vt. The error is in columnj.

- If wwr

r0 = −βr,t

βr,j, then the error is at rowr, and assign ar,j = ˆar,jWr

βr,t. - Else if wwr

r0 = −βr0,j

βr0,t, then the error is at rowr0, and assignar0,j = ˆar0,jWr0

βr0,t. - Else there are more than one errors.

Theorem 6.6 The above algorithm can correct a node erasure and a systematic element error.

Proof: Suppose column t is erased and there is an error at column j and row r. Define r0 =r+vt+vj. Letˆar,j=ar,j+e. It is easy to see that xi =yi = −βi,tai,texcept wheni=r, r0. Since the set of binary vectors{v0, v1, . . . , vk1}are distinct, we know that the error is in column j. Moreover,we have

xr= −βr,tar,t+βr,te, yr = −βr,tar,t, xr0 = −βr0,tar0,t,

yr0 = −βr0,tar0,t+βr,je.

Therefore, the difference betweenX and Y is

wr =xr−yr= βr,te,

wr0 = xr0−yr0 = −βr,je.

And we can see that no matter whate is, we always have wr

wr0 = −βr,t βr,j. Similarly, if the error is at rowr0, we will get

wr

wr0 = −βr0,j βr0,t.

By the MDS property of the code, we know that βr,tβr0,t 6= βr,jβr0,j(see the remark after the proof of the finite field size3). Therefore, we can distinguish between the two cases of an error in row r and in rowr0.

Example 6.7 Consider the zigzag code in Figure 6.1. Suppose all of column 0 is erased. And suppose there is an error in the 0-th element in column1. Namely, the erroneous symbol we read is ˆb0 = b0+e for some error e 6= 0 ∈ F3, see Figure6.2. We can simply compute the syndrome, locate this error, and recover the original array. Since the erased column corresponds to the zero vector, and all the coefficients in column 0 are ones. The algorithm is simplified. For i ∈ [0, 3], we compute the syndromes and subtract them, we get zeros in all places except row0 and 2, which satisfy0+2 = (0, 0) + (1, 0) = (1, 0) = e1. Therefore, we know the location of the error is in column1 and row 0 or 2. But since W0= −W2, we know the error is in ˆb0(IfW0=W2, the error is in ˆb2).

In practice, when we are confident that there are no element errors besides the node erasure, we can use the optimal rebuilding algorithm in Section 4.2.2 and access only half of the array to rebuild the failed node. However, we can also try to rebuild this node by accessing the other half of the array. Thus we will have two recovered version for the same node. If they are equal to each other, there are no element errors; if not, there are element errors. Thus, we have the flexibility of

0 1 2 R Z

0 a0 b0 c0 r0 =a0+b0+c0 z0= a0+2b2+2c1 1 a1 b1 c1 r1 =a1+b1+c1 z1= a1+2b3+c0

2 a2 b2 c2 r2 =a2+b2+c2 z2= a2+b0+c3 3 a3 b3 c3 r3 =a3+b3+c3 z3= a3+b1+2c2

Figure 6.1:(5, 3)zigzag code generated by the standard basis and the zero vector. All elements are overF3.

0 1 2 R Z S0 S1 W =S0−S1

0 b0+e c0 r0 z0 −a0+e −a0 e

1 b1 c1 r1 z1 −a1 −a1 0

2 b2 c2 r2 z2 −a2 −a2+e −e

3 b3 c3 r3 z3 −a3 −a3 0

Figure 6.2: An erroneous array of the(5, 3)zigzag code. There is a node erasure in column0 and an element error in column1. All the other elements are not corrupted. S0, S1are the syndromes.

achieving optimal rebuilding ratio or correcting extra errors.

When there is one node erasure and more than one element errors in column j and row R = {r1, r2, . . . , rl}, following the same techniques, it is easy to see that the code is able to correct systematic errors if

R∪ (R+vj) 6= R0∪ (R0+vi)

for any set of rowsR0and any other column indexi, and ri 6=rt+vj for anyi, t∈ [l].

When the code has more than two parities, the zigzag code can again correct element errors exceeding the bound by the Hamming distance. To detect errors, one can either compute the syn-dromes, or rebuild the erasures multiple times by accessing differente/r parts of the array.

Finally, it should be noted that if a node erasure and a single error happen in a parity column, then we cannot correct this error in the(k+2, k)code.

In document Coding for Information Storage (Page 103-106)