Correcting Column Erasure and Element Error

Rebuilding Multiple Failures

6.3 Correcting Column Erasure and Element Error

In array codes for storage systems, data is arranged in a 2D array. Each column in the array is typically stored in a separate disk and is called a node, and each entry in the array is call ed an element. In the conventional error model, disk failures correspond to an erasure or an error of an entire node. Therefore, array codes are usually designed to correct such entire node failures.

However, if we consider different applications, such as the case of flash memory as storage nodes, element error is also possible. In other words, we may encounter only a few errors in a column as well as entire node erasures. For an MDS array code with two parities, the minimum Hamming distance is 3, therefore, it is not possible to correct a node erasure and a node error at the same time. However, since zigzag code has very long column lengths, we ask ourselves: is it capable of correcting a node erasure and some element errors?

Given a (k+2, k)zigzag code generated by distinct binary vectors T = {v₀, v₁, . . . , v_k₋₁}, the following algorithm corrects a node erasure and an element error. Here we assume that the erasure and error are in different columns, and there is only a single element error in the systematic part of the array. The code has two parities and 2^m rows, and the zigzag permutations are f_j = v_j, j ∈ [0, k−1]. The original array is denoted by (a_i,j), the erroneous array is(ˆa_i,j). The row coefficients are all ones, and the zigzag coefficients are βi,j. Let x₀, x₁, . . . , x_p−1 ∈ _{F. Denote}

f⁻¹(x₀, x₁, . . . , x_p−1) = (x_f−1(0), x_f−1(1), . . . , x_f−1(p−1))for a permutation f on[0, p−1]. Algorithm 6.5 Suppose columnt is erased, and there is at most one element error in the remaining array. Compute for alli∈ [0, 2^m−1]the syndromes:

s_i,0=

∑

j6=t

ˆa_i,j−r_i,

s_i,1=

∑

j6=t

β_f−1 j (i),jˆa_f−1

j (i),j−z_i.

Let the syndrome beS₀= (s_0,0, s1,0, . . . , s2^m−1,0)andS₁= (s_0,1, s1,1, . . . , s2^m−1,1).

Compute for alli∈ [0, 2^m−1],x_i = β_i,ts_i,0. Let X= (x₀, . . . , x₂^m−1),Y= f_t⁻¹(S₁),W =X−Y.

- IfW =0, there is no element error. Assign column t as−S₀.

- Else, there will be two rowsr, r⁰ such thatw_r, w_r⁰are nonzero. Findj such that v_j = r+r⁰+v_t. The error is in columnj.

- If _w^w^r

r0 = −^β^r,t

βr,j, then the error is at rowr, and assign a_r,j = ˆa_r,j− ^W^r

βr,t. - Else if _w^w^r

r0 = −^β^r⁰^,j

β_r0_,t, then the error is at rowr⁰, and assigna_r⁰_,j = ˆa_r⁰_,j− ^W^r⁰

β_r0_,t. - Else there are more than one errors.

Theorem 6.6 The above algorithm can correct a node erasure and a systematic element error.

Proof: Suppose column t is erased and there is an error at column j and row r. Define r⁰ =r+v_t+v_j. Letˆa_r,j=a_r,j+e. It is easy to see that x_i =y_i = −β_i,ta_i,texcept wheni=r, r⁰. Since the set of binary vectors{v₀, v1, . . . , v_k−1}are distinct, we know that the error is in column j. Moreover,we have

x_r= −β_r,ta_r,t+β_r,te, y_r = −β_r,ta_r,t, x_r⁰ = −β_r⁰_,ta_r⁰_,t,

y_r⁰ = −β_r⁰_,ta_r⁰_,t+β_r,je.

Therefore, the difference betweenX and Y is

wr =xr−yr= β_r,te,

w_r⁰ = x_r⁰−y_r⁰ = −β_r,je.

And we can see that no matter whate is, we always have wr

w_r⁰ = −^β^r,t β_r,j. Similarly, if the error is at rowr⁰, we will get

w_r

w_r⁰ = −^β^r⁰^,j β_r⁰_,t.

By the MDS property of the code, we know that βr,tβ_r⁰_,t 6= β_r,jβ_r⁰_,j(see the remark after the proof of the finite field size3). Therefore, we can distinguish between the two cases of an error in row r and in rowr⁰.

Example 6.7 Consider the zigzag code in Figure 6.1. Suppose all of column 0 is erased. And suppose there is an error in the 0-th element in column1. Namely, the erroneous symbol we read is ˆb₀ = b₀+e for some error e 6= 0 ∈ _F₃, see Figure6.2. We can simply compute the syndrome, locate this error, and recover the original array. Since the erased column corresponds to the zero vector, and all the coefficients in column 0 are ones. The algorithm is simplified. For i ∈ [0, 3], we compute the syndromes and subtract them, we get zeros in all places except row0 and 2, which satisfy0+2 = (0, 0) + (1, 0) = (1, 0) = e₁. Therefore, we know the location of the error is in column1 and row 0 or 2. But since W0= −W2, we know the error is in ˆb0(IfW0=W2, the error is in ˆb₂).

In practice, when we are confident that there are no element errors besides the node erasure, we can use the optimal rebuilding algorithm in Section 4.2.2 and access only half of the array to rebuild the failed node. However, we can also try to rebuild this node by accessing the other half of the array. Thus we will have two recovered version for the same node. If they are equal to each other, there are no element errors; if not, there are element errors. Thus, we have the flexibility of

0 1 2 R Z

0 a0 b0 c0 r0 =a0+b0+c0 z0= a0+2b2+2c₁ 1 a₁ b₁ c₁ r₁ =a₁+b₁+c₁ z₁= a₁+2b3+c0

2 a2 b2 c2 r2 =_a₂+_b₂+_c₂ _z₂= _a₂+_b₀+_c₃ 3 a₃ b₃ c₃ r₃ =a₃+b₃+c₃ z₃= a₃+b₁+_2c₂

Figure 6.1:(5, 3)zigzag code generated by the standard basis and the zero vector. All elements are overF3.

0 1 2 R Z S₀ S₁ W =S₀−S₁

0 b₀+e c₀ r₀ z₀ −a₀+e −a₀ e

1 b₁ c₁ r₁ z₁ −a₁ −a₁ 0

2 b₂ c₂ r₂ z₂ −a₂ −a₂+e −e

3 b₃ c₃ r₃ z₃ −a₃ −a₃ 0

Figure 6.2: An erroneous array of the(5, 3)zigzag code. There is a node erasure in column0 and an element error in column1. All the other elements are not corrupted. S₀, S₁are the syndromes.

achieving optimal rebuilding ratio or correcting extra errors.

When there is one node erasure and more than one element errors in column j and row R = {r₁, r₂, . . . , r_l}, following the same techniques, it is easy to see that the code is able to correct systematic errors if

R∪ (R+v_j) 6= R⁰∪ (R⁰+v_i)

for any set of rowsR⁰and any other column indexi, and r_i 6=r_t+v_j for anyi, t∈ [l].

When the code has more than two parities, the zigzag code can again correct element errors exceeding the bound by the Hamming distance. To detect errors, one can either compute the syn-dromes, or rebuild the erasures multiple times by accessing differente/r parts of the array.

Finally, it should be noted that if a node erasure and a single error happen in a parity column, then we cannot correct this error in the(k+_{2, k})code.

In document Coding for Information Storage (Page 103-106)