Basic Error Detection Techniques - From experiment to design

This section reviews the basic concepts and techniques used in fault detection. Many fun- damental fault detection techniques are designed for memory and communication chan- nels. Parity (e.g., even or odd) is a simple technique that can detect one bit error. The concepts in the hamming code are used to design single- or multi-bit error detection techniques in a memory-space-efficient way. Hamming code has a tradeoff between the space overhead (e.g., to save the code) and the provided coverage. In order to detect and correct many more bit errors, ECC technique needs a much larger extra memory space to keep the ECC bits [BSS08]. This space overhead goes down as the protected data unit size in- creases.

Advanced EEC5 organizes ECC bits in a way that the same ECC can provide higher coverage for multi-bit errors. For example, if an ECC data word is composed of bits that are physically noncontiguous, a particle strike event cannot corrupt multiple bits in the same ECC data word. Thus, an SEC-DED technique can be still used and tolerate multi- bit errors. It was measured that advanced ECC can reduce the uncorrectable error rate by a factor of 3 to 8 [SG06]. The uncorrectable error rate was from 0.25% to 0.4% per DIMM per year for SEC-DED protected platforms, and from 0.05% to 0.08% for Chipkill protected platforms.

Advanced ECC does not completely remove uncorrectable errors. This is, for example, due to accumulations of latent memory faults (e.g., two-particle strikes) [YKI09]. In such a case, multiple bits of a physically separated ECC data word can be corrupted and escape the protection provided by an SEC-DED technique.

Scrubbing addresses such latent fault problems by reducing the fault latency and re- moving the latent faults. The scrubbing technique periodically scans the memory data so that the used ECC technique can automatically check the integrity of data [SSP90]. Scrubbing is implemented either in hardware or software. The hardware-based implementation provides better performance than the software-based implementation (e.g., the scrubbing rate of 1 gigabyte per 45 minutes in deployed machines [SG06]).

Regarding the presented recoverability-driven memory protection, we suggested the possibility of recovering code memory using storage files for static memory [YKI09], but

5_{Advanced ECC refers to an ECC technique that can protect single- and multi-bit errors in memory by}

for example scattering data bits protected by an ECC word to multiple memory chips. Its commercial names include IBM Chipkill, Intel SDDC, and former Sun Microsystems Extended ECC.

the size of this area was not measured. In on-chip caches, a finding similar to ours (i.e., a large portion of zero-filled pages in memory) is reported, and this characteristic is used to design a cost-efficient compressed cache [YZG00]. The low error sensitivity of visual data is reported [NPB08], but the memory size of these data in an integrated system is not measured.

Error detection is similar to outlier or novelty detection. The existing outlier detection techniques are generally classified into four types.

(i) Distribution-based. This approach uses the statistics theory. It [BL94] tries to best

fit the given data samples to a standard distribution function (e.g., normal or Poisson). Discordancy tests are used to detect outlier samples where the tests support both univariate functions and some multivariate functions (e.g., normal [But83]). For example, a discordancy test declares a sample as an outlier if it lies ≥3 standard deviations from the mean assuming that the data follow a normal distribution [FPP78]. This type of technique as- sumes the underlying distribution of sample data is known and is similar to a standard distribution. However, distributions of the runtime signature data of computer software are not always known and not all signature data may follow standard distributions when mixed workloads run together.

(ii) Depth-based. This approach [Tuk77] maps each data sample to a point in a k-

dimensional (k-d) space and computes the depth of each point. A point with the smallest depth has the highest likelihood as an outlier. One simple definition takes the minimum of the number of samples to the left of a sample x, and the number of samples to the right of a sample x where k = 1 (i.e., one-dimension sample space) [RR96]. Many different depth definitions exist that show high detection accuracy. The computation of depth relies on the computation of k-d convex hulls (i.e., smallest volume in the k-d space that contains all sample points and straight lines between any pair of sample points) that has a lower time complexity bound of Ω( _{) for n samples, making this approach impracticable for k > 4}

and large n.

(iii) Distance-based. This approach declares a sample x as an outlier if at least p per-

cent of samples lie at a greater distance than D from x. If the total population is P, this condition is the same as where the distance between x and its k-nearest neighbor (kNN) [RRK00] is more than D where k is 1–p/P. This simplified condition can provide similar detection coverage if D is chosen properly. For example, the exact value of D can be com- puted to check samples that lie ≥3 standard deviations from the mean, assuming the sam- ples follow a normal distribution. Simple algorithms with the time complexity O(kN2) ex- ist, where k is the dimension of a sample space, and N is the sample count. A nested-loop algorithm compares the distance of each pair of samples (e.g., total N2 pairs) and stops this

process for sample x when the samples with the distance less than D to x are more than p percent. This simple algorithm can be tuned in a way to maximize the memory locality. A cell-based algorithm [KN98] with the time complexity of O(ck+N) has good scalability for large data (i.e., large sample data is kept in the storage) by forming nearby samples as cells and increasing the memory locality. Also, there is a technique that profiles the normal behaviors of a system (e.g., system call sequences [FHS+96]).

(iv) Classification-based. Anomaly-based detection is widely used in intrusion detec-

tion system (IDS) to detect malware and intruders in computer nodes and networks [FHS+96][LSC97][LSM99]. These techniques monitor a well-selected set of system-level software events (e.g., system call) and hardware events (hardware performance counters). These techniques then use various types of classification techniques to train the classification networks in an efficient way (e.g., k-nearest neighbor, local outlier factor [AMA+11], and Bayesian networks for software faults). Many of these existing techniques use offline training because of the large amount of computing power needed to perform such opera- tions and the difficulty of online diagnosis of security attacks. These techniques are implemented as a part of the software of the monitored system (e.g., OS kernel or hypervisor [ALL06]) or of an external device.

In document From experiment to design – fault characterization and detection in parallel computer systems using computational accelerators (Page 40-43)