Static vs Dynamic Memory - Experiment: Sensitivity

Chapter 4. Data-Type-Aware Fault Injection on Multiple Computer Systems

4.5. Experiment: Sensitivity

4.5.3. Static vs Dynamic Memory

Fault injection is used to characterize fault/error sensitivity of dynamic memory. Over 52,000 faults (single-bit errors) were injected into a dynamic memory region (slab region) on a Linux-based system. The target was monitored for 1 minute after each fault injection. For kernel dynamic memory, faults were injected into the slab objects with the most frequently used data types. The exact count of examined data types was chosen so as to cover at least 80% of the memory space of a region. A small number of data types or caller signatures (e.g., <10) typically form >80% of the used memory spaces (see Table 4.4). The reason is that the kernel has a fixed number of data types that are specified by hu- mans (with limited memory). We select the 8 most frequently used data types, which cover >80% of the slab region space.

(i) Fault sensitivity. Figure 4.5 shows the fault sensitivity of the most frequently used

dynamic memory objects. The y-axis in Figure 4.5 is truncated at 30% because the re- maining 70% corresponds to unactivated faults. The failure type overwritten means activated faults for which the first access was a memory-write; not manifested refers to an activated benign fault for which the first access was a memory-read; and disk corruption refers to file system corruption.

Observation 4.5. The fault sensitivity of dynamic memory is much higher than that of

static memory (e.g., 5.83% vs. 0.32%) mainly because of the higher fault activation ratio of dynamic memory (e.g., ~16.7 times higher).

For comparison purposes, we compute the fault sensitivity of the static memory based on our earlier work on error characterization of a Linux kernel [GKI04]. In that study, we measured a fault activation ratio of 0.5% when we inserted faults into the kernel data segment. For faults in dynamic memory, analyzed in this chapter, we observed a much

higher fault activation ratio of 16.7% on average. The reason is that dynamic memory

pages are managed by a locality-aware (e.g., LRU-variant) cache replacement algorithm, while static memory is placed in a fixed memory location regardless of its access fre- quency. That high fault activation ratio leads to fault sensitivity of dynamic memory that

is 18 times higher than that of static memory (5.83% vs. 0.32%).

(ii) Error sensitivity. In dynamic memory, 34.8% of activated faults manifest, while

in static memory, 65.9% of activated faults manifest (based on [GKI04]). Furthermore, in dynamic memory, 44.7% of not manifested faults have a memory-write operation as their first access. After excluding those overwritten faults, we find that 49.2% (= 5.8%/11.8%) of faults activated by read operations caused failures. That is lower than the ratio measured in static memory (65.9%), where the read/write ratio of not manifested faults in static memory has not been reported [GKI04].

Observation 4.6. The error sensitivities of static and dynamic memories are more

similar (e.g., 49.2% vs. <65.9%) than their fault sensitivities are.

The relatively large variation in the error sensitivities of different dynamic memory objects makes it difficult to directly compare the average error sensitivities of dynamic memory and static memory. Specifically, we injected 800 faults into the inode structure one by one. There was a clear difference in the error sensitivities depending on the data

Figure 4.5. Fault sensitivity of static vs. dynamic memory space.

0% 5% 10% 15% 20% 25% 30% Fai lur e Type Data Type

types of the variables. For example, pointer variables for linked lists or associated objects (e.g., the next, a_ops, and backing_dev_info fields) had error sensitivities higher than 50%. Furthermore, lock variables (e.g., the break_lock field of the rwlock_t data type in the inode structure) had similarly high error sensitivities. The fact that error sensitivities of such control data are not close to 100% indicates that not only the data type but also how the data are used are important factors in determining the error sensitivities. The rea- son is that not all corrupted (and activated) pointers cause failures, and the break_lock is highly error sensitive despite being an integer variable. File system metadata (e.g., buff-

er_head) can cause serious failure if corrupted. For example, a single-bit error in a buff- er_head object can change the file system to a read-only mode and corrupt the file system,

which potentially can be recovered via fsck at the next boot-up of the system.

4.5.4. Modeling

We model the error sensitivity (i.e., the probability that an error is benign Pbenign(flo)) as a

function of the fault location.

We observe that measuring fault sensitivity is simpler than measuring error sensitivity. Below we derive an analytical expression (see (4.3)) that gives error sensitivity (ES) as a function of: (a) fault sensitivity (FS), (b) probability of fault activation (Pa), and (c) the

ratio of read access count to read/write access count (Pr). Equation (4.3) also captures a

scenario in which a fault is activated but does not manifest in the first read access, and it can be reactivated and manifest in the subsequent read accesses.

Table 4.4. Most frequently used slab caches.

Slab Cache Name Data Type Slab Cache Size

Data Type Name Data Size Avg. Stddev. Percent buffer_head struct buffer_head 52B 3.39 MB 3.20 33.7% ext3_inode_cache struct ext3_inode_info 544B 1.50 MB 0.27 14.9% dentry_cache struct dentry 136B 1.23 MB 0.53 12.2% radix_tree_node struct radix_tree_node 276B 0.97 MB 0.39 9.7%

journal_head struct journal_head 52B 0.34 MB 0.94 3.4%

inode_cache struct inode 380B 0.32 MB 0.06 3.1%

filp struct file 192B 0.20 MB 0.02 1.9%

vm_area_struct struct vm_area_struct 88B 0.18 MB 0.11 1.9%

76 (4.3)

To explore the possibility of statically deriving Pa and Pr, we analyzed the variations

of the Pa and Pr of different object types in dynamic memory over time.

(i) Variations of Pa. We found a relatively large variance in the activation ratio be-

tween different object instances of the same data type. For example, the standard devia- tions of the activation ratios of the radix_tree_node and inode slab objects in Linux OS were 11.6% and 7.3%, respectively. The activation ratio of radix_tree_node objects var- ied from 49% to 17% over time.

(ii) Variations of Pr. The read ratio also had a large variation between object types.

The objects belonging to the radix_tree_node data type had a high read ratio (Pr = 93.6%)

and thus a high fault sensitivity (FS = 12.39%). Using those two parameters with the ac- tivation ratio (Pa = 19.6%), we can calculate the error sensitivity of radix_tree_node as

62.4%. The read ratio also had a large variation over time.

Observation 4.7. Different variables in the same type of memory object have a large

variation in fault sensitivity, mainly because of large variations in the fault activation ra- tio and the probability of the first access is being read.

Those two findings indicate that one must be cautious when estimating error sensitivi- ty using statically derived parameters of Pa and Pr. For example, using simple averages

may not be the best approach, and hence, experimental evaluation as discussed here re- mains a trustworthy alternative way to derive such parameters.

In document From experiment to design – fault characterization and detection in parallel computer systems using computational accelerators (Page 90-93)