Metrics - Evaluation of Pedometer Performance Across Multiple Gait Types Using Video for Ground

2.1 Methods

2.1.5 Metrics

There is a tendency for performance or effectiveness to be evaluated based on a single number. For example, academic journals are often reduced to impact factor. Cars are reduced to a reliability score. Windows reduces a computer’s effectiveness to a Windows Experience Index. These values do not provide enough information to fully evaluate effectiveness. A similar phenomenon can be seen in pedometer evaluation. Running Count Accuracy (RCA), or a long term measure of the number of steps detected compared to the actual number of steps taken, is the only evaluation metric used in the vast majority of pedometer studies. While RCA can provide some useful information about the pedometer algorithm’s effectiveness, multiple metrics are needed to fully understand where algorithms are accurate or inaccurate. This work considers two metrics for evaluation, run count accuracy (RCA) and step detection accuracy (SDA).

While RCA can be useful for providing a rough estimation of a pedometer algorithm’s effectiveness, it does not provide an accurate assessment if the number of false positives and false negatives produced by an algorithm are significant, but equal in frequency. This could be a problem, for example, if in the Lego building activity a pedometer algorithm treats many of the motions used when building the Lego as steps, but does not identify the steps taken while the participant moves to pick up additional pieces. While the overall running count accuracy may be similar, the algorithm was not effective at detecting steps, and this problem may only manifest itself in another activity.

The RCA value is the value most commonly reported in prior works, however, so it is reported as an evaluation metric in this work as well. The equation used to calculate RCA is shown in Equation 2.2. This evaluation metric does not always provide an accuracy between 0 and 1. If the number of detected steps is greater than the number of ground truth steps, the resulting value will be greater than one. Thus, an accuracy of 1 is ideal, an accuracy less than 1 indicates that an algorithm under-detects steps, and an accuracy greater than 1 indicates that an algorithm over detects steps.

RCA = #DetectedSteps

#GroundT ruthSteps (2.2)

Step detection accuracy (SDA) provides a more detailed assessment of pedometer accuracy. SDA examines the ability of an algorithm to specifically identify individual steps. This allows evaluation through classification of steps as true positives, false positives, and false negatives. True positives are identified when a step detected by an algorithm occurs within 1/2 second of a ground truth step. These two detections are paired and excluded from further evaluation (eg: if the next step detected by the pedometer algorithm is within 1/2 second the ground truth step paired with the previous detection, the ground truth step cannot be paired again.) Any step detected by the algorithm which cannot be paired with a ground truth step is considered a false positive. At the end of this process, any ground truth step not paired with a step detected by the algorithm is recorded as a false negative. True negatives are not considered because each algorithm would report high accuracy due to the high sample rate relative to the frequency of steps. For example, in our regular gait activity, which has the highest frequency of steps, the sample rate is 15 Hz and the frequency of steps is less than 2 Hz.

Because true negatives are not considered, there are two metrics which can serve to evaluate accuracy. Positive predictive value (PPV), also known as precision, is calculated according to Equation 2.3 and provides a measure of how likely a step detection made by the algorithm is to indicate that a step is actually taken. Sensitivity, also referred to as recall, is calculated according to Equation 2.4 and provides a measure of how likely an algorithm is to detect a step when one is taken. An algorithm should provide good results for each of these measures in order to be effective, and the F1 score is a standard measure which evaluates accuracy based on a combination of PPV and Sensitivity through Equation 2.5. The F1 score is similar to an average of PPV and sensitivity, but compared to a raw average, the score is decreased when the two metrics are significantly different.

For example, a PPV of 0.1 and a sensitivity of 0.9 would average to 0.5, but the F1 score is actually .18. On the other hand, a PPV of 0.5 and a sensitivity of 0.5 would average to 0.5 and also have an F1 score of 0.5. Essentially, this penalizes algorithms which lean too heavily toward extremes: counting all data points as steps or no data points as steps.

P P V = T rueP ositives

T rueP ositives + F alseP ositives (2.3)

Sensitivity = T rueP ositives

T rueP ositives + F alseN egatives (2.4)

F 1 = 2xP P V xSensitivity

P P V + Sensitivity (2.5)

Evaluations using these two metrics allow for a more thorough understanding of when, where, and how pedometer algorithms fail. RCA is a limited metric because regardless of how it is calculated, if the number of false positives and false negatives are similar, the reported RCA will always be near 1. In an extreme example, imagine a participant wears a pedometer throughout their full day, but the pedometer does not detect any steps until they lie down to go to sleep. This hypothetical pedometer then detects this action and it triggers 7,000 steps in the algorithm. If the participant walked 7,000 steps in the day and don’t look at the pedometer until after they have lain down, they would claim that the pedometer is accurate, and according to RCA measures, they would be correct. In reality, however, this is an extremely ineffective algorithm. In the same situation, SDA would correctly indicate that the algorithm is extremely inaccurate because there would be 7,000 false positives, 7,000 false negatives and 0 true positives. RCA does, however, provide some useful information regarding algorithm performance. Based on our method of calculating RCA, if the RCA is significantly greater than 1, it can quickly be learned that the algorithm is, overall, significantly overestimating steps, and a value significantly less than 1 indicates a significant underestimation of steps. A value near 1 does not indicate accuracy, necessarily, but does indicate that false positives and false negatives are roughly equal. SDA, represented by an F1 score, provides information about how effective the algorithm is at detecting individual steps, but, as a single measure, it does not provide information about how much an algorithm is over or under estimating the long term step count. Based on this information, this work considers SDA as a more effective measure of pedometer

Figure 2.19: An example of ground truth steps identified in the regular gait activity. The peak in acceleration between each right step (red solid line) and left step (blue dashed line) is indicative of the forward motion of the left leg.

algorithm accuracy, while using RCA to interpret how the inaccuracies present in the algorithm affect the long term step count.

In document Evaluation of Pedometer Performance Across Multiple Gait Types Using Video for Ground Truth (Page 55-58)