Data analysis session one - Why your back hurts : Finding an efficient way to measure and evalu

During the data analysis it was discovered that a mistake was made in the way the data was gathered. Because all the data needed to be processed using the straight posture data, there was no straight posture data left over to train the machine learning model. To solve this problem, extra data had to be gathered. This extra data gathering focussed mainly on gathering straight posture data. Because of a lack of time, the data was gathered by placing the sensor suit on myself. After obtaining the extra correct posture data, and some extra incorrect posture data, the data could be processed properly. However, this meant that the correct posture data available to train the model was relatively limited and all based on one person.

6.1.1 Training the random forest model using unprocessed data

To test the need for the baseline correct posture data the random forest model was first trained using the “raw”, original, data. This data does not contain the extra data gathered from the researcher himself. Eighty percent of the data was used to train the model, while the other 20 percent was used to test the model. A model was produced using both twenty estimators and one-hundred estimators to test the difference in accuracy. This model was tested using five different training and testing splits. These training and testing splits where generated using the Scikit learn “train_test_split” function. The average accuracy of the 20-sample model and the 100-sample model was virtually the same, with the 20-sample model having an accuracy of 0.90000 out of 1 and the 100-sample model having an accuracy of 0.90625. See Figure 17 for the accuracies for each testing/training split.

Figure 17, the accuracy of the random forest model using different training and testing splits, “raw” data.

All Figure 20 does show which predictions where wrong and which were right. A confusion matrix shows the spread of the predictions. Adding the confusion matrices from all training splits together results in the confusion matrix in Figure 18. This shows that most incorrect predictions are wrongly classified correct posture data, with only one incorrect prediction being a misclassified as correct posture. This is most likely due to the difference between the amount of data gathered on wrong posture versus the amount of data gathered on correct postures. This difference resulted from gathering data for each of the postures identified in Figure 5 to 6. These figures contain more bad postures then good postures. 0,7 0,75 0,8 0,85 0,9 0,95 1 0 1 2 3 4 accuracy o f t h e p re d ic tion s

random seed used for splitting the testing and training data 20 samples 100 samples

36 wrong correct

wrong 140 1

correct 15 4

Figure 18, all confusion matrices of the 20 tree models combined, unprocessed data. Green indicates the real class the data belongs to, yellow the predicted class.

6.1.2 Training the random forest model using data relative to the correct posture

As described before, processing the data required gathering extra correct posture data. This data was gathered by placing the sensor suit on the researcher instead of using independent test subjects. This means that all the correct posture data is from the same person, making it less representative of people’s posture as a whole.

As expected, using posture data that is relative to the correct posture results in a significantly higher accuracy of the predictions. Just like before, both the 20-sample model and the 100-sample model have similar average accuracies. With the 20-sample model having an accuracy of 0,985 and the 100-sample model having an accuracy of 0,975. Figure 19 shows the accuracies of each

testing/training split. Two out of the five training testing splits reached an accuracy of 1.0, indicating that they were one-hundred percent accurate.

Figure 19, the accuracy of the random forest model using different training and testing splits, data relative to correct posture.

wrong correct

wrong 189 0

correct 3 8

Figure 20, all confusion matrices of the 20 tree models added up, data relative to correct posture. Green indicates the real class the data belongs to, yellow the predicted class.

Similarly, to the unprocessed data, most of the wrong predictions made while using the processed data are correct postures labelled as incorrect. Indicating a need for more correct posture data.

0,92 0,93 0,94 0,95 0,96 0,97 0,98 0,99 1 1,01 0 1 2 3 4 accuracy o f t h e p re d ic tion s

random seed used for splitting the testing and training data 20 samples 100 samples

6.1.3 Analysing variable importance.

The build in variable importance measures of the random forest algorithm can give an idea of what sensors are the most important. Figure 21 shows the top five most important variables for the train/test splits that produce the most accurate models using 20 trees. All data used to create the models is relative to the correct posture.

Seed random split 4 2 3 0

Most important variable 11 11 10 11 Second most important variable 10 3 11 9 Third most important variable 9 4 16 10 Fourth most important variable 24 16 9 24 Fifth most important variable 15 10 3 4

Figure 21, top 5 most important variables for different train test splits.

Especially variable 11, 10, and 9 seem important according to Figure 21. These variables correspond to the three axes of the accelerometer on the right shoulder. When training the random forest model using only the data from this sensor, an accuracy of 0,97 is achieved. See Figure 22 for the

corresponding confusion matrix. Figure 25 shows that a lot of mistakes are made when identifying correct posture. The confusion matrix seems to have a similar distribution to the one using all sensors, but with a lower accuracy.

In retrospect, the importance of the right shoulder’s data seems to make sense. Movement of the upper body and arms is almost always going to have an effect on the position of the position of the shoulder, allowing for the shoulder position to be an important indicator of sitting position. This explanation, however, would imply that the left shoulder’s variables should be just about as important. One would also expect to see a higher importance of the gyroscopes attached to both shoulders. Of course, the variable importance measures are just estimating the real importance of the variables and, as discussed in chapter 2.3.2, they are not always accurate.

wrong correct

wrong 189 0

correct 6 5

Figure 22, all confusion matrices of the 20 tree models combined, data relative to the correct posture, only data from the accelerometer on the right shoulder. Green indicates the real class the data belongs to, yellow the predicted class. When training the model using only the gyroscope data from the right shoulder, the predictive ability of the model becomes near zero. Figure 23 Shows the confusion matrices of the models added up. Only one of the correct postures was correctly categorized as such. One might argue that the low predictive capabilities of the model are partly caused by the lack of “correct posture” data. But that does not explain why the model using the accelerometer does so much better.

wrong correct

wrong 189 0

correct 10 1

Figure 23, all confusion matrices of the 20 tree models combined, data relative to the correct posture, only data from the gyroscope on the right shoulder. Green indicates the real class the data belongs to, yellow the predicted class.

To test, and compare, the predictive capabilities of the accelerometer on the left shoulder compared to the right shoulder, a random forest model was trained using the data from this sensor as well, see Figure 24.

wrong correct

wrong 189 0

correct 11 0

Figure 24, all confusion matrices of the 20 tree models combined, data relative to the correct posture, only data from the accelerometer on the left shoulder. Green indicates the real class the data belongs to, yellow the predicted class. The data in figure 27 shows that the gyroscope on the left shoulder performs just as badly as the accelerometer on the right shoulder. In an attempt to further understand the difference between the right shoulder and left shoulder data, a closer look was taken the data. This showed that in the new data, meant to increase the amount of “correct posture” data available, a weird spike was present in the right shoulder data. This could explain the superior predictive abilities of the right shoulder

accelerometer. This spike in the data does not seem representative for the data as a whole, indicating a mistake in the measurement procedures. The “corrupted” data was removed from the dataset. In order to get a better idea of the usefulness of the different sensors, more data needs to be gathered. Both to replace the “corrupted” right shoulder data, but also to increase the amount of “correct posture” data.

In document Why your back hurts : Finding an efficient way to measure and evaluate sitting posture using a combination of body sensors placed on the body and machine learning (Page 35-38)