Data Analysis session two - Why your back hurts : Finding an efficient way to measure and evalu

The goal of the second session was to expand upon the data of the first measuring session. This was done by gathering data from three more test subjects. These data gathering sessions focussed on obtaining extra correct posture data. This data was added to the data from the first data gathering session, resulting in one large dataset.

6.2.1 Training the random forest model using data relative to the correct posture.

In the same way as in the first measuring session, the data was processed by using a line of data from the test person sitting in a correct posture. This baseline data was subtracted from the other data, resulting in a delta between the correct baseline posture and the other postures. This data was then used to train and test the random forest model in Scikit learn. Eighty per cent of the data was used to train the model, while the remaining twenty per cent was used to test its accuracy. Just like in session number one, the confusion matrices from different models using different training and testing sets where added together in an attempt to create an accurate image of the accuracy of the model. See Figure 25 for the combined confusion matrix of the data from the 20 tree models.

wrong correct

wrong 215 1

correct 13 26

Figure 25, all confusion matrices of the 20 tree models added up, data relative to correct posture data. Green indicates the real class the data belongs to, yellow the predicted class.

Increasing the number of threes to 100 does not have a large effect on the accuracy of the model, see Figure 26.

wrong correct

wrong 216 0

correct 14 25

Figure 26, all confusion matrices of the 100 tree models added up, data relative to correct posture. Green indicates the real class the data belongs to, yellow the predicted class.

Again, the machine learning model has more difficulties predicting correct postures compared to incorrect postures. This could indicate a lack of correct posture data, even with the extra data gathered

during this measuring session, since only about twenty per cent of the data is from correct postures. Still, most of the correct postures are categorized correctly.

6.2.2 Analysing variable importance

Again, the build in variable importance measures of the random forest algorithm are used to get an idea of what sensors are the most important to posture measurement. See Figure 27 for the most important variables amongst various random test/train splits. The data in Figure 27 uses the 100 tree models to assure that all variables get enough use to give an accurate prediction of their importance.

Seed random split 0 1 2 3 4 Most important variable 24 24 24 24 24 Second most important variable 3 3 3 3 3 Third most important variable 5 5 5 5 5 Fourth most important variable 19 1 28 0 4 Fifth most important variable 10 28 0 4 27

Figure 27, top 5 most important variables for different train test splits, using 100 tree models.

Figure 27 clearly shows a high importance of variable 24, 3, and 5. The variable importance measures of the random forest models consistently place these variables at the top three of the variable

importance lists. The fourth and fifth spot are more random with variable number 28, 0 and 4 being the most common.

6.2.2.1 Accuracy of variable 3, 5 and 24

variable 24, 3, and 5 correspond to the yaw of the upper back gyroscope, and the x and z axis of the left shoulder gyroscope respectively. Just like in the first data gathering session, one of the shoulder accelerometers is estimated to be of relatively high importance again. In order to test the usefulness of variable 3, 5 and 24, a machine learning model was trained using only these variables. The confusion matrix belonging to this model can be seen in Figure 28. Comparing this confusion matrix to the one in Figure 25, which uses all sensors, shows only a small sacrifice in accuracy when using only variable 24, 3 and 5.

wrong correct

wrong 212 4

correct 12 27

Figure 28, all confusion matrices of the 20 tree models added up, data relative to correct posture, using only variable 3, 5, and 24. Green indicates the real class the data belongs to, yellow the predicted class.

6.2.2.2 Accuracy of the model using only the left shoulder sensor

In an attempt to further reduce the number of sensors needed, a machine learning model was trained using only variables from the left shoulder sensor. First a model was trained using only the data from the left accelerometer, then a model was trained using both the accelerometer and the gyroscope of said shoulder, see figures 32 and 33. Again, a fairly small decrease in accuracy was detected.

40 wrong correct

wrong 209 7

correct 15 24

Figure 29, all confusion matrices of the 20 tree models added up, data relative to the correct posture, using only the accelerometer from the left shoulder. Green indicates the real class the data belongs to, yellow the predicted class.

wrong correct

wrong 211 5

correct 13 26

Figure 30, all confusion matrices of the 20 tree models added up, data relative to the correct posture, using only the gyroscope and accelerometer from the left shoulder. Green indicates the real class the data belongs to, yellow the predicted class

In document Why your back hurts : Finding an efficient way to measure and evaluate sitting posture using a combination of body sensors placed on the body and machine learning (Page 38-40)