Cross-validation results - Quantifying the spatio-temporal temperature dynamics of Greater Lond

5.4 Results

5.4.3 Cross-validation results

Results from the two-fold and leave-one-out test are shown in Table 5.8. The two-fold test shows an RMSE value of 5.03, indicating that the model has a reasonable accuracy in prediction of air temperature, although this is greater than variations between neighbourhood temperatures in the literature which have shown to have significant impact on the variability of heat hazard ex- posure (e.g. <5°C, Harlan et al. (2006)). As previously described the two-fold cross-validation value was used as the cross-validation benchmark by which to evaluate the response of the data to other cross-validation tests.

The leave-one-out cross-validation test shows a minor decrease in RMSE of 0.01. The minimal difference between the results of the two tests suggests that observations are evenly weighted, and that during two-fold cross-validation when 50% of the data are removed at random the remaining construction sample is still able to predict values with the same magnitude of error as the

leave-one-out test which removes observations individually, one at a time. Table 5.9 shows the results of the cross-validation tests when individual stations are removed. In contrast to the two-fold and leave-one-out tests all the RMSE values from station removal are up to 2.33 lower than the two-fold test. The lower RMSE results can be attributed to these tests removing a smaller number of validation samples, leaving a greater number number to act as the construction sample. For example, in the two-fold test 50% of the data are randomly removed (1062 measurements) while for each of the station tests between ~17-31% of all data were removed (371-672 measurements). This meant that for each station test a much larger construction sample was available, giving a more robust line of best fit.

In Table 5.9 it is also apparent that the RMSE values for each of the station tests have very small differences between them (maximum RMSE variation of 0.24). This demonstrates that despite the inter-station variability in the relationship between air and surface temperatures as described in Section 5.3.1, the stations all have an approximately equal capability within the model. This means that for each of the four stations removed to act as the validation sample, data from the other three stations are capable of estimating temperatures at the validation station with the same magnitude of error (0.45-1.67°C) that is lower than both the two-fold and leave-one-out cross-validation tests. The small variation in RMSE values between stations appears to correspond with the number of measurements at each station. LHR and LWC which both have the greatest number of observations (therefore reducing the remaining number of observations available to act as a construction sample resulting in a model with a poorer fit) have the highest RMSE values (2.92 and 2.94 respectively), suggesting that the model may be more representative of temperatures recorded at these two stations than NTH and SJP, which could lead to over-prediction of temperatures in urban green-space such as that found at SJP.

Table 5.10 shows the results of cross-validation for different summers. The highest RMSE (3.11) occurs during removal of observations from 2006 which

had an average air temperature of 24.31°C. The lowest RMSE (2.51) is found in 1995 and 2002 with average air temperatures of 25.67°C and 19.64°C respectively. These results show that there is no apparent relationship between the RMSE and mean average air temperature for each summer, even during summers like 1995 and 2003 which contained known heatwave events (Koppe et al., 2004; Johnson et al., 2005; Kovats et al., 2006). It would be reasonable to assume that model prediction of summers containing heatwaves would be worse than non-heatwave summers, as extreme events would only account for a small number of measurements in the original data, furthest away from the line of best fit. However this test has shown that the model is capable of predicting temperatures for heatwave summers with the same magnitude of error as other non-extreme event summers.

Cross- validation Method Construction Sample Validation Sample RMSE Two-fold 1062 1062 5.03

Leave-one-out 2124 1 (iterated over

all)

5.04

Table 5.8: Results of the two-fold and leave-one-out cross-validation tests for the EST and air temperatures (May-September 06:00-21:00).

Station Re- moved Validation Sample RMSE xair¯ (°C) σair (°C) ¯ xsurf ace (°C) σsurf ace (°C) ¯ xair− ¯ xsurf ace (°C) LWC 662 2.94 21.44 4.95 20.99 7.54 0.45 SJP 371 2.70 22.97 4.76 20.81 6.32 1.55 LHR 672 2.92 21.72 5.26 21.02 7.94 0.69 NTH 419 2.79 22.32 4.98 20.65 6.57 1.67

Table 5.9: Results of station cross-validation tests for EST and air temperatures (May-September 06:00-21:00). Standard deviation and mean temperatures are for the station removed. Stations; LWC: London Weather Centre, SJP: St James’s Park, LHR: London Heathrow, NTH: Northolt.

Year Re- moved Validation Sample RMSE xair¯ (°C) σair (°C) ¯ xsurf ace (°C) σsurf ace (°C) ¯ xair− ¯ xsurf ace (°C) 1989 113 2.75 20.68 4.32 19.66 7.14 -1.02 1990 120 2.73 21.7 5.83 18.98 8.7 -2.72 1995 102 2.51 25.67 3.41 27.9 4.68 2.23 2002 176 2.51 19.64 5.22 16.95 6.74 -2.69 2003 271 2.94 23.45 5.33 23.95 6.7 0.14 2006 179 3.11 24.31 4.24 24.2 5.32 -0.11

Table 5.10: Results of summer cross-validation analysis for the surface and air temperatures (May-September 06:00-21:00). Standard deviation and mean temperatures are for the year removed. Stations; LWC: London Weather Centre, SJP: St James’s Park, LHR: London Heathrow, NTH: Northolt.

5.4.4 Testing for systematic error

Figure 5.11 shows morning paired air and surface temperature measurements, grouped by early, late and on-time satellite overpass times. If the as- sumptions stated in the methodology (Section 5.3.4) are correct then air temperatures paired with early over-passes would be warmer than those with on- time over-passes. In contrast, air temperatures paired with late over-passes would be cooler than those paired with on-time over-passes. Therefore, the differences between early/late and on-time trend-lines in Figure 5.11 could indicate a systematic error introduced by the change in temperature between observations with a time-delta of_{±15 minutes or greater.}

The difference between the trend lines in Figure 5.11 (although minimal_≤2.0°C) correspond with the hypothesis of temperature change between the times of the two measurements. The line of best fit for paired measurements where AVHRR EST was captured 15-30 minutes after air temperature, indicates lower air temperatures in these pairings, as compared to EST-air temperature measurements within 15 minutes of each other. This could be the result of surface warming (leading to higher EST) in the 15-30 minute interval after the air temperature measurement. Furthermore, the trend line for paired measurements where EST was captured 15-30 minutes before air temperature, show warmer air temperatures as compared to those measurements

captured with 15 minutes of each other. As such, this could also indicate warming in the 15-30 minute interval before air temperature measurement, leading to a cooler EST being paired with a warmer air temperature.

Figure 5.12 shows a plot for afternoon observations. However, these data ex- hibit no clear distinction between early or late groups. It is likely that this is as a result of too few measurements in the afternoon data, when evapo- ration is greatest, leading to a reduced number of AVHRR scenes available due to cloud contamination. Given that the afternoon plot showed no possible systematic error, afternoon observations were not investigated for a temporal offset between satellite over-pass and air temperature measurement.

The results for the Mann-Whitney U test between early and late morning temperatures (see Section 5.3.4) found no significant difference between the two. This shows that the slight variation seen in the trend lines is not statistically significant at the 95% confidence level. The trend lines show an apparent difference between early and late measurements, possibly as a result of _±30 minutes between temporally-paired EST and air temperatures, however the results from the statistical testing indicate that this doesn’t cause a significant systematic error between early and late measurement groups, and is therefore unlikely to affect the validity of the regression model.

0 10 20 30 40

Estimated Surface Temperature (°C)

0 10 20 30 40 Air temper atur e (°C) Morning (before 14:00)

Ontime line of best fit Before line of best fit After line of best fit

AVHRR overpass within +/- 15 minutes AVHRR ≥ 15 minutes before AVHRR ≤ 15 minutes after

Scatter plots showing morning and afternoon surface-air temperatures with AVHRR overpass +/- 15-29 minutes before/after air temperature observation

Figure 5.11: Scatter plot of morning (before 14:00) early, late and on-time paired EST and air temperatures.

0 10 20 30 40

Estimated Surface Temperature (°C)

0 10 20 30 40 Air temper atur e (°C) Afternoon (after 14:00)

Ontime line of best fit Before line of best fit After line of best fit

AVHRR overpass within +/- 15 minutes AVHRR ≥ 15 minutes before AVHRR ≤ 15 minutes after

([’/home/a5245228/bin/python/postgres/scatter plots/est air scatter trend ampm v2.1.py’]

SQL: [/home/a5245228/bin/SQL/WP2/select global pm late est air gla for plot v1.sql]) (2012/03/29 15:51)

Figure 5.12: Scatter plot of afternoon (after 14:00) early, late and on-time paired EST and air temperatures.

In document Quantifying the spatio-temporal temperature dynamics of Greater London using thermal Earth observation (Page 184-190)