4.2 Results
4.2.2 Models based on single probability distributions
From now on we use the following notation for all the models based on logistic regression:
logit(p) = log( p
1 − p) = a + binθin+ boutθout
+bφφout+ bRfR+ bW SfW S + bW DfW D, (4.1) where a and bi are the regression parameters and fR, fW S and fW D are dummy variables (based on a crude discretisation, see Section 2.2) for respectively rain presence, wind speed levels and wind direction sectors.
Univariate logistic models
In this section, we present results from separate logistic regressions using each available independent variable, together with some possible transformations of these latter.
Models with untransformed variables. The regression curves are presented in Fig-ures 4.2(a)-4.2(f) and the regression parameters are given in Table 4.2. From this we observe statistical signicance (p < 0.001) of each of the variables tested. The model with θout has the largest likelihood ratio statistic, implying that it best describes the variations of our outcome variable.
However, as discussed in Section 3.2.3, statistical signicance itself does not necessar-ily provide clear-cut conclusions concerning the model's capacity to correctly explain our outcome variable. We therefore give in Table 4.3 a summary of the possible criteria of goodness-of-t for each of these models. According to these goodness-of-t criteria, the model with θout once again oers the best t among all variables. We thus conclude that θout should be integrated in a nal model, possibly in conjunction with other variables if their contributions are statistically signicant and improve the quality of adjustment. The implications of this superiority of θoutas a predictive variable are discussed in Section 4.3.1.
Models using polynomial logits. We noticed in Figure 4.2(b) that a linear logit does not predict well the observed proportions of windows open at high outdoor temperatures.
In order to account for this phenomenon, a possible renement would be to use a polynomial of degree q for the logit of the probability. In this case:
logit(p) = a + b1θout+ b2θout2 + . . . + bqθoutq , (4.2) where we use stepwise logistic regression to determine the highest signicant order. This procedure determines that a fourth degree polynomial is appropriate, with regression parameters a = −2.387 ± 0.005, b1 = (5.55 ± 0.15) · 10−2, b2 = (1.73 ± 0.21) · 10−3, b3 = (2.88 ± 0.11) · 10−4, b4 = (−9.64 ± 0.19) · 10−6. We see in Figure 4.2(b) that the associated probability distribution ts better the observed proportions (blue curve); par-ticularly for high values of θout. Furthermore, regressions with polynomials of lower degree do not oer clear improvements compared to the linear logit model. Although goodness-of-t indicators are not much improved (Table 4.3), this is the best model when using one sole predictor. Similarly, a limited but signicant improvement is obtained when using a polynomial logit with θin as predictor (Figure 4.2(a)).
Although these models appear to better emulate observed trends, models with polyno-mial terms tend to be criticised because of the lack of interpretability of their regression coecients; other approaches are therefore often preferred. Given the structure of the observed proportions, viable possibilities include a non-parametric estimation of the prob-ability, or the tting of two linear logistic models for the distinct domains of θout where the observed behaviours are dierent.
Models based on deviations from comfort temperature. Another possible choice for a driving variable is to use the deviation between θinor θout and a comfort temperature θin,comf; for example dened by the adaptive comfort model of the CEN or ASHRAE standards (see Section 6.1.1). We perform logistic regression with (θ −θin,comf)as a driving variable, alternatively with θ = θin and θ = θout. The corresponding results are given in Figure 4.3 and Table 4.2 (bottom).
In this case, we obtain slightly lower goodness of t and likelihood ratio; that is, the quality of adjustment is somewhat lower than when using raw thermal variables. It is however worth noting that the proportion of windows open reaches a maximum near θout = θin,comf. The use of the equations given alternatively by the CEN or ASHRAE standards produce similar results.
Multivariate logistic models
Following from these univariate models, we proceed to consider models with several vari-ables and assess their increased predictive value. In this we determine the best model
4.2. RESULTS 51
Variables LR AUC Dxy Γ τa R2N B
θout 330873 0.782 0.563 0.565 0.240 0.273 0.168 θin 71243 0.632 0.264 0.269 0.113 0.064 0.202 φout 32566 0.590 0.179 0.181 0.076 0.030 0.208 vwind 10992 0.538 0.077 0.078 0.033 0.010 0.212
patm,red 9556 0.559 0.118 0.122 0.050 0.009 0.212
αwind 4153 0.531 0.063 0.065 0.027 0.004 0.213 Dprec 202 0.493 -0.013 -0.112 -0.006 0.000 0.213 fW D 27756 0.579 0.157 0.211 0.067 0.025 0.209 fW S 19126 0.566 0.133 0.176 0.057 0.017 0.211 fR 1065 0.507 0.014 0.119 0.006 0.001 0.213 θout (polyn.) 349191 0.783 0.566 0.568 0.241 0.287 0.166 θin (polyn.) 91047 0.637 0.274 0.281 0.117 0.081 0.200 θout− θ(CEN)comf 309681 0.774 0.548 0.550 0.234 0.258 0.171 θout− θ(ASHRAE)comf 308337 0.774 0.548 0.549 0.234 0.257 0.171 θin− θcomf(CEN) 47692 0.603 0.206 0.208 0.088 0.043 0.206 θin− θcomf(ASHRAE) 44142 0.602 0.203 0.205 0.087 0.040 0.207 θout, θin 343507 0.785 0.570 0.571 0.243 0.283 0.167 θout, φout 342396 0.785 0.569 0.570 0.243 0.283 0.167 θout, αwind 334066 0.783 0.566 0.567 0.241 0.276 0.168 θout, vwind 331803 0.782 0.564 0.566 0.241 0.274 0.168 θout, Dprec 331616 0.782 0.564 0.566 0.240 0.274 0.168 θout, fW D 332478 0.782 0.565 0.566 0.241 0.275 0.168 θout, fW S 331683 0.782 0.564 0.565 0.240 0.274 0.168 θout, fR 331325 0.782 0.564 0.566 0.240 0.274 0.168 θout, θin, φout 354434 0.789 0.577 0.578 0.246 0.291 0.165
Table 4.3: Goodness-of-t estimators for logistic models including one or several variables
containing two variables, and provided the signicance of the added variable and the sta-bility of the primary variable; continuing this procedure for other predictors until no added signicance is obtained. This procedure is known as forward selection (see Section 3.2.4).
Models with two variables. Based on logistic regression for models including together θout and each other available variable, we observe (Table 4.3) that the model with θout
and θin (a = 0.794 ± 0.030, bout = 0.14760 ± 0.00031, bin = −0.1541 ± 0.0013) has the highest statistical signicance, according to the likelihood ratio statistic. Furthermore, this model is the one that ts best the data, according to all our statistical criteria; but the improvement to these indicators from adding θin is rather modest. However a plot of the observed proportions of windows open versus θoutand θin, with regression surface levels (Figure 4.4(a)), shows that observed variations are better accounted for and thus conrms the existence of an independent contribution of each variable. Finally, the stability of the slope associated with θout is preserved, as its standard error remains extremely low, which shows that the correlation between θin and θout is not problematic for this model.
Models with three or more variables. Now that the model including θout and θin
is retained, we check for the signicance of the inclusion of a third parameter. Based on regression results for the models with a third variable, the best model includes the external relative humidity φoutand this inclusion is statistically signicant (p < 0.001). However the goodness-of-t criteria increase only very slightly (Table 4.3); that is, the added predictive accuracy from the inclusion of φoutis marginal. Some other parameters in models with four or ve variables were also found to be statistically signicant, but without any increase in the goodness-of-t indicators. For the sake of parsimony, it seems sensible to keep the model with just the two variables θout and θin.
Other factors. Inspired by the results of Herkel et al. [81, 82], we attempted to include a factor with twelve levels corresponding to each month of the year, in order to check the existence of an additional eect of season on window actions. This was not found to bring any signicant improvement; that is we observe almost the same regression parameters based on θout for every month.