• No results found

4.3 Multilevel modelling

4.3.1 Introduction

Many data, used in different sectors (e.g. education, social, medical, transportation) (Woltman, 2012), have a nested or clustered structure and are described as hierarchical data (Figure 4.5). Well- known form of nested data can be found in meta- analytic research, (e.g. subjects, procedures, and results data are nested within each experiment in the analysis) or in repeated measures research, where data (panel data), collected at different times and/or under different conditions, are embedded within each study participant (Osborne, 2000). More specific examples of nested data that have been studied through the literature are:

• Children within classrooms within schools;

• Patients in a medical study grouped within doctors within different clinics; • Children within families within communities;

• Employees within departments within business locations; • Airline passengers within flights within airports;

• Traffic measurements (speed, acceleration etc.) within trips within drivers; • Accidents within geographic regions;

• Pilots are nested within crews which are nested within fleets.

A I II III 1...N 1...N 1...N LEVEL 1 LEVEL 2 LEVEL 3 B I II III 1...N 1...N 1...N C I II III 1...N 1...N 1...N

91

The analysis of hierarchical data is really challenging in the sense of selecting the most appropriate methodological approach (O’Connell and McCoach, 2004). The underlying reason is that there is a correlation among the data that belong to the same group; they seem to be more similar to each other and share some common characteristics. Therefore, nested data are not statistically independent. Most statistical analyses techniques require independence of observations as a primary assumption, making them inappropriate to analyse data with a hierarchical structure. If nevertheless, one of these methods is used, it will produce standard errors that are too small, which leads to a higher probability of rejection of a null hypothesis (Beaubien et al., 2001; O’Connell and McCoach, 2004).

The above-mentioned limitation of traditional approaches to analysing nested data can be overcome by applying multilevel models. Multilevel regression, also called hierarchical linear regression is designed for application to multilevel (hierarchical) data structures as it accounts for the statistical dependence among sequential observations in the same group (Goldstein, 2003). Moreover, multilevel models can handle unbalanced data as well as measurement occasions that in practice often vary across individuals. It is an extension of regression with the difference that the parameters are given a probability model, i.e. are allowed to vary, and it is allowed to include random effects other than those associated with the overall error term. The two key parts of a multilevel model are varying coefficients, and a model for those varying coefficients (Gelman and Hill, 2007).

Since its inception in the 1970s, multilevel regression has been widely used for analysing hierarchical data and has been developed simultaneously across many fields. Therefore, it has come to be known by several names, including hierarchical-, multilevel-, mixed level-, mixed linear-, mixed effects-, random effects-, random coefficient (regression)-, and (complex) covariance components-modelling. Multilevel regression, as mentioned above, can be used to handle clustered, grouped or data in which the measurement vary from subject to subject. It simultaneously investigates relationships within (within-group variation, e.g. the variance due to the differences of individuals in the same group) and between (between-group variance, e.g. the variance due to the differences between the observations from one group to another) hierarchical levels of grouped data. Consequently, it is more efficient in accounting for

92

variance among variables at different levels than other existing analyses methods (Woltman, 2012).

Other approaches to deal with the analyses of hierarchical data are: the disaggregation of data, the aggregation of data and the inclusion of dummy variables to a single level model and are presented at Table 4.3 along with their challenges (Beaubien et al., 2001; Goldstein, 2003; O’Connell and McCoach, 2004; Gelman and Hill, 2007; Woltman, 2012). Disaggregation of data deals with hierarchical data issues by ignoring the structure and considering all relationships between variables to be situated at level-1 of the hierarchy (i.e. at the individual level). By bringing level 2 data down to level 1, disaggregation ignores the presence of possible between-group variation. On the other hand, aggregation of data deals with the issues of hierarchical data analysis differently than disaggregation. Instead of ignoring higher-level group differences, aggregation ignores lower-level individual differences. In aggregated statistical models, within-group variation is ignored, and individuals are treated as homogenous entities by using the average for each group.

Table 4.3: Strategies to deal with nested data

Strategy Consequences

Fit a single-level model and ignore structure

(disaggregation)

• the importance of context will not be measured; • too small standard errors-> incorrect inferences

Include a set of dummy variables for groups (a fixed-effects model)

• large number of groups-> large number of additional parameters to estimate;

• the effects of group-level predictors cannot be estimated simultaneously with group residuals.

Fit a single-level model with group-level predictors (aggregation)

• standard errors of coefficients of group-level predictors may be severely underestimated; • no estimate of the between-group variance that

remains unaccounted Multilevel modelling

(random effects)

• correct standard errors and an estimate of between-group variance.

93

From the Table above, it is notable that the other strategies have a lot of difficulties in dealing with nested data. Multilevel modelling is more suitable for this type of data, but this does not come without disadvantages. The motivations for using this method are: ✓ It provides the possibility for one variable to have an effect that varies. In many applications, it is not an overall effect of x that is of interest, but how this effect varies in the population;

✓ it can overtake the assumptions of traditional statistical models (i.e. independence of error, homogeneity of regression slopes) since it allows within and between-subject heterogeneity;

✓ the prediction is more accurate when the data vary by group. If a model ignores group effects (classical regression), it will tend to understate the error in predictions for new groups;

✓ it does not require same data structure for each level component and so it can handle better missing and unbalanced data; and

✓ it makes use of data for each and every observation or time point, increasing the power of analysis.

On the other hand, there are some difficulties in using multilevel modelling:

✓ It is a time-consuming method. It can accommodate any number of hierarchical levels, but the workload increases exponentially with each added level;

✓ it requires a different understanding of how the data are structured; ✓ some procedures may require specialized software; and

✓ the outcome variable(s) of interest must be situated at the lowest level of analysis.

As far as this research is concerned, the objective is to use a statistical model which can explain the relationship between the deceleration events under normal driving conditions and the factors affecting them. Three types of factors are considered: (1) driver factors (e.g. age, gender and driving miles per year), (2) factors relating to the trip (e.g. trip duration, car type, road type) and (3) factors related to the deceleration event (e.g. cause of braking, traffic density at this specific point). Since each driver in the used datasets had several trips and each trip had multiple deceleration events, it is obvious that the data have a hierarchical structure (the deceleration events are

94

nested within the trips and the trips are nested within the drivers). Therefore, the deceleration behaviour can be modelled using three-level analyses i.e. the driver level, the trip level and the event level as can be seen in Figure 4.6.

Figure 4.6: The hierarchical structure of the data of this work

Deceleration events from the same driver may have some common characteristics, for instance, if a driver is aggressive it is more possible to decelerate hard and jerkily (large deceleration value and short duration). In addition, the deceleration events are nested within trips, which may indicate some correlation among the events from the same trip (i.e. within-cluster correlation). On the other hand, there might be a variation between deceleration events from different drivers or/and different trips (i.e. between- cluster variation). Therefore, a statistical model is needed to jointly control both within- and between-cluster variations. As described above the more suitable model to overcome these problems is the multilevel mixed-effects linear regression model, and specifically a three-level random-intercept and random-coefficient model, which will be described in detail in the following section.

The multilevel model offers a more comprehensive use and a more appropriate and powerful analysis of the specific datasets than simple regression models. The mixed model allows for the full exploitation of the data that were acquired from three different studies, providing the opportunity to make use of the structure of the data and to explore as many factors as possible. It allows for dependency of deceleration characteristics for the same driver and within the same trip and examines the variation of deceleration characteristics for different drivers and different trips conducted by the same drivers. Also, it deals with the problem of consistency due to the fact that not all drivers have executed the same number of trips and not every trip has the same

Driver 1 Driver 2 ….

Trip 1 Trip 2 Trip 1 Trip 2 Trip 1

Driver j ….. Deceleration event 1 Deceleration event 2 Deceleration event 1 ….. Deceleration event 2 ….. ….. ….. ….. …..

95

number of deceleration events. However, it is much more demanding in terms of software and statistical knowledge. Regarding this work, the multilevel modelling was applied using the STATA software for the two Field operational projects and the R programming language and software for the UDRIVE project.