Logistic regression - Dependent variable is non-continuous. A case in point would be a study on

Enrolment to Build Behaviour Based Models

Phase 5: Refining Themes Once satisfied with the thematic map Braun and Clarke [18] suggest that the researcher should go back to the codes and underlying data to identify the essence of each theme

1. Dependent variable is non-continuous. A case in point would be a study on family planning

4.2.2.4 Logistic regression

Logistic regression extends regression by allowing the researcher to build models to predict categorical outcomes based on past data (e.g., will a specific user finish the task at hand?). Multiple predictors (independent variables) can also be used to predict this outcome, and these can be both continuous or categorical. Logistic regression can be used across a large number of domains including healthcare [57] (e.g., given these variables, is this patient’s tumour benign or cancerous?). These are life-saving

4.2. Building User Group Behavioural Models for Reuse 115

techniques, however they can also be adopted in other domains to answer critical questions such as will users abandon the task (give up) when they see this enrolment page? This kind of assessment requires a rigorous empirical effort to understand the different user groups, their behaviour and attitudes while determining and selecting techniques to build representative models.

Logistic regression can predict a binary output (binary logistic regression – e.g., yes or no) or a cate-gorical output with more than two categories (multinomial logistic regression – e.g., morning, afternoon, evening or night). In binary logistic regression, models are built to help the researcher understand the probability of an outcome occurring given the independent variable’s values. Binary logistic regression is also adopted when the underlying data moves away from the inter-variable linearity assumption of simple or multiple regression. When the outcome variable is binary (e.g., true or false) then a linear relationship with predictor variables is not possible. Berry and Feldman (cited in [57]) suggest the use of non-linear transformations on variables and provide logarithmic transformation as an example by which one could express non-linear relationships in a linear way [57]. The following equation outlines this transformation.

f(x) = 1

1+ e^−x (4.7)

For this reason, and assuming multiple predictors, the logistic regression equation takes the follow-ing form.

P( ˆyi) = 1

1+ e^−(b⁰^+b¹^X¹ⁱ^+b²^X²ⁱ^+b³^X³ⁱ^+...+bⁿ^Xⁿⁱ⁾ (4.8) P( ˆy_i) is the probability of ˆy_i occurring, e is the base of natural logarithms (approx. 2.718), b₀is the y intercept for the model and bn is the coefficient (weight) for the n^th predictor (Xn_i). Xn_i is the value of the n^thpredictor for the i^thobservation. b₀determines the outcome (P( ˆyi)) when all predictors are set to 0 (X_n_i = 0), whereas b_nadjusts the rate of change in probability of ˆy_i occurring when X_n_i is incremented. An outcome close to 0 means that the event is unlikely to occur. Model fitting is based on maximum-likelihood estimation (MLE) whereby a good fit is obtained when the values predicted by the model given specific predictors are closest to the actual observations. This is an iterative process and terminates when convergence has been reached, or when minor or no improvements on the previous model has been found. If convergence is not found it might indicate that the given predictors are not good enough to predict an outcome (e.g., high levels of collinearity making it difficult to assess the effects of individual predictors). The log-likelihood statistic can be used to determine the model’s goodness of fit. The log-likelihood test “is analogous to the residual sum of squares” used in multiple regression – indicating the extent of unexplained information in a given model [57]. MLE is obtained by maximizing the log-likelihood statistic for a model [111]. Log-likelihood compares the actual outcomes (data points in data set) with values generated through the model (P( ˆyi)). A large log-likelihood statistic indicates a weak model (poor fit) and a value closer to 0 indicates less to no unexplained observations (perfect fit).

Log-likelihood values are not meaningful on their own, but are used to compare models during model fitting iterations – until convergence is achieved. Convergence based on the log-likelihood statistic can be

specified as the point at which the percentage difference in log-likelihood between iterations falls below a specific value, unless a maximum number of iterations is specified [39] in which case the iterative process stops (assuming the use of a statistical package such as SPSS).

Model fitting in binary logistic regression requires particular attention. Since the outcome variable is binary (i.e., true or false) the base model (against which the quality of new models is assessed) can-not be the mean of outcome values (unlike in linear regression). In binary logistic regression the base model or best guess (against which the quality of subsequently generated models is measured) would be the outcome category that has occurred more often in the observed data. The base model assumes no predictors and is made up of the y intercept only (b0) – the best guess when no predictors are available.

Predictors are then introduced to the base model while monitoring for improvements using the following equation.

x²= 2[LL(newmodel) − LL(basemodel)] (4.9)

Various techniques exist to introduce predictors: (1) forced entry, introducing blocks of predictors at one go, or (2) stepwise entry, starting off with the constant and introducing predictors gradually, deter-mined by a score (forward stepwise). Alternatively the process can start off with all the predictors in the model and then removing predictors that have the least impact on the model’s fit to the data (backward stepwise). Some predictors might not be significant and may be excluded to improve the overall model, leaving only those that have a significant contribution to the model’s predictive power (using the likeli-hood ratio test). The Wald statistic can be used to measure the utility of each predictor in improving the model and outcome predictions although this may be misleading especially when regression coefficients for predictors are large, increasing the standard error which is in turn used to compute the Wald statistic (_SE^b

b) [57]. The likelihood ratio is the most expensive statistic (computationally) but it is more reliable than the Wald statistic. The odds ratio (Exp(B)) can also be used to determine the importance of a pre-dictor in its contribution towards predicting outcomes. This is a measure that indicates a change in odds (of an event outcome occurring) given a unit change in a predictor (Xn). If Exp(B) for age is 1.5 then a one year increase in age increases the odds of the outcome by 50% ((OR − 1) × 100). However if the odds ratio for height is 0.99, a unit increase in height would have no significant impact on the change of the outcome occurring (1% decrease in odds i.e.,(0.99 − 1) × 100). When OR is greater than 1 it transpires that an increase in the predictor results in an increases in the odds of the outcome occurring.

According to Field [57] backward stepwise methods are well suited when carrying out exploratory work on new sets of data, as opposed to data for which previous research exists which can be used as a basis for hypothesis testing. Furthermore, backward stepwise is favourable to the forward method [107,57]

mainly because by using this method all independent variables are initially included in the model and if any variable is highly significant only with the inclusion of another variable this will not be excluded from the final model. Forward stepwise might potentially exclude important variables (i.e., suppressor effect).

Tests such as the Hosmer and Lameshow’s R²_L, Cox and Snell’s R²_CSand Nagelkerke’s R²_Nprovide a

4.2. Building User Group Behavioural Models for Reuse 117

“gauge of the substantive significance of the model” [57]. These tests produce different measurements, however their interpretation is conceptually consistent with the goodness of fit test (R²) used in linear re-gression. The Nagelkerke statistic can be interpreted as the extent by which a model explains variability in the data. For instance a value of .469 indicates that the model can explain 46.9% of the variabil-ity. There are various statistics to measure goodness of fit, however these models, including both Cox and Snell and Nagelkerke’s, are pseudo R²statistics since they are only conceptually analogous to the goodness of fit R²measure used in linear regression [22].

Consider a model that explains the user’s willingness to enrol for and use an e-service given a particular enrolment process designed according to the factors outlined in Table4.1. Table4.4outlines the parameter estimates for a logistic regression model involving a subset of (significant) predictors (Items to Generate (ItR), Items to Recall (ItR) and Type of Service (ToS)) for a categorical outcome variable (willingness to complete task (WCT)).

Table 4.4: Example parameter estimates for the willingness to complete task outcome variable Parameter Estimates

Coefficients Sig Intercept (b₀) -3.201 .000

ItG (b1) 0.878 .000

ItR (b₂) -0.224 .000

ToS 1 (b3) 2.635 .000

ToS 2 (b₄) 1.646 .000

ToS 3 (b₅) 0.119 .808

R²for this model is of 0.23 (Cox & Snell) and .33 (Nagelkerke). This means that according to the Nagelkerke statistic, the binary logistic regression model with these predictors explains 33% of the variability in the data.

Logistic regression was carried out using the backward step-wise (likelihood ratio) method. The Delays (D) and Interruptions (I) predictors were excluded from this model (at the final iteration) for this particular set of observations (retrieved from a study involving a group of undergraduate students). Table 4.6shows how the model performs in comparison to actual observations, contrasting predicted outcomes with actual readings. Predicted outcome follows internal encoding shown in Table4.5(SPSS uses these codes internally).

Table 4.5: Dependent variable encoding (willingness to complete task) Original Value Internal Value

Yes 0

No 1

Table 4.6: A small sample of actual observations together with their respective modelled outcomes (expected).

Workings for values in bold are shown in equations4.10and4.11 Model testing

Observations Model Outcomes

ItG ItR D I ToS Complete Task? P(Outcome) Complete Task?

3 8 2 2 0 No .56819 No

Modelled outcomes are generated using equation4.8and worked examples are given below.

P(WCT ) = 0.887 = 1

1+ e−(−3.201+(0.878×3)+(−0.224×0)+2.635) (4.10)

P(WCT ) = 0.362 = 1

1+ e−(−3.201+(0.878×3)+(−0.224×0)+0) (4.11) Logistic regression allows the researcher to build models for data with categorical outcomes. This however requires careful assessment of the nature of both the dependent and independent variables in order to adopt appropriate modelling techniques for optimum model fitting.

In document Designing for experience - a requirements framework for enrolment based and public facing e-government services (Page 114-118)