Method of Data analysis - The effect of crime in the community on becoming not in education, em

The statistical analysis will involve descriptive statistics and logistic regression analysis. The analysis begins with examination of the data that includes tables of frequencies and graphs for the main activity of young people, the indices of deprivation and the variables in the analysis. Cross-tabulations are used to describe the relationship between NEET status and the variables in the model. The observed percentages given

in cross-tabulations only o↵er a description of the relationship between NEETs and the other variables and therefore additional steps are taken to draw further conclu- sions about the relationship of the variables. It is of interest to further explore the relationship between area deprivation and being in NEET status or not at the ages

18 19. The Pearson chi-square measure of association is used to measure the strength

of the relationships. The chi-square is used to test the null hypothesis that NEETs are not related with the variables in the model, in other words that the two variables are

independent. The observed significance level (p < 0.5,_{⇤ ⇤ p < .01, ⇤ ⇤ ⇤p < 0.001) for}

the chi-square statistics determines if the null hypothesis can be rejected and if there is a significant association between NEETs and other variables in the model.

After summarising the strength of the relationship between NEETs and the other variables in the analysis, the use of a regression model will allow a clearer picture of the relationship between NEETs and the variables in the model to emerge. The logistic regression model was selected for this analysis to model the probability that a young person will become NEET or not. When the outcome variable is dichotomous, it is not possible to use multiple linear regression to study the relationship between the outcome (NEET) and the independent variables because it is impossible for such data to satisfy the required assumptions of linear regression. That is because it is impossible for a binary variable to be normally distributed with a constant variance. The logistic regression model is a flexible option to study the relationship between a binary variable and a set of variables that are continuous or categorical.

There are two main di↵erences between a linear regression and a logistic regression model. The first di↵erence lies in the choice of parametric models and the assumptions (Hosmer and Lemeshow [91], 2000). This di↵erence is related to the nature of the relationship between the outcome (NEET) and the independent variables. In linear regression analysis, the key measure is the conditional mean, which is the mean value of the outcome variable given the value of the independent variable. It is expressed

as E(Y _{| x) where Y denotes the outcome variable and x denotes the value of the} independent variable. This equation shows the expected value of Y given the value of x. A linear equation

E(Y | x) = 0+ 1x

implies that it is possible for E(Y _{| x) to take any value as x ranges between} ₁

and +_{1. With dichotomous data, the conditional mean must be greater than or equal}

to zero and less than or equal to 1 [0_{ E(Y | x)  1]. The fact that the logistic function}

ranges between 0 and 1 allows the researcher to estimate the probability of a young person being NEET given their characteristics. This probability allows to estimate that one of two events will occur, a young person will be NEET or not, based on the values of the set of independent variables of the Compositional Model of Neighbourhood E↵ects. Thus, for the logistic model, we can never get an estimated probability either above 1 or

below 0. In addition, the change in E(Y _{| x) per unit change in x becomes progressively}

smaller as the conditional mean gets closer to 0 or 1. This results in an elongated S- shaped curve. The S-shape indicates that the e↵ect of x variables on becoming NEET is minimal for low x values until some threshold is reached. The probability then increases over intermediate xvalues and remains extremely high around 1.

Remark 6.4.1. In this thesis the logistic regression model is used to estimate the

probability of a young person being NEET at the age 18 19. The analysis begins

with a binary logistic regression model, given by

L = ln(o) = ln⇣ p

1 p

⌘

= 0+ 1X + ", (6.1)

where p is the probability of an event taking place, o is the odds of the event, 0 and 1

are the Y -intercept and the slope respectively, and " is the random error. Y is the binary outcome (being NEET or not), and X represents area deprivation characteristics. At the second stage of the analysis, multiple logistic regression analysis is employed to

represent a model with more than one independent variables. More precisely, a vector of covariates representing each of the four pathways introduced by the Compositional Model of Neighbourhood E↵ects is included in the analysis. Recall that the multiple logistic regression function is given by

L = ln(o) = ln⇣ p

1 p

⌘

= 0+ 1X1+ 2X2+· · · + nXn+ ", (6.2)

where Xi for i2 {1, . . . , n} are the independent variables.

The second di↵erence between linear and logistic regression models is related to the conditional distribution of the outcome variable. In a linear regression, an observation

of the outcome variable can be expressed as y = E(Y _{| x) + ", where " is called}

the error term and denotes an observations deviation from a conditional mean. The conditional distribution of the outcome variable (NEET) given x will be normal with

mean E(Y _{| x) and a constant variance. With a dichotomous outcome variable, the}

conditional distribution of the outcome variable follows a binomial distribution with probability given by the conditional mean.

There are also di↵erences between the linear and logistic regression models, In terms of how they are estimated. In linear regression, the likelihood equations are obtained by the di↵erence of the squared deviations function and are linear. In logistic regression, the likelihood equations are not linear and can be obtained using an iterative weighted least squares procedure (McCullagh and Nelder [118], 1989). Maximum likelihood and least squares estimation are di↵erent approaches that give the same results in regression analyses when the dependent variable is normally distributed. Under the linear regression assumptions, we estimate parameters of the model using the principle of least squares. The idea of least squares is that we chose parameter estimates that minimize the average squared di↵erence between observed and predicted values. We maximise the fit of the model to the data by choosing the model that is closest to the data. The principle of least squares cannot be used as an estimation method for logistic

regression analysis, and instead the maximum likelihood method provides the basis of estimation in logistic regression models. The maximum likelihood method finds the set of values for the parameters of the model that are most likely to have resulted in the data that were observed. The maximum likelihood method finds the parameters of the model that best explain the data, in the sense that they maximise the probability of obtaining the observed data. The maximum likelihood estimation has not been widely used for many years due to the fact that software programs were not available to carry out complex calculations. In the last years, new programs have made the maximum likelihood estimation more popular. The advantage of the maximum likelihood method is, that in comparison to the least squares method, it can be applied in the estimation of complex non-linear models and therefore it is a preferred estimation method in logistic regression.

In document The effect of crime in the community on becoming not in education, employment or training (NEET) at 18-19 years in England (Page 159-163)