Regression Models - Score: 3 P-Value: 0.002

Chapter 3: Data and Methods 3.1 Data 3.1 Data

Z- Score: 3 P-Value: 0.002

3.3.2 Regression Models

The Global Ordinary Least Squares (OLS) Model was used in this study to analyze the explanatory power of the independent and dependent variables. The

dependent variable in this study is crime, more specifically larceny theft crime incidents.

The independent variable is the presence of gentlemen clubs. The control variables or explanatory variables used in this study will consist of socioeconomic and demographic variables that define the characteristics of a neighborhood and would hint to whether a neighborhood’s traits are associated with higher levels of crime. The Exploratory Regression was used to assist in the selection of the explanatory variables. The

socioeconomic and demographic data being assessed in the selection of the explanatory

variables is at the census block group level. The OLS Regression and GWR Model was conducted once the explanatory variables have been selected. The OLS Regression is used to examine the explanatory power of the variables on crime. The OLS works by modeling the dependent variable in relation to a set of explanatory variables. The GWR Model follows this analysis, if spatial autocorrelation of the residuals is present, and contributes to its findings by incorporating spatial relationships into the analysis (Cahill and Mulligan, 2007). The GWR incorporates analysis of spatial relationships by

estimating local regression coefficients by weighing data points by their relation to each of the locations of the estimates (Waller et al., 2007). The GWR will consider the spatial variations of relationships between the explanatory variables and crime (Zhang and Wei Song, 2014). The GWR analyzes the variations in the relationships between crime and each explanatory variable, and will be used if the OLS regression residuals are spatially autocorrelated, which will be determined using the Global Moran’s I.

Exploratory Regression was used to determine the possible combinations of variables that can explain the dependent variable. Exploratory Regression analyzes the data and tests for bias, redundancy and multicollinearity (highly correlated variables), significance (which models are consistently significant), completeness, and performance (ESRI: “Exploratory Regression”, 2013). Passing models are tested for spatial

autocorrelation, as the spatial distribution of the model residuals should be random. Areas of over and under predictions should not be clustered. Exploratory Regression will be used in this analysis to determine which variables could explain the spatial distribution of crime. Not only will demographic and socioeconomic variables from the Census Bureau

be used, but also spatial variables were created and included in the Exploratory Regression.

Two spatial variables were created in an attempt to further explain the spatial distribution of larceny theft crimes. One of the spatial variables measures the distance from each larceny theft crime incident to the closest gentlemen’s club. The other spatial variable measures the distance from each larceny theft crime incident to the closest centroid. This measurement was created as city centers are densely populated meaning that there are more offenders and victims. Before creating these distance variables measuring the distance between each larceny theft crime and the nearest centroid, the centroids had to be created. Since the study area is comprised of two areas, metropolitan Los Angeles and the San Fernando Valley, two centroids were created. The latitude and longitude of the center of San Fernando Valley was calculated and as well as the central coordinates of metropolitan Los Angeles. The display x, y function was used to construct points, referred to as centroids, located in the center of each of these two areas within the study area. One centroid is located in the center of the San Fernando and the other centroid is located in the center of metropolitan Los Angeles. Both of these centroid layers were merged into one layer. This layer was used to create the spatial variable measuring the distance from each larceny theft crime to the nearest centroid, which is used in determining where these crimes are located in relation to the center of the city.

The spatial variables were calculated using two methods, the Near Analysis and through the creation of an Origin Destination (OD) Cost Matrix using Network Analyst.

The resulting significance level from the Exploratory Regression analyzing the spatial variables derived from the two methods were compared to determine whether the two

different methods produce different results. The Near Analysis calculates the distance between features of one feature class to the closest feature in another feature class. The measurement is based on a straight-line distance and therefore is the shortest distance between the two features and does not take into consideration the roads network. The Near Analysis was conducted twice. It was conducted once to calculate the distance between each larceny theft crime incident and the nearest centroid in order to determine the distance from the center of the study area each crime occurred. The Near Analysis was then used to calculate the distance between each larceny theft crime incident and the closest gentlemen’s club to create a spatial variable measuring the distance between the crimes and the establishments being questioned in the analysis.

The second method of calculating the two spatial variables involved using the roads network for a more accurate distance calculation as it is based on the actual travel distance rather than straight lines between the points. A roads network was created from a roads layer encompassing all of the roads in Los Angeles County. This new roads

network was used to construct an OD Cost Matrix. The OD Cost Matrix creates a matrix of distances from features of one feature class to features of another feature class. It does this by calculating the distance from each feature in a feature class to every feature in the other feature class. In this case, two OD Cost Matrix were created. One OD Cost Matrix calculated the distance between each larceny theft crime incident and each strip club. The other OD Cost Matrix calculated the distance between each larceny theft crime incident and each centroid. Each distance calculation is assigned a rank, in which the closest feature is given a rank of one. The second closest feature is assigned a rank of two, and so on. Since the OD Cost Matrix is being used as a comparable measurement to the

measurements calculated by the Near Analysis, only the distance calculations assigned a rank of one were selected. A new layer was created containing the distance between each larceny theft crime and the closest strip club, as well as a new layer containing the

distance between each larceny theft crime and the closest centroid.

Crime rate is also thought to be influenced by the proximity to highways, as highways provide a quick and efficient method for entering and exiting communities.

Edwin McDowell stated, “Statistics are hard to come by, but incidents like the one on I-88 are far from isolated. Highway patrol officers, criminologists, district attorneys and other experts say more and more criminals are discovering that highways provide an abundant source of potential victims and an easy avenue of escape for crimes from car theft and armed robbery to rape and murder” (1992). The line vector layer displaying major highways in the study area appeared to intersect many of the statistically significant larceny theft hot spots produced by the Optimized Hot Spot Analysis.

Highways also invite the construction of densely packed housing, which increases the number of people, including both criminals and victims. This map is shown in Figure 9.

In an attempt to further explain the spatial pattern of the larceny theft incidents, a spatial variable was created which calculated the distance from each crime point to the nearest highway using the Near Analysis, which was previously used to create spatial variables.

Near Analysis calculated the straight-line distance from each larceny crime point to the nearest highway feature. The distance to highways spatial variable was only calculated using the Near Analysis as the highways line layer could not be used in the OD Cost Matrix. The OD Cost Matrix requires point data be used in the calculation. This spatial

variable was used as an explanatory variable to assess whether the proximity to highways influences the crime rate and strengthens the model used in the regressions.

Figure 9: Highways Overlaying Larceny Theft Hot Spots

Once the spatial variables were created, they were then joined to the census data.

Since two versions of the spatial variables were created, the census data layer was copied so that there would be one census file containing the Near Analysis distances and another census file containing the Network Analyst distances. However, as the distance to

highways was only calculated using the Near Analysis method, the Near Analysis distance to highways calculations were added to the census file containing the Network Analyst distance calculations of the other two spatial variables. The distance data was joined to the census block group layer using a spatial join in which the distances of all

crime points located within each block group were averaged. Each census block group layer now contains three distance fields. One of which contains the average distance of all crime points within each census block group to the nearest strip club, another which contains the average distance between each of the crime points and the nearest centroid, and one which contained the average distance between each of the crime points and the nearest highway. The census block group layers now contain demographic,

socioeconomic, and spatial variables, which were analyzed using an Exploratory Regression. The final field to be added to the two variations of the census files was the dependent variable to be analyzed in the regressions. Since the study is analyzing crime, the larceny crime incidents contained within each census block group were counted. The new field containing the number of larceny incidents within each census block group is the field that was used as the dependent variable analyzed in the multiple regressions.

Multiple Exploratory Regressions were conducted to assess the data of the census block group layer containing the spatial variables achieved using the Near Analysis distance calculations as well as the census block group layer containing the Network Analyst distance calculations. This was done to determine the best combination of

variables with the aspiration of producing a model that explains the spatial distribution of larcenies, as the Exploratory Regression determines the strength of the explanatory power of each variable. The first Exploratory Regressions acted as a trial run in which variables that either consistently resulted in extremely low significance or were multicollinear were excluded from further analysis despite the fact that many of these variables were assumed to be predictors of characteristics of neighborhoods with crime according to the Social Disorganization Theory such as female-headed households. While conducting various

Explanatory Regression attempts, the number of variables being tested by the Exploratory Regression was decreased, and the number of variables included in each model was increased to observe whether the strength of the model increased.

The multiple Exploratory Regressions conducted failed to produce a passing model. All models failed the Spatial Autocorrelation test, denoting that the data exhibits clustering and a key explanatory variable may be missing. Failing the Spatial

Autocorrelation test was expected as the analysis included spatial variables as well as demographic and socioeconomic variables, which are variables used to analyze a population and the characteristics of that population and their environment. The models failed to pass the Jarque-Bera (JB) test as well. The JB test checks for model bias, and statistically significant JB scores signify that there is bias within the model. This could result from outliers in the data and nonlinear relationships (ESRI: “Interpreting

Exploratory Regression Results”, 2013). The Adjusted R-squared value for all of the models also failed, as it was too low to explain the dependent variables (number of larceny crimes per census block group). The only test in which every model passed was the Variance Inflation Factor (VIF), which tests for multicollinearity. All VIF values must be less than 7.5 in order to pass the multicollinearity test. The Adjusted R-squared value is used to assess the strength of the model produced. Higher Adjusted R-squared values indicate that the explanatory variables explain the dependent variable. Low Adjusted R-squared values mean the model did not explain the dependent variables. The results of the Exploratory Regression are included in Appendix F. A summary of the highest resulting models produced in the final Exploratory Regression for both the Near

Analysis Block group layer and the Network Analyst block group layer are displayed in Table 2 and Table 3.

Table 2: Results From Near Analysis Layer Exploratory Regression Model Number of

Table 3: Results From Network Analyst Layer Exploratory Regression

The Exploratory Regression resulted in extremely low Adjusted R-squared values for both datasets. Aside from the strength of the model being extremely low, Akaike’s Information Criterion (AIC) was very high. The AIC acts as the model’s goodness of fit test in which the AIC determines how well the model fits the observations. Lower AIC scores are better and associated with more trustworthy models. The highest Adjusted R-squared for the census layer containing the distance calculations conducted using the Near Analysis was 0.10, meaning that the model only explained about ten percent of the variation in the data. This model resulted in an AIC score of 57751.42. The highest Adjusted R-squared for the census layer containing the distance calculations achieved using Network Analyst was 0.06, which is lower than that of the Near Analysis census layer. The resulting AIC score was 58002.89, which is greater than that of the Near Analysis distance calculations layer. This indicates that although the distance fields calculated using the local roads network are more accurate, the Near Analysis version of the layer produced a model that was the more trusted of the two models.

The Exploratory Regression analyzes the significance of each individual variable included in the analysis, as well as whether there is a positive or negative relationship between the explanatory variables and the dependent variable. The two differently calculated distance fields yielded very different results. For the distance to gentlemen’s club calculation, the Near Analysis method determined that the significance of the

variable was 62.67 percent, while the Network Analyst method assigned a significance of 91.67 percent. The two methods produced very different results. The more accurate calculation based on the local roads network produced a much higher significance than that of the straight line distance measurements. Both methods found that there was a

positive relationship between the number of crimes per block group and the average distance of each crime point to the nearest gentlemen’s club. The number of larceny crimes increases as the distance between each crime and nearest gentlemen’s club increases. This relationship was not to be expected as it revealed that the presence of gentlemen’s clubs might not be effecting larceny theft incidents. The results of the variables analyzed in the final Exploratory Regression for the Near Analysis block group layer and the Network Analyst block group layer are included in Table 4 and Table 5, in which the variables are ordered from the highest significance to the lowest significance.

This data was further analyzed in an attempt to correct the model or find a model that better explains the spatial distribution.

Table 4: Variables From Near Analysis Layer Used in Exploratory Regression

Table 5: Variables From Network Analyst Layer Used in Exploratory Regression

After these initial Exploratory Regressions, which resulted in incredible low Adjusted R-squared, steps were taken in an attempt to increase the strength of the models.

Any census block group containing zero population or zero households were eliminated from the analysis under the assumption that these locations are likely to be outliers such as LAX and the City of Industry. As the models produced by the Exploratory Regression

did not failed all of the tests, except for the VIF test, the normality of the dependent and explanatory variables was tested using SPSS. The Descriptive Statistics Explore analysis in SPSS applies the Kolmogorov-Smirnov (KS) test to analyze the normality of the data, as well as provides graphs and histograms depicting the distribution of the data. Scatter plots are also included which display whether a relationship is linear or nonlinear, which is very important as the regression model used in this analysis models linear relationships.

The KS test works by standardizing the values and comparing the distribution of these values to that of a normal distribution. Normally distributed data is that which is modeled by normal distribution often appearing as a bell-curve when graphed in a histogram. Each of the variables analyzed using the KS revealed that all of the variables were statistically significant, meaning that the data was not normally distributed. Upon analyzing the skewed histograms produced during the test, it became apparent that the data needed to be transformed in order to produce a more meaningful model. The results of the KS test conducted on the Near Analysis layer are included in Appendix G.

The data was transformed using Excel to add and calculate new fields containing the transformed values. Data transformations provide the data to be scaled at proportional differences and can lead to more meaningful results if outliers are present in the data, or if the model is predicting well for low values, but not for high values. Data

transformations are also useful if the data is nonlinear, which is the case for this analysis, and the regressions are based on linear relationships, such as the OLS regression. The normality tests revealed that a lot of the variables in the analysis were positively skewed.

Logarithmic transformations, such as the natural log, help make positively skewed distributions more normal and will help fit the variable into the model (Princeton

University, 2008). Transformations should be applied uniformly across the data, all variables considered in the analysis which are to be included in the regression were transformed. The natural log (LN) function was used transform the data, and all values equal to 0 were replaced by 0.5 prior to logging, as the natural log of zero is undefined.

Eighteen variables were transformed, including seventeen possible explanatory variables and the dependent variable for both the Near Analysis dataset and the Network Analyst dataset. The transformed data was then tested for normality in SPSS to determine if the transformations improved the distributions of the data. The results of the second KS test revealed that the transformed data still did not pass, and was not normally distributed.

The histograms however, showed that the data transformations had smoothed the data, and that the data would be normally distributed if the values that were zero were

excluded. The census block groups, in which the variable being graphed is equal to zero,

In document Spatial Statistical Analysis of the Effect of Gentlemen???s Clubs on Crime in the City of Los Angeles (Page 51-71)