**ABS House, Canberra, Australia **

**6-7 June 2005 **

**Safety, crime and justice : from data to policy **

Australian Institute of Criminology Conference

**CONFERENCE PAPER: **

**POSSIBILITIES AND PITFALLS IN THE ANALYSIS OF SURVEY DATA **

### Terry Rawnsley and Sandra Fairbairn

### Australian Bureau of Statistics

This conference was organised by the Australian Institute of Criminology in conjunction with the Australian Bureau of Statistics.

**Abstract **

The Australian Bureau of Statistics (ABS) National Crime and Safety Survey provides information on different
types of crime and the characteristics of victims, offenders and incidents. The survey data can support
relatively simple analysis of crime among different groups in the population. For example, analysis can be
undertaken by sex or age or state. However, the survey data also have the potential to support even more
sophisticated (and insightful) analysis that may be used to inform policy to increase reporting rates and thus
increase citizen engagement in crime prevention. To help support more sophisticated analysis the ABS has
set up a Remote Access Data Laboratory (RADLTM_{) which allows users to gain greater access to survey unit }

record files.

This paper will demonstrate a popular statistical method known as a logistic regression using the propensity of people to report crimes to the police as an example. The logistic regression allows the relationship

between reporting crime to police and socioeconomic and demographic variables to be observed. The paper will also explain how the survey sample design used to collect the data can have a significant effect on the outcome of the analysis.

**1. Introduction and background **
**1.1 Introduction **

The Australian Bureau of Statistics (ABS) National Crime and Safety Survey (NCSS) collected information on
a number of different types of personal and household crimes. The survey covered categories of more
serious crimes that affect the largest number of people. These included assault, sexual assault, robbery,
household break-in, attempted break-in and motor vehicle theft. The survey collected information about the
incident and the characteristics of victims and non-victims, as well as characteristics of the offenders. These
results from the survey are presented in* Crime and Safety, Australia*, ABS cat. no. 4509.0.

The statistics in this publication can support relatively simple analyses of crime among different groups in the population. For example, analysis can be undertaken by sex or age or state or if the crime was reported to police. However, the survey data also have the potential to support even more sophisticated (and possibly insightful) analyses.

This paper will demonstrate a statistical method known as a logistic regression. Logistic regression can be used to measure the association between a dependent variable and a set of explanatory variables (risk factors). The dependent variable used in this analysis is the propensity (how likely) of people to report crimes to the police. The logistic regression model allows us to observe the relationship between reporting a crime to police and other variables such as socio-economic factors, the characteristics of the crime, the relationship of the offender to the victim and location of the crime. The results from the logistic model can help better understand reporting patterns associated with particular groups and with particular types of crime. The paper will also briefly explore the impact of different survey designs on logistic regression results. Multistage clustered sample surveys such as the NCSS are conducted as they are less expensive (in terms of time and money) and are more efficient to conduct. However, a side effect of this type of survey is that the sample is no longer a simple random sample. The use of a complex sample design can have a significant effect on the outcome of regression analysis.

The areas of interest (logistic regression and sample design) to this paper can be very technically complex areas. This paper is aimed at readers with a limited technical knowledge, but should provide references and additional information for the more advanced reader. A glossary is provided to help users understand some of the more technical terms used in the paper

### .

The remainder of this introductory section will provide some additional background to accessing ABS data and the NCSS. Section two will outline the logistic regression model and implications of the multistage clustered sample of the NCSS. The results from the logistic regression on the propensity of people to report crimes to the police will be presented to highlight the method and issues. Section three will present some concluding remarks.

**1.2 Accessing ABS Unit Record Data **

The ABShas been collecting and disseminating crime and justice related statistics since the mid 1970's. This has usually been in the form of publications which present a summary of findings from various

collections (sample surveys or administrative data). It may have also taken the form of a special data request when a certain tabulation not contained in the summary publication is produced.

However, as the speed and power of computers has increased there began an increasing demand among users for access to the unit record files underlying the publications. Previously the size of unit record files would have meant that most users would not have been able to store, let alone manipulate and analyse the unit record files on their computers. This increased computing power has also been accompanied by the growth in user friendly statistical software packages. Many of these packages have menu driven functionality that doesn't require users to learn a programming language to use them.

To meet this user demand for increased access the ABS releases Confidentialised Unit Record Files (CURF) on CD-ROM. This allows users to conduct more detailed analysis. To confidentialise a unit record file means that information is disclosed in a manner that is not likely to enable the identification of the particular person or organisation to which it relates.

To further enhance users' access to more detailed microdata the ABS has developed the Remote Access
Data Laboratory (RADLTM_{). The RADL}TM_{ provides access to CURF data on a secure online data query }

Authorised users submit commands from either SAS or SPSS statistical packages (via the RADLTM_{ web }

interface) to operate against the CURFs that are kept within the ABS environment. The results of the queries are checked for confidentiality and then made available to the users via a secure webpage.

The process of converting data collected by the ABS into policy is a long and, at times, very difficult task. By improving users' access to unit record files via the RADLTM the ABS is attempting to make this task

somewhat easier. The 2002 NCSS CURF is currently available through the RADLTM. Other crime related surveys are also available through the RADLTM. For example, the 1996 Women's Safety Survey is also available.

For more information about becoming an authorised user of the RADLTM, please refer to the category 'CURFs' under 'Services we provide' on the ABS website at www.abs.gov.au.

The next section provides some examples of topics for which the NCSS CURF can be used.
**1.3 The National Crime and Safety Survey **

The NCSS has been conducted in 1975, 1983, 1993, 1998 and 2002. The survey covered the more serious crimes that affect the largest number of people, in two broad categories: personal and household crime. In 2002 for personal crimes, information was sought from approximately 54,400 persons, of whom 41,200 (76%) responded. Data pertaining to households were sought from approximately 27,100 households and of these 20,400 (75%) responded.

The respondents were all persons aged 15 years and over, who were the usual residents of private dwellings. The exception to this was sexual assault; only persons aged 18 and over were asked sexual assault questions. Residents of non-private dwellings (hospitals, motels and prisons) were excluded from the whole sample, as were those living in remote or sparsely settled areas, tourists and persons under the age of 15.

The NCSS involved a self-administered questionnaire. Respondents were asked to complete the relevant questionnaires and return them by mail. Each respondent was provided with an individual mail back

envelope, to minimise the likelihood of other members of the household seeing another person's form. Males and females aged 18 years and over were supplied with a separate questionnaire about sexual assault. Completion of this sexual assault form was voluntary.

The survey collects data on the following crimes:

Assault - an incident other than a robbery involving the use, attempted use, or threat of force or violence against the victim.

Sexual assault - an incident which was of a sexual nature involving physical contact, including rape, attempted rape, indecent assault, and assault with intent to sexually assault. Sexual harassment was excluded.

Robbery - an incident where someone had stolen (or tried to steal) property from a respondent by physically attacking them or threatening them with violence.

Break-in - an incident where the respondent’s home had been broken into. The home included their garage or shed, but excluded their car and garden.

Attempted break-in - an incident where there were signs that an attempt was made to break into the respondent’s home. The home included their garage or shed, but excluded their car and garden. Motor vehicle theft - an incident where a motor vehicle was stolen from any member of the household. It

includes privately owned motor vehicles as well as business/company vehicles used exclusively by members of the household.

The survey also collects information on victim, offender and incident characteristics, such as:

Victim's age, sex, marital status, employment status, level of education, occupation and country of birth. Number of incidences.

How long ago the most recent incident occurred.

Whether the most recent incident was reported to police, how the police were told, reasons the incident was not reported to police.

What the offender(s) did in the most recent incident. Whether a weapon was used (for personal crimes). Whether physically injured (for personal crimes).

Sex, age and number of offenders in most recent incident (for personal crimes). Whether the offender was known and how they were known (for personal crimes). Where the incident occurred (for personal crimes).

In addition, information was also collected on respondents' perceptions of problems in their neighbourhood and their feelings of safety. The comprehensive nature of the NCSS enables it to support analysis on a wide range of topics. For example, the factors associated with repeat and/or multiple victimisation, the propensity to be a victim of crime, the propensity to report a crime to police and understanding public perception of crime.

In addition to this, the NCSS is also helpful with benchmarking, program and policy evaluation and in addressing key performance indicators set by government.

Section 2 of this paper will use the 2002 NCSS data to analyse factors affecting the propensity to report crime to the police. The crime rate as collected by the police is equal to the prevalence rate multiplied by the reporting rate of crimes.

However, the actual crime rate includes reported as well as unreported crimes. The analysis demonstrated in this paper may be used to understand the different factors, such as victim, offender and incident

characteristics, that affect reporting rates for assault and break-in.

For further information please refer to* Crime and Safety, Australia* ABS cat. no. 4509.0. For a full list of data
items and information on the CURF please refer to *Expanded Confidentialised Unit Record File, National *
*Crime and Safety Survey*, *2002,* ABS cat. no. 4524.0.55.001.

**2. Methodology **

**2.1 Logistic Regression**

When many explanatory variables are involved, simple two-way tables are not adequate because each table provides a test of the relationship between the variable of interest and only one other explanatory variable. No account can be taken of the effect of one explanatory variable after allowing for another. Thus we may see that a relationship exists between lung cancer and whether the person smokes, or between lung cancer and whether a parent of the person also had lung cancer. However, such comparisons do not allow us to consider the effect of smoking after adjusting for the effect of having a parent with lung cancer. It is in these situations that regression analysis should be used.

The goal of regression analysis is to develop the simplest1 statistical model that can be used to explain the
variation in the values of a dependent variable based on the values of at least one explanatory variable.
More specifically, simple linear regression uses a single numerical independent variable *x* to predict the
numerical dependent variable *y*. Whereas multiple regression models use several numerical explanatory
variables (*x1, x2, ..., xk) to predict a numerical dependent variable y*. Essentially linear regression attempts to
determine a model which explains any systematic variation in the dependent variable.

This section will provide a brief overview of a type of regression known as a logistic regression. There are certain cases where the dependent variable may be a data item taking on the value of 1 (yes) or 0 (no), as is the case for the NCSS question "did you report a crime to police?".

A linear regression model is not appropriate for binary data because the modelled values using one or more explanatory variables are not usually restricted to 0 and 1. However, the logistic regression has been specially designed to taken into account binary variables (1, 0).

In its simplest form, the logistic regression model can be described as follows: Where the Pi are determined by the regression:

Logit(Pi )= log{Pi/(1-Pi)} = a + b1xi1 +....+bkxik

where Pi is the probability of the outcome of interest occurring (e.g. the probability of reporting a crime to

police), a is the intercept parameter, b1...bk are regression parameters, and x is a set of k explanatory

variables.

In our case, where all the explanatory variables are binary, the left hand side of this equation (i.e. log{Pi

/(1-Pi)}) is referred to as the log-odds which is commonly used as a measure of the likelihood of a unit having

the characteristic of interest relative to not having it.

Many statistical software programs are able to perform a logistic regression. However, to some degree running the regression is the easy part. More effort is required in model specification and to interpret and understand the results from the model. The next section provides some examples of how a logistic regression can be specified and the results interpreted.

**2.2 Example of result from logistic regression**

The following section provides an example of a logistic regression. The models have been simplified to demonstrate the technique rather than provide a full overview of all factors which are related to a person’s propensity to report crime to police.

The logistic regression is being used to examine the relationship between the likelihood of a victim reporting an assault (given the crime has occurred) to police and the victim's sex, age, and whether or not they suffered an injury. In this case reporting to police is the dependent variable and sex, age and injury are the explanatory variables.

Log-odds ratios have been estimated relative to a reference category which has been assigned for each explanatory variable. In table 2.1, females have been assigned as the reference group (that is this group has a value of 1.00) for the sex variable. So a odds ratio of 1.24 for the male variable means that men are 24% more likely to report an assault to police than women (note the reference group is not shown for variables that have only two categories).

Similarly the injury odds ratio indicates that if a person is injured in an assault they are 153% more likely to report the assault to police than if they were not injured. The 95% confidence intervals indicate that a person injured during a assault is between 102% and 213% more likely to report the assault to police than a person who was not injured.

The age variables have a slightly different interpretation. Those people aged 35 years and over are used as the reference group (that is this group has a value of 1.00) to which the other two odds are compared. So a person aged 25-34 years is 16% less likely (1-0.84) to report a crime to police than a person aged 35 years and over. The 95% confidence intervals are 34% (1-0.66) less likely and 7% more likely to report the assault to police. Therefore, as the confidence intervals include 1.00, there is no statistical difference between people aged 25-34 years and people aged 35 years in their propensity to report an assault to police. A person aged 15-24 years is 53% (1-0.47) less likely to report an assault to police.

**2.1 Logistic regression on the propensity to report assault to police **

*Variable * *Odds ratio *

*Lower *
*confidence *
*interval *
*(95%) *
*Upper *
*confidence interval *
*(95%) *
*Significance (P *
*value) *
15-24 years 0.47 0.38 0.61 <.0001
25-34 years 0.84 0.66 1.07 0.160

35 years and over* 1.00* N.A.* N.A.* N.A.*

Men 1.24 1.02 1.52 0.036

Injured 2.53 2.02 3.13 <.0001

*Reference group

The p-value is also a useful statistic to examine. The odds ratio for the 15-24 years age group, sex and injury variables are statistically significant at the 5% level. That is we are 95% sure that the variable has an effect on the propensity to report an assault to police. The odds ratio for the 25-34 year old group has a p-value of 0.16. So we are 84% sure that the group has statistically significant different reporting rates to police than the 35 years and over age group.

Generally, 5% is the cut off for the p-value used to established a variable's statistical significance. So we draw the conclusion that there is no statistically significant difference between the people aged 25-34 years and 35 years and over in their propensity to report crime to police.

So in summary this model tells us that:

Men are more likely than women to report an assault.

People aged 15-24 years are less likely to report than older people.

In this case, all of the explanatory variables are binary. If a continuous variable was used as an explanatory variable, then the interpretation of the results is slightly more difficult. See Appendix A for more details. Model specification is however an important aspect of regression analysis. Incorrect specification of the model can lead to erroneous conclusions. Assuming correct specification, a slightly different model specification can lead to more detailed analysis and interpretation of the data. In the second example (see table 2.2), age by sex variables have been created. In this case all of the variables are relative to group of women aged 35 years and over.

**2.2. Logistic regression for on the propensity to report assault to police **
*Variable * *Odds ratio * Lower confidence _{interval (95%) } Upper confidence _{interval (95%)} *Significance (P _{value) }*

Men aged 15-24 years 0.56 0.41 0.79 <.0001

Men aged 25-34 years 1.09 0.77 1.54 0.624

Men aged 35 years and

over 1.21 0.91 1.62 0.204

Women aged 15-24

years 0.49 0.34 0.72 0.000

Women aged 25-34

years 0.78 0.54 1.10 0.161

Women aged 35 years

and over* 1.00* N.A*. N.A.* N.A.*

Injured 2.53 2.01 3.13 <.0001

*Reference group

So men aged 15-24 years are 44% (1-0.56) less likely to report an assault to police than women aged 35 years and over. Women aged 15-24 years are 51% (1-0.49) less likely to report assault to police than women aged 35 years and over. The difference between the reference group and the other age by sex groups is not significantly different at the 5% significance level. This model gives the same result as the previous model for people injured in an assault, namely 2.5 times more likely to report the crime. I

n summary, the level of detail obtained from logistic regression analysis will depend on the type of model specified.

This section has provided a very brief overview of logistic regression. Some additional information, including ways to measure the explanatory power of different models, is provided in Appendix A. For further

information on the method the authors recommend Hosmer and Lemeshow (2000).
**2.3 Informative Sample Design**

This section will provide some brief background to the way surveys are designed and the effect this design can have on regression models. The simplest method for taking a survey is a simple random sample. Using this method, a sub-population is selected at random from the in-scope population, and this sample is used to represent that population. The same weight is then applied to each of the sampled units to get an estimate for the population.

The advantage of this kind of sampling is that it is easy to conduct and to extrapolate the results to the population (any percentages calculated from the sample can be said to apply to the population). A simple random sample also needs to be fairly large to be representative of all groups in the population which adds significantly to the cost of conducting such a survey.

Multistage sampling involving clustering is an integral part of the design of ABS household surveys. In a multistage sample design, initially a sample of first stage units is randomly selected from the population framework. In urban areas, first stage units are collection districts (CDs) and these are often selected with probability proportional to their size (in terms of population).

Within each selected first stage unit (CD), a sample of second stage units is then selected. In ABS surveys these are commonly blocks of dwellings, with usually only one block being selected per CD, again probability proportional to size. At the third stage, a sample of dwellings are selected from each selected block by taking a systematic skip through the list of all dwellings in the block.

Separate stratification and selection methods can be used at each stage of selection. In order to make the sample more efficient with respects to costs, the sample is further clustered in non-urban areas by

introducing an additional stage of selection as the first stage.

In this new first stage, CDs are grouped into broader areas called Primary Selection Units (PSUs). After selecting a sample of PSUs, selection for the remaining stages proceeds as explained above. In areas where travelling costs are very high this ensures that the resulting selected dwellings are geographically closer to each other.

The relative size of the samples selected at each stage is carefully chosen so as to give the best trade-off between the cost of enumerating (collecting) the sample and the variances, or measures of statistical accuracy, for estimates calculated from the sample.

The survey weights take into account the selection probability of each unit and also adjust for any non-response which the survey experiences. So, when deriving design based estimates of totals, averages or quintiles from any ABS survey, as long as the design based weights are applied correctly, the estimates obtained will reflect the population (subject to the sampling errors).

A side effect of this complex design is that the sample is no longer a simple random sample. When

regression models are fitted to simple random samples there is no need to weight the data in order to ensure that the parameters and inferences also apply to the population. However, when a regression model is fitted to data from a complex survey design (such as a multistage design) it may be necessary to take account of the survey design when making inferences about the population, which is usually the analysis goal.

However, when fitting a regression model to sample data, without taking into account the survey design, we may well end up with different model parameters (coefficients) and inferences (significance levels for the estimated parameters, prediction error etc) compared with the corresponding results from fitting the same model to the population.

For example, in a business collection the survey sample has a large proportion of large businesses to reduce the large variance on the estimate that would occur if the inclusion of these units was subject to chance. For example, Coles-Myer and Woolworths would always be selected in the ABS Retail Business Survey.Small businesses, because of their similarity to each other, are selected with low probability which still gives a low sampling error.

However, in the population the reverse is true. Small businesses vastly outnumber the large businesses. So fitting a model to the sample data without taking this into account will result in very different model parameter estimates and variances to that obtained from a model fitted directly to the population.

In household surveys this problem may occur if for example we are trying to measure disease prevalence and it happens to be geographically clustered, like our survey design. If the selected sample happens to overlap with population disease clusters, then the model will be fitted to sample data which reflects high rates of disease. However, the reality is that other areas not selected don’t have such high disease rates. Therefore, in this case making inferences about the population from an unadjusted model fitted to sample data will be incorrect.

In summary, when the sample design is not a simple random sample, such as the multistage clustered sample used in the NCSS, then the survey design should be accounted for when a regression model is used. In the last ten years statistical methods have become available that enable us to adjust the parameters of models fitted to survey data to correctly account for design informativeness, although most software packages do not as yet include these methods.

**2.4 What can be done to account for survey design?**

There are a number of different ways in which the sample design can be taken into account. The more simpler methods include the variables used in the survey design among explanatory variables in the model. These may include stratification variables, identifiers for each multistage unit, variables that determine the measure of size or variables that determine the level of clustering. More complicated methods incorporate adjustments into the model estimation that takes account of the net effect of the design weights after discounting for any information the explanatory variables may contain about the survey design. This paper will outline very briefly some methods for addressing this issue.

If a survey is based on a simple random sample, then no adjustment has to be made when conducting regression analysis. However, if the survey has a more complex design then a adjustment may have to be made.

Pfeffermann and Sverchkov (1999) also shows how to go about adjusting model estimation for design informativeness and compares a “gold standard” (proposed by Pfeffermann and Sverchkov) method against alternatives such as no adjustment and using the full design weight, under several simulated survey designs. In the examples he uses he shows how the no adjustment approach gives the same result as the gold standard, when the sample design is uninformative (e.g. simple random sample).

Using the full design weight is the correct thing to do when the design is informative and the explanatory variables used in the model contain no information whatsoever about the survey design. However, for situations in between these two extremes the gold standard method should be used to determine the correct adjustment. Various tests for design informativeness are also given in Pfeffermann and Sverchkov (1999). A replicate method takes a number of samples from the original sample, with each replicate based on a particular modification of the original survey sample. For example, if the original sample has 10,000 people, then a subsample of 9,000 people would be taken and new (replicate) weights created. Then a second subsample of 9,000 people is taken. This time including the 1,000 people excluded in the first subsample. Once again a set of new (replicate) weights are created. This process is repeated ten times each time 1,000 different people excluded from the sample. The original weight and the replicated weights are then used to calculate a variance and confidence intervals.

**2.5 What effect does the sample design have on the model?**

The following example will demonstrate the different logistic model results achieved by using different methods of accounting for sample design. In this example the model estimated was whether the victim reported a break-in (Yes/No) against 4 four explanatory variables: Metro (a capital city) (Yes/No), Living Alone (Yes/No), Living near pub/hotel/club/licensed premises (Yes/No), and whether the house can be seen from the street at all (Yes/No). This is a logistic regression model, which was estimated using: no weight, the survey weights, and replicate weights to illustrate the effects of including different versions of the weight. The results of the three methods are shown in tables 2.3-2.5.

**2.3. Logistic regression for break-in using no weight **

*Variable* *Log-odds _{ratio}*

Lower
confidence
interval
(95%)
Upper confidence
interval (95%)
*Significance *
*(P value)*
Living alone 1.17 1.01 1.35 0.035
Metro 1.31 1.15 1.36 <.0001

House close to pub 1.37 1.14 1.66 0.001

House not visible from

street 0.54 0.35 0.84 0.006

The odds ratios have not changed appreciably between any of the methods. What has changed markedly is the statistical significance of the variables between the different methods.

**2.4. Logistic regression for break-in using the survey weight **

*Variable * *Log-odds _{ratio }*

Lower
confidence
interval
(95%)
Upper
confidence
interval (95%)
*Significance *
*(P value) *
Living alone 1.16 1.15 1.17 <.0001
Metro 1.37 1.36 1.38 <.0001

House close to pub 1.39 1.37 1.40 <.0001

House not visible from

With no weight used in the regression all of the variables are significant at the 5% level. Using the original survey weights all of the variables are statistically significant at the 1% level. In the third method only Metro is still significant at the 1% level. The house being close to a pub is statistically significant at the 5% level and the remaining two variables are statistically significant at the 10% level.

**2.5. Logistic regression for break-in using replicate weights **

*Variable * *Log-odds _{ratio }*

Lower
confidence
interval
(95%)
Upper
confidence
interval (95%)
*Significance *
*(P value) *
Living alone 1.16 1.07 1.26 0.076
Metro 1.37 1.25 1.50 0.002

House close to pub 1.39 1.25 1.57 0.012

House not visible from

street 0.58 0.43 0.73 0.087

So if your criterion was to have only variables that are statistically significant at the 95% level then using the original weights would lead you to include all of the variables. When in fact the use of the replicate weights shows that you may wish to exclude the House not visible from street and Living alone as variables from your break-in model.

The replicate weight method is preferred as the estimates being derived from the survey are the most closely related to the estimates we would gain if we had data on the whole population. For more information on the replicate method readers can refer to Tanton, Jones and Lubulwa (2001).

The field of accounting for design informativeness in household surveys is still a very new one. The Pfeffermann (1999) method is seen as the gold standard. However, as this method is very complex and is still relatively new, statistical packages are yet to incorporate it into their regression modules. So it is difficult for most users to be able to implement the method.

This section has provided a basic overview of the effect of survey sample design. It is an issue which users should be aware of when conducting regression analysis. However, accounting for the survey sample design is still a very experimental area. Users should seek professional statistical advice when specifying and interpreting the results from a model based on a survey with a complex survey design.

### For further information on the methods the authors recommend Chambers and Skinner

### (2003). A list of other useful references is provided in the Bibliography.

### 3. Conclusion

The process of converting data collected by the ABS into policy is a long and at times difficult task. By
improving users' access to unit record files via the RADLTM_{ the ABS is attempting to make this task easier. }

Making use of a logistic model can help extract information from data to help inform users and develop policy. This paper has applied a logistic model to further understand which groups or factors

increase/decrease the propensity of people to report crimes to the police.

This has been done to demonstrate the usefulness of the logistic model in understanding relationships between variables. Understanding these results can sometimes be confused by the sample design used to collect the survey data.

Several methods have been suggested in this paper to account for the effect of sample design. Some are simple fixes while others are quite complex in nature. We highlight the differences in the results from using different approaches. Users should seek professional statistical advice when specifying and interpreting the results from a model based on a survey with a complex survey design.

**Acknowledgements **

The authors would like to thank Karen Gelb, Julie Cole, Jan Schumacher, Chris Libreri, Robert Tanton, Marion McEwin, Daniel Elazar, Glenys Bishop and Sally Goodspeed for their helpful comments and assistance with this research project.

The content and presentation of the paper are much improved as a result of their input. Responsibility for any errors or omissions remains solely with the authors.

**BIBLIOGRAPHY **

Australian Bureau of Statistics (2002) *Crime and Safety, Australia*, ABS cat. no. 4509.0, Canberra.
Australian Bureau of Statistics (2002)* Expanded Confidentialised Unit Record File, National Crime and *
*Safety Survey*, *2002,* ABS cat. no. 4524.0.55.001.

An, A., Watts, D & Stokes, M. (1999) 'SAS Procedures for Analysis of Sample Survey Data', Survey Statistician, December 1999, pp. 10 - 13.

Chambers, R & Skinner, S., Eds (2003) 'Analysis of Survey Data', Wiley, New York.

Cohen, S. (1997) 'An evaluation of alternative PC-Based software packages developed for the analysis of complex survey data'. The American Statistician, Vol. 51, No. 3, pp. 285 - 292.

Fuller, W. (1975) 'Regression for sample survey', Sankhya: The Indian Journal of Statistics, Vol. 37, Series C, Pt. 3, pp. 117 - 132.

Green, W. (2003) 'Econometric Analysis 5th Edition', Prentice Hall, Sydney. Gujuarati, D. (1996) ‘Basic Econometrics 3rd Edition’, McGraw Hill, Singapore.

Hosmer, D. and Lemeshow, S. (2000) 'Applied Logistics Regression 2nd Edition', Wiley, New York. Korn, E. & Graubard, B. (1995) 'Analysis of Large Health Surveys: Accounting for the Sampling Design',

*Journal of the Royal Statistical Society A*, Vol. 158, Part 2, pp. 263 - 295.

Kott, P. (1990) 'What does performing a linear regression on survey data mean?', ASA Proceedings of the Survey Research Methods Section, p. 337 - 341.

Kott, P. (1991) 'A model-based look at Linear Regression with Survey Data', *The American Statistician*, Vol.
45, No. 2, pp. 107 - 112.

Maddala, G. (1992) ‘Introduction to Econometrics 2nd Edition’, Prentice Hall, London.

Magee, L. (1998) 'Improving survey-weighted least squares regression', *Journal of the Royal Statistical *
*Society B*, Vol. 60, Part 1, pp. 115 - 126.

Magee, L., Robb, A. and Burbidge, J (1998) 'On the use of sampling weights when estimating regression
models with survey data', *Journal of Econometrics*, Vol. 84, pp. 251 - 271.

Pfeffermann, D. (1993) 'The role of sampling weights when modelling Survey data', International Statistical Review, Vol 61, No. 2, pp. 317 - 337.

Pfeffermann, D. and LaVange, L. (1999) 'Regression models for stratified multi-stage cluster samples', in
Analysis of Complex Surveys, (Ed. C; J. Skinner, D. Holt & T. M. F. Smith), pp. 237 - 260, London: Wiley.
Pfeffermann, D. and Sverchkov, M., (1999) ‘Parametric and semi-parametric estimation of models fitted to
survey data’, *The Indian Journal of Statistics, *Vol. 61, pp 166-186.

Rao, J., Wu, C and Yue, K (1992) 'Some recent work on resampling methods for complex surveys', Survey Methodology, Vol. 18, No. 2, pp. 209 - 217.

Roberts, G., Rao, J., and Kumar, S. (1987) 'Logistic regression analysis of sample survey data', *Biometrika*,
Vol. 74, No. 1, pp. 1 - 12.

Scott, A. (1986) 'Logistic Regression with Survey data', *ASA Proceedings of the Survey Research Methods *
*Section*, pp. 25 - 30.

Tanton, R., Jones, R and Lubulwa G., (2001) Modelling Crime Victimisation and the Propensity to Report Crime to Police, Methodology Advisory Committee Paper, Australian Bureau of Statistics. This paper is available free of charge from www.abs.gov.au.

Ten Cate, A . (1986) 'Regression Analysis using Survey data with Endogenous Design', *Survey *
*Methodology*, Vol. 12, No. 2, pp. 121 - 138.

Thompson, K and Sigman, R. (2000) 'Estimation and Replicate Variance Estimation of Median Sales Prices of Sold Houses', Survey Methodology, Vol. 26, No. 2, pp. 153 - 162.

**AppendiX A : Additional Information on logistic regression**

The logistic model for estimating the probability of an event (Pi) can be expressed by the equation given

below:

Yi ~ Binomial (Pi)

log{Pi/(1-Pi)} = a + b1xi1 +....+bkxik

where Yi follows a binomial distribution, *a* is the intercept parameter,* b* are* k* regression parameters, *x* is a

set of *k* explanatory variables and *i* relates to each person. The logistic model has the useful property that
predicted probabilities produced by this model are constrained to lie between 0 and 1 (Gujarati (1995)). The
odds of success (e.g. probability of reporting a crime to police) for the *i*th person is as follows:

Pi/(1-Pi) = exp(a + b1xi1 +....+bkxik)

The following section provides some useful statistical diagnostics which analysts may wish to used in assess a logistic model they have specified.

**Akaike Information Criterion **

This statistic, or the Schwartz Criterion, are used to give an indication of the extent to which the explanatory variables included in the model explain variation in the dependent variable. These tests are based on the log-likelihood and are commonly used to identify which independent variables to include in the model (Maddala (1992)).

**Likelihood Ratio **

The likelihood ratio is used for testing a hypothesis about the model parameters, e.g. the model parameters are all significantly different from zero or from those of another model (Greene (2003)).

LR(i) = -2 {LL(a) - LL(a,b)}

= -2 {Log Likelihood (Model A) - Log Likelihood (Model B)}

The likelihood ratio statistic must be positive, however the two log likelihoods don't. The likelihood ratio is
chi-squared distributed with *k* (number of explanatory variables) degrees of freedom. The model chi-square
statistic can be used to determine if there is a statistically significant difference between Model A and Model
B.

**Pseudo R****2**

In logistic regression there is no equivalent statistic to the R2 used in a linear regression, which estimates the proportion of variance in the dependent variable that is explained by the variance in the independent

variables. There are however, several "Psuedo R2" statistics. One of these is the McFadden's statistic, sometimes referred to as the likelihood ratio index (LRI) above (Woolridge (2000)).

The LRI depends on the ratio of the beginning and ending log-likelihood functions, which makes it very
difficult to "maximise the R2_{" in logistic regression. Another measure is given by: (1-(LL}

estimated/LLcoefficient only)).

This measure of goodness of fit, should be used in conjunction with the statistical significance measures of explanatory variables.

**Percent Correctly Predicted **

This is another goodness of fit measure. It is actually recorded as the number of times the predicted yi equals the actual yi. It is a useful measure, but it is possible to get a high percentage without the model being very useful. This can make it a misleading indicator in some cases.

*Interpreting the coefficient of a continuous explanatory variable *

If a continuous variable is used then the interpretation of the results is slightly more difficult. For example, age is an explanatory variable for a certain event.

Age has a odds ratio of 0.978, which relates to a change in age of one year. Thus the odds of a certain event at a particular age is 0.978 of the odds of a certain event one year younger.

It is useful to compare two representative ages, say 20 and 40. Therefore, (20 vs 40)=exp(b0+b1.20)/exp(b0+b1.40)= exp(-20b1)

To calculate, exp{(-20)(-0.023)}=1.57.

Therefore for a certain event we can say that 20 year-olds are 1.57 times as likely as 40 year-olds.
**AppendiX B. glossary **

**Binary variables : A variable which can take only one of two values (0,1). **

**Collection district (CD): The smallest geographic area defined in the Australian Standard Geographical **
Classifications (ASGC). It has been designed for use in the Census of Population and Housing. CDs serve
as the basic building block in the ASGC and are used for the aggregation of statistics to larger census
geographic areas.

**Confidence intervals : The range associated with a certain statistical estimate. For example, with an **
estimate of 100 there is a 95% chance that the estimates may be between 80 and 120.

**Dependent variable : The variable which is being predicted or explained. It is usually denoted by y in a **
regression equation.

**Explanatory variable : The variable(s) which is doing the predicting or explaining. It is usually denoted by x **
in regression equation.

**Intercept parameter : The starting value for the regression model when all the explanatory variables are 0. **
**Log : A mathematical transformation which is used to scale back widely dispersed data. **

**Log-odds ratio : The odds of a given event is the probability of its occurrence to the probability of **
non-occurrence. For example, the odds of rolling a six on a dice is one to five.

**Logistic regression : A modelling technique used to investigate the relationship between a certain event **
which takes a binary form (for example, reporting crime to police) and a set of explanatory variables.
**Regression parameter : Determines the effect that an explanatory variable will have on the dependent **
variable. Also known as a regression coefficient.

**Reporting rate: The total number of most recent incidents of an offence that were reported to police **
expressed as a percentage of the total victims of that offence.

**Stratified : Dividing the population into homogenous groups of units called stratum such that each unit **
belongs to one and only one stratum. For example, stratum may be geographical areas in household
surveys or by state and industry in business surveys.