Problems due to the non-experimental nature of the data

3.2 Data issues

3.2.2 Problems due to the non-experimental nature of the data

The use of survey data for regression analysis implies a number of difficulties in applying the standard econometric tools which will be described in the following. Since detailed textbook treatments are available and because I present the employed formal estimation procedures below, I will confine myself to a brief outline. References with a particular emphasis on survey data are BEHRMAN and

OLIVER (2000) and DEATON (1997).

The classical regression model assumes that the magnitude of an observed out- come variable Y (the regressand) can be explained by a set of independent variables X (the regressors) plus an error term (see e.g. GREENE 2000, pp. 213-223).

The error term is assumed to be independently and identically distributed conditional on the X’s, with a conditional mean of zero. Usually the interest focuses on one or several particular regressors (sometimes called ‘treatment variables’), while the remaining regressors play the same role as the control group in an ex- periment: they allow for differences in Y that are not caused by the treatment and thus isolate the relationship of interest (DEATON 1997, pp. 92-93). All remaining

statistical ‘disturbances’ (e.g. due to measurement error or unobserved explanatory variables) are captured by the error term. Efficiency and consistency of a regression analysis which aims at recovering the coefficients of the regressors fundamentally hinge on the assumptions concerning this disturbance term. Un- fortunately, in survey data-sets it is often implausible that the standard assumptions hold, due to the nonexperimental nature of the data. The most common problems are described as follows, together with possible avenues for dealing with them.

Simultaneity

Simultaneity causes the conditional mean of the disturbance term to be not equal to zero. It can be made intuitive by considering a simultaneous equation system of regressions, which could for example arise in the framework of a household model (see section 2.3.1, pp. 77 et seq.). As soon as a regressor of a first equation appears as regressand in another equation, there will be a correlation between the disturbance terms and this regressor in the first equation. An example is a (structural) production function which contains variable inputs as regressors. The same inputs also occur as regressands in the (reduced-form) input demand function of the household, and are hence subject to simultaneity. The problem could also be regarded as one of an omitted variable. In the production function example, the exogenous determinants of the variable input quantity are not in- cluded into the equation, which leads to simultaneity (or endogeneity) bias. As a result, the regression analysis no longer yields consistent estimates of the coefficients of interest. Similar effects are produced by feedback processes, reverse

causality, or measurement errors (see DEATON 1997, pp. 92-105; BÖRSCH-

SUPAN and KÖKE 2002).

A possible way to address the simultaneity bias is by employing additional information in the form of variables which are correlated with the explanatory variable but not with the disturbance term. These are called ‘instruments’, the procedure hence instrumental variable (IV) regression. A common difficulty is to find instruments that have the just mentioned properties (see DEATON 1997,

pp. 111-116), which is why the problem is frequently ignored (e.g. by CARTER

and WIEBE 1990, KLAIBER 1988, THIJSSEN 1988, to mention a few studies cited

earlier).

Selectivity

Selectivity occurs when the regression is based on observations which are not a random subsample of the population under investigation. It can be regarded as a special type of simultaneity, since the problem is again the correlation between disturbance term and explanatory variable. The present research provides an il- lustrative example. It will be seen that the farm population investigated here falls into two subsamples, borrowers and non-borrowers. The relation between, say, credit and output can hence only be estimated for the borrower subgroup. How- ever, farmers do not become borrowers by chance but most likely have certain characteristics that influence the volume of credit obtained (being more innova- tive, have more collateral available, etc.). In a regression that relates output to credit and does not account for these systematic characteristics, their effects will

be captured by the disturbance term, which in turn will be correlated with the credit variable. The estimator will therefore be inconsistent (see DEATON 1997,

pp. 101-105).

Among others, nobel laureate James J. HECKMAN has developed tools to address

the selectivity problem (e.g. 1979; 1990). The strategy of his approach is to model the selection rule by a first-stage Probit equation, which can then be used to correct the regression of interest.

Heteroscedasticity

In the presence of heteroscedasticity, the assumption of an identical distribution of the error term conditional on the explanatory variables is violated. This may be the case, for example, if large farms show a greater variation in output than small farms, even after accounting for farm size. Heteroscedasticity hence re- flects an inherent heterogeneity of the population, which is often observed in survey data. In an ordinary least squares (OLS) regression, heteroscedasticity will lead to inefficiency and incorrect standard errors. However, test procedures and robust estimators of the variance-covariance matrix which can address this issue in the framework of OLS are available.

Much more serious is the problem if limited dependent variable models are used for regression analysis (for example the Tobit model). In this case, heteroscedasticity can make the estimator inconsistent, and sometimes the only solution will be to rely on non-parametric methods (see DEATON 1997, pp. 78-92).

Sample design

The way the data was collected has a distinct influence on the results of a regression analysis. This influence may be due to a specific weighting procedure employed in the survey, or to clustering or stratification during the data collection. To illustrate, consider the calculation of means from a sample of observations that had varying but known probabilities to be selected. It is quite intuitive that the computation must be corrected by using the known probabilities as weights if the mean of the underlying population is to be calculated. It might be sug- gested that this directly extends to the use of weights in regression analysis. However, there exists no straightforward procedure to incorporate weights into the regression since weighted least squares (WLS) is not guaranteed to be con-

sistent (for a discussion see DEATON 1997, pp. 68-73). The standard approach

(also followed here) is therefore to ignore the weights, which implies the assumption that behaviour across observations is homogenous.

Clustering leads to an underestimation of standard errors in regressions because clusters tend to be more homogenuous than the overall population. However, this procedure was not used in the data collection of the present study, which employed a stratification approach. It might be reasonable to pay attention to structural differences between strata and use a stratawise regression procedure

(DEATON 1997, pp. 67-68). However, effects of stratification might also safely

be ignored since stratification generally improves the precision of estimates (pp. 12-15; 49-51; 71). In the following analyses, an effect of sample design is thus not taken into account, which means that the sample is treated as if it were a simple random sample.

Data availability

This last section of the subchapter makes the point that (even) survey data rarely can fulfil all data wishes of the researcher, and sooner or later any emprical research comes to its limitations due to lacking data. The reasons for unavailable data may either be that, due to financial or time constraints, it simply was not collected, or even that it principally cannot be collected, for example because respondents were not able or willing to recall the necessary information. A particular problem in cross-sectional data-sets is that variation of certain variables will be quite low (for example price data), and that lagged variables (which may often be useful as instruments) by definition are not collected.

Another issue is missing data for a subgroup of respondents, a phenomenon

commonly encountered in survey data. As soon as there is a systematic default due to some (possibly unobserved) characteristics of respondents, missing data will introduce a bias in any estimates. Although the non-response rates in the Poland farm survey 2000 were rather low (PETRICK 2001, p. 14), a certain loss

of observations in the regressions was unavoidable (see below). The general les- son is hence that “data sets do not have to be perfect – they never are – and in fact much insight has been gained from data that are far from perfect in quality” (SINGH et al. 1986b, p. 66).

In document Credit rationing of Polish farm households: a theoretical and empirical analysis (Page 157-160)