CHAPTER 4. UNFOLDING THE BIAS IN FARM NUTRIENT MAN-
4.3 Data and Model
4.3.4 Multilevel regression model
The regression methods described in the previous section briefly explain the implications of the assumptions involved in each method, and why they are unfit for modeling the collected subjective yield expectations data. While comparing the pooled OLS regression model with the individual farmer and field specific OLS regressions, a significant difference exists in the underlying assumptions about the correlation structure in the data. The individual farmer and field specific OLS regressions do not acknowledge that the information in the observations across the same units can be pooled to use the correlation structure in the data. Pooled data gathers information about the higher level observed explanatory variables, which are constant for any given unit, but has variability across units. For example farmers’ beliefs about nitrogen may be different for a farmer farming on leased land compared to a farmer farming on owned land. Although, the land ownership across nitrogen treatment for any given unit is same, but pooling the data can use information in the responses across farmers for differences in the land ownership. On the other hand, the pooled OLS regression ignores the correlation among the observations of the same unit by treating them to be independent. Therefore, while the individual OLS regression model do not fully exploit the information contained in the data, the assumptions of the pooled OLS model uses more information by treating every observation to be independent. For example, if across the 5 nitrogen treatment for any given field, the field CSR is 80, pooled OLS assumes that there are 5 fields, each with a CSR of 80, while the fact is that there is one single field.
A multilevel linear model (Raudenbush and Bryk, 2002; Snijders and Bosker, 2011) is used to model the subjective yields. A multilevel linear model consists of fixed and random compo- nents at different levels, based on the nesting structure of groups in the data. The regression coefficient of an explanatory variable of interest is modeled as a random variable which follows
a pre-specified distribution. The mean of the distribution of the random regression coefficient is the fixed component of the regression model. The fixed component in the multilevel model should not be confused with the fixed effects used in the pooled OLS regression. While the fixed effects in the pooled OLS regression captures the individual unit specific intercept and slope, the fixed component in the multilevel model indicates the deterministic part of the regression coef- ficient of an explanatory variable. The deterministic part of the regression coefficients (which also includes those explanatory variables which do not have a random coefficient) are the fixed component of the model, which are common to all the units in the data. The heterogeneity in the data and appropriate correlation structure is accounted through the introduction of ran- dom coefficients. The random coefficients significantly reduces the number of parameters to be estimated and at the same time also accounts for the unobserved unit specific effects, which in a way also controls for clustering of data at various levels. However, the multilevel model imposes structure on the random effects that they are drawn from a parametric distribution (mostly from a normal distribution), which is not an unreasonable assumption to make when much is not known about the distribution of the sample from theory. Moreover, practically it is not possible to include as many random effects as desired. Therefore, it is assumed that effects of variable without random components specification is same across all units,. This is again not an unreasonable assumption if the choice of variable for including random effects is guided by theory or the research question.
The use of multilevel model also allows identification of contextual effects in the regression model. In cognitive psychology, contextual effects are defined as the influence that the envi- ronment or the surroundings of an individual (or a group level variable) have on the effect of the individual unit independent variable on the dependent variable (Diez, 2002). For example in the subjective yield model, the perception about the effect of nitrogen on the yield is the integral part of the model, but a contextual effect may be defined as how does the perception differ across farmers who are more educated or who have adopted a delayed planting date, as these are the environment (context) of the farmer under which he has reported his beliefs.
A multilevel model is formed by specifying the regression equation at the lowest level, and the regression coefficients of the lowest level regression equation are functions of higher level
explanatory variables. A three level hierarchical model is generally described as in the following equations:
Level 3 : Yijk= γ0jk+ γ1k· Xijk+ ijk (4.6)
Level 2 : γ0jk= φ00k+ φ01k · Gjk+ τ0jk ; γ1jk=φ10k+ φ11k· Gjk+ τ1jk (4.7)
Level 1 : φ00k = θ000+ θ001· Dk+ η00k ; φ01k =θ010+ θ011· Dk (4.8)
φ10k = θ100+ θ101· Dk+ η10k ; φ11k =θ110+ θ111· Dk (4.9)
The above equations represent a general form of the three level hierarchical linear model,
where Xijk is the vector of level 3 explanatory variable and Yijk is the dependent variable. ijk
is the error term at the level 3. It can be seen that the coefficient of Xijk is a function of
level 2 set of variables denoted by Gjk and the level 2 residuals (random components), τ0jkand
τ1jk. Similarly, the parameter coefficients of the level 2 explanatory variables are function of
level 1 variables and the level 1 residuals. Substituting for the level 2 and level 1 equations in
equation 4.6, the full multilevel model can be written as
Yijk =β0+ β1· Dk+ β2· Gjk+ β3· Xijk+ β4· (Gjk× Dk) + β5· (Xijk× Dk)
+ β6· (Xijk× Gjk) + β7· (Xijk× Gjk× Dk) + τ0jk+ (τ1jk× Xijk)
+ η00k+ (η10k× Xijk) + ijk (4.10)
where the symbol ‘×’ denotes an interaction between two variables. τ0jk and τ1jk are the level
2 random slope and random intercept term whereas η00k and η10k are the level 1 random slopes
and intercept. It is assumed that ijk ∼ N (0, σ2), (τ0jk, τ1jk) ∼ N (0, Γτ) and (η00k, η10k) ∼
N (0, Γη), where Γτ and Γη are the symmetric variance-covariance matrices of the random
components, where Γτ = στ 02 υ21 υ12 στ 12 Γη = ση02 ρ21 ρ12 ση12