Test of hypothesis - Statistical techniques

Figure 2.1: Proposed framework for the analysis of infant and child mortality First level

2.3 Statistical techniques

2.3.2 Test of hypothesis

In this study univariate analysis will be carried out as a first step in applying the logit model. The purpose of the univariate analysis is to identify the association of each of the independent variables of interest with the dichotomised dependent variable of this study. Following this, a multivariate analysis technique, that is a logit model, will be used to calculate the parameters described by the right hand side of the additive equations discussed in the preceding section. The expected frequencies F;j under the hypothesis will be estimated to test whether the hypothesis described by the model fits the data. These estimates will then be compared with the corresponding observed frequency fy by calculating the likelihood ratio statistic (L2):

L2 = ^ijk fijk ln tfijk7 Fijk)> (8)

where i=l, 2,...I; j= l, 2,...J; K=1 if the child is dead; K=2 if the child is alive; fjj^observed frequencies in the i j k-th cells in the contingency table; ln=the natural logarithm; and Fijk=corresPonding expected frequencies to fjj^.

The maximum likelihood statistics (L2 or LRX2 hereafter) have large-sample properties similar to those of chi-square. Thus, L2 is approximately distributed as a chi- square random variable, where the approximation becomes increasingly accurate as N increases (Haberman, 1978: 5). One of the properties of maximum likelihood estimates is that they are asymptotically normally distributed. This leads to the asymptotic chi-square distribution of the test statistics which are used to test the goodness of fit of a model to set observed counts (Fienberg, 1977: 129).

Although the assessment of the goodness of fit of logit models can be tested by using either of the tests, that is, the usual chi-square test-of-fit statistic or the corresponding Pearson chi-square (%2) based on the likelihood statistic (Goodman, 1978: 15), in this study the L2 is selected to test the fit of the model. This is so because Knoke and Burke (1980: 30) noted that L2 is preferable to % 2 on the grounds that the expected frequencies are based on the maximum likelihood method and L2 can be partitioned uniquely for more powerful tests of conditional independence in a multiway table. Additionally, although the

significance of interaction factors or the relative fit of two models can be tested by using the F-statistic (Majumder, 1989: 53), the chi-square statistic is suggested to be more powerful than the F-test, particularly when the degrees of freedom are small (Little, 1978: 36).

One of the problems in analysing multidimensional contingency tables is the appearance of cells with zero entries. Observed zero frequencies generally appear in two circumstances and are known as sampling zeros and fixed or logical zeros. Sampling zeros occur due to the small probabilities for some categories in a situation where several variables are cross-tabulated. Such a zero entry does not mean that such cases do not exist in the population. The other situation which produces zero frequency in a cell, known as fixed zero, is due to the logical unavailability of such specified cases in the population. This type of zero is also known as the true zero (Knoke and Burke, 1980: 33). For example, in this study also, certain cells may appear to be zero because either no-one died during the specified time or because there was no-one who was exposed to the risk of death. In the case of structural zero, the zero frequency found in a cell is the true zero. One of the most powerful properties of a linear model and method of estimation, including the logit linear model used in this study, is that cells with zero entries due to sampling variation can have a non-zero expected value (Fienberg, 1977: 108). However, Fienberg further noted that zeros appearing in the marginal total should be handled in a special way. In this context he suggested:

In order to test the goodness-of-fit of a model that uses an observed set of marginal totals with at least one zero entry, we must reduce the degrees of freedom associated with the test statistic. The reason for this is quite simple. If an observed marginal entry is zero, both the expected and the observed entries for all cells included in that total must be zero, and so the fit of the model for those cells is known to be perfect once it is observed that the marginal entry is zero. As a result, we must delete these degrees of freedom associated with the fit of the zero cell values (Fienberg, 1977: 109).

A general formula for computing the degree of freedom in a situation where some of the margins fitted contain sampling zero suggested by Fienberg (1977: 110) is

where Te = cells in the table that are being fitted, Tp = parameters fitted by model, Ze = cells containing zero estimated expected value and Zp = parameters that cannot be estimated because of zero marginal totals.

The main purpose of the statistical model is to define precisely conditions under which the associated analysis is the best possible (Little, 1978: 19). Among the three types of models — the fully saturated model, the unsaturated model and the main effects model,-- the fully saturated model, which appears to be best fit, is of very little interest. This is so because this model reproduces an expected cell frequency which is exactly the same as the observed cell frequency (Trussed and Hammerslough, 1983: 5). Majumder (1989: 54) noted that the addition of two- or higher-order interactions may improve the explanatory power of the model. On these grounds he suggested that the selected model ideally should include all effects which possess significant explanatory power. It is also argued, however, that models which include a large number of parameters most often fit the data more closely than simpler models; a simple model is often preferred over a model with a large number of parameters (Fienberg, 1977: 47). Similarly, Trussed and Hammerslough (1983: 10-11) appealed to the principle of parsimony and suggested selecting a simpler model than the complicated model if the simple model is understandable and draw a similar conclusion as the complicated model. Trussed and Hammerslough further argued that even if interaction effects are statistically significant, they are sometimes unimportant in practice. On these grounds they suggested that a main effects model may still be preferred because of the researcher's trade-off between simplicity and goodness of fit.

For this study, the main task is to identify the optimal model which provides the explanation of the observed relationship between the independent variables and infant and child mortality in Nepal. Considering that the interpretations of the interaction effects are suggested to be very complicated and since the interaction effects in comparison to the main effects are considered to be unimportant, the analysis in this study is confined to the main effects model. However, the significance of certain two-factor interaction effects was tested and discussed.

In document Infant and child mortality in Nepal : socio-economic, demographic and cultural factors (Page 60-63)