Data Management and Preliminary Analysis

Sanitation Sector

D. Bid Distribution

4.6 Data Management and Preliminary Analysis

While the survey is progressing in the field, the analyst should also begin developing a coding sheet and template for data entry. Once good quality data is generated through the survey, the analyst should make sure errors will not appear at the data entry and management stages. The management and analysis of data from the survey often proceeds in three sets of sub-tasks corresponding to: (i) data entry and processing, (ii) calculation of descriptive statistics, and (iii) cross-tabulation of summary statistics.

4.6.1 Data Entry and Processing

The data recorded on the questionnaires should be transferred by the analyst into the selected data management software (e.g., Microsoft Excel, Access, Fox, or CSPro developed by the United States Census Bureau, Macro International, and Serpro, S.A.) using codes developed during the survey design. Three quality assurance and quality control procedures can be employed: range check, intra-record check, and final consistency check (Munoz 2003). Range and intra-record checks should be undertaken during data entry. By building a proper data entry template, the operator is only allowed to proceed to the next question if the data for the current question falls within the allowable range of responses for each question. An intra-record consistency check can be administered immediately after entry of each questionnaire. For example, family size reported by the household head should equal the number of family members listed in the family roster.

A final scan for overall consistency should be conducted when all questionnaires have been entered. This final consistency check will ensure that values from one question are consistent with values from another question. In addition, it is helpful to conduct spot checks on the data entry operation, to double-enter 10% of the full survey, and to double-enter 100% of critical modules such as the WTP elicitation response, in order to ensure that there are no discrepancies between the hard copies and the electronic data set.25

4.6.2 Descriptive Statistics

The household and community surveys combined with supplementary administrative data can add up to a large data set. While some variables are of independent interest, others must be combined to produce policy relevant statistics. In general, such data should be described at two levels. The analyst should compute descriptive statistics (e.g., mean, median, standard deviations, and range) to understand and describe all of the variables in the data set. Examining the descriptive

25 _{“Double-enter” means that a different person would enter the data from the selected questionnaires}

(or modules) onto a separate Excel sheet. The two Excel sheets (original and double-entered) would be compared electronically to determine whether the original sheet accurately records the responses from the questionnaires.

statistics will serve as an additional quality assurance and quality control measure because the analyst will be able to identify anomalies, outliers, and improbable values. If outliers are found, the analyst should check against the questionnaires by matching respondent identification numbers. If necessary, the enumerator who administered that particular questionnaire should be sent again to clarify and to confirm with the respondent.

It can be useful to identify subpopulations of policy interest for further estimation of descriptive statistics. Typically, these are identified by: (i) WSS user type (e.g., private water connection or public tap); (ii) socioeconomic group (households in different consumption quintiles); and (iii) physical subregions of the study area (e.g., administrative units). In most cases, the analyst may be particularly interested in the subpopulation of households living in poverty, for which the definition of poverty is critical. Data on household consumption expenditure can be validated against income, wealth and assets, caloric intakes, demographics, and housing quality to generate a statistically robust and economically meaningful poverty indicator. The set of variables one has to use in characterizing households will also vary with the study site and specific objectives of a study.

4.6.3 Cross-tabulation of Summary Statistics

Cross-tabulations by socioeconomic, geographic, and current use status should be calculated for all important variables including demand for WSS improvements, water quality perceptions, and consumption. This process will provide descriptive statistics on subpopulations of interest such as the poor, the unconnected, and the marginalized zones of the study area. The descriptive statistics will provide a profile of the socioeconomic characteristics of the typical households as well as a typical household belonging to a subgroup, such as poor unconnected households. These cross-tabulations offer preliminary evidence on underlying relationships between socioeconomic conditions and types of service needs (Deaton 1997).

The story presented by the descriptive statistics should tally with the initial informal characterization of the existing WSS service. Percentage distributions of income and water sources should show similar patterns

of distribution with the study population. Cross-tabulations of key selected variables should generally indicate (positive or negative) correlations based on expectations (e.g., the higher the income level, the higher the number of assets owned). A serious disconnect between the initial understanding and the description provided by the data indicates some errors, perhaps during the data entry process. If cross-tabulation of WTP values with key variables (expected to be correlated with WTP values) is generally consistent with hypothesized patterns, and tallies to a reasonable extent with initial characterization, the analyst can move to the next step: estimation of WTP and policy simulations.

In document COST-BENEFIT ANALYSIS FOR DEVELOPMENT. A Practical Guide (Page 122-125)