1
1. Data and information
-Random sampling application
-Information base and data collection
2
References
Data sampling:Cochran, W.G. (1977) Sampling Techniques. John Wiley&Sons, Inc. New York and others
Lohr, S.L. (1999) Sampling: Design and Analysis. Brooks/Cole Publishing Company
1.Data and information
3
Random Sampling of Farming Systems
Total populations are far from being infinite.
The design of random sampling plans may depend more on ensuring equal probabilitiesfor survey units to be chosen than on decreasing variancesin strata. Lists about all units (farms, households) in a total population do rarely exist (in particular in developing countries)
Sampling has to cope with reality rather than with controlled conditions (= trial, laboratory). Certain violations of the rules for random sampling may be unavoidable.
The application of justified samplingfor gaining representative knowledge is usually affected by the low pre-knowledge on existing types of farming systems.
1.Data and information
4
Basics in data sampling on farming systems
Select a sample survey plan and identify the respective formulae for estimators before you start the survey.
• Calculate the required or possible size of your sample and the optimal distribution between the strata under consideration of your resources.
• Plan the steps of your survey as precise as possible in advance. This includes the methodological parts (intended analyses, questionnaire, data base structure, sampling plan) as well as the logistic issues (budget, time, staff, transport, required software, data input).
• Plan and conduct the survey as closely as possible to the requirements of the theory of your chosen type of sample survey, but stay pragmatic. Try to adapt the theory to reality and document unavoidable divergence (= limitations of inferences from the sample on the total population).
Low resources are no excuse for a bad planning of a survey! 1.Data and information
5
Sampling Plan
• The choice of the sampling plan (simple sampling, stratified sampling, more complex sampling schemes) depends on the specific situation. Criteria are:
• available pre-knowledge
• possibilities to assure controlled probabilities of selection
• possibilities to minimize variance between surveyed units (e.g. farming systems) • capacity limits (funding, time, transport, staff etc.)
• The chosen sampling plandecides on the required estimators(= formulae for calculating estimates from the surveyed units on the total population
1.Data and information
6
Estimators 1
Sample Total Population
information Estimator estimation
• An estimator is a function (aformula) that allows the estimation of the situation in the total population by using information from a sample.
1.Data and information
7
Estimators 2
• The correct estimators (=formulae) depends on the applied sampling plan
•parameters of location: • Estimators are required for:
-
"averages" (e.g. mean values)-
totals-
frequencies and proportions•parameters of spread (e.g. variance, standard deviation)
1.Data and information
8
Estimators 3: Example for correct estimators,
quantitative data
variance mean value
estimators estimate:
stratified sampling (STS)
simple random sampling (SRS)
Sampling Plan:
y N y
N vp z z vp =
∑
N s N s z z z vp∑
∑ = 2 222 1 ) ( 2 2 − =
∑
− n y ysi i i
whereby:
yi = Information from farm i
y si,i2 = mean and variance (SRS)
y sz, z2 = Mean and variance in stratum z
y svp,vp2= Mean and variance (STS)
n = Number of interviewed farmers Nz= Total number of farmers in
stratum z
Nvp= Total number of farmers in study region
n y yi=
∑
iy si,i2
y sz,z2
9
Sample size (1)
Criteria for sample size:1. Variance of the desired information within the total population 2. Acceptable non-systematic sampling error
3. Available resources for the survey (logistics, time, etc.)
Two basic approaches:
1. Calculating the sample size based on the accepted error 2. Determining the sample size by available resources and
calculating the obtained error in retrospect
1.Data and information
10
Sample size (2)
Calculating the required sample size:• The required sample size has to be calculated for every single criterion under research.
• Attention: Calculating the required sample size by the following formulae presupposes that the concerned criterion is normal distributed in the total population .
• The variance of concerned criteria in the total population has to be estimated in advance. This may happen by applying general characteristics of a normal distribution
Example for quantitative criteria: standard deviation = 1/6 of range [= maximal value - minimal value], variance = (range/6)²
1.Data and information
11
Formulae for sample size´:
Quantitative criteria
small ratio sample/ total population:
large ratio sample/ total population (n0> 5% of N):
whereby:
n, n0= sample size, t = quantile of t-distribution,
s² = variance according to chosen sampling plan, N = total population,
e = accepted error in units of the criterion (±)
2 2 2 1 0
e
s
t
n
=
−α1
0 0−
−
=
N
n
N
n
n
1.Data and information
12
Formulae for sample size:
Qualitative criteria (proportions,
frequencies)
small ratio sample/total population:
large ratio sample/total population (n0> 5% of N):
2 2 2 / 1 0
e
pq
t
n
=
−αwhereby:
n, n0= sample size, t = quantile of t-distribution, p = proportion of total population with criterion, q = proportion of total population without criterion, pq = variance according to chosen sampling plan, N = total population, e = accepted error in % (±)
N.B.:These formulae do not consider budget restrictions. Respective considerations (example: Neyman-Tschupow formula) deal with the minimization of error under a given budget in stratified sampling.
1
0 0−
−
=
N
n
N
n
n
1.Data and information
13 PC-Exercise: file C31sampDB.xls
· Copy file sampling1.xls from this subdirectory to your partition (usually H:)
1. Exercises with Spreadsheet "sample size" · explain calculation and interpret results
· Open file sampling1.xls (resp. H:\[your subdirectory]\sampling1.xls) · explain structure of excel-files (multiple worksheets)
· parametrize data in the example and allowed errors and discuss the resulting impacts on the sample size · Open Windows explorer
Introduction to PC in general and Excel (if required) Create a copy of the exercise file
Start Excel
· switch to partition "user of <name of PC-room>", usually partition U: · choose subdirectory: kurs\M3129
1.Data and information
14
Determination of obtained precision
small ratio sample/total population:
large ratio sample/ total population: whereby:
n = sample size,
t = quantile of t-distribution,
s = standard deviation according to chosen sampling plan, N = total population
Confidence limits of estimators (see 1.1b) express the obtained precision. The
calculation of confidence limit corresponds to the calculation of the sample size with the difference, that nis known and eis required.
n
s
t
−α±
11
1
− − ± −
N n N n
s t α
Quantitative criteria:
1.Data and information
15
Determination of obtained precision
Qualitative criteria (proportions, frequencies):large total populations:
small total populations:
whereby: n= sample size,
t = quantile of t-distribution,
p = proportion of total population with criterion, q = proportion of total population without criterion, pq = variance according to chosen sampling plan
1
1
− −
± −
N n N n
pq t α
n
pq
t
1−α/2±
1.Data and information
16
1. Transfer of data from questionnaires and data sheets to files
Data entry and editing
3. Check for data entry errors and correction
4. Identify data problems (extreme values, missing data)
2. Coding of alphanumeric information
5. Solve data problems and provide operational data base
17
Identification of outliers - exploratory data analysis
20 N =
REVPV 1400000
1200000 1000000 800000 600000 400000 200000 0 -200000
8
Median
Upper hinge
Lower hinge
outlier
1,5 x h-spread
}
h-spread
}
Rules for identification of outliers in box-and-whisker plots:
all values
above upper hinge + 1,5 h-spread and
below lower hinge - 1,5 h-spread
are “extreme” observations
18
Missing values
•
The
mechanisms, that lead to missing data
decide on the
possible solutions for further analyses of concerned data sets
• The use of imputation values requires that these mechanisms are
ignorable
, i.e. not linked to the information content
Imputation procedures:
Advantage
: use of standard methods for calculations possible
Disadvantage
: no consideration of added uncertainty
Method
: replacement of individual missing values by values
that are derived from complete sets of the sample
19
Generation of Imputation Values
use of mean values, medians or modes
variance-neutral imputation
regression values (if covariate data available)
hot deck imputation (random choice of values from
comparable cases in the same survey)
cold deck imputation (random choice of values from other
sources)
nearest-neighbour imputation (value from next record)
20
2. Classifications ( Cluster analyses )
21
References
Bi- and Multivariate Analysis:
Aldendorfer, M.S., Blackfield R.K (1984) Cluster Analysis, West Hilcrest Backhaus, K.; Erichson, B.; Plinke, W.; Weiber, R. (1994) Multivariate Analysemethoden. Springer Verlag, Berlin u.a. (German language)
Henze, A. (1994) Marktforschung. UTB 1792, Ulmer Verlag Stuttgart (German language)
Tukey, J.W. (1977) Exploratory Data Analysis. Addison-Wesley Publishing Company, Inc., Reading
SPSS online help – SPSS vers. 10.0 upward
2. Classification ( Cluster analyses )
22
The computer exercises follow the most common sequence of application in reality rather than the steps in learning econometrics
Steps in the Computer Application
• univariate classification– ordering and identifying the best class boundaries • multivariate classification– checking for too high relationships between selected
classification criteria and application of cluster procedures (cluster algorithm + distance measure)
• testing for differences– statistical tests for checking significant differences between classes (χ²-test, nonparametric tests) and between observations in time (z-test, t-test)
• models of linear dependencies– linear regression, multiple regression, probit- and logit models
2. Classification ( Cluster analyses )
23
Univariate classification
• order the selected (quantitative) classification criterion in ascending order
• determine differences (∆) from one value to the next within this order
PC-Exercise: file C32class.xls, sheet: univ, Software: Excel
• check for over proportionally large steps (graphically and/or numerically) -> preliminary class borders
• determine homogeneity (coefficient of variation within classes = standard deviation / mean value) and heterogeneity (distance between class means) of the preliminary classes
• check if moving border cases improves the measures of heterogeneity and homogeneity
2. Classification ( Cluster analyses )
24
Multivariate classification
• import "multi" in SPSS, check for linear correlations and exclude one of each two too highly correlated variables from the further process
• set up cluster procedure (selection of distance measure, cluster algorithm and standardization of classification variables)
PC-Exercise: file C32class.xls, sheet: multi (+ derivates), Software: Excel, SPSS
• The exploratory approach - interpret results (dendrogrammes) from different sets of procedures (seize and number of clusters, development of homogeneity within clusters
2. Classification ( Cluster analyses )
25
Testing for differences
• Checking significant differences between clusters of selected results
-nonparametric tests for quantitative variables, χ²-test for qualitative variables in SPSS (tables for statistical tests inbuilt)
• Interpretation of test results towards a description of the assumed clusters
PC-Exercise: file C32class.xls, sheet: multi (+ derivates), Software: Excel, SPSS
• Checking significant differences between the current sample and information from the past (z-test, t-test) in Excel (use of printed test-value table on the standard normal distribution)
1,4395 92,5%
1,2816 90%
1,9600 97,5%
1,6449 95%
test value probability
test value probability
26
3. Modeling and impact analyses
-Design of family models
-Application of family models
-Uncertainty and risk
-Gap analysis and interpretation
27
References
Modelling in General:
Dantzig, G. B. (1963) "Linear Programming and Extensions", Princeton University Press, Princeton, N.J
France, J.; Thornley, J.H.M. (1984) Mathematical Models in Agriculture. Butterworth, London
http://www.solver.com/tutorial.htm
MOTAD:
Hazell, P.B.R. (1971) A linear alternative to quadratic and semivariance
programming for farm planning under uncertainty. American Journal of Agricultural Economics, 53, pp.53-62
Doppler, W.; Salman, A. Z.and Al-Karablieh, E. K., Wolff, H.-P.: The impact of water price strategies on the allocation of irrigation water - the case of the Jordan Valley. Agricultural Water Management, 55 (2002), Elsevier Science Ltd., pp.171-182
3. Modeling and impact analyses
28
LP-Models in EXCEL
• A LP-Matrix can be set up in several ways in EXCEL. The one used within M3129 is just one alternative. The required elements are, however, always the same.
• SOLVER is provided as a standard add-in to EXCEL, which is why it is used in the module. Other software (e.g. XA or GAMS) is suited as well - in some regards even better – but requires the purchase and the learning of how this software works.
3. Modeling and impact analyses
29
Set-Up of the LP Matrix
• run basic model and discuss resulting reports on the (1)the optimal solution and (2)the sensitivity analysis
PC-Exercise: file C33mod1.xls, sheet: LP1, Software: Excel
· explain the use of ranges and the SUMPRODUCT command · transform planning matrix for the use with EXCEL solver · explain reports on solution, sensitivity and limits · explain EXCEL-Solver settings: cells, constraints etc. · set-up of a planning matrix
Exercises with Spreadsheet "LP (1)"
3. Modeling and impact analyses
30
Parameterization
• run model with changed parameters and discuss resulting reports on the (1)
the optimal solution and (2)the sensitivity analysis
PC-Exercise: file C33mod1.xls, sheet: LP2, Software: Excel
Exercises with Spreadsheet "LP (2)"
· explain and demonstrate the approach of parametrizing
· explain and apply integer constraints
· add additional activities and constraints
3. Modeling and impact analyses
31
A brief introduction to MOTAD models
• The MOTAD approach is a linear approximation of the (
µ
,
σ
-)- criterion
(which is refered to in literature also as E-V model, cf. also lecture chapter
6)
• MOTAD = Minimization Of Total Absolute Deviation (i.e. uses deviation
measure rather than variance to measure variability of return)
3. Modeling and impact analyses
• Advantage over quadratic programming (E-V-models): solution of the
model requires linear algorithm only.
PC-Exercise: file C33mod2.xls, Software: Excel, Solver
32
Required data for a MOTAD models
3. Modeling and impact analyses
• Available capacities, required capacities per realized unit of activities,
contribution of alternative activities to the objective function (= data
requirements of a E-model, i.e. a model based on an expected value/avtivity
only)
• The distribution (=variation) of the altenatives' contributions to the objective
function
• A "realistic" idea about the desired total expected value (e.g. total gross
margin). "Realistic" means = or < than the maximum return from an LP
model that is based on expected values.
33
Applied Method
3. Modeling and impact analyses
• Set up your basic LP model, but formulate the contribution of your activities
to the objective function as a constraint that is forced to yield the expected
total gross margin (respective cells in the objective function stay 0)
• Add constraints for the absolute deviation of the values in your time series
(=mean value of total time series – observed value in t
n). These constraints
must be larger than their RHS value of 0 in the optimal solution
• Add adjustment activities (columns) that allow for a stepwise (1 step = 1)
reduction of the absolute deviation and deliver 1 "unit of variation" to the
objective function.
34
Results
3. Modeling and impact analyses