Optimal Allocation in Stratified Randomized Response Model

(1)

Sat Gupta

Department of Mathematical Sciences, University of North Carolina at Greensboro, 383 Bryan Building Greensboro, NC 27402, USA

Abstract

A Warner (1965) randomized response model based on stratification is used to determine the allocation of samples. Both linear and log-linear cost functions are discussed under uni and double stratification. It observed that by using a log-linear cost function, one can get better allocations.

Key Words: Randomized response, linear and log-linear functions, cost function, stratification.

Corresponding address: Department of Statistics, Quaid-i-Azam University, Islamabad 45320, Pakistan

Introduction

Various randomized response techniques have developed to obtain truthful answers since the pioneering work by Warne (1965). The significant contribution is by Greenberg et al. (1969), Moors (1971), Mangat and Singh (1990), Mangat (1994) and Singh et al. (2000). Hong et al. (1994) used the proportional allocation for obtaining the response more accurately. Recently Kim and Warde (2003) discussed the Warner’s model and used the optimum allocation in stratification. It is no doubt that optimal allocation gives better results as compared to proportional allocation, as also discussed by Cochran (1977). Kim and Warde (2003) mainly focused of getting the truthful response. They did not obtain the allocation of samples and cost function. Khare (1987) discussed the allocation of samples in the presence of non-response. In this paper our emphasize is to look the allocation of samples in relation to cost function in randomized response model. We used both linear and loglinear cost functions and also study is extended to double stratification.

Warner’s Stratified Model

Let y_h and x_h (h=1, 2,…..,L) be the response and auxiliary variables of stratum h, having strata, L. A population of size

∑

= = L

h h N N

1

(2)

various strata according to some particular characteristics that may be age, height or martial status etc. or some other suitable similar characteristics. A

sample of size

∑

= = L

h h n n

1

is obtained from each stratum to measure the

response y. Each selected respondent from each stratum is instructed to use the randomized device before giving the response in the form of yes or no. We assumed that the number of observations in each stratum is known. The Warner (1965) randomized device consists of sensitive and its negative statements with probability P_h and (1−P_h) respectively. The probability of yes answer under the assumption that response will be true is given by

) 1 )( 1

( _h _h

h h

h P P

G = _π + − −_π , h=1, 2,…..,L ,

where G_h is the proportion of yes answer in stratum h, P_h is the probability that the respondent in the sample of stratum h has a sensitive question card and _π_h is the proportion of respondent with yes answer in stratum h. The

maximum likelihood estimate of _π_h is given by

1 2

) 1 ( ˆ ˆ

− − − =

h h h

h

P P G

π , Ph ≠0.5

where Gˆ_h is the proportion of yes answer in the sample of stratum h. SinceGˆ_h

follows the binomial distribution B(n_h, G_h) and selection is made independently in different strata. Therefore maximum likelihood estimate of _π

is

∑

= = L

h

h h W

1

ˆ

ˆ _π

π , where W_h = N_h/N is the stratum weight. The πˆ is an

unbiased estimate of _π i.e. π =

∑

π =

∑

π =

∑

π =π =

= =

L

h

h h L

h

h h L

h

h W E W

W E E

1 1

1

) ˆ ( ]

ˆ [

) ˆ

( and

its variance is given by

] ˆ [

) ˆ (

1

∑

=

= L

h

h h W Var

Var _π _π

∑

= = L

h

h hVar W

1

2 ₍_π_ˆ ₎_,

∑

= −

− + − = L

h h

h h h h h h

P P P n

W Var

1 2

2

] ) 1 2 (

) 1 ( ) 1 ( [ )

ˆ

(π π π (2.1)

Cost Functions

We discuss the linear and loglinear cost functions in stratified randomized response

Linear Cost Function Define:

∑

= +

= L

h h hn C c

C

1

(3)

where C is the total cost, c₀ is the fixed cost, C_h is the cost per unit in stratum h and δ is a constant. We select a sample of size n_h to minimize the

V

V(πˆ)= (say) for a specified cost or to minimize the cost for a specified variance.

For obtaining n_h, we define a function _ψ by using the equations (2.1) and (3.1) as

∑

= = − − + − − + − = L h h L h h h h h h h h

h _C _n _C _c

P P P n W 1 0 1 2 2 )] ( [ ] ) 1 2 ( ) 1 ( ) 1 ( [π π λ δ

ψ (3.2)

where λ is the lagrangian multiplier

Now solving (3.2) for.n_h, we get

) 1 /( 1

2

2 (2 1)

) 1 ( ) 1 ( +                         − − + − = δ δλ π π h h h h h h h P P P W

n . (3.3)

Taking summation of (3.3) and

∑

= = L h h n n 1

, we get

) 1 /( 1 1 2 2 ) 1 /( 1 2 2 ] / } ) 1 2 ( ) 1 ( ) 1 ( { [ ] / } ) 1 2 ( ) 1 ( ) 1 ( { [ + = +

∑

− + −₋ − − + − = δ δ π π π π L h h h h h h h h h h h h h h h h C P P P W C P P P W n n

. (3.4)

Case 1:

If cost is fixed then by (3.2) and (3.4), we have

δ δ δ δ δ δ π π π π / 1 ) 1 /( 1 / 1 2 2 ) 1 /( 1 2 1 2 / 1 0 ] ] ) }( ) 1 2 ( ) 1 ( ) 1 ( { [ [ ] / } ) 1 2 ( ) 1 ( ) 1 ( { [ ) ( + = + =

∑

− − + − − − + − − = _L h h h h h h h h h h h h h h L h h C P P P W C P P P W c C

n . (3.5)

Case 2:

If variance is fixed then by using (2.1) and (3.4), we have

] ] } ) 1 2 ( ) 1 ( ) 1 ( { [

[ 1/ /( 1)

1 2 2 + =

∑

− + −₋ = _π _π δ δ δ h L h h h h h h h C P P P W n ×

¦

= + − − + − L h h h h h h h

h C V

P P P W 1 ) 1 /( 1 2 2 / ] ] / } ) 1 2 ( ) 1 ( ) 1 ( { [

[ π π δ . (3.6)

Using (3.6), we get optimum variance

] ] } ) 1 2 ( ) 1 ( ) 1 ( { [ [ ) ˆ

( 1/ /( 1)

(4)

× C n P P P W L h h h h h h h

h }/ ] ]/

) 1 2 ( ) 1 ( ) 1 ( { [ [ 1 ) 1 /( 1 2 2

¦

₌ + − − + −_π δ

π . (3.7)

Log-linear Cost Function Define:

∑

₌ + = L h h h n C c C 1

0 log δ >0, (3.8)

By (2.1) and (3.8), we have

)] ( log [ } ) 1 2 ( ) 1 ( ) 1 ( { 1 0

1 2 2

2

∑

= = − − + − − + − = L h h h L h h h h h h h

h _C _n _C _c

P P P n

W

V π π λ . (3.9)

Solving (3.9) for n_h, we get

∑

= − − + − − − + − = _L h h h h h h h h h h h h h h h h C P P P W C P P P W n n 1 2 2 2 2 / } ) 1 2 ( ) 1 ( ) 1 ( { / } ) 1 2 ( ) 1 ( ) 1 ( { π π π π

. (3.10)

Case 1:

If cost is fixed then by (3.8) and (3.10), we have

                    − − + − − − =

∑

= = L h h h h h h h h h L h h C C P P P W C c C n 1 2 2 1

0 }/

) 1 2 ( 1 ( ) 1 ( { log ) ( exp π π × _      − − −

∑

_hL₌ h h h h h h h C P P P W 1 2

2 _}_/

) 1 2 ( ) 1 ( ) 1 (

{_π _π (3.11)

Case 2:

If variance is fixed, then by (2.1) and (3.10), we have

V C P P P W C n L h h h h h h h L h h

h }/ ]/

) 1 2 ( ) 1 ( ) 1 ( { ][ [

1 1 2

2

∑ ∑

= = − − + −

= π π (3.12)

From (3.12), the optimum variance is

n C P P P W C

V L _h

h h h h h h L h h h

opt }/ ]/

) 1 2 ( ) 1 ( ) 1 ( { ][ [ ) ˆ (

1 1 2

2

∑ ∑

= = − − + − = π π

π . (3.13)

Numerical Illustration

The following data is used to determine the sample size and expected cost. We obtain the followings (i), (ii) and (iii), linear cost functions by substitution of

(5)

(i)

∑

= +

= L

h h hn C c

C

1 2

0 , (ii)

∑

= +

= L

h h hn C c

C

1

0 ,

(iii)

∑

= +

= L

h

h h n C c

C

1

0 and (iv)

∑

= +

= L

h

h n

C c

C

1

0 log( ).

Table1: Artificial data for two groups

Stratum _π_h P_h W_h C_h

1 2

0.4 0.6

0.6 0.7

0.3 0.7

4 9 1

2

0.2 0.4

0.6 0.7

0.3 0.7

9 4

Table 2: Fixed cost

95 )

(C−c₀ = units (C−c₀)=100 units

Cases _n

1

n n₂ n n₁ n₂

(i) (ii) (iii) (iv)

6 15 115 3396

3 9 67 2120

3 7 48 1276

6 17 142 6332

3 6 45 1549

3 11 96 4783

Table 3: Fixed variance 1 =

V V =0.175

Cases _n

1

n n₂ E. cost n n₁ n₂ E. cost (i)

(ii) (iii) (iv)

3 3 3 3

2 2 2 2

1 1 1 1

22 17 14 3

15 16 16 3

6 6 5 1

9 10 11 2

662 90 34 2

From Tables 2 and 3, it is observed that the log-linear cost function is better as compared to ordinary cost function. The expected cost is decreasing with increase of _π_h, P_h and W_h.

Double Stratification

To estimate the proportion of yes answers from respondent y, it is reasonable to stratify the population on the basis of auxiliary variable x, but when such information on x is lacking or stratum weights, W_hare unknown, then we use the double sampling technique, (see Cochran, 1977; Rao, 1973).

(6)

(ii) The collected samples are arranged into L strata on the basis of x. Let

h

n′ be the number of units falling in stratum h, i.e.

∑

=

′ =

′ L

h h

n n

1

.

(iii) A sub-sample of size n_h ≤n′_h(=ν_hn_h′), 0<νh ≤1, where νh is known for

each strata, is selected with replacement and the information on y_h in stratum h is collected from respondents using the randomized device to estimate the proportion _π. Also n_h′ /n′=w_h is an unbiased estimate of

h

h N W

N / = .

As _πˆ is an unbiased estimate of _π, therefore its variance ignoring the finite population correction (fpc) is given by

) 1 1 ]( ) 1 2 (

) 1 ( ) 1 ( [ ]

) 1 2 (

) 1 ( ) 1 ( [ 1 ) ˆ (

1 2

2

2 ₋ −

− + − ′

+ − − + − ′

=

∑

= h

L

h h

h h h h h

P P P n

W P

P P n

Var

ν π

π π

π

π ,

(5.1) where

∑

= = L

h h

P P

1

, and

∑

= = L

h 1 h

π

π .

Linear Cost Function Define:

∑

₌

+ ′ ′ +

= L

h h hn C n

c c C

1

0 ( )δ , (5.2)

where c′ be the cost of classification per unit, c₀ is the fixed cost and C_h be the cost of measuring unit in stratum h.

Since n_h is random variable therefore the expected cost is.

∑

= ′ + ′ ′ + =

= L

h h h h

W C n n c c C C E

1 0

* ₍ ₎

)

( δ ν _. _(5.3)

Here we choose n′ and _ν_h to minimize Var(πˆ)=V(say) for a specified expected cost or to minimize the expected cost for specified variance (see Cochran, 1977).

Now we define a function ψ by using a constant λ as By (5.1) and (5.3), we have

) 1 1 ]( ) 1 2 (

) 1 ( ) 1 ( [ ]

) 1 2 (

) 1 ( ) 1 ( [ 1

1 2

2

2 ₋ −

− + − ′

+ − − + − ′

=

∑

= h

L

h h

h h h h h

P P P n

W P

P P

n π π π π _ν

ψ

)] (

) (

[ * ₀

1

c C W C n n

c L

h h h h

− − ′

+ ′ ′

+

∑

= ν

λ δ _._(5.4)

] ) 1 2 (

) 1 ( ) 1 ( [ 1

1 2

2

∑

= −

− + − ′

+ ′

= L

h h

h h h h h h B

P P P n

W

n π _ν π π

(7)

)] (

[ * ₀

1 c C W C n n c L h h h

h − −

′ + ′ ′ +

∑

= ν

λ δ _{, (5.5)}

where ]

) 1 2 ( ) 1 ( ) 1 ( [ ] ) 1 2 ( ) 1 ( ) 1 ( [ 1 2 2 2

∑

= − − + − − − − + − = L h h h h h h h B P P P W P P P π π π π π .

Now solving (5.5) for n′ and _ν_h, we get

) 1 /( 1 2 ] [ + ′ = ′ δ δλ π c

n B _(5.6)

and ₂ ₁_/( ₁₎ ₁_/₂

) 1 /( 1 2 / 1 2 ) ( ) ( ] [ ] ) 1 2 ( ) 1 ( ) 1 ( [ h B h h h h h h C c P P P λ π δλ π π ν _δ δ + + ′ − − + −

= (5.7)

Case 1:

If cost is fixed, then substituting of n′ and _ν_h from (5.6) and (5.7) in (5.3), we get 0

) (

)

( * ₀

2 3 ) 1 /( 1 2

1A + +A A − C −c =

A δ δ , (5.8)

where ) 1 /( 1 2

1 =[( )δ ′] δ+

δ

π _c

A B _, ₁_/_λ

2 =

A and

2 / 1 2 1 3 ] ) 1 2 ( ) 1 ( ) 1 ( [ − − + − =

∑

= h h h h h h L h h P P P W C

A _π _π

Case 2:

If variance is fixed, then substituting the values of n′ and _ν_h from (5.6) and (5.7) in (5.1), we get

λ λ δ δ 3 ) 1 /( 1

1( ) A

A

V = + + _,_(5.9)

Log-linear Cost Function Define:

∑

₌ ′ + ′ ′ + = L h h h h W C n n c c C 1 0

* _log _ν _._(5.10)

By (5.1) and (5.12), we have

) 1 1 ]( ) 1 2 ( ) 1 ( ) 1 ( [ ] ) 1 2 ( ) 1 ( ) 1 ( [ 1 1 2 2

2 ₋ −

− + − ′ + − − + − ′ =

∑

= h L h h h h h h h P P P n W P P P

n π π π π _ν

ψ

)] (

log

[ * ₀

1 c C W C n n c L h h h

h − −

′ + ′ ′ +

∑

= ν

λ . (5.11)

Solving (5.11) for n′ and _ν_h, we have

c A

n′=_π_B2 ₂ / ′_and

(8)

Case 1:

If cost is fixed, then substituting n′ and _ν_h in (5.10), we get 0

) (

) log( )

/

log( * ₀

2 3 2

2 _′ ₊ _′ ₊ ₋ ₋ ₌

′ c c A A A C c

c _π_B (5.12)

Case 2:

If variance is fixed, then substituting n′ and _ν_h in (5.1), we get 0

) / ( ) /

(c′ A₂ + A₃ A₂ −V = . (5.13)

Conclusion

It is observed that by using a log-linear cost function, one can select more samples for a given fixed cost in Table 2 and in Table 3, the expected cost is much lower for a fixed variance. So generally it is preferable to use the log-linear cost function for selecting the samples in stratified randomized response survey.

Acknowledgement:

The first author wishes to thanks the facilities provided by the University of Southern Maine, USA during his post-doctoral research work in 2003.

References

1. Cochran W. G., 1977, Sampling Techniques, (3rd edt.), Wiley and Sons: New York.

2. Greenberg B. G; Abul-Ela; Abdel-Latif A; Simmons W. R., and Horvitz D. G.,

1969, The unrelated question RR model-theoretical frame-work, J. Amer. Statist. Assoc., 64, 520-539.

3. Hong, K; Yum, J and Lee, H., 1994, A stratified randomized response

technique, Korean J. App. Statist., 7, 141-147.

4. Khare B. B., 1987, Allocation in stratified random sampling in presence of nonresponse, Metrika, XLV, 1-2, 2132-221.

5. Kim J-M. and Warde W. D., 2003, A stratified Warner’s randomized response

model, J. Statist. Plann. Infer., 1-12 (To be appear).

6. Mangat N. S., 1994, An improved randomized response strategy, J. Roy.

Statist. Soc, B, 56(1), 93-95.

7. Mangat N. S and Singh R., 1990, An alternative randomized response

procedure, Biometrika, 77, 439-442.

8. Moors, J. J. A., 1971, Optimization of the unrelated question randomized

response model, J. Amer. Statist. Assoc. 66, 627-629.

9. Rao J. N. K., 1973, On double sampling for stratification and analytical

survey, Biometrika, 60, 125-133.

10. Singh, S; Singh R and Mangat, N. S., 2000, Some alternative strategies to Moor’s model in randomized response model, J. Statist. Plann. Infer., 83, 243-255.

11. Warner, S. L., 1965, Randomized response: a survey technique for