Retrospective Theses and Dissertations
Iowa State University Capstones, Theses and
Dissertations
2000
Variance estimation after imputation
Jae-Kwang Kim
Iowa State University
Follow this and additional works at:
https://lib.dr.iastate.edu/rtd
Part of the
Statistics and Probability Commons
This Dissertation is brought to you for free and open access by the Iowa State University Capstones, Theses and Dissertations at Iowa State University Digital Repository. It has been accepted for inclusion in Retrospective Theses and Dissertations by an authorized administrator of Iowa State University Digital Repository. For more information, please [email protected].
Recommended Citation
Kim, Jae-Kwang, "Variance estimation after imputation " (2000). Retrospective Theses and Dissertations. 12693. https://lib.dr.iastate.edu/rtd/12693
INFORMATION TO USERS
This manuscript has been reproduced from the microfilm master. UMI films
the text directly from the original or copy submitted. Thus, some thesis and
dissertation copies are in typevvriter ftice, while others may be from any type of
computer printer.
The quality of this reproduction is dependent upon the quality of the
copy submitted. Broken or indistinct print, colored or poor quality illustrations
and photographs, print bleedthrough, substandard margins, and improper
alignment can adversely affect reproduction.
In the unlikely event that the author dkJ not send UMI a complete manuscript
and there are missing pages, these will be noted. Also, if unauthorized
copyright material had to be removed, a note will indicate the deletion.
Oversize materials (e.g., maps, drawings, charts) are reproduced by
sectioning the original, beginning at the upper left-hand comer and continuing
from left to right in equal sections with small overiaps.
Photographs included in the original manuscript have been reproduced
xerographically in this copy.
Higher quality 6' x 9' black and white
photographic prints are available for any photographs or illustrations appearing
in this copy for an additional charge. Contact UMI directly to order.
Bell & Howeli Information and Learning
300 North Zeeb Road, Ann Arbor, Ml 48106-1346 USA
600-521-0600
Variance estimation after imputation
In-. l a t ' - K w a n g K i m
A d i s s e r t a t i o n s u l ) i n i t t f d t o l l i e g r a d u a t e f a c u l t y in [)artial fullillni<Mit o f t h e re<iuinMn«'nts for t h e d e g r e e o f
D O C T O I i o r I M l l I . O S O l M h ' M a j o r : S t a t i s t i c s M a j o r P r o f e s s o r : W a y n e A. I'uller Iowa S t a t e I ' n i v e r s i t y A m e s . Iowa
2000
UMI Number 9977332
UMI
UMI Microform9977332
Copyright 2000 by Bell & Howell Information and Leaming Company. All rights reserved. This microform edition is protected against
unauthorized copying under Title 17, United States Code.
Bell & Howell Information and Leaming Company 300 North Zeeb Road
P.O. Box 1346 Ann Arbor. Ml 48106-1346
I I ( I r a d u a t r ('oll«'in« Iowa S t a t f r i i i w r s i t y T h i s is t o ctTtify t l i a t t h e Doctoral d i s s e r t a t i o n of J a e - K w a n g K i m h a s m e t t h e d i s s e r t a t i o n r e c | u i r e n i e n t s of Iowa S t a l e I ' l i i v e r s i t y N l a j o r Professor For t h e M a j o r P r o g r a m o r t h e G ' r t e t ' o l l e g e
Signature was redacted for privacy.
Signature was redacted for privacy.
TABLE OF CONTENTS
GENERAL INTRODUCTION
I
1 B a s i c Probl«'in I
2 K s t i i n a t i o i i ill I In-pr(>st>iis('uf iiouri's|jons»' 1
;{ l)iss«'rlatioii O r g a n i z a t i o n l i t ' f c n ' i i a ' s 1
LITERATURE REVIEW
(I
1 P f t ' l i i n i i i a r i c s (i •j R cp li cH t i o n V'ariaiHc K.STi n i a t i o n W i t h o u t Noiircspoiisc I I 2.1 I lit-jackkiiiff nu'tliotl 1-J2 . 2 Halanrc'd ri'p«'at»'(l rcplicalioii I I
•J Hot D f c k I m p u t a t i o n .Mcthoils I')
I Ti ll ' . M u l t i p l e I m p u t a t i o n .Xpiiroacli I S 1.1 l i a y c s i a n .Ju.stiHcation "211 1.2 R a i u l o m i z a t i o n Validity 2 2 !.;{ F a y ' s K.xamplo 2 1 ') Q u a s i - R a n d o m i z a t i o n .Approach 2 7 G O t h o r a p p r o a c h e s 2 9 ti.l K a l t o n a n d K i s h ' s a p p r o a c h 2 9 6 . 2 S a r n d a l ' s a p p r o a c h ^{0 6.;5 Tollefson a n d Fullor's a p p r o a c h .12
I V
(i.-l F a y ' s a p p r o a c h
licft'ieiiccs 3 1
VARIANCE ESIMATION AFTER IMPUTATION
3 7A b s t r a c t 3 7 I I n t r o d u c t i o n 3 S J A X a i l r t i i i r [•.srnMcilKJii Mi'tiiuil -5'.) 3 Kxtc'iisioiis T o U a i i d o m I i i i p u t a t i o n 11 1 .Jaci\knif<' Mctluxl 11 ('oiu|)U' X S u r v e y Dcsiffiis IN ' ) . l D c t c r i n i i i i s t i c i m p u t a t i o n IN K a n d o m i i n p u t a l i t > n 'y'l (i C o n c l u d i n g K c n u u k s "i l Ackno\vl('il}!,in(Mits K c f c r c n c c s "i") A p p e n d i x A •')() A p p c ' n d i x 11 (il
INFERENCE PROCEDURES FOR HOT DECK IMPUTATION
. . (i3A l ) s t r a c t (i3 1 I n t r o d u c t i o n (il 2 M o d e l s for I n i p u t a t i o n (i') 2 . 1 N o t a t i o n ()o 2 . 2 P o p u l a t i o n .Model .Approach 6 8 2 . 3 R e s p o n s e .Model . A p p r o a c h (>9 2.-1 Hot dcck imputation 70 3 E s t i m a t i o n .After H o i D e c k I m p u t a t i o n : P o p u l a t i o n M o d e l .Approach . . 72
\ ' 5 \ ' a r i ; u u ( ' K s t i i n a t i o i i 8 8 (i S i i m i l a t i o i i S t u d i e s 5)(i (i.l KxperiiiUMit O n t ' 5)0 G.'J Kx|j«'rinuMil two 10;{ A p p e i u l i x lUN Urferriu'c's ! l-'5
REPLICATION VARIANCE ESTIMATION FOR MULTI-PHASE
STRATIFIED SAMPLING
12:]
A l j s t r a c t 1 2 3
1 l i i t r u d i K t i o i i 1 2 1
•J A s y n i p t o t i c I'lDpt'itics I'Jt)
•J A [{('plication M c l l i o d for t h e Kt'\v«'if»lit«-(1 Kxpaiisioii K s t i i n a t o r I'Ui
1 U t ' p l i i a t i o i i \'ariaiic(' Kstiiiiatioii I'or t h e D o u b l e Mxpansiuu lOstiinator . . I l l
o U e p l i c a l i o n V a r i a n c e l^stiiiiator for M u l t i - I ' l i a s e S a m p l i n g I I')
(i A p p l i c a t i o n t o tli«'JOUO I S ( ' e n s u s 1 1 9 (i.l i n t r o f l u c t i o n 11!) (i.2 P o i n t Kstiinatioii l ' ) 0 G.3 X'ariance Kstiination 1 A p p e n t l i x 1 •')•') A . N o t a t i o n l o o B . P r o p e r t i e s of t h e proposetl v a r i a n c e «' s t i i n at or 1 5 6 C . D e r i v a t i o n o f t w o p h a s e v a r i a n c e w h e n t h e s e c o n d p h a s e is s t r a t i f i e d poisson s a m p l i n g l o 9 R e f e r e n c e s 1()8
GENERAL CONCLUSIONS
170
R e f e r e n c e s 171V l l Taljlc I ThIJU' 2 T a b l e T a b l e I l a b l e •') T a b l e (i Table 7 T a b l e 8 T a b l e y
LIST OF TABLES
i s t r a l i o n of t h e p s e i u l o <lata s«'t for U a o a n d S l i a o inetlioil 17 I l l u s t r a t i o n o f t h e p r o f j o s e d pseiulo d a t a s e t 18M»'aii. v a r i a i i e e . a n d s t a i u l a r d i z e d v a r i a n c e of t h e point <'sliina-t o r s u n d e r <'sliina-t h e f o u r clilfer<'sliina-tMi<'sliina-t i m p u <'sliina-t a <'sliina-t i o n s c l u ' i n e s in exp«'riineii<'sliina-t
o n e (lO.UOO s a m p l e s ) IKi
Nb'an. relativt* bia.s. v a r i a n c e , s l a n d a r d i z e t l varianc<' o f l l u ' vari
a n c e e s t i m a t o r in e x p e r i m e n t o n e ( lU.UOU sampU-s ) 117 S t a n t l a r d i z e t l n K ' a n C . l . w i d t h , s l a n d a r d i z e i l v a r i a n c e o f ( ' . I . w i d t h . a n d c o v e r a g e in e x p e r i m e n t o u e (lU.OOU s a m p l e s ) 118 Pojjiilation p a r a m e t e r s f o r s i m u l a t i o n in e x p e r i m e n t t w o 118 .Mean, v a r i a n c e , a n d s l a i u l a r d i z e d v a r i a n c e o f t l i e p o i n t e s t i m a t o r s u n d e r different i m p u t a t i o n s c h e m e s i n experiiiK'nt t w o ( 1 0 . 0 0 0 s a m p l e s o f s i z e 100) 119 •Mean, r e l a t i v e b i a s , v a r i a n c e , s t a n d a n l i z e d v a r i a n c e o f t h e vari a n c e e s t i m a t o r in e x p e r i m e n t t w o (10.000 s a m p l e s ) 120 S t a n d a r d i z e d m e a n C M . w i d t h , s t a n d a r d i z e d v a r i a n c e ( M . w i d t h , a n d c o v e r a g e in e x p e r i m e n t t w o (10.000 s a m | ) l e s ) 121
Vlll
Tahlc lU M e a n a n d variaiici' o f l l i c point o s t i m a t o r of 0.x uiuliT t h e t h n v
difftMcnl i m p u t a t i o n schciiu's in ( > x i ) c > r i i i R'iit t w o wlu'ii Z i s a l w a y s
o b s o r v t ' d 121
Table 11 M e a n a n d v a r i a n c e o f t h e v a r i a n c e e s t i m a t o r , t h e m e a n l e n g t h o f 9.j/c ("I (("I w i d t h ) . I h e c o v e r a R c ' o f *)') Vf C I . ;iiid t h e r e l a t i v e b i a s o f t h e variance''<fi!!!al<>r in <'Nperi!n<'tit t w o wlwn / alwMy;
I
GENERAL INTRODUCTION
1
Basic Problem
III m a n y s a i i i p U ' s u r v e y s , s o i i i r u l t h e u n i t s c u n t a r t i ' d <lu n o t r e s p o n d . O t h e r u n i t s
m a y r e s p o n d t o s o m e h u t n o t a l l < | i u ' s t i o n s heiii}^ a s k e t l . 1 In- p r o h l e m o f inissiinj; d a t a i n
s u r v e y s a i n i ) r m j i ; i s c a l l e t l t h e | ) r o l ) l e m o f i i o i i n .
T h e p r o b l e m s c r e a t e i l by iionr»'spoiise a r e well e x i d a i i u ' d by U u b i n (1!)S7):
r i n ' s e inissiiiff valiH's i n t e n d e d by survey d e s i g n t o l)e o b s e r v e d not o n l y m e a n less e[lit i«Mit e s t i m a t e ' s i i e e a u s e of t h e r«'dut e d siz«' o f d a t a b a s e b u t a l s o t hat s t a n d a r f l c o m i j h ' t e - d a t a m e t h o d s l annot b«' imiiH'diately u s e d t o analyz«' t h e d a t a . .Moreover, p o s s i b l e biases <'xist b e c a u s e t h e r e s p o n d i ' i i t s a r e o f t e n s y s t e m a t i c a l l y d i l f e r e n t f r o m t h e noiirespoiuU'nts: of p a r t i c u l a r conciMii. t h e s e biases a r e difficult t o e l i m i n a t e s i n c e t h e precise rea.sons for iu)nr<'sponse ar«' u s u a l l y n o t k n o w n . ( p . 1 )
It is c o m m t j i i p r a c t i c e t o d i s t i n g u i s h between u n i t . wluMi s o n u ' o f t h e
u n i t s c o n t a c t e d d o n o t r e s p o n d b e c a u s e of iiot-at-liomes. refusals, i n a b i l i t y t o p a r t i c i p a t e , a n d untracecl u n i t s , a n d i l t i n r t o i i n s p o i i M . when s o m e b u t n o t a l l o f t h e responses a r e a v a i l a b l e . I t e m n o n r e s p o n s e a r i s e s b e c a u s e of i t e m r e f u s a l s , " d o i r t k n o w s ", o m i ss i o n a n d a n s w e r s d e l e t e d in e d i t i n g . I h e p r o b l e m of v a r i a n c e e s t i m a t i o n in t h e pr e s en c e of i t e m n o n r e s p o n s e will b e a d d r e s s e d i n t h i s work.
2
2 Estimation in the presense of nonresponse
r i i o l i t e r a t u r e o n t h e e s t i m a t i o n i ) r o b l e m in th<' p r e s e n c e of n o n r e s p o n s e is c o m p a r a t i v e l y r c v e n t ; r e v i e w papers inc l u d e O h a n d Sc h e u r e n ( IDS.'J). K o t t ( I ! ) 9 1 ) . a n d Brick ancl K a l t o n ( 1!)9()). M e t h o d s p r o p o s e d i n t h i s l i t e r a t u r e c a n h e roughly grcjupecl i n t o t h e following c a t e g o r i e s (not nuitiially e x c l u s i v e ) :
( i ) I' v o n d u n s l i a s t d o n ( ' o i n p h ti Itj H t r o r d i d I
W h e n s o m e variables a r e not r e c o r d e d for s o m e cjf t h e u n i t s , a simi>le e x p e d i e n t is t o cliscard i n c o m p l e t e l y recorcled u n i t s ancl t o a n a l \ x e o n l y t h e u n i t s w i t h c o m p l e t e d a t a . D e l e t i o n g o f units is e a s y t o c a r r y cnit a n d m a y h e satisfactcjry w i t h snutll a m c n i n t s o f missing d a t a . B i a s will o c c u r in t h e pcjint e s t i m a t e a s well a s in t h e v a r i a n c e - c o v a r i a n c c e s t i m a t e , u n l e s s t l i e p o p u l a t i o n m e a n for r e s p o n d e n t s is ecpial
t o t h a t cjf nonresponclcnts ( s e e . e . g . K a l t o n |) . 7). Also, t h e fracticju of
clis-carclecl u n i t s will h e iion-negligil>le w h e n t h e n u m b e r of i t e m s in t h e c|uestiomiaire is l a r g e .
( i i ) \ \ ( Kjhling A d j u s t m i nt
In \ \ t i y h t i n y ndju.'itiin iil for t h e n o n r e s p o n s e p r o b l e m , t h e w e i g i i t s o f s p e c i h e d r e s p o n d e n t s a r e increased s o t h a t t h e y r e p r e s e n t t h e n o n r e s p o n c l e n t s . W e i g h t i n g a d j u s t m e n t is primarily used t o c o m | ) e n s a t e f o r u n i t n o n r e s p o n s e . 1 h e m a i n o b j e c t i v e of t h e weighting a d j u s t m e n t i s t o r e d u c e b i a s in s u r v e y e s t i m a t e s b y m a k i n g eacii r e s p o n d e n t represent t h e c o r r e c t f r a c t i o n o f t h e t a r g e t p o p u l a t i o n .
(iii) I m p u l a t i o n P m n d u r r s
I m p u t a t i o n m e a n s inserting v a l u e s f o r m i s s i n g i t e m s . I m p u t a t i o n is useful in deal
( h ) I ' s e s t l i c s a i n t ' s u r v e y weiglits f o r a l l iltMiis. u n l i k e s e p a r a t e w e i g h t i n g a d j u s t m e n t for e a c h i t e m .
( b ) l i e t a i n s a l l t h e r e p o r t e d d a t a f o r u s e i n m u l t i v a r i a t e a n a l y s i s , u n l i k e t h e coin-[Dlete c a s e a p p r o a c h .
1 h e r e a r e s e v e r a l i m p u t a t i o n m e t h o c l s u s e d i n p r a c t i c e . Hot d e c k im|JUtatioti is t h e i m p u t a t i o n p r o c e d u r e in w h i c h t h e v a l u e a s s i g n e d for a m i s s i n g i t e m is tak<'n from r e s p o n d e n t s in t h e c u r r e n t s a m p l e . .Many o f the- htjl d e c k i m p u t a t i o n p r o c e d u r e s s t a r t w i t h a d i v i s i o n of t h e s a m p l e i n t o c e l l s l)asetl «;n au.xiliary v a r i a b l e s k i u n v n for b o t h t h e n ' s p o n d e n t s a n d n o n n ' s p o n d e n t s . [ h e cell is c a l l e d t h e iinpul(tlii»n n i l . O n e of tli<' c o m m c j i i l y u s e d hcjt d e c k i m p u t a t i o n m e t h o d s is .•iiiiipti rttinloiii h o t dtrfc n n p n t n h o n . w h e r e n o n r e s p o n d e n t s a r e assigne(l v a l u e s f r o m r e s p o n d e n t s in t h e s a n i e i m p u t a t i o n cell w i t h eciual p r o b a b i l i t i e s o f s e l e c t i o n . I n s p i t e o f i t s c o n v e n i e n c e , t r t ' a t i n g t h e i m p u t e d values a s if t h e y a r e t r u e values a n d m a k i n g i n f e r e n c e u s i n g s t a n d a r d f o r m u l a s s h o u l d b e u.secl w i t h c a u t i o n . The s t a n d a r d v a r i a n c e e s t i m a t o r s , in p a r t i c u l a r , l e a d t o u n d e r e s t i m a t i o n b e c a u s e t h e a d d i t i o n a l v a r i a b i l i t y d u e t o m i s s i n g v a l u e s a n d i m p u t a t i o n is n o t t a k e n i n t o a c c o u n t .
W e will r e v i e w e.xisting m e t l u n l s o f v a r i a n c e e s t i m a t i o n for i m p u t e d d a t a a n d suggest a l t e r n a t i v e m o t h o d s .
3 Dissertation Organization
T h e d i s s e r t a t i o n c o n s i s t s o f t h r e e rc'search p a p e r s . T h e d i s s e r t a t i o n i s o r g a n i z e d a s follows: • C h a p t e r 2 E x i s t i n g m e t h o d s o f v a r i a n c e e s t i m a t i o n a f t e r i m p u t a t i o n , i n c l u d i n g t h e a p p r o a c h o f R u b i n ( 1 9 S 7 ) a n d t h e a p p r o a c h o f R a o ( 1 9 9 6 ) a r e r e v i e w e d . T h e p r o s a n d cons1
of t h e e x i s t i n g niethocls a r e discussed.
• C l i a p t e r
A v a r i a n c e e s t i m a t i o n m e t l i o d based o n s i n g l e i m p u t a t i o n is p r o p o s e d i n t h i s p a p e r . I his is b a s i c a l l y a n e.xtension of t h e a j i p r o a c h o f Kao( UMKi). 1 h e propos«'d m e t h o d is usefid f o r e s t i m a t i n g t h e variance of a s m o o t h f u n c t i o n of l i n e a r e s t i m a t o r s .
• C h a p t e r t
In t h i s p a p e r , a v a r i a n c e e s t i m a t i o n m<'tho«l b a s e d o n t w o o r m o r e i m p u t e d values for i ' a c h m i s s i n g i t e m is p r o p o s e d . In p a r t i c u l a r , a iirocedur** calleil fully eflicient f r a c t i o n a l i m p u t a t i o n is |iro|)osed a n d v a r i a n c e e s t i m a t i o n l o r t h e jirocedure is p r e s e n t e d . • C h a i J t e r j In t h i s p a p e r , a v a r i a n c e e s t i m a t i o n m e t h o d f o r mulli-pha.se s a m p l i n g is preseiitetl a n d a p p l i e d t o t h e 2 0 0 0 I S C e n s u s of I ' o p u l a t i t j i i . • C h a p t e r (i C o n c l u s i o n s a r e m a d e .
References
B r i c k . .J. .\I. a n d K a l t o n . ( I . (199(i) " H a n d l i n g m i s s i n g d a t a in s u r v e y r«'search." S l a t i t
-/ical M f t h o d a i n S h d i c a l R i s c a r c h , 21.>238.
K a l t o n . CI. (198.'{) ('otnpfn.-^atiiiy f o r .•<iirct y d a t a . I n s t i t u t e o f S o c i a l Research.
K o t t . P . S . ( 1 9 9 4 ) . n o t e o n h a n d l i n g n o n r e s p o n s e i n s a m p l e s u r v e y s . " J o u r n a l of t i n
• )
O h . II. L. a i u l S c l u ' u r o i i . K. J . (IlKS;}). "W'cigliling acljiistinciits for unit n o i i - n ' s p o i i s c . "
In liicnnipli If D a Id ill S u m pit S u r r t ijt<. \'oliinii J, Tlit o n j tiinl l i i b l i o y n i p h i r s . W . ( I .
M a d o w . I. O i k i n . a i u l 1). B. Ruhiii ( r d s . ). .\«'\v N'ork: AcaclcMiiic l ^ c s s . 1 1:{-IS1.
Rao. .). N. K . ( 1*)%). " O n v a r i a i u f c s t i i u a t i o i i with i m p u t e d s u r v e y d a t a . " .Journal of
tin Aiiif rican Statislicul Assnciatiuu. 1)1. 191)-.')l)l).
(i
LITERATURE REVIEW
1 Preliminaries
A | ) o | ) u l a l i o n o f .V i d c n t i l i a i j i c clcintMits is (U-noti-d i)y / = { 1 . "J \ }. A s u b s e t ol l l u ' |)o|Julalioii is s«'U'( t«'<l a n d caiU'd a sani|)lr. 1 In- s e l e c t i o n o f sanii)les uses a >^et of p r o h a h i l i t y n d e s calU-d t i i e s t u n p l i n y iiitcliiini.>iii. Let . 1 d e n o t e t h e set of indices in i h e s a m p l e . Deliiu' t h e s a m p l e s e l e c t i o n indicator f u n c t i o n
l j = < 1 if J € . 1
( i . n
I ) i f j ^ . i
a n d
B = ( / ,
I
k ) . ( l . - J )Let / H B ) d e n o t e a s a m p l i n g iiiechanisin t h a t a s s i g n s proi)al)ilities. s u m m i n g t o o n e . t o t l i e 2'^ p o s s i b l e
B
v e c t o r s ..Associated w i t h t h e j - l \ \ e l e m e n t of t h e p o p u l a t i o n is a v«'ctor o f c h a r a c t e r i s t i c s d e n o t e d by a n d t h e p o i j u l a t i o n of vectors is d e n o t e d by
Y = ( y ,
y
, v ) .( l . ; J )
L e t t h e p o p u l a t i o n ( j u a n t i t y o f i n t e r e s t b e f
(yi
y
.v) find let U b e a n e s t i m a t o r o f 0 \ ba.sed o n t h e s a m p l e . T h e t r a d i t i o n a l s u r v e y s a m p l i n g a p p r o a c h t r e a t sY
a s fixedi
A n e s t i m a t o r 0 is CH
II
CCI
dtsKju itiibiasid for 0 \ if/• (tf 1 ^ ) = 0 s . ( l . l ) w l i c r c JT - { y ,
y
.v}. /-' ( O \ a n d Y.b ' I f n o l c s t h e s u i n i i u n a t i o n o v e r a l l possil)l«'B.
A n u l l i c r l u o i l f of i n f o r i ' n r r assiuncs t h a t l l i c p o p u l a t i u n v c r t o rY
is a r a n d o m s a m p l e f r o m a n i n l i n i t f s n p c r i ^ p n l a t i o i i . i ll*' inodcl-l)as<'d aj)j)roai li in s u r v e y s a m p l i n g m a k e s i n f e r e n c e s b a s e d o n tin- c o n d i t i o n a l d i s t r i h u l i o n o fY
f^iven t h e s a m p l e o u t c o m e.1.
N o t e t i i a t t h i s c o n d i t i o n a l d i s t r i b u t i o n is d e t e r m i n e d b y t h e s a m p ^ m ^ m e c h a n i s m a s well a s by th«' d i s t r i b u t i o n o f t h e variableY.
I h e d e | ) e n d e n c e o n t h e sampling; m e c h a n i s m c a n b e a v o i d e d if t h e s a m i j l i n g m e c h a n i s m is iijiioniblt. W'e f o r m a l i z e t h e coticept in D e i i n i t i o n1 . 1 .
Definition 1.1
l . t l l l n (lislnbulion of Y bi <li i i o t t d bij C ( Y ) a n d ctilltti t i n s u i n r p o p il l a t i o n i n o d t l . [ a ! /»(B) bf Ihf suuiplintj t m r i t a n i s i n . l i n n . /J(B) lynorabli a n d i r t l u s u p i rpopulatioti riiodd if a n d only if£ ( Y | . \ ) = Z : ( Y ) .
(l..-j)w h i n
£ ( Y | . 1 ) /.S
lli( conditional d i s t r i b u l i o n o f \ y i v m t l u .sampit o u t r o i m. 1 .
I.et
X
= ( X i . - - - . x , v ) b e a vector of v a l u e s f o r a secx}nd v a r i a b l e , w h e r e t h e t r u e v e c t o rX
is k n o w n f o r t h e population. .A sufficient c o n d i t i o n for t h e i g n o r a b i l i t y o f t h esampling mechanism is that it can be described by the conditional independence of
Y
a n d
B
g i v e nX.
In D a w i d ' s (1979) n o t a t i o n .Y L B | X
( l . ( i )m e a n s t h a t t h e v a r i a b l e
Y
is i n d e p e n d e n t o f t h e s a m p l e s e l e c t i o n i n d i c a t o r v a r i a b l es
s t r a t i l i e i l r a i u l o i n s a m p l i n g a n d t h e a u x i l i a r y v a r i a b l o X is t h e i n d i c a t o r v c c t o r f o r s t r a t a , t h e n t h f c o n t l i t i o n ( l . O ) h o l d s b o c a u s c . i n t h e s a m e s t r a t u m , t h e p r o b a b i l i t y o f s a n i j i l e s e l e c t i o n is t h e s a m e for all e l e m e n t s . H e n c e , t h e s a m p l i n g m e c h a n i s m is i n d e p e n d e n t of t h e v a l u e o f V i n t l i e s t r a t u m . R u b i n ( 1 9 7 0 ) . S c o t t a n d S m i t h ( 1 9 7 3 ) . a i u l S u g d e n a n d S m i t h ( 1 9 8 1 ) d i s c u s s i g n o r a b i l i t y .I.et u s a s s u m e t h a t t h e l i n i t e p o p u l a t i o n I i s m a d e u|) of (1 i m p u t a t i o n cells. W i t h i n e a c h ci'll </. cy = 1 (1. t h e »'lemeiits ar<' i d e n t i c a l l y a n d i n d e p e n d e n t l y d i s t r i b u t e d w i t h
m e a n it., a n d v a r i a n c e . i.e.
w h e r e I j d e n o t e s t h e s e t o f i n d i c e s f o r t h e i m i) U t a t i u n c e l l . W e c a l l l l u ' m o d e l ( 1 . 7 )
t h e i i n p u t d t i o i i ct II niotit I.
Lemma 1.1
A s s n n i i roinlilioii ( l . ( ) ) with t i n i t t t-rilinrij ra r ni b h X h f i m j t i n i i u l i r a t o rr t c l n r f o r i i i i p u l a l i o n cflls nn<l (I s s u h k that
Till II till s a m p l i n g iiucliaiiisni i s iynovablt u n d t r snpt rpopnhition m o d i I ( 1 . 7 ) .
Proof. I
,et b e a n y m i ' a s u r a b l e s e t in t h e s i g m a - f i e l d rT( V ' )
gj-nerated b y t i i e r a n d o m v a r i a b l e V. I h e n . b y t h e d e h n i t i o n o f c o n d i t i o n a l i n d e i K ' u d e n c e . for i <E I j . U < l ' r ( / , = 1) < I . I . - - . . V . ( l . S ) P r (V; € > ' . / . = ! \ i e l ' j ) = P r ( V ; € >• I i € { ' . j ) P v ( l , = \ \ , e l ' j ) . .Mso. Pr( V ; € .s' Ii e f ' . j J , = i )
P r ( y . € S . l , = 1 I < €r . j )
P r { I , = \ \ i € r . j ) P r ( V ; e S \ i € u n d e r ( 1 . 8 ) . H e n c e . £ ( v ; | / € r , . / . = i ) = r ( y ; | / € r , ) .9
S i m i l a r l y .
£ ( V ; I / € I ' , . ! , = 0 ) = £ ( V ; I / € T y ) .
S o . t l i c r e s u l t follows. •
I . c m m a I.I iiii[>lii'>< t tial l l i e ' l i s t r i i m t ioii o f I I n - s j i m p l f i i p a r t is llii' s a m i ' a s t h a t of iioii-samplcd p a r t . Tliat is
v ; | . i
~ / € / ' , ( I . ! ) )for <'a< li ( c l l <j. <j = I
(livvii iioiircsiHJiist'. t h e original s a m p l e .1 is (Iccoiiiposcd i n t o tin- sj't of rt'sponcU'iits. A n - a n d t h e s»'t of n o n r c s p o n d c n t s . . 1 \ / . l)«'lin«' t h e rcs|)onsc i i u l i c a t o r fimrtioii
1 V, r e s p o n d s if sami)U'<l
/ = 1 V ( l . i U )
0 V, dot's n o t r«'spoiul if s a m p l e d
l i , =
a n d t h e assot iatecl vt-ctor
R = ( / ^ , l i s ) . ( l . l l )
I h e d i s t r i b u t i o n o f
R
is called t h e r es j) ons e m e c h a n i s m . .Vote t h a t t h e r«'siJonse m e c h a n i s m is u s u a l l y u n k n o w n a n d i s sp«'cilied by t h e mo<lel. C o n d i t i o n a l i n f e r e n c e forY
g i v e nR
r e q u i r e s t h e specification o f t h e r e s p o n s e m e c h a n i s m . I g n o r a b i l i t y o f t h e r e s p o n s e m e c h a n i s m i s d e f i n e d in D e f i n i t i o n 1.2Definition 1.2
L d C { Y \ b t t h t c o n d i t i o n a l d i f t t r i b u t i o no/Y
y i r t i i t i n r t a l i z t d sanipli . 1 . a n d t i n r t n l i z t d r t s p o m U t t t .s A f i . T l u n , tht rf.'iponst n i t c h a i i i .'ini (.s iynortibli u n d e r t h t i n o d t l if£ ( Y | . - L . l f i ) = £ ( Y | . - l ) .
( 1 . 1 2 )10
R u h i n ( lf)7(). p . ' i S ' J ) lU'fiiictl t l i r n ' s p o i i s e i i u ' c l i H i i i s i n t o b o a m i s s i n g a t r a n t l o m
( M A 11) n u ' c l i a i i i s m if
w h e r e t h e n o t a t i o n (1.13) m e a n s t h a t t h e r e s p o n s e i n d i c a t o r v a r i a b l e
R
is i i u h ' p e n d e n tuf the study variable
Y.
conditiuna! on the au.xiliary variableX
and tlie "^ainph'B.
Lemma 1.2
I.it llu auiiliurij v a n a b hX
bt llit i m p i i t a t i o n n i l r u r i a b h dijiiitd i n l.( iiniKi 1.1. A s s t i i m thaia n d t h a t I / k M A H condiHon ( I . I J ) h o l d s , fhtii llu n s p o i i x m t r l i u n t s n i i s Kjuortiblt unilt r lilt t i i o d t l ( l . O j .
Proof.
I l u ' proof is ((uite s i m i l a r t o t h a t of L e m m a "J.l. Let >" b e a n y m e a s u r a b l e s e t i n t h e sigma-li«'ld ^ ( V ) gen«'rated b y t h e r a n d o m v a r i a b l e V. I ' h e n . b y t h e delinitiou o f t h e c o n d i t i o n a i i n d e | ) e n d e n e e . f o r t € IR ± Y | ( X . B )
( i . i ; n 0 < P r ( / / , = 1) < 1. I ^ A ( 1 . 1 1 ) Pr ^ S . It, = I \ I e r. j . I, = I ) = P r l \ ' , ^ S \ i€ r, j . / , = l ) X | > r ( / f , = 1 I / e r , . / . = 1) a n d b y t h e d e l i n i t i o u of /?, P r ( / f , = I 1 / €r , J ,
= 1) = = 1 I / €r , ) .
r i i e r e f o r e . P r ( v ; € . s I i € [',J. I, = \ . H, = I ) P r ( > ; es . H , =
11i e t ' j . I , = \ )
P r ( H . = I I / € / ; . / , = I ) P r ( y ; € . v | t € L ' j . l , = 1) u n d e r ( 1 . 1 4 ) . H e n c e , t h e result follows.11 ( l i v e n l . t ' i i u u a 1.1 a n d L r n i m a l . J . w e a r c abU- t o s a y l l i a t n i o d f l ( 1 . 7 ) . l o g e t l u T w i t h a n i g n o r a b l e s a m p l i n g i n o r l i a n i s i n a n d a n i g n o r a b l e i v s p o n s c n u ' c i i a n i s i n . p r o d u c i ' o b s e r v a t i o n s i n a n i m p u t a t i o n ccll t h a t a r c d i s t r i b u t e d i d e n t i c a l l y a n d itKle|jendently. r i i a t is. V, \ ( . \ . . \ n ) - ( / ' . . t ; ) . ( 1 . 1 5 )
2 Replication Variance Estimation Without Nonresponse
I n t h i s s e i t i o n . w e c o n s i d e r t h « ' ca.st- w h e n t h e r e i s n o n o n r c s | ) i i n s e i n t h e s a m p l e . I n m a n y s t i u T n ' s . d a t a a r e c o l l e c l e d f r o m i n d i v i d u a l s o r u n i t s s a m p l e < l u s i n g c o m p h ' x s a m p l e d e s i g n s t h a t i n c l i u l e v a r y i n g i ^ r o b a b i l i t i c s a n d n o n - i n d e p e n d e n t s « ' l c c t i o n s . O n « ' a p p r o a c h t o e s t i m a t i n g t h e s t a n < l a r d e r r o r t)f t h e e s t i m a t o r i s t o l i n c a r i z * ' t h e e s t i m a t o r u s i n g a T a y l o r s c r i « ' s e x p a n s i o n a n d t h e n us»' s t a i u l a r d s a m p h * s u r v e y v a r i a n c e • • s t i m a t i o n m e t h o d t o e s t i m a t e t h e p r e c i s i o n o f t h e l i n e a r i / . i ' i l s t a t i s t i c . . \ n a d v a i U a g e o f t h e l i n e a r i z a t i o n m e t l i o t l i s t h a t it i s a p p l i c a b l e t o g e n e r a l s a m [ ) l i n g d e s i g n , b u t a d i s a d v a n t a g e i s t h a t it i n v o l v e s t l u ' i l e r i v a t i o n o f a s e p a r a t e v a r i a n c e e s t i m a t i o n f o r m u l a f o r e a c h s t a t i s t i c . .An a l t e r n a t i v e a p p r o a c h is t o u s e a r e p l i c a t i o n m e t h o d . Two p o p u l a r m e t h o d s i n surv<>y s a m p l i n g a r e t h e j a c k k n i f e a n d b a l a n c e d r e p e a t e d r e p l i c a t i o n ( l i l i U ) . W o l t e r (198')) a n d R u s t a n d R a o (199G) p r o v i d e g o o t l r e v i e w s of t h e r e p l i c a t i o n l i t e r a t u r e a s a p p l i e d t o coinple.N s a m p l e s u r v e y s .
Let 0 b e t h e p o p u l a t i o n p a r a m e t e r o f i n t e r e s t a n d let 0 b e t h e e s t i m a t o r of 0 ba.sed o n t h e full s a m p l e . T o e s t i m a t e s a m p l i n g e r r o r s , s u b s a m p l e s f r o m t h e s a m p l e a r e <lrawn a m i 0 is c o m p u t e t l f r o m e a c h s u b s a m p l e . Dilferent w a y s o f s u b s a m p l i n g f r o m t h e full s a m p l e c o r r e s p o n d t o d i f f e r e n t r e p l i c a t i o n m e t h o d s . I h e s u b s a m p l e s a r e c a l l e d r e p l i c a t e s a m p l e s a n d t h e s t a t i s t i c s c a l c u l a t e d f r o m t h e s e r e p l i c a t e s a r e calletl r e p l i c a t e e s t i m a t e s .
12
T h e \ a r i a n a ' o f tin* s a m p l e c s t i m a l o r ^ is e s t i i n a t e c l froni t h e r e p l i c a t e e s t i m a t e s b y
I
' W = Z ' *
k=i
w h e r e Qik) is t h e A'-th e s t i m a t e of 0 b a s e d o n t h e oljservatioiis i i u h u l e c l i n t h e A-th r i ' p l i i a t e . /, is t h e n u m b e r o f r e p l i i a t e s . a n i l c^. is a factor a s s o c i a t e d w i t h r e p l i c a t e A* a n d d e t e r m i n e d by t h e r e p l i c a t i o n m < ' t h o d . W h e n t h e o r i g i n a l e s t i m a t o r
0
is a l i n e a r e s t i m a t o r of t h e f o r m = C-MT)16.1
w h e r e u \ = tr, ( . 1 ) . t h e A-th r e p l i c a t e ol 0 c a n b e w r i t t e n a s I t . Iw h e r e d e n o t e s t h e r e p l i c a t e weight f o r t h e / - t h unit of t h e A'-th r<'plicat«'.
[ h e following l e m m a p r o v i d e s a n e c e s s a r y c o n d i t i o n for a v a r i a n c e e s t i m a t o r t o Ix' u n b i a s e d .
Lemma 2.1
L d tht o r n j u m l t s l h i m t o r 0 bt a l i i i i u r t.'itinKitor of t i n f o r m i n ( J . 171. If a ri pliraliort variniict i. s l i m a t o r I i n ( J . K J j/.•»
( Us k j i i n n b i d s i d f o r t i n ( d i s i y i i ) rariaiicf of 0 a n d always takt.s i i o i i n t y a t i r t r a l i K s . tlnii wi h a r tH A-= 1 . 2 . • • • . / . (2.1!)) i6.> ie.» f o r all
z ,
salisfyiiKj I V/r I = 0 . (2.2U)Proof. Let
' • W = t - - ' ( E
- E
f c = l \ i 6 . » i € . \ 'B y (2."20) Hiul tlif u i i b i a s o d n c s s o f I w e h a v e
('Ml ^} =
"-S i net' P i - ( \ " - ( « , ) < U | . F ) = 0 . w e iiav»> r { o ^ ) = 0for a l l sainpU's. Siiicc a r c all p o s i t i \ c h y tin* lUiiiiicgatixciH'ss of I (Jj^ a g a i n . ( 2. 2 0)
follows. •
r i i c r('i)ruat«' f a c t o r c^. is c h o s e n s o t h a t c^. - " ' • ) ^i] estiniat«'s \ d r (»', V, / , ) .
I l u l e r s t r i c t unhia.seclness o f I (^0^ . w e h a v e
I
-Y . n - ( " • ! " - '!•,)* P r ( / € .1) = / r - l ' r ( / < = . » ) [ ! - P r ( / € . \ ) ] . ( 2 . 2 1 ) k = l
Kc|uality ( 2 . 2 1 ) h o l d s b e c a u s e th«' left s i d e of ( 2 . 2 1 ) is t h e e x p e c t e d valiu' o f \ ' for t h e p a r t i c u l a r p o i ) u l a t i o n JF w h o s e // values a r e Z I T O S for a l l u n i t s exce|)t f o r t h e / - t h
e l e m e n t . T h e right s i d e o f ( 2 . 2 1 ) is tin- d e s i g n v a r i a n c e of 0 for t h e p o p u l a t i o n JF.
2.1 The jackknife method
T h e j a c k k n i f e m e l l i o t l . w h i c h o r i g i n a l l y w a s (U-signed t o e s t i m a t e t h e l)ia.s o f a n e s t i m a t o r by d e l e t i n g o n e f l a t u m f r o m t h e o r i g i n a l d a t a set a n d r e c a l c u l a t i n g t h e e s t i m a t o r b a s e d o n t i i e rest of t h e d a t a , h a s b e c o m e a v a l u a b l e t o o l for t h e v a r i a n c e e s t i m a t i o n s i n c e t h e w o r k of l u k e y (1!).')S). h i a n i n f i n i t e p o p u l a t i o n c o n t e x t . Tukey (li).')S) s u g g e s t e d t h a t e a c h r e p l i c a t e e s t i m a t e m i g h t b e r e g a r d e d a.s a n i n d e p e n d e n t a n d i d e n t i c a l l y d i s t r i b u t e d r a n d o m v a r i a b l e , w h i c h in t u r n s u g g e s t s a v e r y s i m p l e v a r i a n c e e s t i m a t o r . I n t h e f i n i t e p o p u l a t i o n s a m p l i n g c o n t e x t , e a c h j a c k k n i f e r e p l i c a t e d e l e t e s o n e u n i t a n d m o d i f i e s t h e w e i g h t s o f o t h e r s .
1-1
E x a m p l e 2 . 1 / . I ndt r timijU n i n d o n i s n t n p i m t j o f s i z t n from a Jinitf p o p u l a t i o n
of s i z t .V. Ihi jnckkiitft r a r i n n c t t s t i m a t o r i s d i j i t u d b y ((juntioti ( J . 1 0 ) with < t =
/ j ~ ' ( n — 1 ) ( 1 — a n d tcf''' = (;i — 1) ' l u r , i f i k and i f ) ' ' = 0 . I'litsr
L'aluts s a t i s f y ( J . 10) with = 1 a n d ( 2 . 2 1 ) .
2 . l o r stratijii d r a n d o m s a m p l i n i j . Itt V/,, b( t h i r a l u t o j tht i - t h t h m t n t i n s t r a t u m h . Lt I ct; — {ill, — I) u ' h u i th( unit { h i ) i s d t l i t t d f o r thi k - t h r t p l i c a t i a n d l i t
ti'i,, if a u i u t i n s t r a t u m (j is d i l t t i d (j ^ h
' (iii^ — 1) ' i j u n i t [ h j ) / . s d t I t t i l l i ^ j U i f u n i t ( h i ) I S i l l I t t i l l .
T h t n f 2 . 1 9 ) h o l d s , win n c/,, = ( : / , , i . • • • . r/.,//) w i t h ^ [ if h = ij. a n d zi,,., = 0 o t h i r u ' i s t .
2.2 Balanced repeated replication
UalaiKi'd rt'peati'cl n ' p l i c a t i o i i ( l i K l t ) w a s first [)ro|)oscHl i)y M c C ar t l i y (lf)()^)) for tlu* c a s e wlu-rc t w o c l u s l i ' r s pi-r s t r a t u m a r c sampU'cl w i t h rcplaaMiiciit in t h e first s t a g e o f s a m p l i n g . I n t h e t w o - c l u s l e r - p < ' r - s t r a t u i n tlesigii w i t h / / s t r a t a , a m i n i m a l s e t o f L b a l a n c e d h a l f - s a m p l e s m a y b«' constriictecl f r o m a n /, x /, l l a d a m a r t l m a t r i x ( s e e . e . g . VVolter. 19 8 5) by c h o s i n g a n y / / r o l n m n s e x c l u d i n g t h e column of all + r s . w h e r e
H < L < H ^ L e t b e t h e e l e m e n t of t h e l l a d a m a r d m a t r i x satisfying = 0 (2.2'J) k = \ for a l l h a n d = M i / -k=l
T l i o A'-tli of t l u ' lincHr e s t i m a t o r of t h e f o r m / / 1 >,=i 1-1 c a n b e w r i t t e n a s ( 2 . 2 1 ) S i n c e t h e B R R uses u n l v lialf of t h e o r i g i n a l s a m p h ' . it m a y p r o f h u ' e \<'r\' u n s t a b l e e s t i m a t e s for s o m e n o n l i n e a r s t a t i s t i c s i n r e l a t i v e l y s m a l l samples. I'o a v o i d a n o m a l i e s .
Kay (H)S1) s u g g e s t e d u s i n g
w h e r e U < <'> < 1.
3 Hot Deck Imputation Methods
T h e r e a r e a vari«>ty o f i m p u t a t i o n m e t h o d s u s e d in practice, a s not«'il b y K a l t o n a n d K a s p r z y k (1!)S()). I h e h o t d e c k i m p u t a t i o n m e t h o d s t a r t s with t h e ilivision o f t l u ' s a m p l e i n t o several i m p u t a t i o n c<*lls. M a n y hot d e c k i m p u t a t i o n metho<ls a s s i g n th«' v a l u e f r o m a recorcl w i t h a response- t o a recorcl w i t h a missing v al ue o n t h a t i t e m . T h e s e r e c o r d s will b e c a l l e i l t h e d o n o r a m i r t c i p i m l . res|)ectively. O f t e n , t h e v a l u e s for a set o f r e l a t e d m i s s i n g i t e m s a r e tak<'n f r o m t h e s a n u ' d o n o r , t o p r e s e r v e s o m e o f t h e m u l t i v a r i a t e r e l a t i o n s h i p s .
H o t d e c k i m p u t a t i o n m e t h o d s c a n b e r o u g h l y classified into t h e following c a t e g o r i e s :
( i ) S e q u e n t i a l Hot D e c k I m p u t a t i o n
S o m e h o t d e c k i m p u t a t i o n p r o c e d u r e s i m p u t e t h e value f r o m t h e r e c o r d i n t h e s a m e cell t h a t w a s l a s t r e a d b y t h e c o m p u t e r . T h i s is p a r t l y b a s e d o n a belief t h a t , if t h e d a t a a r e a r r a n g e d i n s o m e g e o g r a p h i c o r d e r , a d j a c e n t u n i t s i n t h e cell
1 ( )
will ti'iid t o Ix' iiiort' s i m i l a r t h a n r a n d o m l y chosen u n i t s i n t h e cell. O n e problem with t h e setpiential iiot d e c k i m p u t a t i o n is t h a t it m a y e a s i l y mak»' nudti|)le uses of donors, a feature tluit l e a d s t o a loss of [jrecision in s u r v e y e s t i m a t e s .
(ii) Uaiidoin Hot Deck I i n i j u t a t i o n
respondent is c h o s e n a t r a n d o m w i t h i n a n i m p u t a t i o n cell, a n d t h e selectecl
respondent's value is assigiK-d t h e iu)nres|)oiident. l o preserve m u l t i v a r i a t e
relationships, values f r o m t h e s a m e d o n o r a r e used for all m i s s i n g i t e m s of a recor<l. T h e seh'ction »)f d o n t u s c a n Ix perforiiud e i t h e r witlireplacement o r w i t h o u t -rei)lacem»'nt. F i n i h e r i n o r e . o n e may h a v e m o r e than o n e iinput<'il value for each missiii(f i t e m .
(iii) N'earest-.Xeighhor Hot l)e<k I m p u t a t i o n
This hot deck nu-thod assigns a iionrespondent t h e value of t h e "nearest" resi)ijn-deiit. when- " n e a r e s t " is defined in t e r m s of a tlistance f u n c t i o n of t h e auxiliary variables.
Random hot deck i m p u t a t i o n involves r a n d o m selection of d o n o r s . This r a n d o m selection mechanism i n t r o d u c e s w h a t is t<'rm«'tl i m p u t a t i o n v a r i a n c e , a n d this iniputatii>n
variance reduces t h e precision of th«' survey e s t i m a t e s .
. \ s reviewed by Hrick a i u l K a l t o n (l(M)(i), t h e r e a r e two m a i n m e t h o d s for reducing i m p u t a t i o n variance. O n e i s t h r o u g h a s a m p l e design for st'Iecting d o n o r s within each i m p u t a t i o n cell. For i n s t a n c e , selecting d o n o r s by simple r a n d o m samiiling without replacement is |)referable l o simi)le r a n d o m s a m p l i n g of flonors w i t h replacement. By
minimizing the multiple use of tlonors. the without-replacement design leads to a IO\V<T
i m p u t a t i o n variance.
•A second approach is t o u s e f r n c t i o r u i l i i i i p u l a l i o n . which involves dividing nonre-spondents" records i n t o p a r t s a n d i m p u t i n g s e p a r a t e l y t o «'ach p a r t . For e.xainple. each
17
rc'spoiuU'iit might b e d i v i d e d i n t o three p a r t s , e a c h ot which is allocated a weight ol o n e - t h i r d o f t h e n o i n c s p o i u l e n t ' s original w e i g h t . I hen s e p a r a t e donors a r e c hosen for eac h p a r t . If we have only o n e i m p u t e d value for e a c h nonr(>spondent. t h e n we will call t h e p r o c e d u r e sitiijU ( h o t d i c k ) i i i i p u t i i t i o i t .
Example 3.1
Siippo.-^i. i n a siiii[jlt m i i d u i i t .•minptf o f s i z i i i . r u n i t s n s p o n d m i d t n d o n o t r t . s p o n d t o i t t n i t j . [ I n i m p i i t t d v a U n f o r l u i s s i n t j u n i t i i s d i n o t i d Oij i j ' • I h t i n i p u t i d f s t i i i i i i t o r o f tli( p o p n l d t i o n niiiiii ) i sHI = " ' I E H ' A I •
( i t l/( J
I f till i f i t l i - n plitcniif lit h o t d a k i m p u t a t i o n i s u s ( d . t h i n t i n r a r i n n r i o j i/i i s . c o n d i t i o n a l o n r . HI . . / \ i t r ( ! i i ) = \ <"•((/,.) + — / * . (.s,*) w h I rt !•= t U a n d tjr - H y.-I f t i l t u u t h o u t - r i p l a n nil n t h o t d i c k i n i p u t n t i o n i s u s i d with r > i n . t h i n t i n r a r i a i i c i o f UI c o n d i t i o n a l o n r . \ ' a r ( t i i ) = \ ' i i r ( ! j , . ) + ( s ^ ) . (:{.27) I IJ \ / \ / F o r f r a c t i o n a l i m p u t a t i o n w i t h t h i n u m b t r o f i m p u t a t i o n ( i j u a l t o c . a n d t i n w i t h -r i p l a c t i t i f n t h o t d ( c k i m p u t a t i o n i s u s f d i n d ( p i n d i n t l y c t i i i n s . t i n v a n a n c t o f iji i s . c o n d i t i o n a l o n r , \ ' a r ( y i ) = \ ' a r ( i j r ) + l i i s ; ) . (;j.28) c n -U t n c t , w i t l w u t - r c p l a c t n n n t h o t d e c k i m p u t a t i o n a n d f r a c t i o n a l i m p u t a t i o n n d u c t t h ( i m p u t a t i o n v a r i a n c e r t l a t i i ' f t o w i t h - r t p l u c t m t n l i m p u t a t i o n .
IS
Fractional i i n p u t a t i o i i is cliscussrfl by K a l t o n a n d Kisli (19SI) a n d Kay (1996). A (lifForcnt. but r e l a t i ' d , a p p r o a c h is m u l t i p l e i m p u t a t i o n , which is discussed in t l u ' next section.
4 The Multiple Imputation Approach
.Multiple impntaticjii. proposed by R u b i n (1078). is a jjrocedure for handlinp; inissin^^ d a t a t h a t alUnvs t h e d a t a analyst tcj use s t a n d a r d Iechnic|ues of analysis desip,iu'd for c o m p l e t e d a t a , w h i l e a t t h e s a m e t i m e providing a m e t h o d t o e s t i m a t e t h e uncertainty d u e t o t h e m i s s i n g d a t a .
. \ c o m p r e h e n s i v e description of iiudti|jle i n t p u t a t i o i i is given in R u b i n (19S7). Rubin (1987) devot<'s a gocnl d e a l of c h a p t e r •'} t o specifying recpiin'ments for t h e \ a i i d i l y of m u l t i p l e i m p u t a t i o n inference u n d e r t h e nujch'l ba.s«'d a p p r o a c h . His arginnents in t h a t c h a p t e r a r e for t h e Bayesian a p p r o a c h , w h e r e inferences a r e m a d e using t h e postericjr m e a n a n d t h e p o s t e r i o r variance. R u b i n (1987) d e v o t e s C h a p t e r I t o conditions for t h e validity of multiph* i m p u t a t i o n in t h e r a n d o m i z a t i o n f r a m e w o r k .
•Multiple i m p u t a t i o n c a n b e c h a r a c t e r i z e d by t h e m e t h o d of g e n e r a t i n g tlie i m p u t e d values a n d by t h e variance fornuila. ['he variance f o r m i d a d i r e c t l y uses the c o m p l e t e -s a m p l e variance e -s t i m a t o r -s o t h a t it c a n b e i m p l e m e n t e d ea-sily u-sing t h e exi-sting -soft ware.
Let On b e t h e c o m p l e t e s a m p l e e s t i m a t o r of tiie p a r a m e t e r 0 a n d l „ = \ (V,„,„) b e t h e c o m p l e t e s a m p l e variance e s t i m a t o r of 0,^. The full s a m p l e V„,,„ is decomposed a s
Vjum = (>',65. V'„u»). w h e r e is t h e p a r t of w i t h / / , = 1 a n d V„,„ is t h e part of
> j(i ni 11 h /?i — 0 .
M u l t i p l e i m p u t a t i o n involves r e p e a t i n g t h e i m p u t a t i o n process independently M t i m e s . T h e i m p u t e d values a r e g e n e r a t e d from t h e posterior d i s t r i b u t i o n of V,,,,, given Yobs- After m u l t i p l e i m p u t a t i o n , w e h a v e M d a t a s e t s . T h u s w e c a n construct M
sep-19
a r a l e s t a t i s t i c s a n d M e s t i m a t o r s of v a r i a n c e baseil o n t h e a u g m e n t e d s a m p l e . Let t h e
statistics b e a n d I /(i).„ ^ i(M).n f'Ji" t h e e s t i m a t o r a n d e s t i m a t o r of
variance, respectively. 1 hen. t h e m u l t i p l e i m p u t a t i o n e s t i m a t o r of 0 is
M O M . n = . U - ' ( 1 . 2 9 ) (=1 a n d t h e a s s o c i a t e d variance e s t i m a t o r is t \ i. n = V\/.ri + — — — I h i. n - ( l.;{0) where \ l i = i a n d M f l u. n = ( A / - 1 ) " ' (i . ; J - ' ) 1-1
T h e t y p i c a l assum|>lions associated with multiiile i m p u t a t i o n a r e
I m | / • ; 0 ( i.:{;}) a n d j n n n [/•.' ( T ^ ^ , , . ) - T = U. (l.iM) where a n d = lim 0\i,„ A / X T-^.n = liin A/-+ X
In t h e Bayesian a p p r o a c h , t h e d i s t r i b u t i o n u.sed in ( I..};}) a n d ( I.^M) is t h e conditional
d i s t r i b u t i o n of 0 given \[,b, under t h e a.ssumed model. In t h e cla.ssical model-based a p
proach. t h e d i s t r i b u t i o n is t h e d i s t r i b u t i o n of V'^6s u n d e r t h e a s s u m e d model. In t h e r a n d o m i z a t i o n a p p r o a c h , t h e d i s t r i b u t i o n used i n (-1.33) a n d (1.34) is t h e joint d i s t r i b u tion of t h e s a m p l i n g m e c h a n i s m a n d t h e response m e c h a n i s m .
20
4.1 Bayesian Justification
C h a p t e r of R u b i n (19S7) cU-als with t h f valichty of multiph' i m p u l a t i o i i in Uu*
Bayesian fraiiu'work. l b review tliat a p p r o a c h , we assume that t h e e s t i m a t o r 0„ based o n tlie c o m p h ' t e sainph* is t h e posterior m e a n of 0 under the a s s u m e d Bayesian inotiel
/ ( i i u h u l i n g b o t h t h e likelihoo<l a n d tlie prior d e n s i t y ) . That is.
()„ = i-:,{01 (1.;{••))
Also, let \ „ lie t h e |)osterior variance of 0 u n d e r t l u ' model / . That is.
i ; . = i / « ; 1 (i.iUi)
According t o M e n g (l!)f) l . i).') l:}). a liayesian iiUKlel / satisfying ( 1.•{•")) a n d ( l.^Ui) is said
t o b e c o i i t j i m i l l t o t h e analysis using I I sing t h e terminology of congeniality, u e
s u m m a r i z e t h e m a i n results in c h a p t e r •'{ of R u b i n (1?)S7).
Result 4 . 1 .1 .-.s 1/;n f
I h i l l( i ) [''or lilt c o i n p l d t s i i i n p h . t i n l i m j t s i n n m o d i I f i.s v o n i j t m i i l l o llii itiialifsi.-i tisiiiij (iiiil
( i t ) t h i i u i p i i h d r a l u i.'i ( i n d r u u ' i i f r o m llit r o i i d i t i o i i i i l i l i s l r i b u l i o i i | V/fc,) " /
^ m i s ( j i i ' f i i Vi.fc, u n d t r l f i t B i i i j i s i i i n m o d t l f .
T h i l l , n n d t r n o i i n s p o n s t . Ihf l i n y t f i i a i i m o d d f /.s coiujiiuiil t o Iht i i n i i l y s i s usiiiij l l n i m p u i f d p a i r ( ^ A / . n . /.v/.n) c a l c u l u l t d f r o m a n d (.(..W). its . \ / —> : x . T h i i l i s .
= E f ( 0 I (
a n d
7
'X.N
= I V L i J . (-l.^W)21
Proof,
l o slunv t h eT'C
|ualii'n's ( 1.37) a n d ( l.iJS). n u l c t h a tr { 0 \ Yj,.,) = I
r ( o \ ) v ; , . . , ) P { ) | > " 6 . ) ' i o )
I
v;,,,) for / = I . 2 . - - -
. M .
S o . by till' law of l a r g e tiumlicrs./ • ; ( 0 | V j , . i = Mm £ ; / • . • ( « I > A , . i ; r ) 1 = I 1 -i — 1 a n d
+\- {i-:{0\
v;,u,) i V
m}
I -1 = -1 I " «=I a l m o s t surely. •By Uesult 1.1. we h a v e t h e desired relation ( 1.3i}) a n d ( t..M) under t h e posterior
distribution of 0 given In Uesult l . l . t h e r e ar«' t w o m o d e l s involvt-d. I ' h e first
motlel is called t h e a n a l y s t ' s model, which is u.sed in ( I.•{•')) a n d ( l.-'Ui). I lu- secoiul
m o d e l is called t h e i m p u t e r ' s m o d e l , which is used in calculating C { ) ' , n i s I Uesuh
1.1 re(|uires t h a t t h e t w o models b e t h e s a m e . . \ s is observed in Kay (1991.11)92). .\Ieng (199-1). l{ul)in (n)9()). a n d Schafer (1997). if t h e a n a l y s l ' s model is dilferent f r o m t h a t of i m p u t e r . then t h e m u l t i p l e i m p u t a t i o n e s t i m a t o r m a y b e bia.sed. not only for variance estinuitiou but also for p o i n t e s t i m a t i o n .
22
4.2 Randomization Validity
Multiple i i n p u t a t i o i i . which is based on t h e Bayesiaii p a r a d i g m , c a n h e evaitiatetl u n d e r t h e fre(|uenlist p a r a d i g m , where t l u ' p o p u l a t i o n values a r e t r e a t e d a s fixed a n d inferences are hased o n tin* s a m p l i n g d i s t r i b u t i o n gen«'rated by r e p e t i t i o n s of th»' sami)le seU'ction procedure a n d a motlel for resi)onse probai)ilities.
Hubin(l!)S7. p.118) g a v e t h e delinition of pro|)er i i n p u l a l i o n . w h i c h is a key concept for t h e randoniization validity of multiple i m p u t a t i o n . T h e delinition of a proper i m p u
-tati(jn procedure t r e a t s t h e comi)lele samjjle V a s lixed. a n d t h e res|}onse indication
vector
R
as t h e r a n d o m variable. F o r coinplet(> s a m i i l e statistics a n d \ „ . a im|Mitalionm e t h o d is called profx r uncler t h e assumeil r e s p o n s e mechanism if
I = 0,,. i c i )
l-H x.M I Vv.m} = K,. (("2)
a n d
I Vvi.m} = I ((";{)
The subscript H is used h e r e t o em|)hasize t h a t t h e reference d i s t r i b u t i o n is with respect t o t h e assumed response m e c h a n i s m on H.
[ h e main conclusion regarcling randomization validity with prop«'r i m p u t a t i o n is well s u m m a r i z e d in i t u b i n (19S7):
Result 1.1; If t h e c o m p h ' t e - d a t a inference is randomization valid a n d t h e imiltiple-imputation p r o c e d u r e is p r o p e r , t h e n t h e infinite-/^ r e p e a t e d im putation inference is randomization-valid u n d e r t h e posited r e s p o n s e mech anism. ( p . 119)
T h e conditions of p r o p e r i m p u t a t i o n a r e difficult t o verify. O n e i m p u t a t i o n procedure
is t o generate t h e missing p a r t V'„,„ from t h e conditional d i s t r i b u t i o n £ ( V „ , „ | of
2:}
p r o p i r i n i p u l d t i o i i . liayesianly p r o p e r impiilalioii is nol suMic ieiit for p r o p e r imi)iilalioii.
r i u ' following t l i c o r r m a t l e r n p t s l o clarify l l u ' rolationsliips.
Theorem 4.1
I f a m u l t i p U i m p u t a t i o n i j t n t r a t t t l f r o m C { \ „ u 3 | ^uha) U'^iny tlit f i a i j t s i u i i m o d t l f s d t i s j i t s ( C ' l ) , t i n n i t a l s o s a t i s j i t s { ( ' • { } .Proof. [.
c t tl li<> h!1\' gi\<'n full ^ainpl'' I' s t i n i a t o r of I).lU'
t l i c law of l a r a e tiumlxMs.1 . a n d l<:=l = / • / ( O n
I yJ,.,)
\ t ^lini^ T T H " ^~
= I/(<),. I
Now,I V„,m) = l-U / (^r. I ^ .6,) I v.,,,
= /•,'/< IV/ — I'. J {0,X I I \ ,.h, I Flirt l i o n n o r e .A
h(«x.,. I-Av (<),.! v:,.,) IV
= Vn E f { o „ I I V,„whore i h e e q u a l i t y ( l.;J9) follows from t h e decomposilion
(l.:5!)) ( 1 . 1 0 )
V
[ qI v..m] = Vft
[ E f{
qi
K b . )1
+ i
- n[ i / {
qI V . 6 , ) I v;,.,
w i t h Q = d „ - E f I VLi,) a n d so E f[ Q \ \ = 0 . The e q u a l l y (1.1 0 ) holds because
by a s s u m p t i o n (C'l).
Sunu- autliurs. for rxampU* Kay (19!)'2). have cpn'stioiKnl t h e validity of t h e m u l t i p l e i m p u t a t i o n u n d e r t h e f r e q u e n t i s t response probability m o d e l . W e will s t u d y t h i s in t h e n e x t subsection.
4 . 3 F a y ' s E x a m p l e
Fay( 19!)1.1(M)2) us«'d a li<'riiouHi model t o i l l u s t r a t e t h e ililliculty of c r e a t i n g pro|)er i m p u t a t i o n s a s a g«'iieral purposi' m<'thodology. W e s u p p r e s s t h e subscript n in t h e e s t i m a t o r s t o simplify t h e n o t a t i o n . S u p p o s e we liav<' a s i m p l e r a n d o m s a m p l e of size ii for variable V t a k i n g o n l y 0 t)r 1. .Vssumi' t h a t , for simplicity, t h e lirst r c-U'inents a r e
observetl a n d t h e | ) a r a m e t e r of int*'ri'st is 0 \ — A " ' 51. = i f^is'iine tin- uniform
r e s p o n s e mechanism a n d u s e th«' H<'rnouHi model t o c n ' a t e lh<> m u l t i p l e i m p u t a t i o n , t h e n w<' use M w i t h ( l . l l ) a s t l u ' e s t i m a t o r of ( ) \ a n d u s e w i t h M (-1.-12) a n d M
•Jo
t o e s t i m a t e t h e variance of 0 \ i . riien, letting n r ^ p ^ ( 0 . I) and . V ' n —> 0. we have
E [ O ^ \ \ )
Var{0^ I Y)
I-(IK iY)
O s - 1 - O s )< r
' O s { \ - O s ) - N ~ ' ) O s OyIxTaus*' t l u ' res|)ondents c a n b e regarded a s a s i m p l e r a n d o m s a m p l e from t h e p o p u l a t i o n . Hence.
i : { i \
i Y )
= \ „ r ( ( l ^| Y ) .
Now. a s s u m e that for e a c h unit i . we have .V, t a k i n g e i t h e r u or b a s possible values. W e want t o e s l i m a t t ' 0 , , = / ' r ( V = 1..V = a ) a n d O t , = l ' r { ) = 1. .V = b ) . Let
II — I I , , -I- III, an<l /• = r , + n , . If we h a v e c o m p l e t e response, t h e n w v will u s e
0., = f / , ; '
^ v;/( . V , = </1
1 = 1
0 , = = M
1=1
a n d t h e variance-covariance e s t i m a t o r for {O.t.Oi,^ is
\ =
( i - y , . ) 00 ( i - t f . )
w i t h d„ = tfu + Oh.
If we h a v e missing d a t a a n d i m p u t e using t h e a p p r o x i m a t e Bayesian b o o t s t r a p m e t h o d [jroposed by Rui)in a n d S c h e n k e r (l!)86). tlien we u s e
M { ' ) 0..M = . 1 / : i (=1
4
, u = (=12(i
WIKTC
jc) _ { Y . y j i \ , = " ) + T .v;"7(.v, = </
1=1 i = r+ 1 ijU) "h.i ==
E
> • ; " ' / ( . V , = b ) 1=1 l = r+l a n d t l u ' varinii(»'-c-u\ariaiu-(' e s t i m a t o r for is/
l \ , = ; r V\/ + (1 + i (1 + .\/ ' I Ihi.ih + M ) li\i :rr\/ + (1 +wluTf I \/ is dcliiK'd in ( l. l'i) a n d
m I h l : . = I h l . i h = Ihi.bh — •*' * f = | (=1
riuMi. since tiie resj)onileiits a r e a siinpli- r a n d o m sam|)U' of size r .
/
K { T . . ) = 0 s ( 1 - 0 S
_L -L ^ y i (" i i - ' . i ) () i
"u 'T, V "u / '" V ".i / \ / >•
C "II-'•u ^ f ^ i ±. J . J . f '•l.-' l.y' i \ V / V " b / ' • " f . " i l V " ( . / >• ljUt I '«;•
/
\
V / = t f v ( l 0 , s -£u.+ i ^ "f, r V ri„ y r V M„ m j i ( l _ + i [ i _ ( i t ) -r \ lia " h / "f, r \ " 6 /Hence. overestimates t h e variances of a n d tf5,x. u n d e r s t i m a t e s t h e c o v a r i a n c e o f
27
Kubin (1990) fxplaiiK'd tluit if t lie auxiliary variabk- .V is not used by t h e i m p u t c r t o croatc t l u ' m u l t i p l e i m p u t a t i o n , b u t is used l)y tlit> u l t i m a t e analyst t o tlefine t'slimands. then tlu* i m p u t a t i o n may not IK- p r o p e r . In t h e above e x a m p l e , tlie . A H U i m p u t a t i o n is
proper for 0 = + Oi,. l)ut is imjiroijer for 0 = 0.^— Oi,. Ruliin (1!)9()) a r g u e s t h a t t h e
nHilti|)le i m p u t a t i o n is still conlidenee-proper in tlie s e n s e that it produ<<'s a variance ('stin)ate t l i a t is t o o large.
5 Quasi-Raiidomizatioii Approach
T h e r a n d o m i z a t i o n aj)proaeh. wliieli tri'ats tlu- |)opulation vector
Y
a s lixed. hasplayed a d o m i n a n t role in t h e design a n d analysis of s a m p l e surveys, [{andumization inference re<|uires t h a t units Ix' selec t<'d by iinihabilili/ snniplinij. which is characterized by t h e folhjuing two projjerties:
1. I'he s a m p l i n g distribution is d e t e r m i n e d by t h e s a m p l e r b»'fore a n y ij values a r e known.
2. Kvery u n i t h a s a positive ( k n o w n ) probal)ility of s«'lection.
riie key ingredient of t h e r a n d o m i v a t i o n approach, a k n o w n probability of selection, is lost when s o m e of t h e d a t a a r e m i s s i n g .
Ciiven t h e e x i s t e n c e of n o i n e s p o n d e n t s . o n e approach is t o regard t h e respoiid<Mits as t h e second pha.se s a m p l e in a two-pha.se sam|)le design. T h i s is niaiie possible by treating
the //,"s as random variables. It is necessary to specify a probability niod«'l for
R.
T h e r a n d o m i z a t i o n v e r s i o n o f i n f e r e n c e f o r i m p u t a t i o n i s c a l l e d t h e q i K L s i - r a n d o i n i z a t i o i i approach, a t c r u : suggested by O h a n d .Seh«Hireii (l9N.n.
T h e r e a r e t w o main differences b e t w e e n t h e s a m p l i n g distribution a n d t h e response distribution i n t h e q u a s i - r a n d o m i z a t i o n approach. F i r s t , t h e sampling d i s t r i b u t i o n is determined b y t h e sampler before a n y observations ar«' taken. O n t h e o t h e r h a n d .
wo may have (lifrerenl response m e c h a n i s m s for ditrerent i t e m s . Second, t l i e s a m p l i n g dislriljution is k n o w n , under t h e c o n t r o l of tlu' survey s t a t i s t i c i a n . O n t h e o t h e r h a n d , t h e response d i s t r i l j u t i o n is u n k n o w n a m i needs s o m e f o r m of modelling.
If t h e finite population is p a r t i t i o n e d into
CI
i m p u t a t i o n cells, t h e usual([uasi-raiidoniization a p p r o a c h assunu's t h e following respons*' m e c h a n i s m .
( k . l ) ['"or e a c h cell i ] — I . - - . ( i . all i t e m s { h } , g t o have
tiie s a m e r e s p o n s e prol)al)ility. p j = P r { l i , — 1 | / € ^ j ) . where / , tlenotes t h e s«'t of indices for t h e 7-th i m p u t a t i o n c«>ll.
( R . i ) Kor e v e r y / = 1. • • • . .V. l ^ r { l i , — I ) > 0.
We will call t h i s u n i f o r m n s i w i i s i intcli<itiisin i n l l n i i i m p u l a l i o t i n i l .
Uao a n d Shat) used w f i f j l i t u l h o i i l i c k u n p t i i a t t o n . which selects tlonors with
replacenn'iil witli t h e prohaliility of selection being |)rop(jrtiunal ttj t h e s a m p l i n g weights. This produces a n unbiased e s t i m a t o r u n d e r assinnptions ( H. 1) a n d (H.2). W hile t lu' pro-cedur«' is o f t e n saitl t o be design uidjia.sed. unbia.sedn«'ss r«'(iuires t h e responsj* probability model a s s u m p t i o n s ( l { . l ) a n d ( l t . 2 ) .
riie adjusttnl jackknife v a r i a n c e t ' s l i m a t u r for w e i g h t e d hot deck i m p u t a t i o n , pro
posed by R a o a n d S h a o is constructi'd by c h a n g i n g I'very i m p u t e t l value for t h e
jackknife replicatc* when a r e s p o n d e n t is deleted. I h e v a r i a n c e e s t i m a t o r c a n b e w r i t t e n a s I. 2 (••i.i;}) where r; + (I - f i j ) { ' h + - .7, (o.-l-l) J=1 j€.lnr. with 1 J •' (o.-io)