A COMPREHENSIVE SOFTWARE PACKAGE FOR SURVEY DATA ANALYSIS

(1)

SUDAAN: A C O M P R E H E N S I V E S O F T W A R E PACKAGE FOR SURVEY DATA A N A L Y S I S Lisa M. LaVange, B a b u b h a i V. Shah, Beth G. B a r n w e l l and Joyce F. K i l l i n g e r

Lisa M. LaVan~e. R e s e a r c h T r i a n g l e Institute KEYWORDS: v a r i a n c e estimation, user interface,

C language A B S T R A C T

The d e v e l o p m e n t of a survey data analysis software p a c k a g e that is flexible enough to meet the needs of survey statisticians, as w e l l as research scientists not n e c e s s a r i l y schooled in survey sampling, is u n d e r w a y at the R e s e a r c h Triangle Institute. The system was d e s i g n e d to provide s i g n i f i c a n t e n h a n c e m e n t s to RTI's existing software package, SUDAAN. The n e w SUDAAN system is w r i t t e n in the C language and provides greater e f f i c i e n c y and portability. The software d e v e l o p m e n t strategy consists of d e v e l o p i n g and d e b u g g i n g stand alone s t a t i s t i c a l p r o c e d u r e s on an IBM PC w i t h e x p a n d e d capacity. The procedures are then i m p l e m e n t e d in two a d d i t i o n a l c o m p u t i n g environments, V A X / V M S and IBM/0S. Variance e s t i m a t i o n options and data analysis techniques not p r e v i o u s l y available in the RTI software package are among the m a n y m o d i f i c a - tions i n c o r p o r a t e d into the n e w SUDAAN system. The purpose of this paper is to describe the design and d e v e l o p m e n t of this system. I N T R O D U C T I O N

Scientists at the R e s e a r c h Triangle Institute have been i n v o l v e d in designing, developing, and m a i n t a i n i n g software systems for the analysis of survey data for the past 17 years. Recently, RTI has been under contract to the Public H e a l t h Service (PHS) to develop a c o m p r e h e n s i v e software package that meets the needs of statistical analysts at the N a t i o n a l Center for H e a l t h Statistics (NCHS) and PHS. In particular, this package must be flexible enough to handle most of the statistical designs used by these agen- cies. For this purpose, we have e m b a r k e d on the task of d e v e l o p i n g a system that incorporates many of the features of RTI's e x i s t i n g survey data analysis system but also includes signifi- cant enhancements. The purpose of this paper is to describe in detail the design and d e v e l o p m e n t of this c o m p r e h e n s i v e software package. The immediate ojective of the SUDAAN system design was to develop a series of p r o c e d u r e s capable of p e r f o r m i n g s t a t i s t i c a l data analysis for complex sample surveys. The u l t i m a t e objective was to have a system that could aid in the design and e v a l u a t i o n of n e w s t a t i s t i c a l techniques.

RTI's e x i s t i n g software system consists of a series of p r o c e d u r e s that produce statistical analyses in the form of w e i g h t e d c r o s s - t a b u l a - tions, g e n e r a l i z e d ratio estimators, and linear and logistic regressions. All of the p r o c e d u r e s were c o n s t r u c t e d u s i n g a u n i f i e d set of F O R T R A N

subroutines that execute in the SAS c o m p u t i n g environment on an IBM mainframe. In the course of d e v e l o p i n g a n e w system, we w e n t through the process of r e t h i n k i n g our current design. Four goals emerged:

The system design for the c o m p r e h e n s i v e survey data analysis software, SUDAAN, evolved w i t h these goals in mind. The first goal is

being met by w r i t i n g the n e w software entirely in the C p r o g r a m m i n g language and testing simul- taneously on several systems. Compilers for the C language are a v a i l a b l e on a v a r i e t y of operat- ing systems. The initial programs are c u r r e n t l y executing under the TURBO C compiler on an IBM personal computer, the SVC c o m p i l e r on a 68020, and under the U N I X o p e r a t i n g system on a GOULD computer. V e r s i o n s are also being tested on the VAX/VMS and IBM/VS.

The second goal is being met by b u i l d i n g checks into the s y s t e m to assure that defini- tions of objects are m e a n i n g f u l and c o m p u t a t i o n s are feasible. We have i m p l e m e n t e d n u m e r i c a l l y stable p r o c e d u r e s for the a c c u m u l a t i o n of sums of squares and cross products and for m a t r i x operations, i n c l u d i n g m a t r i x inversion.

For a c h i e v i n g the third goal of improved efficiency, special care has been taken in the design to isolate tasks that are repeated numer- ous times, such as a n u m e r i c a l c a l c u l a t i o n on each v a r i a b l e of each record. These sections of code are o p t i m i z e d to the extent possible with- out using a s s e m b l y language. Some of the opti- m i z i n g algorithms include p r o c e s s i n g of sparse matrices and vectors, r e p l a c e m e n t of m u l t i p l i c a - tion by table lookup w i t h addition, e l i m i n a t i o n of repetitive (in the loop) "if" tests by c o n s t r u c t i n g separate do loops, and many others.

An important p r o d u c t of the SUDAAN design effort has b e e n the c r e a t i o n of a h i g h level statistical p r o g r a m m i n g language. The SUDAAN language has been d e v e l o p e d to provide the s t a t i s t i c i a n / p r o g r a m m e r w i t h easy access to the software package, thus a c c o m p l i s h i n g the fourth goal listed above. S t a t i s t i c a l procedures are currently being w r i t t e n in the SUDAAN language. Once they are finalized, some parts may be c o n v e r t e d to C programs to m a x i m i z e c o m p u t i n g efficiency.

The SUDAAN design allows for four types of user interface as d e p i c t e d in Figure I. The general user w i l l be able to access the system through user friendly procedures. U l t i m a t e l y the p r o c e d u r e s w i l l be available through a statistical system, such as SAS, but the current design consists of stand-alone procedures that interface w i t h a standard ASCII file format. The more s o p h i s t i c a t e d user can w r i t e SUDAAN programs for c o m p u t a t i o n s not available in the procedure library. This access feature is p a r t i c u l a r l y well suited for testing m o d i f i c a - tions to e x i s t i n g procedures, such as the calcu- lation of an a l t e r n a t i v e test statistic. A C p r o g r a m m e r can access the system directly through C c a l l a b l e functions.

The key features of the SUDAAN system are d e s c r i b e d in detail in the f o l l o w i n g sections. RTI SURVEY D A T A A N A L Y S I S P R O C E D U R E S P o r t a b i l i t y R e l i a b i l i t y / n u m e r i c a l a c c u r a c y C o m p u t a t i o n a l e f f i c i e n c y Ease of m o d i f i c a t i o n / e n h a n c e m e n t . B A C K G R O U N D The c o l l a b o r a t i o n of s t a t i s t i c i a n s and

(2)

computer scientists at RTI, b e g i n n i n g in 1971 (e.g., Folsom, Bayless, and Shah, 1971) and continuing today, has r e s u l t e d in a series of programs for survey data analysis that reflects s t a t e - o f - t h e - a r t m e t h o d o l o g y in this field. All of RTI's analysis packages employ the Taylor series or Delta m e t h o d of v a r i a n c e e s t i m a t i o n for complex sample survey designs (Folsom, 1974). Each p r o g r a m c u r r e n t l y m a r k e t e d is designed to run w i t h i n the SAS framework.

RTI is c u r r e n t l y m a r k e t i n g the following procedures for general use:

• S E S U D A A N • R A T I O E S T • R T I F R E Q S • S U R R E G R " RTILOGIT.

The S E S U D A A N (Shah, 1981) procedure produces w e i g h t e d estimates of means, totals, and propor-

tions and their standard errors. R A T I O E S T (Shah, 1982) calculates w e i g h t e d frequency distributions and p e r c e n t a g e s for c r o s s - t a b u l a - tions.

RTI's survey r e g r e s s i o n package, SURREGR (Holt, 1977), provides estimates of linear regression coefficients, their e s t i m a t e d v a r i a n c e - c o v a r i a n c e matrix, and tests of hypo- theses c o n c e r n i n g the c o e f f i c i e n t s that are appropriate for a c o m p l e x sample design. Logis- tic r e g r e s s i o n c a p a b i l i t i e s became available w i t h the d e v e l o p m e n t of R T I L O G I T (Shah, et al., 1984). This p r o c e d u r e runs in c o n j u n c t i o n w i t h SAS PROC LOGIST and produces pseudo m a x i m u m likelihood estimates of the p a r a m e t e r s and tests of h y p o t h e s e s about these parameters.

New System P r o c e d u r e s

The SUDAAN s y s t e m design consists of two phases of software development, each resulting in a series of statistical procedures. The Phase I p r o c e d u r e s are p r i m a r i l y c o n c e r n e d w i t h descriptive data analysis, including cross-tabu- lations, g e n e r a l i z e d ratio estimation, and quantile estimation. The Phase II design, currently in the p l a n n i n g stages, will incorpor- ate several a n a l y t i c a l procedures, including linear and l o g - l i n e a r m o d e l i n g and survivorship analysis of survey data.

Several features are available under the n e w SUDAAN system that were not p r e v i o u s l y available w i t h RTI software. Key among these are m u l t i p l e variance options. Sampling v a r i a n c e s are computed for one of three design options invoked by the user. These are:

• W i t h r e p l a c e m e n t sampling at the first stage

• Simple r a n d o m sampling at each stage (with or w i t h o u t replacement)

• W i t h o u t replacement, unequal p r o b a b i l i t y sampling at the first stage and simple r a n d o m sampling (with or w i t h o u t replacement) at subsequent stages.

The second o p t i o n requires the input of stratum specific p o p u l a t i o n and sample counts for each stage. In a d d i t i o n to these counts, the third o p t i o n requires the input of joint p r o b a b i l i t i e s of s e l e c t i o n for each pair of primary sampling units w i t h i n a stratum. These

quantities are used to produce Y a t e s - G r u n d y - S e n variance estimates for this design option. The three v a r i a n c e options cover a wide range of sample designs i n c l u d i n g sampling w i t h probabil- ities p r o p o r t i o n a l to size at the first stage. Unequal p r o b a b i l i t i e s of s e l e c t i o n at a subsequent stage can be h a n d l e d by forming pseudo strata at that stage since p r o b a b i l i t i e s of selection can v a r y from s t r a t u m to stratum. In addition to these v a r i a n c e options, the user may request that the v a r i a n c e - c o v a r i a n c e m a t r i x for estimates w i t h i n a table be c o m p u t e d for all procedures a l l o w i n g for c r o s s - t a b u l a t i o n s . Estimates of v a r i a n c e - c o v a r i a n c e m a t r i c e s were p r e v i o u s l y a v a i l a b l e only in the SURREGR and RTILOGIT procedures.

Phase I software consists of three procedures: CROSSTAB, DESCRIPT, and RATIO. The CROSSTAB p r o c e d u r e is the R T I F R E Q S e q u i v a l e n t in the n e w SUDAAN system. In a d d i t i o n to c o m p u t i n g w e i g h t e d f r e q u e n c y and p e r c e n t a g e distributions, Chi-square tests of i n d e p e n d e n c e for a two-way c o n t i n g e n c y table are produced. The tests are computed using a W a l d statistic a c c o r d i n g to methods a d v o c a t e d by Koch, Freeman, and F r e e m a n

(1976). D u r i n g Phase II, the R a o - S c o t t S a t t e r t h w a i t e c o r r e c t e d test statistic (Thomas and Rao, 1984 and 1985) w i l l be c o m p u t e d in addition to the W a l d statistic for goodness of fit tests. This statistic has been shown by Thomas and Rao to have the best c h a r a c t e r i s t i c s w i t h respect to both power and significance level w h e n c o m p a r e d w i t h other statistics c o r r e c t i n g for the liberalness of the W a l d test.

The D E S C R I P T p r o c e d u r e computes estimates of means, totals, proportions, geometric means, quantiles, and their sampling errors. Estimates can be r e q u e s t e d for the subdomains of user- specified c r o s s - t a b u l a t i o n s . Options are also available to the user for s t a n d a r d i z e d and/or p o s t - s t r a t i f i e d means and p r o p o r t i o n s and their sampling errors. The s t a n d a r d i z i n g and/or post- stratifying v a r i a b l e s and their d i s t r i b u t i o n s are required as input.

RTI has recently c o m p l e t e d a s i m u l a t i o n study comparing two m e t h o d s for e s t i m a t i n g the variances of quantiles for sample survey data

(Wheeless, et al., 1988). The two methods c o n s i d e r e d w e r e (i) the one p r o p o s e d by Fuller and F r a n c i s c o (1986) and c u r r e n t l y used in the PC CARP software package and (2) a direct l i n e a r i z a t i o n m e t h o d p r o p o s e d by RTI. The simulation also c o m p a r e d two m e t h o d s of e s t i m a t i n g quantiles themselves. Based on the results of this study, the direct l i n e a r i z a t i o n m e t h o d is currently being p r o g r a m m e d for use in the DESCRIPT procedure.

The user can request estimates of linear contrasts among the d o m a i n estimates p r o d u c e d by DESCRIPT. Four types of c o n t r a s t specifications are available. The user can specify a general linear contrast, all p o s s i b l e pairs of differences among levels of a d o m a i n variable, the differences b e t w e e n levels of a domain v a r i a b l e and the m a r g i n a l value, and p o l y m o n i a l effects for an ordinal d o m a i n variable.

The RATIO p r o c e d u r e computes the estimates of g e n e r a l i z e d ratios for survey data. The user can specify n u m e r a t o r and d e n o m i n a t o r variables that are c o n t i n u o u s or categorical. In the latter case, levels of the v a r i a b l e s for w h i c h

(3)

ratios are to be e s t i m a t e d are required. This procedure, as the D E S C R I P T procedure, can be e x e c u t e d in one of three modes: basic estimation, s t a n d a r d i z e d estimation, or p o s t - s t r a t i - fied estimation.

The c u r r e n t p l a n for Phase II software development c o n s i s t s of four s t a t i s t i c a l p r o c e d u r e s w r i t t e n in the n e w S U D A A N framework. The first of these is a g e n e r a l c a t e g o r i c a l data analysis p r o c e d u r e that w i l l offer l o g - l i n e a r m o d e l i n g c a p a b i l i t i e s for survey data analysis. This p r o c e d u r e w i l l be a survey data a n a l o g u e to the SAS p r o c e d u r e CATMOD. P r o c e d u r e s two and three w i l l be linear and logistic r e g r e s s i o n proce-

dures, d e v e l o p e d b a s e d on the e x i s t i n g procedures S U R R E G R and LOGIST. E a c h of these w i l l be r e p r o g r a m m e d in the n e w S U D A A N s y s t e m and v a r i o u s e n h a n c e m e n t s added. E n h a n c e m e n t s to the linear r e g r e s s i o n p r o c e d u r e are b e i n g p l a n n e d for s t o c h a s t i c r e g r e s s i o n c o e f f i c i e n t s m o d e l analysis of survey data. Software has b e e n d e v e l o p e d at RTI u n d e r a c o n t r a c t to the N a t i o n a l I n s t i t u t e for C h i l d H e a l t h and H u m a n D e v e l o p m e n t for this p r o b l e m (LaVange, et al., 1988) that w i l l be i n c o r p o r a t e d into the SUDAAN system. Fourth, a g e n e r a l p r o c e d u r e for sur- vival data a n a l y s i s is p l a n n e d that w i l l a l l o w the a n a l y s t to p e r f o r m K a p l a n - M e i r e s t i m a t i o n and Cox p r o p o r t i o n a l h a z a r d m o d e l e s t i m a t i o n using c o m p l e x sample survey data.

SUDAAN S Y S T E M D E S I G N

The i m m e d i a t e o b j e c t i v e of the SUDAAN s y s t e m design was to d e v e l o p a series of p r o c e d u r e s capable of p e r f o r m i n g s t a t i s t i c a l data analysis for c o m p l e x sample surveys. The u l t i m a t e objective was to have a s y s t e m that could aid in the design and e v a l u a t i o n of n e w s t a t i s t i c a l techniques. The s y s t e m d e s i g n for S U D A A N consists of three layers:

(i) PROCs or p r o c e d u r e s similar to SAS or BMDP p r o c e d u r e s

(2) SUDAAN data s t r u c t u r e s and functions o p e r a t i n g on these data structures (3) A s y s t e m level i n t e r p r e t e r for (i) and

(2).

These layers c o r r e s p o n d to d i f f e r e n t views of the system. S U D A A N m a y be p e r c e i v e d quite d i f f e r e n t l y by d i f f e r e n t groups of users. In this section we discuss these views and describe in detail the m a j o r c o m p o n e n t s of the SUDAAN system.

S y s t e m Views

A s u b s t a n t i v e r e s e a r c h e r or data analyst w i l l v i e w SUDAAN as a c o l l e c t i o n of s t a t i s t i c a l procedures for a p p l y i n g a l t e r n a t i v e s t a t i s t i c a l t e c h n i q u e s for survey data analysis. A statistician m a y v i e w it as a c o n v e n i e n t tool for e v a l u a t i n g n e w s t a t i s t i c a l formulas or analysis techniques. S U D A A N allows for easy c o n v e r s i o n of formulas into e x e c u t a b l e statements, thereby e n a b l i n g n e w p r o c e d u r e s or e n h a n c e m e n t s to e x i s t i n g p r o c e d u r e s to be p r o g r a m m e d q u i c k l y and efficiently.

A s y s t e m d e s i g n e r m a y v i e w it as a compiler, w r i t t e n in C, for a v e r y h i g h level language in order to specify s t a t i s t i c a l analysis tasks. For the c o m p u t e r scientist, the d e v e l o p m e n t

effort c o n s i s t e d of d e s i g n i n g an o p t i m i z i n g c o m p i l e r or i n t e r p r e t e r w i t h the p o t e n t i a l for global o p t i m i z a t i o n of all a v a i l a b l e resources. The S U D A A N p r o g r a m defines only w h a t is to be done, not how. The s y s t e m is free to select any of the p o s s i b l e sequences of o p e r a t i o n s in order to achieve the u s e r - d e f i n e d result.

The PROC i n t e r f a c e is in SUDAAN for the c o n v e n i e n c e of the u s e r w h o r e p e a t e d l y applies the same p r o c e d u r e s to d i f f e r e n t datasets or d i f f e r e n t v a r i a b l e s w i t h i n the same dataset. The systems p r o g r a m m e r ' s v i e w represents one of the m a n y p o s s i b l e c o m p i l e r designs for the i m p l e m e n t a t i o n of SUDAAN.

However, the p r i m a r y force b e h i n d the S U D A A N design is the s t a t i s t i c i a n ' s view, n a m e l y a tool for t r a n s l a t i n g m a t h e m a t i c a l formulas into a c o m p u t e r p r o g r a m capable of p r o d u c i n g n u m e r i c a l results for a given data c o n f i g u r a t i o n . The s t a t i s t i c i a n n e e d not specify h o w a p a r t i c u l a r e s t i m a t o r is to be calculated. R a t h e r his focus is on w h a t is to be calculated. For the statistician, SUDAAN is a "nonprocedural" language. Given the d e f i n i t i o n of a s t a t i s t i c a l q u a n t i t y in terms of a m a t h e m a t i c a l formula, the s y s t e m determines the c o m p u t a t i o n a l algorithm. Users have no k n o w l e d g e of the r e p r e s e n t a t i o n of the data structures in m e m o r y or on an i n p u t / o u t p u t device and no c o n t r o l of the p r o g r a m flow. In this sense the SUDAAN language is v i e w e d as d e f i n i t i o n a l and not p r o c e d u r a l .

The basic elements of the s y s t e m are as follows:

• S t a t i s t i c a l data objects and a b s t r a c t data s t r u c t u r e s

• F u n c t i o n s of and o p e r a t i o n s to data structures

• F u n c t i o n s to input, output, and print data structures.

Each of these is d e s c r i b e d in turn in the following subsections.

S t a t i s t i c a l Data S t r u c t u r e s

An example of a s t a t i s t i c a l data structure is the dataset r e s u l t i n g from a n a t i o n a l h o u s e h o l d survey. Such a d a t a s e t m i g h t consist of records c o n t a i n i n g i n f o r m a t i o n on a sample of indivi- duals, each s e l e c t e d from a sample of h o u s e h o l d s w i t h i n e n u m e r a t i o n districts, counties, and

states. The actual d a t a s e t contains a record for each individual, c o n s i s t i n g of v a r i o u s attributes and item responses for that individual and h o u s e h o l d . E a c h record is i d e n t i f i e d by several h i e r a r c h i c a l i d e n t i f i e r s c o r r e s p o n d - ing to the stages of sample selection: state, county, e n u m e r a t i o n district, household, and individual. An example of a d i f f e r e n t type of data structure is a summary dataset of county- specific p o p u l a t i o n counts of i n d i v i d u a l s by c h a r a c t e r i s t i c s , such as race and sex. Such a dataset is a c o l l e c t i o n of two d i m e n s i o n a l tables or m a t r i c e s w i t h the two h i e r a r c h i c a l identifiers state and county.

In SUDAAN, s t a t i s t i c a l data structures are d e f i n e d as c o l l e c t i o n s of m u l t i d i m e n s i o n a l arrays w i t h n e s t e d h i e r a r c h i c a l identifiers. These data s t r u c t u r e s are r e f e r r e d to as B a l a n c e d A r r a y Trees (BATs). E a c h of the datasets d e s c r i b e d above is an example of a BAT, the

(4)

first w i t h five n e s t e d i d e n t i f i e r s and the s e c o n d w i t h only two. The S U D A A N l a n g u a g e also c o n s i s t s of a series of f u n c t i o n s and m a t h e - m a t i c a l o p e r a t o r s that are u s e d to input and o u t p u t BATs and to d e f i n e BATs as f u n c t i o n s of BATs and o t h e r input objects. The S U D A A N l a n g u a g e i t s e l f is a m a t h e m a t i c a l l y c l o s e d set of BATs w i t h r e s p e c t to these f u n c t i o n s . A p p l y - ing a f u n c t i o n to a BAT y i e l d s a n o t h e r BAT. The level of n e s t e d i d e n t i f i e r s m a y change, but one or m o r e BATs a l w a y s result.

S U D A A N P r o g r a m E x a m p l e s

In o r d e r to i l l u s t r a t e the S U D A A N l a n g u a g e concepts, c o n s i d e r the p r o b l e m of c o m p u t i n g the b e t w e e n p r i m a r y s a m p l i n g u n i t (PSU), w i t h i n s t r a t u m m e a n s q u a r e for an e s t i m a t e d p o p u l a t i o n total. Let y(hijk) d e n o t e the o b s e r v e d v a l u e s of the v a r i a b l e y c o l l e c t e d f r o m a n e s t e d sample d e s i g n for the k t h i n d i v i d u a l in the j t h h o u s e - hold, ith PSU, and h t h stratum. The f i l e n a m e

"EXAMPLEI" r e f e r s to an A S C I I file w i t h the n e s t e d sample i d e n t i f i e r s i n c l u d e d as v a r i a b l e s one to four on the file. The v a r i a b l e y is the fifth v a r i a b l e on the file. The input file is s t o r e d in a s t a n d a r d file f o r m a t w i t h a c o d e b o o k d e s c r i b i n g the v a r i a b l e s and a s s o c i a t e d levels. V a r i a b l e labels or formats are a u t o m a t i c a l l y read in and u s e d for p r i n t i n g . In the f o l l o w i n g statements, X refers to the BAT c o n s i s t i n g of these data v a l u e s . The p r o g r a m uses the summa- tion function, SIGMA, to c o m p u t e totals of y at v a r i o u s levels. The first a r g u m e n t of S I G M A refers to the BAT c o n t a i n i n g data v a l u e s to be summed, w h i l e the s e c o n d a r g u m e n t refers to the n e s t i n g level up to w h i c h the s u m m i n g takes place. This a r g u m e n t e q u a l s the n u m b e r of n e s t e d i d e n t i f i e r s in the r e s u l t i n g BAT that c o n t a i n s the sums. C o m m e n t s , d e l i m i t e d by the strings "/*" and "*/", f u r t h e r c l a r i f y the logic of the f o l l o w i n g p r o g r a m s t a t e m e n t s s h o w n in F i g u r e 2.

Since the S U D A A N l a n g u a g e deals w i t h s y m b o l i c d e f i n i t i o n only, the above s t a t e m e n t s can e a s i l y be m o d i f i e d to o b t a i n the m e a n square errors a s s o c i a t e d w i t h m o r e t h a n one v a r i a b l e . R e p l a c - ing the s e c o n d s t a t e m e n t w i t h the f o l l o w i n g results in the c a l c u l a t i o n of m e a n square errors for v a r i a b l e s s t a t e m e n t 5, 6, 9, and i0 in this example.

X = S E L E C T ( R E C O R D S , V E C T O R ( 5 , 6 , 9 , 1 0 ) ) ; Next, c o n s i d e r the p r o b l e m of c o m p u t i n g m e a n square errors for e s t i m a t e d p o p u l a t i o n counts c o r r e s p o n d i n g to the cells of a m u l t i d i m e n s i o n a l table. In the f o l l o w i n g e x a m p l e s t a t e m e n t s , T A B L E S refers to a BAT c o n s i s t i n g of a t w o - w a y and a t h r e e - w a y table. The first is the cross- c l a s s i f i c a t i o n of v a r i a b l e s 5 and 6 on the input file and the s e c o n d is the c r o s s - c l a s s i f i c a t i o n of v a r i a b l e s 5, 9, and i0. The E F F E C T S f u n c t i o n c o n v e r t s c a t e g o r i c a l v a r i a b l e s w i t h s p e c i f i e d n u m b e r s of levels into d u m m y (0,i) v e c t o r s . The C R O S S P f u n c t i o n d e f i n e s the t a b l e s as cross p r o d u c t s of these e f f e c t v e c t o r s . S I G M A W G returns two a r g u m e n t s , the u n w e i g h t e d and w e i g h t e d sums for e a c h cell in the tables. Once X S U M 2 is defined, the p r o g r a m s t a t e m e n t s in the above e x a m p l e f o l l o w to c o m p u t e the m e a n square errors. The r e q u i r e d s t a t e m e n t s are shown in F i g u r e 3.: S U D A A N P r o c e d u r e s The t a s k of d e v e l o p i n g n e w p r o c e d u r e s in S U D A A N is f a i r l y s t r a i g h t f o r w a r d . The p r o c e s s i n v o l v e s w r i t i n g a S U D A A N p r o g r a m and a s s o c i a t - ing o b j e c t s in the p r o g r a m w i t h a w e l l d e f i n e d set of k e y w o r d s . The u s e r - i n p u t to a P R O C c o n s i s t s of a c o m m a n d level syntax, such as that u s e d in I B M / T S 0 or SAS. The P R O C p r o c e s s o r i n t e r p r e t s the u s e r i n p u t and g e n e r a t e s the n e c e s s a r y s t a t e m e n t s for the S U D A A N l a n g u a g e parser. In short, w r i t i n g a S U D A A N p r o c e d u r e is

l

e q u i v a l e n t to d e f i n i n g a p r e p r o c e s s o r for a S U D A A N p r o g r a m .

The m a j o r a d v a n t a g e of this a p p r o a c h is that the u s e r has a S U D A A N p r o g r a m a v a i l a b l e for e a c h p r o c e d u r e that serves as d e t a i l e d d o c u m e n t a t i o n . The u s e r can see the t r a n s l a t i o n of input to o u t p u t and the series of c a l c u l a t i o n s that are p e r f o r m e d once the p r o c e d u r e has b e e n called. M o d i f i c a t i o n s to the p r o c e d u r e can be m a d e by m o d i f y i n g the a c c o m p a n y i n g S U D A A N p r o g r a m w i t h - out w r i t i n g a c o m p l e t e l y n e w p r o c e d u r e . The source for all of the S U D A A N p r o c e d u r e s w i l l be a v a i l a b l e to the user.

C O N C L U D I N G R E M A R K S

The u n i q u e f e a t u r e of S U D A A N is the e n t i t y d e f i n e d as a BAT. M a t h e m a t i c a l l y , a BAT represents a f i n i t e s e q u e n c e of m u l t i d i m e n s i o n a l arrays. The set of BATs r e p r e s e n t s a c l o s e d set w i t h r e s p e c t to b i n a r y o p e r a t i o n s and f u n c t i o n s w h o s e a r g u m e n t s are BATs. The f u n c t i o n a l s y n t a x is "natural" for a m a t h e m a t i c a l s t a t i s t i c i a n w h o can use the S U D A A N l a n g u a g e to d e f i n e the BATs of interest.

A S U D A A N u s e r w i l l h a v e the S U D A A N source code a v a i l a b l e as p a r t of the d o c u m e n t a t i o n . By r e f e r r i n g to the S U D A A N p r o g r a m for a statistical p r o c e d u r e , the a n a l y s t w i l l be able to d e t e r m i n e e x a c t l y w h a t u n d e r l y i n g m a t h e m a t i c a l formulas are e m p l o y e d . W h e n s t a t i s t i c a l r e s e a r c h i n d i c a t e s n e w or i m p r o v e d f o r m u l a s for a g i v e n a n a l y s i s , the f o r m u l a s can be r e a d i l y t r a n s l a t e d into the S U D A A N l a n g u a g e to y i e l d a w o r k i n g c o m p u t e r p r o g r a m .

In w r i t i n g a S U D A A N program, the u s e r only defines the c o m p u t a t i o n a l o b j e c t s t h r o u g h function cells. The u s e r does not n e e d to s p e c i f y h o w the c a l c u l a t i o n s w i l l be c a r r i e d out. I n c r e a s e d r e l i a b i l i t y is a c h i e v e d t h r o u g h checks b u i l t into the s y s t e m to a s s u r e that the defini- tions are m e a n i n g f u l and that the o b j e c t s

r e q u i r e d as o u t p u t can be g e n e r a t e d . The s y s t e m is free to s e l e c t the c o m p u t a t i o n a l a l g o r i t h m that is o p t i m a l g i v e n the d a t a s e t and a v a i l a b l e c o m p u t i n g r e s o u r c e s . The s y s t e m also p r o v i d e s t r e m e n d o u s f l e x i b i l i t y w i t h r e s p e c t to input and output. A l l i n p u t / o u t p u t o b j e c t s are d e f i n e d as BATs. A n y i n t e r m e d i a t e BAT can be s a v e d or d i s p l a y e d at any time in the p r o g r a m .

The use of BATs, w i t h the a b o v e s t a t e d advan- tages, has some r e s t r i c t i o n s and l i m i t a t i o n s . The u s e r is l i m i t e d to the BAT f u n c t i o n s

c u r r e n t l y i m p l e m e n t e d , and all c o m p u t a t i o n s m u s t be r e p r e s e n t e d as BATs or f u n c t i o n s thereof.

It is p o s s i b l e to test n e w f u n c t i o n s

i n d e p e n d e n t l y b e f o r e i n c o r p o r a t i n g t h e m into the system, t h e r e b y g r e a t l y f a c i l i t a t i n g the addition of f u n c t i o n s to the s y s t e m l i b r a r y on an a s - n e e d e d basis.

(5)

Computations that cannot be represented as

BATs will not be feasible in SUDAAN. The user

must define all computations as functions of

input BATs or data objects. The SUDAAN system

is therefore not for general programming use, but should prove reasonably adequate for statistical analysis, and in particular, survey data analysis.

A major portion of the overall software development effort at RTI has been channeled

into the SUDAAN system design. While this

strategy has slowed the development of the statistical procedures initially, we are antici- pating a tremendous savings in the long run due to the ease w i t h w h i c h each procedure can be

designed, implemented, and debugged. The end

result of this effort will be not only a user- friendly software package that runs efficiently but also a valuable aid to the statistician for enhancing existing procedures and planning for additional procedures as n e w methodology becomes available.

REFERENCES

COX, D. R. (1972). "Regression Models and Life

Tables" (with discussion), Journal of the

Royal Statistical Society B, 34: 187-220.

FOLSOM, R. E., BAYLESS, D. L., and SHAH, B. V.

(1971). "Jackknifing for Variance

Components in Complex Sample Survey

Designs." Proceedings of the American

Statistical Association: 36-39.

FOLSOM, R. E. (1974). "National Assessment

Approach to Sampling Error Estimation.

Sampling Error Monograph. Prepared for the

National Assessment of Educational Progress.

FRANCISCO, C. A. and FULLER, W . A . (1986)

"Estimation of the Distribution Function

with Data from a Complex Survey." Presented

at the American Statistical Association Meetings, Chicago, Illinois.

HOLT, M. M. (1977). SURREGR: Standard Errors

of Regression Coefficients from Sample

Survey Data. Research Triangle Institute,

Research Triangle Park, NC 27709.

KAPLAN, E. L. and MEIER, P. (1958).

"Nonparametric Estimation from Incomplete

Observations." J. of the American

Statistical Association, 53: 457-481.

KOCH, G. G., FREEMAN, D. H., JR. and FREEMAN, J.

L. (1975). "Strategies in the Multivariate

Analysis of Data from Complex Surveys." Inter. Stat. Rev. 43, 59-78.

LAVANGE, L. M., PFEFFERMANN, D., SHAH, B. V.,

and GABEL, T. J. (1988). "The Analysis of

Relationships Between Growth and Dietary

Intake Using the NHANES Dataset." Final

Project Report. Research Triangle

Institute, Research Triangle Park, NC 27709.

SHAH, B. V., HOLT, M. M. and FOLSOM, R. E. (1977), "Inference about Regression Models

from Sample Survey Data." Proceedings of

the 3rd Meeting, International Association of Survey Statisticians, New Delhi.

SHAH, B . V . (1981). RATIOEST: Standard Errors

Program for Computing Ratio Estimates from

Sample Survey Data. Research Triangle

SHAH, B . V . (1982). RTIFREQS: Program tQ

Compute Weighted Frequencies, Percentages,

and Their Standard Errors. Research

Triangle Institute, Research Triangle Park,

NC 27709.

SHAH, B . V . (1981). SESUDAAN: Standard Errors

Program for Computing Standardized Rates

from Sample Survey Data. Research Triangle

SHAH, B . V . (1978). "SUDANN: Survey Data

Analysis Software." Presented at the Joint

Statistical Meetings of the American Statistical Association, San Diego.

SHAH, B. V. (1979). VCMPNLS: Program to

Compute Variance Components. Research

Triangle Institute, Research Triangle Park,

NC 27709.

SHAH, B. V., FOLSOM, R.E. HARRELL, F. E. and

DILLARD, C . N . (1984): "RTILOGIT: Survey

Data Analysis Software for Logistic Regression," Final Report, Work Assignment 74, Research Triangle Institute, Research Triangle Park, NC.

THOMAS, D. R. and RAO, J. N. K. (1984). "A

Monte Carlo Study of Exact Levels of Goodness-of-Fit Statistics Under Cluster

Sampling." Proceedings of the Section on

Survey Research Methods, American Statistical Association, Washington, DC.

THOMAS, D. R. and RAO, J. N. K. (1985). "On

the Power of Some Goodness-of-Fit Tests

Under Cluster Sampling." Proceedings of the

Section on Survey Research MethodE, American Statistical Association, Washington, DC.

WHEELESS, S. and SHAH, B . V . (1982), "RATIO2

and RATIO3: Computing Ratio Estimates with

Two Data Files. Research Triangle

Institute, Research Triangle Park, NC. WHEELESS, S. C., SHAH, B. V., and LAVANGE, L. M.

(1987). "Results of a Simulation for

Comparing Two Methods for Estimating

Quantiles and their Variances." Project

Report. Research Triangle Institute,

(6)

F i g u r e i. U s e r A c c e s s to the S U D A A N S y s t e m U s e r

]

C P r o g r a m C L a n g u a g e I n t e r f a c e S U D A A N P r o g r a m S U D A A N P a r s e r P R 0 C S t a t e m e n t s P R O C P r e p r o c e s s o r S t a t i s t i c a l S y s t e m (e.g. SAS) s t a t e m e n t s S t a t i s t i c a l S y s t e m I n t e r f a c e S U D A A N S u p e r v i s o r : C o r e P r o g r a m s F i g u r e 2 R E C O R D S = F I C S F I L E ( " E X A M P L E I " , V E C T O R ( I , 2 , 3 , 4 ) ) ; X = S E L E C T ( R E C O R D S , 5) ; P S U L E V E L = 2 ; X S U M 2 = S I G M A (X, P S U L E V E L ) ; M H = S I G M A ( C O N S T A N T ( X S U M 2 , 1 ) ,i) ; M H M I N I = M H - I; S T R A T L E V = I; X S U M I = S I G M A ( X S U M 2 , S T R A T L E V ) ; X S Q R I = S l G M A ( X S U M 2 * X S U M 2 , S T R A T L E V ) ; X S S 0 = S I G M A ( ( M H * X S Q R I - X S U M I * X S U M I ) / M H M I N I , 0 ) ; M S E = S Q R T ( X S S O ) ; P R T A B S (MSE) ; / * R E C O R D S is a B A T c o n t a i n i n g the i n p u t d a t a s e t * / /*X is a BAT of y v a l u e s a n d id's */ / * S u m X o v e r e a c h P S U */ / * C o m p u t e the n u m b e r of P S U s p e r s t r a t u m by d e f i n i n g a v e c t o r of o n e ' s for e a c h P S U in X S U M 2 * / / * S u m X o v e r e a c h s t r a t u m */ / * S t r a t u m s u m of s q u a r e s of P S U t o t a l s */ / * C o m p u t e s u m of s q u a r e d d e v i a t i o n s o v e r all s t r a t a */ / * P r i n t the r e s u l t */ F i g u r e 3 R E C O R D S = F I C S F I L E ( " E X A M P L E I " , V E C T O R ( I , 2 , 3 , 4 ) ) ; S U B G R O U P S = V E C T O R ( 5 , 6 , 9 , 1 0 ) ; / * V a r i a b l e s d e f i n i n g s u b g r o u p s */ L E V E L S = V E C T O R ( 7 , 2 , 2 , 2 ) ; / * A s s o c i a t e d v a r i a b l e l e v e l s */ O P T I O N S = V E C T O R ( 0 , 0 , 0 , 0 ) ; / * F o u r o p t i o n s are off */ E F F V E C T O R S = E F F E C T S ( R E C O R D S , S U B G R O U P S , L E V E L S , 0 P T I O N S ) ; T A B 1 = V E C T O R ( I , 2 ) ; / * V a r i a b l e 5 by v a r i a b l e 6 */ T A B 2 = V E C T O R ( I , 3 , 4 ) ; / * V a r i a b l e 5 by v a r 9 by v a r i0 */ T A B V E C = V E C T O R ( T A B I , T A B 2 ) ; T A B L E S = C R 0 S S P ( E F F V E C T O R S , T A B V E C ) ; W E I G H T = S E L E C T ( R E C O R D S , I I ) / * V a r i a b l e II is the w e i g h t */ N W S U M = S I G M A W G ( T A B L E S , P S U L E V E L , W E I G H T ) ; X S U M 2 = N W S U M ( 2 ) ;