• No results found

1 Files to download. 3 A macro to list out-of-range data values. 2 Reading in the example data file. 22S:172 Lab session 9 Macros for data cleaning

N/A
N/A
Protected

Academic year: 2021

Share "1 Files to download. 3 A macro to list out-of-range data values. 2 Reading in the example data file. 22S:172 Lab session 9 Macros for data cleaning"

Copied!
6
0
0

Loading.... (view fulltext now)

Full text

(1)

22S:172

Lab session 9

Macros for data cleaning

July 20, 2005

1

Files to download

patients.txt

lab9.sas

2

Reading in the example data file

options linesize = 72 mprint ; * mprint option puts macro debugging info in the log file ;

/*---Program to create SAS data set PATIENTS

--- */ *LIBNAME CLEAN "C:\TEMP";

LIBNAME CLEAN "/space/kcowles/172/clean"; DATA CLEAN.PATIENTS;

INFILE "C:\TEMP\patients.dat" PAD;

* INFILE "/space/kcowles/172/clean/patients.dat" PAD; INPUT @1 PATNO $3. @4 GENDER $1. @5 VISIT MMDDYY10. @15 HR 3. @18 SBP 3. @21 DBP 3. @24 DX $3. @27 AE $1.;

LABEL PATNO = "Patient Number"

GENDER = "Gender" VISIT = "Visit Date" HR = "Heart Rate"

SBP = "Systolic Blood Pressure" DBP = "Diastolic Blood Pressure" DX = "Diagnosis Code"

AE = "Adverse Event?"; FORMAT VISIT MMDDYY10.;

RUN;

3

A macro to list out-of-range data values

Recall that a macro program is a piece of SAS code in which parts of the code are replaced by variable information by the macro processor before the code is processed in the usual way by the SAS compiler. When the macro is called, the macro processor replaces each of the arguments with the values you specify. Then, in the macro program, every macro variable is replaced by the assigned value.

/*************************************************************************** Program 2-7 Writing a macro to list out-of-range data values

****************************************************************************/ *---*

| Program Name: RANGE.SAS in C:\CLEANING | | Purpose: Macro that takes lower and upper limits for a | | numeric variable, and an ID variable to print out | | an exception report to the output window |

| Arguments: DSN - Data set name |

| VAR - Numeric variable to test |

| LOW - Lowest valid value |

| HIGH - Highest valid value |

| IDVAR - ID variable to print in the exception |

| report |

| Example: %RANGE(CLEAN.PATIENTS,HR,40,100,PATNO) | *---*; %MACRO RANGE(DSN,VAR,LOW,HIGH,IDVAR);

(2)

TITLE "Listing of Invalid Patient Numbers and Data Values"; DATA _NULL_;

SET &DSN(KEEP=&IDVAR &VAR); FILE PRINT;

IF (&VAR LT &LOW AND &VAR NE .) OR &VAR GT &HIGH THEN PUT "&IDVAR:" &IDVAR @18 "Variable:&VAR"

@38 "Value:" &VAR @50 "out-of-range"; RUN;

%MEND RANGE;

%RANGE(CLEAN.PATIENTS,HR,40,100,PATNO)

Look at what appears in the SAS log:

MPRINT(RANGE): TITLE "Listing of Invalid Patient Numbers and Data Values";

MPRINT(RANGE): DATA _NULL_;

MPRINT(RANGE): SET CLEAN.PATIENTS(KEEP=PATNO HR); MPRINT(RANGE): FILE PRINT;

MPRINT(RANGE): IF (HR LT 40 AND HR NE .) OR HR GT 100 THEN PUT "PATNO:" PATNO @18 "Variable:HR" @38 "Value:" HR @50 "out-of-range"; MPRINT(RANGE): RUN;

NOTE: 7 lines were written to file PRINT.

NOTE: There were 31 observations read from the data set CLEAN.PATIENTS. NOTE: The DATASTEP printed page 1.

And here is the output:

Listing of Invalid Patient Numbers and Data Values 1 20:04 Monday, July 14, 2003 PATNO:004 Variable:HR Value:101 out-of-range

PATNO:008 Variable:HR Value:210 out-of-range PATNO:014 Variable:HR Value:22 out-of-range PATNO:017 Variable:HR Value:208 out-of-range PATNO:321 Variable:HR Value:900 out-of-range PATNO:020 Variable:HR Value:10 out-of-range PATNO:023 Variable:HR Value:22 out-of-range

4

A program to use PROC UNIVARIATE to look for

highest and lowest values by percentiles

Note: the%EVALis needed in the 7th line to perform the integer arithmetic. If the value of LOW_PERwere 20, the value ofUP_PERwithout the%EVALfunction would be the text string ”100 - 20” instead of 80.

The data set TMP created by proc univariate contains only one observation and three variables: PATNO,L_20andL_80.

We want to add the two values ofL_20andL_80to every observation in the original dataset. The SET statement brings in an observation from the PATIENTS dataset, keeping only the variables PATNO and HR. Because all variables brought in with a SET statement are automatically retained, the values forL_20andL_80are added to every observation in the dataset HILO.

/*************************************************************************** Program 2-12 Program to print the top and bottom "n" percent of data values, using PROC UNIVARIATE

****************************************************************************/ ***Solution using PROC UNIVARIATE and Percentiles;

*LIBNAME CLEAN "C:\CLEANING";

***The two macro variables that follow define the lower and upper percentile cut points;

***Change the value in the line below to the percentile cut-off you want;

%LET LOW_PER=20;

***Compute the upper cut-off value; %LET UP_PER= %EVAL(100 - &LOW_PER); ***Choose a variable to operate on; %LET VAR = HR;

PROC UNIVARIATE DATA=CLEAN.PATIENTS NOPRINT; VAR &VAR;

ID PATNO;

OUTPUT OUT=TMP PCTLPTS=&LOW_PER &UP_PER PCTLPRE = L_;

(3)

RUN; DATA HILO;

SET CLEAN.PATIENTS(KEEP=PATNO &VAR);

***Bring in upper and lower cutoffs for variable; IF _N_ = 1 THEN SET TMP;

IF &VAR LE L_&LOW_PER THEN DO; RANGE = ’LOW ’;

OUTPUT; END;

ELSE IF &VAR GE L_&UP_PER THEN DO; RANGE = ’HIGH’;

OUTPUT; END; RUN;

PROC SORT DATA=HILO(WHERE=(&VAR NE .)); BY DESCENDING RANGE &VAR;

RUN;

PROC PRINT DATA=HILO;

TITLE "High and Low Values For Variables"; ID PATNO;

VAR RANGE &VAR; RUN;

The output:

High and Low Values For Variables 2 20:04 Monday, July 14, 2003 PATNO RANGE HR 020 LOW 10 014 LOW 22 023 LOW 22 022 LOW 48 003 LOW 58 019 LOW 58 001 HIGH 88 007 HIGH 88 HIGH 90 004 HIGH 101 017 HIGH 208 008 HIGH 210 321 HIGH 900 o

5

Turning the prevous program into a macro

PROC DATASETS is used to delete the two temporaray datasets TMP and HILO.

/*************************************************************************** Program 2-13 Creating a macro to list the highest and lowest "n" percent of the data, using PROC UNIVARIATE

****************************************************************************/ *---*

| Program Name: HILOWPER.SAS in C:\CLEANING | | Purpose: To list the n percent highest and lowest values for |

| a selected variable. |

| Arguments: DSN - Data set name |

| VAR - Numeric variable to test |

| PERCENT - Upper and Lower percentile cutoff | | IDVAR - ID variable to print in the report | | Example: %HILOWPER(CLEAN.PATIENTS,SBP,20,PATNO) | *---*; %MACRO HILOWPER(DSN,VAR,PERCENT,IDVAR);

***Compute upper percentile cutoff; %LET UP_PER = %EVAL(100 - &PERCENT); PROC UNIVARIATE DATA=&DSN NOPRINT;

VAR &VAR; ID &IDVAR;

OUTPUT OUT=TMP PCTLPTS=&PERCENT &UP_PER PCTLPRE = L_; RUN;

DATA HILO;

(4)

IF _N_ = 1 THEN SET TMP; IF &VAR LE L_&PERCENT THEN DO;

RANGE = ’LOW ’; OUTPUT; END;

ELSE IF &VAR GE L_&UP_PER THEN DO; RANGE = ’HIGH’;

OUTPUT; END; RUN;

PROC SORT DATA=HILO(WHERE=(&VAR NE .)); BY DESCENDING RANGE &VAR;

RUN;

PROC PRINT DATA=HILO;

TITLE "Low And High Values For Variables"; ID &IDVAR;

VAR RANGE &VAR; RUN;

PROC DATASETS LIBRARY=WORK NOLIST; DELETE TMP; DELETE HILO; RUN; QUIT; %MEND HILOWPER ; %HILOWPER(CLEAN.PATIENTS,HR,20,PATNO) %HILOWPER(CLEAN.PATIENTS,SBP,20,PATNO) %HILOWPER(CLEAN.PATIENTS,DBP,20,PATNO)

6

Introduction to PROC RANK

PROC RANK produces a new variable (or replaces the values of an existing variable) with values equal to the ranks of another variable. The GROUPS= option on PROC RANK allows you to group your data values. For example, if you set GROUPS = 4, the new variable that usually holds rank values with instead have values of 0,1,2, and 3, with those

observations in group 0 being in the bottom quartile, etc. Here is sample code to illustrate how PROC RANK works.

data sample ; input myval ; datalines ; 12 72 46 192 13 81 0 42 189 16 3 ; run ;

proc rank data = sample out = smplrank; var myval ;

ranks valrank ; run ;

proc print data = smplrank ; run ;

proc rank data = sample out = smplrank2 groups = 3; var myval ;

ranks valrank ; run ;

proc print data = smplrank2 ; run ;

Here is the output.

(5)

Obs myval valrank 1 12 3 2 72 8 3 46 7 4 192 11 5 13 4 6 81 9 7 0 1 8 42 6 9 189 10 10 16 5 11 3 2

The SAS System 2

Obs myval valrank

1 12 0 2 72 2 3 46 1 4 192 2 5 13 1 6 81 2 7 0 0 8 42 1 9 189 2 10 16 1 11 3 0

7

Using PROC RANK to look for highest and lowest

values by percentage

This method is simpler and more efficient, but slightly less accurate than the method using PROC UNIVARIATE.

The%SYSEVALFfunction allows noninteger arithmetic and also provides various conversions (CEIL, FLOOR, INTEGER, or BOOLEAN).

/*************************************************************************** Program 2-14 Creating a macro to list the highest and lowest "n" percent of the data, using PROC RANK

****************************************************************************/ *---*

| Macro Name: HI_LOW_P |

| Purpose: To list the upper and lower n% of values | | Arguments: DSN - Data set name (one- or two-level |

| VAR - Variable to test |

| PERCENT - Upper and lower n% |

| IDVAR - ID variable |

| Example: %HI_LOW_P(CLEAN.PATIENTS,SBP,20,PATNO) | *---*; %MACRO HI_LOW_P(DSN,VAR,PERCENT,IDVAR);

***Compute number of groups for PROC RANK; %LET GRP = %SYSEVALF(100 / &PERCENT,FLOOR);

***Value of the highest GROUP from PROC RANK, equal to the number of groups - 1;

%LET TOP = %EVAL(&GRP - 1); PROC FORMAT;

VALUE RNK 0=’Low’ &TOP=’High’; RUN;

PROC RANK DATA=&DSN OUT=NEW GROUPS=&GRP; VAR &VAR;

RANKS RANGE; RUN;

***Sort and keep top and bottom n%;

PROC SORT DATA=NEW (WHERE=(RANGE IN (0,&TOP))); BY &VAR;

RUN;

***Produce the report; PROC PRINT DATA=NEW;

TITLE "Upper and Lower &PERCENT.% Values for %UPCASE(&VAR)"; ID &IDVAR;

VAR RANGE &VAR; FORMAT RANGE RNK.;

(6)

RUN;

PROC DATASETS LIBRARY=WORK NOLIST; DELETE NEW;

RUN; QUIT; %MEND HI_LOW_P;

References

Related documents

Notice of this Application and Hearing on the Application shall be given by certified mail return receipt requested to Decedent's surviving spouse, to all persons entitled to inherit

We believe these results are consistent with this new theory of cost behaviour, although the coefficients have different signs, because the managers were afraid that with the

Head mounted displays are able to create a sense of being inside a virtual en- vironment by using motion tracking technology to match the view direction that is displayed to the

The actual reflection point, where the laser incident ray strikes the metal reflective surface on the back of the mirror, lies just a bit toward the incident ray side of the

Spells per Day When an arcane investigator level is gained, the character gains new spells per day as if he had also gained a level in a spellcasting class he belonged to

The potential of important agro-industrial wastes; apple pomace (AP) and orange peel (OP) as C sources, was investigated in the maximization of polygalacturonase (PG), an

The Dunn-Lin theory [8] attempted to develop a better viscous compressible stability theory by removing the largest restrictions from Lees & Lin’s theory: the idea that phase

These results suggest that TRPC3-Nox2 interaction per se is critical for stabilization of Nox2 protein complex, and activation of TRPC3 is important for Nox2-dependent