String Manipulation - SPSS Programming and Data Management, 2nd Edition

Since just about the only restriction you can impose on string variables is the maximum number of characters, string values may often be recorded in an inconsistent manner and/or contain important bits of information that would be more useful if they could be extracted from the rest of the string.

Changing the Case of String Values

Perhaps the most common problem with string values is inconsistent capitalization. Since string values are case sensitive, a value of “male” is not the same as a value of “Male.” This example converts all values of a string variable to lowercase letters.

*string_case.sps.

DATA LIST FREE /gender (A6). BEGIN DATA Male Female male female MALE FEMALE END DATA. COMPUTE gender=LOWER(gender). EXECUTE.

The LOWER function converts all uppercase letters in the value of gender to lowercase letters, resulting in consistent values of “male” and “female.”

You can use the UPCASE function to convert string values to all uppercase letters.

Combining String Values

You can combine multiple string and/or numeric values to create new string variables. For example, you could combine three numeric variables for area code, exchange, and number into one string variable for telephone number with dashes between the values.

*concat_string.sps.

DATA LIST FREE /tel1 tel2 tel3 (3F4). BEGIN DATA 111 222 3333 222 333 4444 333 444 5555 555 666 707 END DATA.

STRING telephone (A12). COMPUTE telephone =

CONCAT((STRING(tel1, N3)), "-",

(STRING(tel2, N3)), "-", (STRING(tel3, N4))). EXECUTE.

The STRING command defines a new string variable that is 12 characters long. Unlike new numeric variables, which can be created by transformation commands, you must define new string variables before using them in any transformations.

The COMPUTE command combines two string manipulation functions to create the new telephone number variable.

The CONCAT function concatenates two or more string values. The general form of the function is CONCAT(string1, string2, ...). Each argument can be a variable name, an expression, or a literal string enclosed in quotes.

Each argument of the CONCAT function must evaluate to a string; so we use the

STRING function to treat the numeric values of the three original variables as strings. The general form of the function is STRING(value, format). The value argument can be a variable name, a number, or an expression. The format argument must be a valid numeric format. In this example, we use N format to support leading zeros in values (for example, 0707).

The dashes in quotes are literal strings that will be included in the new string value; a dash will be displayed between the area code and exchange and between the exchange and number.

Figure 4-27

Original numeric values and concatenated string values

Taking Strings Apart

In addition to being able to combine strings, you can also take them apart. For example, you could take apart a 12-character telephone number, recorded as a string (because of the embedded dashes), and create three new numeric variables for area code, exchange, and number.

If all of the values were in the form nnn-nnn-nnnn with no spaces, it would be fairly easy to extract each segment of the telephone number, but some of the values have leading spaces or spaces before and after the dashes.

*substr_index.sps.

***Create some inconsistent sample numbers***. DATA LIST FREE (",") /telephone (A16).

BEGIN DATA 111-222-3333 222 - 333 - 4444 333-444-5555 444 - 555-6666 555-666-0707 END DATA.

***Now extract the component parts***. COMPUTE tel1 = NUMBER(SUBSTR(telephone, 1, INDEX(telephone, "-")-1), F5). COMPUTE tel2 = NUMBER(SUBSTR(telephone, INDEX(telephone, "-")+1, RINDEX(telephone, "-")-(INDEX(telephone, "-")+1)), F5). COMPUTE tel3 = NUMBER(SUBSTR(telephone, RINDEX(telephone, "-")+1), F5). EXECUTE.

FORMATS tel1 tel2 (N3) tel3 (N4).

***Alternate method***. STRING #telstr(A16).

COMPUTE #telstr = telephone. VECTOR tel(3,f4).

LOOP #i = 1 to 2.

- COMPUTE #dash = INDEX(#telstr,"-").

- COMPUTE tel(#i) = NUMBER(SUBSTR(#telstr,1,#dash-1),f10). - COMPUTE #telstr = SUBSTR(#telstr,#dash+1).

END LOOP.

COMPUTE tel(3) = NUMBER(#telstr,f10). EXECUTE.

FORMATS tel1 tel2 (N3) tel3 (N4).

The NUMBER function converts a number expressed as a string to a numeric value. The basic format is NUMBER(value, format). The value argument can be a variable name, a number expressed as a string in quotes, or an expression. The format argument must be a valid numeric format; this format is used to determine the numeric value of the string. In other words, the format argument says, “Read the string as if it were a number in this format.”

The value argument for the NUMBER function for all three new variables is an expression using the SUBSTR function. The general form of the function is

SUBSTR(value, position, length). The value argument can be a variable name, an expression, or a literal string enclosed in quotes. The position argument is a number

that indicates the starting character position within the string. The optional length argument is a number that specifies how many characters to read starting at the value specified on the position argument. Without the length argument, the string is read from the specified starting position to the end of the string value. So SUBSTR("abcd", 2, 2) would return “bc,” and SUBSTR("abcd", 2) would return “bcd.”

INDEX and RINDEX functions are used to determine starting position and/or length for the three new variables. The general form of these commands is

[R]INDEX(haystack, needle). The haystack argument can be a variable name or a literal string enclosed in quotes. The needle argument can be a literal string enclosed in quotes or an expression. Both arguments must evaluate to strings. The function returns a numeric value that represents the starting position of needle

within haystack. For INDEX, the number is the starting position of the first occurrence of needle, and for RINDEX it’s the starting position of the last occurrence of needle. So, INDEX("abcabc", "b") would return a value of 2, and

RINDEX("abcabc", "b") would return a value of 5.

For tel1, SUBSTR(telephone, 1, INDEX(telephone, "-")-1) defines a substring starting with the first character in the value of telephone and ending with the last character prior to the first dash.

For tel3, (SUBSTR(telephone, RINDEX(telephone, "-")+1) defines a substring starting with the first character after the last dash in the value of telephone. In the absence of a length argument, the remainder of the string value is read.

Extracting the value of tel2 is a little more complicated, since it’s in the middle of the original string value. The starting position is the first character after the first dash: INDEX(telephone, "-")+1. The length is the difference between that value and the starting position of the second dash:

RINDEX(telephone, "-")-(INDEX(telephone, "-")+1)).

FORMATS assigns N format to the three new variables for numbers with leading zeros (for example, 0707).

Figure 4-28

Substrings extracted and converted to numbers

The alternative method eliminates the need to use a somewhat complicated expression to extract a substring from the middle of the string value by using a temporary variable and changing the value of the temporary variable to the remaining portion(s) of the string value as each segment is extracted.

A temporary (scratch) string variable, #telstr, is declared and set to the value of the original string telephone number.

The VECTOR command creates three numeric variables—tel1, tel2, and tel3—and creates a vector containing those variables.

The LOOP structure iterates twice to produce the values for tel1 and tel2.

COMPUTE #dash = INDEX(#telstr,"-") creates another temporary variable, #dash, that contains the position of the first dash in the string value.

On the first iteration, COMPUTE tel(#i) = NUMBER(SUBSTR(#telstr,1,#dash-1),f10)

extracts everything prior to the first dash, converts it to a number, and sets tel1 to that value.

COMPUTE #telstr = SUBSTR(#telstr,#dash+1) then sets #telstr to the remaining portion of the string value after the first dash.

On the second iteration, COMPUTE #dash... sets #dash to the position of the “first” dash in the modified value of #telstr. Since the area code and the original first dash have been removed from #telstr, this is the position of the dash between the exchange and the number.

COMPUTE tel(#)... sets tel2 to the numeric value of everything up to the “first” dash in the modified version of #telstr, which is everything after the first dash and before the second dash in the original string value.

COMPUTE #telstr... then sets #telstr to the remaining segment of the string value— everything after the “first” dash in the modified value, which is everything after the second dash in the original value.

After the two loop iterations are complete, COMPUTE tel(3) = NUMBER(#telstr,f10)

sets tel3 to the numeric value of the final segment of the original string value.

Changing the Defined Width of a String Variable

When reading in data from text files or databases, the width of string variables is sometimes set higher than necessary. In some cases, string variables are automatically set to an arbitrarily long width. This can make the string variables awkward to work with. The following example counts the number of characters in each string value, ignoring trailing spaces, and changes the string variable width to the maximum character count.

*string_length.sps.

DATA LIST FREE /stringvar (A10). BEGIN DATA

a ab abc a abcde ab abcdefg END DATA.

COMPUTE strlength=LENGTH(RTRIM(stringvar)). SORT CASES BY strlength (D).

DO IF ($casenum = 1).

WRITE OUTFILE = 'c:\temp\temp.sps'

/"STRING newstring (A" strlength (N5) ")." /"COMPUTE newstring = stringvar."

/"MATCH FILES FILE=* /DROP stringvar strlength.". END IF.

EXECUTE.

INSERT FILE = 'c:\temp\temp.sps'.

The COMPUTE command creates a new numeric variable, strlength, that is the number of characters in each string value.

The LENGTH function has the general form LENGTH(string), and it returns the number of characters in the string argument. The value of the argument can be a variable name, an expression, or a literal string.

RTRIM strips off any trailing blanks in the string value. This is necessary because string values are right-padded to the defined width of the string variable, so without

it, LENGTH(varname) will always return the defined width of the string variable, which does not help in this case.

The SORT CASES command sorts the cases in descending order of the new variable

strlength; the case with the longest string value for stringvar will be the first case

in the working data file.

DO IF ($casenum = 1) restricts the transformations in the DO IF structure to the first case in the file, which is the case with the highest value for strlength.

The WRITE command creates a command syntax file that declares a new string variable newstring with a defined width set to the value of strlength. We need to create a new string variable because you cannot change the defined width of an existing variable.

N5 format is used to write the value of strlength because we know the value will be somewhere between one and five digits long (since the maximum string width is 32,767 characters), and we need to write the value without any preceding blanks. Using the default F8.2 format for strlength would result in a format specification of

(A 7), which is invalid. N format fills out the values to the defined width with leading zeros, resulting in a format specification of (A00007).

The COMPUTE command generated by the WRITE command copies the contents of

stringvar into newstring, and then MATCH FILES is used to delete the intermediate variable strlength and the now redundant original string variable stringvar.

The INSERT command runs the command syntax file created by the WRITE

command, creating the new string variable with the more appropriate width specification.

In document SPSS Programming and Data Management, 2nd Edition (Page 131-138)