• No results found

It’s fairly common for programs to have a need to do some simple kinds of lexical analysis and parsing, such as splitting a command string up into tokens. You can do this with thestrtokfunction, declared in the header file ‘string.h’.

Function

char *strtok (char *restrictnewstring, const char *restrictdelimiters)

A string can be split into tokens by making a series of calls to the function

strtok.

The string to be split up is passed as the newstring argument on the first call only. Thestrtokfunction uses this to set up some internal state information. Subsequent calls to get additional tokens from the same string are indicated by passing a null pointer as thenewstringargument. Callingstrtokwith another nonnullnewstringargument reinitializes the state information. It is guaranteed that no other library function ever callsstrtokbehind your back (which would mess up this internal state information).

The delimiters argument is a string that specifies a set of delimiters that may surround the token being extracted. All the initial characters that are members of this set are discarded. The first character that isnota member of this set of delimiters marks the beginning of the next token. The end of the token is found by looking for the next character that is a member of the delimiter set. This character in the original stringnewstring is overwritten by a null character, and the pointer to the beginning of the token innewstringis returned.

On the next call tostrtok, the searching begins at the next character beyond the one that marked the end of the previous token. Note that the set of delim- itersdelimiters do not have to be the same on every call in a series of calls to

strtok.

If the end of the string newstring is reached, or if the remainder of the string consists only of delimiter characters,strtokreturns a null pointer.

Character is used here in the sense of byte. In a string using a multibyte- character encoding (abstract), characters consisting of more than 1 byte are not

treated as an entity. Each byte is treated separately. The function is not locale dependent.

Function

wchar_t *wcstok (wchar_t *newstring, const char *delimiters)

A string can be split into tokens by making a series of calls to the function

wcstok.

The string to be split up is passed as thenewstringargument on the first call only. Thewcstokfunction uses this to set up some internal state information. Sub- sequent calls to get additional tokens from the same wide-character string are indicated by passing a null pointer as thenewstringargument. Callingwcstok

with another nonnullnewstring argument reinitializes the state information. It is guaranteed that no other library function ever callswcstokbehind your back (which would mess up this internal state information).

Thedelimiters argument is a wide-character string that specifies a set of delim- iters that may surround the token being extracted. All the initial wide characters that are members of this set are discarded. The first wide character that isnota member of this set of delimiters marks the beginning of the next token. The end of the token is found by looking for the next wide character that is a member of the delimiter set. This wide character in the original wide-character stringnew- string is overwritten by a null wide character, and the pointer to the beginning of the token innewstring is returned.

On the next call to wcstok, the searching begins at the next wide character beyond the one that marked the end of the previous token. The set of delim- itersdelimiters do not have to be the same on every call in a series of calls to

wcstok.

If the end of the wide-character stringnewstring is reached, or if the remainder of the string consists only of delimiter wide characters,wcstokreturns a null pointer.

Character is used here in the sense of byte. In a string using a multibyte- character encoding (abstract), characters consisting of more than one byte are not treated as an entity. Each byte is treated separately. The function is not locale dependent.

Warning: Since strtok and wcstok alter the string they are parsing, you should always copy the string to a temporary buffer before parsing it with

strtok/wcstok(seeSection 5.4 [Copying and Concatenation], page 93). If you allowstrtokorwcstokto modify a string that came from another part of your program, you are asking for trouble; that string might be used for other purposes after strtok orwcstok has modified it, and it would not have the expected value.

The string that you are operating on might even be a constant. Then when

strtok orwcstok tries to modify it, your program will get a fatal signal for writing in read-only memory.2 Even if the operation ofstrtokorwcstokwould

2

Chapter 5: String and Array Utilities 121 not require a modification of the string (e.g., if there is exactly one token), the string can (and in theGNUlibc case will) be modified.

This is a special case of a general principle: if a part of a program does not have as its purpose the modification of a certain data structure, then to modify the data structure temporarily is to risk errors.

The functionsstrtokandwcstokare not reentrant.3 Here is a simple example showing the use ofstrtok:

#include <string.h> #include <stddef.h>

...

const char string[] = "words separated by spaces -- and, punctuation!"; const char delimiters[] = " .,;:!-";

char *token, *cp;

...

cp = strdupa (string); /* Make writable copy. */ token = strtok (cp, delimiters); /* token => "words" */ token = strtok (NULL, delimiters); /* token => "separated" */ token = strtok (NULL, delimiters); /* token => "by" */ token = strtok (NULL, delimiters); /* token => "spaces" */ token = strtok (NULL, delimiters); /* token => "and" */

token = strtok (NULL, delimiters); /* token => "punctuation" */ token = strtok (NULL, delimiters); /* token => NULL */

The GNU C Library contains two more functions for tokenizing a string which overcome the limitation of nonreentrancy. They are only available for multibyte- character strings.

Function

char *strtok r (char *newstring, const char *delimiters, char **save ptr)

Just likestrtok, this function splits the string into several tokens that can be accessed by successive calls tostrtok_r. The difference is that the informa- tion about the next token is stored in the space pointed to by the third argument,

save ptr, which is a pointer to a string pointer. Callingstrtok_rwith a null pointer for newstring and leaving save ptr between the calls unchanged does the job without hindering reentrancy.

This function is defined in POSIX.1 and can be found on many systems which support multithreading.

3

See Loosemore et al., “Signal Handling and Nonreentrant Functions”, for a discussion of where and why reentrancy is important.

Function

char *strsep (char **string ptr, const char *delimiter)

This function has a similar functionality tostrtok_r, with thenewstring ar- gument replaced by the save ptr argument. The initialization of the moving pointer has to be done by the user. Successive calls tostrsepmove the pointer along the tokens separated bydelimiter, returning the address of the next token and updatingstring ptr to point to the beginning of the next token.

One difference betweenstrsepandstrtok_ris that if the input string con- tains more than one character fromdelimiterin a row,strsepreturns an empty string for each pair of characters fromdelimiter. This means that a program nor- mally should test forstrsepreturning an empty string before processing it. This function was introduced in 4.3BSDand therefore is widely available. Here is how the above example looks like whenstrsepis used:

#include <string.h> #include <stddef.h>

...

const char string[] = "words separated by spaces -- and, punctuation!"; const char delimiters[] = " .,;:!-";

char *running; char *token;

...

running = strdupa (string);

token = strsep (&running, delimiters); /* token => "words" */ token = strsep (&running, delimiters); /* token => "separated" */ token = strsep (&running, delimiters); /* token => "by" */ token = strsep (&running, delimiters); /* token => "spaces" */ token = strsep (&running, delimiters); /* token => "" */ token = strsep (&running, delimiters); /* token => "" */ token = strsep (&running, delimiters); /* token => "" */ token = strsep (&running, delimiters); /* token => "and" */ token = strsep (&running, delimiters); /* token => "" */

token = strsep (&running, delimiters); /* token => "punctuation" */ token = strsep (&running, delimiters); /* token => "" */

token = strsep (&running, delimiters); /* token => NULL */

Function

char *basename (const char *filename)

The GNUversion of thebasenamefunction returns the last component of the path in filename. This function is the preferred usage, since it does not mod- ify the argument, filename, and respects trailing slashes. The prototype for

basenamecan be found in ‘string.h’. This function is overriden by the

Chapter 5: String and Array Utilities 123 Here is an example usingGNUbasename:

#include <string.h>

int

main (int argc, char *argv[]) {

char *prog = basename (argv[0]);

if (argc < 2) {

fprintf (stderr, "Usage %s <arg>\n", prog); exit (1);

}

... }

Portability Note: This function may produce different results on different sys- tems.

Function

char *basename (char *path)

This is the standard XPG-definedbasename. It is similar in spirit to theGNU

version, but may modify the path by removing trailing ‘/’ characters. If the

path is made up entirely of ‘/’ characters, then “/” will be returned. Also, if

path isNULL or an empty string, then “.” is returned. The prototype for the

XPGversion can be found in ‘libgen.h’. Here is an example usingXPGbasename:

#include <libgen.h>

int

main (int argc, char *argv[]) {

char *prog;

char *path = strdupa (argv[0]);

prog = basename (path);

if (argc < 2) {

fprintf (stderr, "Usage %s <arg>\n", prog); exit (1);

}

}

Function

char *dirname (char *path)

The dirnamefunction is the compliment to theXPGversion ofbasename. It returns the parent directory of the file specified bypath. Ifpath isNULL, an empty string, or contains no ‘/’ characters, then “.” is returned. The prototype for this function can be found in ‘libgen.h’.