5.6 Taking strings apart - Accelerated C++ Practical Programming by Example

Now that we've seen some of what we can do with containers, we're going to turn our attention back to strings. Until now, we've done only a few things with strings: We've created them, read them, concatenated them, written them, and looked at their size. In each of these uses, we have dealt with the string as a single entity. Often, this kind of abstract usage is what we want: We want to ignore the detailed contents of a string. Sometimes, though, we need to look at the specific characters in a string.

As it turns out, we can think of a string as a special kind of container: It contains only characters, and it supports some, but not all, of the container operations. The operations that it does support include indexing, and the string type provides an iterator that is similar to a vector iterator. Thus, many of the techniques that we can apply to vectors apply also to strings.

For example, we might want to break a line of input into words, separated from each other by whitespace (space, tab, backspace, or the end of the line). If we can read the input directly, we can get the words from the input trivially. After all, that's exactly how the string input operator executes: It reads characters up to the whitespace character. However, there are times when we want to read an entire line of input and examine the words within that line. We'll see examples in §7.3/126 and §7.4.2/131.

Because such an operation might be generally useful, we'll write a function to do it. The function will take a string and return a vector<string>, which will contain an entry for each whitespace-separated word in that string. In order to understand this function, you need to know that strings support indexing much the same way as vectors do. So, for example, if s is a string that contains at least one character, the first character of s is s[0], and the last character of s is s[s.size() - 1].

Our function will define two indices, i and j, that will delimit each word in turn. The idea is that we will locate a word by computing values for i and j such that the word will be the characters in the range [i, j). For example,

Once we have these indices, we'll use the characters that they delimit to create a new string, which we will copy into our vector. When we are done, we will return the vector to our caller:

vector<string> split(const string& s) {

vector<string> ret;

typedef string::size_type string_size; string_size i = 0;

// invariant: we have processed characters [original value of i, i) while (i != s.size()) {

// ignore leading blanks

// invariant: characters in range [original i, current i) are all spaces while (i != s.size() && isspace(s[i]))

++i;

// find end of next word string_size j = i;

// invariant: none of the characters in range [original j, current j)is a space while (j != s.size() && !isspace(s[j]))

j++;

// copy from s starting at i and taking j - i chars ret.push_back(s.substr(i, j - i)); i = j; } } return ret; }

In addition to the system headers that we have already encountered, this code needs the <cctype> header, which defines isspace. More generally, this header defines useful functions for processing individual characters. The c at the beginning of cctype is a reminder that the ctype facility is part of C++'s inheritance from C.

The split function has a single parameter, which is a reference to a const string that we'll name s. Because we will be copying words from s, split does not need to change the string. As in §4.1.2/54, we can pass a const reference to avoid the cost of copying the string, while still ensuring that split will not change its argument.

We start off by defining ret, which will hold the words from the input string. The next two statements define and initialize our first index, i. As we saw in §2.4/22, string::size_type is the name for the appropriate type to index a string. Because we need to use this type more than once, we start by defining a shorter synonym for this type, as we did in §3.2.2/43, to simplify the subsequent declarations. We will use i as the index that finds the start of each word, advancing i through the input string one word at a time.

The test in the outermost while ensures that once we've processed the last word in the input, we'll stop.

Inside the while, we start by positioning our two indices. First, we find the first non-space character in s that is at or after the position currently indicated by i. Because there might be multiple whitespace characters in the input, we increment i until it denotes a character that is not whitespace.

There is a lot going on in this statement:

while (i != s.size() && isspace(s[i])) ++i;

The isspace function is a predicate that takes a char and returns a value that indicates whether that char is whitespace. The && operator tests whether both its operands are true, failing if either of them is false. In this expression, the operation will succeed if i is not equal to the size of s (meaning that we have not reached the end of the string), and s[i] is a whitespace character. In that case, we will increment i and check again.

As we described in §2.4.2.2/26, the logical && operation uses a short-circuit strategy for evaluating its operands. Unlike our earlier examples, this one relies on the short-circuit property of &&. The binary logical operations (operators && and ||) execute by testing their left-hand operands first. If that test suffices to determine the overall result, then the right-hand operand is not evaluated. In the case of the &&, the second condition is evaluated if and only if the first condition is true. Thus, the condition in the while executes by first checking whether i != s.size(). Only if this test succeeds does it use i to look at a character in s. Of course, if i is equal to s.size(), then there are no more characters left to examine, and so we drop out of the loop.

Once we fall out of this while, we know either that i denotes a character that is not whitespace, or that we've run out of input without finding such a character.

Assuming that i is still a valid index, the next while will find the space that terminates the current word in s. We start by creating our other index, j, and initializing it to the value of i. The next while,

while (j != s.size() && !isspace(s[j])) ++j;

executes similarly to the previous one, but this time the while stops when it encounters a whitespace character. As before, we start by ensuring that j is still in range. If so, we again call isspace on the character indexed by j. This time, we negate the return from isspace using the logical negation operator, !. In other words, we want the condition to be true if isspace(s[j]) is not true.

Having completed the two inner while loops, we know that we have either found another word or run out of input while looking for a word. If we have run out of input, then both i and j will be equal to s.size(). Otherwise, we have found a word, which we must push onto ret:

// if we found some nonwhitespace characters if (i != j) {

// copy from s starting at i and taking j - i chars ret.push_back(s.substr(i, j - i));

i = j; }

The call to push_back uses a member of the string class, named substr, that we have not previously seen. It takes an index and a length, and creates a new string that contains a copy of characters from the initial string, starting at the index given by the first argument, and copying as many characters as indicated by its second argument. The substring that we extract starts at i, which is the first character in the word that we just found. We copy characters from s starting with the one indexed by i, and continuing until we have copied the characters in the (half-open) range [i, j). Remembering from §2.6/31 that the number of elements in a half-open range is the difference between the bounds, we see that we will copy exactly j - i characters.

In document Accelerated C++ Practical Programming by Example (Page 135-138)