Sometimes we need to filter our datasets based on the context in which a word or phrase is used. For example, during one of our engagements, we were trying to understand the discussions being held around a conference called “Sapphire Now.” According to its Twitter account, it is “The world’s premier business technology conference around business technology trends and innovations.” When we first started collecting data, we simply searched on the word sapphire expecting to find all of our collected tweets referring to
the show.
We forgot about:
■ Sapphire, the gem ■ Sapphire, the liquor ■ Sapphire, the color ■ and so on
So in the cleansing of this data, we had to look at all of our tweets that referenced sapphire and select just the ones that dealt with the conference or
technology in general. In other words, the use of the word sapphire in the
context of technology.
Using the sample dataset we had previously, we had the same problem. We had a set of data about “apple,” and it came in a few “flavors”:
■ Apple, the corporation (and its products) ■ Apple, the fruit
■ Apple, as a sign of affection ■ No doubt there would be others
One way we could sort out the data was to look for the word apple in the
dataset and select just those tweets that discussed the iPad (or any of Apple’s products).
Our egrep command could look something like this:
ptg16373464
2: Separating the Wheat from the Chaff 29
In this case, we specify that we want all tweets that contain the word apple
(remember, we use the -i option of egrep to ignore case) and then anywhere
following that word (that’s the use of the asterisk) the word ipad. The output
of this command yields:
1: Apple said to prepare new 12.9-inch IPad for early 2015: 3: RT @BloombergNews Apple said to prepare new 12.9-inch IPad for early 2015:
5: Apple is working on a bigger iPad with a 12.9-inch screen that will launch early next year
Obviously, as before, specifying (apple).*(ip) would have included
the tweet about the iPhone. The point is that looking for specific words when other keywords are present allows us to understand (or at least assume) our search words are being used in the proper context.
Many search engines allow for the filtering of keywords when other words are within a certain boundary (for example, if we see the word apple, the
word ipad must be seen within three words; otherwise, it will not be consid-
ered a match).
Summary
What we discussed in this chapter could loosely be called the process of data cleansing. This is essentially the process of detecting and removing inac- curate data from a dataset (or database). It’s an essential step in feeding our more complex “downstream” steps of analysis and interpretation. While we took you through some relatively simple examples, we hope we made our point: The accuracy of any analysis is only as good as its weakest link (if your datasets are corrupt, or inaccurate, the error or inaccuracies will only be magnified later). If you look to minimize the corruption or inaccuracies, hopefully the magnification of the minimized error won’t be noticeable or too large as to skew your results in the wrong direction.
The question “How do we know if we have the right data?” is a diffi- cult one to answer. The process of data cleansing, like the whole process of analysis, is iterative in nature (we discuss this issue further in Chapter 12, where we discuss things that can go wrong). At some point we have to decide that “enough is enough.” If we’ve passed over the data three or four times in an attempt to clean it, and each pass produces less and less clean- ing, repeating the process becomes a question of “Is it worth continuing
ptg16373464
to clean with less and less of a result?” There is no simple answer to that question other than your having a gut feeling that your data is ready for the next step.
Endnotes
[1] Lynd, Robert. The Orange Tree: A Volume of Essays. Metheun, 1926. [2] Quote attributed to Charles Babbage (https://en.wikiquote.org/wiki/Charles_ Babbage).
ptg16373464
Up to this point, we have concerned ourselves with what data to analyze while ensuring that what we selected is germane to our topic. In this chapter, we explore how important it is to determine whose comments we are inter-
ested in. A few examples are as follows:
■ If we are interested in getting objective feedback on a product from a
specific company, we might want to make sure that we can identify or exclude this company’s employees from the pool of content under analysis.
■ Similarly, we need to ask: Are we interested in comments from the
general public, or are we interested in the comments of C-level employees (that is, chief marketing officers or chief information offi- cers)?
■ Also, are we interested only in people who have a positive bias toward
a company or those with a strong negative bias?
31
Whose Comments Are
We Interested In?
All opinions are not equal. Some are a very great deal more robust, sophisticated, and well supported in logic and argu- ment than others.
—Douglas Adams, The Salmon of Doubt [1]
3
ptg16373464