It’s Not What You Say but WHERE You Say It

Sometimes we need to filter our datasets based on the context in which a word or phrase is used. For example, during one of our engagements, we were trying to understand the discussions being held around a conference called “Sapphire Now.” According to its Twitter account, it is “The world’s premier business technology conference around business technology trends and innovations.” When we first started collecting data, we simply searched on the word sapphire expecting to find all of our collected tweets referring to

the show.

We forgot about:

■ Sapphire, the gem ■ Sapphire, the liquor ■ Sapphire, the color ■ and so on

So in the cleansing of this data, we had to look at all of our tweets that referenced sapphire and select just the ones that dealt with the conference or

technology in general. In other words, the use of the word sapphire in the

context of technology.

Using the sample dataset we had previously, we had the same problem. We had a set of data about “apple,” and it came in a few “flavors”:

■ Apple, the corporation (and its products) ■ Apple, the fruit

■ Apple, as a sign of affection ■ No doubt there would be others

One way we could sort out the data was to look for the word apple in the

dataset and select just those tweets that discussed the iPad (or any of Apple’s products).

Our egrep command could look something like this:

ptg16373464

2: Separating the Wheat from the Chaff 29

In this case, we specify that we want all tweets that contain the word apple

(remember, we use the -i option of egrep to ignore case) and then anywhere

following that word (that’s the use of the asterisk) the word ipad. The output

of this command yields:

1: Apple said to prepare new 12.9-inch IPad for early 2015: 3: RT @BloombergNews Apple said to prepare new 12.9-inch IPad for early 2015:

5: Apple is working on a bigger iPad with a 12.9-inch screen that will launch early next year

Obviously, as before, specifying (apple).*(ip) would have included

the tweet about the iPhone. The point is that looking for specific words when other keywords are present allows us to understand (or at least assume) our search words are being used in the proper context.

Many search engines allow for the filtering of keywords when other words are within a certain boundary (for example, if we see the word apple, the

word ipad must be seen within three words; otherwise, it will not be consid-

ered a match).

Summary

What we discussed in this chapter could loosely be called the process of data cleansing. This is essentially the process of detecting and removing inaccurate data from a dataset (or database). It’s an essential step in feeding our more complex “downstream” steps of analysis and interpretation. While we took you through some relatively simple examples, we hope we made our point: The accuracy of any analysis is only as good as its weakest link (if your datasets are corrupt, or inaccurate, the error or inaccuracies will only be magnified later). If you look to minimize the corruption or inaccuracies, hopefully the magnification of the minimized error won’t be noticeable or too large as to skew your results in the wrong direction.

The question “How do we know if we have the right data?” is a diffi- cult one to answer. The process of data cleansing, like the whole process of analysis, is iterative in nature (we discuss this issue further in Chapter 12, where we discuss things that can go wrong). At some point we have to decide that “enough is enough.” If we’ve passed over the data three or four times in an attempt to clean it, and each pass produces less and less clean- ing, repeating the process becomes a question of “Is it worth continuing

ptg16373464

to clean with less and less of a result?” There is no simple answer to that question other than your having a gut feeling that your data is ready for the next step.

Endnotes

[1] Lynd, Robert. The Orange Tree: A Volume of Essays. Metheun, 1926. [2] Quote attributed to Charles Babbage (https://en.wikiquote.org/wiki/Charles_ Babbage).

ptg16373464

Up to this point, we have concerned ourselves with what data to analyze while ensuring that what we selected is germane to our topic. In this chapter, we explore how important it is to determine whose comments we are inter-

ested in. A few examples are as follows:

■ If we are interested in getting objective feedback on a product from a

specific company, we might want to make sure that we can identify or exclude this company’s employees from the pool of content under analysis.

■ Similarly, we need to ask: Are we interested in comments from the

general public, or are we interested in the comments of C-level employees (that is, chief marketing officers or chief information officers)?

■ Also, are we interested only in people who have a positive bias toward

a company or those with a strong negative bias?

31 Whose Comments Are

We Interested In?

All opinions are not equal. Some are a very great deal more robust, sophisticated, and well supported in logic and argu- ment than others.

—Douglas Adams, The Salmon of Doubt [1]

3

ptg16373464

In document Social Media Analytics_ Techniques and Insights [Dr.soc] (Page 65-69)