CHAPTER 3: METHODS
3.2 Data
Predictive analysis based on machine learning requires a considerable amount of data. Unfortunately, data from the online health forum on which active health information exchange is expected was not available due to privacy concerns. In this study, I chose two datasets from among the publicly available datasets that are heavily used by the public and representative of two popular types of social media: social Q&A and online reviews. Although there are some differences in the format and type of information exchanged in both types, both have in common
that communication among users is actively made bi-directionally via the Internet. It could be meaningful to predict the credibility of health information disseminated through one-directional media such as news. However, it is more meaningful to conduct research using social media because users who face complex health problems or have difficulty in finding relevant
information tend to rely on interactive media when seeking health information. Social media is better suited for providing personalized information than traditional media.
3.2.1 Yahoo! Answers Dataset
Yahoo! Answers is the most popular Q&A site (Fichman, 2011) and it has over 100 million users (Dom & Paranjpe, 2008). As of 2008, it had more than 23 million archived
questions (Adamic et al., 2008). Yahoo! Answers is a website in which people can post questions and answers, all of which are publicly available. Yahoo! Answers is one of the most frequently consulted reference sites, second only to Wikipedia (Fichman, 2011). Moreover, it is the most active collaborative information seeking and knowledge sharing community (Adamic et al., 2008; Chua & Banerjee, 2015; Jin, Zhou, Lee, & Cheung, 2013). Information sharing on social Q&A sites can be more interactive and customized by sharing specific questions and
corresponding answers for users’ own demands than traditional media. The tremendous amount of data and diversity of topics has attracted many researchers' attention, and therefore Yahoo! Answers has become a popular setting for the latest research despite its short history (Agichtein et al., 2008).
Yahoo! Research makes several datasets available for non-commercial research purposes through their WebscopeTM Program (https://webscope.sandbox.yahoo.com/). Yahoo! Answers Comprehensive Questions and Answers version 1.0 was used in this dissertation study. This dataset contains 4,483,032 questions and their answers, the “best answer” selected by users, the
category/sub-category selected by the asker, and other metadata. The data version used is the Yahoo! Answers corpus as of October 25th, 2007.
The best answer is selected either by the asker or by the participants in the thread through voting if the asker did not select the best answer (Shah & Pomerantz, 2010). Multiple answers can be of high quality, but only one answer is selected as the best answer. Rankings assigned by users can be problematic because they may evaluate answer quality subjectively (Fichman, 2011). In the study by Liu et al. (2008), for instance, the underlying reason for the selection was considered to be an indication of the user’s satisfaction. The best answer is partially related to credibility but is not the same concept as credibility. Therefore, the credibility labels were
created independently and the metadata about the best answer was not used in this study. In other words, answers selected as the best answer by users were used in the study, but credibility labels of these best answers were collected through crowdsourcing. The associated metadata includes the ID of the user who asked the question, the ID of the user who provided the best answer, the language with which the question and answers were posted, the location where the question was posted, the date when the question was posted, the time stamp of the last answer for this
question, the date when the question was resolved, and the date when the best answer was voted. Due to concerns about privacy, Yahoo! Answers dataset is somewhat limited because it only includes the minimal user information such as the virtual user ID of the person who asked the question and provided the best answer. The other user information, such as user ID of users who provided other answers that were not selected as the best answer, is not available in the Yahoo! Answers dataset.
Data selection
Questions and reviews only related to health inquiries were selected as follows. In the Yahoo Answers dataset, three types of category information are available: main_category, category, and sub_category. The main_category is the top-level category in the hierarchy, so it is used to primarily select health-related questions. After selecting questions in the “health”
category, 278,939 questions were left. Then, questions written not in English were removed and 278,882 questions were left. Finally, questions written in countries other than US were removed. There would be differences in the way users ask and answer health-related questions depending on region and culture. This study focused on the health information behavior of the English- speaking population living in the United States. After this filtering process, 255,828 questions were left.
As mentioned in the previous section, there is no guarantee that answers selected as the best answer are credible. The expertise of the questioner may not be enough to make the right judgment, and there are many other factors that can affect the votes of users. The focus of this dissertation research was to assess the credibility of answers. However, the quality of the
question is known to be one of the critical factors that greatly affect the credibility of the answer. There is a proverb, "a wise answer to a silly question", in Korea. This proverb mainly
compliments a good answer, but it also stresses the difficulty in expecting good answers to bad questions. Because Yahoo! Answers has a fairly large number of low-quality questions, it is important to apply a certain level of quality control.
Taking Yahoo! Answers as an example, some questions receive thousands of tags of interest, whereas others do not get even one answer (Li, Jin, Lyu, King, & Mak, 2012). Low- quality questions always get bad answers, and contrastingly we can expect good answers from
high-quality questions (Agichtein et al., 2008). Therefore, it is important to keep the quality of the questions selected in this study at a reasonable level. Prior studies applied different criteria to determine high-quality questions. Li et al. (2012) applied three criteria for high-quality questions to their study. First, the question should attract great attention from users. Second, it should be able to get more answers. Third, it should be able to get the best answer within a short period of time. Agichtein et al. (Agichtein et al., 2008, p. 189) defined question quality as "well-
formedness, readability, utility, and interestingness" and analyzed the essential features of question quality.
Exclusion criteria
After exploring and inspecting 500 randomly selected questions out of 255,828, a
negative list of question types for exclusion was created. Creating a positive list of question types for inclusion can have an excessive impact on the quality of the selected questions by selectively choosing good-quality questions. More specifically, this decision was made for the following reasons: 1) Choosing data by specific question types increases the likelihood that a researcher's bias could be introduced to the study. As mentioned earlier, it is known that question quality affects answer quality. The negative list also can introduce a certain level of bias, but the bias could be minimized if only questions that could not reasonably expect good answers were removed. 2) There were a large number of questions that were registered in the health category but were totally out of topic. For instance, there were several questions asking how they can prepare to get a job in the healthcare industry. Also, many questions were conversational instead of pursuing health information. The exclusion criteria are summarized in Table 4.
Table 4. Exclusion Criteria
Conversational
“Does anyone know what would be a real good comeback to someone that insults you about your weight?”
Answers to this type of question would not include health
information
Emotional
“My mom has had bone cancer for over a year, it started in her hand and last month she did scan, and the cancer has spread in all her body ... I don’t wanna lose her, she is
everything to me in this life, I cry day and night. Plz pray for my mom.”
This question is rather
emotional and only can expect subjective and emotional feedback which is hard to judge credibility.
Out-of-topic
“How many more melting ice-bergs, hurricanes, disasters, before we act on global warming by reducing pollution?”
These questions are not relevant to health topic.
Broad topic “How can I stop eating junk foods while working?”
This question is related to health, but the scope is too broad. Answers to this type of question would be hard to judge the credibility of answers because there are too many possibilities.
Trash jokes/silly question
“I'm so hot that I have to get drunk all the time just to deal with myself what should I do?”
The questioner asks a very personal question which is hard and silly to answer.
Personal affair “How many of you are Lactose Intolerant?”
The answer will depend on the person who answers. There is no right, credible answer.
Among the exclusion criteria mentioned above, most questions were related to trash jokes or personal affair. It was almost impossible to manually pick out these types of questions. Thus, I examined the 500 randomly selected questions that were used to create the exclusion criteria to generate a list of words closely related to the exclusion criteria. For instance, there were many trash jokes about sex. "Does masturbation effect the length of the penis?" is one of the examples. Smoking is closely related to health. It might have a significant link to cancer or respiratory
diseases. Out of the 500 questions, there were 18 related to cigarettes and drugs, and 17 questions out of them were related to personal opinion or quitting smoking. “Who besides me hates cigarettes? They smell, and make me sick when I'm around a smokers?” and “What is the best thing to do to stop smoking if the patch doesn't work?” are examples. There was one case that was related to managing health. “Will smoking make my asthma worse?” was the question, “Asthma sufferers are the last people that should smoke - YES it will make it worse.” was the answer. Questions related to smoking as in this example are expected to have obvious answers because smoking is known to be bad for health. In addition to the 500 questions, I looked up additional questions related to smoking by searching keywords, and most of the questions were to reconfirm whether smoking is bad for health. Questioners probably already knew that it is bad but they did not want to quit it. In this way, a word list of spam words was created, and questions including a word in this list were excluded from the dataset by applying regular expressions.
Table 5. A List of Spam Words
Topic Spam words
Cigarette/drug Smoking, smoke, smoker, cigarette, cigar, smoker, and drug
Sex Kissing, masturbation, Viagra, gonad, saliva, sex, pantiliner, condom, and virginity
Others Hiccup and bourbon
Yahoo! Answers have various categories such as alternative medicine, dental, diet & fitness, disease & condition, men's health, and women's health under the “health” category. Credible answers to the questions in these categories greatly vary depending on the situation and the individual. For example, in the case of diet, the degree of the diet and the corresponding goal of the diet should be different depending on the individual. There is also a practical issue with
the men's health and women's health categories because there are too many trash jokes related to sex and it is hard to filter them manually. Therefore, only health questions classified in the disease & condition category were selected as they are related to relatively less-subjective information needs. There are sub-categories under the disease & condition category, and there are more subjective and out-of-topic questions in certain sub-categories. For instance, skin condition and sexually transmitted diseases (STDs) include a lot of trash jokes. Therefore, I selected five sub-categories (allergies, cancer, diabetes, heart diseases, and respiratory diseases) that focus more on actual health inquiries. Specifically, the questions used in the study were stratified samples of the five topics above related to general health problems.
There was one more point to consider selecting the final dataset. As explained earlier, this dissertation study aimed to examine the varying effects of factors on credibility judgments
influenced by the topic (general vs. specific). Therefore, the topic of a question was also determined whether it is general or specific. A question covering specific information needs, such as the side effects or effectiveness of a particular drug or treatment, was considered specific, whereas a question covering general information needs such as the cause or effect of a disease was considered general. Because I coded the data by myself, data instances that were uncertain regarding the type of topic were excluded from the study to minimize bias by the coder. All questions and answers were randomly sorted first, and filtering was continued until the final 2,000 question and answer pairs from the Yahoo! Answers dataset were collected, with an even distribution of topic and sub-categories considering the exclusion criteria listed in Table 4. 3.2.2 Yelp Dataset
Yelp is the largest business-listing site for service businesses. Yelp provides new datasets to the public for research purposes every six months for its Dataset Challenge
(https://www.yelp.com/dataset/challenge). I used the dataset for the 9th round of the Yelp Dataset Challenge. Because of the large size of the data, Yelp is considered a representative online review service, when considering service reviews only (Racherla & Friske, 2012). The dataset contains online reviews of local businesses across four countries (US, Canada, England, and Germany) written between March 2005 and January 2017. There are 2,685,065 reviews written by 686,555 reviewers for 85,950 local businesses.
Data for review, reviewer, and business were provided as JSON objects in separate files. The last date when an online review was written is January 20th, 2017. The data includes attributes that can be used as features of machine learning by themselves and also has attributes that can be produced as features through processing, such as review texts and list of friends. The review object has a star rating, review text, date, and number of votes for “useful,” “funny,” and “cool.” The business object includes basic information about local businesses such as location, star rating, review count, and service category. The user object consists of review count, average stars, number of compliments, and list of friends. Unlike the Yahoo! Answer dataset, the Yelp dataset has rich network-related attributes (e.g., list of friends) that can be used for network analysis.
Since most reviews are about restaurants, which are not the target for this study, the dataset was filtered by using category information (e.g., doctors, dentists, dermatologists, hospitals, and so on) in order to only consider health information. The final number of reviews was 81,340 after filtering based on the category information. As with Yahoo! Answers data, 1,000 reviews were needed for specific health information needs, such as cardiologists and gynecologists, and another 1,000 reviews were needed for general health information needs, such as family practice and urgent care. A physician reviewed criteria for topic categorization
before the coding process. All reviews were randomly sorted first, and 1,000 reviews were selected for the specific topic and general topic independently. In total, 2,000 reviews were used in the dissertation study.
3.2.3 Comparison of Characteristics of Datasets
As mentioned earlier, two datasets have different characteristics because they represent different types of social media. Additionally, Yahoo! Answers is designed to share answers to a specific question, but Yelp is designed to share reviews of a specific business. Previous
credibility studies using the Yahoo! Answers dataset (Eysenbach & Köhler, 2002; Fogg, 2003; Liu, 2004; Rieh, 2002) have found that credibility is closely related to answerers’ expertise as well as source-related factors such as links and references. However, questioners should rely more on message credibility than source credibility partly due to the unavailability of source information (Kim, 2010). The Yahoo! Answers dataset does not contain rich information related to the source in the way the Yelp dataset does. In Figure 2, the overall characteristics of the datasets are described regarding the type of social media (vertical axis) and availability of content features and source features (horizontal axis). The Yelp dataset contains information about the review and author, while the Yahoo! Answers dataset contains very limited author information.
Figure 2. Overall Characteristics of Datasets
Therefore, minimizing the potential confounding factors caused by the different
availability of source-related features was critical in conducting this study. However, it was also important to examine how the source-related features available in Yelp are discriminative in predicting the credibility of health information. Therefore, this study was conducted with the following strategies. First, when asking MTurk workers for credibility judgments, a minimum number of cues associated with the source were shown to assessors. As such, the potential confounding factors that could occur when judging the credibility of Yahoo! Answer data and Yelp data could be minimized. Second, when examining factors affecting the credibility judgments according to the type of social media, this study used a minimum number of source- related features as mentioned above. Third, in order to compare the discriminative power of the source-related features to the discriminative power of the content-related features, I separated experiments into two types: experiments using the Yahoo! Answers dataset and experiments using the Yelp dataset.