CHAPTER 3: METHODS
3.5 Feature Ablation Study
Features for machine learning were created by using natural language processing (NLP), network metrics, and other available resources. I organized features into three types: content- based, link-based, and user-based features (Table 7). Content-based features measured different characteristics of the content itself (e.g., the answer or review). Link-based features measured characteristics of nodes in a network. For instance, I used several network metrics (e.g., degree, eigenvector, and betweeness). In these metrics, nodes were users of Yelp, and vertices were friendships between users. User-based features measured different characteristics of users (e.g., age of account). Tools that are promising to generate useful features are also summarized in Table 7. It is important to note that the features listed here are not the final set of features used in the study. One of the reasons for conducting the content analysis was to find clues to
operationalize credibility factors as features while inspecting the health information manually. Therefore, the final features that were operationalized based on content analysis are summarized in the result section.
Table 7. Promising Features in Predicting Credibility of Information and Tools to Extract the Corresponding Feature
Type of
feature Feature Tool
Content-based
Word shared between question and
answer NLTK (Bird, 2006)
Non-standard use of grammar and
punctuation NLTK (Bird, 2006)
Part-of-Speech (e.g., number of
adverbs, adjectives, and nouns) NLTK (Bird, 2006) Readability (e.g., SMOG score) Python readability library1
Text complexity (document entropy) NLTK (Bird, 2006), Python Scikits library (Pedregosa et al., 2011) Text informative (e.g., tf-idf) NLTK (Bird, 2006), Python Scikits
library (Pedregosa et al., 2011) Quantity of information (uncertainty
reduction theory) NLTK (Bird, 2006) Vividness, descriptiveness, and
strength of message (e.g., inclusion of personal experience, number of
suggestions)
NLTK (Bird, 2006)
Perceived expertise (ratio of jargon) MetaMap (Aronson & Lang, 2010) Frequency of modal and inferential
conjunction NLTK (Bird, 2006)
Sentiment (e.g., frequency of affective words, attitude, tone)
NLTK (Bird, 2006), TextBlob library2, WordNet affect (Strapparava & Valitutti, 2004), Stanford CoreNLP (Manning et al.,
2014) Content similarity with validated
source
NLTK (Bird, 2006), Python Scikits library (Pedregosa et al., 2011), MetaMap (Aronson & Lang, 2010),
SemRep (Rindflesch & Fiszman, 2003)
Redundancy of the content
NLTK (Bird, 2006), Python Scikits library (Pedregosa et al., 2011), MetaMap (Aronson & Lang, 2010),
SemRep (Rindflesch & Fiszman, 2003)
Frequency of emoticon NLTK (Bird, 2006) Fraction of content containing spam
words NLTK (Bird, 2006)
Authorship (authors and their affiliation) and attribution (references
and sources of content)
NLTK (Bird, 2006)
Number of witnesses (e.g., number of positive replies)
NLTK (Bird, 2006), TextBlob library, WordNet affect (Strapparava & Valitutti, 2004), Stanford CoreNLP (Manning et al.,
2014)
Link-based
PageRank (Brin & Page, 1998)
HITS (Kleinberg, 1999)
Network metrics (e.g., degree,
eigenvector, and betweenness) Pajek (Batagelj & Mrvar, 1998) Ratio of external/internal inbound
links to outbound links Pajek (Batagelj & Mrvar, 1998) Number of internal/external inbound
links Pajek (Batagelj & Mrvar, 1998) Ratio of number of replies to number
of questions Python
3
Maximum depth of the propagation
trees Pajek (Batagelj & Mrvar, 1998)
User-based
Entropy of the categories of user Python
Entropy of time window Python
Reliability of a user (e.g., Mean
absolute error of user’s ratings) Python Number of comments added by other
participants Python
Number of questions, answers, replies,
and giving thanks Python
Difference in positive and negative
replies Python
Frequency and regularity of posts Python
Age of account Python
Some features such as content similarity with the validated source would require external authoritative resources for comparison. Content similarity measured the similarity between an answer and the most relevant pages from these authoritative resources. In this dissertation study, eight online resources were used as authoritative data. Links to each resource and brief
explanations are in Table 8.
Table 8. Authoritative Online Resources for Content Validation
Name URL Information
Asthma and Allergy Foundation of
America
http://www.aafa.org/
“a not-for-profit organization founded in 1953, is the leading patient organization for people with asthma and allergies, and
the oldest asthma and allergy patient group in the world”
American Heart
Association http://www.heart.org/
“the nation’s oldest and largest voluntary organization dedicated to fighting heart disease and stroke.” “We fund innovative
research, fight for stronger public health policies, and provide critical tools and information to save and improve lives”
HealthLine https://www.healthline.com/
HealthLine media is a for-profit health news provider founded in 1999 as YourDoctor.com and a competitor to
WebMD. Cochrane http://www.cochrane.org/
A non-profit organization formed in order to provide up-to-date and accurate
information on the effects of globally available healthcare.
American Cancer
Society https://www.cancer.org/
A nationwide voluntary health organization that is established to free people from cancer. It funds research, shares expert information, supports patients, and publishes two journals:
Cancer, CA: A Cancer Journal for Clinicians and Cancer Cytopathology MedlinePlus https://medlineplus.gov/ “the National Institutes of Health's Web
friends. Produced by the National Library of Medicine, the world’s largest medical library, it brings you information
about diseases, conditions, and wellness issues in language you can understand.”
MAYO CLINIC https://www.mayoclinic.org/
This is the Website of the MAYO CLINIC. “Because physicians, scientists,
and other medical experts dedicate a portion of their clinical time to this site, we are in the unique position to give you
access to the knowledge and experience of Mayo Clinic.”
WebMD https://www.webmd.com/
A for-profit company which is an online publisher of health news and information. It is the market leading online health information provider and
its contents are verified by over 100 doctors and health experts.
Features utilized in this study were grouped into categories that were developed through the content analysis. Some categories had sub-categories as needed. For example, in the case of content informativeness, it had sub-categories of plausibility, relevance, comprehensiveness, and specificity. Each category was created to represent a potential factor influencing the perceived credibility. The overall hierarchy of the feature categories and corresponding factors are summarized in the result section.
I trained Logistic Regression classifiers and tested their performance using the features developed through the content analysis and literature review in predicting the binary class of credibility of each answer or review. Logistic regression classifier predicts a dependent variable as a function of a combination of independent variables. Logistic regression is suitable for predicting a binary dependent variable (credible and not_credible in this study) using multiple categorical or continuous independent variables. Logistic regression has been used successfully
for similar predictive tasks, and the main focus of this study was to measure the marginal contribution of various feature categories by conducting feature ablation studies. Latest algorithms such as deep learning can be effective in improving the performance of classifiers. However, there are convoluted nodes in the hidden layer of deep learning that make it difficult to understand what features actually affect the credibility of health information. Logistic regression classifiers based on the generalized linear model were implemented using the R Caret package. Feature selection methods such as forward, backward, stepwise, and correlation-based feature selection (CFS) were applied to remove features that are not discriminative. To prevent overfitting, 20% of the data (200 for each dataset) was used in the feature selection and the remaining 80% (800 for each dataset) was used in feature ablation studies.
Credibility models were trained and evaluated using 10-fold cross-validation. The data for 10-folds were randomly selected and the same 10-folds were used in all experiments. The evaluation metric was decided after examining the distribution of credibility labels. If the data had an even distribution (credible vs. not_credible), the classification accuracy could be used. Accuracy, which measures the percentage of correct predictions, is easy to interpret if the data is uniformly distributed (50/50). A random classifier is expected to achieve about 50% accuracy. In our case, the data was not uniformly distributed. The data corresponding to the credible class (80%) was much larger than the data corresponding to the not_credible class (20%) according to the experts’ annotation. Therefore, I used average precision (AP) as the main evaluation metric. AP is the average of the precision values where recall changes for all prediction confidence threshold values. AP ranges from 0 to 1 and is roughly the area under the precision-recall curve.
AP can be referred in Equation 1. In this equation, D represents the test set and
examples. Φ represents a ranking of D in descending order of prediction confidence value, and Φ(k) represents the instance at rank k in Φ. P(k) represents the precision at rank k, which is the percentage of top-k instances that are gold-standard negative examples. Finally, I is an indicator function.
(1) AP = |𝐷1−|∑|𝐷|𝑘=1𝑃(𝑘) × 𝐼(∅(𝑘) ∈ 𝐷−)
Between the two classes (credible and not_credible), performance metrics for not_credible class were mainly reported for two reasons. First, not_credible class is hard to predict because it is a minority class. Second, as it is important to filter spam from emails, it is important to filter out health information that is not credible in real health information
applications. The reported performance measures were the average performance measures across 10 iterations. False positive or false negative cases were randomly selected, and error analysis was performed in order to analyze the reason for misprediction and to apply understanding from the error analysis to the improvement of features and credibility models.
While creating data for training and testing, I used the threshold of ζ = 16, which is the mid-point in the scale, in deciding whether an answer or review is credible or not. In other words, answers or reviews with credibility scores (calculated by summing the scores of all credibility items) of 16 or higher were considered credible, and answers or reviews with
credibility scores of 16 or lower were considered not_credible. Since there are two labels in the experts' annotation and three labels in the crowd workers' annotation, the “credibility score” is the average credibility score: the sum of the credibility scores of all evaluators divided by the number of evaluators.
An extensive set of feature ablation studies were conducted to look at the marginal contribution of those different feature groups. In the feature ablation study, the same 10-folds
from the previous experiment were used for all experiments. The performance measure of each fold, obtained using the full features, was compared with the performance measure of the same fold obtained using a reduced feature set, in which features in one feature category were removed. In this way, this dissertation study examined the influence of the features of the removed category and the statistical significance of features in that category. These feature ablation studies were iterated through all feature categories introduced by the content analysis. In the feature ablation study, the paired samples were made using 10 held-out test sets. In cross-fold validation, there is a risk of having non-independent samples that do not follow a normal
distribution. For instance, 9 folds are used for training, and one fold is used for testing, for one iteration in 10-fold cross-validation. The 9 folds used for training are also used for other
iterations so that it is hard to tell we have independent samples. In this situation, non-parametric statistics are more adequate. Therefore, a bootstrap-shift test (Noreen, 1989) was applied to the performance measures of each fold. With the bootstrap-shift test, it is not necessary to make an assumption about the test statistic distribution under the null hypothesis which made the test result more convincing and robust.
Multiple comparisons of the statistical significance can lead to several problems, but the most critical issue is the increased likelihood of creating a Type I error that rejects a true null hypothesis (Ware, Ferron, & Miller, 2012). When a large number of comparisons are made, a certain number of the null hypotheses can be rejected due to random chance (Demšar, 2006). Fisher (1959) presented a solution to the problem of multiple comparisons, “analysis of
variance” (ANOVA). ANOVA is used to analyze the difference among group means in a sample data. ANOVA is a fundamental inference procedure, but some assumptions (e.g., normal
meet these assumptions and the sample size is relatively small, we should apply other procedures that make fewer assumptions. These assumptions are mostly violated when measuring the
performance of machine learning algorithms (Demšar, 2006).
Rank-based statistics such as Wilcoxon test and resampling approach such as Bootstrapping are non-parametric alternatives to the parametric approach. Non-parametric procedures “do not involve estimates of parameters, but rather compare the cumulative
frequency distributions to assess whether two (or more) sets of scores might have come from the same population” (Ware et al., 2012, p. 254). The Friedman test is a non-parametric equivalent of the within-subjects ANOVA. Thus, the Friedman test was conducted to examine the
difference among classifier means.
As there was a control model in these feature ablation studies, all pairwise comparisons were not necessary. To compare the full model with any of the other six constrained treatment models, I used a paired sample Wilcoxon signed-rank test. Wilcoxon signed-ranks test (1945) is a non-parametric alternative to the paired t-test. Averages used in the t-test are sensitive and vulnerable to outliers. Wilcoxon signed-rank test, a traditional nonparametric approach, converts the differences in performances of two models to ranks. Although some information contained in the actual value is lost, the impact of the outlier can be mitigated. In other words, differences in absolute values are ignored, but larger differences are still considered more in the Wilcoxon signed-rank test (Demšar, 2006). Demšar (Demšar, 2006) empirically proved its robustness for statistical comparisons of classifiers. With today’s powerful computing devices available at our fingertips, we can work with the data in its original metric by resampling the distribution of the observed statistics (Ware et al., 2012). Thus, I also applied the bootstrap-shift test (Noreen, 1989) to examine the difference in performance between classifiers.
Multiple comparisons of the statistical significance can lead to false rejection of a true null hypothesis. Thus, a popular adjustment to the significance level (α) is to divide the level of significance by the number of tests, called the “Bonferroni technique” (Ware et al., 2012). However, previous studies have shown that the Bonferroni technique is overly conservative and weak because it assumes the independence of hypotheses (Salzberg, 1997). Holm (1979)
proposed a modification to the Bonferroni technique. Instead of using the modified significance level α’= α / m (where m is the number of tests) for every comparison, he suggested modifying the denominator m to m-(k-1) where m is the number of tests and k is the rank of p-value of the comparison in ascending order. I applied Holm’s modification to both the Wilcoxon signed-rank test and the bootstrap-shift test.
In addition to the feature ablation study explained above, I conducted two more feature ablation studies to focus on variations in the factors of credibility judgments influenced by topic and prior knowledge. For instance, in order to examine the difference in the importance of the factors affecting credibility judgments caused by the topic of health information (general vs. specific), the dataset was separated into a dataset with general topics and another dataset with specific topics. Then the feature ablation study described above was applied to both datasets in the same way. By comparing the results of the bootstrap-shift test with each dataset, the relative importance of each feature category depending on the topic of health information could be examined. As there was a challenge with multiple comparisons with the same dataset, the Holm's modification (1979) was applied to adjust the test results caused by the multiple comparisons.
To examine the influence of prior knowledge of users, I created two additional sets of features: (1) prior knowledge and (2) interaction features that represent interactions between basic features and prior knowledge. Labels of prior knowledge of MTurk workers were created
by the majority vote from self-assessments of their own prior knowledge (Figure 5 and Appendix 6). Scores per item on the 7-point Likert scale were summed, examined, and converted to binary class (prior knowledge or no prior knowledge) by the median value of the data gathered.
Interaction features were features that are the product of individual basic features and prior knowledge feature. A model that includes basic features by a feature selection method was considered a control model. I created two treatment models using two additional feature sets. First treatment model was created by adding prior knowledge feature to the control model. The second treatment model was created by adding prior knowledge feature and interaction features to the control model. Then, comparisons of classifiers based on the feature ablation study using Bootstrap-shift tests and the Holm’s modification was conducted.
Figure 5. A Questionnaire for Asking MTurk Workers’ Prior Knowledge on the Task
It should be noted that many of the link-based and user-based features could not be created in the case of the Yahoo! Answers data. Thus, the feature ablation studies with the Yahoo! Answers dataset could only examine the discriminative power of each feature category mostly in the content-based feature categories. Unfortunately, experiments that utilized full features could only be conducted with the Yelp Review data.