CHAPTER 5: PREDICTIVE ANALYSIS RESULTS
6.3 Limitations
There were various limitations, as is true with any study. First, the amount of data (2,000 in total) used for machine learning was relatively small compared to recent trends. However, this was somewhat unavoidable because humans’ annotation, requiring complex verification, was necessary to create the credibility labels. For instance, 6,000 credibility judgments were collected, but the number of final labels was decreased by utilizing a majority vote. In order to examine the varying effects of features depending on topic and prior knowledge, I divided the data into two groups (Yahoo! Answers and Yelp) according to topic first, and then divided each group into two sub-groups according to prior knowledge, resulting in the final four groups. Thus, the number of data instances in each group was 500. If time and resources were allowed to collect more credibility labels, the study results might have a higher level of generalizability.
Another limitation of this study is that the way to create credibility labels could be improved. Designing a new instrument, especially for concepts that are inevitably subjective such as credibility, is very challenging and needs careful verifications and validations. Since developing a new instrument could be a subject of another dissertation study itself, an existing instrument that had been validated, and had the best fit for the context of this dissertation study was used. It might be necessary to develop a new credibility instrument for the population of crowd workers. It is also necessary to investigate how to give appropriate weighting to the credibility evaluation of each assessor instead of equal weighting for the majority vote. As discussed earlier, prior knowledge had a significant impact, but it was not applied to the
weighting scheme in this study. As will be explained later, I plan to continue to work on this problem as a post-graduation study.
The data used in this study are not necessarily the ideal data for studying the credibility of health information on social media. As I mentioned earlier, data from the online health forum on which active health information exchange is expected was not available due to privacy concerns. I sent data usage requests including the purpose of data use, the brief introduction of research, and secure data management method to seven online forums, but there was no response, or the requests were rejected. Therefore, Yahoo! Answer and Yelp data that are publicly available were used in this dissertation research. Being open to the public might limit users' openness to address their specific information needs. For example, users often will not be able to ask the unknown public some questions which they might ask a close friend or family member. However, it was somewhat inevitable considering the increasing concerns about privacy and ethics of research. Two datasets are out of balance regarding source related metadata. Yelp has rich metadata for the source, and Yahoo! Answers has very limited metadata for the source. Therefore, it is difficult to directly compare the results from two datasets because different features were used for predictive models for different datasets. However, this might be an advantage regarding customization, and it was inevitable that there were limited datasets available for this research.
Another limitation of the study was that some features showed room for improvement regarding their quality and all of the factors identified in the content analysis were not
operationalized. There were some differences in the quality of the extracted features. Features related to platform-specific factors such as plausibility, relevance, and specificity were not found to be statistically significant in predicting the credibility of health information on social media regardless of their statistically significant impacts in the regression analyses and F-tests. Some
features (e.g., plausibility) were very difficult to extract although other features (e.g., comprehensiveness) were relatively more straightforward to extract. Plausibility in Yahoo! Answers was the most discriminative factor in the regression analysis that used experts’
annotations. However, it turned out to be not discriminative in the feature ablation study which depended on natural language processing and language models. Some features such as
objectiveness and sympathy were not be operationalized due to constraints on time and concerns about performance regardless of their importance found in the statistical analyses. Therefore, the effects of feature categories could not be tested comprehensively in the feature ablation studies. Therefore, the generalizability of the results of this study should be carefully examined.
Finally, it was impossible to examine interactions among users in the datasets used in this study. For instance, if other users are suspicious of the quality of an answer, the credibility of the provided answer will decrease. However, in the dataset released to the public, the order of the responses, the relationships among them, and profiles of authors did not exist. Therefore, this study could not reflect the dynamics of information exchanges caused by interactions among users. According to Wilson’s Model of Information Behavior (Wilson, 1997), information is actively constructed, dependent on context, and highly individualized. Research on the dynamics of credibility judgments caused by interactions among users could provide more insights into the cognitive process of credibility judgments.