CHAPTER 5: PREDICTIVE ANALYSIS RESULTS
6.4 Future Research
This study has identified a number of directions for future research. The results of this study suggest that the criteria for assessing credibility depend on contexts. These contexts will vary from person to person and with time. Future research should develop adaptive and
information retrieval. Methodologically, Reinforcement Learning (RL) can be considered as a research framework. This is an area of machine learning in which an “autonomous agent” learns to take actions that maximize rewards based on feedback received from a “trainer.” Credibility judgments vary depending on the characteristics of users and the contextual factors in which the users utilize the information over time. By leveraging RL as a framework to investigate users’ feedback on the credibility of information, users’ preferences for different credibility criteria and the influence of credibility clues presented on the screen over time can be examined to develop personalized credibility model. This can help find a balance between humans and algorithms.
Another important direction is to improve the quality of credibility labels. The agreement between experts and MTurk workers in this study was moderate with the Cohen’s Kappa
coefficient of 0.525. It suggests that there is considerable room for improvement or perhaps credibility is highly subjective. This study created an alternative credibility questionnaire that can be easily understood by crowd workers and collected MTurk workers’ responses. However, the validation and verification of this instrument were left for post-graduation study considering the complexity of the research and time-constraints. Further research is necessary to determine the threshold value, which is the criterion for converting the sum of scores obtained through existing questionnaires and new questionnaires into binary credibility classes. The median or mid-point split is widely used, but it should be carefully applied. It is also necessary to study how to give the ideal weighting in the process of converting each assessor's assessment into the majority vote.
The primary goal of this study is to understand the cognitive processes of general users making credibility judgments about health information on social media. Thus, credibility labels used in this study were created by crowd workers. However, considering the negative impact of
prior knowledge on predictive modeling for Yahoo! Answers, it is necessary to examine whether crowd workers who are general online users are the ideal population to judge the credibility of health information. There were 630 overlapped credibility labels between experts and MTurk workers in this dissertation study. However, in future work, I will increase the number of
overlapped credibility labels between experts and MTurk workers. I also plan to hire experts who are clinicians (e.g., nurses and physicians). In addition to using the credibility labels created by experts to control the quality of credibility labels by crowd workers, I will compare credibility labels created by the two groups who have different expertise to examine the effect of expertise on credibility judgments. In health information, expertise is one of the most critical factors affecting the ability to provide credible information. A more systematic method to create and validate credibility labels would provide a solid foundation for related research.
Another research direction is feature engineering. There are some characteristics that can be represented easily in numbers, and there are other characteristics that are difficult to convert to numbers. For instance, there are values and beliefs in our life that are difficult to digitize. Likewise, there were credibility factors such as plausibility that were operationalized but were not very successful in representing the actual “plausibility.” The features that are already developed should continue to be improved through an iterative process, and features that have not been developed yet should be operationalized by applying methods collected from a wider literature set and advanced natural language processing techniques. In the case of plausibility, the simple term similarity between an answer and documents searched by Google CSE might not be useful due to a lot of noises in the searched documents. A Web site may typically contain more information than is needed to answer the question. When asked about the cause of diabetes, Google CSE might find a Web site that explains the cause of diabetes as well as management
and treatment methods. The latter can create noise when measuring similarity. Therefore, we may need a two-step approach. In the first step, relevant parts can be selected. Then, in the second step, the similarity can be calculated only with these selected parts. Also, the present study collected annotations from experts on the credibility factors. These annotations can be used as ground truth data for machine learning to improve each feature. The interaction between prior knowledge and individual features needs to be further examined and reflected in future work.
Also, the criteria that the MTruk workers used to judge the credibility of the health information have not yet been investigated. This study collected comments made by crowd workers regarding their credibility criteria. Since there were a total of 6,000 comments made by workers, it was impossible for us to manually code all of them. As a part of the post-graduation study, I plan to analyze the credibility criteria of actual online users rather than researchers. After analyzing representative criteria by applying topic modeling such as LSA, I can analyze the credibility criteria in depth by manually analyzing some randomly selected data. Examining credibility criteria used by actual online users will help to reinforce the criteria used in this study and to further develop features that can be used in real applications.
Regarding the directions of the future research mentioned above, a method of linking human feedback with features should be developed. In interactive machine learning, human’s feedback would not be helpful in future prediction unless it is reflected in the feature
representation. As convoluted nodes in the hidden layer in Deep Learning represent latent variables that humans could not provide to a model and LSA finds latent semantic relations between words, it is imperative to find latent variables using users’ feedback and existing features and to keep updating the feature representation. It is essential to study how to
combining several existing methods to find latent variables (e.g., structural equation modeling) and to examine hidden semantics (e.g., LSA).
Non-credible information, both online and offline, hinders the exchange of ideas that democracy relies upon, and creates distrust among people. As fragmentation grows and public trust decreases, governments and journalists strive to perform traditional editorial roles that distinguish misinformation from facts. As the amount of circulated information increases
exponentially, however, it is difficult for a small number of people and organizations to play the editorial role. Collaboration among stakeholders across government, industry, journalism, and academia is necessary to counteract non-credible information. I hope this study will be one meaningful step toward this collaboration.
APPENDIX 1: RECRUITMENT POSTING
In this HIT, you will be asked to judge the credibility of given health information you might search in your daily life. Either a pair of health-related question and answer or a review of a medical facility will be given to you. You do not need to have an expertise in health to
participate in this HIT, because you will be asked to evaluate the credibility of the information on its own merit by using available clues in the given information and your current knowledge. You will be also asked about your prior knowledge on the given topic and criteria you used to make credibility judgments.