Potential areas for further research have already been identified as gaps and identified in the published research. To enable robust future research, findings from efforts such as this dissertation need to be published in journals. Research will need to continue as the medium and amount of social media usage changes over time within the overall
population and within sub-populations. There are research findings that indicate that both news and social media have some capacity to provide unique information and insight that is not available through more traditional public health data sources. The utility and value of this data varies both by condition or disease, and by source, and needs to be examined further to understand fully the maximums.
To enable future researchers to contribute, as well as to facilitate practical
application of this work, there is a need to build a freely available public health search term vocabulary for each medium, using natural language processing. For each of the media, terms need to be evaluated both individually and as symptom complexes or syndromes (20) to assess utility. These information sources are often unstructured and difficult to interpret, requiring advanced computational resources to implement effective categorical or
quantitative assessments. Increased research on natural language processing and the development of related tools for information retrieval, text classification, and text mining are crucial next steps for converting text to structured event data. Additionally, methods like
Latent Dirichlet Allocation (LDA) need to be used for identifying new terms for disease topics that are not directly intuitive but likely relevant (122). Specifically, this type of methodology would allow for new slang terms for diseases or symptoms to be rapidly identified and incorporated into analyst dashboards. LDA has been shown to extract valuable topics from large amounts of data (122), including user profiles using labeled LDA (or L-LDA) (123), and this capacity will become more critical as the amount of information available continues to grow astronomically.
Validations of findings would need to be conducted in comparison to traditional, resource intense, observational or cohort studies (31). Investigation of the key
characteristics of an effective surveillance system (representativeness of system, outbreak detection algorithms in use by the system, and specificity of the algorithms) (2) has not yet occurred and should be done for each of these novel surveillance sources. Both sensitivity and specificity are unclear and false positives have the potential to increase workload on already overburdened public health employees. One survey respondent indicated the critical need for a real-time validation effort, rather than retrospective analysis. This would reduce the potential risk of retrofitting data based on historical insights and also provide an improved understanding of how this content would work in daily practice.
Many survey respondents indicated the need for improved baseline data for these emerging sources to distinguish threatening anomalous events more precisely and to understand more clearly the background reporting, both on a regional and per disease basis. This differentiation is especially critical in a time where the amount of information available to public health officials and everyone is growing at an astounding rate; every two days there is more information created than the amount between the dawn of civilization and 2003 (124). Multivariable linear regression can be used to predict the normalized
(77) or some similar type of “control analysis” for various regions or periods. Improvements in aberration detection algorithms to train Bayesian classifiers to increase positive predictive values and therefore decrease false alarms are also important. There is a need to develop methods to decrease false alarms and to vary the weight of different sources of information, as well as identify ways to distinguish events as high confidence due to multiple sources with corroborating information.
Better understanding of health behaviors and concerns amongst the public could be assessed from this type of data (30), although this understanding would require
development of new methods as well as comparison studies (as mentioned above) to be conducted. For instance, new symptoms or home remedy treatments may be first identified through these alternative information sources (125). Twitter is potentially suitable for longitudinal text mining (to identify changes in opinions or responses) and can provide instantaneous snapshots of the public’s opinions and behavioral responses (23). It is reasonable to presume that other new or emerging technologies may have a similar value.
The next area of research is needed in determining policy implications of these findings: How much of a change from baseline will warrant further investigation or
deployment of resources for investigation (83) or engagement—i.e., what is “actionable” in this space? It has been shown that the spread of information about a disease has the potential to impart benefits and reduce the spread of disease because information creates awareness, and awareness triggers the tendency towards protective behavior (119); however, the dynamics of behavior because of social media has not been fully examined. There is a need to develop methods to assess the impact that disease or health-related messaging (to include rumors) spreads through these new mediums, as well as to understand factors that amplify the information spread.