Research Statement. Background. Current and Past Research

(1)

Research Statement

Wei Gao

School of Information Systems Singapore Management University

https://sites.google.com/view/gaowei

Background

My research generally looks to deliver innovations in problem-solving for intelligent systems in business, industry and public sectors, intensively using machine learning and data-driven approaches. Specifically, my primary research interest lies in Natural Language Processing and Text Mining, a sub-field of data science.

I develop technologies for searching and mining unstructured and semi-structured data from social media and Web using various machine-learning-based methods to meet the challenges of effective online content mining and analysis, especially those presented by the proliferation of social media content. The cornerstone of my research is based on solid principles and techniques intersecting a range of related fields including Natural Language Processing, Information Retrieval, Social Computing, and Machine Learning. I believe that the inherent connection and development of these fields will converge to the generation of next grand impact in real world, following some existing great inventions in our lifetime like World Wide Web and search engine, thanks to people’s common concerns about the ever-increasing difficulty and complexity to effectively manage information online.

Research issues I am concerned about not only need to be technically challenging but also have to carry high application potential. Nowadays, ubiquitous social media content around us, which is characterized as dynamic, noisy, unreliable, and personal, has presented tricky challenges and interesting opportunities to the existing NLP and text mining technologies. It triggers to break a new horizon in some areas such as sentiment and stance analysis, credibility and trust of information, cross-lingual/multilingual linking, language adaptation, conversational AI, multi-task and multi-modal NLP, etc. My research is motivated to propose new methods and develop viable technologies to unleash the power of social natural language processing for solving real-world problems based on principled approaches.

Current and Past Research

– Rumor detection and verification in social media: Social media platforms such as microblogs are ideal places for seeding and spreading misinformation and rumors. Such unverified information is detrimental as they often cause confusion, public panic and social unrest. Therefore, debunking rumors especially at an early stage of diffusion is necessary and important to minimizing their damaging effects. For distinguishing rumors from genuine facts, traditional approach with manual investigation results in low topical coverage and long debunking delay. I and my co-workers conducted pilot studies on automatic rumor detection and verification using various machine learning techniques. We investigated using learning algorithms to capture temporal variations in a wide range of social context features for rumor detection on microblogging websites like Twitter and Sina Weibo [1]. Such approach resorting to feature engineering however is painstakingly detailed, biased, and labor-intensive. We therefore tried to exploit different deep neural networks to automatically learn the high-level rumor-indicative patterns for improving rumor detection. With Recurrent Neural Networks (RNN), we modeled claims associated with an event using a variable-length time sequence of relevant microblog posts to capture useful temporal and textual signals in supervised fashion [2]; We also extended the sequential model into structured models based on tree kernel [3] and tree-structured Recur-

(2)

sive Neural Networks [4], considering propagation structure together with textual patterns of source posts and their responses. Besides, we studied how to reinforce stance detection and rumor detection jointly in a unified neural multi-task learning model [5], considering strong connections between the veracity of claim and the stances expressed in responsive posts via task-invariant feature sharing. In networked online envi- ronment, trust is also highly relevant to the credibility of information. Based on our preliminary findings in trust-based rumor spreader detection [6, 7], I am currently proposing method to leverage deeper and implicit trust structures derived from user and information propagation networks for learning discriminative representations in order to identify rumor claims and their spreaders simultaneously.

– Tweet sentiment quantification: Sentiment classification has become a ubiquitous enabling technology in the Twittersphere since classifying tweets according to the sentiment they convey towards a given topic (e.g., a product, a person, a political party, a policy, etc.) has many applications. Usually, the final goal of most such studies is not estimating the label of an individual tweet, but estimating the distribution of a set of tweets across the classes of interest, i.e., the interested granularity is not at the individual level, but at the aggregate level. I worked on supervised prevalent estimation as to the distribution of data across the classes of interest, which is called quantification. Dealing with quantification via classify and count in most previous studies is suboptimal since a good classifier is not necessarily a good quantifier (or prevalence estimator). I investigated various quantification-specific algorithms and evaluation measures for learning to quantify tweet sentiments in optimal way under the single-label-multi-class setting with or without ordinal scales [8, 9]. I am now trying to model quantification as a multi-target regression problem for learning to quantify the class prevalences. Potentially the technology under development has wide applications in political science, social science, market research, and many other areas.

– Microblog topic extraction: Modeling topics in massive microblog posts can uncover the hidden semantic structures of the underlying collection, being also useful to downstream applications. However, the content from microblogging platforms is short, colloquial and lack of structure, rendering conventional topic models ineffective. Based on message reposting or replying, constructed propagation structure may not only enrich context of conversation, but also provide useful clues for identifying topics and salient information. Using structured prediction algorithms such as Conditional Random Fields (CRF) trained on annotated data with the enriched context, we differentiate posts as leaders, which initiate important aspects of previously focused topics or shift the focus to different topics, and followers that do not introduce any new information but simply respond to their reposted or replied messages. The identified leader-follower differentiation is then incorporated as prior knowledge to guide topic modeling, where we exploit the words’

topic dependencies contained in leader-follower interaction to enhance topic assignments [10]. The defined propagation structure is generic and further exploited in other tasks including microblog summarization [11]

and rumor detection [3, 4].

– Cross-source news summary and highlights generation: Social media streams are regarded as faster first-hand sources of information generated by massive users. The content diffused through this channel, although noisy, can provide important complement and sometimes even a substitute to the traditional news media reports. I proposed approaches to summarize a given subject matter by jointly identifying salient, complementary excerpts across news media and social media platforms. The intuition is that news articles typically emphasize general or objective aspects of an event while social media posts may express more specific and subjective information, which are complementary each other. I patented a complementary measure, which balances commonality and difference of any given post-sentence pair, for finding the posts complementing relevant news sentences based on a variant of LDA-based topic model [19]. In addition to such implicit cross-source correlation, we also used posts that explicitly link to a news article for generating single-document summary or news highlights [13, 14] based on the hypothesis that 1) the post content can indicate the position of salient sentences in the article; 2) the post condenses original news sentences thanks

(3)

to post length restriction. With the relevant posts as helper, we explored various supervised and unsupervised summarization algorithms for generating story highlights.

– Heterogeneous transfer learning and applications: In many domains applying machine learning, it is expensive and time-consuming to obtain enough training data to learn the needed models as the labeled data required may be not available. Bridging domain gap is critical for the model trained in a source domain to generalize well in a target domain. We developed generic domain adaptation methods for improving knowledge transfer applicable to text classification and web search result ranking [15, 16, 17, 20, 21, 22].

In particular, we proposed heterogeneous learners that combine multi-view learning and transfer learning paradigms, and proved them effective for cross-domain document classification [15, 17] and web search ranking adaptation [16]. For improving vertical search and adaptation of a search engine, I also proposed an active ranking adaptation approach that unifies active query selection and ranking model adaptation in a principled framework [20].

– Cross-lingual query log mining for web search: In my PhD study, I worked on web search technologies to cater for user’s cross-lingual information need by mining common search interests from query logs of different languages. I aimed to bridge language gaps between the needs and information by re- formulating web search from two perspectives, i.e., query formulation and result ranking. Unlike previous cross-lingual query formulation methods such as query translation and expansion, I developed a novel cross- lingual query suggestionmethod by leveraging large-scale query logs of search engine to learn to suggest closely related target-language queries for a given source-language query [26, 23]. The idea is motivated by the ever-increasing common search interests of users from different language backgrounds. I further represented such common search intents of different languages by forming bilingual queries, and then tried to improve monolingual web search by aligning those cross-lingually similar results obtained by issuing bilingual queries, which enrich the monolingual ranker with useful cross-lingual features for it to learn to rank the monolingual search results [24]. Furthermore, I developed a joint ranking model for multilingual web search based on Restricted Boltzmann Machine (RBM), a stochastic recurrent neural network, to in- corporate various relevant measures for holistically ranking search results in multiple languages [25]. By modeling semantic similarities of search results as edge features between neurons in the RBM, the model leveraged relevant documents in one language to help the relevance estimation of documents in different languages for improving the overall relevance estimation via joint probabilistic inference.

Research Plan

The popularity of social media platforms such as Twitter, Facebook, etc. has pushed information dissemina- tion and sharing to a new height thanks to the phenomenal real-time citizen journalism. However, without systematic effort of moderation in social media, large volume of false or unverified information can spread wildly and pollute our online society via various malicious activities like spreading rumours, fabricating news reports, conducting unfair social advertising or political campaigns. Nowadays, creating and dissem- inating false information via social media platforms has never been so easy and low-cost. Massive spread of falsehoods brings individuals and society increasing danger with devastating repercussions, rendering daunting “post-truth-era” phenomena, which poses unprecedented challenges to our contemporary news ecosystem.

Public survey shows that only 4% of common adults can correctly identify whether a news story is true or fake. Humans are naturally not good at differentiating real and fake news due to the effect of naive realism and confirmation bias [27]. Fact-checking is the professional act of verifying claims in news reports to determine the veracity and truthfulness of the asserted statements. On the Internet, there are some popular fact-check websites such as snopes.com and politifact.com that conduct post-hoc verification

(4)

on the veracity of circulated news events originated from social media posts and news reports, based on world knowledge and investigative journalism. This non-trivial process requires tremendous amount of time devoted to manual investigation and analysis, and moreover it is prone to low efficiency and poor coverage due to the complexity of topic to check, and is incompetent against the speed of information generation and diffusion online.

Existing automated approaches to fake news and rumor detection solely rely on training supervised clas- sifiers, for which past events or claims are gathered and labelled as trustworthy or fake. Various linguistic, temporal, network and propagation features are extracted typically from social media posts relevant to the concerned events or claims for generating appropriate data representation [1, 2, 3, 4, 5, 6]. However, fact- checking is inherently difficult and sometimes controversial, and the research on information credibility and user trust evaluation is still premature to deal with many fundamental issues in social media context.

For example, the methods which tried to classify the veracity of news based on assessing and comparing user reactions, viewpoints or stances on the claims can suffer from low recall since many news may not spark enough enquiry posts. Also, the sources of responsive information could be noisy, ambiguous and propagandistic, thus are basically unreliable for deducing solid credibility assessment.

I propose to investigate novel and solid fact-checking methods based on in-depth natural language analysis in combination with network analysis to cross validate news veracity by moderating different news sources over social media and the Web. I will create new algorithms and pipelines to: 1) build effective models for locating relevant clues for extracting support evidence from heterogeneous news platforms; 2) construct evidence embeddings by learning to represent structured features of evidence in various granular- ities; 3) develop trust and credibility models to enable concrete quantitative veracity evaluation; 4) exploit natural language inference for verifying the concerned claims based on the learned evidence and knowledge.

In order to validate a claim, it needs to accurately extract various (partial) evidence, properly organize them and evaluate the overall cohesion and coherence. This requires effectively fusing different kinds of techniques for meaningful cross fact-checking, such as evidence retrieval, discourse analysis, natural language understand and inference at various levels, and learning models for representing useful patterns from the networked noisy content. All these constitute a consistent research line of social NLP for dealing with the challenges.

I also plan to continue with mining large scale social media data collections and build technologies conducive to downstream applications: 1) I will continue text quantification study by looking into learning more effective prevalence estimation functions and studying their robustness under various magnitudes of class distribution drift. Robustness is crucial for accurately quantifying the class prevalence since the class distribution of different topics of interest typically varies extensively. Furthermore, I will explore feature learning for automatically generating quantification features to deal with the weak representativeness as a result of using conventional approaches. 2) Inspired by the success of IBM Waston QA system in closed- domain applications, I am interested to investigate open-domain Community Question Answering (cQA), particularly for resource-poor settings where there is no much reliable knowledge and sources to resort to. I will focus on automatic inference of factual accuracy and trust of information for answer verification.

References

[1] J. Ma, W. Gao, Z. Wei, Y. Lu, and K.F. Wong. Detect rumors using time series of social context information on microblogging websites. CIKM 2015.

[2] J. Ma, W. Gao, P. Mitra, S. Kwon, B.J. Jansen, K.F. Wong, and Meeyoung Cha. Detecting rumors from microblogs with recurrent neural networks. IJCAI 2016.

(5)

[3] J. Ma, W. Gao, and K.F. Wong. Detect rumors in microblog posts using propagation structure via kernel learning. ACL 2017.

[4] J. Ma, W. Gao, and K.F. Wong. Rumor detection on Twitter with tree-structured recursive neural networks. ACL 2018.

[5] J. Ma, W. Gao, and K.F. Wong. Detect rumor and stance jointly by neural multi-task learning. WWW 2018 Companion.

[6] B. Rath, W. Gao, J. Ma, and J. Srivastava. From retweet to believability: Utilizing trust to identify rumor spreaders on Twitter. ASONAM 2017.

[7] B. Rath, W. Gao, J. Ma, and J. Srivastava. Utilizing computational trust to identify rumor spreaders on Twitter. Social Network Analysis and Mining, 8:64, December 2018.

[8] W. Gao and F. Sebastiani. From classification to quantification in tweet sentiment analysis. Social Net- work Analysis and Mining, Volume:6, Issue:1, Article 19, 2016.

[9] G. Da San Martino, W. Gao, and F. Sebastiani. Ordinal text quantification. SIGIR 2016.

[10] J. Li, M. Liao, W. Gao, Y. He, and K.F. Wong. Topic extraction from microblog posts using conversation structures. ACL 2016.

[11] J. Li, W. Gao, Z. Wei, B. Peng, and K.F. Wong. Using content-level structures for summarizing microblog repost trees. EMNLP 2015.

[12] K. Song, S. Feng, W. Gao, D. Wang, G. Yu, and K.F. Wong. Personalized sentiment classification based on latent individuality of microblog users. IJCAI 2015.

[13] Z. Wei and W. Gao. Utilizing microblogs for automatic news highlights extraction. COLING 2014.

[14] Z. Wei and W. Gao. Gibberish, assistant, or master? Using tweets linking to news for extractive single- document summarization. SIGIR 2015.

[15] P. Yang and W. Gao. Information-theoretic multi-view domain adaptation: A theoretical and empirical study. Journal of Artificial Intelligence Research, 49:201-525, 2014

[16] W. Gao and P. Yang. Democracy is good for ranking: Towards multi-view rank learning and adaptation in web search. WSDM 2014.

[17] P. Yang and W. Gao. Multi-view discriminant transfer learning. IJCAI 2013.

[18] Y. He, C. Lin, W. Gao, and K.F. Wong. Dynamic joint sentiment-topic model. ACM Transactions on Intelligent Systems and Technology, Volume 5, Issue 1, Article 6, 2013.

[19] W. Gao, P. Li, and K. Darwish. Joint topic modeling for event summarization across news and social media streams. CIKM 2012.

[20] P. Cai, W. Gao, K.F. Wong, and Aoying Zhou. Relevant knowledge helps in choosing right teacher:

Active query selection for ranking adaptation. SIGIR 2011.

[21] P. Cai, W. Gao, A. Zhou, and K.F. Wong. Query weighting for ranking adaptation. ACL 2011.

(6)

[22] W. Gao, P. Cai, K.F. Wong, and A. Zhou. Learning to rank only using training data from related domain. SIGIR 2010.

[23] W. Gao, C. Niu, J.Y. Nie, M. Zhou, K.F. Wong, and Hsiao-Wuen Hon. Exploiting query logs for cross-lingual query suggestion. ACM Transactions on Information Systems 28(2), Artical 6, 2010.

[24] W. Gao, J. Blitzer, M. Zhou, and K.F. Wong. Exploiting bilingual information to improve monolingual web search. ACL 2009.

[25] W. Gao, C. Niu, M. Zhou, and K.F. Wong. Joint ranking for multilingual web search. ECIR 2009.

[26] W. Gao, C. Niu, J.Y. Nie, M. Zhou, J. Hu, K.F. Wong, and H.W. Hon. Cross-lingual query suggestion using query logs of different languages. SIGIR 2007.

[27] K. Shu, A., Sliva, S. Wang, J. Tang, and H. Liu. Fake news detection on social: A data mining per- spective. ACM SIGKDD Explorations Newsletter, Vol. 19, Issue 1, 2017.