Cross-domain and transfer learning - Big Data mining and machine learning techniques applied to

3.2 Techniques

3.2.3 Cross-domain and transfer learning

Cross-domain learning is a very attractive research thread that paves the way for the use of large-scale unsupervised data. Basically, a model is learned on a labeled source domain, then used to classify the polarity of a distinct unlabeled target domain. For example, let us suppose to have learned a model on a set of reviews about movies, and now to be interested in understanding people’s thoughts about kitchen appliances, whose reviews unfortunately are not labeled. A manual categorization of the unlabeled reviews can be a time-consuming solution, infeasible with large-scale text sets. Instead of manually labeling the kitchen appliances’ reviews and building a model from scratch, we want to exploit a model that is already available.

The rationale behind the cross-domain approach is the fact that it allows for model reuse: a model is built once, and used multiple times. The reusability of a model is relevant for both research and business, since collecting, labeling, processing and analyzing data can really be time-consuming tasks, which hinder scalability in big data scenarios. However, the test documents may not reflect the regularities of the training set due to the heterogeneity of language. For example, if a movie review is likely to include terms like amusingor boring, a kitchen appliance review is more plausible to containcleanorbroken. Hence, a transfer learning phase is typically demanded to bridge the inter-domain semantic gap.

Transfer learning generally entails learning knowledge from a source domain and using it in a target domain. Specifically, cross-domain methods are used to handle data of a target domain where labeled instances are only available in a source domain, similar but not equal to the target one. While these methods are used in image matching [SMGE11], genomic prediction [DMMP16] and many other contexts, classification of text documents by either topic or sentiment is perhaps their most common application.

Two major approaches can be distinguished in cross-domain classification [PY+10]: ”in- stance transfer” directly adjusts source instances to the target domain, while ”feature representation transfer” maps features of both domains to a different common space. The most popular transfer learning methods are clustering algorithms and approaches that make use of feature expansion and external, sometimes hierarchical, knowledge bases. These techniques, employed in conjunction with standard text classification algorithms, lead to good results in sentiment classification [TWC08, QZHZ09, MGL09]. Nevertheless, a tuning of parameters is necessary to ensure adequate effectiveness, because the parameter values that yield optimal accuracy with a text set do not usually produce analogous best results with different corpora. Parameter tuning is a bottleneck anytime a new text set has to be classified. Defining algorithms that are not affected (or slightly affected) by such a problem represents an open research challenge.

With reference to text categorization by topic, transfer learning has been fulfilled in different ways, for example by clustering together documents and words [DXYY07], by extending probabilistic latent semantic analysis to unlabeled instances [XDYY08], by extracting latent

CHAPTER 3. SENTIMENT ANALYSIS 27 words and topics, both common and domain specific [LJL12], by iteratively refining target categories representation without a burdensome parameter tuning [DMPS14a, DMPS14b].

Many transfer learning methods have also been proposed for cross-domain sentiment classification. Aue and Gamon [AG05] tried some approaches to customize a classifier to a new target domain: training on a mixture of labeled data from other domains where such data are available, possibly considering just the features observed in target domain; using multiple classifiers trained on labeled data from diverse domains; including a small amount of labeled data from target. Bollegala et al. [BWC13] suggested the adoption of a thesaurus containing labeled data from source domain and unlabeled data from both source and target domains. Blitzer et al. [BDP07] discovered a measure of domain similarity contributing to a better domain adaptation. Pan et al. [PNS+10] advanced a spectral feature alignment algorithm which aims to align words belonging to different domains into same clusters, by means of domain-independent terms. Such clusters form a latent space which can be used to improve sentiment classification accuracy of a target domain. He et al. [HLA11] extended the joint sentiment-topic model by adding prior words sentiment, thanks to the modification of the topic-word Dirichlet priors. Feature and document expansions are performed through adding polarity-bearing topics to align domains. Zhang et al. [ZHL+15] proposed an algorithm that transfers the polarity of features from the source domain to the target domain with the independent features as a bridge. Their approach focused not only on the feature divergence issue, namely different features are used to express similar sentiment in different domains, but also on the polarity divergence problem, where the same feature is used to express different sentiment in heterogeneous domains. Franco et al. [FSCTR15] used the BabelNet multilingual semantic network to generate features derived from word sense disambiguation and vocabulary expansion that can help both in-domain and cross-domain tasks. Bollegala et al. [BMG16] modeled cross-domain sentiment classification as embedding learning, using objective functions that capture domain-independent features, label constraints in the source documents and some geometric properties derived from both domains without supervision.

4

Stock market analysis

It is no mistery that stock market analysis is a primary interest for finance, and has historically attracted interest from shareholders as well as academia. Understanding the market trend allows investors choosing the best trade-off strategy between risk minimization and profit maximization.

This chapter gives an overview of the problem, focusing on the techniques used for analysis and prediction rather than on the financial point of view. In particular, several renowned approaches for stock market forecasting are discussed, emphasizing the predictive potential of social media. Driven by such existing researches, a novel Twitter-based method for stock market prediction and trading will be proposed in Part III.

4.1 Introduction

The Efficient Market Hypotesis (EMH), proposed by Fama [Fam65], states that prices of financial assets are managed by rational investors who rely on new information, i.e. news. Since news is not predictable, neither is the stock market, which generally follows a random walk trend, according to past studies [KAYT90, FAM91]. The EMH was confuted by Malkiel [Mal03], who provided evidences that market prices reflect all the available information. Moreover, several studies showed that the trend of the stock market does not follow a random walk model and can be predicted in some way [LM88, BM92], for example applying mining techniques to market news [GE01, SC06], to past prices [LWD+11], or even to financial reports [LLKC11].

Many approaches to stock market analysis and prediction have been proposed over time, from time series prediction to textual news analysis and social network analysis. In this chapter, we start from traditional approaches to stock market prediction, then we review methods based on text and news analysis, and finally we summarize the most recent works using social network information to forecast the market prices.

In document Big Data mining and machine learning techniques applied to real world scenarios (Page 38-41)