Methods based on social media - Big Data mining and machine learning techniques applied to real

Social media are a pervasive global phenomenon, constantly increasing in the last decade. Facebook, Twitter, LinkedIn, Xing are among the most popular examples. They are huge data sources where users can insert information and opinions about various topics.

Since text and news have already been proved to help stock market analysis [Mit04, MK06a], researchers have started investigating whether social media can support prediction, being a valuable source of unstructured data. One of the most analyzed social networks is Twitter, which is very dense of informative content, because users must be very concise when expressing opinions, due to the fact that tweets cannot be longer than 140 characters.

4.4.1 The predictive value of Twitter

Twitter is a well-known American online news and social networking service on which users post and interact with messages known as ”tweets”. Twitter has over 330 million users and an average of 500 million tweets posted every day, offering the opportunity to access this stream of global posts. Considered the huge amount of data available, Twitter represents a huge knowledge base providing information about the most disparate topics.

Many research fields related to text mining have taken advantage of this source of information. For instance, sentiment analysis benefits from tweets due to the intrinsic nature of the information posted by users. Plenty of articles have been written on sentiment analysis and the study of tweet polarization [PP10, TBP11, AXV+11, KBQ14, KS12, BMMP13, MCMVULMR14]. It can be argued that such a knowledge base can provide an indication of the public mood. Indeed, the emotional state, as the prerogative of a single human being, propagates to social status as a feature of all individuals. This phenomenon was studied by Bollen et al. [BMP11], who found that events in the social, political, cultural and economic sphere do have a significant, immediate and highly specific effect on various dimensions of the public mood extracted from Twitter. They speculated that a large-scale mood analysis can provide a solid platform to model collective emotive trends in terms of their predictive value with regards to existing social as well as economic indicators. This predictive feature of Twitter mood has been used to forecast different phenomena, for example the sales of a movie [AH10], the public opinion on a particular brand [GSZ13], the political alignment of Twitter users [CRF+11, CGR+11], the results of elections [GA12].

CHAPTER 4. STOCK MARKET ANALYSIS 33

4.4.2 Twitter-based stock market prediction

Many approaches in literature applied sentiment analysis techniques to tweets in order to create forecast models for the stock market. Bollen et al. [BMZ11] measured collective mood states (positive, negative, calm, alert, sure, vital, kind and happy) through sentiment analysis applied to more than 9 million tweets posted in 2008. Tweets were filtered by some generic sentiment expressions (e.g. ”I’m feeling”) that were not directly related to stock market. The authors analyzed tweets by two mood tracking tools: Opinion Finder (OF) [WHS+05], which classifies tweets as positive or negative, and Google-Profile of Mood States (GPOMS), which measures mood in six dimensions. They found that thecalmmood profile yields the best prediction result for Dow Jones Industrial Average (DJIA) with accuracy of 86.7% in the prediction of daily directions in december. Furthermore, they showed how a tweet aggregation in a 3-day period ensures better prediction on the daily DJIA. Similarly, Chyan et al. [CHL12] used the calm

score of tweets extracted from June and December 2009, achieving accuracy of 75% in a 20-day prediction of Dow Diamonds ETF (DIA). Accuracy increased up to 80% by adding a quantitative feature related to the previous DIA value.

An analysis similar to Bollen’s one was conducted by Mittal and Goel [MG12]. The authors employed the same dataset used in [CHL12] for a multi-class classification, considering only

calm,happy,alertandkind as mood dimensions. Furthermore, four learning algorithms (i.e. Linear Regression, Logistic Regression, SVMs and SOFNN) are used to learn and exploit the actual predictions; among the algorithms, SOFNN-based model performed best, achieving nearly 76% of accuracy. Oliveira et al. [OCA13] compared six different and popular sentiment analysis lexical resources (Harvard General Inquirer, Opinion Lexicon, Macquarie Semantic Orientation Lexicon, MPQA Subjectivity Lexicon, SentiWordNet, Emoticons) to evaluate the usefulness of each resource for stock prediction. Sprenger et al. [STSW14] used sentiment analysis on stock-related tweets collected during a six-month period. To reduce noise, they selected tweets containing cashtags (i.e. $) of S&P 100 companies. Each message was classified by a Na¨ıve Bayes method trained with a set of 2,500 tweets. The outcome of such an analysis showed that sentiment indicators are associated with abnormal returns, and message volume is correlated to the trading volume. Similarly, Rao and Srivastava [RS14] associated a polarity to each day, considering the number of positive and negative tweets viasentiment1401, and testing DJIA and NASDAQ-100 indexes in a 13-month period between 2010 and 2011. Mao et al. [MCB11b] surveyed a variety of web data sources (Twitter, news headlines and Google search queries) and tested two sentiment analysis methods for the prediction of stock market behavior. They found that their Twitter sentiment indicator and the frequency of financial terms on Twitter are statistically significant predictors of daily market returns.

Other approaches do not use sentiment analysis to make predictions. For example, Mao et al. [MWWL12] analyzed with a linear regression model the correlation between the Twitter predictor and stock indicators at three levels (i.e. stock market, sector and single company level). The authors discovered that the daily number of tweets mentioning S&P 500 stocks was

34 CHAPTER 4. STOCK MARKET ANALYSIS

significantly correlated with S&P 500 daily closing price. Accuracy of about 68% was obtained in a 19-day test for stock market and sector level prediction, and about 52% for company stock. Porshnev et al. [PRS13] created different types of features: to a ”basic” bag-of-words of the previous day tweets, they added features regarding the number of tweets containing the words

hope,worryorfear(Basic&HWF), or the wordshappy,loving,calm,energetic,fearful,angry,

tiredandsad (Basic&8emo). Training a support vector machine model on such datasets over a 7-month period in 2013, they got a maximum baseline accuracy of 65.17% for DJIA, 57% for S&P 500 and 50.67% for NASDAQ. In a different way, Ruiz et al. [RHC+12] extracted two types of features, one concerning the overall activity on Twitter and the other measuring the properties of an induced interaction graph. They found a correlation between these features and changes in S&P 500 price and volume traded. Zhang et al. [ZFG11] found high negative correlation (0.726, significant at level p<0.01) between Dow Jones index and the presence of some words in tweets, such ashope,fear, andworry. A quantitative analysis is made by Mao et al. [MWW13]: using Twitter volume spikes in a 15-month period (from February 2012 to May 2013), they trained a bayesian classifier to assist S&P 500 stock trading, showing that it may lead to a substantial profit. Arias et al. [AAX13] showed through extensive testing that the prediction of stocks or indexes can improve by adding Twitter-related data (either in terms of volume or public sentiment) to non-linear time series (support vector machines or neural networks).

5

Recommender systems for job search

Job searching, also known as job hunting or job seeking, refers to the process people looking for a job perform in order to find it. On the other hand, finding the right employee is a key aspect for enterprises, which continuously have to recruit professionals according to their current needs. Both tasks are sides of the same general problem, namely allowing communication between companies and potential applicants for the sake of establishing an employment relationship.

Recommender Systems (RSs) are software tools and techniques that provide suggestions for items that are most likely of interest to a particular user [AT05, RRS15]. The word item refers to what is recommended, from movies to products, to jobs. The increased amount of digital information available in the age of Big Data has changed the way companies conduct their business, and also the way they employ candidates.

E-recruitment platforms spread in the recent years to address the challenge of recruiting the appropriate person. Simultaneously, this has become an opportunity for job seekers, who typically publish their profiles on job portals. Looking for a job is a tough, often tedious, task that everyone eventually has to face. Each year, millions of persons engage in job search for several reasons, including accidental job loss, come back to work, completion of job training, or the desire to pursue new career opportunities. Consequently, a large volume of job postings as well as user profiles is available online, and can be exploited to improve the matching between candidates and jobs.

Job recommender systems aim at suggesting jobs as pertinent as possible to candidates’ profiles. The best fit may depend on aspects that are hard to measure, such as personal char- acteristics, social skills, preference for the job location, and so on. For these reasons, studies investigating job search behavior have recently proliferated, and many approaches have been proposed to match candidates with jobs.

In this chapter, an overview of recruitment systems and job recommendation systems is given, briefly describing challenges as well as research advances. Additionally, an emerging research thread is introduced, that is Career Pathway Recommendation (CPR), whose goal is to help a user achieving a career goal, possibly far into the future. The most convenient path towards the goal will be suggested to the user.

36 CHAPTER 5. RECOMMENDER SYSTEMS FOR JOB SEARCH

5.1 Recruitment systems

Recruitment, or recruiting, refers to the process of identifying and selecting suitable candidates for jobs within an organization. Such a process has been extensively studied in the human resources area [MWH94, AVdV01], and recruitment systems have been built accordingly [Lee07, ELW08], which are used by human resources departments to select candidates fitting the profiles enterprises are looking for.

In recent times, e-recruitment platforms have emerged to foster the recruiting process using the increased amount of web information [TBW08], especially collected from social networks, such as LinkedIn, Facebook, Twitter, Xing [ZESD14, Fle15]. These platforms have led job seekers to publish their curricula on job portals. For each posted job, thousands of resumes are received by companies; consequently, a huge volume of job descriptions and candidate resumes are becoming available online. Curricula and profiles can be exploited by recommender systems to match candidates with jobs using various approaches, including collaborative filtering, content- based filtering, knowledge-based and hybrid approaches [WHF07]. However, e-recruiting platforms are usually based on boolean search and filtering techniques that cannot sufficiently capture the complexity of a person-job fit as selection decisions [MKWW06].

Davison et al. [DMB11] pointed out that LinkedIn provides more accurate information compared to Facebook because everybody in a person’s network can easily contradict her assertions. This can be among the reasons why Zide et al. [ZESD14] defined LinkedIn as the world’s largest professional network. In addition to its reliability, LinkedIn also offers recruiter accounts aiming to support the recruiting process, so that about 94% of recruiters use it [Kas15]. Instead, the same trend does not hold among social media job seekers, where only 40% makes use of this network, although members are sometimes notified of possibly interesting job offers. LinkedIn professional secrecy does not allow us a complete understanding of the techniques used to recommend job positions. Anyway, analyzing some public profiles and the relative recommended job positions, we noted that there are often wrongly retrieved (i.e. not interesting) offers because of homonymy. This issue could make the job seeking process less effective, more manually conducted and time consuming.

While some works focused on the human resource aspect of the recruitment process [Bue14, PCG11], others proved the benefit of machine learning approaches for job placement. For instance, Min and Emam [ME03] used rules created by a decision tree in order to manage the recruitment of truck drivers. Buckley et al. [BMJM04] presented an automated recruitment and screening system, showing conservative savings of 6 to 1 ratio due to reduced employee turnover, reduced staffing costs and increased hiring process efficiency. Chi [Chi99] applied principal component analysis to establish jobs that can adequately be performed by various types of disabled workers. A dataset of 1,259 occupations was used, summarized in 112 different jobs; 41 available skills were analyzed with principal component analysis, finding five principal factors, namely occupational hazard, verbal communication education and training, visual acuity, body agility, and manual ability. Finally, the 112 job titles were classified into 15 homogeneous clusters, creating useful data to expand both the counselor and counselee perspectives about

CHAPTER 5. RECOMMENDER SYSTEMS FOR JOB SEARCH 37 job possibilities, and job requirements through the five principal factors. Zhu et al. [ZZX+16] used unsupervised learning techniques to discover recruitment market trends based on large- scale recruitment data. Specifically, their sequential latent variable model is able to capture the sequential dependencies of corporate recruitment states, and to automatically learn the latent recruitment topics.

In document Big Data mining and machine learning techniques applied to real world scenarios (Page 44-49)