9.3 A trading protocol for DIJA
9.3.2 Experiments
Experiments have been performed comparing three prediction models: a random predictor of the DJIA rise/drop, and the two prediction models described in Section 9.2. Readers should recall that the model without the removal of noisy tweets achieved f-Measure equal to 79.9%, whereas discarding noisy tweets f-Measure reached 88.9%. We have performed the trading actions according to the protocol described in Section 9.3.1. The trading period coincides with the test interval of the prediction, i.e. from October to December 2008.
CHAPTER 9. METHODS FOR DJIA INDEX PREDICTION 123 The performance of the three experiments has been measured according to the return on investment (ROI) described in Section 9.3.1. Table 9.5 reports the total ROI achieved by the above mentioned prediction models over the three months used as the testing period.
Prediction Model Type ROI Std.Dev.
Random -0.097 0.138
No noise removal 0.562
Noise removal 0.840
Table 9.5: Total return on investment (ROI) in the three months used as the testing period. Three prediction models have been used: random trading actions, trading without removing noisy tweets, and trading with noise removal.
The random prediction model, which conducts completely random buy/sell trading actions, has been executed 10,000 times, changing the random generator seed. On average, it achieves a 10% loss with a 14% standard deviation.
Our first prediction model, trained without the removal of noisy tweets, gains 56%, whereas the second one, trained removing noisy tweets from the training set as described in Section 9.2.3, gains 84%. Standard deviation has not been computed for the two latter results, as there are not random choices and experiments do not need to be repeated. As far as we know, both results outperform the state-of-the-art in trading with DJIA [OM06], even with respect to more recent advanced approaches based on deep learning and recurrent neural networks [FM18, BYR17]. In order to get an annual ROI, the above mentioned gains should be multiplied by four as they have been accumulated over a period of just three months.
Figure 9.5 reports the ROI time series produced by the three models over the test period, i.e. the last three months in 2008. The ROI time series of the random prediction model is always negative, and some days the loss exceeds 10%. The prediction model without noise removal initially has a slight loss, then it almost constantly increases. The prediction model with noise removal does not produce gains in the first week, then it grows really faster with respect to the two previous models.
9.4
Final remarks
In this chapter, we have developed a text mining method for the prediction of DJIA trend based on knowledge extracted from ten million tweets emitted within a year. The method recognizes and filters out irrelevant tweets that represent noise and would negatively affect the prediction accuracy. This is obtained by means of a noise detection technique, which acts both at the tweet level and at the instance level (i.e. aggregation of tweets).
Comparing the method with the same tweets dataset and DJIA trends used in [BMZ11], our simple classification model based on the vector space model has achieved about 80% accuracy. Applying the noise detection method, irrelevant tweets are filtered out from the training set,
124 CHAPTER 9. METHODS FOR DJIA INDEX PREDICTION
Figure 9.5: Time series of the return on investment (ROI) of three prediction models: random, without noise removal, with noise removal. The test period spans three months, i.e. from October to December 2008. Source: [MPD+].
and accuracy grows up to 88.9%, outperforming both our base classifier and the best prediction method based on social network posts proposed in [BMZ11].
Finally, a trading protocol has been added in order to perform buy/sell operations on the basis of the prediction methods proposed. The trading experiments show the total return on investment (ROI) from the initial capital. The values without and with noise removal are 56% and 84% respectively, considering three months as the test period.
Part IV
Big Data Mining and Machine Learning
Methods for Job Search
10
A skillset-based job recommender
In this chapter, a job recommendation approach is introduced, where the best job is suggested to a given candidate based on the semantic similarity between skills extracted from LinkedIn users. A job clustering is first proposed to find and group semantically similar jobs. Based on this clustering, a novel method is introduced to match candidates with jobs. The basic idea is to discover similarity between different job positions, and then to find out their latent associations with people’s skills.
As far as we know, the most similar work with respect to our proposal and used data is by Bastian et al. [BHV+14], who constructed a folksonomy of skills and implemented a skill recommender system. In particular, their goal was to analyze the users’ skillsets extracted from LinkedIn with the aim of helping users into profile skill filling. The authors pointed out the feasibility as well as the practical utility of job recommendation based on LinkedIn users’ skills, without further discussion and detailed analysis.
10.1
Building a hierarchy of job positions
This section explains how LinkedIn public profiles can be used to build a hierarchical clustering of jobs. Hierarchical clustering aids to build a folksonomy (i.e. a user-defined taxonomy) of jobs. Such a folksonomy is useful to correlate distinct job positions, whereas only each person’s job history is normally available. Therefore, a good job clustering is an important first step for systems that match candidates with jobs, namely job recommendation systems as well as recruitment systems.