Weekly Analysis - Case Study - Connection Manager

Connection Manager

6.1 Case Study

6.1.3 Weekly Analysis

In weekly analysis, we are taking the buzz scores of n proceeding weeks with a sample of 4 and 8 weeks. For instance, if we want to predict buzz score on multiple days in a week, then we will take the sample score of n preceding weeks where each week consists of the buzz score obtained by aggregating (summing and averaging the whole day buzz scores for a particular product type e.g. Adobe books in the E-book category) daily scores in that week. For example, for predicting the buzz score in a week from 17/07/2005 to 23/07/2005 we will take the buzz score of four previous weeks (not necessarily consecutive) if 4 week sample is used. We have used the buzz scores of 12/06/2005 - 18/06/2005, 19/06/2005 – 25/06/2005, 26/062005 – 02/07/2005 and 10/07/2005 – 16/07/2005 where each week consists of aggregated buzz scores of every day from Sunday to Saturday as shown in the sample table below

Week Sunday Monday Tuesday Wednesday Thursday Friday Saturday 12/06/2005

–

18/06/2005

7.803881 7.866027 8.009351 8.19867 8.034405 7.925652 8.022644

For generating the Figure XI, the framework has used multiple queries

SELECT TRANSFORM (date, time, buzz_scores) using ‘daily_analysis’ from YAHOO_BUZZ_SCORES WHERE product=’ebooks’ AND date >= ‘2005-06-12’ and date <=’2005-06-18’

SELECT TRANSFORM (date, time, buzz_scores) using ‘daily_analysis’ from YAHOO_BUZZ_SCORES WHERE product=’ebooks’ AND product_type=’adobebook’ AND date >= ‘2005-06-12’ and date <=’2005-07-02’ AND date >=’2005-07-10’ AND date <=’2005-07-16’

SELECT TRANSFORM (date, time, buzz_scores) using ‘daily_analysis’ from YAHOO_BUZZ_SCORES WHERE product=’ebooks’ AND product_type=’safaribook’ AND date >= ‘2005-06-12’ and date <=’2005-07-02’ AND date >=’2005-07-10’ AND date <=’2005-07-16’

SELECT TRANSFORM (date, time, buzz_scores) using ‘daily_analysis’ from YAHOO_BUZZ_SCORES WHERE product=’ebooks’ AND product_type=’amazonbook’ AND date >= ‘2005-06-12’ and date <=’2005-07-02’ AND date >=’2005-07-10’ AND date <=’2005-07-16’

All the above queries use the daily_analysis script for weekly analysis. The changes are only in the WHERE condition of the query where we are selecting the rows based on the condition. The

Case Studies and Experiments

40 user will input the table name and column name and select the daily analysis script from module repository and optional dates for the analysis.

E-book-4 week E-book-Adobe-Book-4 week

E-book-Safari-Book- 4 week E-book-Amazone 4 weeks Figure XI Actual and Predicted Score with 4 week sample

Figure XI shows better trends with a sample of 4 weeks for product E-book and product type adobe book using regression based technique. Safari book and amazon also have better prediction using regression based technique. The scores with extrapolation technique are more consistent with actual scores, but these scores are higher than the actual scores.

Case Studies and Experiments

E-book-Safari-Book-8-week E-book-Amazon-8-week

Figure XII Actual and Predicted Score with 8 week sample

Figure XII shows the actual and predicted score with a sample of 8 weeks and again we can see from the figures that trend is better captured using regression based prediction for E-books. Adobe-book statistics show increase in buzz scores on Wednesday with using regression based technique, while the extrapolation based technique shows almost identical variation in the graph and closer to actual results but with the low buzz scores on each day. Safari books, statistics show variations in the results which are almost opposite to actual results, but again regression based technique is showing better predicted results as compare to extrapolation based technique. In addition the search volume of ‘safari book’ is low except Saturday where it is predicted to be higher. For the amazon books, both regression and extrapolation based techniques are showing close results. The table below shows the correlation between actual and predicted buzz scores.

Correlation between Actual and Predicted buzz scores

(𝑥 − 𝑥̅)(𝑦 − 𝑦)/ (𝑥 − 𝑥̅) (𝑦 − 𝑦)

Where 𝑥̅ 𝑎𝑛𝑑 𝑦 are the means of x and y Technology EP with 4 weeks

samples RP with 4 weeks samples EP with 8 weeks samples RP with 8 weeks samples E-book -0.22 -0.34 -0.45 -0.29 Adobe Book -0.62 -0.38 0.68 -0.73 Amazon 0.01 0.70 -0.47 0.44 Safari Book -0.33 0.88 0.23 0.33

Case Studies and Experiments

42 Table 6-2 shows the correlation between actual and predicted score where we can see the 68% of correlation when 8 week sample was used with EP technique and 70% and 88% correlation with RP techniques for Amazon and Safari books. In addition, there is low correlation for Amazon books when 4 weeks sample was used with EP technique. We can conclude from our results that generally RP technique is better than the EP for predicting future trends and hence the users of the framework such as non-expert scientists can predict future market trends by using either of the technique with less effort.

6.2 Case Study 2

In this case study, we are taking language dataset from yahoo sandbox which is almost 30 GB of uncompressed data out of which approximately 12 GB of 5-gram data gathered from more than 12000 news related websites and from approximately more than 14.6 million documents which contains almost 126 million unique sentences. The data sets are from the period of 11 months from February 2006 to December 2006. Scientists such as social who are working in the linguistic domain can use the data to build a statistic model for different domains as well as can use to analyze the events happened on specific years by extracting the related information out of the data [9].

This data set consists of n-grams from 1 to 5 where 1 indicates the one word and 5 indicate the five words in the corpus. We are using 5-gram words in this case study as the likelihood of finding meaningful information from the corpus increases as n-grams increases. We are using this corpus to find out the important events that happened during the period of February to December in 2006 from the total of 29,570,136 n-gram words in the corpus.

In the first scenario we are searching for H5N1 influenza, also called bird flu. The first case was reported in January in turkey and the number of deaths were recorded in various countries, including turkey and other countries of Africa, Asia and Europe throughout the year [39]. We are using different n-gram words to look for the possible text such as ‘bird flu’ is a 2-gram or ‘bird flu deaths’ is a 3-gram or ‘bird flu disease’ or ‘bird flu came’ or ‘bird flu influenza’. Similarly for 4-gram ‘bird flu cause deaths’ might be the sentence. Figure XIII shows the number of statements/words related to bird flu found from the corpus where <token> is representing the word within an n-gram, e.g. bird flu <token> <token> <token> might be ‘bird flu death in Egypt’ or ‘bird flu deaths have occurred’. Similarly <token> bird flu <token> <token> might be the ‘human bird flu death came’ or ‘six bird flu deaths traced’ and for <token> <token> bird flu <token> might be the ‘41st_{bird flu death’ or ‘confirm another bird flu death’}

Case Studies and Experiments

43 Figure XIII Number of bird flu related statements found in the corpus

From the above Figure XIII, we can see that the statements with a bird flu phrase at the start of the 5-gram word has high frequency as compare to other statements which have the bird flu keyword in the middle. There are a total of 1563 5-grams that are related to the bird flu event from the whole corpus, which is almost (1563/29570136) *100 = 0.0053% of the entire data set. To perform the above analysis the user provides the keyword to search and frequency as column name by viewing it from the data browser interface along with the table name. Query constructor will construct the follow query to perform the analysis.

SELECT TRANSFORM (key_word, frequency)

USING ‘bird_flu_event_analysis’ FROM YAHOO_NGRAMS

The ‘bird_flu_event analysis’ is a user written algorithms for analyzing the supplied keyword from the corpus. The analytical processor will select the bird_flu_event_analysis script from the module repository and execute the above query on the Hive warehouse. The response processor will parse the response and send the information back to data browser for visualization.

In the second scenario we are considering the FIFA world cup of 2006 in which Italy won by beating France in penalty kicks with the score of 5-3, after having a draw at a score of 1-1 in two half’s. The event happened in June and July of 2006 and watched by almost 26.29 billion non- unique users [40]. As FIFA world cup is the most watched sports event in the world so we think that there may be considerable discussion on the internet regarding this event.

For exploring the data to analyze events related to world cup. We are using two key words ‘Italy beats France’ and ‘Italy wins’. Again, we are taking 5-gram words data for analysis,

420 440 460 480 500 520 540 560 580

bird flu <token> <token> <token>

Case Studies and Experiments

44 because as explained before it is more likely to get more results with 5-gram as compared to less than 5-gram keywords.

Figure XIV Frequency of related statement found from the corpus

Figure XIV shows the frequency of occurrence of different statement in the corpus. Similar to bird flu frequency of Italy beat France as first tokens in the statement have a high frequency of occurrence as compare to rest of the combination. There are total 388 statements found which are directly related to the event.

To perform this analysis using the framework, the user will input the key word and frequency as column name and the query constructor will construct the following query.

SELECT TRANSFORM (key_word, frequency)

USING ‘wrold_cup_event_analysis’ FROM YAHOO_NGRAMS

The query constructor passes the query to analytical processor, which selects the user defined algorithm from the modules repository and executes it into the Hive warehouse for analysis. The response processor will get the analysis response and pass the result to data browser for visualization.

In the third scenario we are considering the event of Google who purchases YouTube on 9th

October 2006 in $1.65 billion by competing with other bidders such as Yahoo, Microsoft, Viacom and News Corporations [41]. We think that this event may be under discussion at different websites and blogs. We are searching the key word such as ‘Google bought YouTube’ and ‘Google buys YouTube’.

0 20 40 60 80 100 120 140 160 180 italy beat france <token> <token> <token> <token> italy beat france <token> italy beat france <token> italy wins <token> <token> <token> <token> italy wins <token> <token>

Case Studies and Experiments

45 Figure XV Frequency of related statement found from the corpus

Figure XV shows the high frequency of occurrence when ‘Google buys YouTube’ phrase is used. Overall, we have found 56 total occurrences of statements which are directly related to our key words.

For performing above analysis, user will input the key word and frequency as column name and query constructor constructs the following query.

SELECT TRANSFORM (key_word, frequency)

USING ‘google_youtube_event_analysis’ FROM YAHOO_NGRAMS

Query constructor passes the query to analytical processor where it takes the user defined algorithm from the modules repository and executes it on the Hive warehouse. The responses from the infrastructure layer handled by the response processor who parses the response and send the result back to data browser for visualization.

From this case study we can easily conclude the usability of the framework for non-expert users where one can explore and analyze data with the minimum effort by having no technical knowledge of the big data technologies. The non-expert user can easily get the information by analyzing the graph and conclude that the most popular or discussed event was bird flu in year 2006.

6.3 Summary

In this chapter we presented the two case studies to demonstrate our framework. In the first case study we tried to predict future trends from the market dataset by using extrapolation and

0 5 10 15 20 25 google bought youtube <token> <token> <token> google bought youtube <token> <token> <token> google bought youtube google buys youtube <token> <token> <token> google buys youtube <token> <token> <token> google buys youtube

Case Studies and Experiments

46 regression based techniques and in the second case study we demonstrated the exploration of large data set to understand the important events happened in 2006. The case studies demonstrate the usefulness of the framework for non-expert users. In the next chapter, we will evaluate our framework by comparing it with the related work and in term of performance.

Evaluation

7 :-:Evaluation

In this chapter, we evaluate our solution by comparing it with related work that we have discussed in chapter two. Also we will check the performance of the tool and discussed the usability for non-expert users.

In document Lightweight Stack for Big Data Analytics. Department of Computer Science University of St Andrews (Page 47-55)