© 2014 CY Lin, Columbia University E6893 Big Data Analytics – Lecture 5: Big Data Analytics Algorithms
1
! E6893 Big Data Analytics Lecture 5:
! Big Data Analytics Algorithms -- II
Ching-Yung Lin, Ph.D.
Adjunct Professor, Dept. of Electrical Engineering and Computer Science
Mgr., Dept. of Network Science and Big Data Analytics, IBM Watson Research Center
October 2nd , 2014
Course Structure
Class Data Number Topics Covered
09/04/14 1 Introduction to Big Data Analytics
09/11/14 2 Big Data Analytics Platforms
09/18/14 3 Big Data Storage and Processing
09/25/14 4 Big Data Analytics Algorithms -- I
10/02/14 5 Big Data Analytics Algorithms -- II (recommendation)
10/09/14 6 Big Data Analytics Algorithms — III (clustering) 10/16/14 7 Big Data Analytics Algorithms — IV (classification)
10/23/14 8 Linked Big Data – Graph Computing
10/30/14 9 Big Data Visualization
11/06/14 10 Mobile Data Collection, Analysis, and Interface 11/13/14 11 Hardware, Processors, and Cluster Platforms
11/20/14 12 Big Data Next Challenges – IoT, Cognition, and Beyond
11/27/14 Thanksgiving Holiday
12/04/14 13 Final Projects Discussion (Optional)
© 2014 CY Lin, Columbia University E6893 Big Data Analytics – Lecture 5: Big Data Analytics Algorithms
3
Review — Key Components of Mahout
Mahout reference book
© 2014 CY Lin, Columbia University E6893 Big Data Analytics – Lecture 5: Big Data Analytics Algorithms
5
Setting Up Mahout
Step 1: Java JVM and IDEs (e.g., Eclipse) Step 2: Maven
Step 3: Mahout
Eclipse Luna (June 2014)
Recommender — Inputs
Solid lines: positively related Dashed lines: negatively related
Input Data: User, Item, Rating
© 2014 CY Lin, Columbia University E6893 Big Data Analytics – Lecture 5: Big Data Analytics Algorithms
7
User-based Recommendation — Scenario I
gettofail.com
User-based Recommendation — Scenario II
© 2014 CY Lin, Columbia University E6893 Big Data Analytics – Lecture 5: Big Data Analytics Algorithms
9
User-based Recommendation — Scenario III
User-based Recommendation Algorithms
© 2014 CY Lin, Columbia University E6893 Big Data Analytics – Lecture 5: Big Data Analytics Algorithms
11
Example Recommender Code via Mahout
Process and output of the example
Recommendation for Person 1:
Item 104 > Item 106
Item 107 is not favored
© 2014 CY Lin, Columbia University E6893 Big Data Analytics – Lecture 5: Big Data Analytics Algorithms
13
Refresh (Reload) Data
Update data
© 2014 CY Lin, Columbia University E6893 Big Data Analytics – Lecture 5: Big Data Analytics Algorithms
15
User Similarity Measurements
• Pearson Correlation Similarity
• Euclidean Distance Similarity
• Cosine Measure Similarity
• Spearman Correlation Similarity
• Tanimoto Coefficient Similarity (Jaccard coefficient)
• Log-Likelihood Similarity
! !
Pearson Correlation Similarity
Data:
missing
data
© 2014 CY Lin, Columbia University E6893 Big Data Analytics – Lecture 5: Big Data Analytics Algorithms
17
On Pearson Similarity
Three problems with the Pearson Similarity:
! 1. Not take into account of the number of items in which two users’ preferences overlap. (e.g., 2 overlap items ==> 1, more items may not be better.)
2. If two users overlap on only one item, no correlation can be computed.
3. The correlation is undefined if either series of preference values are identical.
Adding Weighting.WEIGHTED as 2nd parameter of the constructor can cause the
resulting correlation to be pushed towards 1.0, or -1.0, depending on how many
points are used.
Euclidean Distance Similarity
Similarity = 1 / ( 1 + d )
© 2014 CY Lin, Columbia University E6893 Big Data Analytics – Lecture 5: Big Data Analytics Algorithms
19
Cosine Similarity
Cosine similarity and Pearson similarity get the same results if data are
normalized (mean == 0).
Spearman Correlation Similarity
Example for ties
Pearson value on the relative ranks
© 2014 CY Lin, Columbia University E6893 Big Data Analytics – Lecture 5: Big Data Analytics Algorithms
21
Caching User Similarity
Spearman Correlation Similarity is time consuming.
Need to use Caching ==> remember s user-user similarity which was previously computed.
Tanimoto (Jaccard) Coefficient Similarity
Discard preference values
© 2014 CY Lin, Columbia University E6893 Big Data Analytics – Lecture 5: Big Data Analytics Algorithms
23
Log-Likelihood Similarity
Asses how unlikely it is that the overlap between the two users is just due to chance.
Performance measurements
Using GroupLens data (http://grouplens.org): 10 million rating MovieLens dataset.
• Spearnman: 0.8
• Tanimoto: 0.82
• Log-Likelihood: 0.73
• Euclidean: 0.75
• Pearson (weighted): 0.77
• Pearson: 0.89
© 2014 CY Lin, Columbia University E6893 Big Data Analytics – Lecture 5: Big Data Analytics Algorithms
25
Performance measurements
95% of training; 5% of testing
10 nearest neighbors: 0.98
100 nearest neighbors: 0.89
500 nearest neighbors: 0.75
Selecting the number of neighbors
Based on number of neighbors
Based on a fixed threshold, e.g., 0.7 or 0.5
© 2014 CY Lin, Columbia University E6893 Big Data Analytics – Lecture 5: Big Data Analytics Algorithms
27
Item-based recommendation
Item-based recommendation algorithm
© 2014 CY Lin, Columbia University E6893 Big Data Analytics – Lecture 5: Big Data Analytics Algorithms
29
Code and Performance of Item-Based Recommendation
performance
Slope-One Recommender
© 2014 CY Lin, Columbia University E6893 Big Data Analytics – Lecture 5: Big Data Analytics Algorithms
31
Slope-One Algorithm
Slope-One got a result of near 0.65 on the GroupLens data Difference values from
the example
Other recommenders — SVD recommender
number of features
lambda: factor for regularization
number of
training step
© 2014 CY Lin, Columbia University E6893 Big Data Analytics – Lecture 5: Big Data Analytics Algorithms
33
Linear Interpolation Item-based recommender
SVD method got 0.76 on the GroupLens data
Cluster-based Recommendation
© 2014 CY Lin, Columbia University E6893 Big Data Analytics – Lecture 5: Big Data Analytics Algorithms
35
Other Recommenders not in Mahout — Groups (SDM 06)
CB DR
CB
DR
CBDR
– Results: beyond Collaborative Filtering: (1) Collaborative + Content Filtering (53%
improvement); (2) CBDR: Collaborative + Content Filtering + Graph Community Analytics (259% accuracy improvement over collaborative filtering)
– A 3
rdparty Knowledge Repository: 30K users and 20K documents.
Study the most active 697 users who have at least 20 download in a year.
Other Recommenders not in Mahout — Info Flow (SIGIR ’06)
! Comparing to Collaborative Filtering (CF) + Similar People Precision: IF is 91% better, TIF is 108% better
Number of recommended users
Number of recommended users
CF + SP IF TIF
CF + SP IF TIF
IF: Graphical Information Flow Model
TIF: Joint Topic Detection + Information Flow Model
Early adopter
Late adopter
Tests:
– 1 month – 586 new docs – 1,170 users
Network Info Flow
Innovators
Early majority Late majority
Laggards
adopt ? ?
Early adopters
© 2014 CY Lin, Columbia University E6893 Big Data Analytics – Lecture 5: Big Data Analytics Algorithms
37
Distributed Item-based Recommender
Distributed recommender — get co-occurrence matrix
Data:
© 2014 CY Lin, Columbia University E6893 Big Data Analytics – Lecture 5: Big Data Analytics Algorithms
39
Multiply the co-occurrence matrix with user preference
The highest is 103 (101, 104, 105, 107 have been purchased by user 3)
Translating to MapReduce: generating user vectors
© 2014 CY Lin, Columbia University E6893 Big Data Analytics – Lecture 5: Big Data Analytics Algorithms
41
Translating to MapReduce: calculating co-occurrence
Translating to MapReduce: matrix multiplication
© 2014 CY Lin, Columbia University E6893 Big Data Analytics – Lecture 5: Big Data Analytics Algorithms
43
Translating to MapReduce: partial products
Translating to MapReduce: partial product II
© 2014 CY Lin, Columbia University E6893 Big Data Analytics – Lecture 5: Big Data Analytics Algorithms