! E6893 Big Data Analytics Lecture 5:! Big Data Analytics Algorithms -- II

(1)

1

! E6893 Big Data Analytics Lecture 5:

! Big Data Analytics Algorithms -- II

Ching-Yung Lin, Ph.D.

Adjunct Professor, Dept. of Electrical Engineering and Computer Science

Mgr., Dept. of Network Science and Big Data Analytics, IBM Watson Research Center

October 2nd , 2014

(2)

Course Structure

Class Data Number Topics Covered

09/04/14 1 Introduction to Big Data Analytics

09/11/14 2 Big Data Analytics Platforms

09/18/14 3 Big Data Storage and Processing

09/25/14 4 Big Data Analytics Algorithms -- I

10/02/14 5 Big Data Analytics Algorithms -- II (recommendation)

10/09/14 6 Big Data Analytics Algorithms — III (clustering) 10/16/14 7 Big Data Analytics Algorithms — IV (classification)

10/23/14 8 Linked Big Data – Graph Computing

10/30/14 9 Big Data Visualization

11/06/14 10 Mobile Data Collection, Analysis, and Interface 11/13/14 11 Hardware, Processors, and Cluster Platforms

11/20/14 12 Big Data Next Challenges – IoT, Cognition, and Beyond

11/27/14 Thanksgiving Holiday

12/04/14 13 Final Projects Discussion (Optional)

(3)

3 Review — Key Components of Mahout

(4)

Mahout reference book

(5)

5 Setting Up Mahout

Step 1: Java JVM and IDEs (e.g., Eclipse) Step 2: Maven

Step 3: Mahout

Eclipse Luna (June 2014)

(6)

Recommender — Inputs

Solid lines: positively related Dashed lines: negatively related

Input Data: User, Item, Rating

(7)

7 User-based Recommendation — Scenario I

gettofail.com

(8)

User-based Recommendation — Scenario II

(9)

9 User-based Recommendation — Scenario III

(10)

User-based Recommendation Algorithms

(11)

11 Example Recommender Code via Mahout

(12)

Process and output of the example

Recommendation for Person 1:

Item 104 > Item 106

Item 107 is not favored

(13)

13 Refresh (Reload) Data

(14)

Update data

(15)

15 User Similarity Measurements

• Pearson Correlation Similarity

• Euclidean Distance Similarity

• Cosine Measure Similarity

• Spearman Correlation Similarity

• Tanimoto Coefficient Similarity (Jaccard coefficient)

• Log-Likelihood Similarity

! !

(16)

Pearson Correlation Similarity

Data:

missing

data

(17)

17 On Pearson Similarity

Three problems with the Pearson Similarity:

! 1. Not take into account of the number of items in which two users’ preferences overlap. (e.g., 2 overlap items ==> 1, more items may not be better.)

2. If two users overlap on only one item, no correlation can be computed.

3. The correlation is undefined if either series of preference values are identical.

Adding Weighting.WEIGHTED as 2nd parameter of the constructor can cause the

resulting correlation to be pushed towards 1.0, or -1.0, depending on how many

points are used.

(18)

Euclidean Distance Similarity

Similarity = 1 / ( 1 + d )

(19)

19 Cosine Similarity

Cosine similarity and Pearson similarity get the same results if data are

normalized (mean == 0).

(20)

Spearman Correlation Similarity

Example for ties

Pearson value on the relative ranks

(21)

21 Caching User Similarity

Spearman Correlation Similarity is time consuming.

Need to use Caching ==> remember s user-user similarity which was previously computed.

(22)

Tanimoto (Jaccard) Coefficient Similarity

Discard preference values

(23)

23 Log-Likelihood Similarity

Asses how unlikely it is that the overlap between the two users is just due to chance.

(24)

Performance measurements

Using GroupLens data (http://grouplens.org): 10 million rating MovieLens dataset.

• Spearnman: 0.8

• Tanimoto: 0.82

• Log-Likelihood: 0.73

• Euclidean: 0.75

• Pearson (weighted): 0.77

• Pearson: 0.89

(25)

25 Performance measurements

95% of training; 5% of testing

10 nearest neighbors: 0.98

100 nearest neighbors: 0.89

500 nearest neighbors: 0.75

(26)

Selecting the number of neighbors

Based on number of neighbors

Based on a fixed threshold, e.g., 0.7 or 0.5

(27)

27 Item-based recommendation

(28)

Item-based recommendation algorithm

(29)

29 Code and Performance of Item-Based Recommendation

performance

(30)

Slope-One Recommender

(31)

31 Slope-One Algorithm

Slope-One got a result of near 0.65 on the GroupLens data Difference values from

the example

(32)

Other recommenders — SVD recommender

number of features

lambda: factor for regularization

number of

training step

(33)

33 Linear Interpolation Item-based recommender

SVD method got 0.76 on the GroupLens data

(34)

Cluster-based Recommendation

(35)

35 Other Recommenders not in Mahout — Groups (SDM 06)

CB DR

CB

DR

CB

DR

– Results: beyond Collaborative Filtering: (1) Collaborative + Content Filtering (53%

improvement); (2) CBDR: Collaborative + Content Filtering + Graph Community Analytics (259% accuracy improvement over collaborative filtering)

– A 3

^rd

party Knowledge Repository: 30K users and 20K documents.

Study the most active 697 users who have at least 20 download in a year.

(36)