• No results found

! E6893 Big Data Analytics Lecture 5:! Big Data Analytics Algorithms -- II

N/A
N/A
Protected

Academic year: 2022

Share "! E6893 Big Data Analytics Lecture 5:! Big Data Analytics Algorithms -- II"

Copied!
46
0
0

Loading.... (view fulltext now)

Full text

(1)

© 2014 CY Lin, Columbia University E6893 Big Data Analytics – Lecture 5: Big Data Analytics Algorithms

1

! E6893 Big Data Analytics Lecture 5:

! Big Data Analytics Algorithms -- II

Ching-Yung Lin, Ph.D.

Adjunct Professor, Dept. of Electrical Engineering and Computer Science

Mgr., Dept. of Network Science and Big Data Analytics, IBM Watson Research Center

October 2nd , 2014

(2)

Course Structure

Class Data Number Topics Covered

09/04/14 1 Introduction to Big Data Analytics

09/11/14 2 Big Data Analytics Platforms

09/18/14 3 Big Data Storage and Processing

09/25/14 4 Big Data Analytics Algorithms -- I

10/02/14 5 Big Data Analytics Algorithms -- II (recommendation)

10/09/14 6 Big Data Analytics Algorithms — III (clustering) 10/16/14 7 Big Data Analytics Algorithms — IV (classification)

10/23/14 8 Linked Big Data – Graph Computing

10/30/14 9 Big Data Visualization

11/06/14 10 Mobile Data Collection, Analysis, and Interface 11/13/14 11 Hardware, Processors, and Cluster Platforms

11/20/14 12 Big Data Next Challenges – IoT, Cognition, and Beyond

11/27/14 Thanksgiving Holiday

12/04/14 13 Final Projects Discussion (Optional)

(3)

© 2014 CY Lin, Columbia University E6893 Big Data Analytics – Lecture 5: Big Data Analytics Algorithms

3

Review — Key Components of Mahout

(4)

Mahout reference book

(5)

© 2014 CY Lin, Columbia University E6893 Big Data Analytics – Lecture 5: Big Data Analytics Algorithms

5

Setting Up Mahout

Step 1: Java JVM and IDEs (e.g., Eclipse) Step 2: Maven

Step 3: Mahout

Eclipse Luna (June 2014)

(6)

Recommender — Inputs

Solid lines: positively related Dashed lines: negatively related

Input Data: User, Item, Rating

(7)

© 2014 CY Lin, Columbia University E6893 Big Data Analytics – Lecture 5: Big Data Analytics Algorithms

7

User-based Recommendation — Scenario I

gettofail.com

(8)

User-based Recommendation — Scenario II

(9)

© 2014 CY Lin, Columbia University E6893 Big Data Analytics – Lecture 5: Big Data Analytics Algorithms

9

User-based Recommendation — Scenario III

(10)

User-based Recommendation Algorithms

(11)

© 2014 CY Lin, Columbia University E6893 Big Data Analytics – Lecture 5: Big Data Analytics Algorithms

11

Example Recommender Code via Mahout

(12)

Process and output of the example

Recommendation for Person 1:

Item 104 > Item 106

Item 107 is not favored

(13)

© 2014 CY Lin, Columbia University E6893 Big Data Analytics – Lecture 5: Big Data Analytics Algorithms

13

Refresh (Reload) Data

(14)

Update data

(15)

© 2014 CY Lin, Columbia University E6893 Big Data Analytics – Lecture 5: Big Data Analytics Algorithms

15

User Similarity Measurements

• Pearson Correlation Similarity

• Euclidean Distance Similarity

• Cosine Measure Similarity

• Spearman Correlation Similarity

• Tanimoto Coefficient Similarity (Jaccard coefficient)

• Log-Likelihood Similarity

! !

(16)

Pearson Correlation Similarity

Data:

missing

data

(17)

© 2014 CY Lin, Columbia University E6893 Big Data Analytics – Lecture 5: Big Data Analytics Algorithms

17

On Pearson Similarity

Three problems with the Pearson Similarity:

! 1. Not take into account of the number of items in which two users’ preferences overlap. (e.g., 2 overlap items ==> 1, more items may not be better.)

2. If two users overlap on only one item, no correlation can be computed.

3. The correlation is undefined if either series of preference values are identical.

Adding Weighting.WEIGHTED as 2nd parameter of the constructor can cause the

resulting correlation to be pushed towards 1.0, or -1.0, depending on how many

points are used.

(18)

Euclidean Distance Similarity

Similarity = 1 / ( 1 + d )

(19)

© 2014 CY Lin, Columbia University E6893 Big Data Analytics – Lecture 5: Big Data Analytics Algorithms

19

Cosine Similarity

Cosine similarity and Pearson similarity get the same results if data are

normalized (mean == 0).

(20)

Spearman Correlation Similarity

Example for ties

Pearson value on the relative ranks

(21)

© 2014 CY Lin, Columbia University E6893 Big Data Analytics – Lecture 5: Big Data Analytics Algorithms

21

Caching User Similarity

Spearman Correlation Similarity is time consuming.

Need to use Caching ==> remember s user-user similarity which was previously computed.

(22)

Tanimoto (Jaccard) Coefficient Similarity

Discard preference values

(23)

© 2014 CY Lin, Columbia University E6893 Big Data Analytics – Lecture 5: Big Data Analytics Algorithms

23

Log-Likelihood Similarity

Asses how unlikely it is that the overlap between the two users is just due to chance.

(24)

Performance measurements

Using GroupLens data (http://grouplens.org): 10 million rating MovieLens dataset.

• Spearnman: 0.8

• Tanimoto: 0.82

• Log-Likelihood: 0.73

• Euclidean: 0.75

• Pearson (weighted): 0.77

• Pearson: 0.89

(25)

© 2014 CY Lin, Columbia University E6893 Big Data Analytics – Lecture 5: Big Data Analytics Algorithms

25

Performance measurements

95% of training; 5% of testing

10 nearest neighbors: 0.98

100 nearest neighbors: 0.89

500 nearest neighbors: 0.75

(26)

Selecting the number of neighbors

Based on number of neighbors

Based on a fixed threshold, e.g., 0.7 or 0.5

(27)

© 2014 CY Lin, Columbia University E6893 Big Data Analytics – Lecture 5: Big Data Analytics Algorithms

27

Item-based recommendation

(28)

Item-based recommendation algorithm

(29)

© 2014 CY Lin, Columbia University E6893 Big Data Analytics – Lecture 5: Big Data Analytics Algorithms

29

Code and Performance of Item-Based Recommendation

performance

(30)

Slope-One Recommender

(31)

© 2014 CY Lin, Columbia University E6893 Big Data Analytics – Lecture 5: Big Data Analytics Algorithms

31

Slope-One Algorithm

Slope-One got a result of near 0.65 on the GroupLens data Difference values from

the example

(32)

Other recommenders — SVD recommender

number of features

lambda: factor for regularization

number of

training step

(33)

© 2014 CY Lin, Columbia University E6893 Big Data Analytics – Lecture 5: Big Data Analytics Algorithms

33

Linear Interpolation Item-based recommender

SVD method got 0.76 on the GroupLens data

(34)

Cluster-based Recommendation

(35)

© 2014 CY Lin, Columbia University E6893 Big Data Analytics – Lecture 5: Big Data Analytics Algorithms

35

Other Recommenders not in Mahout — Groups (SDM 06)

CB DR

CB

DR

CB

DR

– Results: beyond Collaborative Filtering: (1) Collaborative + Content Filtering (53%

improvement); (2) CBDR: Collaborative + Content Filtering + Graph Community Analytics (259% accuracy improvement over collaborative filtering)

– A 3

rd

party Knowledge Repository: 30K users and 20K documents.

Study the most active 697 users who have at least 20 download in a year.

(36)

Other Recommenders not in Mahout — Info Flow (SIGIR ’06)

! Comparing to Collaborative Filtering (CF) + Similar People Precision: IF is 91% better, TIF is 108% better

Number of recommended users

Number of recommended users

CF + SP IF TIF

CF + SP IF TIF

IF: Graphical Information Flow Model

TIF: Joint Topic Detection + Information Flow Model

Early adopter

Late adopter

Tests:

– 1 month – 586 new docs – 1,170 users

Network Info Flow

Innovators

Early majority Late majority

Laggards

adopt ? ?

Early adopters

(37)

© 2014 CY Lin, Columbia University E6893 Big Data Analytics – Lecture 5: Big Data Analytics Algorithms

37

Distributed Item-based Recommender

(38)

Distributed recommender — get co-occurrence matrix

Data:

(39)

© 2014 CY Lin, Columbia University E6893 Big Data Analytics – Lecture 5: Big Data Analytics Algorithms

39

Multiply the co-occurrence matrix with user preference

The highest is 103 (101, 104, 105, 107 have been purchased by user 3)

(40)

Translating to MapReduce: generating user vectors

(41)

© 2014 CY Lin, Columbia University E6893 Big Data Analytics – Lecture 5: Big Data Analytics Algorithms

41

Translating to MapReduce: calculating co-occurrence

(42)

Translating to MapReduce: matrix multiplication

(43)

© 2014 CY Lin, Columbia University E6893 Big Data Analytics – Lecture 5: Big Data Analytics Algorithms

43

Translating to MapReduce: partial products

(44)

Translating to MapReduce: partial product II

(45)

© 2014 CY Lin, Columbia University E6893 Big Data Analytics – Lecture 5: Big Data Analytics Algorithms

45

Running Recommender on MapReduce and HDFS

(46)

Questions?

References

Related documents

The policy provides 3 levels of lifetime insurance cover for cats subject to certain terms and conditions being met.. Significant features

Previous studies have reported estimates of gaming revenue from casino-style games added to existing race tracks. Other reports and studies have examined the potential revenue

The data presented in Table F1 demonstrate that, for Heathrow (LHR2), the airport model is not likely to be significantly over-predicting the airport-related concentrations

Research in the field of systemic digitization of the urban environment identi- fies a set of measures, usually reduced to three stages: (1) the formation of a common structure to

© 2015 CY Lin, Columbia University E6895 Advanced Big Data Analytics – Lecture 3: Spark and Data

In conclusion, for the studied Taiwanese population of diabetic patients undergoing hemodialysis, increased mortality rates are associated with higher average FPG levels at 1 and

The main wall of the living room has been designated as a "Model Wall" of Delta Gamma girls -- ELLE smiles at us from a Hawaiian Tropic ad and a Miss June USC

Hertel and Martin (2008), provide a simplified interpretation of the technical modalities. The model here follows those authors in modeling SSM. To briefly outline, if a