• No results found

RANDOM PROJECTIONS FOR SEARCH AND MACHINE LEARNING

N/A
N/A
Protected

Academic year: 2021

Share "RANDOM PROJECTIONS FOR SEARCH AND MACHINE LEARNING"

Copied!
74
0
0

Loading.... (view fulltext now)

Full text

(1)

= +

(2)

KEYWORD-BASED SEARCH

Document Data

300 unique words per document

300 000 words in vocabulary

Data sparsity: 99.9%

Words are strong query features

STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE 2015 2

(3)

IMAGE SEARCH BY EXAMPLE

Image Data

800 pixels

200 black pixels

Data sparsity 80%

(4)

TWO KINDS OF SEARCH

Data Sparse Dense

Query Size Short Long

Use Cases Keyword Search Image Search, Semantic Search,

Recommendations SEARCH

METHOD

INVERTED

INDEX (LUCENE)

RANDOM

PROJECTIONS

STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE 2015 4

(5)

OUTLINE

1. High dimensional data (images, text, clicks) 2. Random Projections: Why? What? How?

3. Image Search Benchmark

(6)

HIGH-DIMENSIONAL DATA

Images

STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE 2015 6

(7)

HIGH-DIMENSIONAL DATA

Images

Pixels

(8)

HIGH-DIMENSIONAL DATA

Images

1 2 3 4

5 6 7 8

9 10 11 12 13 14 15 16

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Dimensions Pixels

STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE 2015 8

(9)

MATRIX REPRESENTATION OF DATASET

(10)

MATRIX REPRESENTATION OF DATASET

STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE 2015 10

(11)

HIGH-DIMENSIONAL DATA

Text

A beer can do everyone a favor.

a abstract

be bee

beer

can

do

everyone

2 0 0 0 1 1

Dimensions

(12)

HIGH-DIMENSIONAL DATA

Clicks

1 0 0

STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE 2015 12

(13)

DENSE VS SPARSE MATRIX

IMAGES 1000 PIXELS

TEXT/CLICKS 300 000 WORDS

CANNOT SEARCH img_1

img_3

img_n

doc_1 doc_3

doc_n

(14)

DENSE VS SPARSE MATRIX

IMAGES 1000 PIXELS

TEXT/CLICKS 300 000 WORDS

CAN SEARCH WITH LUCENE CANNOT SEARCH

WITH LUCENE

STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE 2015 14

img_1 img_3

img_n

doc_1 doc_3

doc_n

TEXT/CLICKS

500 DIMENSIONS

WITH LUCENE

(15)

DENSE VS SPARSE MATRIX

IMAGES 1000 PIXELS

TEXT/CLICKS 300 000 WORDS

CANNOT SEARCH img_1

img_3

img_n

doc_1 doc_3

doc_n

TEXT/CLICKS

500 DIMENSIONS

WITH LUCENE

(16)

DIMENSIONALITY REDUCTION

STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE 2015 16

300 000 dimensions

doc_1 doc_3

doc_n

500 dimensions

beer wine beer + wine

SVD

(17)

DIMENSIONALITY REDUCTION = PATTERN DISCOVERY

= +

(18)

APPLICATIONS

Random Projections

Random Forest

SVM

Boosting

SVD Search

(Nearest Neighbor)

STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE 2015 18

(19)

RANDOM PROJECTIONS IN INDUSTRY

Spotify for music recommendations

https://github.com/spotify/annoy

Etsy for user/product recommendations

“Style in the Long Tail: Discovering Unique Interests with Latent Variable Models in Large Scale Social E-commerce”, Diane Hu, Rob Hall, Josh Attenberg

Referred to as Locality Sensitive Hashing

(20)

OUTLINE

1. High dimensional data (images, text, clicks) 2. Random Projections: Why? What? How?

3. Image Search Benchmark

20 STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE 2015

(21)

HOW DOES IT WORK?

= +

(22)

ILLUSTRATION ON ARTIFICIAL DATASET

STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE 2015 22

id_1 id_3

id_n

(23)

ILLUSTRATION ON ARTIFICIAL DATASET

Nearest neighbor problem: Find the closest point

(24)

Approximation: Neighbors are in a circle with small radius

ILLUSTRATION ON ARTIFICIAL DATASET

STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE 2015 24

(25)

Core idea: random grids

Dynamic grids (variably sized)

Random = cheap

ILLUSTRATION ON ARTIFICIAL DATASET

(26)

PARTITION BY RANDOM SPLITS

STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE 2015 26

Hyperplane splits the dataset

(27)

PROJECTION = ONE DIMENSIONAL VIEW OF DATASET

A

(28)

Random Direction (1 Dimension)

STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE 2015 28

A

PLATO’S ALEGORY OF THE CAVE

(29)

PARTITIONING BY PROJECTING ON

RANDOM LINES

(30)

PARTITIONING BY PROJECTING ON RANDOM LINES

STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE 2015 30

(31)

ALGORITHM VISUALIZATION

(32)

ALGORITHM VISUALIZATION

STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE 2015 32

(33)

ALGORITHM VISUALIZATION

(34)

ALGORITHM VISUALIZATION

STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE 2015 34

(35)

ALGORITHM VISUALIZATION

When two points are close, they are likely to end up in the same partition, even under

random partitioning

(36)

ALGORITHM VISUALIZATION

STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE 2015 36

When two points are close, they are LIKELY to end

up in the same partition, even under

random partitioning

(37)

MULTIPLE TREES

(38)

ADDING MORE TREES

Tree 1 Trees 1 and 2 Trees 1, 2 and 3

STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE 2015 38

(39)

ADDING MORE TREES

(40)

CONSTRAINING THE NEAREST NEIGHBOR REGION

THRESHOLD = In how many trees a “NN candidate” appears

Threshold = 1 Threshold = 20 Threshold = 60

STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE 2015 40

(41)

“WE’VE GOT HIM”

(42)

OUTLINE

1. High dimensional data (images, text, clicks) 2. Random Projections: Why? What? How?

3. Image Search Benchmark

42 STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE 2015

(43)

IMAGE SEARCH FOR PREDICTION

(44)

STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE 2015 44

Results

Image Label 5 3 3

IMAGE SEARCH FOR PREDICTION

(45)

STAGE 1: INDEXING

42 000 examples with labels 784 pixels per image

(46)

STAGE 1: INDEXING

STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE 2015 46

42 000 examples with labels 784 pixels per image

(47)

STAGE 1: INDEXING

index = Index(...)

for image in training_examples:

labels[image.id] = image.label index.add_item(image.id,

image.vector) index.build(...)

42 000 examples with labels 784 pixels per image

(48)

STAGE 2: SEARCH TO PREDICT THE LABEL

STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE 2015 48

index.load('.../file.index') ...

image_vec = read_image(...)

nn = 100 #number of nearest neighbors

results = index.get_nns_by_vector(image_vec,nn) top_result = results[0]

predicted_label = labels[top_result.id]

Results

Image Label 5 3 3

28 000 test examples

(49)

TOOLS

stefansavev.com/randomtrees

(50)

BENCHMARK IMAGE SEARCH

DIMENSIONS # TREES # NN

CANDIDATES

ACCURACY ON TEST

SEARCH TIME [ms/query]

784 10 1000-1010 97.0 4.85

STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE 2015 50

Reference: https://github.com/stefansavev/random-projections-talk/

BOTTLENECK

(51)

BENCHMARK IMAGE SEARCH

DIMENSIONS # TREES # NN

CANDIDATES

ACCURACY ON TEST

SEARCH TIME [ms/query]

784 10 1000-1010 97.0 4.85

100, after SVD

10 1000-1010 97.5 0.7

(52)

BENCHMARK IMAGE SEARCH

DIMENSIONS # TREES # NN

CANDIDATES

ACCURACY ON TEST

SEARCH TIME [ms/query]

784 10 1000-1010 97.0 4.85

100, after SVD

10 1000-1010 97.5 0.7

STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE 2015 52

Reference: https://github.com/stefansavev/random-projections-talk/

BOTTLENECK

(53)

BENCHMARK IMAGE SEARCH

DIMENSIONS # TREES # NN

CANDIDATES

ACCURACY ON TEST

SEARCH TIME [ms/query]

784 10 1000-1010 97.0 4.85

100, after SVD

10 1000-1010 97.5 0.7

100,

after SVD

40 180 (mean) 97.5 0.36

stefansavev.com/

randomtrees

(54)

SAME FOUNDATION FOR SEARCH AND MACHINE LEARNING

STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE 2015 54

+

Data point id

32 56

89

(55)

SAME FOUNDATION FOR SEARCH AND MACHINE LEARNING

+

label frequency

“0” 4

“1” 51

“7” 38

Histogram

(56)

CONCLUSION

Search in Dense Data

CHEAP method to “explore” data; Makes algorithms FASTER (e.g. SVD, SVM)

STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE 2015 56

= +

(57)

Stefan Savev

Email: [email protected] Blog: stefansavev.com

THANK YOU!

(58)

EXTRA SLIDES

STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE 2015 58

(59)

REFERENCES

(60)

ADVANTAGES/DISADVANTAGES DIMENSIONALITY REDUCTION

Advantages

“Semantic Similarity”

“Concentrate the signal”

Disadvantages

No exact matches possible

Adds noise

STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE 2015 60

(61)

RANDOM PROJECTIONS AS HASH FUNCTIONS

0 0 1 0

0 1 1 0

0 0 1 0

0 0 1 0

-1 1 1 -1

1 -1 1 1

1 1 -1 -1

-1 1 -1 1

0 0 1 0

0 -1 1 0

0 0 -1 0

0 0 -1 0

1 +1 - 1 – 1 – 1 = -1

* =

(62)

EXAMPLE-BASED IMAGE SEARCH

MNIST image dataset of digits

Images are digits from 0 to 9

42 000 images for training

28 000 images for test

28 x 28 pixels with values from 0 to 255

784 (=28*28) features

Dataset is already preprocessed

Goal is to predict the label of an image

STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE 2015 62

(63)

SOME IMPLEMENTATION TRICKS

sparse random projections  combine not all dimension but a small number of them

apply dimensionality reduction first

pick better random projections  the projected histogram is wide

project on multiple lines simultaneously  Hadamard matrix trick

reuse random projections via recombination

cut out a “ball” from the densest region

PCA may help

use the neighbors of neighbors of a point as additional similarity candidates

(64)

PROJECTION

Dot product (cosine, similarity)

View of the dataset

Overlap with Pattern

“Geometry” of Search

Foundation for Search and Machine Learning

STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE 2015 64

(65)

WHY RANDOM?

Cheap (replace optimization with randomization); Simply build more trees

If we build the perfect tree, it’s just one tree. We need more and DIVERSE trees

In high dimensions all algorithms will have problems, so do the cheapest

With a few points/dimensions random is not the best choice

(66)

DIMENSIONALITY REDUCTION

STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE 2015 66

300 000 WORDS doc_1

doc_3

doc_n

500 DIMENSIONS beer

wine

beer + wine

SVD

MIXING THE COLUMNS

OF THE ORIGINAL

MATRIX TO MAXIMIZE SIGNAL TO NOISE

(67)

PROJECTION = OVERLAP WITH A PATTERN

,

OVERLAP

=

SUM 1, more red

(68)

PROJECTION = OVERLAP WITH A PATTERN

STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE 2015 68

,

OVERLAP

=

SUM

(69)

PROJECTION = OVERLAP WITH A PATTERN

, =

OVERLAP

( )

SUM

(70)

1, if more red pixels than white -1, if more white pixels than red

PROJECTION = OVERLAP WITH A PATTERN

STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE 2015 70

, =

OVERLAP

( )

SUM

=

(71)

PROJECTION = OVERLAP WITH A PATTERN

, =

OVERLAP

( )

SUM

1, if more red pixels than white

=

(72)

DATA POINT – SVD FEATURE – RANDOM FEATURE

STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE 2015 72

(73)

PLATO’S ALEGORY OF THE CAVE

He then explains how the philosopher is like a prisoner who is freed from the cave and comes to understand that the shadows on the wall do not make up reality at all, as he can perceive the true form of reality rather than the mere shadows seen by the prisoners

Source: http://en.wikipedia.org/wiki/Allegory_of_the_Cave

(74)

SECRET SAUCE

How to make it faster?

How to make it more accurate?

When does it work?

Is there something better?

STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE 2015 74

http://stefansavev.com/random-projections-secretsouce.html

References

Related documents

This study aimed to identify the impact of the information used in the internal control to improve internal control of accounting quality in Jordan Insurance companies IT

In this study, satellite-based data with relatively high resolution are used as input data for empirical modeling with machine learning approaches in order to resolve the

Det kan tyde på at elevene blir mer opptatt av hvordan de skal oppføre seg hvis de selv får være med å utforme reglene, fordi de blir mer bevisst på innholdet i dem og fordi

distributing corporation and the controlled corporation must be engaged immediately after the distribution in the active conduct of a trade or business or, if

Suitable for data fanatics, ecommerce professionals and those who simply want to provide a better online experience for their customers, these courses will help you ensure

These can be broadly grouped into the technological, the ideological and the ethical (or moral order). 1) Technologically, the developments in genetics early this century in

The objective of the present work is to establish a HSI simulator known as the Cranfield Hyperspectral Image Modelling and Evaluation System (CHIMES), which features an

To respect the privacy of the individual I arranged for two user profile bits to be added to our existing profile facility (which determines whether a user automatically sees a