compsci 514: algorithms for data science Andrew McGregor Lecture 10

(1)

compsci

514:

algorithms for data science

(2)

summary

Last Class:

• _{Introduced the k-frequent elements problem – identify all}

elements of a stream of n elements that occur ≥ n/k times.

• _{Saw how to solve approximately using the Count-min sketch}

algorithm. Based on random hashing into buckets containing

approximate counts.

This Class:

• _{Randomized methods for dimensionality reduction.}

• _{The Johnson-Lindenstrauss Lemma.}

(3)

summary

Last Class:

• _{Introduced the k-frequent elements problem – identify all}

elements of a stream of n elements that occur ≥ n/k times.

• _{Saw how to solve approximately using the Count-min sketch}

algorithm. Based on random hashing into buckets containing

approximate counts.

This Class:

(4)

summary

Last Class:

• _{Introduced the k-frequent elements problem – identify all}

elements of a stream of n elements that occur ≥ n/k times.

• _{Saw how to solve approximately using the Count-min sketch}

algorithm. Based on random hashing into buckets containing

approximate counts.

This Class:

• _{Randomized methods for dimensionality reduction.}

• _{The Johnson-Lindenstrauss Lemma.}

(5)

high dimensional data

‘Big Data’ means not just many data points, but many measurements per data point. I.e., veryhigh dimensional data.

• Twitter has 321 million active monthly users. Records(tens of) thousands of measurements per user: who they follow, who follows them, when they last visited the site, timestamps for specific interactions, how many tweets they have sent, the text of those tweets, etc.

• A 3 minute Youtube clip with a resolution of 500 × 500 pixels at 15 frames/second with 3 color channels is a recording of≥ 2 billion pixel values. Even a 500 × 500 pixel color image has 750, 000 pixel values. • The human genome contains 3 billion+ base pairs. Genetic datasets

(6)

high dimensional data

often contain information on100s of thousands+ mutations and genetic markers.

(7)

high dimensional data

• A 3 minute Youtube clip with a resolution of 500 × 500 pixels at 15 frames/second with 3 color channels is a recording of≥ 2 billion pixel values. Even a 500 × 500 pixel color image has 750, 000 pixel values.

(8)

high dimensional data

often contain information on100s of thousands+ mutations and genetic markers.

(9)

data as vectors and matrices

In data analysis and machine learning, data points with many

attributes are often stored, processed, and interpreted as

high

dimensional vectors

, with real valued entries.

Similarities/distances between vectors

(e.g., hx , y i, kx − y k

2

) have meaning

(10)

data as vectors and matrices

In data analysis and machine learning, data points with many

attributes are often stored, processed, and interpreted as

high

dimensional vectors

, with real valued entries.

Similarities/distances between vectors

(e.g., hx , y i, kx − y k

2

) have meaning

for underlying data points.

(11)

data as vectors and matrices

In data analysis and machine learning, data points with many

attributes are often stored, processed, and interpreted as

high

dimensional vectors

, with real valued entries.

Similarities/distances between vectors

(e.g., hx , y i, kx − y k

2

) have meaning

(12)

datasets as vectors and matrices

Data points are interpreted as

high dimensional vectors

, with real

valued entries. Data set is interpreted as a matrix.

Data Points: ~

x

1

, ~

x

2

, . . . , ~

x

n

∈ R

d

.

Data Set: X ∈ R

n×d

with i

th

rows equal to ~

x

i

.

Many data points n =⇒ tall. Many dimensions d =⇒ wide.

(13)

datasets as vectors and matrices

Data points are interpreted as

high dimensional vectors

, with real

valued entries. Data set is interpreted as a matrix.

Data Points: ~

x

1

, ~

x

2

, . . . , ~

x

n

∈ R

d

.

Data Set: X ∈ R

n×d

with i

th

rows equal to ~

x

i

.

(14)

datasets as vectors and matrices

Data points are interpreted as

high dimensional vectors

, with real

valued entries. Data set is interpreted as a matrix.

Data Points: ~

x

1

, ~

x

2

, . . . , ~

x

n

∈ R

d

.

Data Set: X ∈ R

n×d

with i

th

rows equal to ~

x

i

.

Many data points n =⇒ tall. Many dimensions d =⇒ wide.

(15)

dimensionality reduction

Dimensionality Reduction: Compress data points so that they lie in many fewer dimensions.

~

x1, . . . , ~xn∈ Rd →x˜1, . . . , ˜xn∈ Rm for m d .

‘Lossy compression’ that still preserves important information about the relationships between ~x1, . . . , ~xn.

(16)

dimensionality reduction

~

Generally will not consider directly how well ˜xi approximates ~xi.

(17)

dimensionality reduction

~

(18)

dimensionality reduction

~

Generally will not consider directly how well ˜xi approximates ~xi.

(19)

low distortion embedding

Low Distortion Embedding: Given ~x1, . . . , ~xn∈ Rd, distance function D, and error parameter ≥ 0, find ˜x1, . . . , ˜xn∈ Rm(where m d ) and distance function ˜D such that for all i , j ∈ [n]:

(1 − )D(~xi, ~xj) ≤ ˜D(˜xi, ˜xj) ≤ (1 + )D(~xi, ~xj).

Have already seen one example in class: MinHash.

With large enough signature size r , # matching entries in ˜xA, ˜xB

r ≈ J(~xA, ~xB).

(20)

low distortion embedding

(1 − )D(~xi, ~xj) ≤ ˜D(˜xi, ˜xj) ≤ (1 + )D(~xi, ~xj). Have already seen one example in class: MinHash.

r ≈ J(~xA, ~xB).

• Note: here J(~xA, ~xB) is asimilarityrather than a distance. So this is not quite low distortion embedding, but is closely related.

(21)

low distortion embedding

r ≈ J(~xA, ~xB).

(22)

low distortion embedding

r ≈ J(~xA, ~xB).

(23)

low distortion embedding

r ≈ J(~xA, ~xB).

(24)

low distortion embedding

r ≈ J(~xA, ~xB).

(25)

embeddings for euclidean space

Euclidean Low Distortion Embedding: Given ~x1, . . . , ~xn∈ Rd and error parameter ≥ 0, find ˜x1, . . . , ˜xn∈ Rm (where m d ) such that for all i , j ∈ [n]:

(1 − )k~xi− ~xjk2≤ k˜xi− ˜xjk2≤ (1 + )k~xi− ~xjk2. Recall that for ~_{z ∈ R}n_{, k~}_zk

2= q

Pn i =1~z(i )2.

(26)

embeddings for euclidean space

(1 − )k~xi− ~xjk2≤ k˜xi− ˜xjk2≤ (1 + )k~xi− ~xjk2. Recall that for ~_{z ∈ R}n_{, k~}_zk

2= q

Pn i =1~z(i )2.

Can use ˜x1, . . . , ˜xn in place of ~x1, . . . , ~xn in clustering, SVM, linear classification, near neighbor search, etc.

(27)

embeddings for euclidean space

(1 − )k~xi− ~xjk2≤ k˜xi− ˜xjk2≤ (1 + )k~xi− ~xjk2.

(28)

embeddings for euclidean space

(1 − )k~xi− ~xjk2≤ k˜xi− ˜xjk2≤ (1 + )k~xi− ~xjk2.

Can use ˜x1, . . . , ˜xn in place of ~x1, . . . , ~xn in clustering, SVM, linear

(29)

embedding with assumptions

A very easy case: Assume that ~x1, . . . , ~xn all lie on the 1st axis in Rd.

Set m = 1 and ˜xi= [~xi(1)] (i.e., ˜xi contains just a single number).

• k˜xi− ˜xjk2=p[~xi(1) − ~xj(1)]2= |~xi(1) − ~xj(1)| = k~xi− ~xjk2.

(30)

embedding with assumptions

• An embedding withno distortionfrom any d into m = 1.

(31)

embedding with assumptions

(32)

embedding with assumptions

When else might no (or low) distortion Euclidean embeddings into

very low-dimensional space exist?

(33)

embedding with no assumptions

What about when we don’t make any assumptions on ~

x

1

, . . . , ~

x

n

.

I.e., they can be scattered arbitrarily around d -dimensional space?

• _{Can we find a no-distortion embedding into m d dimensions?}

No. Require m = d .

• Can we find an -distortion embedding into m d dimensions

for > 0?

Yes! Always, with m depending on .

(34)

embedding with no assumptions

What about when we don’t make any assumptions on ~

x

1

, . . . , ~

x

n

.

I.e., they can be scattered arbitrarily around d -dimensional space?

• _{Can we find a no-distortion embedding into m d dimensions?}

No. Require m = d .

• Can we find an -distortion embedding into m d dimensions

for > 0?

Yes! Always, with m depending on .

For all i , j : (1 − )k~

x

i

− ~

x

j

k

2

≤ k˜

x

i

− ˜

x

j

k

2

≤ (1 + )k~

x

i

− ~

x

j

k

2

.

(35)

embedding with no assumptions

What about when we don’t make any assumptions on ~

x

1

, . . . , ~

x

n

.

I.e., they can be scattered arbitrarily around d -dimensional space?

• _{Can we find a no-distortion embedding into m d dimensions?}

No. Require m = d .

• Can we find an -distortion embedding into m d dimensions

for > 0?

Yes! Always, with m depending on .

(36)

embedding with no assumptions

What about when we don’t make any assumptions on ~

x

1

, . . . , ~

x

n

.

I.e., they can be scattered arbitrarily around d -dimensional space?

• _{Can we find a no-distortion embedding into m d dimensions?}

No. Require m = d .

• Can we find an -distortion embedding into m d dimensions

for > 0?

Yes! Always, with m depending on .

For all i , j : (1 − )k~

x

i

− ~

x

j

k

2

≤ k˜

x

i

− ˜

x

j

k

2

≤ (1 + )k~

x

i

− ~

x

j

k

2

.

(37)

the johnson-lindenstrauss lemma

Johnson-Lindenstrauss Lemma: For any set of points ~

x1, . . . , ~xn∈ Rd and > 0 there exists a linear map Π : Rd → Rm

such that m = Olog n2

and letting ˜xi= Π~xi:

For all i , j : (1 − )k~xi− ~xjk2≤ k˜xi− ˜xjk2≤ (1 + )k~xi− ~xjk2.

Further, if Π ∈ Rm×d _{has each entry chosen i.i.d. from N (0, 1/m),}

it satisfies the guarantee with high probability.

For d = 1 trillion, = .05, and n = 100, 000, m ≈ 6600.

(38)

the johnson-lindenstrauss lemma

Very surprising! Powerful result with a simple construction: applying a random linear transformation to a set of points preserves distances between all those points with high probability.

(39)

the johnson-lindenstrauss lemma

(40)

random projection

For any ~x1, . . . , ~xnand Π ∈ Rm×d with each entry chosen i.i.d. from N (0, 1/m), with high probability, letting xi = Π~xi:

For all i , j : (1 − )k~xi− ~xjk2≤ kxi− xjk2≤ (1 + )k~xi− ~xjk2.

• Π is known as arandom projection. It is a random linear function, mapping length d vectors to length m vectors.

• Π isdata oblivious. Stark contrast to methods like PCA.

(41)

random projection

(42)

random projection

• Π isdata oblivious. Stark contrast to methods like PCA.

(43)

algorithmic considerations

• _{Many alternative constructions: ±1 entries, sparse (most entries}

0), Fourier structured, etc. =⇒ more efficient computation of

x

i

= Π~

x

i

.

• _{Data oblivious property means that once Π is chosen, x}

₁

_{, . . . , x}

_n

can be computed in a stream with little memory.

• _{Memory needed is just O(nm) vs. O(nd ) to store the full data}

set.

• _{Compression can also be easily performed in parallel on different}

servers.

• _{When new data points are added, can be easily compressed,}

(44)

algorithmic considerations

• _{Many alternative constructions: ±1 entries, sparse (most entries}

0), Fourier structured, etc. =⇒ more efficient computation of

x

i

= Π~

x

i

.

• _{Data oblivious property means that once Π is chosen, x}

₁

_{, . . . , x}

_n

can be computed in a stream with little memory.

• _{Memory needed is just O(nm) vs. O(nd ) to store the full data}

set.

• _{Compression can also be easily performed in parallel on different}

servers.

• _{When new data points are added, can be easily compressed,}

without updating existing points.

(45)

algorithmic considerations

• _{Many alternative constructions: ±1 entries, sparse (most entries}

0), Fourier structured, etc. =⇒ more efficient computation of

x

i

= Π~

x

i

.

• _{Data oblivious property means that once Π is chosen, x}

₁

_{, . . . , x}

_n

can be computed in a stream with little memory.

• _{Memory needed is just O(nm) vs. O(nd ) to store the full data}

set.

• _{Compression can also be easily performed in parallel on different}

servers.

• _{When new data points are added, can be easily compressed,}

(46)

algorithmic considerations

• _{Many alternative constructions: ±1 entries, sparse (most entries}

0), Fourier structured, etc. =⇒ more efficient computation of

x

i

= Π~

x

i

.

• _{Data oblivious property means that once Π is chosen, x}

₁

_{, . . . , x}

_n

can be computed in a stream with little memory.

• _{Memory needed is just O(nm) vs. O(nd ) to store the full data}

set.

• _{Compression can also be easily performed in parallel on different}

servers.

• _{When new data points are added, can be easily compressed,}

without updating existing points.

(47)

connection to simhash

Compression operation is xi = Π~xi, so for any j , xi(j ) = hΠ(j ), ~xii =

d X

k=1

Π(j , k) · ~xi(k).

Π(j ) is a vector with independent random Gaussian entries.

~

(48)

connection to simhash

d X

k=1

Π(j , k) · ~xi(k). Π(j ) is a vector with independent random Gaussian entries.

~

x1, . . . , ~xn: original points (d dims.), x1, . . . , xn: compressed points (m < d

dims.), Π ∈ Rm×d_{: random projection (embedding function)}

(49)

connection to simhash

d X

k=1

Points with high cosine similarity have similar

random projections.

~

(50)

connection to simhash

d X

k=1

random projections.

~

(51)

connection to simhash

d X

k=1

(52)

connection to simhash

d X

k=1

random projections.

~

(53)

connection to simhash

d X

k=1

random projections.

~

(54)

connection to simhash

d X

k=1

random projections. Computing a length m SimHash signature SH1(~xi), . . . , SHm(~xi) is identical to computing xi = Π~xi and then taking sign(xi).

~

(55)

distributional jl

The Johnson-Lindenstrauss Lemma is a direct consequence of a closely related lemma:

Distributional JL Lemma: Let Π ∈ Rm×dhave each entry chosen i.i.d. as N (0, 1/m). If we set m = Olog(1/δ)2

, then for any ~

y ∈ Rd_{, with probability ≥ 1 − δ}

(1 − )k~y k2≤ kΠ~y k2≤ (1 + )k~y k2

Applying a random matrix Π to any vector ~y preserves ~y ’s norm with high probability.

• Like a low-distortion embedding, but for the length of a compressed vector rather than distances between vectors.

(56)

distributional jl

, then for any ~

(1 − )k~y k2≤ kΠ~y k2≤ (1 + )k~y k2

• Can be proven from first principles. Will see next class.

Π ∈ Rm×d: random projection matrix. d : original dimension. m: compressed dimension, : embedding error, δ: embedding failure prob.

(57)

distributional jl

, then for any ~

(1 − )k~y k2≤ kΠ~y k2≤ (1 + )k~y k2

(58)

distributional jl

, then for any ~

(1 − )k~y k2≤ kΠ~y k2≤ (1 + )k~y k2

• Can be proven from first principles. Will see next class.

Π ∈ Rm×d: random projection matrix. d : original dimension. m: compressed dimension, : embedding error, δ: embedding failure prob.

(59)

distributional jl =⇒ jl

Distributional JL Lemma =⇒ JL Lemma: Distributional JL show that a random projection Π preserves thenormof any y . The main JL Lemma says that Π preservesdistancesbetween vectors.

Since Π islinearthese are the same thing!

Proof: Given ~x1, . . . , ~xn, define n₂ vectors ~yij where ~yij= ~xi− ~xj.

~

(60)

ran-distributional jl =⇒ jl

~

x1, . . . , ~xn: original points, x1, . . . , xn: compressed points, Π ∈ Rm×d:

ran-dom projection matrix. d : original dimension. m: compressed dimension, :

(61)

distributional jl =⇒ jl

~

(62)

ran-distributional jl =⇒ jl

~

(63)

distributional jl =⇒ jl

• If we choose Π with m = Olog 1/δ2

, for each ~yij with probability ≥ 1 − δ we have:

(1 − )k~yijk2≤ kΠ~yijk2≤ (1 + )k~yijk2

~

(64)

ran-distributional jl =⇒ jl

(1 − )k~xi− ~xjk2≤ kΠ(~xi− ~xj)k2≤ (1 + )k~xi− ~xjk2

~

(65)

distributional jl =⇒ jl

(1 − )k~xi− ~xjk2≤ kxi− xjk2≤ (1 + )k~xi− ~xjk2

~

(66)

ran-distributional jl =⇒ jl

Claim: If we choose Π with i.i.d. N (0, 1/m) entries and m = Olog(1/δ2 0)

, letting xi = Π~xi, for each pair ~xi, ~xj with probability ≥ 1 − δ0 _{we have:}

(1 − )k~xi− ~xjk2≤ kxi− xjk2≤ (1 + )k~xi− ~xjk2.

With what probability are all pairwise distances preserved?

Union bound: With probability ≥ 1 − n₂ · δ0 _{all pairwise distances are} preserved.

Apply the claim with δ0= δ/ n₂. =⇒ for m = Olog(1/δ2 0)

, all pairwise distances are preserved with probability ≥ 1 − δ.

m = O log(1/δ 0₎ 2 = O log( n 2/δ) 2 ! = O log(n 2_/δ) 2 = O log(n/δ) 2 ~

ran-dom projection matrix. d : original dimension. m: compressed dimension, : embedding error, δ: embedding failure prob.

(67)

distributional jl =⇒ jl

(68)

distributional jl =⇒ jl

(69)

distributional jl =⇒ jl

Apply the claim with δ0= δ/ n₂.

=⇒ for m = Olog(1/δ2 0)

(70)

distributional jl =⇒ jl

(71)

distributional jl =⇒ jl

(72)

distributional jl =⇒ jl

(73)

distributional jl =⇒ jl

(74)

distributional jl =⇒ jl

(75)

distributional jl =⇒ jl

m = O log(1/δ 0₎ 2 = O log( n 2/δ) 2 ! = O log(n 2_/δ) 2 = O log(n/δ) 2

Yields the JL lemma.

~

x, . . . , ~x : original points, x , . . . , x : compressed points, Π ∈ Rm×d_:

(76)