• No results found

compsci 514: algorithms for data science Andrew McGregor Lecture 10

N/A
N/A
Protected

Academic year: 2021

Share "compsci 514: algorithms for data science Andrew McGregor Lecture 10"

Copied!
76
0
0

Loading.... (view fulltext now)

Full text

(1)

compsci

514:

algorithms for data science

(2)

summary

Last Class:

Introduced the k-frequent elements problem – identify all

elements of a stream of n elements that occur ≥ n/k times.

Saw how to solve approximately using the Count-min sketch

algorithm. Based on random hashing into buckets containing

approximate counts.

This Class:

Randomized methods for dimensionality reduction.

The Johnson-Lindenstrauss Lemma.

(3)

summary

Last Class:

Introduced the k-frequent elements problem – identify all

elements of a stream of n elements that occur ≥ n/k times.

Saw how to solve approximately using the Count-min sketch

algorithm. Based on random hashing into buckets containing

approximate counts.

This Class:

(4)

summary

Last Class:

Introduced the k-frequent elements problem – identify all

elements of a stream of n elements that occur ≥ n/k times.

Saw how to solve approximately using the Count-min sketch

algorithm. Based on random hashing into buckets containing

approximate counts.

This Class:

Randomized methods for dimensionality reduction.

The Johnson-Lindenstrauss Lemma.

(5)

high dimensional data

‘Big Data’ means not just many data points, but many measurements per data point. I.e., veryhigh dimensional data.

• Twitter has 321 million active monthly users. Records(tens of) thousands of measurements per user: who they follow, who follows them, when they last visited the site, timestamps for specific interactions, how many tweets they have sent, the text of those tweets, etc.

• A 3 minute Youtube clip with a resolution of 500 × 500 pixels at 15 frames/second with 3 color channels is a recording of≥ 2 billion pixel values. Even a 500 × 500 pixel color image has 750, 000 pixel values. • The human genome contains 3 billion+ base pairs. Genetic datasets

(6)

high dimensional data

‘Big Data’ means not just many data points, but many measurements per data point. I.e., veryhigh dimensional data.

• Twitter has 321 million active monthly users. Records(tens of) thousands of measurements per user: who they follow, who follows them, when they last visited the site, timestamps for specific interactions, how many tweets they have sent, the text of those tweets, etc.

• A 3 minute Youtube clip with a resolution of 500 × 500 pixels at 15 frames/second with 3 color channels is a recording of≥ 2 billion pixel values. Even a 500 × 500 pixel color image has 750, 000 pixel values. • The human genome contains 3 billion+ base pairs. Genetic datasets

often contain information on100s of thousands+ mutations and genetic markers.

(7)

high dimensional data

‘Big Data’ means not just many data points, but many measurements per data point. I.e., veryhigh dimensional data.

• Twitter has 321 million active monthly users. Records(tens of) thousands of measurements per user: who they follow, who follows them, when they last visited the site, timestamps for specific interactions, how many tweets they have sent, the text of those tweets, etc.

• A 3 minute Youtube clip with a resolution of 500 × 500 pixels at 15 frames/second with 3 color channels is a recording of≥ 2 billion pixel values. Even a 500 × 500 pixel color image has 750, 000 pixel values.

(8)

high dimensional data

‘Big Data’ means not just many data points, but many measurements per data point. I.e., veryhigh dimensional data.

• Twitter has 321 million active monthly users. Records(tens of) thousands of measurements per user: who they follow, who follows them, when they last visited the site, timestamps for specific interactions, how many tweets they have sent, the text of those tweets, etc.

• A 3 minute Youtube clip with a resolution of 500 × 500 pixels at 15 frames/second with 3 color channels is a recording of≥ 2 billion pixel values. Even a 500 × 500 pixel color image has 750, 000 pixel values. • The human genome contains 3 billion+ base pairs. Genetic datasets

often contain information on100s of thousands+ mutations and genetic markers.

(9)

data as vectors and matrices

In data analysis and machine learning, data points with many

attributes are often stored, processed, and interpreted as

high

dimensional vectors

, with real valued entries.

Similarities/distances between vectors

(e.g., hx , y i, kx − y k

2

) have meaning

(10)

data as vectors and matrices

In data analysis and machine learning, data points with many

attributes are often stored, processed, and interpreted as

high

dimensional vectors

, with real valued entries.

Similarities/distances between vectors

(e.g., hx , y i, kx − y k

2

) have meaning

for underlying data points.

(11)

data as vectors and matrices

In data analysis and machine learning, data points with many

attributes are often stored, processed, and interpreted as

high

dimensional vectors

, with real valued entries.

Similarities/distances between vectors

(e.g., hx , y i, kx − y k

2

) have meaning

(12)

datasets as vectors and matrices

Data points are interpreted as

high dimensional vectors

, with real

valued entries. Data set is interpreted as a matrix.

Data Points: ~

x

1

, ~

x

2

, . . . , ~

x

n

∈ R

d

.

Data Set: X ∈ R

n×d

with i

th

rows equal to ~

x

i

.

Many data points n =⇒ tall. Many dimensions d =⇒ wide.

(13)

datasets as vectors and matrices

Data points are interpreted as

high dimensional vectors

, with real

valued entries. Data set is interpreted as a matrix.

Data Points: ~

x

1

, ~

x

2

, . . . , ~

x

n

∈ R

d

.

Data Set: X ∈ R

n×d

with i

th

rows equal to ~

x

i

.

(14)

datasets as vectors and matrices

Data points are interpreted as

high dimensional vectors

, with real

valued entries. Data set is interpreted as a matrix.

Data Points: ~

x

1

, ~

x

2

, . . . , ~

x

n

∈ R

d

.

Data Set: X ∈ R

n×d

with i

th

rows equal to ~

x

i

.

Many data points n =⇒ tall. Many dimensions d =⇒ wide.

(15)

dimensionality reduction

Dimensionality Reduction: Compress data points so that they lie in many fewer dimensions.

~

x1, . . . , ~xn∈ Rd →x˜1, . . . , ˜xn∈ Rm for m  d .

‘Lossy compression’ that still preserves important information about the relationships between ~x1, . . . , ~xn.

(16)

dimensionality reduction

Dimensionality Reduction: Compress data points so that they lie in many fewer dimensions.

~

x1, . . . , ~xn∈ Rd →x˜1, . . . , ˜xn∈ Rm for m  d .

‘Lossy compression’ that still preserves important information about the relationships between ~x1, . . . , ~xn.

Generally will not consider directly how well ˜xi approximates ~xi.

(17)

dimensionality reduction

Dimensionality Reduction: Compress data points so that they lie in many fewer dimensions.

~

x1, . . . , ~xn∈ Rd →x˜1, . . . , ˜xn∈ Rm for m  d .

‘Lossy compression’ that still preserves important information about the relationships between ~x1, . . . , ~xn.

(18)

dimensionality reduction

Dimensionality Reduction: Compress data points so that they lie in many fewer dimensions.

~

x1, . . . , ~xn∈ Rd →x˜1, . . . , ˜xn∈ Rm for m  d .

‘Lossy compression’ that still preserves important information about the relationships between ~x1, . . . , ~xn.

Generally will not consider directly how well ˜xi approximates ~xi.

(19)

low distortion embedding

Low Distortion Embedding: Given ~x1, . . . , ~xn∈ Rd, distance function D, and error parameter  ≥ 0, find ˜x1, . . . , ˜xn∈ Rm(where m  d ) and distance function ˜D such that for all i , j ∈ [n]:

(1 − )D(~xi, ~xj) ≤ ˜D(˜xi, ˜xj) ≤ (1 + )D(~xi, ~xj).

Have already seen one example in class: MinHash.

With large enough signature size r , # matching entries in ˜xA, ˜xB

r ≈ J(~xA, ~xB).

(20)

low distortion embedding

Low Distortion Embedding: Given ~x1, . . . , ~xn∈ Rd, distance function D, and error parameter  ≥ 0, find ˜x1, . . . , ˜xn∈ Rm(where m  d ) and distance function ˜D such that for all i , j ∈ [n]:

(1 − )D(~xi, ~xj) ≤ ˜D(˜xi, ˜xj) ≤ (1 + )D(~xi, ~xj). Have already seen one example in class: MinHash.

With large enough signature size r , # matching entries in ˜xA, ˜xB

r ≈ J(~xA, ~xB).

• Note: here J(~xA, ~xB) is asimilarityrather than a distance. So this is not quite low distortion embedding, but is closely related.

(21)

low distortion embedding

Low Distortion Embedding: Given ~x1, . . . , ~xn∈ Rd, distance function D, and error parameter  ≥ 0, find ˜x1, . . . , ˜xn∈ Rm(where m  d ) and distance function ˜D such that for all i , j ∈ [n]:

(1 − )D(~xi, ~xj) ≤ ˜D(˜xi, ˜xj) ≤ (1 + )D(~xi, ~xj). Have already seen one example in class: MinHash.

With large enough signature size r , # matching entries in ˜xA, ˜xB

r ≈ J(~xA, ~xB).

(22)

low distortion embedding

Low Distortion Embedding: Given ~x1, . . . , ~xn∈ Rd, distance function D, and error parameter  ≥ 0, find ˜x1, . . . , ˜xn∈ Rm(where m  d ) and distance function ˜D such that for all i , j ∈ [n]:

(1 − )D(~xi, ~xj) ≤ ˜D(˜xi, ˜xj) ≤ (1 + )D(~xi, ~xj). Have already seen one example in class: MinHash.

With large enough signature size r , # matching entries in ˜xA, ˜xB

r ≈ J(~xA, ~xB).

• Note: here J(~xA, ~xB) is asimilarityrather than a distance. So this is not quite low distortion embedding, but is closely related.

(23)

low distortion embedding

Low Distortion Embedding: Given ~x1, . . . , ~xn∈ Rd, distance function D, and error parameter  ≥ 0, find ˜x1, . . . , ˜xn∈ Rm(where m  d ) and distance function ˜D such that for all i , j ∈ [n]:

(1 − )D(~xi, ~xj) ≤ ˜D(˜xi, ˜xj) ≤ (1 + )D(~xi, ~xj). Have already seen one example in class: MinHash.

With large enough signature size r , # matching entries in ˜xA, ˜xB

r ≈ J(~xA, ~xB).

(24)

low distortion embedding

Low Distortion Embedding: Given ~x1, . . . , ~xn∈ Rd, distance function D, and error parameter  ≥ 0, find ˜x1, . . . , ˜xn∈ Rm(where m  d ) and distance function ˜D such that for all i , j ∈ [n]:

(1 − )D(~xi, ~xj) ≤ ˜D(˜xi, ˜xj) ≤ (1 + )D(~xi, ~xj). Have already seen one example in class: MinHash.

With large enough signature size r , # matching entries in ˜xA, ˜xB

r ≈ J(~xA, ~xB).

• Note: here J(~xA, ~xB) is asimilarityrather than a distance. So this is not quite low distortion embedding, but is closely related.

(25)

embeddings for euclidean space

Euclidean Low Distortion Embedding: Given ~x1, . . . , ~xn∈ Rd and error parameter  ≥ 0, find ˜x1, . . . , ˜xn∈ Rm (where m  d ) such that for all i , j ∈ [n]:

(1 − )k~xi− ~xjk2≤ k˜xi− ˜xjk2≤ (1 + )k~xi− ~xjk2. Recall that for ~z ∈ Rn, k~zk

2= q

Pn i =1~z(i )2.

(26)

embeddings for euclidean space

Euclidean Low Distortion Embedding: Given ~x1, . . . , ~xn∈ Rd and error parameter  ≥ 0, find ˜x1, . . . , ˜xn∈ Rm (where m  d ) such that for all i , j ∈ [n]:

(1 − )k~xi− ~xjk2≤ k˜xi− ˜xjk2≤ (1 + )k~xi− ~xjk2. Recall that for ~z ∈ Rn, k~zk

2= q

Pn i =1~z(i )2.

Can use ˜x1, . . . , ˜xn in place of ~x1, . . . , ~xn in clustering, SVM, linear classification, near neighbor search, etc.

(27)

embeddings for euclidean space

Euclidean Low Distortion Embedding: Given ~x1, . . . , ~xn∈ Rd and error parameter  ≥ 0, find ˜x1, . . . , ˜xn∈ Rm (where m  d ) such that for all i , j ∈ [n]:

(1 − )k~xi− ~xjk2≤ k˜xi− ˜xjk2≤ (1 + )k~xi− ~xjk2.

(28)

embeddings for euclidean space

Euclidean Low Distortion Embedding: Given ~x1, . . . , ~xn∈ Rd and error parameter  ≥ 0, find ˜x1, . . . , ˜xn∈ Rm (where m  d ) such that for all i , j ∈ [n]:

(1 − )k~xi− ~xjk2≤ k˜xi− ˜xjk2≤ (1 + )k~xi− ~xjk2.

Can use ˜x1, . . . , ˜xn in place of ~x1, . . . , ~xn in clustering, SVM, linear

(29)

embedding with assumptions

A very easy case: Assume that ~x1, . . . , ~xn all lie on the 1st axis in Rd.

Set m = 1 and ˜xi= [~xi(1)] (i.e., ˜xi contains just a single number).

• k˜xi− ˜xjk2=p[~xi(1) − ~xj(1)]2= |~xi(1) − ~xj(1)| = k~xi− ~xjk2.

(30)

embedding with assumptions

A very easy case: Assume that ~x1, . . . , ~xn all lie on the 1st axis in Rd.

Set m = 1 and ˜xi= [~xi(1)] (i.e., ˜xi contains just a single number).

• k˜xi− ˜xjk2=p[~xi(1) − ~xj(1)]2= |~xi(1) − ~xj(1)| = k~xi− ~xjk2.

• An embedding withno distortionfrom any d into m = 1.

(31)

embedding with assumptions

A very easy case: Assume that ~x1, . . . , ~xn all lie on the 1st axis in Rd.

Set m = 1 and ˜xi= [~xi(1)] (i.e., ˜xi contains just a single number).

• k˜xi− ˜xjk2=p[~xi(1) − ~xj(1)]2= |~xi(1) − ~xj(1)| = k~xi− ~xjk2.

(32)

embedding with assumptions

When else might no (or low) distortion Euclidean embeddings into

very low-dimensional space exist?

(33)

embedding with no assumptions

What about when we don’t make any assumptions on ~

x

1

, . . . , ~

x

n

.

I.e., they can be scattered arbitrarily around d -dimensional space?

Can we find a no-distortion embedding into m  d dimensions?

No. Require m = d .

Can we find an -distortion embedding into m  d dimensions

for  > 0?

Yes! Always, with m depending on .

(34)

embedding with no assumptions

What about when we don’t make any assumptions on ~

x

1

, . . . , ~

x

n

.

I.e., they can be scattered arbitrarily around d -dimensional space?

Can we find a no-distortion embedding into m  d dimensions?

No. Require m = d .

Can we find an -distortion embedding into m  d dimensions

for  > 0?

Yes! Always, with m depending on .

For all i , j : (1 − )k~

x

i

− ~

x

j

k

2

≤ k˜

x

i

− ˜

x

j

k

2

≤ (1 + )k~

x

i

− ~

x

j

k

2

.

(35)

embedding with no assumptions

What about when we don’t make any assumptions on ~

x

1

, . . . , ~

x

n

.

I.e., they can be scattered arbitrarily around d -dimensional space?

Can we find a no-distortion embedding into m  d dimensions?

No. Require m = d .

Can we find an -distortion embedding into m  d dimensions

for  > 0?

Yes! Always, with m depending on .

(36)

embedding with no assumptions

What about when we don’t make any assumptions on ~

x

1

, . . . , ~

x

n

.

I.e., they can be scattered arbitrarily around d -dimensional space?

Can we find a no-distortion embedding into m  d dimensions?

No. Require m = d .

Can we find an -distortion embedding into m  d dimensions

for  > 0?

Yes! Always, with m depending on .

For all i , j : (1 − )k~

x

i

− ~

x

j

k

2

≤ k˜

x

i

− ˜

x

j

k

2

≤ (1 + )k~

x

i

− ~

x

j

k

2

.

(37)

the johnson-lindenstrauss lemma

Johnson-Lindenstrauss Lemma: For any set of points ~

x1, . . . , ~xn∈ Rd and  > 0 there exists a linear map Π : Rd → Rm

such that m = Olog n2



and letting ˜xi= Π~xi:

For all i , j : (1 − )k~xi− ~xjk2≤ k˜xi− ˜xjk2≤ (1 + )k~xi− ~xjk2.

Further, if Π ∈ Rm×d has each entry chosen i.i.d. from N (0, 1/m),

it satisfies the guarantee with high probability.

For d = 1 trillion,  = .05, and n = 100, 000, m ≈ 6600.

(38)

the johnson-lindenstrauss lemma

Johnson-Lindenstrauss Lemma: For any set of points ~

x1, . . . , ~xn∈ Rd and  > 0 there exists a linear map Π : Rd → Rm

such that m = Olog n2



and letting ˜xi= Π~xi:

For all i , j : (1 − )k~xi− ~xjk2≤ k˜xi− ˜xjk2≤ (1 + )k~xi− ~xjk2.

Further, if Π ∈ Rm×d has each entry chosen i.i.d. from N (0, 1/m),

it satisfies the guarantee with high probability.

For d = 1 trillion,  = .05, and n = 100, 000, m ≈ 6600.

Very surprising! Powerful result with a simple construction: applying a random linear transformation to a set of points preserves distances between all those points with high probability.

(39)

the johnson-lindenstrauss lemma

Johnson-Lindenstrauss Lemma: For any set of points ~

x1, . . . , ~xn∈ Rd and  > 0 there exists a linear map Π : Rd → Rm

such that m = Olog n2



and letting ˜xi= Π~xi:

For all i , j : (1 − )k~xi− ~xjk2≤ k˜xi− ˜xjk2≤ (1 + )k~xi− ~xjk2.

Further, if Π ∈ Rm×d has each entry chosen i.i.d. from N (0, 1/m),

it satisfies the guarantee with high probability.

For d = 1 trillion,  = .05, and n = 100, 000, m ≈ 6600.

(40)

random projection

For any ~x1, . . . , ~xnand Π ∈ Rm×d with each entry chosen i.i.d. from N (0, 1/m), with high probability, letting xi = Π~xi:

For all i , j : (1 − )k~xi− ~xjk2≤ kxi− xjk2≤ (1 + )k~xi− ~xjk2.

• Π is known as arandom projection. It is a random linear function, mapping length d vectors to length m vectors.

• Π isdata oblivious. Stark contrast to methods like PCA.

(41)

random projection

For any ~x1, . . . , ~xnand Π ∈ Rm×d with each entry chosen i.i.d. from N (0, 1/m), with high probability, letting xi = Π~xi:

For all i , j : (1 − )k~xi− ~xjk2≤ kxi− xjk2≤ (1 + )k~xi− ~xjk2.

• Π is known as arandom projection. It is a random linear function, mapping length d vectors to length m vectors.

(42)

random projection

For any ~x1, . . . , ~xnand Π ∈ Rm×d with each entry chosen i.i.d. from N (0, 1/m), with high probability, letting xi = Π~xi:

For all i , j : (1 − )k~xi− ~xjk2≤ kxi− xjk2≤ (1 + )k~xi− ~xjk2.

• Π is known as arandom projection. It is a random linear function, mapping length d vectors to length m vectors.

• Π isdata oblivious. Stark contrast to methods like PCA.

(43)

algorithmic considerations

Many alternative constructions: ±1 entries, sparse (most entries

0), Fourier structured, etc. =⇒ more efficient computation of

x

i

= Π~

x

i

.

Data oblivious property means that once Π is chosen, x

1

, . . . , x

n

can be computed in a stream with little memory.

Memory needed is just O(nm) vs. O(nd ) to store the full data

set.

Compression can also be easily performed in parallel on different

servers.

When new data points are added, can be easily compressed,

(44)

algorithmic considerations

Many alternative constructions: ±1 entries, sparse (most entries

0), Fourier structured, etc. =⇒ more efficient computation of

x

i

= Π~

x

i

.

Data oblivious property means that once Π is chosen, x

1

, . . . , x

n

can be computed in a stream with little memory.

Memory needed is just O(nm) vs. O(nd ) to store the full data

set.

Compression can also be easily performed in parallel on different

servers.

When new data points are added, can be easily compressed,

without updating existing points.

(45)

algorithmic considerations

Many alternative constructions: ±1 entries, sparse (most entries

0), Fourier structured, etc. =⇒ more efficient computation of

x

i

= Π~

x

i

.

Data oblivious property means that once Π is chosen, x

1

, . . . , x

n

can be computed in a stream with little memory.

Memory needed is just O(nm) vs. O(nd ) to store the full data

set.

Compression can also be easily performed in parallel on different

servers.

When new data points are added, can be easily compressed,

(46)

algorithmic considerations

Many alternative constructions: ±1 entries, sparse (most entries

0), Fourier structured, etc. =⇒ more efficient computation of

x

i

= Π~

x

i

.

Data oblivious property means that once Π is chosen, x

1

, . . . , x

n

can be computed in a stream with little memory.

Memory needed is just O(nm) vs. O(nd ) to store the full data

set.

Compression can also be easily performed in parallel on different

servers.

When new data points are added, can be easily compressed,

without updating existing points.

(47)

connection to simhash

Compression operation is xi = Π~xi, so for any j , xi(j ) = hΠ(j ), ~xii =

d X

k=1

Π(j , k) · ~xi(k).

Π(j ) is a vector with independent random Gaussian entries.

~

(48)

connection to simhash

Compression operation is xi = Π~xi, so for any j , xi(j ) = hΠ(j ), ~xii =

d X

k=1

Π(j , k) · ~xi(k). Π(j ) is a vector with independent random Gaussian entries.

~

x1, . . . , ~xn: original points (d dims.), x1, . . . , xn: compressed points (m < d

dims.), Π ∈ Rm×d: random projection (embedding function)

(49)

connection to simhash

Compression operation is xi = Π~xi, so for any j , xi(j ) = hΠ(j ), ~xii =

d X

k=1

Π(j , k) · ~xi(k). Π(j ) is a vector with independent random Gaussian entries.

Points with high cosine similarity have similar

random projections.

~

(50)

connection to simhash

Compression operation is xi = Π~xi, so for any j , xi(j ) = hΠ(j ), ~xii =

d X

k=1

Π(j , k) · ~xi(k). Π(j ) is a vector with independent random Gaussian entries.

Points with high cosine similarity have similar

random projections.

~

x1, . . . , ~xn: original points (d dims.), x1, . . . , xn: compressed points (m < d

dims.), Π ∈ Rm×d: random projection (embedding function)

(51)

connection to simhash

Compression operation is xi = Π~xi, so for any j , xi(j ) = hΠ(j ), ~xii =

d X

k=1

Π(j , k) · ~xi(k). Π(j ) is a vector with independent random Gaussian entries.

Points with high cosine similarity have similar

(52)

connection to simhash

Compression operation is xi = Π~xi, so for any j , xi(j ) = hΠ(j ), ~xii =

d X

k=1

Π(j , k) · ~xi(k). Π(j ) is a vector with independent random Gaussian entries.

Points with high cosine similarity have similar

random projections.

~

x1, . . . , ~xn: original points (d dims.), x1, . . . , xn: compressed points (m < d

dims.), Π ∈ Rm×d: random projection (embedding function)

(53)

connection to simhash

Compression operation is xi = Π~xi, so for any j , xi(j ) = hΠ(j ), ~xii =

d X

k=1

Π(j , k) · ~xi(k). Π(j ) is a vector with independent random Gaussian entries.

Points with high cosine similarity have similar

random projections.

~

(54)

connection to simhash

Compression operation is xi = Π~xi, so for any j , xi(j ) = hΠ(j ), ~xii =

d X

k=1

Π(j , k) · ~xi(k). Π(j ) is a vector with independent random Gaussian entries.

Points with high cosine similarity have similar

random projections. Computing a length m SimHash signature SH1(~xi), . . . , SHm(~xi) is identical to computing xi = Π~xi and then taking sign(xi).

~

x1, . . . , ~xn: original points (d dims.), x1, . . . , xn: compressed points (m < d

dims.), Π ∈ Rm×d: random projection (embedding function)

(55)

distributional jl

The Johnson-Lindenstrauss Lemma is a direct consequence of a closely related lemma:

Distributional JL Lemma: Let Π ∈ Rm×dhave each entry chosen i.i.d. as N (0, 1/m). If we set m = Olog(1/δ)2



, then for any ~

y ∈ Rd, with probability ≥ 1 − δ

(1 − )k~y k2≤ kΠ~y k2≤ (1 + )k~y k2

Applying a random matrix Π to any vector ~y preserves ~y ’s norm with high probability.

• Like a low-distortion embedding, but for the length of a compressed vector rather than distances between vectors.

(56)

distributional jl

The Johnson-Lindenstrauss Lemma is a direct consequence of a closely related lemma:

Distributional JL Lemma: Let Π ∈ Rm×dhave each entry chosen i.i.d. as N (0, 1/m). If we set m = Olog(1/δ)2



, then for any ~

y ∈ Rd, with probability ≥ 1 − δ

(1 − )k~y k2≤ kΠ~y k2≤ (1 + )k~y k2

Applying a random matrix Π to any vector ~y preserves ~y ’s norm with high probability.

• Like a low-distortion embedding, but for the length of a compressed vector rather than distances between vectors.

• Can be proven from first principles. Will see next class.

Π ∈ Rm×d: random projection matrix. d : original dimension. m: compressed dimension, : embedding error, δ: embedding failure prob.

(57)

distributional jl

The Johnson-Lindenstrauss Lemma is a direct consequence of a closely related lemma:

Distributional JL Lemma: Let Π ∈ Rm×dhave each entry chosen i.i.d. as N (0, 1/m). If we set m = Olog(1/δ)2



, then for any ~

y ∈ Rd, with probability ≥ 1 − δ

(1 − )k~y k2≤ kΠ~y k2≤ (1 + )k~y k2

Applying a random matrix Π to any vector ~y preserves ~y ’s norm with high probability.

• Like a low-distortion embedding, but for the length of a compressed vector rather than distances between vectors.

(58)

distributional jl

The Johnson-Lindenstrauss Lemma is a direct consequence of a closely related lemma:

Distributional JL Lemma: Let Π ∈ Rm×dhave each entry chosen i.i.d. as N (0, 1/m). If we set m = Olog(1/δ)2



, then for any ~

y ∈ Rd, with probability ≥ 1 − δ

(1 − )k~y k2≤ kΠ~y k2≤ (1 + )k~y k2

Applying a random matrix Π to any vector ~y preserves ~y ’s norm with high probability.

• Like a low-distortion embedding, but for the length of a compressed vector rather than distances between vectors.

• Can be proven from first principles. Will see next class.

Π ∈ Rm×d: random projection matrix. d : original dimension. m: compressed dimension, : embedding error, δ: embedding failure prob.

(59)

distributional jl =⇒ jl

Distributional JL Lemma =⇒ JL Lemma: Distributional JL show that a random projection Π preserves thenormof any y . The main JL Lemma says that Π preservesdistancesbetween vectors.

Since Π islinearthese are the same thing!

Proof: Given ~x1, . . . , ~xn, define n2 vectors ~yij where ~yij= ~xi− ~xj.

~

(60)

ran-distributional jl =⇒ jl

Distributional JL Lemma =⇒ JL Lemma: Distributional JL show that a random projection Π preserves thenormof any y . The main JL Lemma says that Π preservesdistancesbetween vectors.

Since Π islinearthese are the same thing!

Proof: Given ~x1, . . . , ~xn, define n2 vectors ~yij where ~yij= ~xi− ~xj.

~

x1, . . . , ~xn: original points, x1, . . . , xn: compressed points, Π ∈ Rm×d:

ran-dom projection matrix. d : original dimension. m: compressed dimension, :

(61)

distributional jl =⇒ jl

Distributional JL Lemma =⇒ JL Lemma: Distributional JL show that a random projection Π preserves thenormof any y . The main JL Lemma says that Π preservesdistancesbetween vectors.

Since Π islinearthese are the same thing!

Proof: Given ~x1, . . . , ~xn, define n2 vectors ~yij where ~yij= ~xi− ~xj.

~

(62)

ran-distributional jl =⇒ jl

Distributional JL Lemma =⇒ JL Lemma: Distributional JL show that a random projection Π preserves thenormof any y . The main JL Lemma says that Π preservesdistancesbetween vectors.

Since Π islinearthese are the same thing!

Proof: Given ~x1, . . . , ~xn, define n2 vectors ~yij where ~yij= ~xi− ~xj.

~

x1, . . . , ~xn: original points, x1, . . . , xn: compressed points, Π ∈ Rm×d:

ran-dom projection matrix. d : original dimension. m: compressed dimension, :

(63)

distributional jl =⇒ jl

Distributional JL Lemma =⇒ JL Lemma: Distributional JL show that a random projection Π preserves thenormof any y . The main JL Lemma says that Π preservesdistancesbetween vectors.

Since Π islinearthese are the same thing!

Proof: Given ~x1, . . . , ~xn, define n2 vectors ~yij where ~yij= ~xi− ~xj.

• If we choose Π with m = Olog 1/δ2



, for each ~yij with probability ≥ 1 − δ we have:

(1 − )k~yijk2≤ kΠ~yijk2≤ (1 + )k~yijk2

~

(64)

ran-distributional jl =⇒ jl

Distributional JL Lemma =⇒ JL Lemma: Distributional JL show that a random projection Π preserves thenormof any y . The main JL Lemma says that Π preservesdistancesbetween vectors.

Since Π islinearthese are the same thing!

Proof: Given ~x1, . . . , ~xn, define n2 vectors ~yij where ~yij= ~xi− ~xj.

• If we choose Π with m = Olog 1/δ2



, for each ~yij with probability ≥ 1 − δ we have:

(1 − )k~xi− ~xjk2≤ kΠ(~xi− ~xj)k2≤ (1 + )k~xi− ~xjk2

~

x1, . . . , ~xn: original points, x1, . . . , xn: compressed points, Π ∈ Rm×d:

ran-dom projection matrix. d : original dimension. m: compressed dimension, :

(65)

distributional jl =⇒ jl

Distributional JL Lemma =⇒ JL Lemma: Distributional JL show that a random projection Π preserves thenormof any y . The main JL Lemma says that Π preservesdistancesbetween vectors.

Since Π islinearthese are the same thing!

Proof: Given ~x1, . . . , ~xn, define n2 vectors ~yij where ~yij= ~xi− ~xj.

• If we choose Π with m = Olog 1/δ2



, for each ~yij with probability ≥ 1 − δ we have:

(1 − )k~xi− ~xjk2≤ kxi− xjk2≤ (1 + )k~xi− ~xjk2

~

(66)

ran-distributional jl =⇒ jl

Claim: If we choose Π with i.i.d. N (0, 1/m) entries and m = Olog(1/δ2 0)



, letting xi = Π~xi, for each pair ~xi, ~xj with probability ≥ 1 − δ0 we have:

(1 − )k~xi− ~xjk2≤ kxi− xjk2≤ (1 + )k~xi− ~xjk2.

With what probability are all pairwise distances preserved?

Union bound: With probability ≥ 1 − n2 · δ0 all pairwise distances are preserved.

Apply the claim with δ0= δ/ n2. =⇒ for m = Olog(1/δ2 0)



, all pairwise distances are preserved with probability ≥ 1 − δ.

m = O log(1/δ 0) 2  = O log( n 2/δ) 2 ! = O log(n 2/δ) 2  = O log(n/δ) 2  ~

x1, . . . , ~xn: original points, x1, . . . , xn: compressed points, Π ∈ Rm×d:

ran-dom projection matrix. d : original dimension. m: compressed dimension, : embedding error, δ: embedding failure prob.

(67)

distributional jl =⇒ jl

Claim: If we choose Π with i.i.d. N (0, 1/m) entries and m = Olog(1/δ2 0)



, letting xi = Π~xi, for each pair ~xi, ~xj with probability ≥ 1 − δ0 we have:

(1 − )k~xi− ~xjk2≤ kxi− xjk2≤ (1 + )k~xi− ~xjk2.

With what probability are all pairwise distances preserved?

Union bound: With probability ≥ 1 − n2 · δ0 all pairwise distances are preserved.

Apply the claim with δ0= δ/ n2. =⇒ for m = Olog(1/δ2 0)



, all pairwise distances are preserved with probability ≥ 1 − δ.

m = O log(1/δ 0) 2  = O log( n 2/δ) 2 ! = O log(n 2/δ) 2  = O log(n/δ) 2  ~

x1, . . . , ~xn: original points, x1, . . . , xn: compressed points, Π ∈ Rm×d:

(68)

distributional jl =⇒ jl

Claim: If we choose Π with i.i.d. N (0, 1/m) entries and m = Olog(1/δ2 0)



, letting xi = Π~xi, for each pair ~xi, ~xj with probability ≥ 1 − δ0 we have:

(1 − )k~xi− ~xjk2≤ kxi− xjk2≤ (1 + )k~xi− ~xjk2.

With what probability are all pairwise distances preserved?

Union bound: With probability ≥ 1 − n2 · δ0 all pairwise distances are preserved.

Apply the claim with δ0= δ/ n2. =⇒ for m = Olog(1/δ2 0)



, all pairwise distances are preserved with probability ≥ 1 − δ.

m = O log(1/δ 0) 2  = O log( n 2/δ) 2 ! = O log(n 2/δ) 2  = O log(n/δ) 2  ~

x1, . . . , ~xn: original points, x1, . . . , xn: compressed points, Π ∈ Rm×d:

ran-dom projection matrix. d : original dimension. m: compressed dimension, : embedding error, δ: embedding failure prob.

(69)

distributional jl =⇒ jl

Claim: If we choose Π with i.i.d. N (0, 1/m) entries and m = Olog(1/δ2 0)



, letting xi = Π~xi, for each pair ~xi, ~xj with probability ≥ 1 − δ0 we have:

(1 − )k~xi− ~xjk2≤ kxi− xjk2≤ (1 + )k~xi− ~xjk2.

With what probability are all pairwise distances preserved?

Union bound: With probability ≥ 1 − n2 · δ0 all pairwise distances are preserved.

Apply the claim with δ0= δ/ n2.

=⇒ for m = Olog(1/δ2 0)



, all pairwise distances are preserved with probability ≥ 1 − δ.

m = O log(1/δ 0) 2  = O log( n 2/δ) 2 ! = O log(n 2/δ) 2  = O log(n/δ) 2  ~

x1, . . . , ~xn: original points, x1, . . . , xn: compressed points, Π ∈ Rm×d:

(70)

distributional jl =⇒ jl

Claim: If we choose Π with i.i.d. N (0, 1/m) entries and m = Olog(1/δ2 0)



, letting xi = Π~xi, for each pair ~xi, ~xj with probability ≥ 1 − δ0 we have:

(1 − )k~xi− ~xjk2≤ kxi− xjk2≤ (1 + )k~xi− ~xjk2.

With what probability are all pairwise distances preserved?

Union bound: With probability ≥ 1 − n2 · δ0 all pairwise distances are preserved.

Apply the claim with δ0= δ/ n2. =⇒ for m = Olog(1/δ2 0)



, all pairwise distances are preserved with probability ≥ 1 − δ.

m = O log(1/δ 0) 2  = O log( n 2/δ) 2 ! = O log(n 2/δ) 2  = O log(n/δ) 2  ~

x1, . . . , ~xn: original points, x1, . . . , xn: compressed points, Π ∈ Rm×d:

ran-dom projection matrix. d : original dimension. m: compressed dimension, : embedding error, δ: embedding failure prob.

(71)

distributional jl =⇒ jl

Claim: If we choose Π with i.i.d. N (0, 1/m) entries and m = Olog(1/δ2 0)



, letting xi = Π~xi, for each pair ~xi, ~xj with probability ≥ 1 − δ0 we have:

(1 − )k~xi− ~xjk2≤ kxi− xjk2≤ (1 + )k~xi− ~xjk2.

With what probability are all pairwise distances preserved?

Union bound: With probability ≥ 1 − n2 · δ0 all pairwise distances are preserved.

Apply the claim with δ0= δ/ n2. =⇒ for m = Olog(1/δ2 0)



, all pairwise distances are preserved with probability ≥ 1 − δ.

m = O log(1/δ 0) 2  = O log( n 2/δ) 2 ! = O log(n 2/δ) 2  = O log(n/δ) 2  ~

x1, . . . , ~xn: original points, x1, . . . , xn: compressed points, Π ∈ Rm×d:

(72)

distributional jl =⇒ jl

Claim: If we choose Π with i.i.d. N (0, 1/m) entries and m = Olog(1/δ2 0)



, letting xi = Π~xi, for each pair ~xi, ~xj with probability ≥ 1 − δ0 we have:

(1 − )k~xi− ~xjk2≤ kxi− xjk2≤ (1 + )k~xi− ~xjk2.

With what probability are all pairwise distances preserved?

Union bound: With probability ≥ 1 − n2 · δ0 all pairwise distances are preserved.

Apply the claim with δ0= δ/ n2. =⇒ for m = Olog(1/δ2 0)



, all pairwise distances are preserved with probability ≥ 1 − δ.

m = O log(1/δ 0) 2  = O log( n 2/δ) 2 ! = O log(n 2/δ) 2  = O log(n/δ) 2  ~

x1, . . . , ~xn: original points, x1, . . . , xn: compressed points, Π ∈ Rm×d:

ran-dom projection matrix. d : original dimension. m: compressed dimension, : embedding error, δ: embedding failure prob.

(73)

distributional jl =⇒ jl

Claim: If we choose Π with i.i.d. N (0, 1/m) entries and m = Olog(1/δ2 0)



, letting xi = Π~xi, for each pair ~xi, ~xj with probability ≥ 1 − δ0 we have:

(1 − )k~xi− ~xjk2≤ kxi− xjk2≤ (1 + )k~xi− ~xjk2.

With what probability are all pairwise distances preserved?

Union bound: With probability ≥ 1 − n2 · δ0 all pairwise distances are preserved.

Apply the claim with δ0= δ/ n2. =⇒ for m = Olog(1/δ2 0)



, all pairwise distances are preserved with probability ≥ 1 − δ.

m = O log(1/δ 0) 2  = O log( n 2/δ) 2 ! = O log(n 2/δ) 2  = O log(n/δ) 2  ~

x1, . . . , ~xn: original points, x1, . . . , xn: compressed points, Π ∈ Rm×d:

(74)

distributional jl =⇒ jl

Claim: If we choose Π with i.i.d. N (0, 1/m) entries and m = Olog(1/δ2 0)



, letting xi = Π~xi, for each pair ~xi, ~xj with probability ≥ 1 − δ0 we have:

(1 − )k~xi− ~xjk2≤ kxi− xjk2≤ (1 + )k~xi− ~xjk2.

With what probability are all pairwise distances preserved?

Union bound: With probability ≥ 1 − n2 · δ0 all pairwise distances are preserved.

Apply the claim with δ0= δ/ n2. =⇒ for m = Olog(1/δ2 0)



, all pairwise distances are preserved with probability ≥ 1 − δ.

m = O log(1/δ 0) 2  = O log( n 2/δ) 2 ! = O log(n 2/δ) 2  = O log(n/δ) 2  ~

x1, . . . , ~xn: original points, x1, . . . , xn: compressed points, Π ∈ Rm×d:

ran-dom projection matrix. d : original dimension. m: compressed dimension, : embedding error, δ: embedding failure prob.

(75)

distributional jl =⇒ jl

Claim: If we choose Π with i.i.d. N (0, 1/m) entries and m = Olog(1/δ2 0)



, letting xi = Π~xi, for each pair ~xi, ~xj with probability ≥ 1 − δ0 we have:

(1 − )k~xi− ~xjk2≤ kxi− xjk2≤ (1 + )k~xi− ~xjk2.

With what probability are all pairwise distances preserved?

Union bound: With probability ≥ 1 − n2 · δ0 all pairwise distances are preserved.

Apply the claim with δ0= δ/ n2. =⇒ for m = Olog(1/δ2 0)



, all pairwise distances are preserved with probability ≥ 1 − δ.

m = O log(1/δ 0) 2  = O log( n 2/δ) 2 ! = O log(n 2/δ) 2  = O log(n/δ) 2 

Yields the JL lemma.

~

x, . . . , ~x : original points, x , . . . , x : compressed points, Π ∈ Rm×d:

(76)

Summary:

References

Related documents

Perancangan corporate identity ini ditujukan untuk menghasilkan sebuah identitas perusahaan bagi Catterpie kids clothing yang dapat menyampaikan citra perusahaan

When asked about need for supervision in performing certain surgical procedures at the completion of surgical residency training, 100% of the fellows reported that they felt

In comparing the preferred sources of coaching knowledge for coaches who wanted to stay at their current developmental level versus those who wished to coach at a higher

We then discuss optimal student credit arrangements that balance three important objectives: (i) providing credit for students to access college and finance consumption while in

Conceptualizing advanced nursing practice: Curriculum issues to consider in the educational preparation of advanced practice nurses in

This attitude is seen in other countries’ sexuality education programs: For example, youths in other industrialized countries receive unambiguous messages that sexual activity is

The essay will follow the standard structure of research papers in political science and economics, starting with an introduction, a section detailing the theory (or the