compsci
514:
algorithms for data science
summary
Last Class:
•
Introduced the k-frequent elements problem – identify all
elements of a stream of n elements that occur ≥ n/k times.
•
Saw how to solve approximately using the Count-min sketch
algorithm. Based on random hashing into buckets containing
approximate counts.
This Class:
•
Randomized methods for dimensionality reduction.
•
The Johnson-Lindenstrauss Lemma.
summary
Last Class:
•
Introduced the k-frequent elements problem – identify all
elements of a stream of n elements that occur ≥ n/k times.
•
Saw how to solve approximately using the Count-min sketch
algorithm. Based on random hashing into buckets containing
approximate counts.
This Class:
summary
Last Class:
•
Introduced the k-frequent elements problem – identify all
elements of a stream of n elements that occur ≥ n/k times.
•
Saw how to solve approximately using the Count-min sketch
algorithm. Based on random hashing into buckets containing
approximate counts.
This Class:
•
Randomized methods for dimensionality reduction.
•
The Johnson-Lindenstrauss Lemma.
high dimensional data
‘Big Data’ means not just many data points, but many measurements per data point. I.e., veryhigh dimensional data.
• Twitter has 321 million active monthly users. Records(tens of) thousands of measurements per user: who they follow, who follows them, when they last visited the site, timestamps for specific interactions, how many tweets they have sent, the text of those tweets, etc.
• A 3 minute Youtube clip with a resolution of 500 × 500 pixels at 15 frames/second with 3 color channels is a recording of≥ 2 billion pixel values. Even a 500 × 500 pixel color image has 750, 000 pixel values. • The human genome contains 3 billion+ base pairs. Genetic datasets
high dimensional data
‘Big Data’ means not just many data points, but many measurements per data point. I.e., veryhigh dimensional data.
• Twitter has 321 million active monthly users. Records(tens of) thousands of measurements per user: who they follow, who follows them, when they last visited the site, timestamps for specific interactions, how many tweets they have sent, the text of those tweets, etc.
• A 3 minute Youtube clip with a resolution of 500 × 500 pixels at 15 frames/second with 3 color channels is a recording of≥ 2 billion pixel values. Even a 500 × 500 pixel color image has 750, 000 pixel values. • The human genome contains 3 billion+ base pairs. Genetic datasets
often contain information on100s of thousands+ mutations and genetic markers.
high dimensional data
‘Big Data’ means not just many data points, but many measurements per data point. I.e., veryhigh dimensional data.
• Twitter has 321 million active monthly users. Records(tens of) thousands of measurements per user: who they follow, who follows them, when they last visited the site, timestamps for specific interactions, how many tweets they have sent, the text of those tweets, etc.
• A 3 minute Youtube clip with a resolution of 500 × 500 pixels at 15 frames/second with 3 color channels is a recording of≥ 2 billion pixel values. Even a 500 × 500 pixel color image has 750, 000 pixel values.
high dimensional data
‘Big Data’ means not just many data points, but many measurements per data point. I.e., veryhigh dimensional data.
• Twitter has 321 million active monthly users. Records(tens of) thousands of measurements per user: who they follow, who follows them, when they last visited the site, timestamps for specific interactions, how many tweets they have sent, the text of those tweets, etc.
• A 3 minute Youtube clip with a resolution of 500 × 500 pixels at 15 frames/second with 3 color channels is a recording of≥ 2 billion pixel values. Even a 500 × 500 pixel color image has 750, 000 pixel values. • The human genome contains 3 billion+ base pairs. Genetic datasets
often contain information on100s of thousands+ mutations and genetic markers.
data as vectors and matrices
In data analysis and machine learning, data points with many
attributes are often stored, processed, and interpreted as
high
dimensional vectors
, with real valued entries.
Similarities/distances between vectors
(e.g., hx , y i, kx − y k
2) have meaning
data as vectors and matrices
In data analysis and machine learning, data points with many
attributes are often stored, processed, and interpreted as
high
dimensional vectors
, with real valued entries.
Similarities/distances between vectors
(e.g., hx , y i, kx − y k
2) have meaning
for underlying data points.
data as vectors and matrices
In data analysis and machine learning, data points with many
attributes are often stored, processed, and interpreted as
high
dimensional vectors
, with real valued entries.
Similarities/distances between vectors
(e.g., hx , y i, kx − y k
2) have meaning
datasets as vectors and matrices
Data points are interpreted as
high dimensional vectors
, with real
valued entries. Data set is interpreted as a matrix.
Data Points: ~
x
1, ~
x
2, . . . , ~
x
n∈ R
d.
Data Set: X ∈ R
n×dwith i
throws equal to ~
x
i.
Many data points n =⇒ tall. Many dimensions d =⇒ wide.
datasets as vectors and matrices
Data points are interpreted as
high dimensional vectors
, with real
valued entries. Data set is interpreted as a matrix.
Data Points: ~
x
1, ~
x
2, . . . , ~
x
n∈ R
d.
Data Set: X ∈ R
n×dwith i
throws equal to ~
x
i.
datasets as vectors and matrices
Data points are interpreted as
high dimensional vectors
, with real
valued entries. Data set is interpreted as a matrix.
Data Points: ~
x
1, ~
x
2, . . . , ~
x
n∈ R
d.
Data Set: X ∈ R
n×dwith i
throws equal to ~
x
i.
Many data points n =⇒ tall. Many dimensions d =⇒ wide.
dimensionality reduction
Dimensionality Reduction: Compress data points so that they lie in many fewer dimensions.
~
x1, . . . , ~xn∈ Rd →x˜1, . . . , ˜xn∈ Rm for m d .
‘Lossy compression’ that still preserves important information about the relationships between ~x1, . . . , ~xn.
dimensionality reduction
Dimensionality Reduction: Compress data points so that they lie in many fewer dimensions.
~
x1, . . . , ~xn∈ Rd →x˜1, . . . , ˜xn∈ Rm for m d .
‘Lossy compression’ that still preserves important information about the relationships between ~x1, . . . , ~xn.
Generally will not consider directly how well ˜xi approximates ~xi.
dimensionality reduction
Dimensionality Reduction: Compress data points so that they lie in many fewer dimensions.
~
x1, . . . , ~xn∈ Rd →x˜1, . . . , ˜xn∈ Rm for m d .
‘Lossy compression’ that still preserves important information about the relationships between ~x1, . . . , ~xn.
dimensionality reduction
Dimensionality Reduction: Compress data points so that they lie in many fewer dimensions.
~
x1, . . . , ~xn∈ Rd →x˜1, . . . , ˜xn∈ Rm for m d .
‘Lossy compression’ that still preserves important information about the relationships between ~x1, . . . , ~xn.
Generally will not consider directly how well ˜xi approximates ~xi.
low distortion embedding
Low Distortion Embedding: Given ~x1, . . . , ~xn∈ Rd, distance function D, and error parameter ≥ 0, find ˜x1, . . . , ˜xn∈ Rm(where m d ) and distance function ˜D such that for all i , j ∈ [n]:
(1 − )D(~xi, ~xj) ≤ ˜D(˜xi, ˜xj) ≤ (1 + )D(~xi, ~xj).
Have already seen one example in class: MinHash.
With large enough signature size r , # matching entries in ˜xA, ˜xB
r ≈ J(~xA, ~xB).
low distortion embedding
Low Distortion Embedding: Given ~x1, . . . , ~xn∈ Rd, distance function D, and error parameter ≥ 0, find ˜x1, . . . , ˜xn∈ Rm(where m d ) and distance function ˜D such that for all i , j ∈ [n]:
(1 − )D(~xi, ~xj) ≤ ˜D(˜xi, ˜xj) ≤ (1 + )D(~xi, ~xj). Have already seen one example in class: MinHash.
With large enough signature size r , # matching entries in ˜xA, ˜xB
r ≈ J(~xA, ~xB).
• Note: here J(~xA, ~xB) is asimilarityrather than a distance. So this is not quite low distortion embedding, but is closely related.
low distortion embedding
Low Distortion Embedding: Given ~x1, . . . , ~xn∈ Rd, distance function D, and error parameter ≥ 0, find ˜x1, . . . , ˜xn∈ Rm(where m d ) and distance function ˜D such that for all i , j ∈ [n]:
(1 − )D(~xi, ~xj) ≤ ˜D(˜xi, ˜xj) ≤ (1 + )D(~xi, ~xj). Have already seen one example in class: MinHash.
With large enough signature size r , # matching entries in ˜xA, ˜xB
r ≈ J(~xA, ~xB).
low distortion embedding
Low Distortion Embedding: Given ~x1, . . . , ~xn∈ Rd, distance function D, and error parameter ≥ 0, find ˜x1, . . . , ˜xn∈ Rm(where m d ) and distance function ˜D such that for all i , j ∈ [n]:
(1 − )D(~xi, ~xj) ≤ ˜D(˜xi, ˜xj) ≤ (1 + )D(~xi, ~xj). Have already seen one example in class: MinHash.
With large enough signature size r , # matching entries in ˜xA, ˜xB
r ≈ J(~xA, ~xB).
• Note: here J(~xA, ~xB) is asimilarityrather than a distance. So this is not quite low distortion embedding, but is closely related.
low distortion embedding
Low Distortion Embedding: Given ~x1, . . . , ~xn∈ Rd, distance function D, and error parameter ≥ 0, find ˜x1, . . . , ˜xn∈ Rm(where m d ) and distance function ˜D such that for all i , j ∈ [n]:
(1 − )D(~xi, ~xj) ≤ ˜D(˜xi, ˜xj) ≤ (1 + )D(~xi, ~xj). Have already seen one example in class: MinHash.
With large enough signature size r , # matching entries in ˜xA, ˜xB
r ≈ J(~xA, ~xB).
low distortion embedding
Low Distortion Embedding: Given ~x1, . . . , ~xn∈ Rd, distance function D, and error parameter ≥ 0, find ˜x1, . . . , ˜xn∈ Rm(where m d ) and distance function ˜D such that for all i , j ∈ [n]:
(1 − )D(~xi, ~xj) ≤ ˜D(˜xi, ˜xj) ≤ (1 + )D(~xi, ~xj). Have already seen one example in class: MinHash.
With large enough signature size r , # matching entries in ˜xA, ˜xB
r ≈ J(~xA, ~xB).
• Note: here J(~xA, ~xB) is asimilarityrather than a distance. So this is not quite low distortion embedding, but is closely related.
embeddings for euclidean space
Euclidean Low Distortion Embedding: Given ~x1, . . . , ~xn∈ Rd and error parameter ≥ 0, find ˜x1, . . . , ˜xn∈ Rm (where m d ) such that for all i , j ∈ [n]:
(1 − )k~xi− ~xjk2≤ k˜xi− ˜xjk2≤ (1 + )k~xi− ~xjk2. Recall that for ~z ∈ Rn, k~zk
2= q
Pn i =1~z(i )2.
embeddings for euclidean space
Euclidean Low Distortion Embedding: Given ~x1, . . . , ~xn∈ Rd and error parameter ≥ 0, find ˜x1, . . . , ˜xn∈ Rm (where m d ) such that for all i , j ∈ [n]:
(1 − )k~xi− ~xjk2≤ k˜xi− ˜xjk2≤ (1 + )k~xi− ~xjk2. Recall that for ~z ∈ Rn, k~zk
2= q
Pn i =1~z(i )2.
Can use ˜x1, . . . , ˜xn in place of ~x1, . . . , ~xn in clustering, SVM, linear classification, near neighbor search, etc.
embeddings for euclidean space
Euclidean Low Distortion Embedding: Given ~x1, . . . , ~xn∈ Rd and error parameter ≥ 0, find ˜x1, . . . , ˜xn∈ Rm (where m d ) such that for all i , j ∈ [n]:
(1 − )k~xi− ~xjk2≤ k˜xi− ˜xjk2≤ (1 + )k~xi− ~xjk2.
embeddings for euclidean space
Euclidean Low Distortion Embedding: Given ~x1, . . . , ~xn∈ Rd and error parameter ≥ 0, find ˜x1, . . . , ˜xn∈ Rm (where m d ) such that for all i , j ∈ [n]:
(1 − )k~xi− ~xjk2≤ k˜xi− ˜xjk2≤ (1 + )k~xi− ~xjk2.
Can use ˜x1, . . . , ˜xn in place of ~x1, . . . , ~xn in clustering, SVM, linear
embedding with assumptions
A very easy case: Assume that ~x1, . . . , ~xn all lie on the 1st axis in Rd.
Set m = 1 and ˜xi= [~xi(1)] (i.e., ˜xi contains just a single number).
• k˜xi− ˜xjk2=p[~xi(1) − ~xj(1)]2= |~xi(1) − ~xj(1)| = k~xi− ~xjk2.
embedding with assumptions
A very easy case: Assume that ~x1, . . . , ~xn all lie on the 1st axis in Rd.
Set m = 1 and ˜xi= [~xi(1)] (i.e., ˜xi contains just a single number).
• k˜xi− ˜xjk2=p[~xi(1) − ~xj(1)]2= |~xi(1) − ~xj(1)| = k~xi− ~xjk2.
• An embedding withno distortionfrom any d into m = 1.
embedding with assumptions
A very easy case: Assume that ~x1, . . . , ~xn all lie on the 1st axis in Rd.
Set m = 1 and ˜xi= [~xi(1)] (i.e., ˜xi contains just a single number).
• k˜xi− ˜xjk2=p[~xi(1) − ~xj(1)]2= |~xi(1) − ~xj(1)| = k~xi− ~xjk2.
embedding with assumptions
When else might no (or low) distortion Euclidean embeddings into
very low-dimensional space exist?
embedding with no assumptions
What about when we don’t make any assumptions on ~
x
1, . . . , ~
x
n.
I.e., they can be scattered arbitrarily around d -dimensional space?
•
Can we find a no-distortion embedding into m d dimensions?
No. Require m = d .
•
Can we find an -distortion embedding into m d dimensions
for > 0?
Yes! Always, with m depending on .
embedding with no assumptions
What about when we don’t make any assumptions on ~
x
1, . . . , ~
x
n.
I.e., they can be scattered arbitrarily around d -dimensional space?
•
Can we find a no-distortion embedding into m d dimensions?
No. Require m = d .
•
Can we find an -distortion embedding into m d dimensions
for > 0?
Yes! Always, with m depending on .
For all i , j : (1 − )k~
x
i− ~
x
jk
2≤ k˜
x
i− ˜
x
jk
2≤ (1 + )k~
x
i− ~
x
jk
2.
embedding with no assumptions
What about when we don’t make any assumptions on ~
x
1, . . . , ~
x
n.
I.e., they can be scattered arbitrarily around d -dimensional space?
•
Can we find a no-distortion embedding into m d dimensions?
No. Require m = d .
•
Can we find an -distortion embedding into m d dimensions
for > 0?
Yes! Always, with m depending on .
embedding with no assumptions
What about when we don’t make any assumptions on ~
x
1, . . . , ~
x
n.
I.e., they can be scattered arbitrarily around d -dimensional space?
•
Can we find a no-distortion embedding into m d dimensions?
No. Require m = d .
•
Can we find an -distortion embedding into m d dimensions
for > 0?
Yes! Always, with m depending on .
For all i , j : (1 − )k~
x
i− ~
x
jk
2≤ k˜
x
i− ˜
x
jk
2≤ (1 + )k~
x
i− ~
x
jk
2.
the johnson-lindenstrauss lemma
Johnson-Lindenstrauss Lemma: For any set of points ~
x1, . . . , ~xn∈ Rd and > 0 there exists a linear map Π : Rd → Rm
such that m = Olog n2
and letting ˜xi= Π~xi:
For all i , j : (1 − )k~xi− ~xjk2≤ k˜xi− ˜xjk2≤ (1 + )k~xi− ~xjk2.
Further, if Π ∈ Rm×d has each entry chosen i.i.d. from N (0, 1/m),
it satisfies the guarantee with high probability.
For d = 1 trillion, = .05, and n = 100, 000, m ≈ 6600.
the johnson-lindenstrauss lemma
Johnson-Lindenstrauss Lemma: For any set of points ~
x1, . . . , ~xn∈ Rd and > 0 there exists a linear map Π : Rd → Rm
such that m = Olog n2
and letting ˜xi= Π~xi:
For all i , j : (1 − )k~xi− ~xjk2≤ k˜xi− ˜xjk2≤ (1 + )k~xi− ~xjk2.
Further, if Π ∈ Rm×d has each entry chosen i.i.d. from N (0, 1/m),
it satisfies the guarantee with high probability.
For d = 1 trillion, = .05, and n = 100, 000, m ≈ 6600.
Very surprising! Powerful result with a simple construction: applying a random linear transformation to a set of points preserves distances between all those points with high probability.
the johnson-lindenstrauss lemma
Johnson-Lindenstrauss Lemma: For any set of points ~
x1, . . . , ~xn∈ Rd and > 0 there exists a linear map Π : Rd → Rm
such that m = Olog n2
and letting ˜xi= Π~xi:
For all i , j : (1 − )k~xi− ~xjk2≤ k˜xi− ˜xjk2≤ (1 + )k~xi− ~xjk2.
Further, if Π ∈ Rm×d has each entry chosen i.i.d. from N (0, 1/m),
it satisfies the guarantee with high probability.
For d = 1 trillion, = .05, and n = 100, 000, m ≈ 6600.
random projection
For any ~x1, . . . , ~xnand Π ∈ Rm×d with each entry chosen i.i.d. from N (0, 1/m), with high probability, letting xi = Π~xi:
For all i , j : (1 − )k~xi− ~xjk2≤ kxi− xjk2≤ (1 + )k~xi− ~xjk2.
• Π is known as arandom projection. It is a random linear function, mapping length d vectors to length m vectors.
• Π isdata oblivious. Stark contrast to methods like PCA.
random projection
For any ~x1, . . . , ~xnand Π ∈ Rm×d with each entry chosen i.i.d. from N (0, 1/m), with high probability, letting xi = Π~xi:
For all i , j : (1 − )k~xi− ~xjk2≤ kxi− xjk2≤ (1 + )k~xi− ~xjk2.
• Π is known as arandom projection. It is a random linear function, mapping length d vectors to length m vectors.
random projection
For any ~x1, . . . , ~xnand Π ∈ Rm×d with each entry chosen i.i.d. from N (0, 1/m), with high probability, letting xi = Π~xi:
For all i , j : (1 − )k~xi− ~xjk2≤ kxi− xjk2≤ (1 + )k~xi− ~xjk2.
• Π is known as arandom projection. It is a random linear function, mapping length d vectors to length m vectors.
• Π isdata oblivious. Stark contrast to methods like PCA.
algorithmic considerations
•
Many alternative constructions: ±1 entries, sparse (most entries
0), Fourier structured, etc. =⇒ more efficient computation of
x
i= Π~
x
i.
•
Data oblivious property means that once Π is chosen, x
1, . . . , x
ncan be computed in a stream with little memory.
•
Memory needed is just O(nm) vs. O(nd ) to store the full data
set.
•
Compression can also be easily performed in parallel on different
servers.
•
When new data points are added, can be easily compressed,
algorithmic considerations
•
Many alternative constructions: ±1 entries, sparse (most entries
0), Fourier structured, etc. =⇒ more efficient computation of
x
i= Π~
x
i.
•
Data oblivious property means that once Π is chosen, x
1, . . . , x
ncan be computed in a stream with little memory.
•
Memory needed is just O(nm) vs. O(nd ) to store the full data
set.
•
Compression can also be easily performed in parallel on different
servers.
•
When new data points are added, can be easily compressed,
without updating existing points.
algorithmic considerations
•
Many alternative constructions: ±1 entries, sparse (most entries
0), Fourier structured, etc. =⇒ more efficient computation of
x
i= Π~
x
i.
•
Data oblivious property means that once Π is chosen, x
1, . . . , x
ncan be computed in a stream with little memory.
•
Memory needed is just O(nm) vs. O(nd ) to store the full data
set.
•
Compression can also be easily performed in parallel on different
servers.
•
When new data points are added, can be easily compressed,
algorithmic considerations
•
Many alternative constructions: ±1 entries, sparse (most entries
0), Fourier structured, etc. =⇒ more efficient computation of
x
i= Π~
x
i.
•
Data oblivious property means that once Π is chosen, x
1, . . . , x
ncan be computed in a stream with little memory.
•
Memory needed is just O(nm) vs. O(nd ) to store the full data
set.
•
Compression can also be easily performed in parallel on different
servers.
•
When new data points are added, can be easily compressed,
without updating existing points.
connection to simhash
Compression operation is xi = Π~xi, so for any j , xi(j ) = hΠ(j ), ~xii =
d X
k=1
Π(j , k) · ~xi(k).
Π(j ) is a vector with independent random Gaussian entries.
~
connection to simhash
Compression operation is xi = Π~xi, so for any j , xi(j ) = hΠ(j ), ~xii =
d X
k=1
Π(j , k) · ~xi(k). Π(j ) is a vector with independent random Gaussian entries.
~
x1, . . . , ~xn: original points (d dims.), x1, . . . , xn: compressed points (m < d
dims.), Π ∈ Rm×d: random projection (embedding function)
connection to simhash
Compression operation is xi = Π~xi, so for any j , xi(j ) = hΠ(j ), ~xii =
d X
k=1
Π(j , k) · ~xi(k). Π(j ) is a vector with independent random Gaussian entries.
Points with high cosine similarity have similar
random projections.
~
connection to simhash
Compression operation is xi = Π~xi, so for any j , xi(j ) = hΠ(j ), ~xii =
d X
k=1
Π(j , k) · ~xi(k). Π(j ) is a vector with independent random Gaussian entries.
Points with high cosine similarity have similar
random projections.
~
x1, . . . , ~xn: original points (d dims.), x1, . . . , xn: compressed points (m < d
dims.), Π ∈ Rm×d: random projection (embedding function)
connection to simhash
Compression operation is xi = Π~xi, so for any j , xi(j ) = hΠ(j ), ~xii =
d X
k=1
Π(j , k) · ~xi(k). Π(j ) is a vector with independent random Gaussian entries.
Points with high cosine similarity have similar
connection to simhash
Compression operation is xi = Π~xi, so for any j , xi(j ) = hΠ(j ), ~xii =
d X
k=1
Π(j , k) · ~xi(k). Π(j ) is a vector with independent random Gaussian entries.
Points with high cosine similarity have similar
random projections.
~
x1, . . . , ~xn: original points (d dims.), x1, . . . , xn: compressed points (m < d
dims.), Π ∈ Rm×d: random projection (embedding function)
connection to simhash
Compression operation is xi = Π~xi, so for any j , xi(j ) = hΠ(j ), ~xii =
d X
k=1
Π(j , k) · ~xi(k). Π(j ) is a vector with independent random Gaussian entries.
Points with high cosine similarity have similar
random projections.
~
connection to simhash
Compression operation is xi = Π~xi, so for any j , xi(j ) = hΠ(j ), ~xii =
d X
k=1
Π(j , k) · ~xi(k). Π(j ) is a vector with independent random Gaussian entries.
Points with high cosine similarity have similar
random projections. Computing a length m SimHash signature SH1(~xi), . . . , SHm(~xi) is identical to computing xi = Π~xi and then taking sign(xi).
~
x1, . . . , ~xn: original points (d dims.), x1, . . . , xn: compressed points (m < d
dims.), Π ∈ Rm×d: random projection (embedding function)
distributional jl
The Johnson-Lindenstrauss Lemma is a direct consequence of a closely related lemma:
Distributional JL Lemma: Let Π ∈ Rm×dhave each entry chosen i.i.d. as N (0, 1/m). If we set m = Olog(1/δ)2
, then for any ~
y ∈ Rd, with probability ≥ 1 − δ
(1 − )k~y k2≤ kΠ~y k2≤ (1 + )k~y k2
Applying a random matrix Π to any vector ~y preserves ~y ’s norm with high probability.
• Like a low-distortion embedding, but for the length of a compressed vector rather than distances between vectors.
distributional jl
The Johnson-Lindenstrauss Lemma is a direct consequence of a closely related lemma:
Distributional JL Lemma: Let Π ∈ Rm×dhave each entry chosen i.i.d. as N (0, 1/m). If we set m = Olog(1/δ)2
, then for any ~
y ∈ Rd, with probability ≥ 1 − δ
(1 − )k~y k2≤ kΠ~y k2≤ (1 + )k~y k2
Applying a random matrix Π to any vector ~y preserves ~y ’s norm with high probability.
• Like a low-distortion embedding, but for the length of a compressed vector rather than distances between vectors.
• Can be proven from first principles. Will see next class.
Π ∈ Rm×d: random projection matrix. d : original dimension. m: compressed dimension, : embedding error, δ: embedding failure prob.
distributional jl
The Johnson-Lindenstrauss Lemma is a direct consequence of a closely related lemma:
Distributional JL Lemma: Let Π ∈ Rm×dhave each entry chosen i.i.d. as N (0, 1/m). If we set m = Olog(1/δ)2
, then for any ~
y ∈ Rd, with probability ≥ 1 − δ
(1 − )k~y k2≤ kΠ~y k2≤ (1 + )k~y k2
Applying a random matrix Π to any vector ~y preserves ~y ’s norm with high probability.
• Like a low-distortion embedding, but for the length of a compressed vector rather than distances between vectors.
distributional jl
The Johnson-Lindenstrauss Lemma is a direct consequence of a closely related lemma:
Distributional JL Lemma: Let Π ∈ Rm×dhave each entry chosen i.i.d. as N (0, 1/m). If we set m = Olog(1/δ)2
, then for any ~
y ∈ Rd, with probability ≥ 1 − δ
(1 − )k~y k2≤ kΠ~y k2≤ (1 + )k~y k2
Applying a random matrix Π to any vector ~y preserves ~y ’s norm with high probability.
• Like a low-distortion embedding, but for the length of a compressed vector rather than distances between vectors.
• Can be proven from first principles. Will see next class.
Π ∈ Rm×d: random projection matrix. d : original dimension. m: compressed dimension, : embedding error, δ: embedding failure prob.
distributional jl =⇒ jl
Distributional JL Lemma =⇒ JL Lemma: Distributional JL show that a random projection Π preserves thenormof any y . The main JL Lemma says that Π preservesdistancesbetween vectors.
Since Π islinearthese are the same thing!
Proof: Given ~x1, . . . , ~xn, define n2 vectors ~yij where ~yij= ~xi− ~xj.
~
ran-distributional jl =⇒ jl
Distributional JL Lemma =⇒ JL Lemma: Distributional JL show that a random projection Π preserves thenormof any y . The main JL Lemma says that Π preservesdistancesbetween vectors.
Since Π islinearthese are the same thing!
Proof: Given ~x1, . . . , ~xn, define n2 vectors ~yij where ~yij= ~xi− ~xj.
~
x1, . . . , ~xn: original points, x1, . . . , xn: compressed points, Π ∈ Rm×d:
ran-dom projection matrix. d : original dimension. m: compressed dimension, :
distributional jl =⇒ jl
Distributional JL Lemma =⇒ JL Lemma: Distributional JL show that a random projection Π preserves thenormof any y . The main JL Lemma says that Π preservesdistancesbetween vectors.
Since Π islinearthese are the same thing!
Proof: Given ~x1, . . . , ~xn, define n2 vectors ~yij where ~yij= ~xi− ~xj.
~
ran-distributional jl =⇒ jl
Distributional JL Lemma =⇒ JL Lemma: Distributional JL show that a random projection Π preserves thenormof any y . The main JL Lemma says that Π preservesdistancesbetween vectors.
Since Π islinearthese are the same thing!
Proof: Given ~x1, . . . , ~xn, define n2 vectors ~yij where ~yij= ~xi− ~xj.
~
x1, . . . , ~xn: original points, x1, . . . , xn: compressed points, Π ∈ Rm×d:
ran-dom projection matrix. d : original dimension. m: compressed dimension, :
distributional jl =⇒ jl
Distributional JL Lemma =⇒ JL Lemma: Distributional JL show that a random projection Π preserves thenormof any y . The main JL Lemma says that Π preservesdistancesbetween vectors.
Since Π islinearthese are the same thing!
Proof: Given ~x1, . . . , ~xn, define n2 vectors ~yij where ~yij= ~xi− ~xj.
• If we choose Π with m = Olog 1/δ2
, for each ~yij with probability ≥ 1 − δ we have:
(1 − )k~yijk2≤ kΠ~yijk2≤ (1 + )k~yijk2
~
ran-distributional jl =⇒ jl
Distributional JL Lemma =⇒ JL Lemma: Distributional JL show that a random projection Π preserves thenormof any y . The main JL Lemma says that Π preservesdistancesbetween vectors.
Since Π islinearthese are the same thing!
Proof: Given ~x1, . . . , ~xn, define n2 vectors ~yij where ~yij= ~xi− ~xj.
• If we choose Π with m = Olog 1/δ2
, for each ~yij with probability ≥ 1 − δ we have:
(1 − )k~xi− ~xjk2≤ kΠ(~xi− ~xj)k2≤ (1 + )k~xi− ~xjk2
~
x1, . . . , ~xn: original points, x1, . . . , xn: compressed points, Π ∈ Rm×d:
ran-dom projection matrix. d : original dimension. m: compressed dimension, :
distributional jl =⇒ jl
Distributional JL Lemma =⇒ JL Lemma: Distributional JL show that a random projection Π preserves thenormof any y . The main JL Lemma says that Π preservesdistancesbetween vectors.
Since Π islinearthese are the same thing!
Proof: Given ~x1, . . . , ~xn, define n2 vectors ~yij where ~yij= ~xi− ~xj.
• If we choose Π with m = Olog 1/δ2
, for each ~yij with probability ≥ 1 − δ we have:
(1 − )k~xi− ~xjk2≤ kxi− xjk2≤ (1 + )k~xi− ~xjk2
~
ran-distributional jl =⇒ jl
Claim: If we choose Π with i.i.d. N (0, 1/m) entries and m = Olog(1/δ2 0)
, letting xi = Π~xi, for each pair ~xi, ~xj with probability ≥ 1 − δ0 we have:
(1 − )k~xi− ~xjk2≤ kxi− xjk2≤ (1 + )k~xi− ~xjk2.
With what probability are all pairwise distances preserved?
Union bound: With probability ≥ 1 − n2 · δ0 all pairwise distances are preserved.
Apply the claim with δ0= δ/ n2. =⇒ for m = Olog(1/δ2 0)
, all pairwise distances are preserved with probability ≥ 1 − δ.
m = O log(1/δ 0) 2 = O log( n 2/δ) 2 ! = O log(n 2/δ) 2 = O log(n/δ) 2 ~
x1, . . . , ~xn: original points, x1, . . . , xn: compressed points, Π ∈ Rm×d:
ran-dom projection matrix. d : original dimension. m: compressed dimension, : embedding error, δ: embedding failure prob.
distributional jl =⇒ jl
Claim: If we choose Π with i.i.d. N (0, 1/m) entries and m = Olog(1/δ2 0)
, letting xi = Π~xi, for each pair ~xi, ~xj with probability ≥ 1 − δ0 we have:
(1 − )k~xi− ~xjk2≤ kxi− xjk2≤ (1 + )k~xi− ~xjk2.
With what probability are all pairwise distances preserved?
Union bound: With probability ≥ 1 − n2 · δ0 all pairwise distances are preserved.
Apply the claim with δ0= δ/ n2. =⇒ for m = Olog(1/δ2 0)
, all pairwise distances are preserved with probability ≥ 1 − δ.
m = O log(1/δ 0) 2 = O log( n 2/δ) 2 ! = O log(n 2/δ) 2 = O log(n/δ) 2 ~
x1, . . . , ~xn: original points, x1, . . . , xn: compressed points, Π ∈ Rm×d:
distributional jl =⇒ jl
Claim: If we choose Π with i.i.d. N (0, 1/m) entries and m = Olog(1/δ2 0)
, letting xi = Π~xi, for each pair ~xi, ~xj with probability ≥ 1 − δ0 we have:
(1 − )k~xi− ~xjk2≤ kxi− xjk2≤ (1 + )k~xi− ~xjk2.
With what probability are all pairwise distances preserved?
Union bound: With probability ≥ 1 − n2 · δ0 all pairwise distances are preserved.
Apply the claim with δ0= δ/ n2. =⇒ for m = Olog(1/δ2 0)
, all pairwise distances are preserved with probability ≥ 1 − δ.
m = O log(1/δ 0) 2 = O log( n 2/δ) 2 ! = O log(n 2/δ) 2 = O log(n/δ) 2 ~
x1, . . . , ~xn: original points, x1, . . . , xn: compressed points, Π ∈ Rm×d:
ran-dom projection matrix. d : original dimension. m: compressed dimension, : embedding error, δ: embedding failure prob.
distributional jl =⇒ jl
Claim: If we choose Π with i.i.d. N (0, 1/m) entries and m = Olog(1/δ2 0)
, letting xi = Π~xi, for each pair ~xi, ~xj with probability ≥ 1 − δ0 we have:
(1 − )k~xi− ~xjk2≤ kxi− xjk2≤ (1 + )k~xi− ~xjk2.
With what probability are all pairwise distances preserved?
Union bound: With probability ≥ 1 − n2 · δ0 all pairwise distances are preserved.
Apply the claim with δ0= δ/ n2.
=⇒ for m = Olog(1/δ2 0)
, all pairwise distances are preserved with probability ≥ 1 − δ.
m = O log(1/δ 0) 2 = O log( n 2/δ) 2 ! = O log(n 2/δ) 2 = O log(n/δ) 2 ~
x1, . . . , ~xn: original points, x1, . . . , xn: compressed points, Π ∈ Rm×d:
distributional jl =⇒ jl
Claim: If we choose Π with i.i.d. N (0, 1/m) entries and m = Olog(1/δ2 0)
, letting xi = Π~xi, for each pair ~xi, ~xj with probability ≥ 1 − δ0 we have:
(1 − )k~xi− ~xjk2≤ kxi− xjk2≤ (1 + )k~xi− ~xjk2.
With what probability are all pairwise distances preserved?
Union bound: With probability ≥ 1 − n2 · δ0 all pairwise distances are preserved.
Apply the claim with δ0= δ/ n2. =⇒ for m = Olog(1/δ2 0)
, all pairwise distances are preserved with probability ≥ 1 − δ.
m = O log(1/δ 0) 2 = O log( n 2/δ) 2 ! = O log(n 2/δ) 2 = O log(n/δ) 2 ~
x1, . . . , ~xn: original points, x1, . . . , xn: compressed points, Π ∈ Rm×d:
ran-dom projection matrix. d : original dimension. m: compressed dimension, : embedding error, δ: embedding failure prob.
distributional jl =⇒ jl
Claim: If we choose Π with i.i.d. N (0, 1/m) entries and m = Olog(1/δ2 0)
, letting xi = Π~xi, for each pair ~xi, ~xj with probability ≥ 1 − δ0 we have:
(1 − )k~xi− ~xjk2≤ kxi− xjk2≤ (1 + )k~xi− ~xjk2.
With what probability are all pairwise distances preserved?
Union bound: With probability ≥ 1 − n2 · δ0 all pairwise distances are preserved.
Apply the claim with δ0= δ/ n2. =⇒ for m = Olog(1/δ2 0)
, all pairwise distances are preserved with probability ≥ 1 − δ.
m = O log(1/δ 0) 2 = O log( n 2/δ) 2 ! = O log(n 2/δ) 2 = O log(n/δ) 2 ~
x1, . . . , ~xn: original points, x1, . . . , xn: compressed points, Π ∈ Rm×d:
distributional jl =⇒ jl
Claim: If we choose Π with i.i.d. N (0, 1/m) entries and m = Olog(1/δ2 0)
, letting xi = Π~xi, for each pair ~xi, ~xj with probability ≥ 1 − δ0 we have:
(1 − )k~xi− ~xjk2≤ kxi− xjk2≤ (1 + )k~xi− ~xjk2.
With what probability are all pairwise distances preserved?
Union bound: With probability ≥ 1 − n2 · δ0 all pairwise distances are preserved.
Apply the claim with δ0= δ/ n2. =⇒ for m = Olog(1/δ2 0)
, all pairwise distances are preserved with probability ≥ 1 − δ.
m = O log(1/δ 0) 2 = O log( n 2/δ) 2 ! = O log(n 2/δ) 2 = O log(n/δ) 2 ~
x1, . . . , ~xn: original points, x1, . . . , xn: compressed points, Π ∈ Rm×d:
ran-dom projection matrix. d : original dimension. m: compressed dimension, : embedding error, δ: embedding failure prob.
distributional jl =⇒ jl
Claim: If we choose Π with i.i.d. N (0, 1/m) entries and m = Olog(1/δ2 0)
, letting xi = Π~xi, for each pair ~xi, ~xj with probability ≥ 1 − δ0 we have:
(1 − )k~xi− ~xjk2≤ kxi− xjk2≤ (1 + )k~xi− ~xjk2.
With what probability are all pairwise distances preserved?
Union bound: With probability ≥ 1 − n2 · δ0 all pairwise distances are preserved.
Apply the claim with δ0= δ/ n2. =⇒ for m = Olog(1/δ2 0)
, all pairwise distances are preserved with probability ≥ 1 − δ.
m = O log(1/δ 0) 2 = O log( n 2/δ) 2 ! = O log(n 2/δ) 2 = O log(n/δ) 2 ~
x1, . . . , ~xn: original points, x1, . . . , xn: compressed points, Π ∈ Rm×d:
distributional jl =⇒ jl
Claim: If we choose Π with i.i.d. N (0, 1/m) entries and m = Olog(1/δ2 0)
, letting xi = Π~xi, for each pair ~xi, ~xj with probability ≥ 1 − δ0 we have:
(1 − )k~xi− ~xjk2≤ kxi− xjk2≤ (1 + )k~xi− ~xjk2.
With what probability are all pairwise distances preserved?
Union bound: With probability ≥ 1 − n2 · δ0 all pairwise distances are preserved.
Apply the claim with δ0= δ/ n2. =⇒ for m = Olog(1/δ2 0)
, all pairwise distances are preserved with probability ≥ 1 − δ.
m = O log(1/δ 0) 2 = O log( n 2/δ) 2 ! = O log(n 2/δ) 2 = O log(n/δ) 2 ~
x1, . . . , ~xn: original points, x1, . . . , xn: compressed points, Π ∈ Rm×d:
ran-dom projection matrix. d : original dimension. m: compressed dimension, : embedding error, δ: embedding failure prob.
distributional jl =⇒ jl
Claim: If we choose Π with i.i.d. N (0, 1/m) entries and m = Olog(1/δ2 0)
, letting xi = Π~xi, for each pair ~xi, ~xj with probability ≥ 1 − δ0 we have:
(1 − )k~xi− ~xjk2≤ kxi− xjk2≤ (1 + )k~xi− ~xjk2.
With what probability are all pairwise distances preserved?
Union bound: With probability ≥ 1 − n2 · δ0 all pairwise distances are preserved.
Apply the claim with δ0= δ/ n2. =⇒ for m = Olog(1/δ2 0)
, all pairwise distances are preserved with probability ≥ 1 − δ.
m = O log(1/δ 0) 2 = O log( n 2/δ) 2 ! = O log(n 2/δ) 2 = O log(n/δ) 2
Yields the JL lemma.
~
x, . . . , ~x : original points, x , . . . , x : compressed points, Π ∈ Rm×d: