Boaz Barak – Microsoft Research
Joint work with Jonathan Kelner (MIT) and David Steurer
(Cornell)
Fun and Games with Sums of Squares
This talk is about
•
Hilbert’s 17
thproblem / Positivstellensatz
•
Proof complexity
•
Semidefinite programming
•
The Unique Games Conjecture
•
Machine Learning
Exercise: Prove that for every
Hard way to solve: check all extremal points of P (where gradient vanishes)
… there are exponentially many of them
Minkowski (1885): Is every non-negative polynomial a sum of squares?
Hilbert (1888): No!
Indeed, the question of whether a 3SAT formula is satisfiable can be encoded as whether a degree 6 poly is non-negative, and
thus can’t always have a short proof unless
Non-constructive existence proof of a non-negative degree 4 bivariate poly that is not an SOS.
(…first constructive example by Motzkin in 1965)
Yay! We proved
Hilbert 17th problem(1900): Is every non-negative
polynomial a sum of squares of rational functions?
Artin (1927) ,Krivine (1964), Stengle (1974): Yes!
Not just over but also over zero sets of arbitrary polys
Grigoriev-Vorobjov (1999) ,Grigoriev (2001) :
Whoa! Degree proofs take bits to write down… Some 3SAT formulas require degree.
SOS Algorithm:
For low degree we consider the program :
max
𝑥 ∈ℝ𝑛
𝑃
(
𝑥
)
𝑠
.
𝑡
.
𝑃
1(
𝑥
)
=
⋯
=
𝑃
𝑘(
𝑥
)
=
0
SOS Proof that :
SOS polys s.t.(
𝜈
−
𝑃
)
𝑆
=
𝑆
′+
1
(
𝑚𝑜𝑑 𝑃
1,
..
,
𝑃
𝑘)
Degree of proof: max degree of [Gregoriev-Vorobjov’99]
Theorem: [Shor ’87, Parillo ’00, Nesterov ’00, Lasserre ’01]
1) A proof of degree can be found in time.
2) Can find in time the min s.t. degree d proof that
SOS Algorithm:
For low degree we consider the program :
max
𝑥 ∈ℝ𝑛
𝑃
(
𝑥
)
𝑠
.
𝑡
.
𝑃
1(
𝑥
)
=
⋯
=
𝑃
𝑘(
𝑥
)
=
0
SOS Proof that :
Polynomials and SOS polys s.t.(
𝜈
−
𝑃
)
𝑆
=
𝑆
′+
1
(
𝑚𝑜𝑑 𝑃
1,
..
,
𝑃
𝑘)
Degree of proof: max degree of [Gregoriev-Vorobjov’99]
Theorem: [Shor ’87, Parillo ’00, Nesterov ’00, Lasserre ’01]
1) A proof of degree can be found in time.
2) Can find in time the min s.t. degree d proof that
Positivstellensatz: All true bounds have SOS proof. [Artin ’27, Krivine ’64, Stengle ‘74]
Program :
max
𝑥 ∈ℝ𝑛
𝑃
(
𝑥
)
𝑠
.
𝑡
.
𝑃
1(
𝑥
)
=
⋯
=
𝑃
𝑘(
𝑥
)
=
0
SOS Proof that :
Can optimize in time over programs with degree proofs.
(
𝜈
−
𝑃
)
𝑆
=
𝑆
′+
1
(
𝑚𝑜𝑑 𝑃
1,
..
,
𝑃
𝑘)
Can’t hope for always: Captures SAT, CLIQUE, 3COL, MAX-CUT, etc…
But maybe often? Essentially only one (robust) lower bound showing [Grigoriev ’01]
Applications:
• Optimizing polynomials w/ non-negative coefficients over sphere.• Algorithms for quantum separability problem [Brandao-Harrow’13]
• Sparse coding: learning dictionaries beyond the barrier.
• Finding sparse vectors in subspaces.
• Approach to refute the Unique Games Conjecture.
This talk: General method to analyze the SOS algorithm. [B-Kelner-Steurer’13]
Rest of this
talk:
• Super high level description of approach. • More concrete – reduction to task of
“finding a needle in a needle-stack”
• Implementing reduction via pseudoexpectations
• Example: Sparse Coding (aka dictionary learning)
Program :
max
𝑥 ∈ℝ𝑛
𝑃
(
𝑥
)
𝑠
.
𝑡
.
𝑃
1(
𝑥
)
=
⋯
=
𝑃
𝑘(
𝑥
)
=
0
SOS Proof that :
Can optimize in time over programs with degree proofs.
(
𝜈
−
𝑃
)
𝑆
=
𝑆
′+
1
(
𝑚𝑜𝑑 𝑃
1,
..
,
𝑃
𝑘)
Can’t hope for always: Captures SAT, CLIQUE, 3COL, MAX-CUT, etc…
But maybe often? Essentially only one (robust) lower bound showing [Grigoriev ’01]
Applications:
• Optimizing polynomials w/ non-negative coefficients over sphere.• Algorithms for quantum separability problem [Brandao-Harrow’13]
• Sparse coding: learning dictionaries beyond the barrier.
• Finding sparse vectors in subspaces.
• Approach to refute the Unique Games Conjecture.
This talk: General method to analyze the SOS algorithm. [B-Kelner-Steurer’13]
Rest of this
talk:
• Super high level description of approach. • More concrete – reduction to task of
“finding a needle in a needle-stack”
• Implementing reduction via pseudoexpectations
• Example: Sparse Coding (aka dictionary learning)
Traditional relaxation based approach for
solving/approximating :
1) Define relaxation optimizing over larger set of ’s.
(e.g., if define the set , optimize over instead)
2) Find rounding algorithm mapping larger set into valid ’s.
Our approach:
Invert the steps -1) Find combining algorithm mapping some set into valid ’s.
2) Use relaxation to supply inputs to
Our Approach: High-Level Description
Crucial ingredient: view of relaxation as a proof system.
Program :
max
𝑥 ∈ℝ𝑛
𝑃
(
𝑥
)
𝑠
.
𝑡
.
Program :
max
𝑥 ∈ℝ𝑛
𝑃
(
𝑥
)
𝑠
.
𝑡
.
𝑃
1(
𝑥
)
=
⋯
=
𝑃
𝑘(
𝑥
)
=
0
Finding is hard. We consider easier problem:
“Finding a needle in a needle-stack”
Given many ’s maximizing , find a single with value close to maximum.
(multi) set of s.t. ,
Single s.t. ,
Combiner
Non-trivial combiner:
Only depends on low degree marginals of
\{
𝔼
𝑥∼𝑆
𝑥
𝑖
1
⋯
𝑥
𝑖
𝑘\}
𝑖
1,
..
,
𝑖
𝑘∈
[
𝑛
]
[B-Kelner-Steurer’13]: Transform “simple” non-trivial combiners to
algorithm for original problem.
Idea in a nutshell: Simple combiners will output a solution even when fed
Program :
max
𝑥 ∈ℝ𝑛
𝑃
(
𝑥
)
𝑠
.
𝑡
.
𝑃
1(
𝑥
)
=
⋯
=
𝑃
𝑘(
𝑥
)
=
0
Finding is hard. We consider easier problem:
“Finding a needle in a needle-stack”
Given many ’s maximizing , find a single with value close to maximum.
(multi) set of s.t. ,
Single s.t. ,
Combiner Non-trivial combiner:
Only depends on low degree marginals of
\{
𝔼
𝑥∼𝑆
𝑥
𝑖
1
⋯
𝑥
𝑖
𝑘\}
𝑖
1,
..
,
𝑖
𝑘∈
[
𝑛
]
[B-Kelner-Steurer’13]: Transform “simple” non-trivial combiners to
algorithm for original problem.
Idea in a nutshell: Simple combiners will output a solution even when fed
“fake marginals”.
Pseudoexpectations (aka “Fake
Marginals”)
“fake marginals”.
Def: [Lasserre ’01] Degree pseudoexpectation is operator mapping any degree poly into a number satisfying:
• Normalization:
• Linearity: of deg
• Positivity: of deg
Fundamental Fact: deg SOS proof for
for any deg pseudoexpectation operator
Take home message:
• Pseudoexpectation “looks like” real expectation to low degree polynomials.
• Can efficiently find pseudoexpectation matching any polynomial constraints.
• Proofs about real random vars can often be “lifted” to pseudoexpectation.
[B-Kelner-Steurer’13]: Transform “simple” non-trivial combiners to
algorithm for original problem. Program :
max
𝑥 ∈ℝ𝑛
𝑃
(
𝑥
)
𝑠
.
𝑡
.
𝑃
1(
𝑥
)
=
⋯
=
𝑃
𝑘(
𝑥
)
=
0
Finding is hard. We consider easier problem:
“Finding a needle in a needle-stack”
Given many ’s maximizing , find a single with value close to maximum.
(multi) set of s.t. ,
Single s.t. ,
Combiner Non-trivial combiner:
Only depends on low degree marginals of
\{
𝔼
𝑥∼𝑆
𝑥
𝑖
1
⋯
𝑥
𝑖
𝑘\}
𝑖
1,
..
,
𝑖
𝑘∈
[
𝑛
]
Idea in a nutshell: Simple combiners will output a solution even when fed
“fake marginals”.
Pseudoexpectations (aka “Fake
Marginals”)
“fake marginals”.
Def: [Lasserre ’01] Degree pseudoexpectation is operator mapping any degree poly into a number satisfying:
• Normalization:
• Linearity: of deg
• Positivity: of deg
Fundamental Fact: deg SOS proof for
for any deg pseudoexpectation operator
Take home message:
• Pseudoexpectation “looks like” real expectation to low degree polynomials.
• Can efficiently find pseudoexpectation matching any polynomial constraints.
• Proofs about real random vars can often be “lifted” to pseudoexpectation.
Deg pseudoexpectation operator can be represented by p.s.d matrix
Problem: Given low degree maximize s.t.
Problem: Given low degree maximize s.t.
[B-Kelner-Steurer’13]: Transform “simple” non-trivial combiners to
algorithm for original problem.
Non-trivial combiner: Alg with
Input: , r.v. over s.t.
Output: s.t.
Corollary: In this case, we can find efficiently:
• Use SOS SDP to find pseudoexpectation matching input conditions.
• Use to map into an actual solution
Crucial Observation: If proof that is good solution is in SOS framework, then it holds even if is fed with a pseudoexpectation.
Combining Rounding
𝔼
(
𝑃
(
𝑥
)
−v
)
2=
0
,
∀
𝑖𝔼
𝑃
𝑖(
𝑋
)
2=
0
dist of s.t. ,
Single s.t. ,
Goal: Given examples of form , where recover
Find the “right” representation of observed data
Previous best (rigorous) results:
[Spielman-Wang-Wright ’12, Arora-Moitra-Ge ‘13, Agrawal-Anandkumar-Jain-Netrapalli-Tandon ‘13]
We show: is sufficient* (even in non-independent, overcomplete case) Let set of vectors.
LOTS of work: important primitive in Machine Learning, Vision, Neuroscience...
Example Application: Dictionary Learning / Sparse Coding
[Olhausen-Field ’96]
Goal: Given examples of form , where recover
Find the “right” representation of observed data
Previous best (rigorous) results:
[Spielman-Wang-Wright ’12, Arora-Moitra-Ge ‘13, Agrawal-Anandkumar-Jain-Netrapalli-Tandon ‘13]
We show: is sufficient* (even in non-independent, overcomplete case) Let set of vectors.
LOTS of work: important primitive in Machine Learning, Vision, Neuroscience,…
Example Application: Dictionary Learning / Sparse Coding
[Olhausen-Field ’96]
(3) Show that arguments in (1) and (2) fall under the SOS framework.
Goal:
Given examples of form , where recover Let set of vectors.Achieve in 3 steps:
Result generalizes to overcomplete,
non independent case.
For simplicity, assume , ’s orthonormal basis, i.i.d. random vars over s.t.
(1) Find a program s.t. every maximizing is close to one of ’s
(2) Give combining alg taking moments of dist over maximizers into a vector close to one of ’s.
Consider the polynomial
𝑃
(
𝑥
)
=
𝔼
⟨
𝑦
,
𝑥
⟩
4=
𝔼
(
∑
𝑊
𝑖
⟨
𝑎
𝑖
,
𝑥
⟩
(can approximate arbitrarily well from examples))
4Opening parenthesis we get
𝑃
(
𝑥
)
≤
𝜇
∑
⟨
𝑎
𝑖,
𝑥
⟩
4+
2
𝜇
2(
∑
⟨
𝑎
𝑖,
𝑥
⟩
2)
2=
𝜇
∑
⟨
𝑎
𝑖,
𝑥
⟩
4+
𝑜
(
𝜇
)
∥ 𝑥 ∥
4Corollary: unit,
Establishes (1) !
(3) Show that arguments in (1) and (2) fall under the SOS framework.
Goal:
Given examples of form , where recover Let set of vectors.Achieve in 3 steps:
Result generalizes to overcomplete,
non independent case.
For simplicity, assume , ’s orthonormal basis, i.i.d. random vars over s.t.
(1) Find a program s.t. every maximizing is close to one of ’s
(2) Give combining alg taking moments of dist over maximizers into a vector close to one of ’s.
Consider the polynomial
𝑃
(
𝑥
)
=
𝔼
⟨
𝑦
,
𝑥
⟩
4=
𝔼
(
∑
𝑊
𝑖
⟨
𝑎
𝑖
,
𝑥
⟩
(can approximate arbitrarily well from examples))
4Opening parenthesis we get
𝑃
(
𝑥
)
≤
𝜇
∑
⟨
𝑎
𝑖,
𝑥
⟩
4+
2
𝜇
2(
∑
⟨
𝑎
𝑖,
𝑥
⟩
2)
2=
𝜇
∑
⟨
𝑎
𝑖,
𝑥
⟩
4+
𝑜
(
𝜇
)
∥ 𝑥 ∥
4Corollary: unit,
Establishes (1) !
Step 2. Let be dist over unit vectors s.t. every satisfies for some
Pick set of random (std gaussian) vectors.
Establishes (2) !
for Let be matrix s.t.
Our combining algorithm outputs the top e-vec of .
Suppose that and for every , .
(Note that )
Then if then (up to scaling) and we’ll succeed.
(3) Show that arguments in (1) and (2) fall under the SOS framework.
Goal:
Given examples of form , where recover Let set of vectors.Achieve in 3 steps:
(1) Find a program s.t. every maximizing is close to one of ’s
(2) Give combining alg taking moments of dist over maximizers into a vector close to one of ’s.
Slightly tedious but mostly* straightforward computations.
Unique Games Conjecture: UG/SSE problem is NP-hard. [Khot’02,Raghavendra-Steurer’08]
reasons to believe reasons to suspect
“Standard crypto heuristic”: Tried to solve it and couldn’t.
Very clean picture of complexity landscape:
simple algorithms are optimal [Khot’02…Raghavendra’08….]
Random instances are easy via simple algorithm
[Arora-Khot-Kolla-Steurer-Tulsiani-Vishnoi’05]
Simple poly algorithms can’t refute it
[Khot-Vishnoi’04] Subexponential algorithm
[Arora-B-Steurer ‘10]
Quasipoly algo on KV instance [Kolla ‘10]
Simple subexp' algorithms can’t refute it
[B-Gopalan-Håstad-Meka-Raghavendra-Steurer’12] SOS solves all candidate hard
instances [B-Brandao-Harrow-Kelner-Steurer-Zhou ‘12] S O S p ro o f sy st e m
SOS useful for sparse vector problem
Candidate algorithm for search problem
[B-Kelner-Steurer ‘13]
A personal overview of the Unique Games Conjecture
Skeletal program to prove UGC
Conclusions
• Sum of Squares is a powerful algorithmic framework that can yield strong results for the right problems.
(contrast with previous results on SDP/LP hierarchies, showing lower bounds when using either wrong hierarchy or wrong problem.)
• “Combiner” view allows to focus on the features of the problem rather than details of relaxation.
• SOS seems particularly useful for problems with some geometric structure, includes several problems related to unique games and machine learning.
• Still have only rudimentary understanding when SOS works or not.
Other Results
Sparse vector problem:
Recover -sparse vector in -dimensional subspace given arbitrary basis.
Random case: Recovery for any
(Improving on [Demanet-Hand ‘13])
[Brandao-Harrow’12]: Using our techniques, find separable quantum state maximizing a “local operations classical communication” () measurement.
Worst case: Recovery* for
(motivation: machine learning, optimization , [Demanet-Hand 13]