Big data challenges for physics in the next decades

(1)

Big data challenges for physics in the next decades

David W. Hogg

Center for Cosmology and Particle Physics, New York University

(2)

punchlines

I Huge data sets create new opportunities. I they also present new challenges

I Industrial (dot-com) methods are brilliant.

I they can only solve a very limited set of problems

I We won’t reap the full benefit of larger data sets without new technology. I call to arms

(3)

(4)

(5)

(6)

(7)

(8)

(9)

(10)

(11)

(12)

(13)

(14)

scale

I Large Synoptic Survey Telescope: 1010 galaxies in 1015 pixels I get the cosmic gravitational-lensing shear map

I and then the cosmological parameters I 109_{stars too, all moving and varying} I Gaia: 109 _{stars in 10}12 _pixels

I infer the dynamics of the Milky Way

I but also—necessarily—the distribution function of stars in that potential I precision requirements areoutrageous(milli-pixel positions)

I Large Hadron Collider instruments: taking data 107 times faster than it can be moved to disk

I _{they found the Higgs! (probably)}

(15)

how do you store big data sets?

I Distribute the data.

I <1 Tb of data per CPU, thousands of CPUs

I all modern databases can do this, even open-source ones I There is a CPU near every data point.

I and there are many, many CPUs

I Hardware is cheap; management is expensive. I Massive CPU redundancy creates opportunities. . .

(16)

map–reduce or die

I “We won’t even consider algorithms that can’t be written in the map–reduce framework.”

(17)

map–reduce

I map:

I at each “data point” (on the distributed system), do an operation on that datum, produce output

I _{if "kittens" is in document:}

return (URL, PageRank)

I distributed datais the key: Every datum is near a CPU. I reduce:

I between each pair of outputs, do an operation and return one new output, recurse up the tree

I if PageRank[0] > PageRank[1]: return (URL[0], PageRank[0]) else:

return (URL[1], PageRank[1])

I tree structureof the data center is the key: There are only log₂N branches to any datum.

(18)

map–reduce

reduce

reduce reduce reduce reduce

"kittens?" reduce reduce reduce reduce cpu0009 cpu0008 reduce cpu0001

(19)

map–reduce

I Brilliant. And a huge opportunity.

I the computational complexity isN logN

I but thetime it takesscales as just logN

I Map–reduce made internet search possible. I (goes by many names, some trade-marked!)

(20)

maximum-likelihood and map–reduce

I full-data likelihood: p(D|θ) =Y

n

p(dn|θ)

I Find a local maximum with respect to θ of this likelihood. I map:

I _compute dlnp(dn|θ)

dθ

I reduce:

I pairwise sum

I Go uphill. Repeat as necessary; each iteration only takes logN time. I (use L-BFGS or conjugate-gradient or whatever you like)

I complexity isN logN,compute timeis logN

I Local optimization typically takes M2logN time, where M is thenumber of parameters.

(21)

all that matters is the number of parameters

I So we are done, right?

I Nope.

I In all real problems, the number of parametersM scales with the data volume N.

(22)

all that matters is the number of parameters

I So we are done, right? I Nope.

(23)

all that matters is the number of parameters

I So we are done, right? I Nope.

(24)

physics problems are hierarchical

n= 1, . . . , N Σ σn Ω xn α true n obs n γ

(25)

nuisance parameters

I Nuisance parameters tend to increase in number with the data size.

I Relevant backgrounds get more subtle and must be modeled more carefully. I Details of sample selection and observed data distribution functions become

more important.

I As precision expectations rise (and they rise with N), noise models get more realistic.

(26)

calibration parameters

I As you go from N = 103 toN = 109, you expect to do more than 103 times better in accuracy.

I Calibration of the device must get correspondingly better. I time-dependence, temperature, Solar Cycle, hysteresis I all these effects bring new parameters

I you usually can’t measure these parameters well enough in your “calibration program”; whenN is large, you end upself-calibrating

(27)

Bayesian inference

isn’t

map–reduce

I p(θ|D) = 1 Z p(θ) Y n p(dn|θ) I map: I compute functionsp(dn|θ) I reduce:

I product functions together (starting with the prior) I but think about how you pass forward those functions

I _θ_{has 10}6_{or more parameters} I functions are multi-modal I support is broader than Gaussian

(28)

even the frequentists are doomed

I All the “M scaling withN” arguments apply to frequentists and Bayesians alike.

I Computing the full-data likelihood functionis just as hard as computing the full-data posterior PDF.

I (local optimization of the likelihood is easy, full description of the function is hard)

(29)

marginalization is hard—and unavoidable

n= 1, . . . , N Σ σn Ω xn α true n obs n γ

(30)

Bayesian state-of-the-art

I Therearehuge non-parametric Bayesian inferences with massive marginalizations out there.

I How were they done?

I carefully chosen priors that make the inferences and marginalizations analytic or tractable

I we can’t do this

I why not? Because for us the priorsactually areour prior beliefs. Ourrealprior beliefs are not conjugate to anything!

(31)

my approach

I Brute Force (tm).

I Plus some help from applied math and computer vision. I approximate Bayesian inference

I very clever Markov-Chain Monte Carlo methods I exploiting problem sparsity

(32)

my approach

I Brute Force (tm).

(33)

my approach

I Brute Force (tm).

(34)

my day job

I Lang & Hogg (forthcoming): a 109-parameter model of the 1013 Sloan Digital Sky Survey pixels (The Tractor)

I Brewer et al.(forthcoming): Bayesian non-parametrics but with priors that represent our actual prior knowledge

I Foreman-Mackey et al.(arXiv:1202.3665): emcee, the MCMC Hammer: flexible, parallelized, adaptive sampler

I Bovy, Murray, Hogg (arXiv:0903.5308): a dynamical inference fully marginalizing out a non-parametric distribution function

I Bovy et al.(arXiv:1105.3975): a 60,000-parameter model of 700,000 flux measurements, followed by predictions for 160,000,000 point sources

(35)

it’s actually worse than you think

I “More data” means “more new discoveries”.

I The number of hypotheses to testalso grows with the data sizeN. I There will be far more hypotheses than physicists.

I (this is already true, of course) I citizen science?

(36)

punchlines

I Huge data sets create new opportunities. I they also present new challenges

I Industrial (dot-com) methods are brilliant.

I they can only solve a very limited set of problems

I We won’t reap the full benefit of larger data sets without new technology. I call to arms