Big data challenges for physics in the next decades
David W. HoggCenter for Cosmology and Particle Physics, New York University
punchlines
I Huge data sets create new opportunities. I they also present new challenges
I Industrial (dot-com) methods are brilliant.
I they can only solve a very limited set of problems
I We won’t reap the full benefit of larger data sets without new technology. I call to arms
scale
I Large Synoptic Survey Telescope: 1010 galaxies in 1015 pixels I get the cosmic gravitational-lensing shear map
I and then the cosmological parameters I 109stars too, all moving and varying I Gaia: 109 stars in 1012 pixels
I infer the dynamics of the Milky Way
I but also—necessarily—the distribution function of stars in that potential I precision requirements areoutrageous(milli-pixel positions)
I Large Hadron Collider instruments: taking data 107 times faster than it can be moved to disk
I they found the Higgs! (probably)
how do you store big data sets?
I Distribute the data.
I <1 Tb of data per CPU, thousands of CPUs
I all modern databases can do this, even open-source ones I There is a CPU near every data point.
I and there are many, many CPUs
I Hardware is cheap; management is expensive. I Massive CPU redundancy creates opportunities. . .
map–reduce or die
I “We won’t even consider algorithms that can’t be written in the map–reduce framework.”
map–reduce
I map:I at each “data point” (on the distributed system), do an operation on that datum, produce output
I if "kittens" is in document:
return (URL, PageRank)
I distributed datais the key: Every datum is near a CPU. I reduce:
I between each pair of outputs, do an operation and return one new output, recurse up the tree
I if PageRank[0] > PageRank[1]: return (URL[0], PageRank[0]) else:
return (URL[1], PageRank[1])
I tree structureof the data center is the key: There are only log2N branches to any datum.
map–reduce
reduce
reduce reduce reduce reduce
"kittens?" reduce reduce reduce reduce cpu0009 cpu0008 reduce cpu0001
map–reduce
I Brilliant. And a huge opportunity.
I the computational complexity isN logN
I but thetime it takesscales as just logN
I Map–reduce made internet search possible. I (goes by many names, some trade-marked!)
maximum-likelihood and map–reduce
I full-data likelihood: p(D|θ) =Yn
p(dn|θ)
I Find a local maximum with respect to θ of this likelihood. I map:
I compute dlnp(dn|θ)
dθ
I reduce:
I pairwise sum
I Go uphill. Repeat as necessary; each iteration only takes logN time. I (use L-BFGS or conjugate-gradient or whatever you like)
I complexity isN logN,compute timeis logN
I Local optimization typically takes M2logN time, where M is thenumber of parameters.
all that matters is the number of parameters
I So we are done, right?
I Nope.
I In all real problems, the number of parametersM scales with the data volume N.
all that matters is the number of parameters
I So we are done, right? I Nope.
I In all real problems, the number of parametersM scales with the data volume N.
all that matters is the number of parameters
I So we are done, right? I Nope.
I In all real problems, the number of parametersM scales with the data volume N.
physics problems are hierarchical
n= 1, . . . , N Σ σn Ω xn α true n obs n γnuisance parameters
I Nuisance parameters tend to increase in number with the data size.
I Relevant backgrounds get more subtle and must be modeled more carefully. I Details of sample selection and observed data distribution functions become
more important.
I As precision expectations rise (and they rise with N), noise models get more realistic.
calibration parameters
I As you go from N = 103 toN = 109, you expect to do more than 103 times better in accuracy.
I Calibration of the device must get correspondingly better. I time-dependence, temperature, Solar Cycle, hysteresis I all these effects bring new parameters
I you usually can’t measure these parameters well enough in your “calibration program”; whenN is large, you end upself-calibrating
Bayesian inference
isn’t
map–reduce
I p(θ|D) = 1 Z p(θ) Y n p(dn|θ) I map: I compute functionsp(dn|θ) I reduce:I product functions together (starting with the prior) I but think about how you pass forward those functions
I θhas 106or more parameters I functions are multi-modal I support is broader than Gaussian
even the frequentists are doomed
I All the “M scaling withN” arguments apply to frequentists and Bayesians alike.
I Computing the full-data likelihood functionis just as hard as computing the full-data posterior PDF.
I (local optimization of the likelihood is easy, full description of the function is hard)
marginalization is hard—and unavoidable
n= 1, . . . , N Σ σn Ω xn α true n obs n γBayesian state-of-the-art
I Therearehuge non-parametric Bayesian inferences with massive marginalizations out there.
I How were they done?
I carefully chosen priors that make the inferences and marginalizations analytic or tractable
I we can’t do this
I why not? Because for us the priorsactually areour prior beliefs. Ourrealprior beliefs are not conjugate to anything!
my approach
I Brute Force (tm).
I Plus some help from applied math and computer vision. I approximate Bayesian inference
I very clever Markov-Chain Monte Carlo methods I exploiting problem sparsity
my approach
I Brute Force (tm).
I Plus some help from applied math and computer vision. I approximate Bayesian inference
I very clever Markov-Chain Monte Carlo methods I exploiting problem sparsity
my approach
I Brute Force (tm).
I Plus some help from applied math and computer vision. I approximate Bayesian inference
I very clever Markov-Chain Monte Carlo methods I exploiting problem sparsity
my day job
I Lang & Hogg (forthcoming): a 109-parameter model of the 1013 Sloan Digital Sky Survey pixels (The Tractor)
I Brewer et al.(forthcoming): Bayesian non-parametrics but with priors that represent our actual prior knowledge
I Foreman-Mackey et al.(arXiv:1202.3665): emcee, the MCMC Hammer: flexible, parallelized, adaptive sampler
I Bovy, Murray, Hogg (arXiv:0903.5308): a dynamical inference fully marginalizing out a non-parametric distribution function
I Bovy et al.(arXiv:1105.3975): a 60,000-parameter model of 700,000 flux measurements, followed by predictions for 160,000,000 point sources
it’s actually worse than you think
I “More data” means “more new discoveries”.
I The number of hypotheses to testalso grows with the data sizeN. I There will be far more hypotheses than physicists.
I (this is already true, of course) I citizen science?
punchlines
I Huge data sets create new opportunities. I they also present new challenges
I Industrial (dot-com) methods are brilliant.
I they can only solve a very limited set of problems
I We won’t reap the full benefit of larger data sets without new technology. I call to arms