• No results found

Big data challenges for physics in the next decades

N/A
N/A
Protected

Academic year: 2021

Share "Big data challenges for physics in the next decades"

Copied!
36
0
0

Loading.... (view fulltext now)

Full text

(1)

Big data challenges for physics in the next decades

David W. Hogg

Center for Cosmology and Particle Physics, New York University

(2)

punchlines

I Huge data sets create new opportunities. I they also present new challenges

I Industrial (dot-com) methods are brilliant.

I they can only solve a very limited set of problems

I We won’t reap the full benefit of larger data sets without new technology. I call to arms

(3)
(4)
(5)
(6)
(7)
(8)
(9)
(10)
(11)
(12)
(13)
(14)

scale

I Large Synoptic Survey Telescope: 1010 galaxies in 1015 pixels I get the cosmic gravitational-lensing shear map

I and then the cosmological parameters I 109stars too, all moving and varying I Gaia: 109 stars in 1012 pixels

I infer the dynamics of the Milky Way

I but also—necessarily—the distribution function of stars in that potential I precision requirements areoutrageous(milli-pixel positions)

I Large Hadron Collider instruments: taking data 107 times faster than it can be moved to disk

I they found the Higgs! (probably)

(15)

how do you store big data sets?

I Distribute the data.

I <1 Tb of data per CPU, thousands of CPUs

I all modern databases can do this, even open-source ones I There is a CPU near every data point.

I and there are many, many CPUs

I Hardware is cheap; management is expensive. I Massive CPU redundancy creates opportunities. . .

(16)

map–reduce or die

I “We won’t even consider algorithms that can’t be written in the map–reduce framework.”

(17)

map–reduce

I map:

I at each “data point” (on the distributed system), do an operation on that datum, produce output

I if "kittens" is in document:

return (URL, PageRank)

I distributed datais the key: Every datum is near a CPU. I reduce:

I between each pair of outputs, do an operation and return one new output, recurse up the tree

I if PageRank[0] > PageRank[1]: return (URL[0], PageRank[0]) else:

return (URL[1], PageRank[1])

I tree structureof the data center is the key: There are only log2N branches to any datum.

(18)

map–reduce

reduce

reduce reduce reduce reduce

"kittens?" reduce reduce reduce reduce cpu0009 cpu0008 reduce cpu0001

(19)

map–reduce

I Brilliant. And a huge opportunity.

I the computational complexity isN logN

I but thetime it takesscales as just logN

I Map–reduce made internet search possible. I (goes by many names, some trade-marked!)

(20)

maximum-likelihood and map–reduce

I full-data likelihood: p(D|θ) =Y

n

p(dn|θ)

I Find a local maximum with respect to θ of this likelihood. I map:

I compute dlnp(dn|θ)

I reduce:

I pairwise sum

I Go uphill. Repeat as necessary; each iteration only takes logN time. I (use L-BFGS or conjugate-gradient or whatever you like)

I complexity isN logN,compute timeis logN

I Local optimization typically takes M2logN time, where M is thenumber of parameters.

(21)

all that matters is the number of parameters

I So we are done, right?

I Nope.

I In all real problems, the number of parametersM scales with the data volume N.

(22)

all that matters is the number of parameters

I So we are done, right? I Nope.

I In all real problems, the number of parametersM scales with the data volume N.

(23)

all that matters is the number of parameters

I So we are done, right? I Nope.

I In all real problems, the number of parametersM scales with the data volume N.

(24)

physics problems are hierarchical

n= 1, . . . , N Σ σn Ω xn α true n obs n γ

(25)

nuisance parameters

I Nuisance parameters tend to increase in number with the data size.

I Relevant backgrounds get more subtle and must be modeled more carefully. I Details of sample selection and observed data distribution functions become

more important.

I As precision expectations rise (and they rise with N), noise models get more realistic.

(26)

calibration parameters

I As you go from N = 103 toN = 109, you expect to do more than 103 times better in accuracy.

I Calibration of the device must get correspondingly better. I time-dependence, temperature, Solar Cycle, hysteresis I all these effects bring new parameters

I you usually can’t measure these parameters well enough in your “calibration program”; whenN is large, you end upself-calibrating

(27)

Bayesian inference

isn’t

map–reduce

I p(θ|D) = 1 Z p(θ) Y n p(dn|θ) I map: I compute functionsp(dn|θ) I reduce:

I product functions together (starting with the prior) I but think about how you pass forward those functions

I θhas 106or more parameters I functions are multi-modal I support is broader than Gaussian

(28)

even the frequentists are doomed

I All the “M scaling withN” arguments apply to frequentists and Bayesians alike.

I Computing the full-data likelihood functionis just as hard as computing the full-data posterior PDF.

I (local optimization of the likelihood is easy, full description of the function is hard)

(29)

marginalization is hard—and unavoidable

n= 1, . . . , N Σ σn Ω xn α true n obs n γ

(30)

Bayesian state-of-the-art

I Therearehuge non-parametric Bayesian inferences with massive marginalizations out there.

I How were they done?

I carefully chosen priors that make the inferences and marginalizations analytic or tractable

I we can’t do this

I why not? Because for us the priorsactually areour prior beliefs. Ourrealprior beliefs are not conjugate to anything!

(31)

my approach

I Brute Force (tm).

I Plus some help from applied math and computer vision. I approximate Bayesian inference

I very clever Markov-Chain Monte Carlo methods I exploiting problem sparsity

(32)

my approach

I Brute Force (tm).

I Plus some help from applied math and computer vision. I approximate Bayesian inference

I very clever Markov-Chain Monte Carlo methods I exploiting problem sparsity

(33)

my approach

I Brute Force (tm).

I Plus some help from applied math and computer vision. I approximate Bayesian inference

I very clever Markov-Chain Monte Carlo methods I exploiting problem sparsity

(34)

my day job

I Lang & Hogg (forthcoming): a 109-parameter model of the 1013 Sloan Digital Sky Survey pixels (The Tractor)

I Brewer et al.(forthcoming): Bayesian non-parametrics but with priors that represent our actual prior knowledge

I Foreman-Mackey et al.(arXiv:1202.3665): emcee, the MCMC Hammer: flexible, parallelized, adaptive sampler

I Bovy, Murray, Hogg (arXiv:0903.5308): a dynamical inference fully marginalizing out a non-parametric distribution function

I Bovy et al.(arXiv:1105.3975): a 60,000-parameter model of 700,000 flux measurements, followed by predictions for 160,000,000 point sources

(35)

it’s actually worse than you think

I “More data” means “more new discoveries”.

I The number of hypotheses to testalso grows with the data sizeN. I There will be far more hypotheses than physicists.

I (this is already true, of course) I citizen science?

(36)

punchlines

I Huge data sets create new opportunities. I they also present new challenges

I Industrial (dot-com) methods are brilliant.

I they can only solve a very limited set of problems

I We won’t reap the full benefit of larger data sets without new technology. I call to arms

References

Related documents

JAWAHARLAL NEHRU TECHNOLOGICAL UNIVERSITY: KAKINADA KAKINADA – 533 003, Andhra Pradesh, India.. FIRST SEMESTER AR17B1.1C

products are stored in mobile inventory pods in the center of the warehouse hil t Ygal Bendavid Sourcehttp://www.kivasystems.com/solutions/system‐overview / while operators

[r]

The reference attribute of the BuyPageInformation element is used by Commerce Suite Accelerator to locate buy page information for use when the merchant administrator needs to

If you expect search engine spiders to execute Flash, Java or Javascript code in order to access links to further pages within your site, you'll usually be disappointed with

Abstract In this paper the well-known minimax theorems of Wald, Ville and Von Neumann are generalized under weaker topological conditions on the payoff function ƒ and/or extended

Users with extensive PC audio and some mobile phone communication needs • Premium UC headset seamlessly manages calls to and from your PC and mobile devices • Smart

Watson Collection; Franklin photo © Henry Grant Collection / Museum of London; Maurice Wilkins in the lab photo © King’s College Archives; Photo of X-ray diffraction pattern