Learning from Big Data in

(1)

Learning from Big Data in

Astronomy – an overview

Kirk Borne

George Mason University

School of Physics, Astronomy, & Computational Sciences

(2)

From traditional astronomy …

(3)

… to Big Data Astronomy is a Big Challenge

(4)

Scary News:

Big Data is

taking us to a

Tipping Point

4 http://bit.ly/HUqmu5

(5)

Good News: Big Data is Sexy

5

(6)

Characteristics of Big Data

n If the only distinguishing characteristic was that we

have lots of data, we would call it “Lots of Data”.

n Big Data characteristics: the 3+n V’s =

1. Volume Volume (lots of data = “Tonnabytes”)(lots of data = “Tonnabytes”)

2. Variety (complexity, curse of dimensionality) 3. Velocity (rate of data and information flow)

4. Veracity (verifying inference-based models from

comprehensive data collections)

5. Variability 6. Venue

7. Vocabulary

(7)

Characteristics of Big Data

n If the only distinguishing characteristic was that we

have lots of data, we would call it “Lots of Data”.

n Big Data characteristics: the 3+n V’s =

1. Volume Volume

2. Variety : this one helps us to discriminate subtle

new classes (= Class Discovery)

3. Velocity 4. Veracity 5. Variability 6. Venue 7. Vocabulary 7

(8)

Insufficient

Variety: stars & galaxies

are not separated

in this parameter

(9)

Sufficient

Variety: stars & galaxies

are separated

in this parameter

(10)

Class Discovery: feature separation

and discrimination of classes

n The separation of classes improves when the “correct” features are chosen for investigation, as in the following star-galaxy

discrimination test: the “Star-Galaxy Separation” Problem

Reference: http://www.cs.princeton.edu/courses/archive/spr04/cos598B/bib/BrunnerDPS.pdf

Not good

Good

(11)

3 Categories of Scientific Knowledge

Discovery in Databases

n

Class Discovery

n Clustering of scientific parameters

n Finding new classes of objects and behaviors

n

Novelty Discovery

n Finding new, rare, one-in-a-million(billion)(trillion)

objects and events

n

Correlation Discovery

n Finding new patterns and dependencies, which

reveal new physics or new scientific principles

(12)

Big Data Science:

Scientific KDD (Knowledge Discovery from Data)

• Characterize the known (clustering,

unsupervised learning)

• Assign the new (classification, supervised learning)

• Discover the unknown (outlier

detection, semi-supervised learning)

• Benefits of very large datasets:

• best statistical analysis of “typical” events • automated search for “rare” events

(13)

This graphic says it all …

• Clustering – examine the data and find the data clusters (clouds), without considering what the items are =

Characterization !

• Classification – for each new data item, try to place it within a

Graphic provided by Professor S. G. Djorgovski, Caltech

try to place it within a known class (i.e., a known category or cluster) = Classify !

• Outlier Detection – identify those data

items that don’t fit into the known classes or clusters = Surprise !

(14)

n Astroinformatics is data-oriented astronomy for

research and education (similar to Bioinformatics,

Geoinformatics, Cheminformatics, and many others).

n Informatics is the discipline of organizing, accessing,

mining, & analyzing information describing complex systems (e.g., the human genome, the Earth, or the

Astroinformatics

systems (e.g., the human genome, the Earth, or the Universe).

n X-informatics: key enabler of scientific discovery in the era

of Big Data science. (X = Bio, Geo, Astro, ...) (Jim Gray, KDD-2003)

n Intelligent data discovery, integration, mining, and

analysis tools will enable exponential knowledge discovery within exponentially growing data

collections.

(15)

Borne (2010): “Astroinformatics: Data-Oriented Astronomy Research and

Education”, Journal of Earth Science Informatics, vol. 3, pp. 5-17.

Astroinformatics Research paper available !

Addresses the data science challenges, research agenda, application areas,

(16)

Astronomy Data Environment: Sky Surveys

• To avoid biases caused by limited samples, astronomers now study the sky systematically = Sky Surveys

• Surveys are used to measure and collect data from all

objects that are contained in large regions of the sky, in a systematic, controlled, repeatable fashion.

• Brief(!!) history of surveys (... this is just a subset): • Brief(!!) history of surveys (... this is just a subset):

– MACHO and related surveys for dark matter objects: ~ 1 Terabyte

– Digitized Palomar Sky Survey: 3 Terabytes

– 2MASS (2-Micron All-Sky Survey): 10 Terabytes

– GALEX (ultraviolet all-sky survey): 30 Terabytes

– Sloan Digital Sky Survey (1/4 of the sky): 40 Terabytes

– Pan-STARRS: 40 Petabytes!

• Leading up to the big survey next decade:

(17)

LSST =

Large

Synoptic

Survey

Telescope

http://www.lsst.org/ 8.4-meter diameter primary mirror = 10 square degrees! Hello !

– 100-200 Petabyte image archive – 20-40 Petabyte database catalog

(18)

Observing Strategy: One pair of images every 40 seconds for each spot on the sky,

then continue across the sky continuously every night for 10 years (~2021-2031), with time domain sampling in log(time) intervals (to capture dynamic range of transients).

• LSST (Large Synoptic Survey Telescope):

– Ten-year time series imaging of the night sky – mapping the Universe !

– ~2,000,000 events each night – anything that goes bump in the night ! – Cosmic Cinematography! The New Sky! @ http://www.lsst.org/

Education and Public Outreach have been an integral and key feature of the project since the beginning – the EPO program includes formal Ed, informal Ed, Citizen Science projects, and Science Centers / Planetaria.

(19)

LSST Key Science Drivers: Mapping the Dynamic Universe

– Solar System Inventory (moving objects, NEOs, asteroids: census & tracking) – Nature of Dark Energy (distant supernovae, weak lensing, cosmology)

– Optical transients (of all kinds, with alert notifications within 60 seconds) – Digital Milky Way (proper motions, parallaxes, star streams, dark matter)

LSST in time and space: – When? ~2021-2031

– Where? Cerro Pachon, Chile

Architect’s design of LSST Observatory

(20)

LSST Summary

http://www.lsst.org/

• 3-Gigapixel camera

• One 6-Gigabyte image every 20 seconds • 30 Terabytes every night for 10 years

• 100-Petabyte final image data archive anticipated –

all data are public!!! all data are public!!!

• 20-Petabyte final database catalog anticipated

• Real-Time Event Mining: 1-10 million events per night, every night, for 10 yrs

– Follow-up observations required to classify these

• Repeat images of the entire night sky every 3 nights:

(21)

LSST Science Collaborations

http://www.lsst.org/lsst/science/participate

• The original 10 teams

:

1. Solar System

2. Milky Way & the Local Volume 3. Transients/Variable Stars

4. Stellar Populations 4. Stellar Populations

5. Supernovae

6. Galaxies

7. Active Galactic Nuclei (AGN, quasars)

8. Strong Lensing

9. Weak Lensing

(22)

LSST Science Collaborations

http://www.lsst.org/lsst/science/participate

• Original 10 teams:

1. Solar System

2. Milky Way & the Local Volume 3. Transients/Variable Stars

4. Stellar Populations

5. Supernovae 5. Supernovae

6. Galaxies

7. Active Galactic Nuclei (AGN, quasars)

8. Strong Lensing

9. Weak Lensing

10. Large-scale Structure / Baryon Oscillations

(23)

The LSST ISSC

• Informatics and Statistics Science Collaboration

• Core Leadership team:

– Kirk Borne – Jogesh Babu – Jogesh Babu – Tom Loredo – Eric Feigelson – Alex Gray

• ~50 members include astronomers, statisticians,

and machine learning data miners

(24)

The LSST ISSC research focus

• In discussing the data-related research challenges

posed by LSST, we identified several research areas:

– Statistics

– Data & Information Visualization – Data mining (machine learning)

– Data-intensive computing & analysis _Informatics – Large-scale scientific data management

– Semantics and Ontologies

• These areas represent Statistics and the science of

Informatics (Astroinformatics) = Data-intensive Science = the 4th _{Paradigm of Scientific Research}

– Addressing the LSST “Data to Knowledge” challenge – Helping to discover the unknown unknowns

(25)

The LSST Big Data Challenges

1.

1. Massive data stream: ~2 Massive data stream: ~2 Terabytes of image data Terabytes of image data per hour that must be per hour that must be mined in real time (for 10 mined in real time (for 10 years).

years).

2.

2. Massive 20Massive 20--Petabyte Petabyte database: more than 50 database: more than 50 billion objects need to be billion objects need to be

•• Challenge #1 includes Challenge #1 includes both the static data both the static data mining aspects of #2 mining aspects of #2 and the dynamic data and the dynamic data mining aspects of #3. mining aspects of #3.

billion objects need to be billion objects need to be classified, and most will be classified, and most will be monitored for important monitored for important variations in real time. variations in real time.

3.

3. Massive event stream: Massive event stream: knowledge extraction in knowledge extraction in real time for ~2,000,000 real time for ~2,000,000 events each night.

events each night.

•• Look at #3 in more

Look at #3 in more

detail ...

(26)

LSST data mining challenge # 3

• Approximately 2,000,000 times each night for 10 years LSST will obtain the following data on a

new sky event, and we will be challenged with classifying these data:

time

fl

u

(27)

new sky event, and we will be challenged with classifying these data: more data points help !

time

fl

u

(28)

LSST data mining challenge # 3

new sky event, and we will be challenged with classifying these data: more data points help !

time

fl

u

(29)

new sky event, and we will be challenged with classifying these data: What will we do with all of these events?

(30)

LSST data mining challenge # 3

new sky event, and we will be challenged with classifying these data: What will we do with all of these events?

Characterize first !

(Unsupervised Learning)

Classify later.

(31)

Characterization includes …

• Feature Detection and Extraction:

• Identifying and describing features in the data

– via machine algorithms or human inspection (including the potentially huge contributions from Citizen Science)

• Extracting feature descriptors from the data

• Curating these features for search, re-use, & discovery • Curating these features for search, re-use, & discovery

• Finding other parameters and features from other archives, other databases, other information sources – and using

those to help characterize (ultimately classify) each new event.

• This introduces more variety and complexity to the data ... • … but this should help us to discriminate subtle distinctions

(32)

Data-driven Discovery (Unsupervised Learning)

i.e., What can I do with characterizations?

1. Class Discovery – Clustering

2. Principal Component Analysis – Dimension

Reduction

3. Outlier (Anomaly / Deviation / Novelty)

Detection – Surprise Discovery

4. Link Analysis – Association Analysis –

Network Analysis

(33)

Data-driven Discovery (Unsupervised Learning)

• Class Discovery – Clustering

• Distinguish different classes of behavior or different types of objects • Find new classes of behavior or new types of objects

• Describe a large data collection by a small number of condensed representations

• Principal Component Analysis – Dimension Reduction

• Find the dominant features among all of the data attributes

• Generate low-dimensional descriptions of events and behaviors, while revealing correlations and dependencies among parameters

• Addresses the Curse of Dimensionality • Addresses the Curse of Dimensionality

• Outlier Detection – Surprise / Anomaly / Deviation / Novelty Discovery

• Find the unknown unknowns (the rare one-in-a-billion or one-in-a-trillion event) • Find objects and events that are outside the bounds of our expectations

• These could be garbage (erroneous measurements) or true discoveries

• Used for data quality assurance and/or for discovery of new / rare / interesting data items

• Link Analysis – Association Analysis – Network Analysis

• Identify connections between different events (or objects)

• Find unusual (improbable) co-occurring combinations of data attribute values • Find data items that have much fewer than “6 degrees of separation”

(34)

Why do all of this?

… for 4 very simple reasons:

• (1) Any real data table may consist of

thousands, or millions, or billions of rows of numbers.

• (2) Any real data table will probably have many more (perhaps hundreds more)

many more (perhaps hundreds more) attributes (features), not just two.

• (3) Humans can make mistakes when

staring for hours at long lists of numbers, especially in a dynamic data stream.

• (4) The use of a data-driven model provides an objective, scientific, rational, and

(35)

Why do all of this?

Volume

Variety

many more (perhaps hundreds more)

attributes (features), not just two.

justifiable test of a hypothesis. 35

Variety

Velocity

(36)

Why do all of this?

Volume

Variety

It is too much !

It is too complex !

many more (perhaps hundreds more) attributes (features), not just two.

justifiable test of a hypothesis. 36

Variety

Velocity

Veracity

Can you prove your results ?

It is too complex !

(37)

Rationale for BIG DATA – 1

• Consequently, if we collect a thorough set of parameters (high-dimensional data) for a

complete set of items within our domain of

study, then we would have a “perfect” statistical model for that domain.

• In other words, Big Data becomes the model for • In other words, Big Data becomes the model for

a domain X = we call this X-informatics.

• Anything we want to know about that domain is specified and encoded within the data.

• The goal of Big Data Science is to find those encodings, patterns, and knowledge nuggets.

• Recall what we said before …

(38)

Rationale for BIG DATA – 2

• … one of the two major benefits of BIG DATA is one of the two major benefits of BIG DATA is one of the two major benefits of BIG DATA is one of the two major benefits of BIG DATA is

to provide the best statistical analysis ever(!) for to provide the best statistical analysis ever(!) for to provide the best statistical analysis ever(!) for to provide the best statistical analysis ever(!) for the domain of study.

the domain of study. the domain of study. the domain of study.

38

Benefits of very large datasets:

1. best statistical analysis of “typical” events

2. automated search for “rare” events

(39)

The other great rationale for Big Data

• Outlier detection:

– Anomaly detection / novelty discovery – Rare event discovery

– Surprise Discovery

– Finding the objects and events that are outside the – Finding the objects and events that are outside the

bounds of our expectations (outside known clusters) – Discovering the unknown unknowns

– These may be real scientific discoveries or garbage

– Outlier detection is therefore useful for:

• Anomaly Detection – is the detector system working? • Data Quality Assurance – is the data pipeline working? • Novelty Discovery – is my Nobel prize waiting?

(40)

Data Science

addresses the

Data-to-Knowledge Challenges:

Finding order in Big Data

and