Learning from Big Data in
Astronomy – an overview
Astronomy – an overview
Kirk Borne
George Mason University
School of Physics, Astronomy, & Computational Sciences
From traditional astronomy …
… to Big Data Astronomy is a Big Challenge
Scary News:
Big Data is
taking us to a
taking us to a
Tipping Point
4 http://bit.ly/HUqmu5Good News: Big Data is Sexy
5
Characteristics of Big Data
n If the only distinguishing characteristic was that we
have lots of data, we would call it “Lots of Data”.
n Big Data characteristics: the 3+n V’s =
1. Volume Volume (lots of data = “Tonnabytes”)(lots of data = “Tonnabytes”)
2. Variety (complexity, curse of dimensionality) 3. Velocity (rate of data and information flow)
4. Veracity (verifying inference-based models from
comprehensive data collections)
5. Variability 6. Venue
7. Vocabulary
Characteristics of Big Data
n If the only distinguishing characteristic was that we
have lots of data, we would call it “Lots of Data”.
n Big Data characteristics: the 3+n V’s =
1. Volume Volume
2. Variety : this one helps us to discriminate subtle
new classes (= Class Discovery)
3. Velocity 4. Veracity 5. Variability 6. Venue 7. Vocabulary 7
Insufficient
Variety: stars & galaxies
are not separated
in this parameter
Sufficient
Variety: stars & galaxies
are separated
in this parameter
Class Discovery: feature separation
and discrimination of classes
n The separation of classes improves when the “correct” features are chosen for investigation, as in the following star-galaxy
discrimination test: the “Star-Galaxy Separation” Problem
Reference: http://www.cs.princeton.edu/courses/archive/spr04/cos598B/bib/BrunnerDPS.pdf
Not good
Good
3 Categories of Scientific Knowledge
Discovery in Databases
n
Class Discovery
n Clustering of scientific parameters
n Finding new classes of objects and behaviors
n
Novelty Discovery
n Finding new, rare, one-in-a-million(billion)(trillion)
objects and events
n
Correlation Discovery
n Finding new patterns and dependencies, which
reveal new physics or new scientific principles
Big Data Science:
Scientific KDD (Knowledge Discovery from Data)
• Characterize the known (clustering,unsupervised learning)
• Assign the new (classification, supervised learning)
• Discover the unknown (outlier
detection, semi-supervised learning)
• Benefits of very large datasets:
• best statistical analysis of “typical” events • automated search for “rare” events
This graphic says it all …
• Clustering – examine the data and find the data clusters (clouds), without considering what the items are =
Characterization !
• Classification – for each new data item, try to place it within a
Graphic provided by Professor S. G. Djorgovski, Caltech
try to place it within a known class (i.e., a known category or cluster) = Classify !
• Outlier Detection – identify those data
items that don’t fit into the known classes or clusters = Surprise !
n Astroinformatics is data-oriented astronomy for
research and education (similar to Bioinformatics,
Geoinformatics, Cheminformatics, and many others).
n Informatics is the discipline of organizing, accessing,
mining, & analyzing information describing complex systems (e.g., the human genome, the Earth, or the
Astroinformatics
systems (e.g., the human genome, the Earth, or the Universe).
n X-informatics: key enabler of scientific discovery in the era
of Big Data science. (X = Bio, Geo, Astro, ...) (Jim Gray, KDD-2003)
n Intelligent data discovery, integration, mining, and
analysis tools will enable exponential knowledge discovery within exponentially growing data
collections.
Borne (2010): “Astroinformatics: Data-Oriented Astronomy Research and
Education”, Journal of Earth Science Informatics, vol. 3, pp. 5-17.
See also http://arxiv.org/abs/0909.3892
Astroinformatics Research paper available !
Addresses the data science challenges, research agenda, application areas,Astronomy Data Environment: Sky Surveys
• To avoid biases caused by limited samples, astronomers now study the sky systematically = Sky Surveys
• Surveys are used to measure and collect data from all
objects that are contained in large regions of the sky, in a systematic, controlled, repeatable fashion.
• Brief(!!) history of surveys (... this is just a subset): • Brief(!!) history of surveys (... this is just a subset):
– MACHO and related surveys for dark matter objects: ~ 1 Terabyte
– Digitized Palomar Sky Survey: 3 Terabytes
– 2MASS (2-Micron All-Sky Survey): 10 Terabytes
– GALEX (ultraviolet all-sky survey): 30 Terabytes
– Sloan Digital Sky Survey (1/4 of the sky): 40 Terabytes
– Pan-STARRS: 40 Petabytes!
• Leading up to the big survey next decade:
LSST =
Large
Synoptic
Survey
Telescope
http://www.lsst.org/ 8.4-meter diameter primary mirror = 10 square degrees! Hello !– 100-200 Petabyte image archive – 20-40 Petabyte database catalog
Observing Strategy: One pair of images every 40 seconds for each spot on the sky,
then continue across the sky continuously every night for 10 years (~2021-2031), with time domain sampling in log(time) intervals (to capture dynamic range of transients).
• LSST (Large Synoptic Survey Telescope):
– Ten-year time series imaging of the night sky – mapping the Universe !
– ~2,000,000 events each night – anything that goes bump in the night ! – Cosmic Cinematography! The New Sky! @ http://www.lsst.org/
Education and Public Outreach have been an integral and key feature of the project since the beginning – the EPO program includes formal Ed, informal Ed, Citizen Science projects, and Science Centers / Planetaria.
LSST Key Science Drivers: Mapping the Dynamic Universe
– Solar System Inventory (moving objects, NEOs, asteroids: census & tracking) – Nature of Dark Energy (distant supernovae, weak lensing, cosmology)
– Optical transients (of all kinds, with alert notifications within 60 seconds) – Digital Milky Way (proper motions, parallaxes, star streams, dark matter)
LSST in time and space: – When? ~2021-2031
– Where? Cerro Pachon, Chile
Architect’s design of LSST Observatory
LSST Summary
http://www.lsst.org/
• 3-Gigapixel camera
• One 6-Gigabyte image every 20 seconds • 30 Terabytes every night for 10 years
• 100-Petabyte final image data archive anticipated –
all data are public!!! all data are public!!!
• 20-Petabyte final database catalog anticipated
• Real-Time Event Mining: 1-10 million events per night, every night, for 10 yrs
– Follow-up observations required to classify these
• Repeat images of the entire night sky every 3 nights:
LSST Science Collaborations
http://www.lsst.org/lsst/science/participate
• The original 10 teams
:
1. Solar System
2. Milky Way & the Local Volume 3. Transients/Variable Stars
4. Stellar Populations 4. Stellar Populations
5. Supernovae
6. Galaxies
7. Active Galactic Nuclei (AGN, quasars)
8. Strong Lensing
9. Weak Lensing
LSST Science Collaborations
http://www.lsst.org/lsst/science/participate
• Original 10 teams:
1. Solar System
2. Milky Way & the Local Volume 3. Transients/Variable Stars
4. Stellar Populations
5. Supernovae 5. Supernovae
6. Galaxies
7. Active Galactic Nuclei (AGN, quasars)
8. Strong Lensing
9. Weak Lensing
10. Large-scale Structure / Baryon Oscillations
The LSST ISSC
• Informatics and Statistics Science Collaboration
• Core Leadership team:
– Kirk Borne – Jogesh Babu – Jogesh Babu – Tom Loredo – Eric Feigelson – Alex Gray
• ~50 members include astronomers, statisticians,
and machine learning data miners
The LSST ISSC research focus
• In discussing the data-related research challenges
posed by LSST, we identified several research areas:
– Statistics
– Data & Information Visualization – Data mining (machine learning)
– Data-intensive computing & analysis Informatics – Large-scale scientific data management
– Semantics and Ontologies
• These areas represent Statistics and the science of
Informatics (Astroinformatics) = Data-intensive Science = the 4th Paradigm of Scientific Research
– Addressing the LSST “Data to Knowledge” challenge – Helping to discover the unknown unknowns
The LSST Big Data Challenges
The LSST Big Data Challenges
1.
1. Massive data stream: ~2 Massive data stream: ~2 Terabytes of image data Terabytes of image data per hour that must be per hour that must be mined in real time (for 10 mined in real time (for 10 years).
years).
2.
2. Massive 20Massive 20--Petabyte Petabyte database: more than 50 database: more than 50 billion objects need to be billion objects need to be
•• Challenge #1 includes Challenge #1 includes both the static data both the static data mining aspects of #2 mining aspects of #2 and the dynamic data and the dynamic data mining aspects of #3. mining aspects of #3.
billion objects need to be billion objects need to be classified, and most will be classified, and most will be monitored for important monitored for important variations in real time. variations in real time.
3.
3. Massive event stream: Massive event stream: knowledge extraction in knowledge extraction in real time for ~2,000,000 real time for ~2,000,000 events each night.
events each night.
•• Look at #3 in more
Look at #3 in more
detail ...
LSST data mining challenge # 3
• Approximately 2,000,000 times each night for 10 years LSST will obtain the following data on a
new sky event, and we will be challenged with classifying these data:
time
fl
u
LSST data mining challenge # 3
• Approximately 2,000,000 times each night for 10 years LSST will obtain the following data on a
new sky event, and we will be challenged with classifying these data: more data points help !
time
fl
u
LSST data mining challenge # 3
• Approximately 2,000,000 times each night for 10 years LSST will obtain the following data on a
new sky event, and we will be challenged with classifying these data: more data points help !
time
fl
u
LSST data mining challenge # 3
• Approximately 2,000,000 times each night for 10 years LSST will obtain the following data on a
new sky event, and we will be challenged with classifying these data: What will we do with all of these events?
LSST data mining challenge # 3
• Approximately 2,000,000 times each night for 10 years LSST will obtain the following data on a
new sky event, and we will be challenged with classifying these data: What will we do with all of these events?
Characterize first !
(Unsupervised Learning)
Classify later.
Characterization includes …
• Feature Detection and Extraction:
• Identifying and describing features in the data
– via machine algorithms or human inspection (including the potentially huge contributions from Citizen Science)
• Extracting feature descriptors from the data
• Curating these features for search, re-use, & discovery • Curating these features for search, re-use, & discovery
• Finding other parameters and features from other archives, other databases, other information sources – and using
those to help characterize (ultimately classify) each new event.
• This introduces more variety and complexity to the data ... • … but this should help us to discriminate subtle distinctions
Data-driven Discovery (Unsupervised Learning)
i.e., What can I do with characterizations?
1. Class Discovery – Clustering
2. Principal Component Analysis – Dimension
Reduction
3. Outlier (Anomaly / Deviation / Novelty)
Detection – Surprise Discovery
4. Link Analysis – Association Analysis –
Network Analysis
Data-driven Discovery (Unsupervised Learning)
• Class Discovery – Clustering
• Distinguish different classes of behavior or different types of objects • Find new classes of behavior or new types of objects
• Describe a large data collection by a small number of condensed representations
• Principal Component Analysis – Dimension Reduction
• Find the dominant features among all of the data attributes
• Generate low-dimensional descriptions of events and behaviors, while revealing correlations and dependencies among parameters
• Addresses the Curse of Dimensionality • Addresses the Curse of Dimensionality
• Outlier Detection – Surprise / Anomaly / Deviation / Novelty Discovery
• Find the unknown unknowns (the rare one-in-a-billion or one-in-a-trillion event) • Find objects and events that are outside the bounds of our expectations
• These could be garbage (erroneous measurements) or true discoveries
• Used for data quality assurance and/or for discovery of new / rare / interesting data items
• Link Analysis – Association Analysis – Network Analysis
• Identify connections between different events (or objects)
• Find unusual (improbable) co-occurring combinations of data attribute values • Find data items that have much fewer than “6 degrees of separation”
Why do all of this?
… for 4 very simple reasons:
• (1) Any real data table may consist of
thousands, or millions, or billions of rows of numbers.
• (2) Any real data table will probably have many more (perhaps hundreds more)
many more (perhaps hundreds more) attributes (features), not just two.
• (3) Humans can make mistakes when
staring for hours at long lists of numbers, especially in a dynamic data stream.
• (4) The use of a data-driven model provides an objective, scientific, rational, and
Why do all of this?
… for 4 very simple reasons:
• (1) Any real data table may consist of
thousands, or millions, or billions of rows of numbers.
• (2) Any real data table will probably have many more (perhaps hundreds more)
Volume
Variety
many more (perhaps hundreds more)attributes (features), not just two.
• (3) Humans can make mistakes when
staring for hours at long lists of numbers, especially in a dynamic data stream.
• (4) The use of a data-driven model provides an objective, scientific, rational, and
justifiable test of a hypothesis. 35
Variety
Velocity
Why do all of this?
… for 4 very simple reasons:
• (1) Any real data table may consist of
thousands, or millions, or billions of rows of numbers.
• (2) Any real data table will probably have many more (perhaps hundreds more)
Volume
Variety
It is too much !
It is too complex !
many more (perhaps hundreds more) attributes (features), not just two.
• (3) Humans can make mistakes when
staring for hours at long lists of numbers, especially in a dynamic data stream.
• (4) The use of a data-driven model provides an objective, scientific, rational, and
justifiable test of a hypothesis. 36
Variety
Velocity
Veracity
Can you prove your results ?It is too complex !
Rationale for BIG DATA – 1
• Consequently, if we collect a thorough set of parameters (high-dimensional data) for a
complete set of items within our domain of
study, then we would have a “perfect” statistical model for that domain.
• In other words, Big Data becomes the model for • In other words, Big Data becomes the model for
a domain X = we call this X-informatics.
• Anything we want to know about that domain is specified and encoded within the data.
• The goal of Big Data Science is to find those encodings, patterns, and knowledge nuggets.
• Recall what we said before …
Rationale for BIG DATA – 2
• … one of the two major benefits of BIG DATA is one of the two major benefits of BIG DATA is one of the two major benefits of BIG DATA is one of the two major benefits of BIG DATA is
to provide the best statistical analysis ever(!) for to provide the best statistical analysis ever(!) for to provide the best statistical analysis ever(!) for to provide the best statistical analysis ever(!) for the domain of study.
the domain of study. the domain of study. the domain of study.
38
Benefits of very large datasets:
1. best statistical analysis of “typical” events
2. automated search for “rare” events
The other great rationale for Big Data
• Outlier detection:
– Anomaly detection / novelty discovery – Rare event discovery
– Surprise Discovery
– Finding the objects and events that are outside the – Finding the objects and events that are outside the
bounds of our expectations (outside known clusters) – Discovering the unknown unknowns
– These may be real scientific discoveries or garbage
– Outlier detection is therefore useful for:
• Anomaly Detection – is the detector system working? • Data Quality Assurance – is the data pipeline working? • Novelty Discovery – is my Nobel prize waiting?
Data Science
addresses the
Data-to-Knowledge Challenges:
Finding order in Big Data
and