• No results found

Learning from Big Data in

N/A
N/A
Protected

Academic year: 2021

Share "Learning from Big Data in"

Copied!
40
0
0

Loading.... (view fulltext now)

Full text

(1)

Learning from Big Data in

Astronomy – an overview

Astronomy – an overview

Kirk Borne

George Mason University

School of Physics, Astronomy, & Computational Sciences

(2)

From traditional astronomy …

(3)

… to Big Data Astronomy is a Big Challenge

(4)

Scary News:

Big Data is

taking us to a

taking us to a

Tipping Point

4 http://bit.ly/HUqmu5
(5)

Good News: Big Data is Sexy

5

(6)

Characteristics of Big Data

n If the only distinguishing characteristic was that we

have lots of data, we would call it “Lots of Data”.

n Big Data characteristics: the 3+n V’s =

1. Volume Volume (lots of data = “Tonnabytes”)(lots of data = “Tonnabytes”)

2. Variety (complexity, curse of dimensionality) 3. Velocity (rate of data and information flow)

4. Veracity (verifying inference-based models from

comprehensive data collections)

5. Variability 6. Venue

7. Vocabulary

(7)

Characteristics of Big Data

n If the only distinguishing characteristic was that we

have lots of data, we would call it “Lots of Data”.

n Big Data characteristics: the 3+n V’s =

1. Volume Volume

2. Variety : this one helps us to discriminate subtle

new classes (= Class Discovery)

3. Velocity 4. Veracity 5. Variability 6. Venue 7. Vocabulary 7

(8)

Insufficient

Variety: stars & galaxies

are not separated

in this parameter

(9)

Sufficient

Variety: stars & galaxies

are separated

in this parameter

(10)

Class Discovery: feature separation

and discrimination of classes

n The separation of classes improves when the “correct” features are chosen for investigation, as in the following star-galaxy

discrimination test: the “Star-Galaxy Separation” Problem

Reference: http://www.cs.princeton.edu/courses/archive/spr04/cos598B/bib/BrunnerDPS.pdf

Not good

Good

(11)

3 Categories of Scientific Knowledge

Discovery in Databases

n

Class Discovery

n Clustering of scientific parameters

n Finding new classes of objects and behaviors

n

Novelty Discovery

n Finding new, rare, one-in-a-million(billion)(trillion)

objects and events

n

Correlation Discovery

n Finding new patterns and dependencies, which

reveal new physics or new scientific principles

(12)

Big Data Science:

Scientific KDD (Knowledge Discovery from Data)

• Characterize the known (clustering,

unsupervised learning)

• Assign the new (classification, supervised learning)

• Discover the unknown (outlier

detection, semi-supervised learning)

• Benefits of very large datasets:

• best statistical analysis of “typical” events • automated search for “rare” events

(13)

This graphic says it all …

• Clustering – examine the data and find the data clusters (clouds), without considering what the items are =

Characterization !

• Classification – for each new data item, try to place it within a

Graphic provided by Professor S. G. Djorgovski, Caltech

try to place it within a known class (i.e., a known category or cluster) = Classify !

• Outlier Detection – identify those data

items that don’t fit into the known classes or clusters = Surprise !

(14)

n Astroinformatics is data-oriented astronomy for

research and education (similar to Bioinformatics,

Geoinformatics, Cheminformatics, and many others).

n Informatics is the discipline of organizing, accessing,

mining, & analyzing information describing complex systems (e.g., the human genome, the Earth, or the

Astroinformatics

systems (e.g., the human genome, the Earth, or the Universe).

n X-informatics: key enabler of scientific discovery in the era

of Big Data science. (X = Bio, Geo, Astro, ...) (Jim Gray, KDD-2003)

n Intelligent data discovery, integration, mining, and

analysis tools will enable exponential knowledge discovery within exponentially growing data

collections.

(15)

Borne (2010): “Astroinformatics: Data-Oriented Astronomy Research and

Education”, Journal of Earth Science Informatics, vol. 3, pp. 5-17.

See also http://arxiv.org/abs/0909.3892

Astroinformatics Research paper available !

Addresses the data science challenges, research agenda, application areas,
(16)

Astronomy Data Environment: Sky Surveys

• To avoid biases caused by limited samples, astronomers now study the sky systematically = Sky Surveys

• Surveys are used to measure and collect data from all

objects that are contained in large regions of the sky, in a systematic, controlled, repeatable fashion.

• Brief(!!) history of surveys (... this is just a subset): • Brief(!!) history of surveys (... this is just a subset):

– MACHO and related surveys for dark matter objects: ~ 1 Terabyte

– Digitized Palomar Sky Survey: 3 Terabytes

– 2MASS (2-Micron All-Sky Survey): 10 Terabytes

– GALEX (ultraviolet all-sky survey): 30 Terabytes

– Sloan Digital Sky Survey (1/4 of the sky): 40 Terabytes

– Pan-STARRS: 40 Petabytes!

• Leading up to the big survey next decade:

(17)

LSST =

Large

Synoptic

Survey

Telescope

http://www.lsst.org/ 8.4-meter diameter primary mirror = 10 square degrees! Hello !

– 100-200 Petabyte image archive – 20-40 Petabyte database catalog

(18)

Observing Strategy: One pair of images every 40 seconds for each spot on the sky,

then continue across the sky continuously every night for 10 years (~2021-2031), with time domain sampling in log(time) intervals (to capture dynamic range of transients).

LSST (Large Synoptic Survey Telescope):

– Ten-year time series imaging of the night sky – mapping the Universe !

– ~2,000,000 events each night – anything that goes bump in the night ! – Cosmic Cinematography! The New Sky! @ http://www.lsst.org/

Education and Public Outreach have been an integral and key feature of the project since the beginning – the EPO program includes formal Ed, informal Ed, Citizen Science projects, and Science Centers / Planetaria.

(19)

LSST Key Science Drivers: Mapping the Dynamic Universe

– Solar System Inventory (moving objects, NEOs, asteroids: census & tracking) – Nature of Dark Energy (distant supernovae, weak lensing, cosmology)

– Optical transients (of all kinds, with alert notifications within 60 seconds) – Digital Milky Way (proper motions, parallaxes, star streams, dark matter)

LSST in time and space: – When? ~2021-2031

– Where? Cerro Pachon, Chile

Architect’s design of LSST Observatory

(20)

LSST Summary

http://www.lsst.org/

• 3-Gigapixel camera

• One 6-Gigabyte image every 20 seconds • 30 Terabytes every night for 10 years

• 100-Petabyte final image data archive anticipated –

all data are public!!! all data are public!!!

• 20-Petabyte final database catalog anticipated

• Real-Time Event Mining: 1-10 million events per night, every night, for 10 yrs

– Follow-up observations required to classify these

• Repeat images of the entire night sky every 3 nights:

(21)

LSST Science Collaborations

http://www.lsst.org/lsst/science/participate

• The original 10 teams

:

1. Solar System

2. Milky Way & the Local Volume 3. Transients/Variable Stars

4. Stellar Populations 4. Stellar Populations

5. Supernovae

6. Galaxies

7. Active Galactic Nuclei (AGN, quasars)

8. Strong Lensing

9. Weak Lensing

(22)

LSST Science Collaborations

http://www.lsst.org/lsst/science/participate

• Original 10 teams:

1. Solar System

2. Milky Way & the Local Volume 3. Transients/Variable Stars

4. Stellar Populations

5. Supernovae 5. Supernovae

6. Galaxies

7. Active Galactic Nuclei (AGN, quasars)

8. Strong Lensing

9. Weak Lensing

10. Large-scale Structure / Baryon Oscillations

(23)

The LSST ISSC

• Informatics and Statistics Science Collaboration

• Core Leadership team:

– Kirk Borne – Jogesh Babu – Jogesh Babu – Tom Loredo – Eric Feigelson – Alex Gray

• ~50 members include astronomers, statisticians,

and machine learning data miners

(24)

The LSST ISSC research focus

• In discussing the data-related research challenges

posed by LSST, we identified several research areas:

– Statistics

– Data & Information Visualization – Data mining (machine learning)

– Data-intensive computing & analysis Informatics – Large-scale scientific data management

– Semantics and Ontologies

• These areas represent Statistics and the science of

Informatics (Astroinformatics) = Data-intensive Science = the 4th Paradigm of Scientific Research

– Addressing the LSST “Data to Knowledge” challenge – Helping to discover the unknown unknowns

(25)

The LSST Big Data Challenges

The LSST Big Data Challenges

1.

1. Massive data stream: ~2 Massive data stream: ~2 Terabytes of image data Terabytes of image data per hour that must be per hour that must be mined in real time (for 10 mined in real time (for 10 years).

years).

2.

2. Massive 20Massive 20--Petabyte Petabyte database: more than 50 database: more than 50 billion objects need to be billion objects need to be

•• Challenge #1 includes Challenge #1 includes both the static data both the static data mining aspects of #2 mining aspects of #2 and the dynamic data and the dynamic data mining aspects of #3. mining aspects of #3.

billion objects need to be billion objects need to be classified, and most will be classified, and most will be monitored for important monitored for important variations in real time. variations in real time.

3.

3. Massive event stream: Massive event stream: knowledge extraction in knowledge extraction in real time for ~2,000,000 real time for ~2,000,000 events each night.

events each night.

•• Look at #3 in more

Look at #3 in more

detail ...

(26)

LSST data mining challenge # 3

• Approximately 2,000,000 times each night for 10 years LSST will obtain the following data on a

new sky event, and we will be challenged with classifying these data:

time

fl

u

(27)

LSST data mining challenge # 3

• Approximately 2,000,000 times each night for 10 years LSST will obtain the following data on a

new sky event, and we will be challenged with classifying these data: more data points help !

time

fl

u

(28)

LSST data mining challenge # 3

• Approximately 2,000,000 times each night for 10 years LSST will obtain the following data on a

new sky event, and we will be challenged with classifying these data: more data points help !

time

fl

u

(29)

LSST data mining challenge # 3

• Approximately 2,000,000 times each night for 10 years LSST will obtain the following data on a

new sky event, and we will be challenged with classifying these data: What will we do with all of these events?

(30)

LSST data mining challenge # 3

• Approximately 2,000,000 times each night for 10 years LSST will obtain the following data on a

new sky event, and we will be challenged with classifying these data: What will we do with all of these events?

Characterize first !

(Unsupervised Learning)

Classify later.

(31)

Characterization includes …

• Feature Detection and Extraction:

• Identifying and describing features in the data

– via machine algorithms or human inspection (including the potentially huge contributions from Citizen Science)

• Extracting feature descriptors from the data

• Curating these features for search, re-use, & discovery • Curating these features for search, re-use, & discovery

• Finding other parameters and features from other archives, other databases, other information sources – and using

those to help characterize (ultimately classify) each new event.

• This introduces more variety and complexity to the data ... • … but this should help us to discriminate subtle distinctions

(32)

Data-driven Discovery (Unsupervised Learning)

i.e., What can I do with characterizations?

1. Class Discovery – Clustering

2. Principal Component Analysis – Dimension

Reduction

3. Outlier (Anomaly / Deviation / Novelty)

Detection – Surprise Discovery

4. Link Analysis – Association Analysis –

Network Analysis

(33)

Data-driven Discovery (Unsupervised Learning)

• Class Discovery – Clustering

• Distinguish different classes of behavior or different types of objects • Find new classes of behavior or new types of objects

• Describe a large data collection by a small number of condensed representations

• Principal Component Analysis – Dimension Reduction

• Find the dominant features among all of the data attributes

• Generate low-dimensional descriptions of events and behaviors, while revealing correlations and dependencies among parameters

• Addresses the Curse of Dimensionality • Addresses the Curse of Dimensionality

• Outlier Detection – Surprise / Anomaly / Deviation / Novelty Discovery

• Find the unknown unknowns (the rare one-in-a-billion or one-in-a-trillion event) • Find objects and events that are outside the bounds of our expectations

• These could be garbage (erroneous measurements) or true discoveries

• Used for data quality assurance and/or for discovery of new / rare / interesting data items

• Link Analysis – Association Analysis – Network Analysis

• Identify connections between different events (or objects)

• Find unusual (improbable) co-occurring combinations of data attribute values • Find data items that have much fewer than “6 degrees of separation”

(34)

Why do all of this?

… for 4 very simple reasons:

• (1) Any real data table may consist of

thousands, or millions, or billions of rows of numbers.

• (2) Any real data table will probably have many more (perhaps hundreds more)

many more (perhaps hundreds more) attributes (features), not just two.

• (3) Humans can make mistakes when

staring for hours at long lists of numbers, especially in a dynamic data stream.

• (4) The use of a data-driven model provides an objective, scientific, rational, and

(35)

Why do all of this?

… for 4 very simple reasons:

• (1) Any real data table may consist of

thousands, or millions, or billions of rows of numbers.

• (2) Any real data table will probably have many more (perhaps hundreds more)

Volume

Variety

many more (perhaps hundreds more)

attributes (features), not just two.

• (3) Humans can make mistakes when

staring for hours at long lists of numbers, especially in a dynamic data stream.

• (4) The use of a data-driven model provides an objective, scientific, rational, and

justifiable test of a hypothesis. 35

Variety

Velocity

(36)

Why do all of this?

… for 4 very simple reasons:

• (1) Any real data table may consist of

thousands, or millions, or billions of rows of numbers.

• (2) Any real data table will probably have many more (perhaps hundreds more)

Volume

Variety

It is too much !

It is too complex !

many more (perhaps hundreds more) attributes (features), not just two.

• (3) Humans can make mistakes when

staring for hours at long lists of numbers, especially in a dynamic data stream.

• (4) The use of a data-driven model provides an objective, scientific, rational, and

justifiable test of a hypothesis. 36

Variety

Velocity

Veracity

Can you prove your results ?

It is too complex !

(37)

Rationale for BIG DATA – 1

• Consequently, if we collect a thorough set of parameters (high-dimensional data) for a

complete set of items within our domain of

study, then we would have a “perfect” statistical model for that domain.

• In other words, Big Data becomes the model for • In other words, Big Data becomes the model for

a domain X = we call this X-informatics.

• Anything we want to know about that domain is specified and encoded within the data.

• The goal of Big Data Science is to find those encodings, patterns, and knowledge nuggets.

• Recall what we said before …

(38)

Rationale for BIG DATA – 2

• … one of the two major benefits of BIG DATA is one of the two major benefits of BIG DATA is one of the two major benefits of BIG DATA is one of the two major benefits of BIG DATA is

to provide the best statistical analysis ever(!) for to provide the best statistical analysis ever(!) for to provide the best statistical analysis ever(!) for to provide the best statistical analysis ever(!) for the domain of study.

the domain of study. the domain of study. the domain of study.

38

Benefits of very large datasets:

1. best statistical analysis of “typical” events

2. automated search for “rare” events

(39)

The other great rationale for Big Data

• Outlier detection:

– Anomaly detection / novelty discovery – Rare event discovery

– Surprise Discovery

– Finding the objects and events that are outside the – Finding the objects and events that are outside the

bounds of our expectations (outside known clusters) – Discovering the unknown unknowns

– These may be real scientific discoveries or garbage

– Outlier detection is therefore useful for:

• Anomaly Detection – is the detector system working? • Data Quality Assurance – is the data pipeline working? • Novelty Discovery – is my Nobel prize waiting?

(40)

Data Science

addresses the

Data-to-Knowledge Challenges:

Finding order in Big Data

and

References

Related documents

Understand the words and rest assured in hindi with example sentences and bundle you name that you getting word wise available procure the server.. We insure for we hope and please

Balance billing is allowed: Balance billing is where a healthcare provider, who has already received payment from Medicare for a specific service or treatment, asserts a lien

• Currently, private umbilical cord blood banking is cost-effective only for children with a very high likelihood of needing a stem cell transplant. • Obstetrics

altera showed a distinct effect on the emergence and survival of annual and perennial species 28.. and negatively affected the growth of individuals belonging to both groups

Just as many companies may be unready to hire and retain veterans, many veterans don’t know how to find work in the civilian sector.. They may not know how to locate or connect

Despite the huge void left in our lives, we gain comfort from knowing that mum had so many friends in her Church family and I myself have many fond memories of Swann Lane church..

Employee Informal economy The person is a … Informal employment in the informal sector Informal employee Formal employment in the informal sector Formal employee OAW - Own

• Gained support from local practices • Partnered with Rural Respiratory Nurse • Mobile team provides local programmes. for