• No results found

Introduction to LSST Data Management. Jeffrey Kantor Data Management Project Manager

N/A
N/A
Protected

Academic year: 2021

Share "Introduction to LSST Data Management. Jeffrey Kantor Data Management Project Manager"

Copied!
15
0
0

Loading.... (view fulltext now)

Full text

(1)

Introduction to LSST Data Management

Jeffrey Kantor

(2)

LSST Data Management

Principal Responsibilities

• Archive Raw Data:

Receive the incoming stream of images that the

Camera system generates to archive the raw images.

• Process to Data Products:

Detect and alert on transient events within one

minute of visit acquisition. Approximately once per year create and

archive a Data Release, a static self-consistent collection of data products

generated from all survey data taken from the date of survey initiation to

the cutoff date for the Data Release.

• Publish:

Make all LSST data available through an interface that uses

community-accepted standards, and facilitate user data analysis and

production of user-defined data products at Data Access Centers (DACs)

(3)

LSST From the

User’s Perspective

• A stream of ~10 million time-domain events per night, detected and

transmitted to event distribution networks within 60 seconds of

observation.

• A catalog of orbits for ~6 million bodies in the Solar System.

• A catalog of ~37 billion objects (20B galaxies, 17B stars)

,

~7 trillion

observations (“sources”), and ~30 trillion measurements (“forced

sources”), produced annually, accessible through online databases.

• Deep co-added images.

• Services and computing resources at the Data Access Centers to

enable user-specified custom processing and analysis.

• Software and APIs enabling development of analysis codes.

Lev

el

3

Lev

el

1

Lev

el

2

(4)

02C.06.02 Data Access Services

02C.07.01, 02C.06.03 Processing Middleware

02C.07.02 Infrastructure Services

(System Administration, Operations, Security)

02C.08.03 Long-Haul Communications

Physical Plant (included in above) 02C.07.04.02

Base Site

Application Layer (LDM-151) • Scientific Layer

• Pipelines constructed from reusable,

standard “parts”, i.e. Application Framework • Data Products representations standardized • Metadata extendable without schema change • Object-oriented, python, C++ Custom Software Middleware Layer (LDM-152)

• Portability to clusters, grid, other

• Provide standard services so applications behave consistently (e.g. provenance) • Preserve performance (<1% overhead)

• Custom Software on top of Open Source, Off-the-shelf Software

Infrastructure Layer (LDM-129) • Distributed Platform

• Different sites specialized for real-time alerting, data release production, peta-scale data access

• Off-the-shelf, Commercial Hardware & Software, Custom Integration

02C.06.01 Science Data Archive (Images, Alerts, Catalogs)

02C.01.02.01, 02C.02.01.04, 02C.03, 02C.04 Alert, SDQA, Calibration,

Data Release Productions/Pipelines 02C.03.05, 02C.04.07

Application Framework 02C.05

Science User Interface and Analysis Tools

02C.07.04.01 Archive Site

Data Management System Design (LDM-148)

02C.01.02.02 - 03 SDQA and

Science Pipeline Toolkits

Data Management

(5)

Mapping Data Products

into Pipelines

02C.01.02.01/02.

Data Quality Assessment Pipelines

02C.01.02.04.

Calibration Products Production Pipelines

02C.03.01.

Instrumental Signature Removal Pipeline

02C.03.01.

Single-Frame Processing Pipeline

02C.03.04.

Image Differencing Pipeline

02C.03.03.

Alert Generation Pipeline

02C.03.06.

Moving Object Pipeline

02C.04.04.

Coaddition Pipeline

02C.04.04/.05

Association and Detection Pipelines

02C.04.06.

Object Characterization Pipeline

02C.04.03.

PSF Estimation

02C.01.02.03.

Science Pipeline Toolkit

02C.03.05/04.07

Common Application Framework

Le

vel

1

Le

vel

2

L3

(6)

Infrastructure: Petascale Computing,

Gbps Networks

The computing cluster at the

LSST Archive at NCSA will

run the processing pipelines.

• Single-user, single-application data center

• Commodity computing clusters. • Distributed file system for scaling and

hierarchical storage

• Local-attached, shared-nothing storage when high bandwidth needed

Long Haul Networks to transport

data from Chile to the U.S.

• 2x100 Gbps from Summit to La Serena (new fiber) • 2x40 Gbps for La Serena to Champaign, IL (path

diverse, existing fiber)

Archive Site and U.S. Data Access Center

NCSA, Champaign, IL

Base Site and Chilean Data Access Center

(7)

Middleware Layer: Isolating

Hardware, Orchestrating Software

Enabling execution of science pipelines on hundreds of

thousands of cores.

• Frameworks to construct pipelines out of basic algorithmic

components

• Orchestration of execution on thousands of cores • Control and monitoring of the whole DM System

Isolating the science pipelines from details of underlying

hardware

• Services used by applications to access/produce data and

communicate

• "Common denominator" interfaces handle changing underlying technologies

(8)

Database and Science UI:

Delivering to Users

Massively parallel,

distributed, fault-tolerant

relational database.

• To be built on existing, robust, well-understood, technologies (MySQL and xrootd)

• Commodity hardware, open source • Advanced prototype in existence (qserv)

Science User Interface to enable the

access to and analysis of LSST data

• Web and machine interfaces to LSST databases • Visualization and analysis capabilities

(9)

Critical Prototypes:

Algorithms and Technologies

Petascale Database Design

• Conducted parallel database tests up to 300 nodes, 100 TB of data, 100% of scale for operations year 1

Petascale Computing Design

• Executed in parallel on up to 10k cores (TeraGrid/XSEDE and NCSA Blue Waters hardware) with scalable results

Algorithm Design

• Approximately 60% of the software functional capability has been prototyped • Over 350,000 lines of c++, python coded, unit tested, integrated, run in production mode

• Have released three terabyte-scale datasets, including single frame measurements, point source and galaxy photometry

• Pre-cursors leveraged

• Pan-STARRS, SDSS, HSC

Gigascale Network Design

• Currently testing at up to 1 Gbps

• Agreements in principle are in hand with key infrastructure providers (NCSA, FIU/AmPath, REUNA, IN2P3)

(10)

Data Management Scope is Defined

and Requirements are Established

• Data Product requirements have been vetted with Science Collaborations multiple times and have successfully passed review (Jul ‘13)

• Data quality and algorithmic assessments are far advanced and we understand the risks, successfully passed review (Sep ‘13)

• Hardware sizing has been refreshed based on latest scientific and engineering requirements, system design, technology trends, software performance profiles, acquisition strategy

• Interfaces are defined to Phase 2 level

• Requirements and Final Design have been baselined (Data Management Technical Control Team) • Traceability from OSS to DMSR has been verified

(11)

Data Management ICDs needed for

Construction start are at Phase 2 Level

under formal change control

in progress (Phase 1)

ICDs on Confluence: http://ls.st/mmm Docushare: http://ls.st/col-1033
(12)

Going Where the Talent is:

Distributed Team

Infrastructure

Middleware

Science Pipelines

Database

User Interfaces

Mgm

t,

I&

T,

and Sc

ienc

e

QA

(13)

Science User Interface & Tools

X. Wu D. Ciardi IPAC

Project Manager

J. Kantor

Project Scientist

M. Juric System Architecture K-T. Lim G. Dubois-Felsmann SLAC

Survey Science Group

SSG Lead Scientist TBD F. Economou LSST Alert Production A. Connolly UW/OPEN International Comms/Base Site R. Lambert NOAO Processing Services & Site Infrastructure

D. Petravick NCSA

Science Database & Data Acc Services

J. Becla SLAC

Data Release Production

R. Lupton J. Swinbank Princeton

Data Management Organization document-139

Data Management Organization

LSST DM Leadership

• DM Lead institutions are integrated into one project and are performing in their construction

(14)

Leveraging national and

international investments

NSF/OCI Funded

– Formal relationships continue with the IRNC-funded AmLight project and they are the lead entity in securing Chile - US network capacity for LSST

– We have leveraged significant XSEDE and Blue Waters Service Unit and storage allocations for critical R&D phase prototypes and productions

– Our LSST Archive Center and US Data Access Center will hosted in the National Petascale Computing Facility at NCSA

– A strong relationship has been established with the Condor Group at the University of Wisconsin and HTCondor is now in our processing middleware baseline

– We have reused a wide range of open source software libraries and tools, many of which received seed funding from the NSF

Other National/International Funded

– We have participated in joint development of astronomical software with Pan-STARRS and HSC – We have fostered collaborative development of scientific database technology via the eXtremely

Large Data Base (XLDB) conferences and collaborations with database developers (e.g. SciDB, MySQL, MonetDB)

– We have a deep process of community engagement to deliver products that are needed, and an architecture to allow the community to deliver their own tools

(15)

Data Management is

Construction Ready

• The Data Management System is scoped and credibly estimated

– Requirements have been baselined and are achievable (LSE-61) – Final Design baselined (LDM-148, -151, 152, -129, -135)

– Approximately 60% of the software functional capability has been prototyped – Data and algorithmic assessments are far advanced and we understand the risks

– Hardware sizing has been done based on scientific and engineering requirements, system design, technology trends, software performance profiles, acquisition strategy

– All lowest level WBS elements have been estimated and scheduled in PMCS with scope and basis of estimate documented

• All lead institutions are demonstrably integrated into one project and are

performing in their construction roles/responsibilities

– Core lead technical personnel are on board at all institutions

– Agreements in principle are in hand with key technology and center providers (NCSA, NOAO, FIU/AmPath, REUNA)

• The software development process has been exercised fully

– Have successfully executed eight software and data releases

– Standard/formal processes, tools, environment exercised repeatedly and refined – Automated build, test environment is configured and exercised nightly/weekly

http://ls.st/mmm http://ls.st/col-1033

References

Related documents

Drawing on a sample of 271 nurses operating in Greek hospitals, we examined the degree to which stressors such as conflict, workload, interpersonal relationships, career

Assim, para compreender a dimensão duchampiana na produção do autor catalão, especialmente nos romances Aire de Dylan e Historia abreviada de la literatura portátil , é

The Hosted Services by Data Center consist of access to the Data Center on the BANGLADESH COMPUTER COUNCIL (BCC) Network, data storage in the Data Center, hosted

On the other hand, Brookings Institution divides intangibles into intellectual properties, intellectual assets and other intellectual capital.. This paper proposes that

OnRamp’s 15,000 square foot, state-of-the-art Data Center features the highest levels of security, redundancy, reliability, infrastructure and technical expertise necessary to

This thesis sheds light on the syllable structures found in Najdi Arabic and how they are accounted for within the framework of Optimality Theory (OT). Moreover, other

This applies to all workers who have, or may have occupational exposure to BBPs including those workers that do not neces- sarily use needles or sharps devices as part of their job

However, the examples in (3) show that when the pre-[m] vowel in English is a long vowel or a diphthong, vowel epenthesis still applies even though [m] is followed by a