• No results found

Data Services for Campus Researchers

N/A
N/A
Protected

Academic year: 2021

Share "Data Services for Campus Researchers"

Copied!
20
0
0

Loading.... (view fulltext now)

Full text

(1)

Data Services for Campus Researchers

Research Data Management Implementations Workshop March 13, 2013

Richard Moore

SDSC Deputy Director & UCSD RCI Project Manager

(2)

SDSC Cloud: A Storage Paradigm Shift for

Access, Sharing and Collaboration

Launched September 2011

Largest, highest-performance known academic cloud

5.5 Petabytes (raw), ~2PB usable, 8 GB/sec

Automatic dual-copy and verification

Capacity and performance scale linearly to 100’s of petabytes

Open source platform based on NASA and RackSpace software

http://cloud.sdsc.edu

for more information

(3)

Key Features of SDSC Cloud

Easy access to shared data for any users with permission under range of mechanisms (http, APIs, portals, gateways …)

Simple data owner user interfaces to data, its management, its access and setting permissions for sharing data

“Always-there” disk-based availability of data

High reliability

• Disk RAID; automatic dual-copy; continuous background checksum verification/ restoration; offsite replication soon

Encryption readily incorporated

Transaction history logged – track usage, assess utility, support provenance

Scalable system in both capacity and bandwidth

(4)

Applications of SDSC Cloud

Shared/published/curated data collections

HPC simulation data storage and sharing

Web/portal applications and site hosting

Application integration using supported APIs

Serving images/videos

(5)

Commercial Products

Amanda Backup Tools

Commvault Crashplan Panzura Traditional Clients GUI Applications Command Line SDSC Web I/F

Web Services API Amazon S3 Rackspace CloudFiles / Openstack API

SDSC Cloud Interfaces

Swift Object Storage Cluster Load Balanced Proxy Servers User- Developed Web Portals/ Gateways Data Owners External Users
(6)
(7)

Developing interface between

Globus Online and SDSC Cloud

Developing a GridFTP Data Storage Interface to OpenStack’s Swift Object Storage (Swift-DSI) – modules are 80% complete

Developed authentication system between GridFTP and Swift

Directory-level operations under development

(8)

UNIVERSITY OF CALIFORNIA, SAN DIEGO

First Year’s Experience w/ SDSC Cloud

Adoption primarily by large-scale users w/ commercial interfaces

• Backup services (Commvault), national lab (Panzura)

More typical feedback from individual researchers

• Interfaces are not easy for integration with normal tools • POSIX access is more familiar than object-based storage • Service is too expensive

Data sharing only slowly emerging as a bottoms-up priority amongst individual researchers

Recovering total cost of ownership, without campus or extramural investments, is a challenge

• And depreciation costs just get tougher to recapture later in equipment life

We expect this to be an increasingly valuable resource but adoption takes time and there remain barriers to entry for individual PIs

(9)

UCSD’s Research

Cyberinfrastructure Program

2008-2009: Research Cyberinfrastructure Design Team (RCIDT)

• Chaired by Mike Norman and Phil Papadopoulos; broad campus participation • Campus-wide survey of research cyberinfrastructure needs (2008)

• April 2009: RCIDT issued Blueprint for the Digital University (http://rci.ucsd.edu/_files/RCIDTReportFinal2009.pdf)

2009-2010: CyberInfrastructure Planning & Operations Committee

(CIPOC) developed a business plan with recommendations • A principle of shared costs between PIs and campus RCI investments • April 2010: CIPOC issued their report

2011-Present: RCI Oversight Committee charged to implement RCI

• January 2011 - CIPOC business plan accepted, oversight committee charged • Chaired by Mike Norman and Mike Gilson; broad campus representation

(10)

UNIVERSITY OF CALIFORNIA, SAN DIEGO

Data Center Colocation

Networking

Centralized Storage

Data Curation

Research Computing

Technical Expertise

Elements of UCSD’s Integrated

Research CyberInfrastructure

(11)

A Data-Centric view of UCSD’s RCI:

From the 2009 Blueprint report

(12)

Research Computing

(in production now)

RCI is evolving SDSC’s Triton system to the “Triton Shared

Computing Cluster” (TSCC)

Condo model: Researchers

purchase compute nodes which are operated as part of shared cluster for 3-4 years

• PI buys hardware & modest ops fee • Lower operations cost than local PI

cluster; larger-scale resource available (core count and capacity);

professionally-managed

Hotel: Purchase time by the core-hour; shared queue

(13)

Data Curation – in pilot

(production FY13-14)

Completing a two-year pilot phase

• How do lab personnel work with librarians to curate their data?

• How much work is required to curate data and what are options?

• What is a sustainable business model for curation within RCI project?

Five representative programs across UCSD selected as pilots

• The Brain Observatory (Annese)

• Open Topography (Baru)

• Levantine Archaeology Laboratory (Levy)

• SIO Geological Collections (Norris)

• Laboratory for Computational Astrophysics (Wagner)

Using existing tools whenever possible

• Storage at SDSC, campus high-speed networking, Digital Asset Management System

(DAMS) at UCSD Libraries, Chronopolis digital preservation network

(14)

Centralized Storage

(phased production thru 2013)

Completing interviews of a

broad sample of ~50

representative PIs to

understand technical and cost

requirements

Identify common needs, and

define sustainable RCI business

model with strong adoption

Anticipate production centralized storage services in CY13

RCI Network-Attached Storage (RCI/NAS) available Spring CY13

Further services to be rolled out throughout the year, based on requirements analysis

(15)

PI Interview Responses:

How do You Handle Data Storage/Backup?

Storage Devices

• Network accessible storage

(NAS), USB and server local drives dominate

• Use of Dropbox for sharing

• Others use Google Drive,

Hadoop, XSEDE, SDSC co-location

Backup modes

• Replicated copies in two NAS

• A copy in the NAS,

• A copy in local hard drive (laptop/workstation),

• And a copy in a USB drive

• Maybe a copy in email/Dropbox

Problems:

• Out of sync

(16)

UNIVERSITY OF CALIFORNIA, SAN DIEGO

PI Interview Responses:

Where is Your Data Coming From?

Indicates use cases

for connectivity

requirements

Data sources:

• ~50% campus instruments • ~30% simulations (XSEDE,

campus, lab systems) • ~20% field instruments

• ~15% other external sources

%’s reflect PIs, not data

volume

Numbers reflect percentages of PIs surveyed that utilize each solution ; Individual PIs use multiple solutions, so %’s add up to >100%.

(17)

PI Interview Responses: How much storage

do you need: now, future, permanently?

For PIs interviewed, current needs 1-1000TB

Increasing in future

(18)

Interview responses: Metadata and

longevity requirements

Do you need metadata

annotation capability?

How long do you need to

retain your data?

(19)

Interview responses: What are you

willing to pay for storage?

Intentionally posed a

relative question rather

than absolute $’s

PIs comfortable with

hardware costs but …

Often ignore the cost of

lab staff to operate

equipment

• “Hard to compete with slave labor” (Jim Pepin)

(20)

UNIVERSITY OF CALIFORNIA, SAN DIEGO

Data Life Cycle Services

for Campus Researchers

Initial Services (currently available)

• High-performance Lustre PFS for HPC scratch and medium-term parking • Centralized network-attached storage (NFS, CIFS mounts) available from

campus HPC cluster and remotely across campus (10 Gbps infrastructure) • Support to Data Management Plans

Next-phase services

• Build on experience with pilots to develop basic data curation services • Improve interfaces for cloud storage to facilitate sharing/access

• Low-cost backup services for research data • HIPAA/PII protected data services

• Data management tools and improved support to DMPs

Looking ahead

• Tools/technology to support data sharing (via SDSC Cloud, portals, gateways) • Develop tools for search, discovery and sharing of data

ess

References

Related documents

While the results did not determine whether Primavera was primarily used as a result of being a contract requirement, despite the responses received from the owners on the perception

previous homelessness mortality research conducted at BHCHP and clinical evidence were applied to a cohort of unsheltered individuals living on the streets of Boston and

Fresh air ventilator to the exhaust air of air-conditioning system is increasing fresh air while saving energy, improving indoor air quality [8]; Although the use of exhaust

become a critical issue today. Buildings alone account for around 30 percent of the world’s total energy consumption. The way buildings are designed and constructed today will not

Atmos Object API Policy Engine File Share File Share File Management Appliance Archive NetWorker Backup Web Application Data Services. Web Services

The keystore contains the SSL certificate that the Cisco IPICS server uses to verify its identity when a client attempts to connect to the server, and the URL for the web

 Staff reviews mitigation protocols every 30 days, and since the Governor relaxed requirements, changes to current protocols are: we removed some Table Games barriers;

In this issue brief, we use the term “managed long-term care” to refer to an arrangement where the state contracts with one or more managed care plans to provide a set of