Big Data in OpenTopography
Vishu Nandigam
San Diego Supercomputer Center
NSF Big Data in Educa<on Workshop
Presenta<on Overview
• Lidar and OpenTopography
• Data and Workflow
• Cyberinfrastructure
• Data Growth and Challenges
• Data Insights
LIDAR
• LIght Detec<on And Ranging (aka airborne laser
swath mapping)
• Billions of of accurate distance measurements with a
scanning laser rangefinder + GPS + Iner<al Measurement Unit (IMU)
Point cloud
(x,y,z coordinates) = fundamental LIDAR data product
OpenTopography
• NSF Earth Science Facility: 3 year support in 2009.
Renewed in 2012
(Award No. 1226353 & 1225810 EAR/IF)
• CI and Science Collabora3on
– SDSC, ASU & UNAVCO
• Related research efforts
– NASA ROSES: Extend to Satellite-‐based LIDAR (waveform data)
– NSF SI2 CyberGIS: OpenTopo as an exemplar of cyber GIS
– NSF CluE: Inves<gate Computer Science issues in big data
• Partnerships with state and local agencies to support
OpenTopography Data
• Large user community with
variable needs and levels of sophis<ca<on.
• Goal: maximize access to
data to achieve greatest scien<fic impact.
• Big data
– treat data as an asset that can
be used and reused
– Co-‐locate data with on-‐demand
Data Workflow
1. Original Source Data from Collector (eg. NCALM)
2. Extract relevant data products
3. Source data is archived in Chronopolis digital preserva<on
network
1. UCSD Library (Library of Congress)
2. Three geographically distributed copies of the data
4. Extracted data products go through QA/QC
5. Data transforma<on and op<miza<on 1. Error correc<on
2. Projec<on conversion
6. Update Metadata ISO 19115 (Data)
7. Generate addi<onal derived products (e.g. GE hillshades)
Catalog Service for the Web / DOI
(8 &9 of the Data Workflow)
• CSW Catalog – ESRI Geoportal Server
• ISO 19115 (Data)
• CZO, CyberGIS, Thomson Reuters Web of Science
Protocol: 'CSW' Profile: 'ArcGIS Server Geoportal Extension’
DOI: hmp://dx.doi.org/10.5069/ G9ST7MR9
EZID is a service of UC Cura<on Center of the CDL
Current Data Holdings
• Lidar Point Cloud Datasets – 770 Billion+ lidar returns.
– Each return has addi<onal amributes
– On-‐demand processing capabili<es
• SRTM, Raster (mul<ple layers)
e.g. Sonoma -‐ several intensity products, canopy height, bare earth, hydro-‐enforced bare earth, canopy top, etc.
• 30+ TB of on-‐demand online data
(Excluding big custom job runs, original archived data, pre processing data and user generated products)
OT CyberInfrastructure
Infrastructure Tier Application Tier
OT Resources Data and Processing
XSEDE Resources Gordon Nodes
SDSC Resources SDSC Cloud Point Cloud Data
PC Data Access Service PC Data Processing Service DOI CSW SRTM Dataset SRTM Data Access Service SRTM Data Processing Service
OT Challenges
(Keeping pace with data and user growth
and advancing science!)
LIDAR Point Cloud Data Growth
0 20 40 60 80 100 120 140 160 Sum m er 2 00 5 fal l2 00 5 w in te r 2 00 6 spr ing 2006 Sum m er 2 00 6 fal l 2 00 6 w in te r 2 00 7 spr ing 2007 summe r 2007 fal l 2 00 7 w in te r 2 00 8 spr ing 2008 Sum m er 2 00 8 Fal l 2 00 8 W in te r 2 00 9 Spr ing 2 00 9 summe r 2009 Fal l 2 00 9 W in te r 2 01 0 Spr ing 2 01 0 Sum m er 2 01 0 Fal l 2 01 0 W in te r 2 01 1 Spr ing 2 01 1 Sum m er 2 01 1 Fal l 2 01 1 W in te r 2 01 2 Spr ing 2 01 2 Sum m er 2 01 2 Fal l 2 01 2 W in te r 2 01 3 Spr ing 2 01 3 Sum m er 2 01 3 Fal l 2 01 3 W in te r 2 01 4 Spr ing 2 01 4 OpenTopographySensor Hardware Technology
• Rapid Evolu<on of Laser Scanner Technology
• Data is being collected on mul<ple channels (different
wavelengths) and capturing full waveform data.
• Early datasets collected with scanners opera<ng at
less than 33Khz. Current systems collect data at ~900Khz
Greater Resolu<on = Larger Data Volumes
Diverse and complex datasets support
• Discrete return lidar
• Full waveform lidar
• Op<cal imagery (R,G,B orthophotography)
• Hyperspectral imagery
NEON Hyperspectral Imagery collected light reflected across the electromagne<c spectrum for a total of 426 bands of informa<on. Image: Nathan Leisso, NEON AOP
User Growth
0 1000 2000 3000 4000 5000 6000 7000 Au g-‐ 08 No v-‐ 08 Fe b-‐0 9 May -‐0 9 Au g-‐ 09 No v-‐ 09 Fe b-‐1 0 May -‐1 0 Au g-‐ 10 No v-‐ 10 Fe b-‐1 1 May -‐1 1 Au g-‐ 11 No v-‐ 11 Fe b-‐1 2 May -‐1 2 Au g-‐ 12 No v-‐ 12 Fe b-‐1 3 May -‐1 3 Au g-‐ 13 No v-‐ 13 Fe b-‐1 4 May -‐1 4 Au g-‐ 14 No v-‐ 14CyberGIS, NASA/UNAVCO communi<es Service Level Access
Pluggable Services Infrastructure
• Methods for scien<fic data processing are evolving
• Users demanding more processing services
• Pluggable services infrastructure – OT development sandbox
– Assist researchers with their code
– Deployment of the algorithm as a service
– Update processing workflow and UI
Data Insights
What can we learn from:
40,000+ custom PC jobs
1.2 trillion lidar points processed 15,000+ custom raster jobs (past year)
Data Access Pamerns
Access Pamerns in LIDAR Scien<fic Datasets
B4 – San Andreas Fault
0 20 40 60 80 100 120 140 160
Access Pamerns in LIDAR Scien<fic Datasets
Event based datasets -‐ El Mayor-‐Cucapah Earthquake
0 20 40 60 80 100 120 140
Access Pamerns in LIDAR Scien<fic Datasets
Event based datasets – Hai< Earthquake
0 20 40 60 80 100 120 140 160
Data Usage
Analy<cs
Northern San Andreas Fault
Social networking with data Recommenda<on system
Understanding Traffic Pamerns
Impact of dataset releases and workshops on traffic
0 20 40 60 80 100 120 140 160 180 200 0 2000 4000 6000 8000 10000 12000 14000
Jan-‐13 Feb-‐13 Mar-‐13 Apr-‐13 May-‐13 Jun-‐13 Jul-‐13 Aug-‐13 Sep-‐13 Oct-‐13 Nov-‐13 Dec-‐13
Re gi ste re d U se rs Pag e Vi si ts
New Users Registered users Daily visits
SRTM Global Dataset Released
Tiered storage based on Data Access
• Ac<vity based data ranking and <ered cloud
integrated storage SSD New and existing most active data Regular Disk
Between Hot and Cool
Cloud
Cloud Compu<ng
• Cost effec<veness & feasibility of data science facili<es on the cloud
• Microsos Azure for Research Award Integra<on of cloud based on-‐
demand geospa<al processing services into community earth science data facili<es.
Leveraging HPC
• Dedicated Gordon nodes via XSEDE (democra<za<on
of supercompu<ng resources)
GEO Data and Cyberinfrastructure Impera<ve: Harness the Power of Compu<ng and
Computa<onal Infrastructure.
Summary
• OpenTopography is an modern agile data facility
• Cyberinfrastructure driven by science use cases – CI
and science collabora<on
• Big Data needs to be usable -‐ Community not only
wants access to data but also wants tools for processing these data.
• Concept of OT can be used as a template for other
Thank You!
@OpenTopography [email protected]
White River, IN. Credit: Indiana Geological Survey/The State of Indiana