Development of Big Data Infrastructure
in NCHC
:
From Sensing to Understanding
Fang-Pang Lin
National Center for High-Performance Computing
National Applied Research Laboratories, Taiwan
2
2
Shared Nothing Scale-out Storage + SSD MPP + In-Memory Compute Hadoop HiSpeed / -Resiliency Networking Converged Infrastructure Cloud Non-relational DWHEND USERS
BUSINESS PROCESS
ANALYSTS / SCIENTISTS
INFO PROCESSING
ARCHITECTS / ENGINEERS
DATA ACQUISITION
PRODUCERS
DATA CREATION
SYSTEMS INTEGRATION
V
OL
UM
E
V
EL
OCITY
V
ARI
ET
Y
OBJECTIVES
Stream Processing Event Management REAL-TIME EVENTS Data Exploration Contextualized Data Modeling / Scenarios Forecasting DEEP INSIGHTSDELIVERY MODELS
Access-Anywhere Analytics Services Context-Aware Business Applications ON-DEMAND Location-Based ServicesAlert and Respond
PUSH Workflow and Interaction Automation Smart devices and systems EMBEDDED
Email and Messaging Mobile Apps Data Transaction and
Usage Logs
Machine and Sensors
Geolocation
Relationships and Social Influence
Copyright IDC (2013)
BIG DATA ECOSYSTEM : FROM DATA TO DECISIONS
V
ALUE
3
3
Shared Nothing Scale-out Storage + SSD MPP + In-Memory Compute Hadoop HiSpeed / -Resiliency Networking Converged Infrastructure Cloud Non-relational DWHEND USERS
BUSINESS PROCESS
ANALYSTS / SCIENTISTS
INFO PROCESSING
ARCHITECTS / ENGINEERS
DATA ACQUISITION
PRODUCERS
DATA CREATION
SYSTEMS INTEGRATION
V
OL
UM
E
V
EL
OCITY
V
ARI
ET
Y
OBJECTIVES
Stream Processing Event Management REAL-TIME EVENTS Data Exploration Contextualized Data Modeling / Scenarios Forecasting DEEP INSIGHTSDELIVERY MODELS
Access-Anywhere Analytics Services Context-Aware Business Applications ON-DEMAND Location-Based ServicesAlert and Respond
PUSH Workflow and Interaction Automation Smart devices and systems EMBEDDED
Email and Messaging Mobile Apps Data Transaction and
Usage Logs
Machine and Sensors
Geolocation
Relationships and Social Influence
Copyright IDC (2013)
BIG DATA ECOSYSTEM : FROM DATA TO DECISIONS
V
ALUE
Customer
Engagement
[microsegment, net
promoter,
personalization]
(2013 BD Hotspots)
Geofencing
[retail offers, asset
tracking]
(GPS Innovations)
Smart Cities
[power, water, traffic,
safety]
(M2M Innovations)
User-gen Text
68%
(Data collected)
Transactional
67%
(Data collected)
Machine or device
41%
(Data collected)
Managing
>5TB data
36%
Deployed/ing
Hadoop
21%
Keep/discard which
data
35%
(#1 BD Challenge)
Deployed/ing
Text Analytics
26%
Managing Data
Quality
(#1 IT Challenge)
Deployed/ing Event
Processing
22%
4
APeJ BDA Maturity by Country – Market
Size (USD)
© IDC Visit us at IDC.com and follow us on Twitter: @IDC
4
Source: IDC APeJ Big Data Maturity Assessment and Benchmark 2013 (n=802)
Source: IDC APeJ Big Data Market Analysis & Forecast 2013-2017, Oct 2013
Trends among Big Data leaders
Hyper-competition, regulation, analytics sophistication
© IDC Visit us at IDC.com and follow us on Twitter: @IDC 5
HK
- IaaS investments, Telco asset
utilization, FSI risk
SGP
- FSI microsegment, Telco LBS & data
monetization, ICA
AUS
- FSI personalization, Telco
psycho-graphic, Data privacy
NZ
- Emergency svc, SME SaaS, Fleet
mgmt, Farm asset mgmt
Trends among Big Data midstream
Mass urbanization, “first-measured”, neighborly pressure
© IDC Visit us at IDC.com and follow us on Twitter: @IDC 6
KOR
- M2M automation, Manu QC, Govt
NLP audio analytics
PRC
- Web 2.0 [ABT] xacts, City CCTV, FSI
microfinance
TWN
- Smart Tourism, Data Privacy, Manu
process control
IND
- Citizen registration, banking the
unbanked, Big Data insource
The Path from Infrastructure to
Data
•
Sensing for Understanding
–
Sensing:
(Networks change the game!)
•
Evolve since 10 years ago: Ecogrid, SARS Grid, … etc
•
Institutional missions based on special vehicles:
Satellites, Research Ships & Aircrafts, Met Stations …etc.
•
It is growing even larger and broader, e.g. IOT, social
network.
–
Understanding:
•
Modeling from hypothesis to discovery
Ecogrid:
The Path from Infrastructure to
Data
•
Sensing for Understanding
–
Sensing:
•
Evolve since 10 years ago: Ecogrid, SARS Grid, … etc
•
Institutional missions based on special vehicles:
Satellites, Research Ships & Aircrafts, Met Stations …etc.
•
It is growing even larger and broader, e.g. IOT, social
network.
–
Understanding:
•
Modeling from hypothesis to discovery
Fish4Knowledge–
human level query for Marine Biology
10
UCATANIA (Italy) Centrum Wiskunde & Informatica (Netherlands) U. of Edinburgh (UK)•
NCHC: sustainable system for data acquisition, storage, and computing.
•
U. of Catania in Italy: fish detection and tracking.
•
U. of Edinburgh in UK: workflow, fish recognition, and fish behavior.
•
CWI in Netherlands: user interfaces.
NCHC
Video Data: 10 camera
years = 10 cameraframes
= 112 Tb
massive data storage
Descriptive data: 20 Tb
Fish: 10^10 detections
Summary data: 500 Gb
Target: 1 sec query
answering
11
Global Lake Ecological Observational Network (GLEON)
Lakebase: harvest quality data from internet and
collect more than ~25,000 lakes across the world.
Global compute service through CONDOR
>10 major real time observational data from selected
GLEON sites.
Infrastructure enables Fish4Knowledge
2015/5/25
12
Cloud of Resources
Service Frontend
Data Source
Application Developers
Storage Nodes
Computing Nodes
Video
Server
Marine Biologists
Clients
Computing platform
Storage platform
Video Classification
(~350K videos from 2009 to 2013)
14
Algae: 9.2%
Blurred: 33.5%
Highly Blurred: 13.9%
Complex Scenes: 4.3%
Encoding: 23.9%
Normal: 12.9%
Unknown: 2.2%
The SEATANK Event & Other
•
SEATANK event
–
It happened again!
–
Require scientific evidences for
Chinese Taipei to win the
lawsuit. (TORI, NSPO, NCHC)
2006 Cargo ship ‘Tzini’ stranded in Yilan water and
leaks
>100 tons
fuel Oil
2013.1 Freighter SEATANK stranded in
Penhu water
The Path from Infrastructure to
Data
•
Sensing for Understanding
–
Sensing:
(Networks change the game!)
•
Evolve since 10 years ago: Ecogrid, SARS Grid, … etc
•
Institutional missions based on special vehicles:
Satellites, Research Ships & Aircrafts, Met Stations …etc.
•
It is growing even larger and broader, e.g. IOT, social
network.
–
Understanding
: (Data change the game!)
•
Modeling from hypothesis to discovery
•
Natural disaster
•
Facilities infrastructure failure
•
Storage failure
•
Server hardware/software failure
•
Application software failure
•
External dependencies (e.g. PKI
failure)
•
Format obsolescence
•
Legal encumbrance
•
Human error
•
Malicious attack by human or
automated agents
•
Loss of staffing competencies
•
Loss of institutional commitment
•
Loss of financial stability
•
Changes in user expectations and
requirements
CC i m a g e b y Sh a ry n M o rro w o n F lic k r Source: http://blog.alltop.com.tw/story/archives/3278 Source: DataONEInf
o.
con
tent
Time
(From Michener et al 1997)
Time for Publication
Accident
Specific Details
General Details
Retirement
or Career Change
Death
Data Management
The Treasure of NARLabs:
Earth Science Observational Data
Big Data Infrastructure Challenge in ESOD
•
Data Features:
–
high complexity, large scale, frequency, real-time and
stream
Big Data
•
Data Size:
–
155TB/yr.
(Actual data size will 2~3x)
–
Currently, 103 TB for Q1. Simulation data 200TB from
TTFRI.
•
Real time, High frequency Data:
–
Process > 18,000 records/sec.
–
Currently 6,000 records/sec
.
ESOD Big Data Service ~ 1PB
Parallel
compute server
d
1
d
2
d
3
External
client
Parallel
data server
Query
Source
dataset
Derived
datasets
Parallel
file system
(e.g.,
GFS,
HDFS, GPFS
)
Result
Data-intensive computing system (e.g.
Hadoop
)
Parallel
query server
External
data
sources
Examples:
•
Search
•
Photo scene completion
•
Log processing
•
Science analytics
Characteristics:
•
Small queries and results
•
Massive data and computation
performed on server
Source: David O’Hallaron
Continuous
query stream
Continuous
query results
21
Virtually
Centralized
Resources
ESOD
Hardware Architecture
Challenges of use of distributed resources
ESOD Storage
1
1
1
1
2
2
2
3
3
3
1
Disk 918TB Tape 433TB
TTFRI
NSPO
TORI
NCREE
NCHC
10G Bridge Gateway 8G Optical Fiber WindRider: 40 G infinitband (Lustre FS) 1G -10G Fiber GPFS GPFS Preload/StageExmaples of Datasets
22
General Parallel File System
2 X 10GE
HSM Tape
library
ESOD Data Archiving and System Infrastructure
General Parallel File System
2 X 10GE
HS M
Tape library
General Parallel File System
2 X 10GE HSM Tape library NCHC firewall Load balance server Master Load balance server Slave
Failover between 3 sites.
Automatic load balancing.
Transparent file systems: direct read/write of
files across different sites.
Automatic storage backup.
23
23
HAProxy
https
ftps
Scale out > 1 PB
WindRider Formosa 2/3/5 Paralllel DBMS High IOPSData discovery: Metadata Catalog
Use relational database (mysql) with
multi-dimensional schema
design
to speed up searching
(migrating to NOSQL solutions now)
System Metadata
User name space
- Address / e-mail / telephone number
- Role (administrator, curator, user)
- File name space
- Creation date / size / location / checksum
- Owner / access controls
Storage resource name space
- Capacity / quotas / Type (archive, disk, fast cache)
Domain Metadata
User-given metadata
- Key-Value-Unit Triplets, Annotation
- Relational / XML Metadata
- Domain-specific Schema
Adopt
OGC standards
24
Smart query and answer
•
Develop a set of
control vocabulary
based on int’l
standards, e.g. HDF, NetCDF, OGC … etc.
•
Derive
RDF triple dataset
from common query
tasks of ESOD.
•
Combination of
visual data plus metadata
to
support a specific high-level information seeking
user task.
•
Design of an interface for
graph & visual
comparison search.
•
Selected one specific task, that of comparing sets
of objects, and designed a prototype interface on
top of
linked data sets
used by the experts to
support this task explicitly
•
Develop methods for data
provenance
.
The Path from Infrastructure to
Data
•
Sensing for Understanding
–
Sensing:
(Infrastructure-Centric )
•
Evolve since 10 years ago: Ecogrid, SARS Grid, … etc
•
Institutional missions based on special vehicles:
Satellites, Research Ships & Aircrafts, Met Stations …etc.
•
It is growing even larger and broader, e.g. IOT, social
network.
–
Understanding
: (Data-Centric)
•
Modeling from hypothesis to discovery
Government Big Data
•
Government Clouds & Open data
Big Data
–
10 government clouds:
•
Health
,
Food
,
e-Invoice
,
Transportation
,
Environment
,
Finance
,
Education
,
Culture
,
Disaster Prevention
,
Geospatial Information
,
Agriculture
and e-Government, focusing on
societal impact
via
Research & innovation
.
–
3 kinds of data
:
•
Infrastructural Data
,
Public data
and Personal data
,
open but required
protection
.
•
Open, but Protected
–
Licensing models:
•
Open Government License (OGL), Non-Commercial Government License,
Charged License.
–
Access Platform
•
Facility and Systems to enable Data store, Management and Analytics.
Government Big Data
NCHC/NARLabs
provides
Big Data Platform
,
Bridging
Excellent Research
Societal Impact
Transportation: etag system
Health: Electronic Medical Record
Food Security: Food Traceability
Network: Openflow
Finance: e-invoice