• No results found

Development of Big Data Infrastructure in NCHC: From Sensing to Understanding

N/A
N/A
Protected

Academic year: 2021

Share "Development of Big Data Infrastructure in NCHC: From Sensing to Understanding"

Copied!
28
0
0

Loading.... (view fulltext now)

Full text

(1)

Development of Big Data Infrastructure

in NCHC

:

From Sensing to Understanding

Fang-Pang Lin

National Center for High-Performance Computing

National Applied Research Laboratories, Taiwan

(2)

2

2

Shared Nothing Scale-out Storage + SSD MPP + In-Memory Compute Hadoop HiSpeed / -Resiliency Networking Converged Infrastructure Cloud Non-relational DWH

END USERS

BUSINESS PROCESS

ANALYSTS / SCIENTISTS

INFO PROCESSING

ARCHITECTS / ENGINEERS

DATA ACQUISITION

PRODUCERS

DATA CREATION

SYSTEMS INTEGRATION

V

OL

UM

E

V

EL

OCITY

V

ARI

ET

Y

OBJECTIVES

Stream Processing Event Management REAL-TIME EVENTS Data Exploration Contextualized Data Modeling / Scenarios Forecasting DEEP INSIGHTS

DELIVERY MODELS

Access-Anywhere Analytics Services Context-Aware Business Applications ON-DEMAND Location-Based Services

Alert and Respond

PUSH Workflow and Interaction Automation Smart devices and systems EMBEDDED

Email and Messaging Mobile Apps Data Transaction and

Usage Logs

Machine and Sensors

Geolocation

Relationships and Social Influence

Copyright IDC (2013)

BIG DATA ECOSYSTEM : FROM DATA TO DECISIONS

V

ALUE

(3)

3

3

Shared Nothing Scale-out Storage + SSD MPP + In-Memory Compute Hadoop HiSpeed / -Resiliency Networking Converged Infrastructure Cloud Non-relational DWH

END USERS

BUSINESS PROCESS

ANALYSTS / SCIENTISTS

INFO PROCESSING

ARCHITECTS / ENGINEERS

DATA ACQUISITION

PRODUCERS

DATA CREATION

SYSTEMS INTEGRATION

V

OL

UM

E

V

EL

OCITY

V

ARI

ET

Y

OBJECTIVES

Stream Processing Event Management REAL-TIME EVENTS Data Exploration Contextualized Data Modeling / Scenarios Forecasting DEEP INSIGHTS

DELIVERY MODELS

Access-Anywhere Analytics Services Context-Aware Business Applications ON-DEMAND Location-Based Services

Alert and Respond

PUSH Workflow and Interaction Automation Smart devices and systems EMBEDDED

Email and Messaging Mobile Apps Data Transaction and

Usage Logs

Machine and Sensors

Geolocation

Relationships and Social Influence

Copyright IDC (2013)

BIG DATA ECOSYSTEM : FROM DATA TO DECISIONS

V

ALUE

Customer

Engagement

[microsegment, net

promoter,

personalization]

(2013 BD Hotspots)

Geofencing

[retail offers, asset

tracking]

(GPS Innovations)

Smart Cities

[power, water, traffic,

safety]

(M2M Innovations)

User-gen Text

68%

(Data collected)

Transactional

67%

(Data collected)

Machine or device

41%

(Data collected)

Managing

>5TB data

36%

Deployed/ing

Hadoop

21%

Keep/discard which

data

35%

(#1 BD Challenge)

Deployed/ing

Text Analytics

26%

Managing Data

Quality

(#1 IT Challenge)

Deployed/ing Event

Processing

22%

(4)

4

APeJ BDA Maturity by Country – Market

Size (USD)

© IDC Visit us at IDC.com and follow us on Twitter: @IDC

4

Source: IDC APeJ Big Data Maturity Assessment and Benchmark 2013 (n=802)

Source: IDC APeJ Big Data Market Analysis & Forecast 2013-2017, Oct 2013

(5)

Trends among Big Data leaders

Hyper-competition, regulation, analytics sophistication

© IDC Visit us at IDC.com and follow us on Twitter: @IDC 5

HK

- IaaS investments, Telco asset

utilization, FSI risk

SGP

- FSI microsegment, Telco LBS & data

monetization, ICA

AUS

- FSI personalization, Telco

psycho-graphic, Data privacy

NZ

- Emergency svc, SME SaaS, Fleet

mgmt, Farm asset mgmt

(6)

Trends among Big Data midstream

Mass urbanization, “first-measured”, neighborly pressure

© IDC Visit us at IDC.com and follow us on Twitter: @IDC 6

KOR

- M2M automation, Manu QC, Govt

NLP audio analytics

PRC

- Web 2.0 [ABT] xacts, City CCTV, FSI

microfinance

TWN

- Smart Tourism, Data Privacy, Manu

process control

IND

- Citizen registration, banking the

unbanked, Big Data insource

(7)

The Path from Infrastructure to

Data

Sensing for Understanding

Sensing:

(Networks change the game!)

Evolve since 10 years ago: Ecogrid, SARS Grid, … etc

Institutional missions based on special vehicles:

Satellites, Research Ships & Aircrafts, Met Stations …etc.

It is growing even larger and broader, e.g. IOT, social

network.

Understanding:

Modeling from hypothesis to discovery

(8)

Ecogrid:

(9)

The Path from Infrastructure to

Data

Sensing for Understanding

Sensing:

Evolve since 10 years ago: Ecogrid, SARS Grid, … etc

Institutional missions based on special vehicles:

Satellites, Research Ships & Aircrafts, Met Stations …etc.

It is growing even larger and broader, e.g. IOT, social

network.

Understanding:

Modeling from hypothesis to discovery

(10)

Fish4Knowledge–

human level query for Marine Biology

10

UCATANIA (Italy) Centrum Wiskunde & Informatica (Netherlands) U. of Edinburgh (UK)

NCHC: sustainable system for data acquisition, storage, and computing.

U. of Catania in Italy: fish detection and tracking.

U. of Edinburgh in UK: workflow, fish recognition, and fish behavior.

CWI in Netherlands: user interfaces.

NCHC

Video Data: 10 camera

years = 10 cameraframes

= 112 Tb

massive data storage

Descriptive data: 20 Tb

Fish: 10^10 detections

Summary data: 500 Gb

Target: 1 sec query

answering

(11)

11

Global Lake Ecological Observational Network (GLEON)

Lakebase: harvest quality data from internet and

collect more than ~25,000 lakes across the world.

Global compute service through CONDOR

>10 major real time observational data from selected

GLEON sites.

(12)

Infrastructure enables Fish4Knowledge

2015/5/25

12

Cloud of Resources

Service Frontend

Data Source

Application Developers

Storage Nodes

Computing Nodes

Video

Server

Marine Biologists

Clients

Computing platform

Storage platform

(13)
(14)

Video Classification

(~350K videos from 2009 to 2013)

14

Algae: 9.2%

Blurred: 33.5%

Highly Blurred: 13.9%

Complex Scenes: 4.3%

Encoding: 23.9%

Normal: 12.9%

Unknown: 2.2%

(15)

The SEATANK Event & Other

SEATANK event

It happened again!

Require scientific evidences for

Chinese Taipei to win the

lawsuit. (TORI, NSPO, NCHC)

2006 Cargo ship ‘Tzini’ stranded in Yilan water and

leaks

>100 tons

fuel Oil

2013.1 Freighter SEATANK stranded in

Penhu water

(16)

The Path from Infrastructure to

Data

Sensing for Understanding

Sensing:

(Networks change the game!)

Evolve since 10 years ago: Ecogrid, SARS Grid, … etc

Institutional missions based on special vehicles:

Satellites, Research Ships & Aircrafts, Met Stations …etc.

It is growing even larger and broader, e.g. IOT, social

network.

Understanding

: (Data change the game!)

Modeling from hypothesis to discovery

(17)

Natural disaster

Facilities infrastructure failure

Storage failure

Server hardware/software failure

Application software failure

External dependencies (e.g. PKI

failure)

Format obsolescence

Legal encumbrance

Human error

Malicious attack by human or

automated agents

Loss of staffing competencies

Loss of institutional commitment

Loss of financial stability

Changes in user expectations and

requirements

CC i m a g e b y Sh a ry n M o rro w o n F lic k r Source: http://blog.alltop.com.tw/story/archives/3278 Source: DataONE
(18)

Inf

o.

con

tent

Time

(From Michener et al 1997)

Time for Publication

Accident

Specific Details

General Details

Retirement

or Career Change

Death

Data Management

(19)

The Treasure of NARLabs:

Earth Science Observational Data

(20)

Big Data Infrastructure Challenge in ESOD

Data Features:

high complexity, large scale, frequency, real-time and

stream

Big Data

Data Size:

155TB/yr.

(Actual data size will 2~3x)

Currently, 103 TB for Q1. Simulation data 200TB from

TTFRI.

Real time, High frequency Data:

Process > 18,000 records/sec.

Currently 6,000 records/sec

.

(21)

ESOD Big Data Service ~ 1PB

Parallel

compute server

d

1

d

2

d

3

External

client

Parallel

data server

Query

Source

dataset

Derived

datasets

Parallel

file system

(e.g.,

GFS,

HDFS, GPFS

)

Result

Data-intensive computing system (e.g.

Hadoop

)

Parallel

query server

External

data

sources

Examples:

Search

Photo scene completion

Log processing

Science analytics

Characteristics:

Small queries and results

Massive data and computation

performed on server

Source: David O’Hallaron

Continuous

query stream

Continuous

query results

21

Virtually

Centralized

Resources

(22)

ESOD

Hardware Architecture

Challenges of use of distributed resources

ESOD Storage

1

1

1

1

2

2

2

3

3

3

1

Disk 918TB Tape 433TB

TTFRI

NSPO

TORI

NCREE

NCHC

10G Bridge Gateway 8G Optical Fiber WindRider: 40 G infinitband (Lustre FS) 1G -10G Fiber GPFS GPFS Preload/Stage

Exmaples of Datasets

22

(23)

General Parallel File System

2 X 10GE

HSM Tape

library

ESOD Data Archiving and System Infrastructure

General Parallel File System

2 X 10GE

HS M

Tape library

General Parallel File System

2 X 10GE HSM Tape library NCHC firewall Load balance server Master Load balance server Slave

Failover between 3 sites.

Automatic load balancing.

Transparent file systems: direct read/write of

files across different sites.

Automatic storage backup.

23

23

HAProxy

https

ftps

Scale out > 1 PB

WindRider Formosa 2/3/5 Paralllel DBMS High IOPS
(24)

Data discovery: Metadata Catalog

Use relational database (mysql) with

multi-dimensional schema

design

to speed up searching

(migrating to NOSQL solutions now)

System Metadata

User name space

- Address / e-mail / telephone number

- Role (administrator, curator, user)

- File name space

- Creation date / size / location / checksum

- Owner / access controls

Storage resource name space

- Capacity / quotas / Type (archive, disk, fast cache)

Domain Metadata

User-given metadata

- Key-Value-Unit Triplets, Annotation

- Relational / XML Metadata

- Domain-specific Schema

Adopt

OGC standards

24

(25)

Smart query and answer

Develop a set of

control vocabulary

based on int’l

standards, e.g. HDF, NetCDF, OGC … etc.

Derive

RDF triple dataset

from common query

tasks of ESOD.

Combination of

visual data plus metadata

to

support a specific high-level information seeking

user task.

Design of an interface for

graph & visual

comparison search.

Selected one specific task, that of comparing sets

of objects, and designed a prototype interface on

top of

linked data sets

used by the experts to

support this task explicitly

Develop methods for data

provenance

.

(26)

The Path from Infrastructure to

Data

Sensing for Understanding

Sensing:

(Infrastructure-Centric )

Evolve since 10 years ago: Ecogrid, SARS Grid, … etc

Institutional missions based on special vehicles:

Satellites, Research Ships & Aircrafts, Met Stations …etc.

It is growing even larger and broader, e.g. IOT, social

network.

Understanding

: (Data-Centric)

Modeling from hypothesis to discovery

(27)

Government Big Data

Government Clouds & Open data

Big Data

10 government clouds:

Health

,

Food

,

e-Invoice

,

Transportation

,

Environment

,

Finance

,

Education

,

Culture

,

Disaster Prevention

,

Geospatial Information

,

Agriculture

and e-Government, focusing on

societal impact

via

Research & innovation

.

3 kinds of data

:

Infrastructural Data

,

Public data

and Personal data

,

open but required

protection

.

Open, but Protected

Licensing models:

Open Government License (OGL), Non-Commercial Government License,

Charged License.

Access Platform

Facility and Systems to enable Data store, Management and Analytics.

(28)

Government Big Data

NCHC/NARLabs

provides

Big Data Platform

,

Bridging

Excellent Research

Societal Impact

Transportation: etag system

Health: Electronic Medical Record

Food Security: Food Traceability

Network: Openflow

Finance: e-invoice

References

Related documents

The decoupling of storage controller functionality from the physical infrastructure resources, done by Nutanix web-scale converged infrastructure, a hypervisor such as

• Networking innovations lay the foundation for transformation Big Data Security Cloud Mobility Converged Infrastructure Software-defined Infrastructure Converged

Converged infrastructure systems combine compute, storage, networking, and infrastructure management into an integrated system to provide general purpose, virtualized resource

Big Data Science: provide principles, processes, techniques for understanding phenomena via big data analysis.. + Big Data Infrastructure

Converged Infrastructure Cloud Service Providers VBLOCK VSPEX EMC Storage VMAX VNX XtremIO Data Protection Avamar Data Domain VPLEX* RecoverPoint* DPA BYO Compute

It utilizes a hosted Hadoop framework running on the web-scale infrastructure of Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3).”..

Therapists working with adults with developmental disabilities often focus on other performance skills or areas of occupation including activities of daily living (ADLs) rather

Data obtained were subjected to summary statistics, di- versity analysis using both Simpson diversity and Shannon evenness index, and rank abundance curve and model.. The