• No results found

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

N/A
N/A
Protected

Academic year: 2021

Share "Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging"

Copied!
43
0
0

Loading.... (view fulltext now)

Full text

(1)
(2)

Outline

2

 High Performance Computing (HPC)

 Towards exascale computing: a brief history

 Challenges in the exascale era

 Big Data meets HPC

Some facts about Big Data

Technologies

HPC and Big Data converging

 Case Studies:

 Bioinformatics

 Natural Language Processing (NLP)

(3)

Outline

3

High Performance Computing (HPC)

Towards exascale computing: a brief history

Challenges in the exascale era

 Big Data meets HPC

Some facts about Big Data

Technologies

HPC and Big Data converging

 Case Studies:

 Bioinformatics

 Natural Language Processing (NLP)

(4)
(5)

What types of big problem might require a “Big Computer”?

High Performance Computing (HPC)

5

Compute Intensive : A single problem requiring a large

amount of computation

Memory Intensive : A single problem requiring a large

amount of memory

High Throughput : Many unrelated problems to be

executed over a long period

Data Intensive : Operation on a large amount of data

(6)

What types of big problem might require a “Big Computer”?

High Performance Computing (HPC)

6

Compute Intensive:

 Distribute the work across multiple CPUs to reduce the execution time as far as possible:

̶ Each thread performs a part of the work on its own CPU, concurrently with the others

 CPUs may need to exchange data rapidly, using specialized hardware

 Large systems running multiple parallel jobs also need fast access to storage

 Many use cases from Physics, Chemistry, Energy, Engineering, Astronomy, Biology...

 The traditional domain of HPC and supercomputers

(7)

What types of big problem might require a “Big Computer”?

High Performance Computing (HPC)

7

Memory Intensive:

 Aggregate sufficient memory to enable solution at all

 Technically more challenging if the program cannot be parallelized

High Throughput:

 Distribute work across multiple CPUs to reduce the overall execution time as far as possible

 Workload is trivially (or embarrassingly) parallel

̶ Workload breaks up naturally into independent pieces

̶ Each piece is performed by a separate process on a separate CPU (concurrently)

 Emphasis is on throughput over a period, rather than on performance on a single problem

 Obviously a supercomputer can do this too

(8)

What types of big problem might require a “Big Computer”?

High Performance Computing (HPC)

8

Data Intensive:

 Distribute the data across multiple CPUs to process in a reasonable time

 Note that the same work may be done on each data segment

 Rapid movement of data in and out of (disk) storage becomes important

Big Data and how to efficiently process it currently occupies much thought

(9)
(10)
(11)
(12)
(13)

Challenges in the exascale era

High Performance Computing (HPC)

13

FLOPS is not on command

 Once upon a time… when FPU was the most expensive

and precious resource in a supercomputer

 Metrics: FLOPS, FLOPS and FLOPS

 But Data movement’s energy efficiency isn’t imporving as

fast as Flop’s energy efficiency

Algorithm designer should be thinking in terms of

wasting the inexpensive resource (flops) to reduce data

movement

 Communication-avoiding algorithms

(14)
(15)

Outline

15

 High Performance Computing (HPC)

 Towards exascale computing: a brief history

 Challenges in the exascale era

Big Data meets HPC

Some facts about Big Data

Technologies

HPC and Big Data converging

 Case Studies:

 Bioinformatics

 Natural Language Processing (NLP)

(16)
(17)

Some facts about Big Data

Big Data meets HPC

17

Data in 2013: 4.4 Zettabytes (4.4 x 10

21

bytes)

Estimation in 2020: 44 Zettabytes

Searching for one element in a 1 MB file: < 0.1 seconds

Searching for one element in a 1 GB file: a few minutes

Searching for one element in a 1 TB file: about 30 hours

Searching for one element in a 1 PB file: > 3 years

Searching for one element in a 1 EB file: 30 centuries

Searching for one element in a 1 ZB file: 3,000 millennium

Estimation using a PC

(18)
(19)
(20)

Technologies: how to process all these data

Big Data meets HPC

20

Map/Reduce paradigm

 Introduced by Dean and Ghemawat (Google, 2004)

 As simple as providing:

̶ MAP function that processes a key/value pair to generate a set of intermediate key/value pairs

̶ REDUCE function that merges all intermediate values associated with the same intermediate key

 Runtime takes care of:

Partitioning the input data (Parallelism)

 Scheduling the program’s execution across a set of machines (Parallelism)

 Handling machine failures

Managing inter-machines communication (Parallelism)

(21)
(22)

Technologies: Apache Hadoop

Big Data meets HPC

22

Apache Hadoop is an open-source implementation of

the Map/Reduce paradigm

 It’s a framework for large-scale data processing

 It is designed to run on cheap commodity hardware

 It automatically handles data replication and node failure

 Hadoop provides:

API+implementation for working with Map/Reduce

Job configuration and efficient scheduling

Browser-based monitoring of important cluster stats

A distributed filesystem optimized for HUGE amounts of

data (HDFS)

(23)
(24)
(25)

Technologies: Apache Hadoop

Big Data meets HPC

25

Pros

:

 Write here all the advantages commented previously

 Hadoop it’s written in Java but allows to execute codes from different programming languages (Hadoop Streaming)

 Hadoop Ecosystem (Pig, Hive, HBase, etc…)

(26)

Technologies: Apache Hadoop

Big Data meets HPC

26

Cons:

 The problem must fit the Map/Reduce paradigm (embarrassingly parallel problems)

 Bad for iterative applications

Important degradations in performance when using Hadoop

Streaming (i.e. when codes are written in languages as Fortran, C, Python, Perl, etc.)

 Intermediate results output is always stored on disks (In-Memory MapReduce – IMMR)

 No reuse computation for jobs with similar input data:

̶ For example, job runs everyday to find the most frequently read news over the past week

Hadoop was not-designed by HPC people (joke!!!)

(27)

Technologies: Apache Spark

Big Data meets HPC

27

Apache Spark

is an open source project

It starts as a research project in Berkeley

Cluster computing framework designed to be fast and general-

purpose

Pros (I):

 Extends the Map/Reduce paradigm to support more types of computations (interactive queries and stream processing)

 APIs in Python, Java and Scala

 Spark has the ability to run computations in memory (Resilient Distributed Datasets – RDDs)

(28)

Technologies: Apache Spark

Big Data meets HPC

28

Pros (II):

 It supports different workloads in the same engine:

̶ Batch applications

̶ Iterative algorithms

̶ Streaming and iterative queries

 Good integration with Hadoop

Cons:

 Memory requirements

(29)
(30)
(31)

Other Technologies

Big Data meets HPC

31

Apache Flink

It’s an European project

Functionalities very similar to those explained for Spark

If you like R, try this:

RHIPE

̶ Released as R package

̶ Map and Reduce functions as R code

Big R (O. D. Lara et al. “Big R_ Large-scale Analytics on Hadoop Using R”, IEEE Int. Congress on Big Data, 2014)

̶ It hides the Map/Reduce details to the programmer

(32)
(33)

HPC and Big Data converging

Big Data meets HPC

33

 Those technologies were designed to run on “cheap”

commodity clusters, but…

 … there is more to Big Data than large amounts of

information

It also related to massive distributed activities such as

complex queries and computation (analytics or data-

intensive scientific computing)

High Performance Data Analytics (HPDA)

(34)

HPC and Big Data converging

Big Data meets HPC

34

Infiniband :

It’s the standard interconnect technology used in HPC

supercomputers

Commodity clusters use 1Gbps or 10Gbps ethernet

Hadoop is very network-intensive (e.g. Data Nodes and

Task Trackers exchange a lot of information)

56Gbps FDR can be 100x faster than 10 Gbps ethernet

due to its superior bandwidth and latency

It allows to scale the big data platform to the desired size,

without worrying about bottlenecks

(35)

HPC and Big Data converging

Big Data meets HPC

35

Accelerators :

Hadoop and Spark only work on homogeneous clusters

(CPUs)

HPC is moving to heterogeneous platforms consisting of

nodes which include GPUs, co-processors (Intel Xeon Phi)

and/or FPGAs.

Most promising systems to reach the exaflop

Accelerator can boost the performance (not all

applications fit well)

For complex analytics or data-intensive scientific

computing

A lot of interest: HadoopCL, Glasswing, GPMR, MapCG,

MrPhi, etc…

(36)

Outline

36

 High Performance Computing (HPC)

 Towards exascale computing: a brief history

 Challenges in the exascale era

 Big Data meets HPC

Some facts about Big Data

Technologies

HPC and Big Data converging

Case Studies:

Bioinformatics

Natural Language Processing (NLP)

(37)
(38)
(39)

Natural Language Processing

Outline

39

 NLP suited to structure and organize the textual information accessible through Internet

 Linguistic processing is a complex task that requires the use of several subtasks organized in interconnected modules

Most of the existent NLP modules are programmed using Perl (regular expressions)

We have integrated into Hadoop three NLP modules (Hadoop Streaming):

Named Entity Recognition (NER): It consists of identifying as a single unit (or token) those words or chains of words denoting an entity, e.g. New York, University of San Diego, Herbert von Karajan, etc.

PoS-Tagging: It assigns each token of the input text a single PoS tag provided with morphological information e.g. singular and masculine adjective, past participle verb, etc.

Named Entity Classification (NEC): It is the process of classifying entities by means of classes such as “People”, “Organizations”, “Locations”, or “Miscellaneous”.

J. M. Abuín, Juan C. Pichel, Tomás F. Pena, P. Gamallo and M. García. “Perldoop: Efficient Execution of Perl Scripts on Hadoop Clusters”, Big Data Conference, 2014

(40)

Natural Language Processing

Outline

40

 HPC people (including myself) are “obsessed” with

performance… Flops, Flops and Flops

 Hadoop Streaming is really nice, but it shows an important

degradation in performance (w.r.t. Java codes)

Perldoop: we designed an automatic source-to-source

translator Perl to Java:

It’s not general-purpose

Perl scripts should be in Map/Reduce format

Perl codes should follow some simple programming rules

J. M. Abuín, Juan C. Pichel, Tomás F. Pena, P. Gamallo and M. García. “Perldoop: Efficient Execution of Perl Scripts on Hadoop Clusters”, Big Data Conference, 2014

(41)
(42)
(43)

43

Thank you!

[email protected] citius.usc.es

References

Related documents

Given that there are now relatively few property funds management groups, each of whom are relatively large and global, together with a decreasing number of outsourced

If the stay at home order is lifted before public and staff safety can be assured, then other factors should be used in determining what level of services the Library is able

 Presented at the Annual Latino Enhancement Cooperative’s Latino Leadership Conference at Indiana University to provide pre-collegiate preparation seminars for high school

The defense intelligence community understands information sharing is critical to advance the mission, facilitate effective and efficient war- fighting, conduct intelligence

Population and economic growth within the Durban Metropolitan region in eastern South Africa has increased the demand for water supply. This ever-increasing demand means that

In the present investigation, the removal of Naphthol green B dye is studied using Hydrogen peroxide treated Red mud as adsorbent.. The sorption characteristics of the adsorbent

According to these results, the best condition for sensory evaluation belonged to treatment under modified atmosphere CO2 70% with flexible pouch 4-layer,since

This paper employed geographic information system (GIS) to process the input data, RIDF curve to generate different design storm scenarios and PCSWMM to simulate