• No results found

Experimentation on Cloud Databases to Handle Genomic Big Data

N/A
N/A
Protected

Academic year: 2021

Share "Experimentation on Cloud Databases to Handle Genomic Big Data"

Copied!
32
0
0

Loading.... (view fulltext now)

Full text

(1)

Experimentation on Cloud

Databases to Handle Genomic Big Data

Presented by: Abraham Gómez, M.Sc., B.Sc. Academic Advisor:

Alain April. Ph.D,M.Sc.A, B.A.

(2)

Agenda

Concepts & Overview

Technology description Case study

1 2 3

Page 2

Methodology

Obtained Results

4 5

6 Conclusion

(3)

At present, industries &

academics have become focused on delivering information.

Concepts & Overview

Past local applications are moving

to applications in the Cloud.

After all, this migration is easy?

Page 3

(4)

Concepts & Overview

Page 4

(5)

This new field of research is called big data.

Concepts & Overview

Variety Velocity Often time-sensitive Extends beyond structured data: text, audio, video, etc.

Page 5

Experimentation on Cloud Databases to Handle Genomic Big Data

Volume

Comes in one size: large

(6)

Why genetics/genomic are related to Big Data?

-The time and cost of DNA sequencing

has been reduced by a factor of 1 million.

Concepts & Overview

million.

-Near future, mapping a personal

genome could cost a few hundred dollars.

Page 6

(7)

Concepts & Overview

Why genetics/genomic are related

to Big Data?

-Now, a personal genome = 100 GB.

-If everybody wants its own genome =

Page 7

Experimentation on Cloud Databases to Handle Genomic Big Data

-If everybody wants its own genome =

hundreds of petabytes of data. (Volume).

-If everybody wants get it fast (Velocity).

(8)

Agenda

Concepts & Overview

Technology description

Case study

1 2 3

Page 8

Methodology

Obtained Results

4 5

6 Conclusion

(9)

Companies presently use relational database (RDB) technology.

Concepts & Overview

Big data

Page 9

Experimentation on Cloud Databases to Handle Genomic Big Data

Big data

applications also use RDB technology

(10)

However, case study shows that accessing petabytes of data

efficiently (in the cloud) is challenging.

Concepts & Overview

challenging.

-Solutions:

-Sharding but creates technological problems.

-Buy specialized hardware but creates budget problems.

Page 10

(11)

Emerging Cloud Computing (CC) database technology (No-SQL) is a new research domain that

addresses these challenges. Concepts & Overview

Page 11

(12)

Hadoop Project

(13)

Hadoop Project

Several subprojects:

Page 13

HBase is the Hadoop database. So, is a distributed, scalable, big data store.

(14)

Agenda

Concepts & Overview Technology description

Case study

1 2 3

Page 14

Methodology

Obtained Results

4 5

6 Conclusion

(15)

Case study

Data from “Centre d‘Excellence en

Neuromique de l'Université de

Montréal” (CENUM) has been used.

The research group uses a

The research group uses a

five-steps process to collect the

information saved in the database:

Page 15

(16)

Case study

Page 16

(17)

Case study

The current RDB start to show

weakness by the impressive amount of data.

Some tables only with a sample of

Some tables only with a sample of

25% have:

Page 17

Experimentation on Cloud Databases to Handle Genomic Big Data

By next 2 year will have 200% more

data.

Coverage

(18)

Case study

SQL 1 = Exclude some variants

present in 1 or more individuals (list of samples).

SELECT DISTINCT variant_id FROM sample_variant

WHERE sample_id in (146,162,167,189,193) GROUP BY variant_id

ORDER BY variant_id

Page 18

(19)

Case study

Some SQL requests become very

slow or sometimes even impossible.

Page 19

(20)

Case study

The problem has three dimensions:

Data Storage Data

Analysis

Page 20

Experimentation on Cloud Databases to Handle Genomic Big Data

Data Retrieval

(21)

Agenda

Concepts & Overview Technology description Case Study

1 2 3

Page 21

Methodology

Obtained Results

4 5

6 Conclusion

(22)

Methodology

Several case studies show that

HBase is a feasible solution.

HBase Pros:

Page 22

Experimentation on Cloud Databases to Handle Genomic Big Data

-Avoid the sharding problems.

-Avoid to buy specialized Hardware.

HBase Cons:

(23)

Methodology

Case study analysis.

-Exactly what they need?

-Thinks a long term data store.

-Improve data retrieval.

-Improve data retrieval.

Redesign the DB schema based

on:

-SQL request ( Read / Write ).

Page 23

(24)

Methodology

Redesign the DB schema based

on:

-Take reference tables and mixed in one.

-Faster data retrieval.

-Faster data retrieval.

-Avoid some relationships.

Using Sqoop for data migration.

Page 24

(25)

Methodology

Page 25

(26)

Agenda

Concepts & Overview Technology description Case Study

1 2 3

Page 26

Methodology

Obtained Results

4 5

6 Conclusion

(27)

Obtained Results

Create an standard in migration

process from RDB to No-SQL DB.

- Proc #1: Create column families according access / write pattern (SQL request).

Page 27

- Proc #2: Merge reference tables.

- Proc #3: Spreading relationships manually.

Use of Sqoop.

(28)

Obtained Results

The SQL request has a better

performance:

HBase RDB

Page 28

(29)

Agenda

Concepts & Overview Technology description Case Study

1 2 3

Page 29

Methodology Obtained Results 4 5 Conclusion 6

(30)

Conclusion

Migration from RDB to HBase can

be standardized.

More data from case study will be

needed.

Page 30

needed.

Until now the results support that

HBase is faster than RDB. (Controlled environment). Experimentation on Cloud Databases to Handle Genomic Big Data

(31)

Conclusion

HBase is a good solution for:

1. Data Storage because is possible to

save petabytes of data.

2. Data Retrieval because you can obtain

Page 31

2. Data Retrieval because you can obtain

your data fast.

3. Data Analysis if you can uses other

Hadoop subprojects (MapReduce, Hive) from HBase.

(32)

Thanks!!!

Questions?

Page 32

Questions?

http://www.etsmtl.ca/ http://www.gelog.etsmtl.ca/

References

Related documents

While as anticipated generally decentralization resulted in greater participation and control over service delivery and governance by local communities, local governments are

Abstract This study examines the nature of staff quality and the extent to which it can explain variations in service delivery outcomes using two selected District Assemblies

The present research investigates these standards and is framed by critical reflections on CBT literature and existing sustainable tourism standards (STS) practices,

The secondary objectives were to assess the adjusted prevalence of CM according to clinical presentation and patient characteristics, to determine crude 90-day survival according

DICAL HOUSE gifts and wine hampers are always well received, and there is a hamper for every taste so step inside the flagship Store located on the outskirts of Mosta, or if more

The M270/M274 family of four-cylinder engines is optimally equipped thanks to the flexible consumption technologies Camtronic, lean-burn combustion and natural gas capability.

3 shows that the developed FE models predicted the strain distributions and the curvatures of the tested slabs reasonably well considering the large variability in

Our Products…… Card Solution PVC Card RF-ID Card RF-ID Card Proximity Card Hybrid Card Loyalty Card.. Contact & Contactless