Experimentation on Cloud
Databases to Handle Genomic Big Data
Presented by: Abraham Gómez, M.Sc., B.Sc. Academic Advisor:
Alain April. Ph.D,M.Sc.A, B.A.
Agenda
Concepts & Overview
Technology description Case study
1 2 3
Page 2
Methodology
Obtained Results
4 5
6 Conclusion
At present, industries &
academics have become focused on delivering information.
Concepts & Overview
Past local applications are moving
to applications in the Cloud.
After all, this migration is easy?
Page 3
Concepts & Overview
Page 4
This new field of research is called big data.
Concepts & Overview
Variety Velocity Often time-sensitive Extends beyond structured data: text, audio, video, etc.
Page 5
Experimentation on Cloud Databases to Handle Genomic Big Data
Volume
Comes in one size: large
Why genetics/genomic are related to Big Data?
-The time and cost of DNA sequencing
has been reduced by a factor of 1 million.
Concepts & Overview
million.
-Near future, mapping a personal
genome could cost a few hundred dollars.
Page 6
Concepts & Overview
Why genetics/genomic are related
to Big Data?
-Now, a personal genome = 100 GB.
-If everybody wants its own genome =
Page 7
Experimentation on Cloud Databases to Handle Genomic Big Data
-If everybody wants its own genome =
hundreds of petabytes of data. (Volume).
-If everybody wants get it fast (Velocity).
Agenda
Concepts & Overview
Technology description
Case study
1 2 3
Page 8
Methodology
Obtained Results
4 5
6 Conclusion
Companies presently use relational database (RDB) technology.
Concepts & Overview
Big data
Page 9
Experimentation on Cloud Databases to Handle Genomic Big Data
Big data
applications also use RDB technology
However, case study shows that accessing petabytes of data
efficiently (in the cloud) is challenging.
Concepts & Overview
challenging.
-Solutions:
-Sharding but creates technological problems.
-Buy specialized hardware but creates budget problems.
Page 10
Emerging Cloud Computing (CC) database technology (No-SQL) is a new research domain that
addresses these challenges. Concepts & Overview
Page 11
Hadoop Project
Hadoop Project
Several subprojects:
Page 13
HBase is the Hadoop database. So, is a distributed, scalable, big data store.
Agenda
Concepts & Overview Technology description
Case study
1 2 3
Page 14
Methodology
Obtained Results
4 5
6 Conclusion
Case study
Data from “Centre d‘Excellence en
Neuromique de l'Université de
Montréal” (CENUM) has been used.
The research group uses a
The research group uses a
five-steps process to collect the
information saved in the database:
Page 15
Case study
Page 16
Case study
The current RDB start to show
weakness by the impressive amount of data.
Some tables only with a sample of
Some tables only with a sample of
25% have:
Page 17
Experimentation on Cloud Databases to Handle Genomic Big Data
By next 2 year will have 200% more
data.
Coverage
Case study
SQL 1 = Exclude some variants
present in 1 or more individuals (list of samples).
SELECT DISTINCT variant_id FROM sample_variant
WHERE sample_id in (146,162,167,189,193) GROUP BY variant_id
ORDER BY variant_id
Page 18
Case study
Some SQL requests become very
slow or sometimes even impossible.
Page 19
Case study
The problem has three dimensions:
Data Storage Data
Analysis
Page 20
Experimentation on Cloud Databases to Handle Genomic Big Data
Data Retrieval
Agenda
Concepts & Overview Technology description Case Study
1 2 3
Page 21
Methodology
Obtained Results
4 5
6 Conclusion
Methodology
Several case studies show that
HBase is a feasible solution.
HBase Pros:
Page 22
Experimentation on Cloud Databases to Handle Genomic Big Data
-Avoid the sharding problems.
-Avoid to buy specialized Hardware.
HBase Cons:
Methodology
Case study analysis.
-Exactly what they need?
-Thinks a long term data store.
-Improve data retrieval.
-Improve data retrieval.
Redesign the DB schema based
on:
-SQL request ( Read / Write ).
Page 23
Methodology
Redesign the DB schema based
on:
-Take reference tables and mixed in one.
-Faster data retrieval.
-Faster data retrieval.
-Avoid some relationships.
Using Sqoop for data migration.
Page 24
Methodology
Page 25
Agenda
Concepts & Overview Technology description Case Study
1 2 3
Page 26
Methodology
Obtained Results
4 5
6 Conclusion
Obtained Results
Create an standard in migration
process from RDB to No-SQL DB.
- Proc #1: Create column families according access / write pattern (SQL request).
Page 27
- Proc #2: Merge reference tables.
- Proc #3: Spreading relationships manually.
Use of Sqoop.
Obtained Results
The SQL request has a better
performance:
HBase RDB
Page 28
Agenda
Concepts & Overview Technology description Case Study
1 2 3
Page 29
Methodology Obtained Results 4 5 Conclusion 6
Conclusion
Migration from RDB to HBase can
be standardized.
More data from case study will be
needed.
Page 30
needed.
Until now the results support that
HBase is faster than RDB. (Controlled environment). Experimentation on Cloud Databases to Handle Genomic Big Data
Conclusion
HBase is a good solution for:
1. Data Storage because is possible to
save petabytes of data.
2. Data Retrieval because you can obtain
Page 31
2. Data Retrieval because you can obtain
your data fast.
3. Data Analysis if you can uses other
Hadoop subprojects (MapReduce, Hive) from HBase.
Thanks!!!
Questions?
Page 32
Questions?
http://www.etsmtl.ca/ http://www.gelog.etsmtl.ca/