An introduction to Big Data Science.
Big Data – Explained
Presentation Agenda
• What is Big Data
• Why learn Big Data
• Who is it for
• How to start learning Big Data
• When to learn it
• Objective and Benefits of Big Data
What is Big Data
Large-Scale Data Management Data Science and Analytics
Managing very large amounts of data and extracting
Introduction to Big Data
What is Big Data?
What makes data, “Big” Data?
Big Data Definition
• No single standard definition…
“Big Data” is the data whose scale, diversity, and complexity require new architecture, techniques,
algorithms, and analytics to manage it and extract value and hidden knowledge from it…
Examples : Google, Wikipedia, Amazon, Facebook, eBay and other corporate
enterprises…
Data explosion
Data generation
▫ Web data, e-commerce
▫ Purchases at department and grocery stores
▫ Bank/Credit Card transactions
▫ Social Networks
▫ Health care records
▫ Satellite imagery and weather modeling
Data Approximation
• Google processes 20 PB a day (2008)
• Wayback Machine has 3 PB + 100 TB/month (3/2009)
• Facebook has 2.5 PB of user data + 15 TB/day (4/2009)
• eBay has 6.5 PB of user data + 50 TB/day (5/2009)
• CERN’s Large Hydron Collider (LHC) generates 15 PB a year
Characteristics of Big Data:
1-Scale (Volume)
• Data Volume
▫ 44x increase from 2009 to 2020
▫ From 0.8 zettabytes to 35zb
• Data volume is increasing exponentially
Exponential increase in
Characteristics of Big Data:
2-Complexity (Varity)
• Various formats, types, and structures
• Text, numerical, images, audio, video, sequences, time series, social media data, multi-dim arrays, etc…
• Static data vs. streaming data
• A single application can be
generating/collecting many types of data
To extract knowledge all these types of
Characteristics of Big Data:
3-Speed (Velocity)
• Data is being generated fast and need to be processed fast
• Online Data Analytics
• Late decisions missing opportunities
• Examples
▫ E-Promotions: Based on your current location, your purchase
history, what you like send promotions right now for store next to you
▫ Healthcare monitoring: sensors monitoring your activities and body
Big Data: 3V’s
Some Make it 4V’s
Harnessing Big Data
• OLTP: Online Transaction Processing (DBMSs)
• OLAP: Online Analytical Processing (Data Warehousing)
• RTAP: Real-Time Analytics Processing (Big Data Architecture &
technology)
Who’s Generating Big Data
Social media and networks
(all of us are generating data) Scientific instruments (collecting all sorts of data)
Mobile devices
(tracking all objects all the time)
Sensor technology and networks
(measuring all kinds of data)
• The progress and innovation is no longer hindered by the ability to collect data
• But, by the ability to manage, analyze, summarize, visualize, and discover knowledge from the collected data in a timely manner and in a scalable fashion
Why learn Big Data - The Model Has Changed
• The Model of Generating/Consuming Data has Changed
Old Model: Few companies are generating data, all others are consuming data
New Model: all of us are generating data, and all of us are consuming data
Big Data Types
What’s driving Big Data
- Ad-hoc querying and reporting - Data mining techniques
- Structured data, typical sources - Small to mid-size datasets
- Optimizations and predictive analytics - Complex statistical analysis
- All types of data, and many sources - Very large datasets
- More of a real-time
Who is it for - Value of Big Data Analytics
• Big data is more real-time in nature than traditional DW applications
• Traditional DW architectures (e.g.
Exadata, Teradata) are not well- suited for big data apps
• Shared nothing, massively parallel processing, scale out architectures are well-suited for big data apps
Big Data Market
Challenges in Handling Big Data
• The Bottleneck is in technology
▫ New architecture, algorithms, techniques are needed
• Also in technical skills
▫ Experts in using the new technology and dealing with big data
What Technologies do we have
for Big Data ?
How does Big Data work ?
Big Data Landscape
Big Data Technology
How to get started
• Learn the platform (how it is designed and works)
▫ How big data are managed in a scalable, efficient way
• Learn writing Hadoop jobs in different languages
▫ Programming Languages: Java, C, Python
▫ High-Level Languages: Apache Pig, Hive
• Learn advanced analytics tools on top of Hadoop
▫ RHadoop: Statistical tools for managing big data
▫ Mahout: Data mining and machine learning tools over big data
• Learn state-of-art technology from recent research papers
▫ Optimizations, indexing techniques, and other extensions to Hadoop
Some popular vendors
When to learn Big Data
• Proven historical trend on recruiting engineering graduates in IT companies
• Those doing ECE, EEE, Civil, Mechanical and others are mostly not able to apply their core skills to work on what they learnt in graduation
• Pre-graduation is the right time to learn Big Data technologies
Learning Big Data continued…
• If you start now, you can master it by next 2 years
• Big Data involves not just several tools, but numerous technologies, methodologies, and mathematical and/or statistical concepts
• These need to be thought, developed, and applied appropriately to reach a certain goal
• Algorithms and computing languages are required to practically turn “Big Data in to Applied Intelligence”
Why to learn Big Data now ?
• A culmination of several technologies
• Sooner, the better
• If a flexible mind starts learning HADOOP and related stuff now, it can rightly be positioned after few years in the right job
• Synonymous to 3-year IT diploma courses
Resources & Books
• No specific syllabus
▫ Big Data is a relatively new topic with no fixed syllabus
▫ Evolutionary development, being standardized
• Where to learn
▫ Big Data University
▫ Cloudera CDH VM and many more vendors
• Related books:
▫ Hadoop, The Definitive Guide. Several others.
Resources on the net
• Vast information on Big Data available on the internet
• Tutorials, YouTube videos, articles and vendor white papers
• Most of it is open source and for every one
• What one needs is time, interest, energy, and a bit of foresight to work on ambitious projects
Learning curve
• Some Big-Data courses available in the market
▫ Cloudera Certified Administrator Apache Hadoop
▫ Cloudera Certified Developer for Apache Hadoop
• Several perceptions and perspectives
▫ Several tools exist for Big-Data technology
▫ Students need the right direction to get started
• Who learns what is more important
▫ Clear goals and learning curves for administrators and developers
▫ A combination of above is the right mix for young minds
Starting with Big Data
• Virtual machine environment is best suited to start
▫ Any supported or popular Linux distribution
▫ Preferred RHEL, SUSE, Cent OS, Ubuntu or Fedora
▫ Hadoop platform
▫ Single-node and then clustered with High-Availability
• Cloudera Quickstart VM (CDH 4.4)
▫ Cloudera is one of the pioneers in Big Data technologies
▫ CDH or Cloudera Distribution for HADOOP available as a VM
▫ Downloadable from Cloudera website
• Other needed software packages
Introduction to HADOOP
• High Availability Distributed Object Oriented Platform
• Developed by The Apache Software Foundation (http://apache.org)
• Google started in 1990’s. 2000’s brought data management complexities
• In 2004, Google published whitepaper on MapReduce, a framework that provides a parallel processing model
HADOOP contd..
• Google’s technologies namely
1. GFS (Google File System) – A distributed file system 2. MapReduce – A framework for parallel processing 3. BigTable – A Data storage system
• These are reverse engineered and re-engineered by Apache Software Foundation, and called as:
1. HDFS (Hadoop Distributed File System) 2. MapReduce
3. Apache HBase
Real-world scenarios
• IMAGINE YOUR BOSS COMES TO YOU AND SAYS:
“HERE ARE 50 GB OF LOGFILES—FIND A WAY TO IMPROVE OUR business!”
• What would you do?
• Where would you start?
• And what would you do next?
Cloud and Big Data
Most of the traditional IT skills are being moved towards the Cloud and Big Data.
Some related fields:
• Artificial Intelligence
• Distributed computing / super computing
• Business Analytics / Business Intelligence
• Data Analytics / Data Mining
Companies using HADOOP
HADOOP - Business problems types
How does MapReduce help
Hadoop and MapReduce Architecture
A Sample HADOOP Cluster Configuration
RDBMS vs. HADOOP
History of Databases
Object Databases
Relational Dominance
Bigtable, Dynamo and HBase
NoSQL = Not Only SQL
Database ecosystem
Past, Present and Future of IT
• Information technology or IT
• The term “IT” first appeared in 1958
• IT as a catalyst to other areas of science and technology
• A movement from IT driven industry to open information society
• We are today a part global village, via internet, which is now a commodity or a common
consumer service
• Fast Innovations and Inventions to continue
Big data development Big Data environment
• Use of Virtual Machines
• Java runtime environment
• HADOOP and related software
• Installed on a single node or clustered
• Running Cloudera CDH, IBM BigInsight etc.
Prerequisites for leaning Big Data
• Working knowledge of computers
• Basic knowledge of Linux, C, Java etc.