• No results found

Big Data Explained. An introduction to Big Data Science.

N/A
N/A
Protected

Academic year: 2021

Share "Big Data Explained. An introduction to Big Data Science."

Copied!
52
0
0

Loading.... (view fulltext now)

Full text

(1)

An introduction to Big Data Science.

Big Data – Explained

(2)

Presentation Agenda

• What is Big Data

• Why learn Big Data

• Who is it for

• How to start learning Big Data

• When to learn it

• Objective and Benefits of Big Data

(3)

What is Big Data

Large-Scale Data Management Data Science and Analytics

Managing very large amounts of data and extracting

(4)

Introduction to Big Data

What is Big Data?

What makes data, “Big” Data?

(5)

Big Data Definition

• No single standard definition…

“Big Data” is the data whose scale, diversity, and complexity require new architecture, techniques,

algorithms, and analytics to manage it and extract value and hidden knowledge from it…

Examples : Google, Wikipedia, Amazon, Facebook, eBay and other corporate

enterprises…

(6)

Data explosion

(7)

Data generation

▫ Web data, e-commerce

▫ Purchases at department and grocery stores

▫ Bank/Credit Card transactions

▫ Social Networks

▫ Health care records

▫ Satellite imagery and weather modeling

(8)

Data Approximation

• Google processes 20 PB a day (2008)

• Wayback Machine has 3 PB + 100 TB/month (3/2009)

• Facebook has 2.5 PB of user data + 15 TB/day (4/2009)

• eBay has 6.5 PB of user data + 50 TB/day (5/2009)

• CERN’s Large Hydron Collider (LHC) generates 15 PB a year

(9)

Characteristics of Big Data:

1-Scale (Volume)

Data Volume

44x increase from 2009 to 2020

From 0.8 zettabytes to 35zb

Data volume is increasing exponentially

Exponential increase in

(10)

Characteristics of Big Data:

2-Complexity (Varity)

Various formats, types, and structures

Text, numerical, images, audio, video, sequences, time series, social media data, multi-dim arrays, etc…

Static data vs. streaming data

A single application can be

generating/collecting many types of data

To extract knowledge all these types of

(11)

Characteristics of Big Data:

3-Speed (Velocity)

• Data is being generated fast and need to be processed fast

• Online Data Analytics

• Late decisions  missing opportunities

Examples

E-Promotions: Based on your current location, your purchase

history, what you like  send promotions right now for store next to you

Healthcare monitoring: sensors monitoring your activities and body

(12)

Big Data: 3V’s

(13)

Some Make it 4V’s

(14)

Harnessing Big Data

OLTP: Online Transaction Processing (DBMSs)

OLAP: Online Analytical Processing (Data Warehousing)

RTAP: Real-Time Analytics Processing (Big Data Architecture &

technology)

(15)

Who’s Generating Big Data

Social media and networks

(all of us are generating data) Scientific instruments (collecting all sorts of data)

Mobile devices

(tracking all objects all the time)

Sensor technology and networks

(measuring all kinds of data)

The progress and innovation is no longer hindered by the ability to collect data

But, by the ability to manage, analyze, summarize, visualize, and discover knowledge from the collected data in a timely manner and in a scalable fashion

(16)

Why learn Big Data - The Model Has Changed

The Model of Generating/Consuming Data has Changed

Old Model: Few companies are generating data, all others are consuming data

New Model: all of us are generating data, and all of us are consuming data

(17)

Big Data Types

(18)

What’s driving Big Data

- Ad-hoc querying and reporting - Data mining techniques

- Structured data, typical sources - Small to mid-size datasets

- Optimizations and predictive analytics - Complex statistical analysis

- All types of data, and many sources - Very large datasets

- More of a real-time

(19)

Who is it for - Value of Big Data Analytics

Big data is more real-time in nature than traditional DW applications

Traditional DW architectures (e.g.

Exadata, Teradata) are not well- suited for big data apps

Shared nothing, massively parallel processing, scale out architectures are well-suited for big data apps

(20)

Big Data Market

(21)

Challenges in Handling Big Data

The Bottleneck is in technology

New architecture, algorithms, techniques are needed

Also in technical skills

Experts in using the new technology and dealing with big data

(22)

What Technologies do we have

for Big Data ?

How does Big Data work ?

(23)

Big Data Landscape

(24)

Big Data Technology

(25)

How to get started

Learn the platform (how it is designed and works)

How big data are managed in a scalable, efficient way

Learn writing Hadoop jobs in different languages

Programming Languages: Java, C, Python

High-Level Languages: Apache Pig, Hive

Learn advanced analytics tools on top of Hadoop

RHadoop: Statistical tools for managing big data

Mahout: Data mining and machine learning tools over big data

Learn state-of-art technology from recent research papers

Optimizations, indexing techniques, and other extensions to Hadoop

(26)

Some popular vendors

(27)

When to learn Big Data

• Proven historical trend on recruiting engineering graduates in IT companies

• Those doing ECE, EEE, Civil, Mechanical and others are mostly not able to apply their core skills to work on what they learnt in graduation

• Pre-graduation is the right time to learn Big Data technologies

(28)

Learning Big Data continued…

• If you start now, you can master it by next 2 years

• Big Data involves not just several tools, but numerous technologies, methodologies, and mathematical and/or statistical concepts

• These need to be thought, developed, and applied appropriately to reach a certain goal

• Algorithms and computing languages are required to practically turn “Big Data in to Applied Intelligence”

(29)

Why to learn Big Data now ?

• A culmination of several technologies

• Sooner, the better

• If a flexible mind starts learning HADOOP and related stuff now, it can rightly be positioned after few years in the right job

• Synonymous to 3-year IT diploma courses

(30)

Resources & Books

No specific syllabus

Big Data is a relatively new topic with no fixed syllabus

Evolutionary development, being standardized

Where to learn

Big Data University

Cloudera CDH VM and many more vendors

Related books:

Hadoop, The Definitive Guide. Several others.

(31)

Resources on the net

• Vast information on Big Data available on the internet

• Tutorials, YouTube videos, articles and vendor white papers

• Most of it is open source and for every one

• What one needs is time, interest, energy, and a bit of foresight to work on ambitious projects

(32)

Learning curve

Some Big-Data courses available in the market

Cloudera Certified Administrator Apache Hadoop

Cloudera Certified Developer for Apache Hadoop

Several perceptions and perspectives

Several tools exist for Big-Data technology

Students need the right direction to get started

Who learns what is more important

Clear goals and learning curves for administrators and developers

A combination of above is the right mix for young minds

(33)

Starting with Big Data

Virtual machine environment is best suited to start

Any supported or popular Linux distribution

Preferred RHEL, SUSE, Cent OS, Ubuntu or Fedora

Hadoop platform

Single-node and then clustered with High-Availability

Cloudera Quickstart VM (CDH 4.4)

Cloudera is one of the pioneers in Big Data technologies

CDH or Cloudera Distribution for HADOOP available as a VM

Downloadable from Cloudera website

Other needed software packages

(34)

Introduction to HADOOP

• High Availability Distributed Object Oriented Platform

Developed by The Apache Software Foundation (http://apache.org)

Google started in 1990’s. 2000’s brought data management complexities

In 2004, Google published whitepaper on MapReduce, a framework that provides a parallel processing model

(35)

HADOOP contd..

Google’s technologies namely

1. GFS (Google File System) – A distributed file system 2. MapReduce – A framework for parallel processing 3. BigTable – A Data storage system

These are reverse engineered and re-engineered by Apache Software Foundation, and called as:

1. HDFS (Hadoop Distributed File System) 2. MapReduce

3. Apache HBase

(36)

Real-world scenarios

• IMAGINE YOUR BOSS COMES TO YOU AND SAYS:

“HERE ARE 50 GB OF LOGFILES—FIND A WAY TO IMPROVE OUR business!”

• What would you do?

• Where would you start?

• And what would you do next?

(37)

Cloud and Big Data

Most of the traditional IT skills are being moved towards the Cloud and Big Data.

Some related fields:

• Artificial Intelligence

• Distributed computing / super computing

• Business Analytics / Business Intelligence

• Data Analytics / Data Mining

(38)

Companies using HADOOP

(39)

HADOOP - Business problems types

(40)

How does MapReduce help

(41)

Hadoop and MapReduce Architecture

(42)

A Sample HADOOP Cluster Configuration

(43)

RDBMS vs. HADOOP

(44)

History of Databases

(45)

Object Databases

(46)

Relational Dominance

(47)

Bigtable, Dynamo and HBase

(48)

NoSQL = Not Only SQL

(49)

Database ecosystem

(50)

Past, Present and Future of IT

• Information technology or IT

• The term “IT” first appeared in 1958

• IT as a catalyst to other areas of science and technology

• A movement from IT driven industry to open information society

• We are today a part global village, via internet, which is now a commodity or a common

consumer service

• Fast Innovations and Inventions to continue

(51)

Big data development Big Data environment

• Use of Virtual Machines

• Java runtime environment

• HADOOP and related software

• Installed on a single node or clustered

• Running Cloudera CDH, IBM BigInsight etc.

Prerequisites for leaning Big Data

• Working knowledge of computers

• Basic knowledge of Linux, C, Java etc.

(52)

Thank You

References

Related documents

Therefore, various laboratory equipment used in learning media with the help of ICT can be developed simulation application.. Particularly in the field of

In conclusion, for the studied Taiwanese population of diabetic patients undergoing hemodialysis, increased mortality rates are associated with higher average FPG levels at 1 and

The calculation with the traditional formulae does not give you any exact fair price but only a result which is true, assuming a flat yield curve and a re-investment of the

Young People's Health in Context: Health Behaviour in School-aged Children (HBSC) study?. Health Policy for Children and

Tissue factor and phosphatidylserine are also expressed on the surface of microvesicles released from the other parent cells such as platelets and endothelial cells... activity

In this study I exploit a dataset of loss given default realizations to estimate a prediction model based on financial accounting information available to lenders at the