TRAINING PROGRAM ON BIGDATA/HADOOP

(1)

Course: Training on Bigdata/Hadoop with Hands

Course Duration / Dates / Time: 4 Days /

Venue: Eagle Photonics Pvt Ltd

First Floor, Plot No 31, Sector 19C, Vashi, Navi Mumbai Ph: 022 27841425

Fee Details:

For Indian participants: 20,000 INR For Foreign participants: 350 USD

Course Description:

Big Data is a collection of large and complex data sets that cannot be processed using regular database management tools or processing applications. On

framework that allows for the distributed processing of large data sets using simple programming models. This hands-on course equips participants on how to manage Bigdata using Hadoop.

Who should attend?

This course is meant for software developers/programmers who are interested in Bigdata/Hadoop.

Key benefits:

On course completion, participants would be knowledgeable on Managing Bigdata and comfortable working with Hadoop Distributed File Systems & components.

Course Outline:

Module 1: Introduction to Big Data Session 1: Introduction to Big Data

• So What Is Big Data?

• History of Data Management

• Structuring of Big Data

• Types of Big Data

• Elements of Big Data

• Application of Big Data in the Business Context

• Careers in Big Data

Session 2: Business application of Big Data

• Significance of Social network Data

• Uses of Social Network Data Analysis

• Financial Fraud and Big Data

• Preventing Fraud Using Big Data Analytics Training on Bigdata/Hadoop with Hands-on

4 Days / 24th - 27th June 2015 / 9:30 - 17:30 Hrs

Vashi, Navi Mumbai - 400705

Big Data is a collection of large and complex data sets that cannot be processed using regular database management tools or processing applications. On the other hand, the Apache Hadoop software library is a framework that allows for the distributed processing of large data sets using simple programming models. This

on course equips participants on how to manage Bigdata using Hadoop.

On course completion, participants would be knowledgeable on Managing Bigdata and comfortable working with components.

• History of Data Management—Evolution of Big Data

• Application of Big Data in the Business Context

Session 2: Business application of Big Data

• Significance of Social network Data

• Uses of Social Network Data Analysis

• Financial Fraud and Big Data

• Preventing Fraud Using Big Data Analytics

Big Data is a collection of large and complex data sets that cannot be processed using regular database the other hand, the Apache Hadoop software library is a framework that allows for the distributed processing of large data sets using simple programming models. This

On course completion, participants would be knowledgeable on Managing Bigdata and comfortable working with

(2)

• Use of Big Data in the Retail Industry

Session 3: Technologies for handling Big Data

• Distributed and Parallel Computing for Big Data

• Virtualization and its Importance to Big Data

• Introducing Hadoop

• Cloud Computing and Big Data

• Features of Cloud Computing

• Providers in Big Data Cloud Market

• Issues in Using Cloud Services

• In-Memory Technology for Big Data

Session 4: Understanding the Hadoop Ecosystem

• The Hadoop Ecosystem

• Processing Data with Hadoop MapReduce

• Managing Resources and Applications with Hadoop YARN

• Storing Big Data with HBase

• Using Hive for Querying Big Databases

• Interacting with Hadoop Ecosystem

Session 5: Map reduce fundamentals

• Origins of MapReduce

• Characteristics of MapReduce

• How MapReduce Works

• More about Map and Reduce Functions

• Optimization Techniques for MapReduce Jobs

• Hardware/Network Topology

• Applications of MapReduce

• Role of HBase in Processing Big Data

• Mining Big Data with Hive

Module 2: Managing an Enterprise Wide Big Data Ecosystem Session 1- Big Data Technology Foundations

• Exploring the Big Data Stack

• Virtualization and Big Data

• Processor and Memory Virtualization

• Data and Storage Virtualization

• Managing Virtualization with Hypervisor

• Abstraction and Virtualization

• Implementing Virtualization to Work with Big Data

• Use of Big Data in the Retail Industry

Session 3: Technologies for handling Big Data

• Distributed and Parallel Computing for Big Data

• Virtualization and its Importance to Big Data

• Cloud Computing and Big Data

• Features of Cloud Computing

• Providers in Big Data Cloud Market

• Issues in Using Cloud Services Memory Technology for Big Data

Understanding the Hadoop Ecosystem

Data with Hadoop MapReduce

• Managing Resources and Applications with Hadoop YARN

• Storing Big Data with HBase

• Using Hive for Querying Big Databases

• Interacting with Hadoop Ecosystem

• Characteristics of MapReduce

• More about Map and Reduce Functions

• Optimization Techniques for MapReduce Jobs

• Hardware/Network Topology

• Role of HBase in Processing Big Data

Module 2: Managing an Enterprise Wide Big Data Ecosystem Big Data Technology Foundations

• Exploring the Big Data Stack

• Processor and Memory Virtualization

• Data and Storage Virtualization

ging Virtualization with Hypervisor

• Abstraction and Virtualization

• Implementing Virtualization to Work with Big Data

(3)

Session 2: Big Data management Systems

• RDBMSs and Big Data Environment

• PostgreSQL Relational Datab

• Nonrelational Databases

• Key-Value Pair Databases

• Document Databases

• Columnar Databases

• Graph Databases

• Spatial Databases.

• Polyglot Persistence

• Integrating Big Data with Traditional Data Warehouse

• Rethinking Extraction, Transformation, and Loading

• Big Data Analysis and Data Warehouse

• Changing Deployment Models in Big Data Era

Session 3: Analytics and Big Data

• Using Big Data to Get Results.

• What Constitutes Big Data

• Exploring Unstructured Data

• Understanding Text Analytics

• Building New Models and Approaches to Support Big Data

Session 4: Integrating Data, Real- Time Data and Implementing Big Data

• Stages in Big Data Analysis

• Fundamentals of Big Data Inte

• Streaming Data and Complex Event Processing

• Making Big Data a Part of Your Operational Process

• Ensuring Validity, Veracity, and Volatility of Big Data

• Data Validity and Veracity

• Data Volatility

Session 5: Big Data Solutions and Data in Motion

• Big Data as a Business Strategy Tool

• Analysis in Real-Time: Adding New Dimensions to the Cycle

• The Needs for Data in Motion

• Case 1: Using Streaming Data for Environmental Impact

• Case 2: Using Streaming Data for Public Policy

• Case 3: Use of Streaming Data in Health Care Industry

• Case 4: Use of Streaming Data in Energy Industry

• Case 5: Improving Customer Experience with Real

• Case 6: Using Real-time Data in Finance

Data management Systems – Databases and Warehouses

• RDBMSs and Big Data Environment

• PostgreSQL Relational Database

• Integrating Big Data with Traditional Data Warehouse

• Rethinking Extraction, Transformation, and Loading

• Big Data Analysis and Data Warehouse

• Changing Deployment Models in Big Data Era

• Using Big Data to Get Results.

• Exploring Unstructured Data

• Understanding Text Analytics

• Building New Models and Approaches to Support Big Data

Time Data and Implementing Big Data

• Fundamentals of Big Data Integration

• Streaming Data and Complex Event Processing

• Making Big Data a Part of Your Operational Process

• Ensuring Validity, Veracity, and Volatility of Big Data

Session 5: Big Data Solutions and Data in Motion

• Big Data as a Business Strategy Tool

Time: Adding New Dimensions to the Cycle

• The Needs for Data in Motion

• Case 1: Using Streaming Data for Environmental Impact Streaming Data for Public Policy

• Case 3: Use of Streaming Data in Health Care Industry

• Case 4: Use of Streaming Data in Energy Industry

• Case 5: Improving Customer Experience with Real-Time Text Analytics time Data in Finance Industry

(4)

• Case 7: Using Real-Time Data for Insurance Fraud Prevention

Module 3: Storing and Processing Data Session 1: Storing Data in Hadoop

• HDFS, HBase

• Combining HDFS and HBase for Effective Data Storage

• Choosing an Appropriate Hadoop Data Organization for Your Applications

Session 2: Processing your data with map Reduce

• Getting to Know MapReduce

• Your First MapReduce Application

• Designing MapReduce Implementations

Session 3: Customizing MapReduce Execution

• Controlling MapReduce Execution with Input Format

• Reading Data Your Way with Custom Record Reader

• Organizing Output Data with Custom Output Formats

• Optimizing Your MapReduce Execution with a Combiner

• Controlling Reducer Execution with Partitioners

Session 4: Testing and Debugging map Reduce Applications

• Unit Testing MapReduce Applications

• Local Application Testing with Eclipse

• Using Logging for Hadoop Testing

• Reporting Metrics with Job Counters

• Defensive Programming in MapReduce

Session 5: Implementing MapReduce Wordcount Program

Module 4: Increasing Efficiency with Hadoop Tools: Hive and Pig Session 1: Exploring Hive

• Introducing Hive

• Starting Hive

• Executing Hive Queries from Files

• Data Types

• Hive Built-In Functions

• Compressed Data Storage

• Data Manipulation in Hive

Session 2: Advanced Querying with Hive

• Queries

Time Data for Insurance Fraud Prevention

and Processing Data – HDFS and MapReduce

• Combining HDFS and HBase for Effective Data Storage

• Choosing an Appropriate Hadoop Data Organization for Your Applications

Session 2: Processing your data with map Reduce

• Getting to Know MapReduce

• Your First MapReduce Application

• Designing MapReduce Implementations

Session 3: Customizing MapReduce Execution

• Controlling MapReduce Execution with Input Format

• Reading Data Your Way with Custom Record Reader

• Organizing Output Data with Custom Output Formats

• Optimizing Your MapReduce Execution with a Combiner

• Controlling Reducer Execution with Partitioners

Session 4: Testing and Debugging map Reduce Applications

• Unit Testing MapReduce Applications

• Local Application Testing with Eclipse

• Using Logging for Hadoop Testing Counters

• Defensive Programming in MapReduce

Session 5: Implementing MapReduce Wordcount Program- A case study

Module 4: Increasing Efficiency with Hadoop Tools: Hive and Pig

Hive Queries from Files

Session 2: Advanced Querying with Hive

(5)

• Manipulating Column Values Using Functions

• JOINS in Hive

• Hive Best Practices

• Performance-Tuning and Query Optimizations

• Various Execution Types

• Hive File and Record Formats

• HiveThrift Service

• Security in Hive

Session 3: Analyzing Data with Pig

• Introduction to Pig

• Installing Pig

• Properties of Pig

• Running Pig

• Pig Latin Application Flow

• Beginning with Pig Latin

• Relational Operators in Pig

Module 5: Additional Hadoop Tools: Sqoop, Flume, YARN and Storm Session 1: Efficiently transferring Bulk data Using Sqoop

• Introducing Sqoop

• Using Sqoop 1

• Importing Data with Sqoop

• Controlling Parallelism

• Encoding NULL Values

• Importing Data into Hive Tables

• Importing Data into HBase

• Exporting Data

• Exporting Data into Subset of Columns

• Drivers and Connectors in Sqoop

• Sqoop Architecture Overview

• Sqoop 2

Session 2: Flume

• Introducing Flume

• The Flume Architecture

• Setting Up Flume

• Building Flume

Session 3: Beyond MapReduce – YARN

• Why YARN?

• Manipulating Column Values Using Functions

Tuning and Query Optimizations

• Hive File and Record Formats

Module 5: Additional Hadoop Tools: Sqoop, Flume, YARN and Storm Session 1: Efficiently transferring Bulk data Using Sqoop

• Importing Data into Hive Tables

• Exporting Data into Subset of Columns

• Drivers and Connectors in Sqoop

• Sqoop Architecture Overview

(6)

• The YARN Ecosystem

• A YARN API Example

• Mesos versus YARN

Session 4: Storm on YARN

• Storm and Hadoop

• Overview of Storm

• The Storm API

• Storm on YARN

• Installing Storm on YARN

• An Example of Storm on YARN

Module 6: Leveraging NoSQL, Hadoop Security, on Cloud and Real Time Session 1: Hello MoSQL

• Two Simple Examples

• Storing and Accessing Data

• Storing and Accessing Data in MongoDB

• Storing and Accessing Data in HBase

• Storing and Accessing Data i

• Language Bindings for NoSQL Data Stores

Session 2: Working with NoSQL

• Creating Records

• Accessing Data

• Updating and Deleting Data

• MongoDB Query Language Capabilities

• Accessing Data from Column

Session 3: Hadoop Security

• Hadoop Security Challenges

• Authentication

• Delegated Security Credentials

• Authorization

Session 4: Running Hadoop Applications on AWS

• Getting to Know AWS

• Options for Running Hadoop on AWS

• Understanding the EMR–Hadoop Relationship

• Using AWS S3

• Automating EMR Job Flow Creation and Job Execution

• Orchestrating Job Execution in EMR

• An Example of Storm on YARN

Module 6: Leveraging NoSQL, Hadoop Security, on Cloud and Real Time

• Storing and Accessing Data in MongoDB

• Storing and Accessing Data in HBase

• Storing and Accessing Data in Apache Cassandra

• Language Bindings for NoSQL Data Stores

• Updating and Deleting Data

• MongoDB Query Language Capabilities

• Accessing Data from Column-Oriented Databases Like HBase

• Hadoop Security Challenges

• Delegated Security Credentials

Session 4: Running Hadoop Applications on AWS

• Options for Running Hadoop on AWS

Hadoop Relationship

• Automating EMR Job Flow Creation and Job Execution

• Orchestrating Job Execution in EMR

(7)

Session 5: Real Time Hadoop

• Real-Time Hadoop Applications

• Using Specialized Real-Time Hadoop Quer

• Using Hadoop-Based Event-Processing Systems

Trainer Profile

Mr Biswajyoti Kar holds A.M.I.E from Institution of Engineers(India), Gokhale Road Calcutta & B.Sc in Physics from University Of Calcutta. He is a Senior Architect with over 19 yea

architecture, designing and implementing systems software

kernel mode development, Data structures and algorithm development in C solutions around Big Data and Analytics

Training Experience

• Big Data, Hadoop Distributed file systems in Dell.

• Algorithm and Data Structures in C/C++, UNIX/Linux advanced programming, shell scripting in Dell

• Algorithm and Data Structures in C Proton solutions

Project Experiences

1. BIG Data Work

Leading a project that involved setting up of Hadoop distributed file system (HDFS) on Linux box to test the elasticity part of cloud computing.

Bench-marking the hado model called PIG Latin.

Statistical analysis was done using R language.

2. Parallel network file system

Leading a project that involved setting up a pNFS client configuration.

Figuring out pros and cons of each configuration in HPC and NAS environments.

3. Big Data Analytics

Providing consulting in the area of Big Data Analysis to credit rating agency Time Hadoop Applications

Time Hadoop Query Systems Processing Systems

holds A.M.I.E from Institution of Engineers(India), Gokhale Road Calcutta & B.Sc in Physics from is a Senior Architect with over 19 years of rich experience with proven record in implementing systems software. He has experience of BIG Data Analytics, UNIX/Linux

Data structures and algorithm development in C. His area of Big Data and Analytics and IP creation in Big Data space.

Big Data, Hadoop Distributed file systems in Dell.

Algorithm and Data Structures in C/C++, UNIX/Linux advanced programming, shell scripting in Dell Algorithm and Data Structures in C Proton solutions

Leading a project that involved setting up of Hadoop distributed file system (HDFS) on Linux box to test the elasticity part of cloud computing.

marking the hadoop system for crunching terabytes of data using macro model called PIG Latin.

Statistical analysis was done using R language.

Leading a project that involved setting up a pNFS client-server file and block layout

Providing consulting in the area of Big Data Analysis to credit rating agency

* * *

holds A.M.I.E from Institution of Engineers(India), Gokhale Road Calcutta & B.Sc in Physics from rs of rich experience with proven record in BIG Data Analytics, UNIX/Linux area of interest is building

Algorithm and Data Structures in C/C++, UNIX/Linux advanced programming, shell scripting in Dell

Leading a project that involved setting up of Hadoop distributed file system (HDFS) on Linux box

op system for crunching terabytes of data using macro-programming

server file and block layout