HADOOP. Revised 10/19/2015

(1)

HADOOP

Revised 10/19/2015

(2)

This Page Intentionally Left Blank

(3)

Hortonworks HDP Developer: Java... 1

Hortonworks HDP Developer: Apache Pig and Hive... 2

Hortonworks HDP Developer: Windows... 3

Hortonworks HDP Operations: Hadoop Administration 1... 4

Hortonworks HDP Data Science ... 5

Hortonworks HDP Developer: Custom YARN Applications... 6

Hortonworks HDP Operations: Migrating to the Hortonworks Data Platform... 7

Hortonworks HDP Analyst: Apache HBase Essentials... 8

Hortonworks HDP Operations: Apache HBase Management... 9

Hortonworks HDP Developer: Storm and Trident Fundamentals Workshop... 10

This Page Intentionally Left Blank

(5)

Course Description:

Who Should Attend:

Prerequisites:

Course Outline:

This advanced four-day course provides Java programmers a deep-dive into Hadoop 2.0 application development. Students will learn how to design and develop efficient and effective MapReduce applications for Hadoop 2.0 using the Hortonworks Data Platform. Students who attend this course will learn how to harness the power of Hadoop 2.0 to manipulate, analyze and perform computations on their Big Data.

This class is for experienced Java software engineers who need to design and develop Java MapReduce applications for Hadoop 2.0.

This course assumes students have experience developing Java applications and using a Java IDE.

Labs are completed using the Eclipse IDE and Maven. No prior Hadoop knowledge is required.

Benefits of Attendance:

Upon completion of this course, students will be able to:

Explain Hadoop 2.0 and the Hadoop Distributed File System

•

Explain the new YARN framework in Hadoop 2.0

•

Develop a Java MapReduce application

•

Run a MapReduce application on YARN

•

Use combiners and in-map aggregation to improve the performance of a MapReduce job

•

Write a custom partitioner to avoid data skew on reducers

•

Perform a secondary sort by writing custom key and group comparator classes

•

Recognize use cases for the various built-in input and output formats

•

Write a custom input and output format for a MapReduce job.

•

Optimize a MapReduce job by following best practices

•

Configure various aspects of a MapReduce job to optimize mappers and reducers

•

Develop a custom RawComparator class

•

Use the Distributed Cache

•

Explain the various join techniques in Hadoop

•

Perform a map-side join

•

Use a Bloom filter to join two large datasets

•

Perform unit tests using the UnitMR API

•

Explain the basic architecture of HBase

•

Write an HBase MapReduce application

•

Explain use cases for Pig and Hive

•

Write a simple Pig script to explore and transform big data

•

Write a Pig UDF (User-Defined Function) in Java

•

Execute a Hive query

•

Write a Hive UDF in Java

•

Use the JobControl class to create a workflow of MapReduce jobs

•

Day 1

Understanding Hadoop and HDFS Writing MapReduce Applications Map Aggregation

Day 2

Partitioning and Sorting Input and Output Formats Optimizing MapReduce Jobs Day 3

Advanced MapReduce Features Unit Testing

HBase Programming Day 4

Pig Programming Hive Programming Defining Workflow Lab Content

Configuring a Hadoop 2.0 Development Environment Putting data into HDFS using Java

Write a distributed grep MapReduce application Write an inverted index MapReduce application Configure and use a combiner

Writing a custom combiner Writing a custom partitioner Globally sort output using the TotalOrderPartitioner

Writing a MapReduce job whose data is sorted using a composite key Writing a custom InputFormat class

Writing a custom OutputFormat class

Compute a simple moving average of historical stock price data Use data compression

Define a RawComparator Perform a map-side join Using a Bloom filter Unit testing a MapReduce job Import data into HBase

Writing an HBase MapReduce job Writing a User-Defined Pig Function Writing a User-Defined Hive Function Defining an Oozie workflow

Hortonworks HDP Developer: Java

Course Length 4 Days

TE7411_20140825

(6)

Course Description:

Who Should Attend:

Prerequisites:

Course Outline:

This 4-day hands-on training course teaches students how to develop applications and analyze Big Data stored in Apache Hadoop 2.0 using Pig and Hive. Students will learn the details of Hadoop 2.0, YARN, the Hadoop Distributed File System (HDFS), an overview of MapReduce, and a deep dive into using Pig and Hive to perform data analytics on Big Data. Other topics covered include data ingestion using Sqoop and Flume, and defining workflow using Oozie. Labs are run in a Linux environment.

This class is for data Aaalysts, BI analysts, BI developers, SAS developers and other types of analysts who need to answer questions and analyze Big Data stored in a Hadoop cluster.

Students should be familiar with programming principles and have experience in software development.

SQL experience is strongly recommended. Java knowledge is helpful. No prior Hadoop knowledge is required.

Benefits of Attendance:

Explain Hadoop 2.0 and YARN

•

Explain use cases for Hadoop

•

Explain how HDFS Federation works in Hadoop 2.0

•

Explain the various tools and frameworks in the Hadoop 2.0 ecosystem

•

Explain the architecture of the Hadoop Distributed File System (HDFS)

•

Use the Hadoop client to input data into HDFS

•

Use Sqoop to transfer data between Hadoop and a relational database

•

Explain the architecture of MapReduce

•

Explain the architecture of YARN

•

Run a MapReduce job on YARN

•

Write a Pig script to explore and transform data in HDFS

•

Define advanced Pig relations

•

Use Pig to apply structure to unstructured Big Data

•

Invoke a Pig User-Defined Function

•

Use Pig to organize and analyze Big Data

•

Understand how Hive tables are defined and implemented

•

Use the new Hive windowing functions

•

Explain and use the various Hive file formats

•

Create and populate a Hive table that uses the new ORC file format

•

Use Hive to run SQL-like queries to perform data analysis

•

Use Hive to join datasets using a variety of techniques, including Map-side joins and Sort-Merge-Bucket joins

•

Write efficient Hive queries

•

Create ngrams and context ngrams using Hive

•

Perform data analytics like quantiles and page rank on Big Data using the DataFu Pig library

•

Day 1

Understanding Hadoop 2.0

The Hadoop Distributed File System (HDFS) Inputting Data into HDFS

The MapReduce Framework and YARN Day 2

Introduction to Pig Advanced Pig Programming Day 3

Hive Programming Using HCatalog

Advanced Hive Programming Day 4

Advanced Hive Programming (cont.) Data Analysis and Statistics Defining Workflow with Oozie Lab Content

Use HDFS commands to add/remove files and folders from HDFS Use Sqoop to transfer data between HDFS and a RDBMS Run a MapReduce job

Run a YARN application

Explore and transform data using Pig Split a dataset using Pig

Join two datasets using Pig

Use Pig to transform and export a dataset for use with Hive

Use HCatLoader and HCatStorer to retrieve HCatalog schemas from within a Pig script Understand how a Hive table is stored in HDFS

Use Hive to discover useful information in a dataset Understand how Hive queries get executed as MapReduce jobs Perform a join of two datasets with Hive

Use advanced Hive features like windowing, views and ORC files Use the Hive analytics functions (rank, dense_rank, cume_dist, row_number) Write a custom reducer in Python that reduces the number of underlying MapReduce jobs

generated from a Hive query

Analyze and sessionize clickstream data using the Pig DataFu library

Compute quantiles of NYSE stock prices Use Hive to compute ngrams on Avro-formatted files Define an Oozie workflow

Hortonworks HDP Developer: Apache Pig and Hive

TE7414_20150603

(7)

Course Description:

Who Should Attend:

Prerequisites:

Course Outline:

This 4-day hands-on training course teaches students how to develop applications and analyze Big Data stored in Apache Hadoop on Windows using Pig and Hive. Students will learn the details of Hadoop 2.x, YARN, the Hadoop Distributed File System (HDFS), an overview of MapReduce, and a deep dive into using Pig and Hive to perform data analytics on Big Data. Other topics covered include using Sqoop to transfer data between Hadoop and Microsoft SQL Server, and connecting Microsoft Excel to Hadoop using the HiveODBC Driver.

This course is for software developers who need to understand and develop applications for Hadoop 2.x on Windows.

Students should be familiar with programming principles and have experience in software development.

SQL knowledge and familiarity with Microsoft Windows is also helpful. No prior Hadoop knowledge is required.

Benefits of Attendance:

Explain Hadoop and YARN

•

Explain use cases for Hadoop

•

Explain the various tools and frameworks in the Hadoop 2.x ecosystem

•

Explain the components of the Hortonworks Data Platform on Windows

•

Explain the deployment options for HDP on Windows

•

Explain the architecture of the Hadoop Distributed File System (HDFS)

•

Use the Hadoop client to input data into HDFS

•

Use Sqoop to transfer data between Hadoop and Microsoft SQL Server

•

Explain the architecture of MapReduce

•

Run a MapReduce job on YARN

•

Write a Pig script to explore and transform data in HDFS

•

Define advanced Pig relations

•

Use Pig to apply structure to unstructured Big Data

•

Invoke a Pig User-Defined Function

•

Use Pig to organize and analyze Big Data

•

Understand how Hive tables are defined and implemented

•

Use the new Hive windowing functions

•

Explain and use the various Hive file formats

•

Create and populate a Hive table that uses the new ORC file format

•

Use Hive to run SQL-like queries to perform data analysis

•

Use Hive to join datasets using a variety of techniques, including Map-side joins and Sort-Merge-Bucket joins

•

Write efficient Hive queries

•

Create ngrams and context ngrams using Hive

•

Day 1

Understanding Hadoop

The Hadoop Distributed File System (HDFS) Inputting Data into HDFS

The MapReduce Framework Day 2

Introduction to Pig Advanced Pig Programming Day 3

Hive Programming Using HCatalog

Advanced Hive Programming Day 4

The Hive ODBC Driver Hadoop 2 and YARN

Appendix A: Defining Workflow with Oozie

Hands-On Labs: Students will work through the following lab exercises using the Hortonworks Data Platform 2.1 on Windows.

Start HDP on Windows

Use HDFS commands to add/remove files and folders from HDFS Use Sqoop to transfer data between HDFS and Microsoft SQL Server Run a MapReduce job

Explore and transform data using Pig Split a dataset using Pig

Join two datasets using Pig

Use Pig to transform and export a dataset for use with Hive

Use HCatLoader and HCatStorer to retrieve HCatalog schemas from within a Pig script Understand how a Hive table is stored in HDFS

Use Hive to discover useful information in a dataset Understand how Hive queries get executed as MapReduce jobs Perform a join of two datasets with Hive

Use advanced Hive features like windowing, views and ORC files Use the Hive analytics functions (rank, dense_rank, cume_dist, row_number) Analyze and sessionize clickstream data using the Pig DataFu library Compute quantiles of NYSE stock prices

Use Hive to compute ngrams on Avro-formatted files Connect Microsoft Excel to Hadoop using the HiveODBC Driver Run a YARN application

Define an Oozie workflow

Hortonworks HDP Developer: Windows

TE7410_20140825

(8)

Course Description:

Who Should Attend:

Prerequisites:

Course Outline:

This course is designed for administrators who will be managing the Hortonworks Data Platform (HDP) 2.3 with Ambari. It covers installation, configuration, and other typical cluster maintenance tasks.

This course is designed for IT administrators and operators responsible for installing, configuring and supporting an Apache Hadoop 2.3 deployment in a Linux environment.

Attendees should be familiar with Hadoop and Linux environments.

Benefits of Attendance:

Summarize and enterprise environment including Big Data, Hadoop and the Hortonworks Data Platform (HDP)

•

Install HDP

•

Manage Ambari Users and Groups

•

Manage Hadoop Services

•

Use HDFS Storage

•

Manage HDFS Storage

•

Configure HDFS Storage

•

Configure HDFS Transparent Data Encryption

•

Configure the YARN Resource Manager

•

Submit YARN Jobs

•

Configure the YARN Capacity Scheduler

•

Add and Remove Cluster Nodes

•

Configure HDFS and YARN Rack Awareness

•

Configure HDFS and YARN High Availability

•

Monitor a Cluster

•

Protect a Cluster with Backups

•

Lab Content: Students will work through the following lab exercises using the Hortonworks Data Platform 2.2.

Introduction to the Lab Environment

Performing an Interactive Ambari HDP Cluster Installation Configuring Ambari Users and Groups

Managing Hadoop Services Using HDFS Files and Directories Using WebHDFS

Configuring HDFS ACLs Managing HDFS Managing HDFS Quotas

Configuring HDFS Transparent Data Encryption Configuring and Managing YARN

Non-Ambari YARN Management

Configuring YARN Failure Sensitivity, Work Preserving Restarts, and Log Aggregation Settings

Submitting YARN Jobs

Configuring Different Workload Types Configuring User and Groups for YARN Labs Configuring YARN Resource Behavior and Queues User, Group and Fine-Tuned Resource Management Adding Worker Nodes

Configuring Rack Awareness Configuring HDFS High Availability Configuring YARN High Availability Configuring and Managing Ambari Alerts Configuring and Managing HDFS Snapshots Using Distributed Copy (DistCP)

Hortonworks HDP Operations: Hadoop Administration 1

TE7408_20151014

(9)

Course Description:

Who Should Attend:

Prerequisites:

Course Outline:

Data Science for the Hortonworks Data Platform covers data science principles and techniques through lecture and hands-on experience. During this three-day course, students will learn the processes and practice of data science, including machine learning and natural language processing. Students will also learn the tools and programming languages used by data scientists, including Python, IPython, Mahout, Pig, NumPy, pandas, SciPy, Scikit-learn, the Natural Language Toolkit (NLTK), and Spark MLlib.

This class is for architects, software developers, analysts and data scientists who need to understand how to apply data science and machine learning on Hadoop.

Students must have experience with at least one programming or scripting language, knowledge in statistics and/or mathematics, and a basic understanding of big data and Hadoop principles.

Benefits of Attendance:

Recognize use cases for data science

•

Describe the architecture of Hadoop and YARN

•

Explain the differences between supervised and unsupervised learning

•

List the six machine learning tasks

•

Recognize use cases for clustering, outlier detection, affinity analysis, classification, regression, and recommendation

•

Use Mahout to run a machine learning algorithm on Hadoop

•

Write Pig scripts to transform data on Hadoop

•

Use Pig to prepare data for a machine learning algorithm

•

Write a Python script

•

Use NumPy to analyze big data

•

Use the data structure classes in the pandas library

•

Write a Python script that invokes a SciPy machine learning algorithm

•

Explain the options for running Python code on a Hadoop cluster

•

Write a Pig User Defined Function in Python

•

Use Pig streaming on Hadoop with a Python script

•

Write a Python script that invokes a scikit-learn machine learning algorithm

•

Use the k-nearest neighbor algorithm to predict values based on a training data set

•

Run a machine learning algorithm on a distributed data set on Hadoop

•

Describe use cases for Natural Language Processing (NLP)

•

Perform sentence segmentation on a large body of text

•

Perform part-of-speech tagging

•

Use the Natural Language Toolkit (NLTK) for implement NLP tasks and machine learning algorithms

•

Explain the components of a Spark application

•

Day 1

Using Hadoop for Data Science Hadoop Architecture Machine Learning Introduction to Pig Day 2

Python Programming Analyzing Data with Python Running Python on Hadoop Day 3

Implementing Machine Learning Natural Language Processing Spark MLlib

Hands-On Labs: Students will complete the following hands-on labs using their own 7-node Hadoop cluster (HDP 2.1) and IPython Notebook.

Setting Up a Development Environment Using HDFS Commands

Using Mahout for Machine Learning Getting Started with Pig Exploring Data with Pig Using the IPython Notebook Data Analysis with Python Interpolating Data Points Define a Pig UDF in Python Streaming Python with Pig K-Nearest Neighbor K-Means Clustering

Using NLTK for Natural Language Processing Classifying Text using Naive Bayes Spark Programming

Running Data Science Algorithms using Spark MLlib

Hortonworks HDP Data Science

TE7412_20140825

(10)

Course Description:

Who Should Attend:

Prerequisites:

Course Outline:

This 2-day hands-on training course teaches students how to develop custom YARN applications for Apache Hadoop. Students will learn the details of the YARN architecture, the steps involved in writing a YARN application, the details of writing a YARN client and ApplicationMaster, and how to launch Containers. Applications are developed using Eclipse and Gradle connected remotely to a 7-node HDP 2.1 cluster running in a virtual machine that the students can keep for use after the training.

This course is intended for software engineers familiar with Java who need to develop YARN applications on Hadoop 2.x by writing custom YARN clients and ApplicationMasters in Java.

Students must have attended the Developing Applications with the Hortonworks Data Platform using Java course; or attended the Data Analysis with the Hortonworks Data Platform using Pig and Hive course; or possess similar Hadoop development knowledge and understand HDFS and the MapReduce framework.

Benefits of Attendance:

•

Explain the lifecycle of a YARN application

•

Write a YARN client application

•

Run a YARN application on a Hadoop 2.x cluster

•

Monitor the status of a running YARN application

•

View the aggregated logs of a YARN application

•

Configure a ContainerLaunchContext

•

Define a LocalResource for sharing application files across the cluster

•

Write a YARN ApplicationMaster

•

Explain the differences between synchronous and asynchronous ApplicationMasters

•

Allocate Containers in a cluster

•

Launch Containers on NodeManagers

•

Write a custom Container to perform specific business logic

•

Explain the job schedulers of the ResourceManager

•

Define queues for the Capacity Scheduler

•

Day 1

Unit 1: The YARN Architecture Unit 2: Overview of a YARN Application Unit 3: Writing a YARN Client Day 2

Unit 4: Writing a YARN ApplicationMaster Unit 5: Containers

Unit 6: Job Scheduling

Lab Content: Students will work through the following lab exercises using the Hortonworks Data Platform 2.1.

Running a YARN Application

Setup a YARN Development Environment Writing a YARN Client

Submitting an ApplicationMaster Writing an ApplicationMaster Requesting Containers Running Containers Writing Custom Containers

Hortonworks HDP Developer: Custom YARN Applications

TE7415_20140725

(11)

Course Description:

Who Should Attend:

Prerequisites:

Course Outline:

This course is designed for administrators who are familiar with administering other Hadoop distributions and are migrating to the Hortonworks Data Platform (HDP). It covers installation, configuration, maintenance, security and performance topics.

This class is for experienced Hadoop administrators and operators responsible for installing, configuring and supporting the Hortonworks Data Platform.

Attendees should be familiar with Hadoop fundamentals, have experience administering a Hadoop cluster, and installation of configuration of Hadoop components such as Sqoop, Flume, Hive, Pig and Oozie.

Benefits of Attendance:

Install and configure an HDP 2.x cluster

•

Use Ambari to monitor and manage a cluster

•

Mount HDFS to a local filesystem using the NFS Gateway

•

Configure Hive for Tez

•

Use Ambari to configure the schedulers of the ResourceManager

•

Commission and decommission worker nodes using Ambari

•

Use Falcon to define and process data pipelines

•

Take snapshots using the HDFS snapshot feature

•

Implement and configure NameNode HA using Ambari

•

Secure an HDP cluster using Ambari

•

Setup a Knox gateway

•

Hands-On Labs Install HDP 2.x using Ambari Add a new node to the cluster Stop and start HDP services Mount HDFS to a local file system Configure the capacity scheduler Use WebHDFS

Dataset mirroring using Falcon

Commission and decommission a worker node using Ambari Use HDFS snapshots

Configure NameNode HA using Ambari Secure an HDP cluster using Ambari Setting up a Knox gateway

Hortonworks HDP Operations: Migrating to

the Hortonworks Data Platform

TE7416_20150223

(12)

Course Description:

Who Should Attend:

Prerequisites:

Course Outline:

This course is designed for big data analysts who want to use the HBase NoSQL database which runs on top of HDFS to provide real-time read/write access to sparse datasets. Topics include HBase architecture, services, installation and schema design.

This class is for architects, software developers, and analysts responsible for implementing non-SQL databases in order to handle sparse data sets commonly found in big data use cases.

Students must have basic familiarity with data management systems. Familiarity with Hadoop or databases is helpful but not required.

Benefits of Attendance:

Integrate HBase with Hadoop and HDFS

•

Describe architectural components and core concepts of HBase

•

Understand HBase functionality

•

Install and configure HBase

•

Understand HBase schema design

•

Import and export data

•

Perform backup and recovery

•

Monitor and manage HBase

•

Describe how Apache Phoenix works with HBase

•

Integrate HBase with Apache ZooKeeper

•

Use HBase services and perform data operations

•

Optimize HBase Access

•

Hands-On Labs Using Hadoop and MapReduce Using HBase

Importing Data from MySQL to HBase Using Apache ZooKeeper Examining Configuration Files Using Backup and Snapshot HBase Shell Operations

Creating Tables with Multiple Column Families Exploring HBase Schema

Blocksize and Bloom filters Exporting Data

Using a Java Data Access Object Application to interact with HBase

Hortonworks HDP Analyst: Apache HBase Essentials

TE7417_20150612

(13)

Course Description:

Who Should Attend:

Prerequisites:

Course Outline:

This course is designed for administrators who will be installing, configuring and managing HBase clusters. It covers installation with Ambari, configuration, security and troubleshooting HBase implementations. The course includes an end-of-course project in which students work together to design and implement an HBase

schema.

This course is for architects, software developers, and analysts responsible for implementing non-SQL databases in order to handle sparse data sets commonly found in big data use cases.

Students must have basic familiarity with data management systems. Familiarity with Hadoop or databases is helpful but not required. Students new to Hadoop are encouraged to take the HDP Overview: Apache Hadoop Essentials course.

Benefits of Attendance:

Discuss running applications in the cloud

•

Provision the cluster

•

Use the HBase shell

•

Ingest data

•

Perform operational management

•

Perform backup and recovery

•

Provide security

•

Monitor HBase and diagnose problems

•

Perform maintenance

•

Troubleshoot

•

Hands on Labs

Installing and Configuring HBase with Ambari Manually Installing HBase (Optional) Using Shell Commands

Ingesting Data using ImportTSV Enabling HBase High Availability Viewing Log Files

Configuring and Enabling Snapshots Configuring Cluster Replication Enabling Authentication and Authorization Diagnosing and Resolving Hot Spotting Region Splitting

Monitoring JVM Garbage Collection

End-of-Course Project: Designing an HBase Schema

Hortonworks HDP Operations: Apache HBase

Management

TE7419_20150921

(14)

Course Description:

Who Should Attend:

Prerequisites:

Course Outline:

This course provides a technical introduction to the fundamentals of Apache Storm and Trident that includes the concepts, terminology, architecture, installation, operation, and management of Storm and Trident. Simple Storm and Trident code excerpts are provided throughout the course. The course also includes an introduction to, and code samples for, Apache Kafka. Apache Kafka is a messaging system that is commonly used in concert with Storm and Trident.

This course is for data architects, data integration architects, technical infrastructure team, and Hadoop administrators or developers who want to understand the fundamentals of Storm and Trident.

No previous Hadoop or programming knowledge is required. Students will need browser access to the Internet.

Benefits of Attendance:

Recognize differences between batch and real-time data processing

•

Define Storm elements including tuples, streams, spouts, topologies, worker processes, executors, and stream groupings

•

Explain Storm architectural components, including Nimbus, Supervisors, and ZooKeeper cluster

•

Recognize/interpret Java code for a spout, bolt, or topology

•

Identify how to install and configure a Storm cluster

•

Identify how to develop and submit a topology to a local or remote distributed cluster

•

Recognize and explain the differences between reliable and unreliable Storm operation

•

Manage and monitor Storm using the command-line client or browser-based Storm User Interface (UI)

•

Define Trident elements including tuples, streams, batches, partitions, topologies, Trident spouts, and operations

•

Recognize and interpret the code for Trident operations, including filters, functions, aggregations, merges, and joins

•

Recognize and understand Trident repartitioning operations

•

See Course Objectives

Hortonworks HDP Developer: Storm and

Trident Fundamentals Workshop

TE7418_20150630

HADOOP. Revised 10/19/2015

HADOOP

This Page Intentionally Left Blank

Table of Contents

This Page Intentionally Left Blank

Course Description:

Who Should Attend:

Prerequisites:

Course Outline:

Benefits of Attendance:

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

Hortonworks HDP Developer: Java

Course Description:

Who Should Attend:

Prerequisites:

Course Outline:

Benefits of Attendance:

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

Hortonworks HDP Developer: Apache Pig and Hive

Course Description:

Who Should Attend:

Prerequisites:

Course Outline:

Benefits of Attendance:

•

•

•

•

•

•

•

•

•