HADOOP
Revised 10/19/2015
This Page Intentionally Left Blank
Hortonworks HDP Developer: Java... 1
Hortonworks HDP Developer: Apache Pig and Hive... 2
Hortonworks HDP Developer: Windows... 3
Hortonworks HDP Operations: Hadoop Administration 1... 4
Hortonworks HDP Data Science ... 5
Hortonworks HDP Developer: Custom YARN Applications... 6
Hortonworks HDP Operations: Migrating to the Hortonworks Data Platform... 7
Hortonworks HDP Analyst: Apache HBase Essentials... 8
Hortonworks HDP Operations: Apache HBase Management... 9
Hortonworks HDP Developer: Storm and Trident Fundamentals Workshop... 10
Table of Contents
This Page Intentionally Left Blank
Course Description:
Who Should Attend:
Prerequisites:
Course Outline:
This advanced four-day course provides Java programmers a deep-dive into Hadoop 2.0 application development. Students will learn how to design and develop efficient and effective MapReduce applications for Hadoop 2.0 using the Hortonworks Data Platform. Students who attend this course will learn how to harness the power of Hadoop 2.0 to manipulate, analyze and perform computations on their Big Data.
This class is for experienced Java software engineers who need to design and develop Java MapReduce applications for Hadoop 2.0.
This course assumes students have experience developing Java applications and using a Java IDE.
Labs are completed using the Eclipse IDE and Maven. No prior Hadoop knowledge is required.
Benefits of Attendance:
Upon completion of this course, students will be able to:Explain Hadoop 2.0 and the Hadoop Distributed File System
•
Explain the new YARN framework in Hadoop 2.0
•
Develop a Java MapReduce application
•
Run a MapReduce application on YARN
•
Use combiners and in-map aggregation to improve the performance of a MapReduce job
•
Write a custom partitioner to avoid data skew on reducers
•
Perform a secondary sort by writing custom key and group comparator classes
•
Recognize use cases for the various built-in input and output formats
•
Write a custom input and output format for a MapReduce job.
•
Optimize a MapReduce job by following best practices
•
Configure various aspects of a MapReduce job to optimize mappers and reducers
•
Develop a custom RawComparator class
•
Use the Distributed Cache
•
Explain the various join techniques in Hadoop
•
Perform a map-side join
•
Use a Bloom filter to join two large datasets
•
Perform unit tests using the UnitMR API
•
Explain the basic architecture of HBase
•
Write an HBase MapReduce application
•
Explain use cases for Pig and Hive
•
Write a simple Pig script to explore and transform big data
•
Write a Pig UDF (User-Defined Function) in Java
•
Execute a Hive query
•
Write a Hive UDF in Java
•
Use the JobControl class to create a workflow of MapReduce jobs
•
Day 1
Understanding Hadoop and HDFS Writing MapReduce Applications Map Aggregation
Day 2
Partitioning and Sorting Input and Output Formats Optimizing MapReduce Jobs Day 3
Advanced MapReduce Features Unit Testing
HBase Programming Day 4
Pig Programming Hive Programming Defining Workflow Lab Content
Configuring a Hadoop 2.0 Development Environment Putting data into HDFS using Java
Write a distributed grep MapReduce application Write an inverted index MapReduce application Configure and use a combiner
Writing a custom combiner Writing a custom partitioner Globally sort output using the TotalOrderPartitioner
Writing a MapReduce job whose data is sorted using a composite key Writing a custom InputFormat class
Writing a custom OutputFormat class
Compute a simple moving average of historical stock price data Use data compression
Define a RawComparator Perform a map-side join Using a Bloom filter Unit testing a MapReduce job Import data into HBase
Writing an HBase MapReduce job Writing a User-Defined Pig Function Writing a User-Defined Hive Function Defining an Oozie workflow
Hortonworks HDP Developer: Java
Course Length 4 Days
TE7411_20140825
Course Description:
Who Should Attend:
Prerequisites:
Course Outline:
This 4-day hands-on training course teaches students how to develop applications and analyze Big Data stored in Apache Hadoop 2.0 using Pig and Hive. Students will learn the details of Hadoop 2.0, YARN, the Hadoop Distributed File System (HDFS), an overview of MapReduce, and a deep dive into using Pig and Hive to perform data analytics on Big Data. Other topics covered include data ingestion using Sqoop and Flume, and defining workflow using Oozie. Labs are run in a Linux environment.
This class is for data Aaalysts, BI analysts, BI developers, SAS developers and other types of analysts who need to answer questions and analyze Big Data stored in a Hadoop cluster.
Students should be familiar with programming principles and have experience in software development.
SQL experience is strongly recommended. Java knowledge is helpful. No prior Hadoop knowledge is required.
Benefits of Attendance:
Upon completion of this course, students will be able to:Explain Hadoop 2.0 and YARN
•
Explain use cases for Hadoop
•
Explain how HDFS Federation works in Hadoop 2.0
•
Explain the various tools and frameworks in the Hadoop 2.0 ecosystem
•
Explain the architecture of the Hadoop Distributed File System (HDFS)
•
Use the Hadoop client to input data into HDFS
•
Use Sqoop to transfer data between Hadoop and a relational database
•
Explain the architecture of MapReduce
•
Explain the architecture of YARN
•
Run a MapReduce job on YARN
•
Write a Pig script to explore and transform data in HDFS
•
Define advanced Pig relations
•
Use Pig to apply structure to unstructured Big Data
•
Invoke a Pig User-Defined Function
•
Use Pig to organize and analyze Big Data
•
Understand how Hive tables are defined and implemented
•
Use the new Hive windowing functions
•
Explain and use the various Hive file formats
•
Create and populate a Hive table that uses the new ORC file format
•
Use Hive to run SQL-like queries to perform data analysis
•
Use Hive to join datasets using a variety of techniques, including Map-side joins and Sort-Merge-Bucket joins
•
Write efficient Hive queries
•
Create ngrams and context ngrams using Hive
•
Perform data analytics like quantiles and page rank on Big Data using the DataFu Pig library
•
Day 1
Understanding Hadoop 2.0
The Hadoop Distributed File System (HDFS) Inputting Data into HDFS
The MapReduce Framework and YARN Day 2
Introduction to Pig Advanced Pig Programming Day 3
Hive Programming Using HCatalog
Advanced Hive Programming Day 4
Advanced Hive Programming (cont.) Data Analysis and Statistics Defining Workflow with Oozie Lab Content
Use HDFS commands to add/remove files and folders from HDFS Use Sqoop to transfer data between HDFS and a RDBMS Run a MapReduce job
Run a YARN application
Explore and transform data using Pig Split a dataset using Pig
Join two datasets using Pig
Use Pig to transform and export a dataset for use with Hive
Use HCatLoader and HCatStorer to retrieve HCatalog schemas from within a Pig script Understand how a Hive table is stored in HDFS
Use Hive to discover useful information in a dataset Understand how Hive queries get executed as MapReduce jobs Perform a join of two datasets with Hive
Use advanced Hive features like windowing, views and ORC files Use the Hive analytics functions (rank, dense_rank, cume_dist, row_number) Write a custom reducer in Python that reduces the number of underlying MapReduce jobs
generated from a Hive query
Analyze and sessionize clickstream data using the Pig DataFu library
Compute quantiles of NYSE stock prices Use Hive to compute ngrams on Avro-formatted files Define an Oozie workflow
Hortonworks HDP Developer: Apache Pig and Hive
Course Length 4 Days
TE7414_20150603
Course Description:
Who Should Attend:
Prerequisites:
Course Outline:
This 4-day hands-on training course teaches students how to develop applications and analyze Big Data stored in Apache Hadoop on Windows using Pig and Hive. Students will learn the details of Hadoop 2.x, YARN, the Hadoop Distributed File System (HDFS), an overview of MapReduce, and a deep dive into using Pig and Hive to perform data analytics on Big Data. Other topics covered include using Sqoop to transfer data between Hadoop and Microsoft SQL Server, and connecting Microsoft Excel to Hadoop using the HiveODBC Driver.
This course is for software developers who need to understand and develop applications for Hadoop 2.x on Windows.
Students should be familiar with programming principles and have experience in software development.
SQL knowledge and familiarity with Microsoft Windows is also helpful. No prior Hadoop knowledge is required.
Benefits of Attendance:
Upon completion of this course, students will be able to:Explain Hadoop and YARN
•
Explain use cases for Hadoop
•
Explain the various tools and frameworks in the Hadoop 2.x ecosystem
•
Explain the components of the Hortonworks Data Platform on Windows
•
Explain the deployment options for HDP on Windows
•
Explain the architecture of the Hadoop Distributed File System (HDFS)
•
Use the Hadoop client to input data into HDFS
•
Use Sqoop to transfer data between Hadoop and Microsoft SQL Server
•
Explain the architecture of MapReduce
•
Explain the architecture of YARN
•
Run a MapReduce job on YARN
•
Write a Pig script to explore and transform data in HDFS
•
Define advanced Pig relations
•
Use Pig to apply structure to unstructured Big Data
•
Invoke a Pig User-Defined Function
•
Use Pig to organize and analyze Big Data
•
Understand how Hive tables are defined and implemented
•
Use the new Hive windowing functions
•
Explain and use the various Hive file formats
•
Create and populate a Hive table that uses the new ORC file format
•
Use Hive to run SQL-like queries to perform data analysis
•
Use Hive to join datasets using a variety of techniques, including Map-side joins and Sort-Merge-Bucket joins
•
Write efficient Hive queries
•
Create ngrams and context ngrams using Hive
•
Day 1
Understanding Hadoop
The Hadoop Distributed File System (HDFS) Inputting Data into HDFS
The MapReduce Framework Day 2
Introduction to Pig Advanced Pig Programming Day 3
Hive Programming Using HCatalog
Advanced Hive Programming Day 4
The Hive ODBC Driver Hadoop 2 and YARN
Appendix A: Defining Workflow with Oozie
Hands-On Labs: Students will work through the following lab exercises using the Hortonworks Data Platform 2.1 on Windows.
Start HDP on Windows
Use HDFS commands to add/remove files and folders from HDFS Use Sqoop to transfer data between HDFS and Microsoft SQL Server Run a MapReduce job
Explore and transform data using Pig Split a dataset using Pig
Join two datasets using Pig
Use Pig to transform and export a dataset for use with Hive
Use HCatLoader and HCatStorer to retrieve HCatalog schemas from within a Pig script Understand how a Hive table is stored in HDFS
Use Hive to discover useful information in a dataset Understand how Hive queries get executed as MapReduce jobs Perform a join of two datasets with Hive
Use advanced Hive features like windowing, views and ORC files Use the Hive analytics functions (rank, dense_rank, cume_dist, row_number) Analyze and sessionize clickstream data using the Pig DataFu library Compute quantiles of NYSE stock prices
Use Hive to compute ngrams on Avro-formatted files Connect Microsoft Excel to Hadoop using the HiveODBC Driver Run a YARN application
Define an Oozie workflow
Hortonworks HDP Developer: Windows
Course Length 4 Days
TE7410_20140825
Course Description:
Who Should Attend:
Prerequisites:
Course Outline:
This course is designed for administrators who will be managing the Hortonworks Data Platform (HDP) 2.3 with Ambari. It covers installation, configuration, and other typical cluster maintenance tasks.
This course is designed for IT administrators and operators responsible for installing, configuring and supporting an Apache Hadoop 2.3 deployment in a Linux environment.
Attendees should be familiar with Hadoop and Linux environments.
Benefits of Attendance:
Upon completion of this course, students will be able to:Summarize and enterprise environment including Big Data, Hadoop and the Hortonworks Data Platform (HDP)
•
Install HDP
•
Manage Ambari Users and Groups
•
Manage Hadoop Services
•
Use HDFS Storage
•
Manage HDFS Storage
•
Configure HDFS Storage
•
Configure HDFS Transparent Data Encryption
•
Configure the YARN Resource Manager
•
Submit YARN Jobs
•
Configure the YARN Capacity Scheduler
•
Add and Remove Cluster Nodes
•
Configure HDFS and YARN Rack Awareness
•
Configure HDFS and YARN High Availability
•
Monitor a Cluster
•
Protect a Cluster with Backups
•
Lab Content: Students will work through the following lab exercises using the Hortonworks Data Platform 2.2.
Introduction to the Lab Environment
Performing an Interactive Ambari HDP Cluster Installation Configuring Ambari Users and Groups
Managing Hadoop Services Using HDFS Files and Directories Using WebHDFS
Configuring HDFS ACLs Managing HDFS Managing HDFS Quotas
Configuring HDFS Transparent Data Encryption Configuring and Managing YARN
Non-Ambari YARN Management
Configuring YARN Failure Sensitivity, Work Preserving Restarts, and Log Aggregation Settings
Submitting YARN Jobs
Configuring Different Workload Types Configuring User and Groups for YARN Labs Configuring YARN Resource Behavior and Queues User, Group and Fine-Tuned Resource Management Adding Worker Nodes
Configuring Rack Awareness Configuring HDFS High Availability Configuring YARN High Availability Configuring and Managing Ambari Alerts Configuring and Managing HDFS Snapshots Using Distributed Copy (DistCP)
Hortonworks HDP Operations: Hadoop Administration 1
Course Length 4 Days
TE7408_20151014
Course Description:
Who Should Attend:
Prerequisites:
Course Outline:
Data Science for the Hortonworks Data Platform covers data science principles and techniques through lecture and hands-on experience. During this three-day course, students will learn the processes and practice of data science, including machine learning and natural language processing. Students will also learn the tools and programming languages used by data scientists, including Python, IPython, Mahout, Pig, NumPy, pandas, SciPy, Scikit-learn, the Natural Language Toolkit (NLTK), and Spark MLlib.
This class is for architects, software developers, analysts and data scientists who need to understand how to apply data science and machine learning on Hadoop.
Students must have experience with at least one programming or scripting language, knowledge in statistics and/or mathematics, and a basic understanding of big data and Hadoop principles.
Benefits of Attendance:
Upon completion of this course, students will be able to:Recognize use cases for data science
•
Describe the architecture of Hadoop and YARN
•
Explain the differences between supervised and unsupervised learning
•
List the six machine learning tasks
•
Recognize use cases for clustering, outlier detection, affinity analysis, classification, regression, and recommendation
•
Use Mahout to run a machine learning algorithm on Hadoop
•
Write Pig scripts to transform data on Hadoop
•
Use Pig to prepare data for a machine learning algorithm
•
Write a Python script
•
Use NumPy to analyze big data
•
Use the data structure classes in the pandas library
•
Write a Python script that invokes a SciPy machine learning algorithm
•
Explain the options for running Python code on a Hadoop cluster
•
Write a Pig User Defined Function in Python
•
Use Pig streaming on Hadoop with a Python script
•
Write a Python script that invokes a scikit-learn machine learning algorithm
•
Use the k-nearest neighbor algorithm to predict values based on a training data set
•
Run a machine learning algorithm on a distributed data set on Hadoop
•
Describe use cases for Natural Language Processing (NLP)
•
Perform sentence segmentation on a large body of text
•
Perform part-of-speech tagging
•
Use the Natural Language Toolkit (NLTK) for implement NLP tasks and machine learning algorithms
•
Explain the components of a Spark application
•
Day 1
Using Hadoop for Data Science Hadoop Architecture Machine Learning Introduction to Pig Day 2
Python Programming Analyzing Data with Python Running Python on Hadoop Day 3
Implementing Machine Learning Natural Language Processing Spark MLlib
Hands-On Labs: Students will complete the following hands-on labs using their own 7-node Hadoop cluster (HDP 2.1) and IPython Notebook.
Setting Up a Development Environment Using HDFS Commands
Using Mahout for Machine Learning Getting Started with Pig Exploring Data with Pig Using the IPython Notebook Data Analysis with Python Interpolating Data Points Define a Pig UDF in Python Streaming Python with Pig K-Nearest Neighbor K-Means Clustering
Using NLTK for Natural Language Processing Classifying Text using Naive Bayes Spark Programming
Running Data Science Algorithms using Spark MLlib
Hortonworks HDP Data Science
Course Length 3 Days
TE7412_20140825
Course Description:
Who Should Attend:
Prerequisites:
Course Outline:
This 2-day hands-on training course teaches students how to develop custom YARN applications for Apache Hadoop. Students will learn the details of the YARN architecture, the steps involved in writing a YARN application, the details of writing a YARN client and ApplicationMaster, and how to launch Containers. Applications are developed using Eclipse and Gradle connected remotely to a 7-node HDP 2.1 cluster running in a virtual machine that the students can keep for use after the training.
This course is intended for software engineers familiar with Java who need to develop YARN applications on Hadoop 2.x by writing custom YARN clients and ApplicationMasters in Java.
Students must have attended the Developing Applications with the Hortonworks Data Platform using Java course; or attended the Data Analysis with the Hortonworks Data Platform using Pig and Hive course; or possess similar Hadoop development knowledge and understand HDFS and the MapReduce framework.
Benefits of Attendance:
Upon completion of this course, students will be able to:Explain the architecture of YARN
•
Explain the lifecycle of a YARN application
•
Write a YARN client application
•
Run a YARN application on a Hadoop 2.x cluster
•
Monitor the status of a running YARN application
•
View the aggregated logs of a YARN application
•
Configure a ContainerLaunchContext
•
Define a LocalResource for sharing application files across the cluster
•
Write a YARN ApplicationMaster
•
Explain the differences between synchronous and asynchronous ApplicationMasters
•
Allocate Containers in a cluster
•
Launch Containers on NodeManagers
•
Write a custom Container to perform specific business logic
•
Explain the job schedulers of the ResourceManager
•
Define queues for the Capacity Scheduler
•
Day 1
Unit 1: The YARN Architecture Unit 2: Overview of a YARN Application Unit 3: Writing a YARN Client Day 2
Unit 4: Writing a YARN ApplicationMaster Unit 5: Containers
Unit 6: Job Scheduling
Lab Content: Students will work through the following lab exercises using the Hortonworks Data Platform 2.1.
Running a YARN Application
Setup a YARN Development Environment Writing a YARN Client
Submitting an ApplicationMaster Writing an ApplicationMaster Requesting Containers Running Containers Writing Custom Containers
Hortonworks HDP Developer: Custom YARN Applications
Course Length 2 Days
TE7415_20140725
Course Description:
Who Should Attend:
Prerequisites:
Course Outline:
This course is designed for administrators who are familiar with administering other Hadoop distributions and are migrating to the Hortonworks Data Platform (HDP). It covers installation, configuration, maintenance, security and performance topics.
This class is for experienced Hadoop administrators and operators responsible for installing, configuring and supporting the Hortonworks Data Platform.
Attendees should be familiar with Hadoop fundamentals, have experience administering a Hadoop cluster, and installation of configuration of Hadoop components such as Sqoop, Flume, Hive, Pig and Oozie.
Benefits of Attendance:
Upon completion of this course, students will be able to:Install and configure an HDP 2.x cluster
•
Use Ambari to monitor and manage a cluster
•
Mount HDFS to a local filesystem using the NFS Gateway
•
Configure Hive for Tez
•
Use Ambari to configure the schedulers of the ResourceManager
•
Commission and decommission worker nodes using Ambari
•
Use Falcon to define and process data pipelines
•
Take snapshots using the HDFS snapshot feature
•
Implement and configure NameNode HA using Ambari
•
Secure an HDP cluster using Ambari
•
Setup a Knox gateway
•
Hands-On Labs Install HDP 2.x using Ambari Add a new node to the cluster Stop and start HDP services Mount HDFS to a local file system Configure the capacity scheduler Use WebHDFS
Dataset mirroring using Falcon
Commission and decommission a worker node using Ambari Use HDFS snapshots
Configure NameNode HA using Ambari Secure an HDP cluster using Ambari Setting up a Knox gateway
Hortonworks HDP Operations: Migrating to
the Hortonworks Data Platform
Course Length 2 Days
TE7416_20150223
Course Description:
Who Should Attend:
Prerequisites:
Course Outline:
This course is designed for big data analysts who want to use the HBase NoSQL database which runs on top of HDFS to provide real-time read/write access to sparse datasets. Topics include HBase architecture, services, installation and schema design.
This class is for architects, software developers, and analysts responsible for implementing non-SQL databases in order to handle sparse data sets commonly found in big data use cases.
Students must have basic familiarity with data management systems. Familiarity with Hadoop or databases is helpful but not required.
Benefits of Attendance:
Upon completion of this course, students will be able to:Integrate HBase with Hadoop and HDFS
•
Describe architectural components and core concepts of HBase
•
Understand HBase functionality
•
Install and configure HBase
•
Understand HBase schema design
•
Import and export data
•
Perform backup and recovery
•
Monitor and manage HBase
•
Describe how Apache Phoenix works with HBase
•
Integrate HBase with Apache ZooKeeper
•
Use HBase services and perform data operations
•
Optimize HBase Access
•
Hands-On Labs Using Hadoop and MapReduce Using HBase
Importing Data from MySQL to HBase Using Apache ZooKeeper Examining Configuration Files Using Backup and Snapshot HBase Shell Operations
Creating Tables with Multiple Column Families Exploring HBase Schema
Blocksize and Bloom filters Exporting Data
Using a Java Data Access Object Application to interact with HBase
Hortonworks HDP Analyst: Apache HBase Essentials
Course Length 2 Days
TE7417_20150612
Course Description:
Who Should Attend:
Prerequisites:
Course Outline:
This course is designed for administrators who will be installing, configuring and managing HBase clusters. It covers installation with Ambari, configuration, security and troubleshooting HBase implementations. The course includes an end-of-course project in which students work together to design and implement an HBase
schema.
This course is for architects, software developers, and analysts responsible for implementing non-SQL databases in order to handle sparse data sets commonly found in big data use cases.
Students must have basic familiarity with data management systems. Familiarity with Hadoop or databases is helpful but not required. Students new to Hadoop are encouraged to take the HDP Overview: Apache Hadoop Essentials course.
Benefits of Attendance:
Upon completion of this course, students will be able to:Discuss running applications in the cloud
•
Provision the cluster
•
Use the HBase shell
•
Ingest data
•
Perform operational management
•
Perform backup and recovery
•
Provide security
•
Monitor HBase and diagnose problems
•
Perform maintenance
•
Troubleshoot
•
Hands on Labs
Installing and Configuring HBase with Ambari Manually Installing HBase (Optional) Using Shell Commands
Ingesting Data using ImportTSV Enabling HBase High Availability Viewing Log Files
Configuring and Enabling Snapshots Configuring Cluster Replication Enabling Authentication and Authorization Diagnosing and Resolving Hot Spotting Region Splitting
Monitoring JVM Garbage Collection
End-of-Course Project: Designing an HBase Schema
Hortonworks HDP Operations: Apache HBase
Management
Course Length 4 Days
TE7419_20150921
Course Description:
Who Should Attend:
Prerequisites:
Course Outline:
This course provides a technical introduction to the fundamentals of Apache Storm and Trident that includes the concepts, terminology, architecture, installation, operation, and management of Storm and Trident. Simple Storm and Trident code excerpts are provided throughout the course. The course also includes an introduction to, and code samples for, Apache Kafka. Apache Kafka is a messaging system that is commonly used in concert with Storm and Trident.
This course is for data architects, data integration architects, technical infrastructure team, and Hadoop administrators or developers who want to understand the fundamentals of Storm and Trident.
No previous Hadoop or programming knowledge is required. Students will need browser access to the Internet.
Benefits of Attendance:
Upon completion of this course, students will be able to:Recognize differences between batch and real-time data processing
•
Define Storm elements including tuples, streams, spouts, topologies, worker processes, executors, and stream groupings
•
Explain Storm architectural components, including Nimbus, Supervisors, and ZooKeeper cluster
•
Recognize/interpret Java code for a spout, bolt, or topology
•
Identify how to install and configure a Storm cluster
•
Identify how to develop and submit a topology to a local or remote distributed cluster
•
Recognize and explain the differences between reliable and unreliable Storm operation
•
Manage and monitor Storm using the command-line client or browser-based Storm User Interface (UI)
•
Define Trident elements including tuples, streams, batches, partitions, topologies, Trident spouts, and operations
•
Recognize and interpret the code for Trident operations, including filters, functions, aggregations, merges, and joins
•
Recognize and understand Trident repartitioning operations
•
See Course Objectives
Hortonworks HDP Developer: Storm and
Trident Fundamentals Workshop
Course Length 2 Days
TE7418_20150630