Workshop: From Zero
to _
Agenda today
1. Some setup before we start 2. (Back to the) introduction 3. Our workshop today
There is a lot to copy and paste – so let’s all join a Google Hangout chat
http://bit.ly/1xgSQId
• If I forget to paste some content into the chat room, just shout out and remind me
First, let’s all download and setup Virtualbox and Vagrant
http://docs.vagrantup.com/v2/installation/in dex.html
Now let’s setup our development environment
$ vagrant plugin install vagrant-vbguest If you have git already installed:
$ git clone --recursive
https://github.com/snowplow/dev-environment.git If not: $ wget https://github.com/snowplow/dev-environment/archive/temp.zip $ unzip temp.zip $ wget https://github.com/snowplow/ansible-playbooks/archive/temp.zip $ unzip temp.zip
Now let’s setup our development environment
$ cd dev-environment
$ vagrant up
Final step for now, let’s install some software
$ ansible-playbook /vagrant/ansible-playbooks/aws-tools.yml inventory-file=/home/vagrant/ansible_hosts --connection=local $ ansible-playbook /vagrant/ansible-playbooks/scala-sbt.yml inventory-file=/home/vagrant/ansible_hosts --connection=localSnowplow is an open-source web and event analytics platform, built on Hadoop
• Co-founders Alex Dean and Yali Sassoon met at OpenX, the open-source ad technology business in 2008
• We released Snowplow as a skunkworks prototype at start of 2012:
github.com/snowplow/snowplow
• We built Snowplow on top of Hadoop from the very start
We wanted to take a fresh approach to web analytics
• Your own web event data -> in your own data warehouse
• Your own event data model
• Slice / dice and mine the data in highly bespoke ways to answer your specific business questions
• Plug in the broadest possible set of analysis tools to drive value from your data
Data warehouse Data pipeline
Analyse your data in
And we saw the potential of new “big data” technologies and
services to solve these problems in a scalable, low-cost manner
These tools make it possible to capture, transform, store and analyse all your granular, event-level data, to you can perform any analysis
Amazon EMR Amazon S3
Our Snowplow event processing flow runs on Hadoop,
specifically Amazon’s Elastic MapReduce hosted Hadoop service
Website / webapp
Snowplow Hadoop data pipeline
CloudFront-based event collector Scalding-based enrichment on Hadoop JavaScript event tracker Amazon Redshift / PostgreSQL Amazon S3 or Clojure-based event collector
Why did we pick Hadoop?
Scalability Easy to reprocess data Highly testableWe have customers processing 350m Snowplow events a day in Hadoop – runs in <2 hours
If business rules change, we can fire up a large cluster and re-process all historical raw
Snowplow events
We write unit and integration tests for our jobs and run them locally, giving us confidence that our jobs will run correctly at scale on Hadoop
And why Amazon’s Elastic MapReduce (EMR)?
No need to run our own
cluster
Elastic
Interop with other AWS
services
Running your own Hadoop cluster is a huge pain – not for the fainthearted. By contrast, EMR just works (most of the time !)
Snowplow runs as a nightly (sometimes more frequent) batch job. We spin up the EMR cluster to run the job, and shut it down straight after
EMR works really well with Amazon S3 as a file store. We are big fans of Amazon Redshift
… for our workshop today, we will stick to using Elastic MapReduce and try to avoid any unnecessary complexity
… and we will learn by doing!
• Lots of books and articles about Hadoop and the theory of MapReduce
• We will learn by doing – no theory unless it’s required to directly explain the jobs we are creating
• Our priority is to get you up-and-running on Elastic MapReduce, and confident enough to write your own Hadoop jobs
Part 1: a simple Pig Latin job
on EMR
What is Pig (Latin)?
• Pig is a high-level platform for creating MapReduce jobs which can run on Hadoop
• The language you write Pig jobs in is called Pig Latin • For quick-and-dirty scripts, Pig just works
Hadoop DFS Hadoop MapReduce
Crunch Hive Pig
Java Cascading
Let’s all come up with a unique name for ourselves
• Lowercase letters, no spaces or hyphens or anything
• E.g. I will be alexsnowplow – please come up with a unique name for yourself!
• It will be visible to other participants so choose something you don’t mind being public
• In the rest of this workshop, wherever you see YOURNAME, replace it with your unique name
Let’s restart our Vagrant and do some setup
$ mkdir zero2hadoop
$ aws configure
// And type in:
AWS Access Key ID [None]: AKIAILD6DCBTFI642JPQ
AWS Secret Access Key [None]:
KMVdr/bsq4FDTI5H143K3gjt4ErG2oTjd+1+a+ou Default region name [None]: eu-west-1
Let’s create some buckets in Amazon S3 – this is where our data and our apps will live
$ aws s3 mb s3://zero2hadoop-in-YOURNAME $ aws s3 mb s3://zero2hadoop-out-YOURNAME $ aws s3 mb s3://zero2hadoop-jobs-YOURNAME
// Check those worked
Let’s get some source data uploaded
$ mkdir -p ~/zero2hadoop/part1/in $ cd ~/zero2hadoop/part1/in $ wget https://raw.githubusercontent.com/snowplow/sc alding-example-project/master/data/hello.txt $ cat hello.txt Hello world Goodbye world$ aws s3 cp hello.txt s3://zero2hadoop-in-YOURNAME/part1/hello.txt
Let’s get our EMR command-line tools installed (1/2)
$ /vagrant/emr-cli/elastic-mapreduce $ rvm install ruby-1.8.7-head $ rvm use 1.8.7 $ alias emr=/vagrant/emr-cli/elastic-mapreduceLet’s get our EMR command-line tools installed (2/2)
Add this file:
{ "access_id": "AKIAI55OSYYRLYWLXH7A", "private_key": "SHRXNIBRdfWuLPbCt57ZVjf+NMKUjm9WTknDHPTP", "region": "eu-west-1" } to: /vagrant/emr-cli/credentials.json (sudo sntp -s 24.56.178.140)
Let’s get our EMR command-line tools installed (2/2)
// This should work fine now:
$ emr --list <no output>
Let’s do some local file work
$ mkdir -p ~/zero2hadoop/part1/pig $ cd ~/zero2hadoop/part1/pig $ wget https://gist.githubusercontent.com/alexanderd ean/d8371cebdf00064591ae/raw/cb3030a6c48b85d1 01e296ccf27331384df3288d/wordcount.pig // The original https://gist.github.com/alexanderdean/d8371ce bdf00064591aeNow upload to S3
$ aws s3 cp wordcount.pig s3://zero2hadoop-jobs-YOURNAME/part1/
$ aws s3 ls --recursive s3://zero2hadoop-jobs-YOURNAME/part1/
2014-06-06 09:10:31 674 part1/wordcount.pig
And now we run our Pig script
$ emr --create --name "part1 YOURNAME" \ --set-visible-to-all-users true \ --pig-script s3n://zero2hadoop-jobs-YOURNAME/part1/wordcount.pig \ --ami-version 2.0 \ --args "-p,INPUT=s3n://zero2hadoop-in-YOURNAME/part1, \ -p,OUTPUT=s3n://zero2hadoop-out-YOURNAME/part1"
Let’s check out the jobs running in Elastic MapReduce – first at the console
$ $ emr --list
j-1HR90SWPP40M4 STARTING part1 YOURNAME
PENDING Setup Pig
Okay let’s check the output of our job! (1/2)
$ aws s3 ls --recursive s3://zero2hadoop-out-YOURNAME/part1
2014-06-06 09:57:53 0 part1/_SUCCESS 2014-06-06 09:57:50 26 part1/part-r-00000
Okay let’s check the output of our job!
$ mkdir -p ~/zero2hadoop/part1/out $ cd ~/zero2hadoop/part1/out
$ aws s3 cp --recursive s3://zero2hadoop-out-YOURNAME/part1 . $ ls part-r-00000 _SUCCESS $ cat part-r-00000 2 world 1 Hello 1 Goodbye
Part 2: a simple Scalding job
on EMR
What is Scalding?
• Scalding is a Scala API over Cascading, the Java framework for building data processing pipelines on Hadoop:
Hadoop DFS Hadoop MapReduce
Cascading Pig …
Java
Scalding Cascalog PyCascading cascading. jruby
Cascading has a “plumbing” abstraction over vanilla MapReduce which should be quite comfortable to DW practitioners
Scalding improves further on Cascading by reducing boilerplate and making more complex pipelines easier to express
• Scalding written in Scala – reduces a lot of boilerplate versus vanilla Cascading. Easier to look at a job in its entirety and see what it does
• Scalding created and supported by Twitter, who use it throughout their organization
• We believe that data pipelines should be as strongly typed as possible – all the other DSLs/APIs on top of Cascading encourage dynamic typing
Strongly typed data pipelines – why?
• Catch errors as soon as possible – and report them in a strongly typed way too • Define the inputs and outputs of each of your data processing steps in an
unambiguous way
• Forces you to formerly address the data types flowing through your system • Lets you write code like this:
Okay let’s get started!
Let’s get this code down locally and build it
$ mkdir -p ~/zero2hadoop/part2 $ cd ~/zero2hadoop/part2 $ git clone git://github.com/snowplow/scalding-example-project.git $ cd scalding-example-project $ sbt assemblyGood, tests are passing, now let’s upload this to S3 so it’s available to our EMR job
$ aws s3 cp
target/scala-2.10/scalding-example-project-0.0.5.jar s3://zero2hadoop-jobs-YOURNAME/part2/
// If that doesn’t work:
$ aws cp s3://snowplow-hosted-assets/third-party/scalding-example-project-0.0.5.jar s3://zero2hadoop-jobs-YOURNAME/part2/
$ aws s3 ls s3://zero2hadoop-jobs-YOURNAME/part2/
And now we run it!
$ emr --create --name ”part2 YOURNAME" \ --set-visible-to-all-users true \ --jar s3n://zero2hadoop-jobs- YOURNAME/part2/scalding-example-project-0.0.5.jar \ --arg com.snowplowanalytics.hadoop.scalding.WordCou ntJob \ --arg --hdfs \
--arg --input --arg s3n://zero2hadoop-in-YOURNAME/part1/hello.txt \
--arg --output --arg s3n://zero2hadoop-out-YOURNAME/part2
Let’s check out the jobs running in Elastic MapReduce – first at the console
$ emr --list
j-1M62IGREPL7I STARTING scalding-example-project
Okay let’s check the output of our job!
$ aws s3 ls --recursive s3://zero2hadoop-out-YOURNAME/part2
$ mkdir -p ~/zero2hadoop/part2/out $ cd ~/zero2hadoop/part2/out
$ aws s3 cp --recursive s3://zero2hadoop-out-YOURNAME/part2 . $ ls $ cat part-00000 goodbye 1 hello 1 world 2
Part 3: a more complex
Scalding job on EMR
Let’s explore another tutorial together
Questions?
http://snowplowanalytics.com https://github.com/snowplow/snowplow
@snowplowdata
To talk offline – @alexcrdean on Twitter or [email protected]