Hadoop: challenge accepted!

(1)

Hadoop:

challenge

accepted!

Arkadiusz Osiński

(2)

ToC

-‐‑  Hadoop basics -‐‑  Gather data

-‐‑  Process your data

-‐‑  Learn from your data -‐‑  Visualize your data

(3)

BigData

(4)

BigData

-‐‑ Petabytes of (un)structured data -‐‑  12% of data is analyzed

(5)

BigData

(6)

BigData

-‐‑  a lot of data is not gathered -‐‑  how to gain knowledge?

(7)

Power Big Data Data Lake Scalability Petabytes Mapreduce Commodity

(8)

HDFS

(9)

HDFS

-‐‑  Storage layer

(10)

HDFS

-‐‑  Distributed ﬁle system

(11)

HDFS

-‐‑  Commodity hardware

(12)

HDFS

-‐‑  Scalability -‐‑  JBOD

(13)

HDFS

(14)

HDFS

-‐‑  Access control -‐‑  No SPOF

(15)

YARN

(16)

YARN

-‐‑  Distributed computing layer

(17)

YARN

-‐‑  Operations in place of data

(18)

YARN

-‐‑  MapReduce…

(19)

YARN

-‐‑  MapReduce…

-‐‑  and others applications

(20)

Let’s squize our data

to get a juice!!

(21)

Gather data

flume-twitter.sources.Twitter.type = com.cloudera.flume.source.TwitterSource flume-twitter.sources.Twitter.channels = MemChannel flume-twitter.sources.Twitter.consumerKey = (…) flume-twitter.sources.Twitter.consumerSecret = (…) flume-twitter.sources.Twitter.accessToken = (…) flume-twitter.sources.Twitter.accessTokenSecret = (…)

(22)

Process your data

(23)

Process your data

-‐‑  Hadoop Streaming!

(24)

Process your data

-‐‑  Hadoop Streaming!

-‐‑  No need to write code in Java

(25)

Process your data

#!/usr/bin/python import sys import json import datetime as dt keyword='hadoop'

for line in sys.stdin:

data = json.loads(line.strip())

if keyword in data['text'].lower():

dt=dt.datetime.strptime(data['created_at'], '%a %b %d %H:%M:%S +0000 %Y').strftime('%Y-%m-%d')

(26)

Process your data

#!/usr/bin/python

import sys

(counter,datekey=(0,'')

for line in sys.stdin:

line = line.strip().split("\t") if datekey != line[0]:

if datekey:

print "{0}\t{1}".format(str(datekey),str(counter)) datekey = line[0]

counter = 1 else:

counter += 1

(27)

Process your data

yarn jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar \

-files ./map.py,./reduce.py \ -mapper ./map.py \

-reducer ./reduce.py \

-input /tweets/2014/04/*/*/* \ -input /tweets/2014/05/*/*/* \ -output /tweet_keyword

(28)

Process your data

(….) 2014-04-24 864 2014-04-25 1121 2014-04-26 593 2014-04-27 649 2014-04-28 1084 2014-04-29 1575 2014-04-30 1170 2014-05-01 1164 2014-05-02 1175 2014-05-03 779 2014-05-04 471 (….)

(29)

(30)

Recommendations

Which product will be desired by client? We’ve got historical users interaction

(31)

(32)

Simple Example

Let’s just do mahout -‐‑ it’s easy! > apt-get install mahout

> cat simple_example.csv 1,101 1,102 1,103 2,101 > hdfs dfs -put simple_example.csv

> mahout recommenditembased -s SIMILARITY_LOGLIKELIHOOD -b \ -Dmapred.input.dir=/mahout/input/wikilinks/simple_example.csv \ -Dmapred.output.dir=/mahout/output/wikilinks/simple_example \ -Dmapred.job.queue.name=atmosphere_prod

(33)

Simple Example

Tadadam! > hdfs dfs –text /mahout/output/wikilinks/simple_example/part-r-00000.snappy 1 [105:1.0,104:1.0] 2 [106:1.0,105:1.0] 3 [103:1.0,102:1.0] 4 [105:1.0,102:1.0] 5 [107:1.0,106:1.0]

(34)

Wiki Case

We’ve got links between wikipedia articles, and want to propose new links between articles.

„Wikipedia (i_/_ˌ_w_ɪ_k_ɨˈ_pi_ː_diəә/_or i_/_ˌ_w_ɪ_ki_ˈ_pi_ː_diəә/ _WIK_{-‐‑i-‐‑}_PEE_{-‐‑dee-‐‑əә}_{) is a} _{collaboratively edited}_,

multilingual, free Internet encyclopedia that is supported by the non-‐‑proﬁt Wikimedia Foundation. Volunteers worldwide collaboratively write Wikipedia'ʹs 30 million articles in 287 languages, including over 4.5 million in the English Wikipedia. Anyone who can access”

(35)

(36)

Wiki Case

hlp://users.on.net/%7Ehenry/pagerank/links-‐‑simple-‐‑sorted.zip #!/usr/bin/awk -f BEGIN { OFS=",”; } { gsub(":","",$1); for (i=2;i<=NF;i++) { print $1,$i } }

(37)

Wiki Case

yarn jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar \ -Dmapreduce.job.max.split.locations=24 \ -Dmapreduce.job.queuename=hadoop_prod \ -Dmapred.output.key.comparator.class=mapred.lib.KeyFieldBasedComparator \ -Dmapred.text.key.comparator.options=-n \ -Dmapred.output.compress=false \ -files ./mahout/mapper.awk \ -mapper ./mapper.awk \ -input /mahout/input/wikilinks/links-simple-sorted.txt \ -output /mahout/output/wikilinks/fixedinput

(38)

Wiki Case

Mahout lib count’s similarity Matrix and gave recommendations for 824 articles.

What’s important, we didn’t gather any knowledge a priori and just ran algorithm’s out of box.

(39)

Wiki Case

Acadèmia_Valenciana_de_la_Llengua

FIFA Valencia

October_1 Calendar

Prehistoric_Iberia Link appears recently

Ceuta Spain City at the north coast of Africa

Roussillon Part of France by the border with Spain

Sweden J

Turís municipality in the Valencian Community

Vulgar_Latin Language article

Western_Italo-‐‑

Western_languages Language article

(40)

(41)

Tweets

Let’s ﬁnd group of:

•  tags

(42)

Tweets

•  Our data is not random

•  We’ve picked speciﬁc keywords

•  We’ll do analysis in two

(43)

Tweets

{

"filter_level":"medium",

"contributors":null,

"text":"PROMOCIÓN MES DE MAYO. con ...",

"geo":null, "retweeted":false, "lang":"es", "entities":{ "urls":[ { "expanded_url":"http://www.agmuriel.com", "indices":[ 69, 91 ], "display_url":"agmuriel.com/#!-/c1gz", "url":"http://t.co/APpPjRRTXn" } ] } (…)

(44)

Tweets

#!/usr/bin/python

import json, sys for line in sys.stdin: line = line.strip()

if '"lang":"en"' in line:

tweet = json.loads(line) try:

text = tweet['text'].lower().strip() if text:

tags = tweet[” entities"][”hashtags”]

for tag in tags:

print tag[“text”]+"\t"+text

except KeyError: continue #!/usr/bin/python import sys (lastKey,text) = (None,"") for line in sys.stdin:

(key,value) = line.strip().split("\t")

if lastKey and lastKey != key: print lastKey+"\t"+text (lastKey,text) = (key,value) else:

(lastKey,text) = (key,text+" "+value)

(45)

Tweets

yarn jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar \ -Dmapreduce.job.queuename=atmosphere_time \ -Dmapred.output.compress=false \ -Dmapreduce.job.max.split.locations=24 \ -D-Dmapred.reduce.tasks=20 \ -files ~/mahout/twitter_map.py,~/mahout/twitter_reduce.py \ -mapper ./twitter_map.py \ -reducer ./twitter_reduce.py \ -input /project/atmosphere/tweets/2014/04/*/* \ -output /project/atmosphere/tweets/output \ -outputformat org.apache.hadoop.mapred.SequenceFileOutputFormat

(46)

Tweets

mahout seq2sparse \

-i /project/atmosphere/tweets/output \

-o /project/atmosphere/tweets/vectorized -ow \

-chunk 200 -wt tfidf -s 5 -md 5 -x 90 -ng 2 -ml 50 -seq -n 2

Calculate vector representation for text

{10:0.6292275202550768,14:0.7772211575566166} {10:0.6292275202550768,14:0.7772211575566166}

{3:0.37796447439954967,14:0.37796447439954967,19:0.654653676423271,22:0.534522474858859} {17:1.0}

(47)

Tweets

I’ts time to begin clusterization Let’s ﬁnd 100 clusters mahout kmeans \ -i /tweets_5/vectorized/tfidf-vectors \ -c /tweets_5/kmeans/initial-clusters \ -o /tweets_5/kmeans/output-clusters \ -cd 1.0 -k 100 -x 10 -cl –ow \ -dm org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure

(48)

Tweets

Glance at results

BURN OPEN LEATHER

FAT SOFTWARE WALLET

WEIGHTLOSS LINUX MAN

FITNESS UBUNTU

ZUMBA OPENSUSE

(49)

Tweets

It was easy because tags are

(50)

Tweets

Bigger challenge – user clustering LINUX UBUNTU WINDOWS OS PATCH MAC HACKED MICROSOFT FREE CSRRACING WON RACEYOURFRIENDS ANDROID CSRCLASSIC

(51)

Tweets

Bigger challenge – user clustering

•  Results show that dataset is strongly curved

by mobile and games

•  Dataset wasn’t random – we subscribed

speciﬁc keywords

(52)

Tweets

HADOOP WORLD

run predictive machine learning algorithms on hadoop without even knowing mapreduce.: data scientists are very... h:p://t.co/gdmqm5g1ar

rt @mapr: google cloud storage connector for #hadoop: quick start guide now avail h:p://t.co/17hxtvdlir

(53)

Tweets

HADOOP WORLD

Cloudera wants to do big data in Real Time.

(54)

Visualize data

add jar hive-serdes-1.0-SNAPSHOT.jar; create table tw_data_201404

ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\012’

STORED AS TEXTFILE LOCATION ‘/tweets/tw_data_201404’ AS SELECT

v_date,

LOWER(hashtags.text),

lang,

COUNT(*) AS total_count

FROM logs.tweets LATERAL VIEW EXPLODE(entities.hashtags) t1 AS hashtags WHERE v_date like '2014-04-%'

(55)

Visualize data

add jar elasticsearch-hadoop-hive-2.0.0.RC1.jar;

CREATE EXTERNAL TABLE es_export ( v_date string, tag string, lang string, total_count int, info string ) STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler’ TBLPROPERTIES ( 'es.resource' = 'trends/log', 'es.index.auto.create' = 'true') ;

(56)

Visualize data

INSERT overwrite TABLE es_export

SELECT distinct may.v_date,may.tag,may.lang,may.total_count,'nt'

FROM tw_data_201405 may

LEFT outer JOIN tw_data_201404 april

ON april.tag = may.tag

WHERE april.tag is null

(57)

(58)

Visualize data

(59)

Thank you!

Questions?