Data science for the datacenter: processing logs with Apache Spark. William Benton Red Hat Emerging Technology

(1)

Data science for the

datacenter: processing logs with Apache Spark

William Benton

(2)

(3)

(4)

Challenges of log data

SELECT hostname, DATEPART(HH, timestamp) AS hour, COUNT(msg) FROM LOGS WHERE level='CRIT' AND msg LIKE '%failure%'

(5)

11:00 12:00 13:00 14:00 15:00 16:00 17:00 18:00 SELECT hostname, DATEPART(HH, timestamp) AS hour, COUNT(msg)

FROM LOGS WHERE level='CRIT' AND msg LIKE '%failure%' GROUP BY hostname, hour

(6)

postgres httpd

syslog

INFO INFO WARN CRIT DEBUG INFO

GET GET GET POST

WARN WARN INFO INFO INFO

GET (404)

INFO

(7)

Challenges of log data postgres httpd syslog INFO WARN GET (404) CRIT INFO

INFO INFO INFO WARN

CouchDB httpd Django

INFO INFO CRIT

GET POST

haproxy k8s

INFO INFO WARN DEBUG CRIT

INFO

Cassandra nginx

Rails

INFO CRIT INFO

GET POST PUT POST

INFO INFO _WARN _INFO INFO

redis INFO CRIT INFO INFO

PUT (500)

httpd syslog

GET PUT

(8)

Challenges of log data postgres httpd syslog INFO WARN GET (404) CRIT INFO

GET POST

haproxy k8s

INFO

Rails

INFO INFO WARN INFO INFO

PUT (500)

GET PUT

INFO WARN INFO INFO

postgres httpd syslog INFO WARN GET (404) CRIT INFO

GET POST

haproxy k8s

INFO

Rails

PUT (500)

GET PUT

GET POST

haproxy k8s

INFO

Cassandra nginx Rails

PUT (500)

GET PUT

GET POST

haproxy k8s

INFO

PUT (500)

GET PUT

GET POST

haproxy k8s

INFO

Rails

PUT (500)

GET PUT

GET POST

haproxy k8s

INFO

PUT (500)

GET PUT

How many services are generating logs in your datacenter today?

(9)

Apache Spark

Spark core (distributed collections, scheduler)

Graph Query ML Streaming

(10)

(11)

Collecting log data

collecting

Ingesting live log data via rsyslog, logstash, fluentd

(12)

collecting

Ingesting live log data via rsyslog, logstash, fluentd

normalizing

Reconciling log record metadata across sources

(13)

collecting

Ingesting live log data via rsyslog, logstash, fluentd normalizing Reconciling log record metadata across sources warehousing Storing normalized records in ES indices

(14)

warehousing

Storing normalized records in ES indices

(15)

Collecting log data warehousing Storing normalized records in ES indices analysis cache warehoused data as Parquet files on Gluster volume

(16)

(17)

(18)

(19)

Schema mediation

timestamp, level, host, IP addresses, message, &c.

rsyslog-style metadata, like

(20)

Structured queries at scale

logs

.select("level").distinct .as[String].collect

(21)

logs

debug, notice, emerg,

err, warning, crit, info, severe, alert

(22)

logs

.groupBy($"level", $"rsyslog.app_name") .agg(count("level").as("total"))

.orderBy($"total".desc) logs

(23)

logs

.groupBy($"level", $"rsyslog.app_name") .agg(count("level").as("total"))

.orderBy($"total".desc)

INFO kubelet 17933574 INFO kube-proxy 10961117 ERR journal 6867921 INFO systemd 5184475 … logs

(24)

(25)

From log records to vectors

(26)

What does it mean for two log records to be similar?

red

green blue

(27)

red green blue orange -> 000 -> 010 -> 100 -> 001

(28)

red green blue orange -> 000 -> 010 -> 100 -> 001 pancakes waffles aebliskiver omelets bacon hash browns

(29)

red green blue orange -> 000 -> 010 -> 100 -> 001 pancakes waffles aebliskiver omelets bacon hash browns -> 10000 -> 01000 -> 00100 -> 00001 -> 00000 -> 00010

(30)

red green blue orange -> 000 -> 010 -> 100 -> 001 pancakes waffles aebliskiver omelets bacon hash browns -> 10000 -> 01000 -> 00100 -> 00001 -> 00000 -> 00010 red pancakes orange waffles

(31)

red green blue orange -> 000 -> 010 -> 100 -> 001 pancakes waffles aebliskiver omelets bacon hash browns -> 10000 -> 01000 -> 00100 -> 00001 -> 00000 -> 00010 red pancakes orange waffles -> 00010000 -> 00101000

(32)

(33)

{level: INFO, hostname: fred, timestamp: "2016-05-06 00:00:01", ...}

{level: DEBUG, hostname: barney, timestamp: "2016-05-06 00:00:02", ...} {level: INFO, hostname: fred, timestamp: "2016-05-06 00:00:03", ...}

{level: ERR, hostname: barney, timestamp: "2016-05-06 00:00:03", ...} {level: ALERT, hostname: wilma, timestamp: "2016-05-06 00:00:03", ...}

(34)

{level: INFO, hostname: fred, timestamp: "2016-05-06 00:00:01", ...}

{level: DEBUG, hostname: barney, timestamp: "2016-05-06 00:00:02", ...} {level: INFO, hostname: fred, timestamp: "2016-05-06 00:00:03", ...}

{level: ERR, hostname: barney, timestamp: "2016-05-06 00:00:03", ...} {level: ALERT, hostname: wilma, timestamp: "2016-05-06 00:00:03", ...}

-> 00000100 -> 00100000 -> 00000100 -> 00001000 -> 10000000

(35)

(36)

(37)

Similarity and distance

(38)

pi - qi

i=1 n

(39)

pi - qi

i=1 n

(40)

p • q

(41)

(42)

(43)

(44)

10

3

(45)

WARN INFO INFO INFO

WARN INFO INFO INFO DEBUG

WARN INFO INFO INFO WARN

WARN

INFO INFO

INFO

Other interesting features

host01 host02

(46)

INFO INFO INFO DEBUG INFO INFO WARN

INFO INFO INFO

WARN INFO INFO INFO

INFO INFO

INFO

INFO WARN

INFO INFO

WARN DEBUG

host01 host02

(47)

INFO INFO

WARN DEBUG WARN INFO INFO INFO DEBUG WARN INFO INFO INFO INFO INFO

WARN

INFO

WARN INFO

host01 host02

(48)

(49)

INFO: Everything is great! Just checking in to let you know I’m OK. CRIT: No requests in last hour; suspending running app containers.

(50)

INFO: Everything is great! Just checking in to let you know I’m OK. CRIT: No requests in last hour; suspending running app containers.

(51)

NATURAL LANGUAGE  

(52)

Introducing word2vec

Madrid Spain

(53)

(54)

(55)

(56)

(57)

v("Madrid") - v("Spain") + v("France") ≈ v("Paris")

Madrid

Paris Spain

(58)

Preprocessing text

What is a word? How are “log words” different from words found in conventional prose?

(59)

(60)

Apache® Oozie™ Java HotSpot™

systemd

(61)

/dev/null

systemd

(62)

/dev/null

OutOfMemoryError Kernel::exec

systemd

(63)

ENOENT

OPEN_MAX /dev/null

systemd

(64)

ENOENT

OPEN_MAX /dev/null

systemd

etcd ZooKeeper

(65)

ENOENT

OPEN_MAX /dev/null

OutOfMemoryError Kernel::exec systemd etcd ZooKeeper 192.168.0.1 QEMU

(66)

// assume messages is an RDD of log message texts

val oneletter = new scala.util.matching.Regex(".*([A-Za-z]).*")

def tokens(s: String, post: String=>String): Seq[String] =

collapseWhitespace(s) .split(" ")

.map(s => post(stripPunctuation(s)))

.collect { case token @ oneletter(_) => token }

val tokenSeqs = messages.map(line => tokens(line), identity[String])

val w2v = new org.apache.spark.mllib.feature.Word2Vec

(67)

(68)

Insights from log messages

(69)

Insights from log messages nova vm glance images containers instances

(70)

Insights from log messages nova vm glance images containers instances password

(71)

Insights from log messages nova vm glance images containers instances password publickey accepted opened IPMI

(72)

Insights from log messages nova vm glance images containers instances password publickey accepted opened IPMI sh

(73)

Insights from log messages nova vm glance images containers instances password publickey accepted opened IPMI sh/usr/bin/bash AUDIT_SESSION

(74)

VISUALIZING STRUCTURE and FINDING OUTLIERS

(75)

(76)

Multidimensional data

(77)

(78)

(79)

(80)

[4,7] [2,3,5]

[7,1,6,5,12,  8,9,2,2,4,

(81)

[4,7] [2,3,5]

[7,1,6,5,12,  8,9,2,2,4,

(82)

A linear approach: PCA 0 0 0 1 1 0 1 0 1 0 0 0 1 0 0 0 1 1 0 0 1 0 1 1 0 1 0 0 0 0 0 0 0 0 0 0 1 1 0 1 0 1 0 0 1 0 0 1 0 0 1 0 0 0 0 1 0 1 1 0 0 0 1 0 1 0 1 0 0 0 0 1 0 0 0 1 0 0 1 1 0 0 0 0 1 0 0 1 0 1 1 1 0 0 0 0 0 0 0 1

(83)

A linear approach: PCA 0 0 0 1 1 0 1 0 1 0 0 0 1 0 0 0 1 1 0 0 1 0 1 1 0 1 0 0 0 0 0 0 0 0 0 0 1 1 0 1 0 1 0 0 1 0 0 1 0 0 1 0 0 0 0 1 0 1 1 0 0 0 1 0 1 0 1 0 0 0 0 1 0 0 0 1 0 0 1 1 0 0 0 0 1 0 0 1 0 1 1 1 0 0 0 0 0 0 0 1

(84)

(85)

(86)

A nonlinear approach: t-SNE

(87)

A nonlinear approach: t-SNE

(88)

(89)

(90)

Tree-based approaches yes no yes no if orange if !orange if red if !red if !gray if !gray

(91)

Tree-based approaches yes no yes no if orange if !orange if red if !red if !gray if !gray yes no no yes yes no yes no yes no no yes yes no yes no yes no no yes yes no yes no yes no no yes yes no yes no yes no no yes yes no yes no yes no no yes yes no yes no

(92)

Tree-based approaches yes no yes no if orange if !orange if red if !red if !gray if !gray yes no no yes yes no yes no yes no no yes yes no yes no yes no no yes yes no yes no yes no no yes yes no yes no yes no no yes yes no yes no yes no no yes yes no yes no

(93)

(94)

(95)

(96)

(97)

(98)

(99)

(100)

(101)

(102)

(103)

(104)

(105)

(106)

(107)

(108)

(109)

(110)

(111)

(112)

(113)

(114)

Outliers in log data

(115)

0.95 0.97

(116)

0.95 0.97

(117)

Outliers in log data 0.95 0.97 0.92 0.37 An outlier is any

record whose best match was at least

4σ_{below the mean.}

0.94 0.89

0.91 0.93

(118)

(119)

Out of 310 million log records, we identified 0.0012% as outliers.

(120)

(121)

(122)

Thirty most extreme outliers

10 _{Can not communicate with power supply 2.} 9 _{Power supply 2 failed.}

8 Power supply redundancy is lost. 1 Drive A is removed.

1 Can not communicate with power supply 1. 1 Power supply 1 failed.

(123)

LESSONS LEARNED  

(124)

Spark and ElasticSearch

Data locality is an issue and caching is even more

important than when running from local storage.

The best practice is to warehouse ES indices to Parquet

(125)

Structured queries in Spark

Always program defensively: mediate schemas,

explicitly convert null values, etc.

Use the Dataset API whenever possible to minimize

boilerplate and benefit from query planning without

(126)

Natural language tips

Conventional preprocessing pipelines probably won’t  

get you the best results on log message texts.

Consider the sorts of “words” and word variants you’ll

want to recognize with your preprocessing pipeline. Use an approximate whitelist for unusual words.

(127)

Memory and partitioning

Large JVM heaps can lead to appalling GC pauses.

You probably won’t be able to use all of your memory

in a single JVM without executor timeouts.

Consider memory budget at the driver when choosing

(128)

Feature engineering

Design your feature engineering pipeline so you can

translate feature vectors back to factor values.

Favor feature engineering effort over complex or

novel learning algorithms.

(129)

@willb • [email protected] 

https://chapeau.freevariable.com