• No results found

Data science for the datacenter: processing logs with Apache Spark. William Benton Red Hat Emerging Technology

N/A
N/A
Protected

Academic year: 2021

Share "Data science for the datacenter: processing logs with Apache Spark. William Benton Red Hat Emerging Technology"

Copied!
129
0
0

Loading.... (view fulltext now)

Full text

(1)

Data science for the

datacenter: processing logs with Apache Spark

William Benton

(2)
(3)
(4)

Challenges of log data

SELECT hostname, DATEPART(HH, timestamp) AS hour, COUNT(msg) FROM LOGS WHERE level='CRIT' AND msg LIKE '%failure%'

(5)

Challenges of log data

11:00 12:00 13:00 14:00 15:00 16:00 17:00 18:00 SELECT hostname, DATEPART(HH, timestamp) AS hour, COUNT(msg)

FROM LOGS WHERE level='CRIT' AND msg LIKE '%failure%' GROUP BY hostname, hour

(6)

Challenges of log data

postgres httpd

syslog

INFO INFO WARN CRIT DEBUG INFO

GET GET GET POST

WARN WARN INFO INFO INFO

GET (404)

INFO

(7)

Challenges of log data postgres httpd syslog INFO WARN GET (404) CRIT INFO

GET GET GET POST

INFO INFO INFO WARN

CouchDB httpd Django

INFO INFO CRIT

GET POST

INFO INFO INFO WARN

haproxy k8s

INFO INFO WARN DEBUG CRIT

WARN WARN INFO INFO INFO

INFO

Cassandra nginx

Rails

INFO CRIT INFO

GET POST PUT POST

INFO INFO WARN INFO INFO

redis INFO CRIT INFO INFO

PUT (500)

httpd syslog

GET PUT

(8)

Challenges of log data postgres httpd syslog INFO WARN GET (404) CRIT INFO

GET GET GET POST

INFO INFO INFO WARN

CouchDB httpd Django

INFO INFO CRIT

GET POST

INFO INFO INFO WARN

haproxy k8s

INFO INFO WARN DEBUG CRIT

WARN WARN INFO INFO INFO

INFO

Cassandra nginx

Rails

INFO CRIT INFO

GET POST PUT POST

INFO INFO WARN INFO INFO

redis INFO CRIT INFO INFO

PUT (500)

httpd syslog

GET PUT

INFO WARN INFO INFO

postgres httpd syslog INFO WARN GET (404) CRIT INFO

GET GET GET POST

INFO INFO INFO WARN

CouchDB httpd Django

INFO INFO CRIT

GET POST

INFO INFO INFO WARN

haproxy k8s

INFO INFO WARN DEBUG CRIT

WARN WARN INFO INFO INFO

INFO

Cassandra nginx

Rails

INFO CRIT INFO

GET POST PUT POST

INFO INFO WARN INFO INFO

redis INFO CRIT INFO INFO

PUT (500)

httpd syslog

GET PUT

INFO WARN INFO INFO

postgres httpd syslog INFO WARN GET (404) CRIT INFO

GET GET GET POST

INFO INFO INFO WARN

CouchDB httpd Django

INFO INFO CRIT

GET POST

INFO INFO INFO WARN

haproxy k8s

INFO INFO WARN DEBUG CRIT

WARN WARN INFO INFO INFO

INFO

Cassandra nginx Rails

INFO CRIT INFO

GET POST PUT POST

INFO INFO WARN INFO INFO

redis INFO CRIT INFO INFO

PUT (500)

httpd syslog

GET PUT

INFO WARN INFO INFO

postgres httpd syslog INFO WARN GET (404) CRIT INFO

GET GET GET POST

INFO INFO INFO WARN

CouchDB httpd Django

INFO INFO CRIT

GET POST

INFO INFO INFO WARN

haproxy k8s

INFO INFO WARN DEBUG CRIT

WARN WARN INFO INFO INFO

INFO

Cassandra nginx Rails

INFO CRIT INFO

GET POST PUT POST

INFO INFO WARN INFO INFO

redis INFO CRIT INFO INFO

PUT (500)

httpd syslog

GET PUT

INFO WARN INFO INFO

postgres httpd syslog INFO WARN GET (404) CRIT INFO

GET GET GET POST

INFO INFO INFO WARN

CouchDB httpd Django

INFO INFO CRIT

GET POST

INFO INFO INFO WARN

haproxy k8s

INFO INFO WARN DEBUG CRIT

WARN WARN INFO INFO INFO

INFO

Cassandra nginx

Rails

INFO CRIT INFO

GET POST PUT POST

INFO INFO WARN INFO INFO

redis INFO CRIT INFO INFO

PUT (500)

httpd syslog

GET PUT

INFO WARN INFO INFO

postgres httpd syslog INFO WARN GET (404) CRIT INFO

GET GET GET POST

INFO INFO INFO WARN

CouchDB httpd Django

INFO INFO CRIT

GET POST

INFO INFO INFO WARN

haproxy k8s

INFO INFO WARN DEBUG CRIT

WARN WARN INFO INFO INFO

INFO

Cassandra nginx Rails

INFO CRIT INFO

GET POST PUT POST

INFO INFO WARN INFO INFO

redis INFO CRIT INFO INFO

PUT (500)

httpd syslog

GET PUT

INFO WARN INFO INFO

How many services are generating logs in your datacenter today?

(9)

Apache Spark

Spark core (distributed collections, scheduler)

Graph Query ML Streaming

(10)
(11)

Collecting log data

collecting

Ingesting live log data via rsyslog, logstash, fluentd

(12)

Collecting log data

collecting

Ingesting live log data via rsyslog, logstash, fluentd

normalizing

Reconciling log record metadata across sources

(13)

Collecting log data

collecting

Ingesting live log data via rsyslog, logstash, fluentd normalizing Reconciling log record metadata across sources warehousing Storing normalized records in ES indices

(14)

Collecting log data

warehousing

Storing normalized records in ES indices

(15)

Collecting log data warehousing Storing normalized records in ES indices analysis cache warehoused data as Parquet files on Gluster volume

(16)
(17)
(18)
(19)

Schema mediation

timestamp, level, host, IP addresses, message, &c.

rsyslog-style metadata, like

(20)

Structured queries at scale

logs

.select("level").distinct .as[String].collect

(21)

Structured queries at scale

logs

.select("level").distinct .as[String].collect

debug, notice, emerg,

err, warning, crit, info, severe, alert

(22)

Structured queries at scale

logs

.groupBy($"level", $"rsyslog.app_name") .agg(count("level").as("total"))

.orderBy($"total".desc) logs

.select("level").distinct .as[String].collect

debug, notice, emerg,

err, warning, crit, info, severe, alert

(23)

Structured queries at scale

logs

.groupBy($"level", $"rsyslog.app_name") .agg(count("level").as("total"))

.orderBy($"total".desc)

INFO kubelet 17933574 INFO kube-proxy 10961117 ERR journal 6867921 INFO systemd 5184475 … logs

.select("level").distinct .as[String].collect

debug, notice, emerg,

err, warning, crit, info, severe, alert

(24)
(25)

From log records to vectors

(26)

From log records to vectors

What does it mean for two log records to be similar?

red

green blue

(27)

From log records to vectors

What does it mean for two log records to be similar?

red green blue orange -> 000 -> 010 -> 100 -> 001

(28)

From log records to vectors

What does it mean for two log records to be similar?

red green blue orange -> 000 -> 010 -> 100 -> 001 pancakes waffles aebliskiver omelets bacon hash browns

(29)

From log records to vectors

What does it mean for two log records to be similar?

red green blue orange -> 000 -> 010 -> 100 -> 001 pancakes waffles aebliskiver omelets bacon hash browns -> 10000 -> 01000 -> 00100 -> 00001 -> 00000 -> 00010

(30)

From log records to vectors

What does it mean for two log records to be similar?

red green blue orange -> 000 -> 010 -> 100 -> 001 pancakes waffles aebliskiver omelets bacon hash browns -> 10000 -> 01000 -> 00100 -> 00001 -> 00000 -> 00010 red pancakes orange waffles

(31)

From log records to vectors

What does it mean for two log records to be similar?

red green blue orange -> 000 -> 010 -> 100 -> 001 pancakes waffles aebliskiver omelets bacon hash browns -> 10000 -> 01000 -> 00100 -> 00001 -> 00000 -> 00010 red pancakes orange waffles -> 00010000 -> 00101000

(32)

From log records to vectors

(33)

From log records to vectors

What does it mean for two log records to be similar?

{level: INFO, hostname: fred, timestamp: "2016-05-06 00:00:01", ...}

{level: DEBUG, hostname: barney, timestamp: "2016-05-06 00:00:02", ...} {level: INFO, hostname: fred, timestamp: "2016-05-06 00:00:03", ...}

{level: ERR, hostname: barney, timestamp: "2016-05-06 00:00:03", ...} {level: ALERT, hostname: wilma, timestamp: "2016-05-06 00:00:03", ...}

(34)

From log records to vectors

What does it mean for two log records to be similar?

{level: INFO, hostname: fred, timestamp: "2016-05-06 00:00:01", ...}

{level: DEBUG, hostname: barney, timestamp: "2016-05-06 00:00:02", ...} {level: INFO, hostname: fred, timestamp: "2016-05-06 00:00:03", ...}

{level: ERR, hostname: barney, timestamp: "2016-05-06 00:00:03", ...} {level: ALERT, hostname: wilma, timestamp: "2016-05-06 00:00:03", ...}

-> 00000100 -> 00100000 -> 00000100 -> 00001000 -> 10000000

(35)
(36)
(37)

Similarity and distance

(38)

Similarity and distance

pi - qi

i=1 n

(39)

Similarity and distance

pi - qi

i=1 n

(40)

Similarity and distance

pq

(41)
(42)
(43)
(44)

Similarity and distance

10

10

3

(45)

WARN INFO INFO INFO

WARN INFO INFO INFO DEBUG

WARN INFO INFO INFO WARN

WARN

INFO INFO

INFO

Other interesting features

host01 host02

(46)

INFO INFO INFO DEBUG INFO INFO WARN

INFO INFO INFO

WARN INFO INFO INFO

INFO INFO

INFO INFO

INFO

INFO WARN

INFO INFO INFO

INFO INFO

WARN DEBUG

Other interesting features

host01 host02

(47)

INFO INFO INFO

INFO INFO

INFO INFO

INFO INFO INFO

WARN DEBUG WARN INFO INFO INFO DEBUG WARN INFO INFO INFO INFO INFO

INFO INFO INFO

WARN

WARN

INFO

WARN INFO

Other interesting features

host01 host02

(48)

Other interesting features

(49)

Other interesting features

INFO: Everything is great! Just checking in to let you know I’m OK. CRIT: No requests in last hour; suspending running app containers.

(50)

Other interesting features

INFO: Everything is great! Just checking in to let you know I’m OK. CRIT: No requests in last hour; suspending running app containers.

(51)

NATURAL LANGUAGE 


(52)

Introducing word2vec

Madrid Spain

(53)

Introducing word2vec

Madrid Spain

(54)

Introducing word2vec

Madrid Spain

(55)

Introducing word2vec

Madrid Spain

(56)

Introducing word2vec

Madrid Spain

(57)

Introducing word2vec

v("Madrid") - v("Spain") + v("France") ≈ v("Paris")

Madrid

Paris Spain

(58)

Preprocessing text

What is a word? How are “log words” different from words found in conventional prose?

(59)

Preprocessing text

What is a word? How are “log words” different from words found in conventional prose?

(60)

Preprocessing text

What is a word? How are “log words” different from words found in conventional prose?

Apache® Oozie™ Java HotSpot™

systemd

(61)

Preprocessing text

What is a word? How are “log words” different from words found in conventional prose?

/dev/null

Apache® Oozie™ Java HotSpot™

systemd

(62)

Preprocessing text

What is a word? How are “log words” different from words found in conventional prose?

/dev/null

Apache® Oozie™ Java HotSpot™

OutOfMemoryError Kernel::exec

systemd

(63)

Preprocessing text

What is a word? How are “log words” different from words found in conventional prose?

ENOENT

OPEN_MAX /dev/null

Apache® Oozie™ Java HotSpot™

OutOfMemoryError Kernel::exec

systemd

(64)

Preprocessing text

What is a word? How are “log words” different from words found in conventional prose?

ENOENT

OPEN_MAX /dev/null

Apache® Oozie™ Java HotSpot™

OutOfMemoryError Kernel::exec

systemd

etcd ZooKeeper

(65)

Preprocessing text

What is a word? How are “log words” different from words found in conventional prose?

ENOENT

OPEN_MAX /dev/null

Apache® Oozie™ Java HotSpot™

OutOfMemoryError Kernel::exec systemd etcd ZooKeeper 192.168.0.1 QEMU

(66)

// assume messages is an RDD of log message texts

val oneletter = new scala.util.matching.Regex(".*([A-Za-z]).*")

def tokens(s: String, post: String=>String): Seq[String] =

collapseWhitespace(s) .split(" ")

.map(s => post(stripPunctuation(s)))

.collect { case token @ oneletter(_) => token }

val tokenSeqs = messages.map(line => tokens(line), identity[String])

val w2v = new org.apache.spark.mllib.feature.Word2Vec

(67)
(68)

Insights from log messages

(69)

Insights from log messages nova vm glance images containers instances

(70)

Insights from log messages nova vm glance images containers instances password

(71)

Insights from log messages nova vm glance images containers instances password publickey accepted opened IPMI

(72)

Insights from log messages nova vm glance images containers instances password publickey accepted opened IPMI sh

(73)

Insights from log messages nova vm glance images containers instances password publickey accepted opened IPMI sh/usr/bin/bash AUDIT_SESSION

(74)

VISUALIZING STRUCTURE and FINDING OUTLIERS

(75)
(76)

Multidimensional data

(77)

Multidimensional data

(78)

Multidimensional data

(79)

Multidimensional data

(80)

Multidimensional data

[4,7] [2,3,5]

[7,1,6,5,12,
 8,9,2,2,4,

(81)

Multidimensional data

[4,7] [2,3,5]

[7,1,6,5,12,
 8,9,2,2,4,

(82)

A linear approach: PCA 0 0 0 1 1 0 1 0 1 0 0 0 1 0 0 0 1 1 0 0 1 0 1 1 0 1 0 0 0 0 0 0 0 0 0 0 1 1 0 1 0 1 0 0 1 0 0 1 0 0 1 0 0 0 0 1 0 1 1 0 0 0 1 0 1 0 1 0 0 0 0 1 0 0 0 1 0 0 1 1 0 0 0 0 1 0 0 1 0 1 1 1 0 0 0 0 0 0 0 1

(83)

A linear approach: PCA 0 0 0 1 1 0 1 0 1 0 0 0 1 0 0 0 1 1 0 0 1 0 1 1 0 1 0 0 0 0 0 0 0 0 0 0 1 1 0 1 0 1 0 0 1 0 0 1 0 0 1 0 0 0 0 1 0 1 1 0 0 0 1 0 1 0 1 0 0 0 0 1 0 0 0 1 0 0 1 1 0 0 0 0 1 0 0 1 0 1 1 1 0 0 0 0 0 0 0 1

(84)
(85)
(86)

A nonlinear approach: t-SNE

(87)

A nonlinear approach: t-SNE

(88)
(89)
(90)

Tree-based approaches yes no yes no if orange if !orange if red if !red if !gray if !gray

(91)

Tree-based approaches yes no yes no if orange if !orange if red if !red if !gray if !gray yes no no yes yes no yes no yes no no yes yes no yes no yes no no yes yes no yes no yes no no yes yes no yes no yes no no yes yes no yes no yes no no yes yes no yes no

(92)

Tree-based approaches yes no yes no if orange if !orange if red if !red if !gray if !gray yes no no yes yes no yes no yes no no yes yes no yes no yes no no yes yes no yes no yes no no yes yes no yes no yes no no yes yes no yes no yes no no yes yes no yes no

(93)
(94)
(95)
(96)
(97)
(98)
(99)
(100)
(101)
(102)
(103)
(104)
(105)
(106)
(107)
(108)
(109)
(110)
(111)
(112)
(113)
(114)

Outliers in log data

(115)

Outliers in log data

0.95 0.97

(116)

Outliers in log data

0.95 0.97

(117)

Outliers in log data 0.95 0.97 0.92 0.37 An outlier is any

record whose best match was at least

below the mean.

0.94 0.89

0.91 0.93

(118)
(119)

Out of 310 million log records, we identified 0.0012% as outliers.

(120)
(121)
(122)

Thirty most extreme outliers

10 Can not communicate with power supply 2. 9 Power supply 2 failed.

8 Power supply redundancy is lost. 1 Drive A is removed.

1 Can not communicate with power supply 1. 1 Power supply 1 failed.

(123)

LESSONS LEARNED 


(124)

Spark and ElasticSearch

Data locality is an issue and caching is even more

important than when running from local storage.

The best practice is to warehouse ES indices to Parquet

(125)

Structured queries in Spark

Always program defensively: mediate schemas,

explicitly convert null values, etc.

Use the Dataset API whenever possible to minimize

boilerplate and benefit from query planning without

(126)

Natural language tips

Conventional preprocessing pipelines probably won’t 


get you the best results on log message texts.

Consider the sorts of “words” and word variants you’ll

want to recognize with your preprocessing pipeline. Use an approximate whitelist for unusual words.

(127)

Memory and partitioning

Large JVM heaps can lead to appalling GC pauses.

You probably won’t be able to use all of your memory

in a single JVM without executor timeouts.

Consider memory budget at the driver when choosing

(128)

Feature engineering

Design your feature engineering pipeline so you can

translate feature vectors back to factor values.

Favor feature engineering effort over complex or

novel learning algorithms.

(129)

@willb • [email protected]

https://chapeau.freevariable.com

References

Related documents

Study objectives: The purposes of this study were as follows: (1) to determine whether physical performance, quality of life, and dyspnea with activities of daily living

Dengan mengetahui prediksi jumlah mahasiswa baru pada periode yang akan datang, maka Universitas Widyagama Malang dapat merancang strategi promosi yang lebih

This dissertation research was performed to investigate the relationships between SAR methods, characterize the function of dispersion and swelling of sodic soils in pure

influència de salpicar, sinònim també antic (segle xiii) i molt estès en occità (i de més presència en. català que no salpuscar) i que ens puguem trobar davant una mena de

The unbiased analysis of a paradigmatic V α3S1/Vβ13S1-T-cell receptor from a pathogenic epidermal CD8 + T-cell clone of an HLA-C*06:02 + psoriasis patient had revealed

Remove the battery pack before starting any work on the machine...

This study reported the design and development of a Hardware-in-the-Loop simulation platform with illustration of the development and demonstration as applied to a candidate

Figure 4. Brain endothelial cells protect GSC against mitogen deprivation-induced autophagy and apoptosis. a-d) GSC#1 were cultured for 3 days with control medium (C),