• No results found

How To Analyze Big Data With Luajit And Luatjit

N/A
N/A
Protected

Academic year: 2021

Share "How To Analyze Big Data With Luajit And Luatjit"

Copied!
32
0
0

Loading.... (view fulltext now)

Full text

(1)

Ad-hoc Big-Data Analysis with Lua

And LuaJIT

Alexander Gladysh <[email protected]> @agladysh

Lua Workshop 2015 Stockholm

(2)

Outline

Introduction The Problem A Solution Assumptions Examples The Tools Notes Questions? 2 / 32

(3)

Alexander Gladysh

I CTO, co-owner at LogicEditor

I In löve with Lua since 2005

(4)

The Problem

I You have a dataset to analyze,

I which is too large for "small-data" tools,

I and have no resources to setup and maintain (or pay for) the

Hadoop, Google Big Query etc.

I but you have some processing power available.

(5)

Goal

I Pre-process the data so it can be handled by R or Excel or

your favorite analytics tool (or Lua!).

I If the data is dynamic, then learn to pre-process it and build a

data processing pipeline.

(6)

An Approach

I Use Lua!

I And (semi-)standard tools, available on Linux.

I Go minimalistic while exploring, avoid frameworks,

I Then move on to an industrial solution that fits your newly

understood requirements,

I Or roll your own ecosystem! ;-)

(7)

Assumptions

(8)

Data Format

I Plain text

I Column-based (csv-like), optionally with free-form data in the

end

I Typical example: web-server log files

(9)

Data Format Example: Raw Data

2015/10/15 16:35:30 [info] 14171#0: *901195 [lua] index:14: 95c1c06e626b47dfc705f8ee6695091a 109.74.197.145 *.example.com

GET 123456.gif?q=0&step=0&ref= HTTP/1.1 example.com NB: This is a single, tab-separated line from a time-sorted file.

(10)

Data Format Example: Intermediate Data

alpha.example.com 5 beta.example.com 7 gamma.example.com 1

NB: These are several tab-separated lines from a key-sorted file.

(11)

Hardware

I As usual, more is better: Cores, cache, memory speed and

size, HDD speeds, networking speeds...

I But even a modest VM (or several) can be helpful.

I Your fancy gaming laptop is good too ;-)

(12)

OS

Linux (Ubuntu) Server.

This approach will, of course, work for other setups.

(13)

Filesystem

I Ideally, have data copies on each processing node, using

identical layouts.

I Fast network should work too.

(14)

Examples

(15)

Bash Script Example

time pv /path/to/uid-time-url-post.gz \

| pigz -cdp 4 \

| cut -d$’\t’ -f 1,3 \

| parallel --gnu --progress -P 10 --pipe --block=16M \

$(cat <<"EOF"

luajit ~me/url-to-normalized-domain.lua EOF

) \

| LC_ALL=C sort -u -t$’\t’ -k2 --parallel 6 -S20% \

| luajit ~me/reduce-value-counter.lua \

| LC_ALL=C sort -t$’\t’ -nrk2 --parallel 6 -S20% \

| pigz -cp4 >/path/to/domain-uniqs_count-merged.gz

(16)

Lua Script Example: url-to-normalized-domain.lua

for l in io.lines() do

local key, value = l:match("^([^\t]+)\t(.*)")

if value then

value = url_to_normalized_domain(value) end

if key and value then

io.write(key, "\t", value, "\n") end

end

(17)

Lua Script Example: reduce-value-counter.lua 1/3

-- Assumes input sorted by VALUE -- a foo --> foo 3 -- a foo bar 2 -- b foo quo 1 -- a bar -- c bar -- d quo 17 / 32

(18)

Lua Script Example: reduce-value-counter.lua 2/3

local last_key = nil, accum = 0

local flush = function(key)

if last_key then

io.write(last_key, "\t", accum, "\n")

end

accum = 0

last_key = key -- may be nil

end

(19)

Lua Script Example: reduce-value-counter.lua 3/3

for l in io.lines() do

-- Note reverse order!

local value, key = l:match("^(.-)\t(.*)$")

assert(key and value)

if key ~= last_key then

flush(key) collectgarbage("step") end accum = accum + 1 end flush() 19 / 32

(20)

Tying It All Together

Basically:

I You work with sorted data,

I mapping and reducing it line-by-line,

I in parallel where at all possible,

I while trying to use as much of available hardware resources as

practical,

I and without running out of memory.

(21)

The Tools

(22)

The Tools

I parallel

I sort, uniq, grep

I cut, join, comm

I pv

I compression utilities

I LuaJIT

(23)

LuaJIT?

Up to a point:

I 2.1 helps to speed things up,

I FFI bogs down development speed.

I Go plain Lua first (run it with LuaJIT),

I then roll your own ecosystem as needed ;-)

(24)

Parallel

I xargs for parallel computation

I can run your jobs in parallel on a single machine

I or on a "cluster"

(25)

Compression

I gzip: default, bad

I lz4: fast, large files

I pigz: fast, parallelizable

I xz: good compression, slow

I ...and many more,

I be on lookout for new formats!

(26)

GNU sort Tricks

LC_ALL=C \ sort -t$’\t’ --parallel 4 -S60% \ -k3,3nr -k2,2 -k1,1nr I Disable locale. I Specify delimiter.

I Note that parallel x4 with 60% memory will consume 0.6 *

log(4) = 120% of memory.

I When doing multi-key sort, specify parameters after key

number.

(27)

grep

http://stackoverflow.com/questions/9066609/fastest-possible-grep

(28)

Notes and Remarks

(29)

Why Lua?

Perl, AWK are traditional alternatives to Lua, but, if you’re not very disciplined and experienced, they are much less maintainable.

(30)

Start Small!

I Always run your scripts on small representative excerpts from

your datasets, not only while developing them locally, but on actual data-processing nodes too.

I Saves time and helps you learn the bottlenecks.

I Sometimes large run still blows in your face though:

I Monitor resource utilization at run-time.

(31)

Discipline!

I Many moving parts, large turn-around times, hard to keep tabs.

I Keep journal: Write down what you run and what time it took.

I Store actual versions of your scripts in a source control system.

I Don’t forget to sanity-check the results you get!

(32)

Questions?

Alexander Gladysh, [email protected]

References

Related documents

We have introduced a general framework for active testing that minimizes human vetting effort by actively selecting test examples to label and using performance estimators that adapt

«Work package» (WP 2) also included for each partner a proposal of literature review and analysis of : theoretical and empirical research findings; public

The National Housing Policy of 2002, which has the broad goal of ensuring that all Nigerians have access to decent housing accommodation at affordable costs, also stresses

The Department of Homeland Security’s Performance and Accountability Report for Fiscal Year 2003 provides financial and performance information that enables the President,

16 week homeless or alcohol/drug rehabilitation shelter for persons with drug/alcohol addictions. What’s Offered

We contribute an integrated approach to learning proba- bilistic relational rules from probabilistic examples and back- ground knowledge. It is incorporated in the ProbFOIL + sys-

Specifically, we consider the popular Good Features to Track (GFTT) [23] and Harris corner detection principles, and extend them to RGBD content, making the detected keypoints

Rather, both national governments and international actors, especially the United States with its huge military, economic, and political footprint in the region, will need to take