Big Data and Scripting. (lecture, computer science, bachelor/master/phd)

(1)

Big Data and Scripting

(lecture, computer science, bachelor/master/phd)

(2)

Big Data and Scripting - abstract/organization abstract

• introduction to “Big Data” and involved techniques

• lecture (2+2)

• practical exercises to be turned in

dates

• 2 lectures (Mon 1:30 pm, M628 and Thu 10 am G302)

• 2 lab courses (Fri 10:00 am and 1:30 pm in Z613)

• oral exam, end of semester

me

• Uwe Nagel [email protected]

(3)

Big Data and Scripting - organizational stuff

exercises

• website:

http://www.inf.uni-konstanz.de/algo/lehre/ss13/bds/

• (about) 3 projects (bash, R, NOSQL/Hadoop)

• programming skills usefull, but not required

• discussion and help in lab course (Friday)

(4)

agenda - contents of this lecture

prologue: What is “Big Data” and why bother?

• concrete examples

• identify qualitatively what sets “Big Data” approaches apart

tools and techniques for (distributed) computation

• (some) basic notions of data handling

• Unix command line

• scripting in R

• NOSQL by example

• the map/reduce paradigm (example: Hadoop)

(5)

What this lecture does not cover

basics of data mining

• we are using some dm-techniques

• this is not a data mining course

lecture “Data Mining: Artificial Intelligence”

recommender systems

• we will touch those without detail

seminar/lecture “Recommender Systems”

(6)

Prologue

what does Big Data mean and why is that interesting

• Big Data and distributed computing seems like a fashion

• is there really an advantage?

• where does this advantage come from?

• 3 example applications

• increasing level of detail

(7)

What is “Big Data” and why bother?

a simple example - Amazon

• basically a selling platform

• provides:

– connection of suppliers to (private) customers – a common market place (one interface for all) – additional services (storage, shipment, payment) – recommendation

what is the difference to competitors?

• Amazon knows customers, products, sales and views

• same is true for its competitors

(8)

What is “Big Data” and why bother?

• in comparison, Amazon has much more customers

• more customers, more transactions, more views a larger data collection

better recommendations

estimate

¹

:

1/3 of Amazon’s sales generated by recommendations

more data = better predictions?

• simple answer: essentially yes

• real answer: it’s a bit more complicated

1www.economist.com/blogs/graphicdetail/2013/02/

elusive-big-data

(9)

What is “Big Data”? - extraction from examples

what are we trying to find out?

• learning/data mining and artificial intelligence are not that new

• somehow huge amounts of data can make a difference

• question: how and why ?

approach: analyze examples using big data

1. where is the big data

2. what kind of data is involved

3. what makes a large data base crucial

(10)

Target and the pregnant teen

Target

• a large discounter chain (similar to Walmart)

• uses data analysis for targeted marketing

• central to one of the most famous big data stories

the story

• Target predicts pregnancy better than family members

• source: www.forbes.com/sites/kashmirhill/2012/02/16/

how-target-figured-out-a-teen-girl-was-pregnant-before-her-father-did/

(11)

Target and the pregnant teen - How?

in a nutshell

• collect data about customers

• predict what they are interested in

• adjust advertisement to the specific person

(12)

Target and the pregnant teen - How?

1. step: data collection

• create large base of data available about customers

• each customer gets some unique ID (credit card, email, . . . )

• everything that can be connected to the customer is – collected

– connected to customer ID – used for interest prediction

example of data to collect

• items purchased together

• time/place of purchase

• weather? - whatever can be collected

(13)

Target and the pregnant teen - How?

next: search for patterns

• simple: people buy what they always bought

• recommendation: “customers who bought this usually also buy . . . ”

concrete targeting, example: young parents

• a new child is a perfect opportunity:

– parents have to buy a lot of stuff (without having too much money) – at this stage they are more likely bound to brands

prediction of pregnancy is crucial for advertisement

(14)

Target and the pregnant teen - How?

remark: this is how one could do it, not necessary how it was done.

ground truth?

• customers are described by their purchases

• goal: identify patterns typical for pregnant women

• first steps: identify purchase records

– of pregnant women (i.e. positive label, group P)

– of non-pregnant customers (i.e. negative label, group N)

searching for hints

• find commonalities within P

• find features distinguishing P from N

• build predictor: P(c ∈ P)

• (it is unknown, how exactly Target is doing this)

(15)

Target and the pregnant teen - results identified patterns

• quoting a Target analyst:

– they identified 25 products

– when analyzed together these allow a “pregnancy prediction score”

P(c ∈ P)

• example: pregnant women buy supplements like calcium, magnesium and zinc “sometime in the 20 first weeks”

business impact

• start of program: 2002

• revenue growth: $44 Billion (2002) −→ $67 Billion (2010)

• it is assumed that data mining was crucial for this growth

(16)

a second example: machine translation

the task

• automatic translation of text

• given: text T in language A

• result: text T⁰ in language B

example: Google’s translator

• URL: http://translate.google.de/

(17)

machine translation: a naive approach

word mappings

• hold a dictionary W : A → B

• replace each w ∈ T by W (w )

1. problem: words don’t match exactly between languages 2. problem: grammar

learning grammar

1. problem: grammar is hard, especially with semantics mixed in – c.f. Chomsky’s hierarchy of grammars

2. problem: language is noisy

(18)

machine translation: a statistical approach learning from big data

• new approach: don’t understand or analyze

• instead: translation by example

• “examples” are taken from a corpus of manually translated documents

basic idea (roughly)

• learn probability P that T⁰ is translation of T

• find T⁰ with maximal P

• approach: breaking down probabilities

note: the following explains the principle and is not correct in every detail

based on: http://michaelnielsen.org/blog/

introduction-to-statistical-machine-translation/

(19)

machine translation: breaking down probabilities

example: translate french text F to english text E

• P(E |F ) - prob. that E is correct translation of F

• let F = f₁f₂. . . (f_i sentence, E analogous)

first splitting

• assumption: f₁ corresponds to e_i

• E is correct, if each e_i translates its f_i P(E |F ) =^Q

i

P(e_i|f_i)

• try to maximize P(e_i|f_i)

(20)

machine translation: breaking down probabilities

consider a concrete pair of sentences:

• Je ne vous connais pas. ↔ I don’t know you.

– Je - I – vous - you – connais - know – ne . . . pas - don’t

some observations

• words are translated (Je → I)

• some words change place (vous → you)

• some words change “number” (e.g. ne . . . pas → don’t)

(21)

machine translation: breaking down probabilities

formalize our observations into concrete probabilities:

translation P(f |e) f is translation of e (Je → I)

distortion P(t|s, l ) word at position t is replaced

(you → nous) by word at position s in sentence of length l

fertility P(n|e) e is replaced by n french words (ne pas→ don’t)

(22)

machine translation: breaking down probabilities how does this help for P(E |F )?

• recall assumption P(E |F ) =^Q_iP(e_i|f_i)

• P(E |F ) is high, if every P(e_i|f_i) is high

• same principle can be applied on the sentence level

breaking up sentences

• P(f_i, e_i) has many parts

– translation, distortion, fertility for every word – some more, unknown

– combination by product (assuming independence)

• P(f_i, e_i) → 1 , if all the parts → 1

use translation, distortion, fertility as indicators

(23)

machine translation: missing data/open questions

how are partial probabilities determined?

• estimation by observation

• recall: translation by example

• derive approximate probabilities by counting in corpus

what is left

• basis: large corpus of translated documents

• additional: matching of sentences, words

• not considered here, further information:

http://www.mt-archive.info/

(24)

discussion

why does this work?

• it does not (translate a text into your native language and you’ll see)

translate.google.com

• still the quality of the results is surprising

does it scale?

• why is it not always correct?

• what would be the impact of adding more data?

• can it be parallelized?