Big Data and Scripting
(lecture, computer science, bachelor/master/phd)
Big Data and Scripting - abstract/organization abstract
• introduction to “Big Data” and involved techniques
• lecture (2+2)
• practical exercises to be turned in
dates
• 2 lectures (Mon 1:30 pm, M628 and Thu 10 am G302)
• 2 lab courses (Fri 10:00 am and 1:30 pm in Z613)
• oral exam, end of semester
me
• Uwe Nagel [email protected]
Big Data and Scripting - organizational stuff
exercises
• website:
http://www.inf.uni-konstanz.de/algo/lehre/ss13/bds/
• (about) 3 projects (bash, R, NOSQL/Hadoop)
• programming skills usefull, but not required
• discussion and help in lab course (Friday)
agenda - contents of this lecture
prologue: What is “Big Data” and why bother?
• concrete examples
• identify qualitatively what sets “Big Data” approaches apart
tools and techniques for (distributed) computation
• (some) basic notions of data handling
• Unix command line
• scripting in R
• NOSQL by example
• the map/reduce paradigm (example: Hadoop)
What this lecture does not cover
basics of data mining
• we are using some dm-techniques
• this is not a data mining course
lecture “Data Mining: Artificial Intelligence”
recommender systems
• we will touch those without detail
seminar/lecture “Recommender Systems”
Prologue
what does Big Data mean and why is that interesting
• Big Data and distributed computing seems like a fashion
• is there really an advantage?
• where does this advantage come from?
• 3 example applications
• increasing level of detail
What is “Big Data” and why bother?
a simple example - Amazon
• basically a selling platform
• provides:
– connection of suppliers to (private) customers – a common market place (one interface for all) – additional services (storage, shipment, payment) – recommendation
what is the difference to competitors?
• Amazon knows customers, products, sales and views
• same is true for its competitors
What is “Big Data” and why bother?
• in comparison, Amazon has much more customers
• more customers, more transactions, more views a larger data collection
better recommendations
estimate
1:
1/3 of Amazon’s sales generated by recommendations
more data = better predictions?
• simple answer: essentially yes
• real answer: it’s a bit more complicated
1www.economist.com/blogs/graphicdetail/2013/02/
elusive-big-data
What is “Big Data”? - extraction from examples
what are we trying to find out?
• learning/data mining and artificial intelligence are not that new
• somehow huge amounts of data can make a difference
• question: how and why ?
approach: analyze examples using big data
1. where is the big data
2. what kind of data is involved
3. what makes a large data base crucial
Target and the pregnant teen
Target
• a large discounter chain (similar to Walmart)
• uses data analysis for targeted marketing
• central to one of the most famous big data stories
the story
• Target predicts pregnancy better than family members
• source: www.forbes.com/sites/kashmirhill/2012/02/16/
how-target-figured-out-a-teen-girl-was-pregnant-before-her-father-did/
Target and the pregnant teen - How?
in a nutshell
• collect data about customers
• predict what they are interested in
• adjust advertisement to the specific person
Target and the pregnant teen - How?
1. step: data collection
• create large base of data available about customers
• each customer gets some unique ID (credit card, email, . . . )
• everything that can be connected to the customer is – collected
– connected to customer ID – used for interest prediction
example of data to collect
• items purchased together
• time/place of purchase
• weather? - whatever can be collected
Target and the pregnant teen - How?
next: search for patterns
• simple: people buy what they always bought
• recommendation: “customers who bought this usually also buy . . . ”
concrete targeting, example: young parents
• a new child is a perfect opportunity:
– parents have to buy a lot of stuff (without having too much money) – at this stage they are more likely bound to brands
prediction of pregnancy is crucial for advertisement
Target and the pregnant teen - How?
remark: this is how one could do it, not necessary how it was done.
ground truth?
• customers are described by their purchases
• goal: identify patterns typical for pregnant women
• first steps: identify purchase records
– of pregnant women (i.e. positive label, group P)
– of non-pregnant customers (i.e. negative label, group N)
searching for hints
• find commonalities within P
• find features distinguishing P from N
• build predictor: P(c ∈ P)
• (it is unknown, how exactly Target is doing this)
Target and the pregnant teen - results identified patterns
• quoting a Target analyst:
– they identified 25 products
– when analyzed together these allow a “pregnancy prediction score”
P(c ∈ P)
• example: pregnant women buy supplements like calcium, magnesium and zinc “sometime in the 20 first weeks”
business impact
• start of program: 2002
• revenue growth: $44 Billion (2002) −→ $67 Billion (2010)
• it is assumed that data mining was crucial for this growth
a second example: machine translation
the task
• automatic translation of text
• given: text T in language A
• result: text T0 in language B
example: Google’s translator
• URL: http://translate.google.de/
machine translation: a naive approach
word mappings
• hold a dictionary W : A → B
• replace each w ∈ T by W (w )
1. problem: words don’t match exactly between languages 2. problem: grammar
learning grammar
1. problem: grammar is hard, especially with semantics mixed in – c.f. Chomsky’s hierarchy of grammars
2. problem: language is noisy
machine translation: a statistical approach learning from big data
• new approach: don’t understand or analyze
• instead: translation by example
• “examples” are taken from a corpus of manually translated documents
basic idea (roughly)
• learn probability P that T0 is translation of T
• find T0 with maximal P
• approach: breaking down probabilities
note: the following explains the principle and is not correct in every detail
based on: http://michaelnielsen.org/blog/
introduction-to-statistical-machine-translation/
machine translation: breaking down probabilities
example: translate french text F to english text E
• P(E |F ) - prob. that E is correct translation of F
• let F = f1f2. . . (fi sentence, E analogous)
first splitting
• assumption: f1 corresponds to ei
• E is correct, if each ei translates its fi P(E |F ) =Q
i
P(ei|fi)
• try to maximize P(ei|fi)
machine translation: breaking down probabilities
consider a concrete pair of sentences:
• Je ne vous connais pas. ↔ I don’t know you.
– Je - I – vous - you – connais - know – ne . . . pas - don’t
some observations
• words are translated (Je → I)
• some words change place (vous → you)
• some words change “number” (e.g. ne . . . pas → don’t)
machine translation: breaking down probabilities
formalize our observations into concrete probabilities:
translation P(f |e) f is translation of e (Je → I)
distortion P(t|s, l ) word at position t is replaced
(you → nous) by word at position s in sentence of length l
fertility P(n|e) e is replaced by n french words (ne pas→ don’t)
machine translation: breaking down probabilities how does this help for P(E |F )?
• recall assumption P(E |F ) =QiP(ei|fi)
• P(E |F ) is high, if every P(ei|fi) is high
• same principle can be applied on the sentence level
breaking up sentences
• P(fi, ei) has many parts
– translation, distortion, fertility for every word – some more, unknown
– combination by product (assuming independence)
• P(fi, ei) → 1 , if all the parts → 1
use translation, distortion, fertility as indicators
machine translation: missing data/open questions
how are partial probabilities determined?
• estimation by observation
• recall: translation by example
• derive approximate probabilities by counting in corpus
what is left
• basis: large corpus of translated documents
• additional: matching of sentences, words
• not considered here, further information:
http://www.mt-archive.info/
discussion
why does this work?
• it does not (translate a text into your native language and you’ll see)
translate.google.com
• still the quality of the results is surprising
does it scale?
• why is it not always correct?
• what would be the impact of adding more data?
• can it be parallelized?