Large-Scale Test Mining

(1)

SIAM Conference on Data Mining Text Mining 2010

Alan Ratner Northrop Grumman Information Systems

Large-Scale

Test Mining

(2)

2

Aim

• Identify topic and language/script/coding of real-world informal text at highest speed possible

• Informal text

– Blogs, posts, tweets

– Don’t necessarily follow conventional rules of spelling & grammar – Transliterated language usually ad hoc

– In typical web documents far more bytes of HTML & JavaScript than content making everything look like English. Can be parsed out but time- consuming.

– Does not look like newswire (written by journalists, rich in named entities, summarized in first paragraph)

• High speed

– Want to process documents as quickly as possible – Trillions of web pages

– Gigabytes per second speed desired

(3)

What is Text?

(4)

4

6 Fundamental Questions in Mining Text

¾ Detection

1. Does a document contain text in any language? (Or is it audio, video, …?) 2. If so, does the text have a topic? (41% of tweets are “pointless babble”.)

¾ Clustering

3. Which documents are in the same (but not necessarily known) language?

4. Which documents are on the same (but not necessarily known) topic?

¾ Identification

5. What is the language? (Is it a language the system has been trained to recognize?)

6. What is the topic? (Is it a topic the system has been trained to recognize?)

(5)

Specific Goals

• Identify specific language(s) of documents and identify documents on specific topics

• Accuracy requirements

– False negatives OK - users don’t know what has been missed

– Precision needs to be high enough so we don’t annoy users with lots of false alarms

– Extremely low false positive rate for topic id (<< 1 ppm)

• Speed requirement

– Fast hardware or software; simple algorithms

• Ideally use same algorithm for both language and topic id

• Language-neutral

– Work on all languages, including Asian languages that do not parse words with spaces

(6)

Text Analysis Algorithms

6 NORTHROP GRUMMAN PRIVATE / PROPRIETARY LEVEL I

• Many algorithms have been used for topic id

– Bayesian – Markov

– Markov Orthogonal Sparse Bi-word – Hyperspace

– Correlative

– Entropy (Optimal Compressor) (longest string match) – Minimum Description Length

– Term Frequency*Inverse Document Frequency – Morphological

– Centroid-based

– Logistic Regression (similar to SVM and single-layer NN)

• Most can be expressed using additive weights of detected tokens

• In the domain of interest (low false alarms on informal text) logistic

regression worked best and is computationally efficient

(7)

Additive Weight Algorithms

• Training

– Define & select tokens (e.g., words, words with spaces before & after, phrases, N-grams)

– Assign weights (for LR weights range from roughly -1 to +1)

• Testing

– Detect tokens

– Add weight to summer (S)

– For each document, convert weight sum to likelihood score & compare to threshold

• P(on-topic or on-language) = 1/(1+e^-S)

(8)

Token Detection

• In hardware

– Load tokens and ternary bit masks into CAM (Content Addressable Memory)

– Stream data through CAM to automatically identify token(s)

• In software

– Aho-Corasick algorithm creates a large state machine

• e.g., 50K tokens with 130K states

– “Terminal” states indicate detection of a token – For each byte of data

• Next_state = TableLookup[Previous_state][New_byte]

• If Terminal_state[Next_state] == TRUE then – Retrieve weight and add to summer

– Execution time is relatively independent of number of tokens or average length of tokens

(9)

4-Token State Machine

sp t h e sp

n

y

“ the ”

“ then ”

“ they ” sp

sp

n d sp

“ and ” a

transition to orange, green or pink state if

space, t or a

else transition to blue state

(10)

Solving Text Analysis Problems using Hadoop

• Hadoop is a framework for writing and running applications that process vast amounts of data in parallel on large clusters (up to thousands of nodes) of commodity hardware in a reliable, fault- tolerant, manner.

• Hadoop is free/open-source software that emulates Google’s proprietary MapReduce

– The master node takes the input, chops it up into smaller sub-problems, and distributes those to slave nodes where the “Map” tasks ingest and transform the input

– The “Reduce” task(s) then aggregate or summarize the Map output to deliver the final output.

10

(11)

Hadoop Software

• Hadoop will run on anything from a laptop to a vast cluster of computers

• Software packages work with Hadoop to provide:

− scalable distributed file systems and data warehouses (HBase, CloudBase)

− data summarization, ad hoc querying, scripting (Pig, Hive)

− massive matrix math, graph computation, machine learning, social network analysis (Hama, Mahout, X-Rime, Pegasus)

Program Stacks Windows/Cygwin/Hadoop Windows/VM/Linux/Hadoop

Linux/Hadoop

Linux/VM/Linux/Hadoop

(12)

Hadoop Data Flow with 1 Reduce

12

MapMap

Combine Combine

Reduce Reduce

Slave 1

Slave 2

Input File 2 Input File 1

…

(13)

Hadoop/MapReduce

Hadoop automatically distributes blocks of data to slave nodes and then lines of text to Maps Hadoop

automatically distributes blocks of data to slave nodes and then lines of text to Maps

Hadoop automatically sorts and groups

outputs of all Combines by key

Hadoop automatically sorts and groups

outputs of all Combines by key

•alue pairs

•alue pairs ••

One output file from each Reduce

Your Map code

transforms one line of text &

outputs KVPs Your Map code

transforms one line of text &

outputs KVPs

Your Reduce code summarizes sorted and

grouped output of the Combines &

outputs KVPs Your Reduce code summarizes sorted and

grouped output of the Combines &

outputs KVPs

Your Map:

Configure Your Map:

Configure code defines

Your Reduce:

Configure Your Reduce:

Configure code defines Your Combine

code transforms sorted & grouped output of all Maps for one node &

outputs KVPs Your Combine code transforms sorted & grouped output of all Maps for one node &

outputs KVPs

Your Combine:

Configure code Your Combine:

Configure code defines Combine

Hadoop automatically sorts and groups outputs of 1 node’s Maps by key

Main/Run code defines interfaces &

loads globals into distrib.

cache Main/Run code defines interfaces &

loads globals into distrib.

cache

KVP = Key

(14)

14

Original Hardware-Based Text Analyzer

• High speed solution with expensive special hardware

• HW limitations (# & length of tokens, wildcarding in ternary CAM)

• Few people with FPGA/VHDL skills

(15)

New Improved Hadoop Text Analyzer

• High speed software solution on generic hardware

• Enabled use of a very sophisticated detection algorithm (Aho- Corasick Trie – as in “information reTRIEval”); unlimited token length; speed relatively independent of number of tokens

Many people with Java skills

(16)

Cluster Configuration

16

Master & Slave 1 Slave 2

Slave 3

Slave N

Our servers each have 2 quad-core Nehalem Xeons, 24-48GB RAM, 4 1TB drives Per rack: 328 cores, 1TB RAM, 164TB drives, 30KW, 1700 pounds

Each server may host more than

one slave

Master finds slaves using IPs in Host

Table

Each slave may

run many Maps

Cluster may have 1 or

many Reduces

(17)

Language Identification Results

• Performance varies

– No standard data set or procedure for testing language identification – Worked very well overall except on documents with just a few words

– Mutually intelligible languages such as Dutch/Afrikaans, Indonesian/Malay, or Norwegian/Swedish harder to distinguish than dissimilar languages

– Relatively few tokens (most common words in informal language) used for each language (7 for Hindi, 55 for English, 95 for Spanish) it is possible to construct difficult documents

– Could not distinguish random words from real language

• Language is defined as words and grammar

(18)

Topic Identification Results

Based on incremental content of newsgroup posts

(quoted prior posts and metadata such as newsgroup, thread, author removed.)

(19)

The Bottom Line

• Text analysis readily performed on a cluster of commodity computers using Hadoop

• Comparison between software and hardware solutions

– In hardware achieved 2.5 Gb/s real-time throughput (but most such links operate at a fraction of their capacity)

– In software achieved 1.3 Gb/s off-line throughput on a cluster of 64 ancient servers (on new but unoptimized servers 1.6 Gb/s)

– In a properly configured Hadoop cluster performance scales linearly

• Elapsed time = 15-20 seconds + C/(number of servers)

(20)

NORTHROP GRUMMAN PRIVATE / PROPRIETARY LEVEL I

20

Large-Scale Test Mining

SIAM Conference on Data Mining Text Mining 2010

Alan Ratner Northrop Grumman Information Systems

Large-Scale

Test Mining

Aim

• Identify topic and language/script/coding of real-world informal text at highest speed possible

• Informal text

• High speed

What is Text?

6 Fundamental Questions in Mining Text

¾ Detection

¾ Clustering

¾ Identification

Specific Goals

• Identify specific language(s) of documents and identify documents on specific topics

• Accuracy requirements

• Speed requirement

• Ideally use same algorithm for both language and topic id

• Language-neutral

Text Analysis Algorithms

• Many algorithms have been used for topic id

• Most can be expressed using additive weights of detected tokens

• In the domain of interest (low false alarms on informal text) logistic

regression worked best and is computationally efficient

Additive Weight Algorithms

• Training

• Testing

Token Detection

• In hardware

• In software

4-Token State Machine

Solving Text Analysis Problems using Hadoop

• Hadoop is a framework for writing and running applications that process vast amounts of data in parallel on large clusters (up to thousands of nodes) of commodity hardware in a reliable, fault- tolerant, manner.

• Hadoop is free/open-source software that emulates Google’s proprietary MapReduce

Hadoop Software

• Hadoop will run on anything from a laptop to a vast cluster of computers

• Software packages work with Hadoop to provide:

− scalable distributed file systems and data warehouses (HBase, CloudBase)

− data summarization, ad hoc querying, scripting (Pig, Hive)

− massive matrix math, graph computation, machine learning, social network analysis (Hama, Mahout, X-Rime, Pegasus)

Hadoop Data Flow with 1 Reduce

…

Hadoop/MapReduce

Original Hardware-Based Text Analyzer

• High speed solution with expensive special hardware

• HW limitations (# & length of tokens, wildcarding in ternary CAM)

• Few people with FPGA/VHDL skills

New Improved Hadoop Text Analyzer

• High speed software solution on generic hardware

• Enabled use of a very sophisticated detection algorithm (Aho- Corasick Trie – as in “information reTRIEval”); unlimited token length; speed relatively independent of number of tokens

Many people with Java skills

Cluster Configuration

Our servers each have 2 quad-core Nehalem Xeons, 24-48GB RAM, 4 1TB drives Per rack: 328 cores, 1TB RAM, 164TB drives, 30KW, 1700 pounds

Language Identification Results

• Performance varies

Topic Identification Results

The Bottom Line

• Text analysis readily performed on a cluster of commodity computers using Hadoop

• Comparison between software and hardware solutions

[email protected]