OLSUG Workshop Oracle Data Mining

(1)

(2)

OLSUG Workshop

Oracle Data Mining

Charlie Berger

Sr. Director of Product Mgmt, Life Sciences and Data Mining Oracle Corporation

Carolyn K. Hamm, Ph.D. Chief, Decision Support Center Walter Reed Army Medical Center Washington, DC 20307

202 356-1012 x 40166 [email protected]

[email protected] Dr. Lutz Hamel

Asst. Professor, Computer Science University of Rhode Island

(3)

Oracle Data Mining Workshop

• Oracle Data Mining overview

• Data mining process & example use cases

• Explore, build, test, cluster, etc.

(4)

Oracle Data Mining

• Platform for data mining

• PL/SQL API • Java API

• Oracle Data Miner (GUI)

• Wide range of algorithms

• Classification

• Support Vector Machines, Naïve Bayes,

Adaptive Bayes Networks

• Attribute Importance • Association Rules • Clustering

• Enhanced K-Means, Orthogonal Clustering

• Nonnegative Matrix Factorization

(feature extraction)

(5)

Oracle Data Mining

Algorithms & Example Applications

• Attribute Importance

• Identify most influential attributes for a target attribute

• Factors associated a disease • Promising leads

• Classification and Prediction

• Predict most likely to:

• Doctors who prescribe a new drug • Patients who respond to a treatment

• Regression

• Predict a numeric value • Predict a value

(6)

Oracle Data Mining

• Clustering

• Find naturally occurring groups

• Gene clusters

• Find disease subgroups

• Distinguish normal from non-normal behavior

• Association Rules

• Find co-occurring items

• Suggest interactions

• Feature Extraction

• Reduce a large dataset into representative new attributes

(7)

Oracle Data Mining

• Text Mining

• Combine data and text for better models

• Add unstructured text e.g. physician’s notes to

structured data e.g. age, weight, height, etc., to predict outcomes

• Classify and cluster documents

• Combined with Oracle Text to develop advanced text mining applications e.g. Medline

• BLAST

ATGCAATGCCAGGATTTCCA

CTGCAAGGCCAGGAAGTTCCA

ATGCGTTGCCAC…ATTTCCA

(8)

10g

Statistics & SQL Analytics

• Ranking functions

• rank, dense_rank, cume_dist, percent_rank, ntile

• Window Aggregate functions

(moving and cumulative)

• Avg, sum, min, max, count, variance, stddev, first_value, last_value

• LAG/LEAD functions

• Direct inter-row reference using offsets

• Reporting Aggregate functions

• Sum, avg, min, max, variance, stddev, count, ratio_to_report

• Statistical Aggregates

• Correlation, linear regression family, covariance

• Linear regression

• Fitting of an ordinary-least-squares regression line to a set of number pairs.

• Frequently combined with the COVAR_POP, COVAR_SAMP, and CORR functions.

• Descriptive Statistics

• average, standard deviation, variance, min, max, median

(via percentile_count), mode, group-by & roll-up

• DBMS_STAT_FUNCS: summarizes numerical columns of a table and returns count, min, max, range, mean, stats_mode, variance, standard deviation, median, quantile values, +/- 3 sigma values, top/bottom 5 values

• Correlations

• Pearson’s correlation coefficients, Spearman's and Kendall's (both nonparametric).

• Cross Tabs

• Enhanced with % statistics: chi squared, phi coefficient, Cramer's V, contingency coefficient, Cohen's kappa

• Hypothesis Testing

• Student t-test , F-test, Binomial test, Wilcoxon Signed Ranks test, Chi-square, Mann Whitney test, Kolmogorov-Smirnov test, One-way ANOVA

• Distribution Fitting

• Kolmogorov-Smirnov Test, Anderson-Darling Test, Chi-Squared Test, Normal, Uniform, Weibull, Exponential

(9)

Statistics

• Enables analytic pipelines without removing data

to statistical packages for simple analyses

(10)

Workshop Outline

• Explore the data

• View data, simple graphs, ranges, etc.

• Cluster the data (undirected) & look for interesting patterns

• Determine problem to be solved

• What factors are associated with target 1, target 2, etc. • Predict patients likely to respond to treatment

• Data transformations

• Building Models

• Classification Models • Build, Test, Apply

• Throw out attributes e.g. 100% correlations etc. • Classification Models w/ unstructured data (text)

• Mining Activity Guides

• 10gR2 Preview

• Decision Trees (10gR2) • Anomaly Detection (10gR2)

(11)

Explore the data

• View data, simple graphs, ranges, etc.

• Lymphoma_7 data • Bpress default

• Relative value data (WRMC)

• Cluster the data

(undirected) looking for interesting patterns

(12)

Determine problem

• State the problem in terms of data mining

• What factors are associated with target 1, target 2, etc. • Predict patients likely to respond to treatment

(13)

Building Models

• Classification

• Build, Test, Apply

• Use SVM on Brain Tumor data w/o TEXT and SVM on Brain Tumor w/ TEXT

• Throw out attributes e.g. 100% correlations etc. • Use Diabetes data • Classification Models w/ unstructured data (text)

(14)

Mining Activity Guides

• Step by step guidance to achieve a goal and increase the likelihood of successful data mining

(15)

“The future of data mining lies

in predictive analytics.”

The Future of Data Mining – Predictive Analytics

Article published in DM Review Magazine August 2004 Issue

By Lou Agosta

(16)

What is “Predictive Analytics”?

• “One click data mining”

• Automatically selects “appropriate” algorithm • Automates all advanced algorithm settings • Automates Train, Test, and Apply steps

• Power data analysts can use

• Oracle Data Miner (wizard driven gui) • PL/SQL API

• Java API

• Concept of performing predictive analytics is

better than doing nothing

(17)

Oracle Data Mining

Attribute Importance

• Identify most influential attributes for a target attribute

• Factors associated a disease • Promising leads

Classification and Prediction

• Predict most likely to:

• Doctors who prescribe a new drug • Patients who respond to a treatment

• Regression

• Predict a numeric value

• Predict a value

• Predict the size tumor reduction

A1 A2 A3 A4 A5 A6 A7

Explain PA

“Easy Button” Attribute Importance

Predict PA

(18)

Life Sciences

Oracle Data Miner 10g

(19)

Age

Status

Temp Gender Days ICU

>45 <45

No Infection Infection >35 <=35

<100 >100 M >4

Age

Oracle Data Mining

Decision Trees

Risk = 0 Risk = 1 Risk = 1 Risk = 0

<=4 Risk = 1 F Risk = 0 Problem: Find profiles of high risk patients • Decision Trees • Popular algorithm • Human readable “rules” • Builds classification trees in Database • Parallel

(20)

Oracle Data Mining

10g Release 2 New Features

Problem: Detect rare cases

• Anomaly Detection

• “One-Class” Classification • Builds SVM classification

models where only one class e.g. “0’s” exists

• Network intrusion detection • Disease outbreaks

• Outlier detection

• Rare events, true novelty X1

(21)

Oracle Data Mining

10g Release 2 New Features

(Continued)

• Oracle Predictive Analytics PL/SQL Packages (Available now on OTN)

• EXPLAIN and PREDICT PL/SQL packages completely automate data mining

• Oracle Spreadsheet Add-In for Predictive Analytics on OTN

• http://www.oracle.com/technology/products/bi/odm/pa-addin/odm_pred_analytics_addin.htmlPrediction Operator SQL-Level Data Mining Capability

• Prediction Operator SQL-Level Data Mining Capability

• Fast, SQL-level data mining prediction (“Apply”) functions that can be used to “pipeline” predictions e.g.

• “Select customers where Churner_predicted >.80 AND Customer_value_prediction > $500 AND

Response_likehood > .6”

• Java Data Mining (JDM) Compliant Java API

• Oracle Database 10g R2 provides a Java Data Mining (JDM) JSR-73 compliant Java API

(22)

Oracle Data Mining

10g Release 2

• Updated Oracle Data Miner (GUI)

• Ability to mine text column • Anomaly detection • Decision Trees • Predictive Analytics

(“one click” data mining)

(23)

Oracle Data Mining

10g Release 2

(24)

Oracle Data Mining

10g Release 2

(25)

Q U E S T I O N S

A N S W E R S

(26)

(27)

Additional Life Sciences Use

Case Slides

(28)

Life Sciences Use Cases

1. Gene expression analysis

2. Clinical treatment outcome analysis

3. Classification of Multiple Tumor Types

(29)

Oracle Data Mining in the Life Sciences

Gene expression analysis

• Problem 1

• Given thousands of gene expression values for each patient, can a small subset of the

expressions be identified that can be used to distinguish one type of leukemia from another?

• Solution

• Apply ODM’s Attribute Importance algorithm to the data to decrease the size of the problem

• Build an Adaptive Bayes Network Classification model to predict disease type from the gene

(30)

Oracle Data Mining in the Life Sciences

Top Genes (of ~7000) for Classifying Leukemia

Gene Expression Relative Importance

V00594_s_at 0.298955976210004 D43950_at 0.292217965904811 U34038_at 0.227177556507829 J03827_at 0.227177556507829 U64863_at 0.227177556507829 S85655_at 0.175469338594625 L07758_at 0.17031674247889 U19345_at 0.17031674247889 U89336_cds4_at 0.125995412839 U79295_at 0.125995412839 HG311-HT311_at 0.125995412839 V00599_s_at 0.125995412839

(31)

Oracle Data Mining in the Life Sciences

ABN Model Predictions

Lymphoid Leukemia vs. Myeloid Leukemia

Predicted LL ML Actual LL 19 1

ML 2 12

(32)

Oracle Data Mining in the Life Sciences

Clinical treatment outcome analysis

• Problem 2

• Is it possible to classify treatments that are most effective in causing improvement in clinical

patients suffering from a given disease?

• Solution

• Use Attribute Importance to rank the treatment factors

• Use Association Rules to establish correlations between treatment and outcome

Source: Walter Reed Medical Center, Dr. Carolyn Hamm, presentation at Oracle Life Sciences User Group Meeting, June 2004

(33)

Oracle Data Mining in the Life Sciences

Factors associated with positive diabetes outcomes

1. DRUG_TYPE 2. COMPLETE_HISTORY_RECORDED (Scorecard) 3. NUM_HOSPITAL_ADMISSIONS 4. GENDER 5. NUM_VISITS_TO_PROVIDER 6. INSURANCE_TYPE 7. BLOOD_PRESSURE_GOAL 8. WEIGHT_GOAL 9. LDL_GOAL 10.PROVIDER_TYPE

(34)

Oracle Data Mining in the Life Sciences

Sample Association Rules

Source: Walter Reed Medical Center, Dr. Carolyn Hamm, presentation at Oracle Life Sciences User Group Meeting,

If Then OUTCOME Percentage of Cases

NUM_HOSPITAL_ADMISSIONS=0 NO_IMPROVEMENT 0.5463472 NUM_VISITS_TO_PROVIDER>5 IMPROVEMENT 0.37195602 NUM_HOSPITAL_ADMISSIONS =0 and NUM_VISITS_TO_PROVIDER>5 IMPROVEMENT 0.36252946 DRUG_GROUP=2 and NUM_HOSPITAL_ADMISSIONS =0 NO_IMPROVEMENT 0.3267871 DRUG_GROUP=2 and NUM_VISITS_TO_PROVIDER>5 NO_IMPROVEMENT 0.30911234 NUM_HOSPITAL_ADMISSIONS =0 and GENDER=FEMALE NO_IMPROVEMENT 0.28711703 COMPLETE_HISTORY_RECORDED=NO NO_IMPROVEMENT 0.28436762 NUM_HOSPITAL_ADMISSIONS =0 and COMPLETE_HISTORY_RECORDED=Yes IMPROVEMENT 0.22663

(35)

Oracle Data Mining in the Life Sciences

Classification of Multiple Tumor Types

DNA Microarray Data

Oracle Data Mining Actual\Predicted BR PR LU CO LY BL ML UT LE RE PA OV MS BR BREAST-BR 1 1 PROSTATE-PR 1 1 LUNG-LU 1 2 COLON-CO 3 LYMPHOMA-LY 6 BLADDER-BL 1 2 MELANOMA-ML 1 1 UTERUS-UT 2 LEUKEMIA-LE 1 5 RENAL-RE 3 PANCREAS-PA 1 2 OVARY-OV 1 2 MESOTHELIOMA-MS 3 BRAIN-BR 4 78.25% accuracy Green=Correct Red=Errors We feed multiple cancer types

data into the Oracle DB: 16,063 genes, 144 cancer

patients and

10 samples per class.

We mine the data using Support Vector

Machines and create the confusion matrix

(36)

Oracle Data Mining in the Life Sciences

Tumor Class # Train # Test Tumor Class # Train # Test

Breast (BR) 8 3 Uterus (UT) 8 2

Prostate (PR) 8 2 Leukemia (LE) 24 6

Lung (LU) 8 3 Renal (RE) 8 3

Colorectal (CO) 8 5 Pancreas (PA) 8 3

Lymphoma (LY) 16 6 Ovary (OV) 8 3

Bladder (BL) 8 3 Mesothelioma (MS) 8 3

• Multiple examples of 14 tumor types

• Training set: 144 samples. Test set: 46 samples

• Microarrays gene expression profiles: 7,129 genes (features)

• Can we build a model to distinguish between multiple tumor types?

(37)

Multi-Tumor Dataset

Build SVM Model (Training)

Evaluate Model on Test Set

Data Preparation (Scaling)

Read into RDMS as Table

Oracle Task

SQLLDR

SQL query

ODM Model Build

ODM Model Apply

Tumor Labels (Train) Tumor Labels (Train) Tumor Labels (Test) Tumor Labels (Test)

Oracle Data Mining in the Life Sciences

(38)

Oracle Data Mining in the Life Sciences

NUMBER expr VARCHAR2(30) gene NUMBER sid type column NUMBER expr VARCHAR2(30) gene NUMBER sid type column

• The datasets were downloaded from the web site and stored in flat files prior to loading them to the Oracle database

• The data was loaded using SQLLDR to create a fact table of the following format:

• Rescaling: the values were divided by a constant (10000) to make them into small numbers near 1 (to keep the dot products between all samples in the dataset inside the [-1, 1] range

(39)

Oracle Data Mining in the Life Sciences

• Entire methodology implemented in Oracle Database • The SVM model works with all 7,129 input features

(genes) genes and do not require feature selection. • The SVM model is relatively fast: 9 minutes training

time on 500MHz Netra.

• The SVM is very accurate for multi-tumor molecular classification: 78.25% accuracy

• Comparable to published results in Ramaswamy et al PNAS 2001 paper, they also found that k-NN = 63% and Weighted Voting = 46% accuracy

(40)

Oracle Data Mining in the Life Sciences

Actual\Predicted BR PR LU CO LY BL ML UT LE RE PA OV MS BR BREAST-BR 1 1 PROSTATE-PR 1 1 LUNG-LU 1 2 COLON-CO 3 LYMPHOMA-LY 6 BLADDER-BL 1 2 MELANOMA-ML 1 1 UTERUS-UT 2 LEUKEMIA-LE 1 5 RENAL-RE 3 PANCREAS-PA 1 2 OVARY-OV 1 2 MESOTHELIOMA-MS 3 BRAIN-BR 4 Green=Correct Red=Errors

Oracle Data Mining’s SVM models are able to accurately predict the

multi-class tumor problem with 78.25% accuracy.

(41)