OLSUG Workshop
Oracle Data Mining
Charlie Berger
Sr. Director of Product Mgmt, Life Sciences and Data Mining Oracle Corporation
Carolyn K. Hamm, Ph.D. Chief, Decision Support Center Walter Reed Army Medical Center Washington, DC 20307
202 356-1012 x 40166 [email protected]
[email protected] Dr. Lutz Hamel
Asst. Professor, Computer Science University of Rhode Island
Oracle Data Mining Workshop
•
Oracle Data Mining overview
•
Data mining process & example use cases
• Explore, build, test, cluster, etc.
Oracle Data Mining
• Platform for data mining
• PL/SQL API • Java API
• Oracle Data Miner (GUI)
• Wide range of algorithms
• Classification
• Support Vector Machines, Naïve Bayes,
Adaptive Bayes Networks
• Attribute Importance • Association Rules • Clustering
• Enhanced K-Means, Orthogonal Clustering
• Nonnegative Matrix Factorization
(feature extraction)
Oracle Data Mining
Algorithms & Example Applications
•
Attribute Importance
• Identify most influential attributes for a target attribute
• Factors associated a disease • Promising leads
•
Classification and Prediction
• Predict most likely to:
• Doctors who prescribe a new drug • Patients who respond to a treatment
•
Regression
• Predict a numeric value • Predict a value
Oracle Data Mining
Algorithms & Example Applications
• Clustering
• Find naturally occurring groups
• Gene clusters
• Find disease subgroups
• Distinguish normal from non-normal behavior
• Association Rules
• Find co-occurring items
• Suggest interactions
• Feature Extraction
• Reduce a large dataset into representative new attributes
Oracle Data Mining
Algorithms & Example Applications
• Text Mining
• Combine data and text for better models
• Add unstructured text e.g. physician’s notes to
structured data e.g. age, weight, height, etc., to predict outcomes
• Classify and cluster documents
• Combined with Oracle Text to develop advanced text mining applications e.g. Medline
• BLAST
ATGCAATGCCAGGATTTCCA
CTGCAAGGCCAGGAAGTTCCA
ATGCGTTGCCAC…ATTTCCA
10g
Statistics & SQL Analytics
• Ranking functions
• rank, dense_rank, cume_dist, percent_rank, ntile
• Window Aggregate functions
(moving and cumulative)
• Avg, sum, min, max, count, variance, stddev, first_value, last_value
• LAG/LEAD functions
• Direct inter-row reference using offsets
• Reporting Aggregate functions
• Sum, avg, min, max, variance, stddev, count, ratio_to_report
• Statistical Aggregates
• Correlation, linear regression family, covariance
• Linear regression
• Fitting of an ordinary-least-squares regression line to a set of number pairs.
• Frequently combined with the COVAR_POP, COVAR_SAMP, and CORR functions.
• Descriptive Statistics
• average, standard deviation, variance, min, max, median
(via percentile_count), mode, group-by & roll-up
• DBMS_STAT_FUNCS: summarizes numerical columns of a table and returns count, min, max, range, mean, stats_mode, variance, standard deviation, median, quantile values, +/- 3 sigma values, top/bottom 5 values
• Correlations
• Pearson’s correlation coefficients, Spearman's and Kendall's (both nonparametric).
• Cross Tabs
• Enhanced with % statistics: chi squared, phi coefficient, Cramer's V, contingency coefficient, Cohen's kappa
• Hypothesis Testing
• Student t-test , F-test, Binomial test, Wilcoxon Signed Ranks test, Chi-square, Mann Whitney test, Kolmogorov-Smirnov test, One-way ANOVA
• Distribution Fitting
• Kolmogorov-Smirnov Test, Anderson-Darling Test, Chi-Squared Test, Normal, Uniform, Weibull, Exponential
Statistics
•
Enables analytic pipelines without removing data
to statistical packages for simple analyses
Workshop Outline
• Explore the data
• View data, simple graphs, ranges, etc.
• Cluster the data (undirected) & look for interesting patterns
• Determine problem to be solved
• What factors are associated with target 1, target 2, etc. • Predict patients likely to respond to treatment
• Data transformations
• Building Models
• Classification Models • Build, Test, Apply
• Throw out attributes e.g. 100% correlations etc. • Classification Models w/ unstructured data (text)
• Mining Activity Guides
• 10gR2 Preview
• Decision Trees (10gR2) • Anomaly Detection (10gR2)
Explore the data
• View data, simple graphs, ranges, etc.
• Lymphoma_7 data • Bpress default
• Relative value data (WRMC)
• Cluster the data
(undirected) looking for interesting patterns
Determine problem
• State the problem in terms of data mining
• What factors are associated with target 1, target 2, etc. • Predict patients likely to respond to treatment
Building Models
• Classification• Build, Test, Apply
• Use SVM on Brain Tumor data w/o TEXT and SVM on Brain Tumor w/ TEXT
• Throw out attributes e.g. 100% correlations etc. • Use Diabetes data • Classification Models w/ unstructured data (text)
Mining Activity Guides
• Step by step guidance to achieve a goal and increase the likelihood of successful data mining“The future of data mining lies
in predictive analytics.”
The Future of Data Mining – Predictive Analytics
Article published in DM Review Magazine August 2004 Issue
By Lou Agosta
What is “Predictive Analytics”?
•
“One click data mining”
• Automatically selects “appropriate” algorithm • Automates all advanced algorithm settings • Automates Train, Test, and Apply steps
•
Power data analysts can use
• Oracle Data Miner (wizard driven gui) • PL/SQL API
• Java API
•
Concept of performing predictive analytics is
better than doing nothing
Oracle Data Mining
Algorithms & Example Applications
Attribute Importance
• Identify most influential attributes for a target attribute
• Factors associated a disease • Promising leads
Classification and Prediction
• Predict most likely to:
• Doctors who prescribe a new drug • Patients who respond to a treatment
• Regression
• Predict a numeric value
• Predict a value
• Predict the size tumor reduction
A1 A2 A3 A4 A5 A6 A7
Explain PA
“Easy Button” Attribute Importance
Predict PA
Life Sciences
Oracle Data Miner 10g
Age
Status
Temp Gender Days ICU
>45 <45
No Infection Infection >35 <=35
<100 >100 M >4
Age
Oracle Data Mining
Decision Trees
Risk = 0 Risk = 1 Risk = 1 Risk = 0
<=4 Risk = 1 F Risk = 0 Problem: Find profiles of high risk patients • Decision Trees • Popular algorithm • Human readable “rules” • Builds classification trees in Database • Parallel
Oracle Data Mining
10g Release 2 New Features
Problem: Detect rare cases
•
Anomaly Detection
• “One-Class” Classification • Builds SVM classificationmodels where only one class e.g. “0’s” exists
• Network intrusion detection • Disease outbreaks
• Outlier detection
• Rare events, true novelty X1
Oracle Data Mining
10g Release 2 New Features
(Continued)• Oracle Predictive Analytics PL/SQL Packages (Available now on OTN)
• EXPLAIN and PREDICT PL/SQL packages completely automate data mining
• Oracle Spreadsheet Add-In for Predictive Analytics on OTN
• http://www.oracle.com/technology/products/bi/odm/pa-addin/odm_pred_analytics_addin.htmlPrediction Operator SQL-Level Data Mining Capability
• Prediction Operator SQL-Level Data Mining Capability
• Fast, SQL-level data mining prediction (“Apply”) functions that can be used to “pipeline” predictions e.g.
• “Select customers where Churner_predicted >.80 AND Customer_value_prediction > $500 AND
Response_likehood > .6”
• Java Data Mining (JDM) Compliant Java API
• Oracle Database 10g R2 provides a Java Data Mining (JDM) JSR-73 compliant Java API
Oracle Data Mining
10g Release 2
• Updated Oracle Data Miner (GUI)
• Ability to mine text column • Anomaly detection • Decision Trees • Predictive Analytics
(“one click” data mining)
Oracle Data Mining
10g Release 2
• Decision Trees (10gR2) • Anomaly Detection (10gR2)Oracle Data Mining
10g Release 2
• Decision Trees (10gR2) • Anomaly Detection (10gR2)Q U E S T I O N S
Q U E S T I O N S
A N S W E R S
Additional Life Sciences Use
Case Slides
Life Sciences Use Cases
1.
Gene expression analysis
2.
Clinical treatment outcome analysis
3.
Classification of Multiple Tumor Types
Oracle Data Mining in the Life Sciences
Gene expression analysis
•
Problem 1
• Given thousands of gene expression values for each patient, can a small subset of the
expressions be identified that can be used to distinguish one type of leukemia from another?
•
Solution
• Apply ODM’s Attribute Importance algorithm to the data to decrease the size of the problem
• Build an Adaptive Bayes Network Classification model to predict disease type from the gene
Oracle Data Mining in the Life Sciences
Gene expression analysis
Top Genes (of ~7000) for Classifying Leukemia
Gene Expression Relative Importance
V00594_s_at 0.298955976210004 D43950_at 0.292217965904811 U34038_at 0.227177556507829 J03827_at 0.227177556507829 U64863_at 0.227177556507829 S85655_at 0.175469338594625 L07758_at 0.17031674247889 U19345_at 0.17031674247889 U89336_cds4_at 0.125995412839 U79295_at 0.125995412839 HG311-HT311_at 0.125995412839 V00599_s_at 0.125995412839
Oracle Data Mining in the Life Sciences
Gene expression analysis
ABN Model Predictions
Lymphoid Leukemia vs. Myeloid Leukemia
Predicted LL ML Actual LL 19 1
ML 2 12
Oracle Data Mining in the Life Sciences
Clinical treatment outcome analysis
•
Problem 2
• Is it possible to classify treatments that are most effective in causing improvement in clinical
patients suffering from a given disease?
•
Solution
• Use Attribute Importance to rank the treatment factors
• Use Association Rules to establish correlations between treatment and outcome
Source: Walter Reed Medical Center, Dr. Carolyn Hamm, presentation at Oracle Life Sciences User Group Meeting, June 2004
Oracle Data Mining in the Life Sciences
Clinical treatment outcome analysis
Factors associated with positive diabetes outcomes
1. DRUG_TYPE 2. COMPLETE_HISTORY_RECORDED (Scorecard) 3. NUM_HOSPITAL_ADMISSIONS 4. GENDER 5. NUM_VISITS_TO_PROVIDER 6. INSURANCE_TYPE 7. BLOOD_PRESSURE_GOAL 8. WEIGHT_GOAL 9. LDL_GOAL 10.PROVIDER_TYPE
Oracle Data Mining in the Life Sciences
Clinical treatment outcome analysis
Sample Association Rules
Source: Walter Reed Medical Center, Dr. Carolyn Hamm, presentation at Oracle Life Sciences User Group Meeting,
If Then OUTCOME Percentage of Cases
NUM_HOSPITAL_ADMISSIONS=0 NO_IMPROVEMENT 0.5463472 NUM_VISITS_TO_PROVIDER>5 IMPROVEMENT 0.37195602 NUM_HOSPITAL_ADMISSIONS =0 and NUM_VISITS_TO_PROVIDER>5 IMPROVEMENT 0.36252946 DRUG_GROUP=2 and NUM_HOSPITAL_ADMISSIONS =0 NO_IMPROVEMENT 0.3267871 DRUG_GROUP=2 and NUM_VISITS_TO_PROVIDER>5 NO_IMPROVEMENT 0.30911234 NUM_HOSPITAL_ADMISSIONS =0 and GENDER=FEMALE NO_IMPROVEMENT 0.28711703 COMPLETE_HISTORY_RECORDED=NO NO_IMPROVEMENT 0.28436762 NUM_HOSPITAL_ADMISSIONS =0 and COMPLETE_HISTORY_RECORDED=Yes IMPROVEMENT 0.22663
Oracle Data Mining in the Life Sciences
Classification of Multiple Tumor Types
DNA Microarray Data
Oracle Data Mining Actual\Predicted BR PR LU CO LY BL ML UT LE RE PA OV MS BR BREAST-BR 1 1 PROSTATE-PR 1 1 LUNG-LU 1 2 COLON-CO 3 LYMPHOMA-LY 6 BLADDER-BL 1 2 MELANOMA-ML 1 1 UTERUS-UT 2 LEUKEMIA-LE 1 5 RENAL-RE 3 PANCREAS-PA 1 2 OVARY-OV 1 2 MESOTHELIOMA-MS 3 BRAIN-BR 4 78.25% accuracy Green=Correct Red=Errors We feed multiple cancer types
data into the Oracle DB: 16,063 genes, 144 cancer
patients and
10 samples per class.
We mine the data using Support Vector
Machines and create the confusion matrix
Oracle Data Mining in the Life Sciences
Classification of Multiple Tumor Types
Tumor Class # Train # Test Tumor Class # Train # Test
Breast (BR) 8 3 Uterus (UT) 8 2
Prostate (PR) 8 2 Leukemia (LE) 24 6
Lung (LU) 8 3 Renal (RE) 8 3
Colorectal (CO) 8 5 Pancreas (PA) 8 3
Lymphoma (LY) 16 6 Ovary (OV) 8 3
Bladder (BL) 8 3 Mesothelioma (MS) 8 3
• Multiple examples of 14 tumor types
• Training set: 144 samples. Test set: 46 samples
• Microarrays gene expression profiles: 7,129 genes (features)
• Can we build a model to distinguish between multiple tumor types?
Multi-Tumor Dataset
Multi-Tumor Dataset
Build SVM Model (Training)
Build SVM Model (Training)
Evaluate Model on Test Set
Evaluate Model on Test Set
Data Preparation (Scaling)
Data Preparation (Scaling)
Read into RDMS as Table
Read into RDMS as Table
Oracle Task
SQLLDR
SQL query
ODM Model Build
ODM Model Apply
Tumor Labels (Train) Tumor Labels (Train) Tumor Labels (Test) Tumor Labels (Test)
Oracle Data Mining in the Life Sciences
Oracle Data Mining in the Life Sciences
Classification of Multiple Tumor Types
NUMBER expr VARCHAR2(30) gene NUMBER sid type column NUMBER expr VARCHAR2(30) gene NUMBER sid type column
• The datasets were downloaded from the web site and stored in flat files prior to loading them to the Oracle database
• The data was loaded using SQLLDR to create a fact table of the following format:
• Rescaling: the values were divided by a constant (10000) to make them into small numbers near 1 (to keep the dot products between all samples in the dataset inside the [-1, 1] range
Oracle Data Mining in the Life Sciences
Classification of Multiple Tumor Types
• Entire methodology implemented in Oracle Database • The SVM model works with all 7,129 input features
(genes) genes and do not require feature selection. • The SVM model is relatively fast: 9 minutes training
time on 500MHz Netra.
• The SVM is very accurate for multi-tumor molecular classification: 78.25% accuracy
• Comparable to published results in Ramaswamy et al PNAS 2001 paper, they also found that k-NN = 63% and Weighted Voting = 46% accuracy
Oracle Data Mining in the Life Sciences
Classification of Multiple Tumor Types
Actual\Predicted BR PR LU CO LY BL ML UT LE RE PA OV MS BR BREAST-BR 1 1 PROSTATE-PR 1 1 LUNG-LU 1 2 COLON-CO 3 LYMPHOMA-LY 6 BLADDER-BL 1 2 MELANOMA-ML 1 1 UTERUS-UT 2 LEUKEMIA-LE 1 5 RENAL-RE 3 PANCREAS-PA 1 2 OVARY-OV 1 2 MESOTHELIOMA-MS 3 BRAIN-BR 4 Green=Correct Red=Errors
Oracle Data Mining’s SVM models are able to accurately predict the
multi-class tumor problem with 78.25% accuracy.