1
Data Mining and Business
Intelligence
Increasing potential to support
business decisions E nd U ser
B usiness A nalyst D ata A nalyst D B A
Decision
M aking
Data Presentation
V isualization T echniques
Data M ining
Inform ation D iscovery
Data Exploration
Statistical Sum m ary, Q uerying, and R eporting
Data Preprocessing/Integration, Data W arehouses
Data Sources
3
MDA Strategy Structured Matrix Analysis in the variable spaceNO
By rowYES
Discriminant An. SegmentationYES
Symmetric AnalysisNO
PLS Regress. Conjoint Anal. Non Symm Corresp. Anal.NO
Canonical Corr. Multiple Corresp. 3 Way AnalysisYES
Cluster Anal. Multid. ScalingNO
Princ. Comp. Corresp. An.YES
Explicative
Analyses
Exploratory
Analyses
•
Categorical Variables
•
Ordinal Variables
•
Quantitative Variables
Matrices & Methods in MDA
AFC sur les résultats du premier tour des élections présidentielles dans les différents
arrondissements de Paris
. (Mai 2002)
5
Decision Trees
Decision trees are the main outcome of a segmentation procedure
They represent a learning technique for solving problems of
classification and forecast
Graphically, a decision tree may be seen as an upside down tree:
leaf node root node node leaf leaf leaf leaf
¾
Explanatory purpose:
ª
explain the response variable from
the set of predictors
¾
Decisional purpose
:
Example:
Referendum on the European
Constitution
(Binary Response Y)
Vote for European Constitution
Sex Age Class Political
Affiliation Last Degree
Confidence in the future
Oui Femme 25-34 PS Bac+3/4 Confiant+
Oui Homme 60 et + PS < Bac
Confiant-Oui Femme 35 à 44 ans UMP Bac+3/4 Nsp
Oui Homme 45-59 PS Bac Confiant++
Oui Femme 35 à 44 ans UMP Bac+5/Grande école Confiant++
Oui Homme 25-34 UMP Bac Confiant+
Oui Femme 25-34 UMP Bac Confiant+
Oui Homme 35 à 44 ans PS Bac+5/Grande école Confiant+ Oui Femme 35 à 44 ans UDF Pas de diplôme Confiant+
Oui Homme 45-59 UDF < Bac
Confiant--Oui Homme 25-34 UMP Bac+5/Grande école Confiant+
Oui Homme 60 et + UMP < Bac Confiant+
Oui Femme 35 à 44 ans PS < Bac Confiant+
Oui Homme 18-24 UMP Bac+3/4
Confiant-Oui Femme 35 à 44 ans PS Bac+2
Confiant-Oui Femme 18-24 Verts Bac Confiant++
Oui Femme 60 et + UMP < Bac Confiant+
Oui Homme 35 à 44 ans PS Bac+2 Confiant+
7
Building the second level in the
tree
Building the third level in the
tree
9
Extracting association rules
from decision trees
]
The knowledge represented in a decision tree may be also represented
in terms of “IF
Æ
THEN” rules.
]
For each path from the root to a terminal node, an association rule
may be defined.
IF age = “<=30” AND student = “no”
THEN buys_computer = “no”
IF age = “<=30” AND student = “yes”
THEN buys_computer = “yes”
IF age = “31…40”
THEN buys_computer = “yes”
IF age = “>40” AND credit_rating = “excellent” THEN buys_computer = “yes”
IF age = “<=30” AND credit_rating = “fair”
THEN buys_computer = “no”
Partial Least Squares (PLS)
Path Modeling for:
Causal Network of Relationships
Multi-block Analysis
11
Path model describing a network of causal
relationships for Customer Satisfaction
.
Image
Perceived value Customer Expectation Perceived quality Loyalty Customer satisfactionComplaints
a) Expectations for the overall quality of “your
mobile phone provider” at the moment you
became customer
of this provider.
b) Expectations for “your mobile phone
provider” to provide products and services to
meet your personal need.
c) How often did you expect that things could
go wrong at “your mobile phone provider” ?
Measurement Instrument for the Mobile Phone Industry
Examples of latent and manifest variables
Customer expectation
Customer satisfaction
a) Overall satisfaction
b) Fulfilment of expectations
c) How well do you think
“ your mobile phone provider”
compares with your ideal mobile
phone provider ?
13
Customer loyalty
a)
If you would need to choose a new mobile phone provider how
likely is it that you would choose “your provider” again ?
b) Let us now suppose that other mobile phone providers decide
to lower fees and prices, but “your mobile phone provider”
stays at the same level as today. At which level of difference (in %)
would you choose another phone provider ?
c) If a friend or colleague asks you for advice, how likely is it that
you would recommend “your mobile phone provider” ?
And so on for the other latent variables ...
Measurement Instrument for the Mobile Phone Industry
Examples of latent and manifest variables
ECSI Model in the XLSTAT-PLSPM software
Image Expectation Perceived Quality Perceived Value Satisfaction Loyalty Complaints IMAG1 IMAG2 IMAG3 IMAG4 IMAG5CUEX1 CUEX2 CUEX3
PERV1
PERV2
CUSA1 CUSA2 CUSA3
CUSL1 CUSL2 CUSL3
CUSCO PERQ1
PERQ2 PERQ3
15
ECSI Path model for a
“Mobile phone provider”
Image
Perceived
value
Customer ExpectationPerceived
quality
Loyalty
Customer satisfactionComplaint
.492 (7.67) R2=.242 .544 (10.71) .066 (1.10) .037 (1.14) .153 (3.07) .211 (2.54) .541 (6.93) .543 (8.62) .201 (3.59) .468 (5.18) .540 (11.08) .049 (1.11) R2=.296 R2=.335 R2=.672 R2=.432 R2=.29216
Latent Variable Computation
Example : Customer Satisfaction Index
0264
.
0
0231
.
0
0158
.
0
3
sat
_
C
0264
.
0
2
sat
_
C
0231
.
0
1
sat
_
C
0158
.
0
CSI
+
+
×
+
×
+
×
=
Mean and standard deviation of the latent variables
250 26.49 100.00 72.6878 13.7660 250 25.85 100.00 72.3198 14.1259 250 23.95 100.00 74.5765 14.2573 250 .00 100.00 61.5887 20.5987 250 23.68 100.00 71.2876 15.3417 250 .00 100.00 67.4704 25.2684 250 1.29 100.00 69.1757 21.2668 IMAGE CUSTOMER EXPECTATION PERCEIVED QUALITY PERCEIVED VALUE CUSTOMER SATISFACTION COMPLAINT LOYALTY
N Minimum Maximum Mean Std. Deviation
Explanatory Variables for
Customer Satisfaction
ˆ
β
jCorrelation Contribution
to
R
2(%)
Image
.153 .671
15.28
Expectation
.037 .481
2.67
Perceived Value
.200 .604
17.98
Perceived Quality
.544 .791
64.07
PLS1 regression :
an overview of the algorithm
Step 1 : Research of
m
orthogonal components
t
h= Xa
has correlated as possible with
y
and as explanatory as
possible of their own group.
The number
m
is obtained by cross validation.
Step 2 : Regression of Y on the
m
components
t
h.
Objective of step 1 of PLS regression
**
*
*
*
*
X
2
X
1
CPX
1
t
1
*
** *
*
*
*
y
CPX
1
t
1
y
*
*
**
* *
PLS1 Regression: Next Steps
]
Finally, the m-components PLS regression model:
y
=
c
1
t
1
+ c
2
t
2
+ … + c
m
t
m
+ Residual
=
c
1
Xa
1
+ c
2
Xa
2
+ … + c
m
Xa
m
+ Residual
=
X(c
1
a
1
+ c
2
a
2
+ … + c
m
a
m
) + Residual
=
b
1
x
1
+ b
2
x
2
+ … + b
k
x
k
+ Residual
•
Similarly, we proceed for the next components
Wine data (Asselin, Morlat & Pagès)
X
2el (Saumur),1 1cha (Saumur),1 1fon (Bourgueil),1 1vau (Chinon),3 … t1 (Saumur),4 t2 (Saumur),4 Smell intensity at rest 3.07 2.96 2.86 2.81 … 3.70 3.71 Aromatic quality at rest 3.00 2.82 2.93 2.59 … 3.19 2.93 Fruity note at rest 2.71 2.38 2.56 2.42 … 2.83 2.52 Floral note at rest 2.28 2.28 1.96 1.91 … 1.83 2.04 Spicy note at rest 1.96 1.68 2.08 2.16 … 2.38 2.67 Visual intensity 4.32 3.22 3.54 2.89 … 4.32 4.32 Shading (orange to purple) 4.00 3.00 3.39 2.79 … 4.00 4.11 Surface impression 3.27 2.81 3.00 2.54 … 3.33 3.26 Smell intensity after shaking 3.41 3.37 3.25 3.16 … 3.74 3.73 Smell quality after shaking 3.31 3.00 2.93 2.88 … 3.08 2.88 Fruity note after shaking 2.88 2.56 2.77 2.39 … 2.83 2.60 Floral note after shaking 2.32 2.44 2.19 2.08 … 1.77 2.08 Spicy note after shaking 1.84 1.74 2.25 2.17 … 2.44 2.61 Vegetable note after shaking 2.00 2.00 1.75 2.30 … 2.29 2.17 Phenolic note after shaking 1.65 1.38 1.25 1.48 … 1.57 1.65 Aromatic intensity in mouth 3.26 2.96 3.08 2.54 … 3.44 3.10 Aromatic persisitence in mouth 3.26 2.96 3.08 2.54 … 3.44 3.10 Aromatic quality in mouth 3.26 2.96 3.08 2.54 … 3.44 3.10
Intensity of attack 2.96 3.04 3.22 2.70 … 2.96 3.33
Acidity 2.11 2.11 2.18 3.18 … 2.41 2.57
Astringency 2.43 2.18 2.25 2.18 … 2.64 2.67
Alcohol 2.50 2.65 2.64 2.50 … 2.96 2.70
Balance (Acid., Astr., Alco.) 3.25 2.93 3.32 2.33 … 2.57 2.77 Mellow ness 2.73 2.50 2.68 1.68 … 2.07 2.31
Bitterness 1.93 1.93 2.00 1.96 … 2.22 2.67
Ending intensity in mouth 2.86 2.89 3.07 2.46 … 3.04 3.33
Harmony 3.14 2.96 3.14 2.04 … 2.74 3.00
Global quality 3.39 3.21 3.54 2.46 … 2.64 2.85
3 Appellations
4 Soils
y
21
Hierarchical PLS model for wine data
Variable loading plot (w
*
, c)
- 0.40 - 0.30 - 0.20 - 0.10 0.00 0.10 0.20 0.30 0.40 - 0.20 - 0.10 0.00 0.10 0.20 0. 30 w* c [2 ] w*c[1]SMELL INTENSITY AT REST
AROMATIC QUALITY AT REST FRUITY NOTE AT REST
FLORAL NOTE AT REST
SPICY NOTE AT REST
VISUAL INTENSITY SHADING SURFACE IMPRESSION SMELL INTENSITY SMELL QUALITY FRUITY NOTE
FLORAL NOTE AFTER SHAKING
SPICY NOTE
VEGETABLE NOTE
PHELONIC NOTE
AROMATIC INTENSITY IN MOUTH AROMATIC PERSISTENCE IN MOUTH AROMATIC QUALITY IN MOUTH
INTENSITY OF ATTACK ACIDITY ASTRINGENCY ALCOHOL BALANCE MELLOWNESS BITTERNESS
ENDING INTENSITY IN MOUTH HARMONY GLOBAL QUALITY SIMCA-P 10.5 - 28/08/2004 08:24:42
Positive
Negative
Non significant
23
Méthodes explicatives
Plusieurs variables à expliquer, plusieurs variables explicatives :
Régression PLS
Variable à expliquer
X
1, X
2, …, X
kY
Quantitatives
Qualitatives
Mélange
Quantitatif
Régression multiple
Analyse de la variance
Analyse de la
covariance
Qualitatif
- Régression
Logistique
-
Segmentation
-
Analyse factorielle
discriminante
-
Analyse factorielle
bayesienne
- Régression
Logistique
-
Segmentation
-
Analyse factorielle
discriminante
- Régression
Logistique
-
Segmentation
-
Analyse factorielle
discriminante
Variables explicatives
Méthodes descriptives
Méthodes de visualisation
X
1, X
2, …, X
kQuantitatives
Qualitatives
Mélange
Analyse en
composantes
principales
Analyse des
correspondances
multiples
-
ACP
-
ACM
- Codage optimal
Méthodes de classification
-
Classification ascendante hiérarchique
(observations ou variables)
- Méthode des nuées dynamiques
25
Méthodes de prévision
]
Analyse d’une série chronologique
-
Recherche d’une tendance et de facteurs
saisonniers
-
Identification de valeurs atypiques
]
Prévision
-
Méthodes de lissage (série courte)
« Une goutte d’eau dans l’océan…
Ne la sous-estimez pas,
L’océan n’est fait que de gouttes d’eau… »
Photo extraite du livre « Rendons à César … »