Statistics for Materials Engineers
MATLS 3J03
© KevinDunn,2013
Instructor: Tim Dietrich
Overall revision number: 19 (January 2013)
Copyright, sharing, and attribution notice
This work is licensed under the Creative Commons Attribution-ShareAlike 3.0
Unported License. To view a copy of this license, please visit
http://creativecommons.org/licenses/by-sa/3.0/
This license allows you:
I to share - to copy, distribute and transmit the work
I to adapt- but you must distribute the new result under the same or similar license to this one
I commercialize - youare allowedto use this work for commercial purposes
I attribution - but you must attribute the work as follows:
I “Portions of this work are the copyright of Kevin Dunn”,or
I “This work is the copyright of Kevin Dunn”
We appreciate:
I if you let us know about any errors in the slides
I any suggestions to improve the notes
All of the above can be done by writing to
kevin.dunn@mcmaster.ca
or anonymous messages can be sent to Kevin Dunn at
http://learnche.mcmaster.ca/feedback-questions
If reporting errors/updates, please quote the current revision number:77
Please note that all material is provided “as-is” and no liability will be accepted for your usage of the material.
Intro
In context
I The section we’ve just finished could be considered:
“Empirical modelling of systems using a least squares model”
I Experiments are important:
I We learn more about our systems
I We use the data to fit an empirical model
I Main aim: use the model to optimize a process forhigher profit
I Happenstance (as-is) data
I cannot tell cause-and effect
I most often is not in a DOE layout, but might still be valuable to learn from.
References
I Box, Hunter and Hunter, Statistics for Experimenters
I chapters 10, 11, 12, 13, 15 in first edition
I chapters 5 and 6 in second edition
Experiments with a single variable at two levels
I Simplest case:
I catalyst A vs catalyst B
I low RPM vs high RPM
I etc
I Measure nA value from setup A
I Measure nB values from setup B
Recap of group-to-group differences
Recap:
sP2 = (nA−1)s
2
A+ (nB −1)sB2 nA−1 +nB −1
z = (¯xBs−xA¯ )−(µB −µA)
sP2 1 nA + 1 nB
(¯xB−x¯A)−ct×
q s2 P 1 nA+ 1 nB
≤ µB−µA ≤ (¯xB−x¯A)+ct×
q s2 P 1 nA+ 1 nB
I Significant difference: does confidence interval span zero?
I Practical difference?
I width of confidence interval
I where it lies relative to zero
Using linear least squares models
I Canachievesame resultusing least squares: yi =b0+g di I di =0 forA
I di =1 forB
I andyi istheresponse variable.
Twototallydifferentmethods;sameresult!Confirmitforyourself
Importance of randomization
Whyrandomize experiments?
I Prevent unmeasured, and uncontrollabledisturbances affecting y
I Guarantees independence in the data
I We can then use t-distributions (which require independence)
I The example of Fisher: lady and the tea. Modern day example: Coke vs Pepsi.
I Engineering example: A = TK104 and B = TK107
I nA= 8: [254, 440, 501, 368, 697, 476, 188, 525]
I nB = 9: [338, 470, 558, 426, 733, 539, 240, 628, 517 ] I Null hypothesis: there is no difference
I Implies these numbers could have come from either A or B
Details of the analysis are given in the course textbook
Importance of randomization
I nA= 8: [254, 440, 501, 368, 697, 476, 188, 525]
I nB = 9: [338, 470, 558, 426, 733, 539, 240, 628, 517 ]
I Randomly assign “A” to any nA of the values and “B” to any nB of the values
I (nA+nB)! nA!nB!
possible combinations = 24310
I Combinations: number of unique ways to split 17 experiments into 2 groups of nA = 8 andnB = 9
I 1: AAAAAAAABBBBBBBBB
I 2: AAAAAAABABBBBBBBB
I 3: AAAAAAABBABBBBBBB
I etc
I For each arrangement we calculate: ¯yA−yB¯
I Plot a histogram of this difference of averages
Importance of randomization
I Probability that the actual experiment could have come from chance?
I 79.6% combinations have a lower value than actual difference
I Using standard group-to-group difference:
I z = 0.8435
I Pr(z <0.8435) = 79.3% (DOF =nA+nB−2)
I Result: if we don’t randomize, we cannot usez-values and
confidence intervals - may be misleading.
Importance of randomization
The previous derivation, used random combinations and made no statistical assumptions.
Why don’t we use this approach instead?
The original data set (still a small data set by today’s standards) wasnA = 20 andnB = 23. There are
(nA+nB)! nA!nB!
≈960,567 million combinations, and it would take
about 3 years on a regular computer to do the computation (never mind storing the results.)
I Base case: T=346K,S = 1.5g/L; yield = 63%.
15
Change
one
single
variable
at
a
time
(COST)
Change one single variable at a time (COST)
I Trapped in a sub-optimal solution
I In the previous example: we would have considered experiment 7 to be the optimum
I experiment 3 is the optimum wrt “Temperature”
I then experiment 7 is the optimum wrt “Substrate”
I but, we’re still away from the true optimum
I We have known for 80 years now: COST is wrong way to optimize a system
Why not use existing data?
I Existing data = historical data = happenstance data
I This is data without any intentional perturbations
I Problem: we see correlations, but we cannot tell if they are
causal
Terminology:
Factors
Factor: the thing that is being changed.
I growing plants?
I water used = [50mLvs80 mL] I maximizing sales in a store?
I height from floor = [3ftvs5ft] I first date or “date-night”?
I action movievschick flick I growing plants?
Terminology:
Response
Response: the outcome that is being measured
I growing plants?
I e.g. height of plant after 10 days
I other outcomes are possible
I maximizing sales in a store?
I total profit
A response variable:
I is usually (in almost every case) a continuous variable
I should be measured in the same manner for all experiments
I should be reproducibly measurable
I measure as many outcomes as you can to avoid repeating experiments later
Factorial designs: 2 levels for 2 or more factors
I Change multiple factors simultaneously
I Factor: is a variable that we can manipulate/adjust/set
I Consider, for now, two levels in each factor. For example:
I continuous: low and highpH
I continuous: short reaction time and longreaction time I discrete: catalystA and B
Factorial designs: by example
I We will use this system for our example
Factorial designs: by example
Bioreactor example: aim is to maximize they = conversion [%]
I T: Temperature: Tlow = 338K and Thigh = 354K
I S: Substrate concn: Slow = 1.25 g/L and Shigh = 1.75 g/L I How is the range chosen?
I About 25% of typical operating range if no other prior knowledge. We will consider other criteria later also.
I Factors are: T andS
I Number of experiments (runs): 2k;k = number of factors
Factorial designs: by example
I Run your experiments in random order, collect results:
Notes:
I we don’t need to run an experiment at the baseline (it can be useful though)
I baseline atT = 12(338 + 354) and S = 12(1.25 + 1.75), i.e.
baseline at (346K; 1.5g/L) = midpoint of the factorial
I if we had replicate experiments, then use the average of the response variable
Factorial designs: by example
Analysis: Main effects
I Main effect: difference from high to low level
I Where would you run your next experiment(s) to improve yield?
Analysis: Main effects
Analysis: Main effects
I No computer? Use an interaction plot (see notes for section 1 of the
course)
I Lines are roughly parallel in this case
I The numbers “1”, “2”, “3”, “4” refer to the experiment number in standard order
Analysis: interaction effects
Analysis: interaction effects
Experiment T [K] S [g/L] y [%]
1 −(390K) −(0.5 g/L) 77
2 + (400K) −(0.5 g/L) 79
3 −(390K) + (1.25 g/L) 81
4 + (400K) + (1.25 g/L) 89
I Main effect of T:
I Main effect of S:
Analysis: interaction effects
Experiment T [K] S [g/L] y [%]
1 −(390K) −(0.5 g/L) 77
2 + (400K) −(0.5 g/L) 79
3 −(390K) + (1.25 g/L) 81
4 + (400K) + (1.25 g/L) 89
I Main effect ofT: 5% per 10K; but reported as2.5% per 5K
I ∆TS+= 8% per 10K I ∆TS−= 2% per 10K
I Main effect of S: 7% per 0.75g/L; report3.5% per 0.375g/L
Analysis: interaction effects
I Lines not parallel
I Indicates magnitude of effect is not the same at both levels of the variable being held constant
I Implies there is aninteraction
I In this case, interaction betweenT andS
I could also be called theS andT interaction: symmetrical
I called theT ×S interaction (orS×T interaction)
I it is a 2-factor interaction (2fi)
Analysis: interaction effects
Recall system withno interaction (earlier example):
I Main effect of T:
I TS+=−11% per 16K I TS−=−9% per 16K
I Main effect of S:
I ST+=−7% per 0.5g/L I ST−=−5% per 0.5g/L
Analysis: interaction effects
Systemwith interaction (second example):
I Main effect of T: 5% per 10K
I TS+= 8% per 10K I TS−= 2% per 10K
I Main effect of S: 7% per 0.75g/L
I ST+= 10% per 0.75g/L I ST−= 4% per 0.75g/L
I There was an important phenomenon that we did not capture with the main effects alone
I The main effects are quite different for each estimate.
I We need “something else” to capture this interaction
Analysis: interaction effects
I T interaction withS:
I ∆y due toT at highS: +8
I ∆y due toT at lowS: +2
I The half difference: [+8−(+2)]/2 =3 I S interaction withT:
I ∆y due toS at high T: +10
I ∆y due toS at low T: +4
I The half difference: [+10−(+4)]/2 =3
Interpretation:
I T andS increase y by a greater amount when both are high
I Similarly, both terms reducey when they are of opposite sign.
Interaction terms dominate on a ridge, and are important as
Visualizing the interaction: we are on a ridge
I T interacts with S:
I ∆y due toT atS+: +8 I ∆y due toT atS−: +2
I S interacts with T:
I ∆y due toS atT+: +10 I ∆y due toS atT−: +4
T and S increasey when they both operate together (they are
Analysis by least squares modelling
I Return back to system with little interaction:
Experiment T [K] S [g/L] y [%]
Baseline 346 K 1.50
1 −(338K) −(1.25 g/L) 69
2 + (354K) −(1.25 g/L) 60
3 −(338K) + (1.75 g/L) 64
4 + (354K) + (1.75 g/L) 53
I Standard form: variable−center point range/2
I T− =
338−346 (354−338)/2 =
−8 8 =−1
I S− =
1.25−1.50 (1.75−1.25)/2 =
−0.25 0.25 =−1
I T+ = +1
Analysis by least squares modelling
Least squares model
y =β0+βTxT+βSxS +βTSxTxS+ε y =b0+bTxT +bSxS +bTSxTxS+e
I 4 parameters to estimate: b0,bT,bS,bTS
I 4 data points
I Zero degrees of freedom (i.e. SE = 0, no confidence intervals
possible)
Analysis by least squares modelling
Aim: Write out the LS model equation for the 4 data points; stack them as rows in a matrix.
For example, for the first experiment from the standard order table, which was run at lowT and low S:
y1 = b0 + bTxT + bSxS + bTSxTxS + e1 y1 = b0 + bTT− + bSS− + bTST−S− + e1
y1 y2 y3 y4 =
1 T− S− T−S−
1 T+ S− T+S−
1 T− S+ T−S+
1 T+ S+ T+S+ b0 bT bS bTS + e1 e2 e3 e4
Visualizing the least squares modelling
I Least squares model for DOE in 2 factorsI Interaction term is small: blue plane is flat
Analysis by least squares modelling
y1 y2 y3 y4 = 1 T− S− T−S−
1 T+ S− T+S−
1 T− S+ T−S+
1 T+ S+ T+S+ b0 bT bS bTS + e1 e2 e3 e4 69 60 64 53 =
1 −1 −1 +1 1 +1 −1 −1 1 −1 +1 −1 1 +1 +1 +1
b0 bT bS bTS + e1 e2 e3 e4
y = Xb + e
Xmatrix is trivial to set up:
I Interceptcolumn: is alwaysacolumn of 1’s
I xT column: comes directlyfromstandard table
I xS column: comes directlyfromstandardtable
Analysis by least squares modelling
I XTX=
4 0 0 0 0 4 0 0 0 0 4 0 0 0 0 4
I XTy=
Analysis by least squares modelling
b = (XTX)−1XTy=
1/4 0 0 0 0 1/4 0 0 0 0 1/4 0 0 0 0 1/4
246 −20 −12 −2 b =
61.5
−5
−3
−0.5
I y =b0+bTxT +bSxS+bTSxTxS+e
I y = 61.5−5xT −3xS+−0.5xTxS+e
You can easily calculate these effects by hand:
I (+69 + 60 + 64 + 53)/4 = 61.5
I (−69 + 60−64 + 53)/4 =−5
I (−69−60 + 64 + 53)/4 =−3
Analysis by least squares modelling
1. XTX: zeros on off-diagonals
I orthogonal matrix
I each column is varied independently of the others
I calculate thekthslope coefficient separately: bk =
xkTy xT
k xk
2. InterpretbT =−5?
I xT is the change innormalized temperatureby 1 unit
I ChangingxT from 0 to 1 impliesTactualchanges from 346K to
354K (baseline to high level)
I ChangingxT from -1 to 0 impliesTactualchanges from 338K
to 346K (low level to baseline)
I −5% decrease in conversion for every 8K increase in temperature
3. Now interpret bS =−3?
4. How to use this model for a prediction?
Analysis by least squares modelling
I The least squares model was y = 61.5−5xT −3xS−0.5xTxS+e I The geometric construction was:
Analysis by least squares modelling
Return to system withhigh interaction
I Base line: T = 395K and S = (1.25+0.5)/2 = 0.875 g/L
I Calculate deviation variables
I Build the matrices and calculateb= (XTX)−1XTy
I Verify at home: y = 81.5 + 2.5xT + 3.5xS+ 1.5xTxS
Large interaction is confirmed in least squares model due to high value of the 1.5 coefficient on thexTxS term
Analysis by least squares modelling: visualizing it
High interaction system:
Ignoring interaction term:
We are estimating a linear equation using linear least squares.
With interaction term:
DOE of a 3-factor experiment
Plastics molding factory; waste treatment.
I Factor 1: C: chemical compound added (A or B)
I Factor 2: T: treatment temperature (72F or 100F)
I Factor 3: S: stirring speed (200 rpm or 400 rpm)
I y = amount of pollutant discharged [lb]
I Categorial variables: A=−1 and B=+1 (orvice versa)
DOE of a 3-factor experiment
Example on the board:
1. Geometric illustration of the data
2. Calculate main effects
3. Calculate the 3 two-factor interactions, and the single 3 factor interaction
I C×T andC×S andT ×S andC×T ×S
4. Main effects and interactions using least squares (by-hand)
5. Computer verification:
I y=11.25+6.25xC+0.75xT−7.25xS+0.25xCxT
Summary of factorial designs
I Good visual interpretation, even on paper
I Few experiments, but powerful information
I Building blocks for complex designs
I 2k experiments fork factors
I Each factor is varied independently of the others
I Each factor in model can be interpreted independently
I Least squares model easily derived by hand
I Main effects cannot be interpreted separate from their
interactions
I y =b0+bPxP+bQxQ+bPQxPxQ+e
I Sometimes a small effect is desirable: implies y not sensitive that factor
Summary of factorial designs
Much more efficient than change one-single factor at-a-time (COST)
I COST: cannot estimate interactions
Review: Change one variable at a time
If COST in a cross shape (experiments 2, 3, 4, 5, 6):
I we cannot estimate interactions
I only a single estimate of each main effect
I rescued to a full factorial: e.g. use experiments 2, 3, 6 and add new point below 2, to the right of 6