A (very) short course on the analysis of Water Quality Data

(1)

Carl James Schwarz

Department of Statistics and Actuarial Science Simon Fraser University

Burnaby, BC, Canada cschwarz @ stat.sfu.ca

(2)

Introduction

(3)

Objectives:

Review the concepts of mean, standard deviations, standard errors, confidence intervals and percentiles in terms of water quality studies.

Show how compare water quality using synoptic reading at two or more sites.

Use some sample datasets provided by MOE to examine how to deal with problematic datasets with suggestions on how to proceed.

(4)

Excel 2010 ships with some additional statistical (and other) analysis in a tookpack, but this may not be started.

Check the Data → Analysis menu to see if the Analysis Took Pack shows.

(5)

(6)

Select and load the Analysis TookPack

This is not the best package (very limited functionality). You are better off using a proper package such as JMP/ R/ SAS, etc.

(7)

Helsel, D. R. and Hirsch, R. M. (online).

Statistical Methods in Water Resources

http://pubs.usgs.gov/twri/ twri4a3/

(8)

McBride, G. B. (2005). Using Statistical Methods for Water Quality

Management. Wiley, New York

(9)

Helsel, D. R.(2005). Non-detects and data analysis: Statistics for censored environmental data. Wiley, New York.

(10)

http://people.stat.sfu.ca/~cschwarz/CourseNotes/

(11)

What is population of interest? Water quality parameter over the entire year?

What is question of interest?

Interested in average water quality? Interested in spread of water quality?

Interested in higher percentiles (e.g. 95th_percentile?)

Interested in comparing averages across sites?

Interested in comparing percentiles across sites? (not part of this course)

Interested in relationship of water quality to other covariates (such as rainfall, temperature)? (not part of this course)

(12)

The 3 R’s of good data collection

Randomization - this is what makes the sample representative of the population

Every member of population should have a known probability of selection. Often this is equal.

Replication - determines how reproducible are your estimates.

This controls the reproducibility of your results.

Stratification - controls for some explain sources of variation in your readings.

Synoptic readings to compare sites controls for “time” effects.

No amount of statistical wizardry can rescue badly collected data!

(13)

Often has a lower bound of zero and no negative values are possible (usually not a problem).

Outliers, typically on the high side (can greatly affect results if not accounted for).

Long right tails (skewness). Rather than being symmetric, there is a long tail to the right (try a log-transform). Non-normally distribution data. (Try a log-transform or resistant methods).

Censoring (e.g. below detection limits (BDL)). (not part of this course)

Seasonal Patterns (e.g. values tend to be higher in the summer than winter). Try stratification or modeling.

Autocorrelation (e.g. measurements taken closely together in space and time tend to be related). (not part of this course, but see webinar for Air Quality branch)

(14)

No formal definition of an outlier other than a data value that is unusual.

Dealing with outliers:

Through away “old-fashioned” rules such as Grubbs rule. Check data and see if a typing error etc.

Perhaps population is more complex than previously thought. Try analysis with outlier in and outlier out. If it makes no difference, who cares.

(15)

MCAR (Missing completely at random). The missingness is unrelated to the response or another covariate in a study. For example, the sample vial broke when accidentally dropped. No bias introduced into results but standard error is larger than with complete data.

MAR (Missing at random). The missingness is unrelated to the response, but may be related to another covariate. For example, rather than having equal number of samples in each month, fewer samples can be obtained in January because of access issues. If you know the adjustment factor, the collected data can be reweighed (not part of this course).

IM (Informative missing). The missingness is related to the response value. For example, the concentration is so large that it exceeds the detection limits of the machine. Censoring is relatively easy to deal with modern software (not Excel). Other forms of informative missings are very difficult to deal

(16)

You say “toe-mah-toe” and I say “toe-may-toe”.

Don’t be afraid of transformations, e.g. pH is on a log-scale. Most common transformation in water quality is log() transform -discussed later.

Sometimes not obvious what is correct scale (e.g. mpg or L/100 km?)

See

Hoaglin, David C., 1988,

Transformations in everyday experience: Chance 1, 40-45.

(17)

(18)

Objectives:

Mean, Standard Deviation, Empirical Intervals Mean, Standard Errors, Confidence Intervals Mean vs. Geometric Mean

Median vs. Mean

Quantiles and Tolerance Intervals

(19)

Sample data set: Arsenic (As) Concentrations from Ground Water (See H & H, Chapter 3, page 71)

(20)

Data must be collected using RRRs – otherwise summary statistics are nonsense!

Mean (Y ) = simple average of values. Use the average() function in Excel. YAs = 98.35.

Mean are greatly influenced by outliers so these need to be dealt with prior to analysis.

(21)

Data must be collected using RRRs – otherwise summary statistics are nonsense! Standard Deviation s = q (Yi−Y )2 n−1 .

This is a measure of spread of the INDIVIDUAL values in the sample (and by inference in the population).

Use the stdev.s() function in Excel. sAs = 144.7.

Standard deviations are greatly influenced by outliers so these need to be dealt with prior to analysis.

(22)

Empirical Rule: Y ± 2s contains about 95% of INDIVIDUAL data values.

Use the average() +/- 2*stdev.s() function in Excel. Empirical Rule for As is (−191 → 387) which contains 24/25 = 96% of observations. This gives a rough range for FUTURE observations.

Empirical intervals are greatly influenced by outliers so these need to be dealt with prior to analysis.

(23)

Every time we take a NEW sample, the sample mean (Y ) will change. How reproducible are our results?

The standard error (SE) measures the variability of future SAMPLE MEANS.

The standard deviation (SD) measures the variability of future OBSERVATIONS.

Under Simple Random Sampling the SE of the sample mean is √s n.

Different sample schemes and different statistics have different SE formula. Excel does NOT have a built-in function.

SE = average(CELLRANGE)/sqrt(count(CELLRANGE)) = 28.93. SE are greatly influenced by outliers so these need to be dealt with prior to analysis.

(24)

A 95% confidence interval is an estimate of the likely values of MEANS from 95% future samples.

A 95% empirical interval is an estimate of the likely values of OBSERVATIONS from future samples.

Under Simple Random Sampling, a 95% c.i. is found as Y ± t_α/2,n−1√s

n

95% c.i. lower bound = average(CELLRANGE) -/+

confidence.t(.05,stddev.s(CELLRANGE),count(CELLRANGE)) For As data the 95% c.i. is (38.6 → 158).

CI are greatly influenced by outliers so these need to be dealt with prior to analysis.

(25)

(26)

(27)

Usual summary statistics. Notice that the half-width of the 95% CI is provided and you still must compute the upper and lower bound.

98.352 − 59.72 = 38.6 98.352 + 59.72 = 158.1

(28)

The mean measures the average value in the data set.

The standard deviation (SD) measures the variability of future OBSERVATIONS.

The standard error (SE) measures the variability of future SAMPLE MEANS.

A 95% empirical interval (EI) is an estimate of the likely values of OBSERVATIONS from future samples.

A 95% confidence interval (CI) is an estimate of the likely values of MEANS from 95% future samples.

The SD does not depend on n.

The SE and CI decline as a function of √1 n

These statistics are greatly influenced by outliers.

(29)

Means, SD implicitly assume a symmetric distribution so that averages and spread are meaningful.

SE, CI are acceptable regardless of distribution in large samples. [How big is large??]. But they perform better with symmetric distributions in small samples.

For skewed data (e.g. long right-tail) a logarithmic transform often improves inference.

(30)

(31)

Two common transformation in water quality are log() vs. ln() Natural logarithms - ln() - base e, e.g. ln(100) = 4.60 and e4.60_{= exp(4.60) = 100.}

Common logarithsm - log() - base 10, e.g. log(100) = 2 and 102= 100.

Relationship: ln(Y ) ≈ 2.3 log(Y ). So it doesn’t matter what you use, but be clear in documents what is used.

CAUTION: Many computer packages (except Excel) use log() to represent the natural logarithmic transformation, and log10() to represent the common logarithmic transformation.

(32)

Effect of transformation:

(33)

Arithmetic Mean - Y = P Yi

n .

Geometric Mean - GMY = n

√

Y1Y2. . . Yn.

Can also be found by using logarithmic transformation: Geometric Mean - GMY = antilog (log (Y )) Example:

Y = (5, 12, 20). Y = 12.33 GMY = 3 √ 5 × 12 × 20 = 10.62. GMY = exp(ln(5)+ln(12)+ln(20)₃ ) = exp(1.609+2.485+2.996₃ ) = exp(2.363) = 10.62. GMY ≤ Y (this is always true)

GMY is less sensitive to outliers than Y and is often close to

(34)

You can compute the usual statistics on the log() data ...

etc.

(35)

transform mean, EI, and CI bounds. DO NOT BACK TRANSFORM SD or SE.

(36)

Don’t be afraid to use log() transformation, e.g. pH log() vs. ln(). CAUTION. Packages assume log() is natural logarithm.

Suitable for highly skewed data.

CAUTION on back transformation of mean and CI bounds -these refer to geometric mean in population (or very close to median).

(37)

Quantiles (Percentiles) are measures of LOCATION within the dataset.

Qp is the value of Y with at least 100p% of INDIVIDUAL values

≤ Qp and at least 100(1 − p)% of INDIVIDUAL values ≥ Qp.

Examples:

Q.50 = 50th percentile = median has at least 50% of

INDIVIDUAL values ≤ Q.50 and at least 50% of INDIVIDUAL

values ≥ Q.50.

Q.95 = 95th percentile has at least 95% of INDIVIDUAL

values ≤ Q.95 and at least 5% of INDIVIDUAL values ≥ Q.95.

(38)

Finding the p quantiles. [There are at least 5 different ways(!) to compute percentiles.]

Sort the data from smallest to largest. Compute (n + 1)p.

If (n + 1)p is integer, use Y[(n+1)p]+, i.e. the ((n + 1)p)th

observation after sorting.

if (n + 1)p is fractional, interpolated between two bracketing observations.

In large samples, all of the methods will all give similar values. Example

As data. n=25. Median = 50th percentile. (n + 1)p = 13. Use the 13th observation = 19.

The median of (13, 11, 17) is 13.

The median of (13, 11, 17, 15) is (13 + 15)/2 = 14. The median of (13, 11, 20234234) is still 13.

The median of (13, 11, < 5, < 5, 9) is 9 where < 5 indicates BDL.

(39)

Quantiles are (theoretically) invariant to log () transformation, i.e. if you find the quantile on transformed data and then back transform, you get the same value as the quantile on the original data. But, because of interpolation, the values may be slightly different.

Excel function percentile.inc(DATARANGE, p) looks at 100p/(n − 1) and interpolates if this is not an integer.

(40)

Comparison of Excel vs JMP (Different interpolation rules).

(41)

Quantile plots:

1 _{Sort the data (Y ) from smallest to largest. Even if data is}

censored, it usually can be ordered.

2 _{Number the sorted values from 1 to n.}

3 _{Create the plotting positions as p}_i _{= (i − 0.4)/(n + 0.2).}

[Cunnane plotting positions (other formula exist)]. When tied values are present, each is assigned a separate plotting value and the tied values will create a vertical “cliff” on the plots.

4 Plot the Y variable on the bottom axis, and the plotting

(42)

Refer to [As] data:

(43)

(44)

Refer to log([As]) data:

Read percentiles off the graph.

(45)

If you are willing to assume a specified distribution (e.g. normal) then you can estimate percentiles based on sample mean and sample standard deviation.

c Qp= Y + zs p z .50 0.00 .75 0.67 .80 0.84 .90 1.28 .95 1.65 .98 2.05 .99 2.33

(46)

For small samples, the sample standard deviation (s) is a biased estimator of σ.

The adjusted estimate of the percentile is:

\

Qp,adjusted = Y + z

s M(df )

where M(df ) is the mean of a standardized chi-distribution. [See my notes for more details.]

(47)

(48)

Avoid NAKED ESTIMATES, e.g. never report a mean with out a SE.

Tolerance intervals are confidence intervals for percentiles.

“We are 95% confident that no more than 10% of observations exceed xxxxx”.

This is a one-sided 95% confidence interval for the 90th percentile.

(49)

Let p be the quantile of interest. Compute Rupper = n(p) + zpn(p)(1 − p)

The upper bound of the tolerance interval is then Y [Rupper].

For a 95% confidence interval use z = 1.645 as seen earlier. Example, with n = 500, find a 95% one-sided tolerance interval for the 99th percentile, i.e. you will be 95% confidence that 1% or less of future observations will exceed this value.

The estimated quantile is the 500(.99) = 495th value. z = 1.645: Rupper = 500(.99) + 1.645p500(.99)(1 − .99) =

495 + 3.66 = 498.66

The tolerance upper bound is then Y [498.36]. Some interpolation may be required to find Y [498.36].

(50)

CAUTION:

It may require extrapolation outside range of dataset (and cannot be computed).

Example, if n = 10 then

Rupper = 10(.99) + 1.645p10(.99)(1 − .99) = 9.9 + .517 = 10.41st

observation which is out of the range of the dataset (!). [Also for small samples replace the normal distribution with a binomial distribution as shown in Conover (1999)]

(51)

(52)

You will be (1 − pn) × 100% confident that at least the fraction p of the future observations will lie below the maximum of the observed data.

For the As dataset, n = 25

You are (1 − .9525) × 100% = 72% confident that at least 95% of future observations will lie below the largest value, or 72% confident that no more than 5% of observations will exceed the maximum.

You are (1 − .9925_{) × 100% = 22% confident that at least}

99% of future observations will lie below the largest value, or only 22% confident that no more than 1% of observations will exceed the maximum

If n = 5 (not much useful information available)

You are (1 − .955) × 100% = 23% confident that at least 95% of future observations will lie below the largest value,or only 23% confident that no more than 5% of observations will exceed the maximum.

(53)

(54)

The previous slides show that very little information is available in small samples if you are not willing to assume a distribution for the data.

If the assumption of normality is sensible, you can do better. Basic form is :

TIp= Y + ks

where k is obtained from tables.

http://www.itl.nist.gov/div898/handbook/prc/section2/ prc263.htm

Gerow, K. and Bielen, Confidence Intervals for Percentiles: An Application to Estimation of Potential Maximum Biomass of Trout in Wyoming Streams. North American Journal of Fisheries

Management 19, 149-151.

http://dx.doi.org/10.1577/1548-8675(1999)019<0149: CIFPAA>2.0.CO;2

http://statpages.org/tolintvl.html and Excel spreadsheet. CAUTION: TI are very sensitive to the normality assumption!54 / 118

(55)

(56)

Quantiles measure LOCATION of INDIVIDUAL observations (e.g. median or 95th percentile.

Tolerance Intervals (TI) are confidence intervals on

percentiles, e.g. you are 90% sure that no more than 5% of INDIVIDUAL values exceed xxxxx.

Quantiles can be estimated:

Using non-parametric methods. Find Qp= Ynp. [The (np)

varies among methods.]

Assuming a distribution (e.g. normal). Qp= Y + zs

Tolerance intervals can be estimated:

Using non-parametric methods. Require large samples. Based on min/max of observed data. Frightening how little knowledge is available in small samples.

Based on normal distribution. Assumption of normality is crucial for extreme quantiles (!)

(57)

Data must be collected using RRRs; otherwise summary statistics have no meaning.

Are you interested in the AVERAGE

Estimate mean, SE, and CI

CI say NOTHING about INDIVIDUAL observations.

Are you interested in the SPREAD

Estimate mean, SD, and EI

Two-sided tolerance intervals (not covered in this webinar).

Are you interested in EXTREMES (e.g. higher order quantiles)

Estimate quantiles and TI

Non-parametric (with large samples), or parametric (with small samples), but the latter is EXTREMELY sensitive to

(58)

Assessing Normality - Normal Quantile Plots

(59)

Objectives:

Constructing a normal quantile plot. Assessing normality based on quantile plot.

(60)

Quantile-plots are a graphical method to assess distributional assumptions:

Easier to assess “straight-line” fit rather than fitting to a curve It is not necessary to create arbitrary bins as is needed for histograms.

All of the data are displayed unlike box-plots.

Every point in the data can be displayed without overlap.

(61)

Construction:

1 Sort the data (Y ) from smallest to largest.

2 _{Number the sorted values from 1 to n. No adjustment is}

made for tied values.

3 _{Create the plotting positions as p}_i _{= (i − 0.4)/(n + 0.2).} 4 _{Compute the normal quantile (Q}_p_{) using the NORMINV(p}_i₎

or the NORM.S.INV(pi) function in Excel.

5 _{Plot Q}_p _{(on the X-axis) vs Y on the Y-axis.}

If the distribution is correct for the data, the points should lie on an approximate straight line.

Slope of the line estimates s (the sample standard deviation); Value of the curve at X = 0 estimates the mean.

(62)

Quantile plot of the original data.

(63)

(64)

(65)

(66)

(67)

(68)

Q-Q plots are easy way to assess normality.

Look for departures from linearity (but don’t over-interpret the plots).

For inference about the MEAN, not too crucial that underlying distribution is normal except in (very) small sample sizes. For inference about extreme percentiles, it is crucial that normality assumptions be satisfied.

(69)

(70)

Objectives:

How does WQ compare between 2+ sites Types of designs (paired vs. unpaired) Case study - French Creek

(71)

Types of designs Paired/blocked

Paired (2 sites) or Blocked (2+ sites) designs are similar Synoptic (same day) or near synoptic (same week?) readings are taken

Interested in the average “difference”.

Not necessary to have a random sample to times. It may be preferable to select times to enhance contrast (e.g. sample at low and peak flows).

Paired t-test; Single Factor Randomized Blocked ANOVA

Independent samples (not part of this course).

2+ sites to be compared.

Separate, random samples taken from each site. Interested in compare the MEANS across the sites. Independent sample t-test; Single Factor Completely Randomized Design ANOVA (a.k.a. One-way ANOVA)

(72)

French Creek - data set available Five locations along French Creek

Monthly + two sets of 5-in-30 samples starting late-July and late-October

(Near) Synoptic data

How do the readings compare across the sites?

(73)

Compare readings in 2 sites. Align the data by date

Find the difference in readings or log(ratio) of readings

Use the difference if the range of values is small so that the differences between sites are relatively similar (e.g. a consistent difference of about 2 units).

Use the log(ratio) if the range of values is large so that differences between sites are NOT relatively similar (e.g. range from 2 to 200 units), but RATIO of readings (e.g. one site’s readings is about twice the other site).

Compute the mean difference, se of mean difference, 95% confidence interval for mean difference and see if the 95% confidence interval includes 0.

If want a p-value, test the hypothesis that mean difference in population is zero.

(74)

Turbidity Readings

Sampling BARCLAY GRAFTON NEW WINCH

Week BRG COOMBS ROAD HWY -ESTER

2010.0427 1.9 1.1 0.9 1.9 1.4 2010.0525 1 0.7 0.6 0.7 1 2010.0622 1.8 0.6 0.3 0.7 1 2010.0726 0.7 0.3 0.8 0.5 0.8 2010.0816 0.7 0.3 1.5 0.6 0.9 2010.0824 0.8 0.5 0.4 0.4 1.4 2010.0831 0.9 0.3 1.1 0.6 1.6 2010.0907 0.7 0.3 1.6 0.5 2.5 2010.0914 0.8 0.3 0.2 0.4 0.7 2010.102 0.5 0.4 0.2 0.3 0.6 2010.1026 4.3 2 1.2 2.3 1.1 2010.1102 7.4 4 2.1 4.2 1.9 2010.1109 2.5 1.3 1.3 2 1.3 2010.1116 1.9 1.1 0.6 0.8 2010.1229 5.3 3.8 4.3 6.2 3.1 74 / 118

(75)

(76)

Compute the difference or log(ratio) between two sites (e.g. Barclay vs. Coombs).

Drop cases where missing values are present.

(77)

(78)

Compute using Excel functions

Number of differences/log-ratios - count() Mean difference/log-ratio - average() Std dev of diff/log-ration - stdev.s () 95% CI half width - confidence.t() 95% CI mean ± 95% CI half width

t-test for testing equality of means - t.test() Also take anti-logs of mean and 95% CI for log-ratio

(79)

(80)

Barclay averages .947 NTU higher than Coombs with a 95% c.i. of between 0.45 and 1.44 NTU higher than Coombs. Barclay is, on average, approx 1.98x (95% CI 1.70x to 2.30x) larger than Coombs.

CAUTION: 95% c.i. say nothing about individual differences or individual ratios, i.e. NOT CORRECT TO SAY that 95% of differences lie between 0.45 and 1.44 NTU.

(81)

CAUTIONS: Excel does not deal with missing data very nicely. Look what happens if last reading for Coombs is missing.

(82)

Similar output from JMP with additional graphs to show that log(ratio) is likely “better” choice than difference:

(83)

(84)

A formal p-value is un-necessary but can be obtained as well.

Because the p-values are very small, there is strong evidence of a difference (on average) between the two sites.

(85)

Pairing is induced by synoptic readings

Delete any pairs with missing values. More advanced software (e.g. R/SAS/JMP) can also incorporate missing data but this is beyond scope of course.

CAUTION: EXCEL does NOT handle missing data well.

Compute differences and/or log(ratio)

Use differences if readings are similar over time

Use log(ratio) if large variation in readings and RATIO is consistent over time

Compute mean difference or log-ratio and 95% confidence interval for population mean difference or log-ratio.

Is 0 included in the 95% confidence interval? If so, then there is no evidence of a difference (on average).

CAUTION: 95% confidence interval say NOTHING about individual differences or log-ratios

Use anti-log on mean and 95% c.i. to convert log-ratio back to ratios.

(86)

It is possible to extend the analysis to 3+ more sites with synoptic readings.

Make an array of week by site Record actual values or ln(values)

CAUTION: EXCEL does not ALLOW for ANY missing data. You must exclude entire sampling week if any data is missing. This can lead to a LARGE loss of data.

JMP/SAS/R can gracefully deal with missing data.

(87)

Notice that readings on 2010.1116 are dropped FOR all sites because missing data at New Highway

(88)

Known as a Randomized Block Design

Blocks = Synoptic Times = device for “pairing up” observations that are affected in similar way.

Assume that differences among sites is relatively consistent across the blocks == NO INTERACTION between blocks and sites.

(OR) Assume that ratio among sites is relatively consistent across the blocks so that differences of log(values) is relatively consistent (e.g. site A might always be about 2x larger than site B).

Again start with plot of values as seen earlier

(89)

(90)

(91)

(92)

Here, column effect = effects of SITES

P-value = 0.02, so there is some evidence of a consistent difference in the MEANS among sites

Does NOT indicate which sites could have the same or different means. Need to follow-up with a Tukey Multiple Comparison Procedure to identify which pairs of sites could have different means.

(93)

Look at estimates of MARGINAL means:

Critical range is CR = Qa

q

MSE

# blocks where Qa is value from

Studentized range with df1 = #sites and df2 = dfMSE. In this case

we look for (5, 52) df (see previous slide) at http://www.stat. duke.edu/courses/Spring98/sta110c/qtable.html and find the Qa= 4.20 and CR = 4.20

q

0.65

14 = 0.91.

(94)

This is VERY tedious and error prone in EXCEL – use a proper package such as JMP/ R/ SAS etc.

The output is “automatic” and more-informative and can handle missing data.

(95)

P-value is small, so there is some evidence of a difference in means among the sites. It does not indicate where the difference may lie.

(96)

This indicates which Sites could have the same mean.

Think of ’paint-chips’ to understand overlap in ranges of sites that could be the same.

(97)

Provides estimates of effects and confidence intervals for each pairwise difference.

(98)

Typically used for synoptic data to see if sites are comparable Exactly 2 sites:

Find difference or log(ratio) of readings from both sites. Drop any sites with missing data.

Find mean and 95% confidence interval for difference in MEAN See if 0 is included in the confidence interval.

Find p-value using Paired t-test.

(99)

3+ Site:

Analyze either raw data or log(data).

Use Randomized Block Design analysis (Excel: Two-factor with no replication).

Look at ANOVA table at either Rows/Columns effects that correspond to SITES.

Program Tukey Multiple Comparison procedure by hand (groan)!

CAUTION: EXCEL does not deal with missing values correctly -GIVES WRONG ANSWERS.

CAUTION: EXCEL is very clumsy in finding where the differences lie. You must program Tukey procedure. You will make mistakes in doing this!

CAUTION: EXCEL does not provide other output to check assumptions of the models.

(100)

Case Study French Creek Barclay Bridge

-Turbidity

(101)

(102)

What are the objectives?

Find the distribution of turbidity across the year? Estimate the 95th percentile across the year?

Estimate the mean and percentiles only in November? Problems:

Non-random sampling across the year with some “days” having a higher probability of sampling than other days. More samples deliberately selected in August (when the turbidity is low) and more samples deliberately taken in November when turbidity is high.

High autocorrelation in values taken close together in the “5-in-30” samples. Standard errors will be understated, i.e. you will think you are more precise than you are.

Not clear how to interpret means and percentiles for part of a year. The 95th percentile based on the “5-in-30” will NOT estimate the 95th percentile for the year.

(103)

Example of bias in computing percentiles.

Simulated data to follow data based on previous curve with some random noise around the curve.

Computed 95th percentile based on yearly data and on sample dates data.

(104)

Example of bias in computing means.

Simulated data to follow data based on previous curve with some random noise around the curve.

Computed mean based on yearly data and on sample dates data.

(105)

Not clear what to do with this type of data without some consideration of “filling in” some of the missing data.

(106)

Readings are taken monthly, except “5-in-30” days samples are again taken in August and November.

There are many censored readings (indicated by the < character in the adjacent column.

There duplicate and split samples for QA/QC work. There is seasonality in some of the characteristics.

(107)

Duplicate and split-sample measurements. These are NOT independent observations and the usual way to deal with these is to the the average of the duplicate or split-sample

measurements.

Outliers. The sole outliers for NO3 and Turbidity are synoptic. What happened?

For some variables (e.g. NO3), weak seasonality present so some pooling over all month in a year is a possibility. I am still worried about the “5-in-30” readings as serial correlation may be a problem?

Some variables are highly censored while others are lightly censored. There is no easy way to deal with censoring in Excel other than to either ignore it (i.e. treat the limit as the data value) or use 1/2 of the detection limit as the data value.

(108)

Turbidity and NO3 in Mercantile Creek:

(109)

Several interesting features.

There appears to be seasonality with high readings in October and April, but they are not consistent over time (e.g. look what happened in 2003).

Sampling intensity is not uniform over the years with some months missed and no apparent 5-in-30 sampling occasions. A very large value occurred in November 2002 with very small values on the months on either side of it indicating a very volatile system.

As in the French Creek dataset, it is not clear what the 95th percentile is supposed to be measuring? Just the peak events? Across the entire year?

There appears to be an increasing trend in both the mean and variability. Fitting a trend line to this data set is beyond the capabilities of Excel because of the problems above.

(110)

Try a linear fit to the Turbidity data after dropping the outliers in 2002.

This was done in R (don’t use Excel - gives WRONG results for many regressions!)

More details on fitting linear models available in my course notes.

Estimate Std. Error t value Pr(>|t|) (Intercept) -7.6373521418 2.1271802524 -3.590364 0.0009771295 Date 0.0006675769 0.0001699621 3.927799 0.0003719998 Standard errors may be too small because of the autocorrelation in

the residuals.

Strong evidence of an increase over time.

(111)

(112)

Try a linear fit to the Turbidity data after dropping the outliers in 2002.

Residual plots (lower left) shows increase in variance with mean.

(113)

Try a linear fit to the NO3 data after dropping the outliers in 2002.

This was done in R (don’t use Excel - gives WRONG results for many regressions!)

More details on fitting linear models available in my course notes.

Estimate Std. Error t value Pr(>|t|) (Intercept) -3.453517e-02 2.546656e-01 -0.1356099 0.8928851 Date 1.381042e-05 2.034783e-05 0.6787172 0.5016572 Standard errors may be too small because of the autocorrelation in the residuals.

(114)

Try a linear fit to the NO3 data after dropping the outliers in 2002.

(115)

(116)

RRR’s

No easy way to compute means and percentiles for

non-randomly selected data (e.g. monthly + “5-in-30” data). More flexibility when comparing across sites or trends over time. Some caution needed if readings are too close together (e.g. hourly or daily). See my webinar for Air Quality.

What are you interested in?

Averages - estimate means, SE, and CI Spread - estimate means, SD, and EI

Extremes - estimate percentiles and TI (either parametric or non-parametric). CAUTION. These results are extremely sensitive to violations of the RRRs and distributional assumptions. Assess normality using Normal Quantile plots. Trends - use linear models (see my webinar for Air Quality)

No amount of statistical wizardry can rescue poorly corrected data! You will be severely constrained in your analyses if you only use Excel!

(117)

Missing values?

If missingness is MCAR, then no problems. MAR/ IM are more/ very difficulty to deal with.

Outliers?

Run analysis with outliers in and with outliers out. If no difference then who cares.

(118)

Other possible analyzes not-part of this course:

Quantile-regression where you model the change in

percentiles, e.g. has the 90th percentile changed over time? Testing for changes in the standard deviation over time rather than the mean to see if variability has changed over time. Modeling the number of events (e.g. days exceeding WQ guidelines in a month) and if they change over time.

Further help, contact Carl Schwarz (cschwarz @ stat.sfu.ca)