Step - Down Method - NONPARAMETRIC ANALYSIS OF VARIANCE

NONPARAMETRIC ANALYSIS OF VARIANCE

26.4.3 Step - Down Method

In the step - down method, by contrast, the multiple regression of Y on all the X variables is calculated fi rst. The contribution of each X to a reduction in the SS of the Y values is then computed and the variable giving the smallest reduction eliminated by a specifi c rule.

This variable is then excluded and the process repeated until no variable qualifi es for omission according to the rule employed.

TA B L E 26.1 Stepwise Multiple Regression of Lichen Growth Data Presented in Statnote 25

138 STEPWISE MULTIPLE REGRESSION

26.5 CONCLUSION

Stepwise multiple regression techniques are useful in identifying the major variables infl uencing the dependent variable Y and in ranking them in order of importance. The step - up and step - down methods do not necessarily select the same variables for inclusion in the regression and these differences are magnifi ed when the X variables are themselves intercorrelated. Investigators may sometimes use the step - up method when they wish to defi ne the variables that infl uence Y more rigorously and to exclude variables that make relatively small contributions to the regression. The step - down method may retain more variables, some of which may make small contributions to the regression, but by retaining them, a better prediction may result. Investigators should also note the high probability of making a type 1 error when carrying out stepwise multiple regression, that is, claiming a signifi cant effect when one is not present. Hence, if there were 20 X variables included in a study and none of them actually infl uenced Y , the probability of achieving at least one signifi cant F to enter would be 1 − (1 − 0.05) ²⁰ , that is, 0.642, greater than a 50% chance!

Statnote 27

CLASSIFICATION AND DENDROGRAMS

Purpose of classifi cation.

Classifi cation for “ convenience ” (dissection).

Theory of classifi cation.

Methods of classifi cation.

The unweighted pair - group method using arithmetic averages (UPGMA).

Euclidean distance.

27.1 INTRODUCTION

The analysis of bacterial genomes for epidemiological purposes often results in the pro-duction of a banding profi le of DNA fragments characteristic of the genome under inves-tigation. These may be produced using the numerous methodologies available, many of which are founded in the cutting or amplifi cation of DNA into defi ned and reproducible characteristic fragments. It is frequently of interest to enquire whether the bacterial isolates are naturally classifi able into distinct groups based on their DNA profi les. A major problem with this approach is whether classifi cation or clustering of the data is even appropriate.

It is always possible to classify such data, but it does not follow that the strains they represent are “ actually ” classifi able into well - defi ned separate parts. Hence, the act of classifi cation does not in itself answer the question: Do the strains consist of a number of different distinct groups or species or do they merge imperceptibly into one another

140 CLASSIFICATION AND DENDROGRAMS

because DNA profi les vary continuously? Nevertheless, we may still wish to classify the data for “ convenience ” even though strains may vary continuously, and such a classifi ca-tion has been called a dissecca-tion (Kendall & Stuart, 1966 ). This statnote discusses the use of classifi catory methods in analyzing the DNA profi les from a sample of bacterial isolates. An approach to analyzing and representing the relationship between isolates in a nonhierarchical manner will be discussed in the fi nal statnote.

27.2 SCENARIO

Eight unknown isolates of methicillin - resistant Staphylococcus aureus (MRSA) and a culture of S. aureus strain NCTC8325 as a control were incubated for 18 to 24 hours at 37 ° C in brain – heart infusion (BHI) broth. Following incubation, the bacterial cells were harvested and 20 mg (wet weight) of cells were resuspended in 1 ml NET - 100 [0.1 M Na 2 EDTA (pH 8.0), 0.1 M NaCl, 0.01 M Tris - HCl (pH 8.0)] and mixed with an equal volume of molten low melting point chromosomal grade agarose [0.9% (w/v) in NET 100; BioRad, UK]. The prepared blocks were incubated for 24 hours at 37 ° C in 3 ml lysis solution (6 m M Tris pH 7.6, 100 m M EDTA pH 8, 100 m M NaCl, 0.5% lauroyl sarcosine, and 1 mg/ml lysozyme) with 20 units of lysostaphin (Sigma, UK). The initial lysis solution was removed and the blocks were incubated for 48 hours at 50 ° C in 3 ml ESP [0.5 M EDTA pH 9, 1.5 mg/ml proteinase K (Sigma, UK), and 1% lauroyl sarcosine]. The blocks were washed at room temperature twice for 2 hours followed by two 1 - hour washes using TE buffer (10 m M Tris and 1 m M EDTA, pH 8). A portion of each agarose block (1 × 1 × 9 mm) was digested with 20 units of Sma I (Roche, UK) in 0.1 - ml buffer for 16 hours at 25 ° C. The digested DNA samples were subjected to pulsed - fi eld gel electro-phoresis (PFGE) (CHEF Mapper System, BioRad, UK) under the conditions outlined by Bannerman et al. (1995) . Gels were stained with 1 μ g/ml of ethidium bromide for 45 minutes and destained for 45 minutes in distilled water. Gels were visualized under ultra-violet (UV) illumination and photographed using the GeneGenius Bio Imaging System (Syngene, UK). All images were saved using the tagged image fi le format (TIFF). Figure 27.1 represents the resulting PFGE patterns of the eight Sma I genomic digests of MRSA;

wells 3 and 8 carry a Sma I chromosomal digest from S. aureus strain NCTC 8325 as a control and molecular weight marker (Murchan et al., 2003 ).

27.3 DATA

The data comprise the band distances migrated in millimeters from the origin of the gel for each of the DNA preparations and are presented in Table 27.1 . Note that unlike the data for a multiple regression analysis (see Statnotes 25 and 26 ), there is no Y variable, the data comprise a series of X variables representing the bacterial strains.

27.4 ANALYSIS 27.4.1 Theory

If there are s bands defi ned by distance from the origin of the gel, each bacterial strain could be represented by a point in an s - dimensional coordinate frame, any one point having

ANALYSIS 141

Figure 27.1. Resulting pulsed - fi eld gel electrophoresis (PFGE) patterns of the eight Sma I genomic digests of MRSA; wells 3 and 8 carry an Sma I chromosomal digest from S. aureus strain NCTC 8325 as a control and molecular weight marker.

TA B L E 27.1 Band Distances, Migrated in Millimeters, from the Origin of Gel for Each of the DNA Preparations Extracted from Several Strains of MRSA and a Culture of S. aureus Strain NCTC8325 as a Control

Bacterial Strains

Band A B C D E F G H I J

1 17 17 18 17 17 26 18 18 18 18

2 46 46 38 39 46 40 50 38 45 45

3 52 52 42 53 52 42 58 42 52 52

4 56 56 48 56 56 66 64 48 58 58

5 80 80 58 79 80 70 69 58 79 79

6 85 85 66 84 85 71 79 66 84 84

7 89 89 76 89 89 81 82 76 89 89

8 94 94 80 94 94 88 85 80 94 94

9 98 98 91 98 98 89 91 91 98 98

10 102 102 95 102 102 93 92 95 102 102

11 103 103 98 103 103 96 93 98 103 103

12 104 104 104 104 104 104 97 104 105 105

13 105 105 105 105 105 99 105

14 109

coordinates represented by the distances traveled by the defi ning bands. If in this s -dimensional frame, the points formed a hypersphere then the strains sampled would constitute a single homogeneous group. Alternatively, the strains may be clustered into two or more separate hyperspheres. Moreover, if the strains formed a single nonisodia-metric cluster such as an ellipse, then the strains may be neither homogeneous nor divisible into separate groups but would essentially be continuously distributed without distinct “ gaps. ”

142 CLASSIFICATION AND DENDROGRAMS

There is a bewildering array of methods available for classifying data (Pielou, 1969 ).

Hence, classifi cation methods may be hierarchical or reticulate, divisive or agglomerative, and monothetic or polythetic. Moreover, several methods are available for measuring the similarity or “ distance ” between strains. The most commonly employed hierarchic cluster-ing methods include nearest - neighbor, furthest - neighbor, and the unweighted pair – group method using arithmetic averages (UPGMA) (Clifford & Sokal, 1975 ), the most frequently used of which is UPGMA. When employing UPGMA to construct a dendrogram (tree diagram), an assumption is made during the calculation that each molecular strain type diverges equally from the others, and it is this approach that will be adopted in this statnote.

27.4.2 How Is the Analysis Carried Out?

The fi rst problem to be considered in any classifi cation analysis is the nature of the vari-ables and whether the measurements have been made on the same or different scales. For example, each strain may have been defi ned by band distance and intensity of the band, and, if both types of data were included, the analysis would be biased by the variable with the greatest mean and range. Hence, such data are often “ standardized ” by converting them so that they are members of the standard normal distribution, that is, a distribution with a mean of zero and a standard deviation of one unit (see Statnote 2 ). Second, a distance measure d needs to be selected that refl ects the similarity of one strain to another. There are various methods of computing distance, but the most straightforward is to compute d as if the s variables are dimensions making up an s - dimensional space, namely to use Euclidean distance as a measure of similarity. Third, a linkage rule needs to be selected, and, as discussed in Section 27.4.1 , the method based on UPGMA has been the most commonly used to analyze bacterial strains. Nevertheless, note that the nearest - neighbor method is often the default option available in various statistical packages and also offers a satisfactory method of classifi cation.

Most of the major statistical packages such as SPSS and STATISTICA offer multi-variate classifi catory methods, and, although the “ mechanics ” of carrying out the analyses may differ in detail, they use a similar approach. We will illustrate the analysis of our data using STATISTICA software.

27.4.3 Interpretation

The dendrogram obtained from the data in Table 27.1 using UPGMA and Euclidean dis-tance methods is shown in Figure 27.2 . The process of classifi cation is agglomerative, individual strains being combined with those that they resemble most closely and then, successively, the groups are combined as the dendrogram is ascended. Hence, the data from wells I / J , C / H , and A / B / E immediately form three groups, the members of which are each identical according to band distances and cluster at a linkage distance of 0.

Furthermore, D is the strain most closely related to A / B / E , and F is most closely related to I / J . Strain G appears to be the most unrelated to the others based on band distance, but at a linkage distance of approximately 125 all strains have been combined into a single group.

An obvious question to ask is how many groups should be retained? A useful addi-tional plot in the interpretation of the data is the graph of the amalgamation schedule and is shown in Figure 27.3 . As linkage distance increases, larger and larger clusters are formed but with a greater degree of within - cluster diversity. A clear discontinuity in the graph

ANALYSIS 143

Figure 27.2. Resulting dendrogram obtained from the data in Table 27.1 using the hierarchic clustering method UPGMA and Euclidean distance as a measure of the similarity between bacte-rial strains.

Figure 27.3. Graph of amalgamation schedule obtained from data in Table 27.1 , the hierarchic clustering method UPGMA, and Euclidean distance as a measure of the similarity between bacte-rial strains.

suggests that many clusters are being amalgamated at the same linkage distance, and this level can be used as an approximate cut - off to determine the number of groups to retain.

In Figure 27.3 , for example, this discontinuity occurs at a linkage distance of approxi-mately 40. Hence, drawing a line across the dendrogram in Figure 27.2 at this level would suggest the presence of three groups of MRSA strains, namely strain G alone, strains I / J and F , and strains A / B / C / D / E and H .

144 CLASSIFICATION AND DENDROGRAMS

27.5 CONCLUSION

The use of classifi catory methods is popular in the interpretation of DNA banding data from bacterial strains. The analysis results in a dendrogram that illustrates the relationships between individual strains by combining them into groups. There are two problems with this approach. First, there is no guarantee that the data are actually classifi able, and, second, there are many possible variations of the statistical analysis and their relative merits and usefulness with reference to DNA banding data have not been established. An alternative method of analysis is to make no assumptions as to the relationships between the strains.

Such a nonhierarchical method of analysis involving factor analysis (FA) and principal components analysis (PCA) will be described in the fi nal statnote.

Statnote 28

FACTOR ANALYSIS AND PRINCIPAL

In document Statistical Analysis in Microbiology StatNotes (Page 143-151)