METHOD STATISTIC REFLECTS TYPE OF DATA INTERPRETATION
Intraclass correlation Association of two or more measures Continuous Value of zero =no association between coefficient (ICC) and amount of association variables; value of 1 =there is perfect
association; most commonly used statistic for more than two measures; different variables can be compared or different people’s scores can be compared Spearman’s rho Association of two measures; Ordinal Value of zero =no association between
comparable to Pearson’s in that it variables; value of 1 =there is perfect
does not reflect the sameness of the association
scores, only the nature of how they change
Kappa (κ) Association of two measures but Nominal Value of zero =no association between accounting for “chance” agreements variables; value of 1=there is perfect
association; values tend to be lower than other measures of association; contribution of chance is removed
Kendall’s tau Association of two measures; Nominal Value of zero =no association between comparable to Pearson’s in that variables; value of 1 =there is perfect
it does not reflect the sameness association
of the scores, only the nature of how they change
Chronbach’s alpha Internal consistency of an Continuous Internal consistency should be close to 1; all instrument or scale Categorical items relate to each other and to the
Ordinal construct being measured by the scale Standard error Variability of the standard Continuous High values indicate large variability of SD of measurement deviation (SD) of a measure against implying a lack of reliability.
(SEM) probabilities of this variation
because means are only estimates. CIs can give you this infor-mation, and they are typically expressed at the 95th percentile or the 90th percentile (see Fig. 4.6, Note 4.4). The CI is ex-pressed as the 90th or 95th.6
Guyatt et al6offer the example of a coin toss to illustrate the concept of CIs. The larger the sample (coin tosses), the better the estimate you have of the true value, that is, the smaller the CI.
Interpreting CIs
CIs can be calculated for many different statistics. One common and useful CI is computed for the difference between means from two treatment groups in an intervention study. Research is based on the behavior of samples of people, but you are trying to apply this research to your patients. In a typical intervention study comparing two different treatments, the mean differences between groups are computed to determine which treatment was more affective. The difference between means from the sample is used to infer to a population of people, and the CI helps you de-termine the value of this mean. CIs must be reviewed and inter-preted. If the CI crosses zero, meaning that the CI range includes
negative and positive values, then the result of the intervention cannot be considered statitistically significant. Consider the following: If the results of an intervention are expressed as a posi-tive change (for example, increased strength on a dynamometer), then the estimated mean difference in the change between groups should be positive following intervention if the treatment group had greater strength than the comparison group. If the CI range includes negative values, then this finding suggests that the true mean (recall that the means of the samples in the study are estimates of the population mean) may be negative. Negative values suggest that the comparison group performed better after intervention.
D I G G I N G D E E P E R 4 . 3
Continued 1716_Ch04_045-058 30/04/12 2:26 PM Page 51
Interface of Descriptive and Inferential Statistics
QUESTION 4:Were descriptive and inferential statistics applied to the results?
Descriptive statistics help summarize information about groups before and after intervention. However, these summary scores must be interpreted to understand their implications for clinical decisions. Should you be concerned about the difference in the mean values of the demographic or clinical characteristics in
Figure 4.2? The descriptive statistics, in this case the mean val-ues, indicate that the differences exist in these samples. How-ever, this type of statistic does not help you sort out whether these differences threaten the validity of the study, with the consequence that the study is less useful for your clinical de-cision. Could the differences in subject characteristics affect or explain the outcomes of the study sufficiently, such that you are less certain that the outcomes are due to the treatment? In addition, it is not certain as to how to interpret the change in Some measures, such as a pain scale, might be expected to
have lower values after intervention. In this example, an im-provement would be expressed as a negative value between group means. The CI for this result should only include a range of scores with negative values.
Regardless of what type of measure is used or whether a posi-tive or negaposi-tive range is expected, if the CI crosses zero, then the result is not statistically significant. This means that both a positive and a negative result are possible given the difference between means. However, further interpretation of CIs is war-ranted. Guyatt et al6suggest looking at the low and high values in the CI to evaluate the sample size of a study further and thus what you can take away from the study. In a study with a posi-tive result for a treatment, if the lowest value in a CI is greater
than the smallest difference that is clinically important, then you know that the sample size was adequate in demonstrating a posi-tive effect above the minimal threshold. Even if the CI is some-what wide, the lowest value would still indicate a positive effect, albeit small, of the treatment. In a study with a negative result for a treatment, if the highest value in a CI is less than the smallest difference that is clinically important, then you know that the sample size was adequate in demonstrating a negative effect above the minimal threshold.
CIs are typically expressed at the 90th and 95th percentile. This reflects the degree of certainty from the CI. You can be 90% or 95% certain that the “true” (population) mean (in this example) is within the range given for the CI.
D I G G I N G D E E P E R 4 . 3 — cont’d
Change in Test Performance Over Time From Baseline for Participants with Subacromial Impingement Syndrome in High-Intensity Laser Therapy (HILT) and Ultrasound (US) Therapy Groups: Evaluation Between Groupsa
a Cl-confidence interval, VAS-visual analog scale, CMS-Constant-Murley Scale, SST-Simple Shoulder Test.
b The statistical inferences were adjusted according to Bonferroni inequality (0.05/6-0.008 and 0.01/6-0.002).
Note 4.4 Is this a difference that we should consider in our clinical decisions?
We can be 95%
confident that the true mean lies between
᎐2.27 to ᎐1.12
F I G U R E 4 . 6Comparison of two interventions for the treatment of subacromial impingement syndrome.
From: Santamato A, Solfrizzi V, Panza F, et al. Short-term effects of high-intensity laser therapy versus ultrasound therapy in the treatment of people with subacromial impingement syndrome: a randomized clinical trial. Phys Ther.
2009;89:643–652; with permission.
mean scores from before to after treatment (see Fig. 4.6, Note 4.5). You can see from the data in columns 3 and 4 that both treatment groups made positive changes on all three vari-ables that were measured (VAS, CMS, and SST) but that the HILT group made larger positive changes. From this data, would you accept that the HILT is more effective than US Therapy?
Descriptive statistics are useful but insufficient for making conclusions about the differences between groups. Inferential statistics2,3,7,8are helpful in making these conclusions. Infer-ential statistics are tools that use the mathematics of probability to interpret the differences observed in research studies. These statistics focus on the following question:
Is the outcome due to the intervention, or could it be due to chance?
Role of Chance in Communicating Results
Inferential statistics are based on the mathematics of proba-bility; they can be expressed as probability, or p values. The
sixth column in the table in Figure 4.5 includes p values. A p value is an expression of the probability that the difference that has been identified is due to chance. If there is a high probability that the results are due to chance (large p value), then you cannot conclude that the treatment was the reason for the difference observed between the two groups. What are values that represent high or low probability? The consensus among researchers is that most want the probability of chance in explaining the results of a study to be as low as possible.
This agreed-upon value is termed the alpha level, and it is typically 5%.
The results of an analysis using inferential statistics are ex-pressed as p values, and these values are evaluated as either below or above the agreed-upon alpha level (<p = 0.05 or
>p =0.05) (see Fig. 4.6, Note 4.6). If you were comparing groups, then you would infer that the groups are statistically significantly different and that the HILT group improved sig-nificantly more than the US Therapy group, although both groups improved following treatment.
C H A P T E R 4 Critically Appraise the Results of an Intervention Research Study 53
Hypothesis testing
Using inferential statistics to test the role of probability in de-terming differences between groups relies on an understanding of hypothesis testing, which is fundamental to statistical analy-ses. In what is referred to as the null hypothesis, groups are con-sidered (in statistical terms) to be equal; that is, there is no difference between groups. A given statistical test supports either acceptance or rejection of the null hypothesis. If the null hypoth-esis is rejected (the groups are different), then the p values ex-press the probability that chance contributed to this result.
TYPEI ERROR: The results of a study may conclude falsely that there is a statistically significant difference when there is actu-ally no difference. You might assume falsely that the intervention under study was effective when in fact it was not. This is a type I error, because the result was actually due to chance. This is one reason for the decision among scientists to use the conservative alpha level of 0.05; there is only a minimal percent that chance could account for the result. As the alpha level increases (0.08 or 0.10), the chances of type I errors increase.
TYPEII ERROR: The result of a study may also conclude falsely that there is no statistically significant difference between groups when there is actually a difference. This is a type II error. From this conclusion, we might not use a treatment that might indeed be effective with our patients. Small samples are important contribut-ing factors to type II error. The difference between groups may not have been large enough to detect with only a small sample.
The last column in the table in Figure 4.6 includes the Bonferroni-corrected p values, which are the values to accept in terms of statisti-cal significance. The Bonferroni correction takes into account the number of statistical comparisons that are made in an analysis. Re-call that inferential statistics are based on probability, and if there are many comparisons in an analysis, then there is the probability that one or more will be significant. For example, if 20 statistical com-parisons were made at an alpha level of 0.05, then you would expect that one of the comparisons would show a statistically significant difference between groups even if one does not exist. The Bonferroni statistics correct for these multiple comparisons.
D I G G I N G D E E P E R 4 . 4
Should you be concerned about the difference in the distribution of demographic or clinical characteristics at baseline? Figure 4.65includes characteristics of the partici-pants in the two treatment groups, HILT and US Therapy, at baseline prior to intervention. The means, standard devia-tions, and range are given for age, onset of pain, and the stage in the diagnostic process. How do you interpret these values? All of the p values are greater than 0.05, the
conventional alpha level. This supports the statement that the groups are not statistically significantly different.
This is interpreted to mean that the groups are similar at baseline and that subject characteristics included in the base-line measures will not likely bias the results of the study.
This increases the certainty that the treatments, and not an imbalance between groups at baseline, are responsible for the results.
1716_Ch04_045-058 30/04/12 2:26 PM Page 53
What does it mean to approach significance? Recall that the alpha level of 0.05 is a convention, the amount of error that researchers are willing to accept. What if the p value is in the range of 0.05 to 0.10 or slightly higher? Do you automatically assume that the difference between groups is due only to chance? A p value of 0.06 indicates that there is a 6% chance that the difference is due to chance. A bit of common sense must prevail. You do not want to ignore important findings be-cause the p values are above the conventional 0.05 threshold.
Carefully review all p values in a study, and consider the rele-vance of values between 0.15 and 0.05. There may be a pattern of results that should be considered even if the p values are above 0.05.
There are many inferential statistical tools, but there are some statistics that are typically used in the rehabilitation lit-erature. A list of these statistics and their uses is included in Digging Deeper 4.5. Familiarity with these statistics assists you in interpreting the results of a study.
Part D: Summarizing
QUESTION 5:Was there a treatment effect? If so, was it clinically relevant?
Descriptive and inferential statistics are two categories of sta-tistical tools that are used to analyze the data in a research study. Readers sometimes conclude (and authors sometimes imply) that a finding of statistical significance from a valid study (e.g., a result of p <0.05 in a well-designed intervention study) means that the study’s results must be important. How-ever, this is not the case: statistically significant results may not be clinically important.
F o r e x a m p l e :
Suppose you try a new intervention with a patient who has se-verely limited range of motion in several fingers of her dominant hand. After a reasonable trial period, the patient increases her ability to move her fingers by about a quarter of an inch. If you replicated this intervention with a very large sample of patients and achieved the same outcome, a statistical analysis might show the significance of this change is p <0.05. However, this change is not very likely to improve the functional use of the hand.
To be clinically important, the results must:
nShow change on a measure that has value to the patient in terms of his or her daily life (patient values). In the example, a measure of functional use of the hand as well as range of motion might help evaluate the value of the change for the patient.
nShow change of a magnitude that will make a real difference in the patient’s life (in terms of function, satisfaction, comfort, etc.)
Therefore, you need some measures other than p values to answer the following question: Are the results of this valid study important?
Effect Size
The p values help to establish the role of chance in the results of a study. However, you also need to know how big a differ-ence occurred between the treatments. One approach to evalu-ating the magnitude of the difference following treatments is to calculate the effect size, the most common form of which is Cohen’s d.9The effect size d provides a measure of just how distinct the samples are, or specifically, how far apart the two means are, relative to the variability, as you can see from look-ing at the followlook-ing formula:
d=Meangroup 1 – Meancontrol group
SDcontrol group
A large d value indicates a big difference between the two groups. Papers do not always include the effect sizes, but de-scriptive statistics (means and standard deviations) of the re-sults are usually given in tables. You can readily calculate d to help you evaluate the magnitude of the differences that are re-ported. Interpretation of the d statistic as suggested by Cohen9 is included in Table 4.2.
Number Needed to Treat
Many discussions in the medical EBP literature about exam-ining the importance of results focus on dichotomous out-comes. A dichotomous outcome has two categories; for example, healthy or sick, admitted or not admitted, or relapsed or not relapsed. Effect size can be computed only on results
A Applicability
TABLE 4.2 Interpretation of Effect Size Values
Large >0.8
Medium 0.5–0.8
Small 0.2–0.5
From: Cohen J. Statistical Power Analysis for the Behavioral Sciences.
Hillsdale, NJ: Erlbaum; 1988.