Chapter 3 Statistical Analyses 186 - Measurement of Eating Behaviour

CHAPTER 2: GENERAL METHODS

2.9 Measurement of Eating Behaviour

2.11.1 Chapter 3 Statistical Analyses 186

Total and time-averaged AUC for all appetite sensations was calculated using the trapezoidal rule. Paired t-tests were used to compare characteristics of the exercise sessions and total area-under-the curve between pairs of trials.

Repeated measures ANOVA were used to assess differences in appetite, EI and body composition between trials.

Bland-Altman plots were used to visually assess agreement in EI measurements made in the control and exercise trials. The Bland-Altman plot is commonly used in method comparison studies and allows visual judgement of the extent of agreement and any potential bias between two measurements. The two-way mixed effects interclass correlation coefficients (ri) was used to quantify the test-retest reliability of EI and appetite values. This coefficient estimates the average correlation between all possible pairs of observations giving a

measurement of reliability (Bland & Altman, 1996). This statistic is commonly used in similar reliability studies and gives a quantitative assessment of test-retest reliability alongside the visual representation of the Bland-Altman plots (Arvaniti et al, 2000; Nair et al, 2009; Laan et al, 2010). All results are

presented as mean (95% Confidence Interval (CI)) in this chapter unless specified otherwise.

2.11.2 Chapter 4 Statistical Analyses

Time averaged AUC was calculated for acylated ghrelin and peptide YY concentrations using the trapezoidal rule. Paired t-tests were used to assess differences in EE during the intervention, metabolic rate, total EI and time averaged AUC of peptide hormones. A repeated measures analysis of variance (ANOVA) was used to investigate effects of trial and time on acylated ghrelin, peptide YY, EI and appetite. Repeated measures ANOVA was selected for this

analyses as it allows multiple comparisons of data collected at different time points to be drawn whilst minimising the type 1 error rate associated with using multiple t-tests for comparisons (Kao & Green, 2008). Post-hoc tukey

comparisons were then used to identify specific significant differences; the Tukey test allows comparisons of all pairs of means to be drawn, and provides confidence intervals of mean differences. The Tukey test is useful when all pair wise comparisons are of interest and has the advantage of providing smaller confidence intervals than other available multiple comparison tests (Kao &

Green, 2008). Pearson correlation coefficients were calculated for associations between appetite, acylated ghrelin and peptide YY. All results in this chapter are presented as mean ± SEM unless specified otherwise.

Due to the large degree of variability in exercise intervention EE, analysis to identify potential outliers was carried out for this study after all data were collected. Box plots and a histogram with an overlaid normal curve were plotted for ExEE data; these plots graphically summarise the statistical distribution of data, and highlight any values which lie out with this distribution (Williamson et al, 1989).

2.11.3 Chapter 5 Statistical Analyses

Time-averaged AUC was calculated for appetite measures using the trapezoidal rule and differences were assessed using paired t test. This test was also used to assess differences in total and relative EI, metabolic rate, and exercise

responses between trials. Repeated measures ANOVA was used to assess

differences in body composition and cardiovascular fitness levels over time, and to assess the interactive effect of time*trial on appetite ratings, macronutrient intake, EI, and acylated ghrelin and peptide YY levels. Post-hoc Tukey test of multiple comparisons was then used to identify specific significant differences.

All results are presented as mean ± SEM in this chapter unless specified otherwise.

188

2.11.4 Chapter 6 Statistical Analyses

Two sample t-tests were used to compare differences between restrained and unrestrained eaters, and between the two subgroups of restrained eaters.

Pearson correlation coefficients were calculated between EE and physical and anthropometric characteristics. Statistical significance was set at p < 0.05.

Values are mean ± SEM unless otherwise stated.

2.12 Sample Size and Power Calculations in Acute Exercise Studies

Given that very few of the studies reviewed in this chapter report a power calculation, a discussion of appropriate methods of conducting power

calculations in acute exercise studies seems prudent. Using data from this thesis, worked examples will be used in the discussion.

The nature of power calculations is to minimise the chance of a type 1 or 2 error occurring. The probability of these errors in a given sample is represented by the statistical values α and β. Many studies in this field find non-significant results;

these findings can simply be the result of inadequate power; a classic example of a type 2 error - the “false negative”. The probability of this type of error is quantified by the value β, which is related to statistical power in the following way:

Power = 1 – β

Thus β must be minimised in order to maximal statistical power in any given study; power calculation of appropriate sample size is necessary to achieve this (Altman, 1990).

The value of α in power calculations must also be considered; α is related to β and quantifies the probability of rejecting the null hypothesis when it is in fact

true. This is known as a type 1 error – the “false positive”. The value of α is usually set at 5%, which relates to the widely accepted value of 0.05 taken to indicate statistical significance.

Studies that find a statistically significant result can be assumed to have

adequate power and not susceptible to type 2 errors. Such studies could still be subject to type 1 errors, but this will not be discussed as this section will focus specifically on type 2 errors.

Power calculations are usually done prospectively, based on pilot data, findings of a similar study, or sometimes simply a good guess as to how variable the population is likely to be. There is also value in conducting power calculations retrospectively as they can provide valuable information for future studies, and give more accurate details of the effect size that can be detected. Retrospective power analysis was carried out on the acylated ghrelin data reported in chapter 3. In this study acylated ghrelin data for 15 overweight, pre-menopausal, female participants was available. A prospective power calculation was not carried out for this study, since it was part of the larger study detailed in chapter 4, and the original power calculation was conducted for outcomes of the larger study. As a result a retrospective power calculation was conducted on the chapter 4 data.

2.12.1 Using the Correct Standard Deviation

In the majority of studies which have examined the acute effects of exercise on acylated ghrelin, measurements are taken over a period of hours, pre- and post-intervention and the average difference in ghrelin concentrations over a period of hours is assessed. The study used as an example was complicated as it

consisted of two observational periods; afternoon on day 1 and morning on day 2, incorporating a 16 hour overnight period spent at home. Each observational period was thus treated discretely for the purpose of power calculations due to the length of time separating them.

190

Before any calculations could be carried out it was necessary to decide where a difference in ghrelin was most likely to be present in order to elucidate the appropriate SD. It was reasoned that since day 1 observations included the

intervention period, a difference would most feasibly be seen here (although the method described is applicable to data from either day). On day 1, a series of 5 measurements was made before the evening meal, and there was a steady, slight rise in acylated ghrelin over this period as would be expected. If exercise affected ghrelin it may be expected that this rise would be either attenuated or stimulated, hence the power calculation was constructed to detect a difference in this day 1 change in ghrelin concentrations between trials. Firstly, the change between the first and last measurements on day 1 was examined to obtain the necessary SD:

• For each participant the difference between the baseline and end values (time points 210-0) of acylated ghrelin concentrations during day 1 was calculated.

• The mean and standard deviation of these differences for each trial were then calculated; giving an average change in acylated ghrelin concentrations for both control and exercise trials over the course of day 1 (Control trial:

Mean 36.9 SD 110.2 pg ml^-1; Exercise trial: Mean 35.0 SD 81.2 pg ml^-1).

• A paired t test was then carried out on these data. This provided the mean and standard deviation of the difference between the average day 1 change in ghrelin concentrations in the control and exercise trial. Results of this t-test are shown below:

Paired T for CON A 210-0 - EX A 210-0

N Mean StDev SE Mean CON [ 210-0 15 36.94 110.21 28.45 EX [ 210-0 15 35.04 81.15 20.95 Difference 15 1.90 100.86 26.04

95% CI for mean difference: (-53.95444, 57.75444)

T-Test of mean difference = 0 (vs. not = 0): T-Value = 0.07 P-Value = 0.943

There are clearly no differences in the day 1 change in ghrelin concentrations between trials (i.e. the difference between first and last ghrelin measurements on day 1 is approximately 36 pg ml^-1 for both trials and the mean difference between these two trials 1.9 pg ml^-1). In order to determine the exact effect size which can be excluded in this study the SD of the difference must be utilised in a retrospective power calculation; 100.9 pg ml^-1. The large value of the SD shows that variability in acylated ghrelin concentrations in these

particular participants was huge.

2.12.2 Calculating the Existing Statistical Power of a Study

The first step involved in retrospective power calculation for a study is to

calculate the power to detect the observed difference (1.9 pg ml^-1 in this case).

This requires the use of the power analysis function in a statistical analysis package (Minitab 14, Minitab Ltd., Coventry, UK). In this case a 1-sample t test power calculation for paired data was used; the method of calculating power, including values of α and β, is detailed below:

FORMULAS

Values Specified by the User n= sample size

σ= estimated/approximated standard deviation

δ= difference between true mean and hypothesized mean α= significance level

Derived Values

ν = degrees of freedom for error = n - 1 λ= non-centrality parameter for t

tα= one-sided critical value (upper α point of the t distribution with ν degrees of freedom)

tα/2= two-sided critical value (upper α/2 point of the t distribution with ν degrees of freedom)

Non-centrality Parameter λ= SQRT( n ) * δ / σ

192

One-sided Power For the > alternative, Power = 1 – t( tα; ν, λ ) For the < alternative, Power = t( –tα; ν, λ ) Two-sided Power

Power = 1 – t( tα/2; ν, λ ) + t( –tα/2; ν, λ )

The information which must be provided by the user for this formula is the study sample size (15), the observed actual difference in the data (1.9), and the SD of that difference (100.9). Thus the power that the study had to detect the

difference that actually exists between trials is calculated. The results of this calculation are shown below.

1-Sample t Test

Testing mean = null (versus not = null)

Calculating power for mean = null + difference Alpha = 0.05 Assumed standard deviation = 100.9 Difference Size Power

1.9 15 0.05

So this shows there is only 5% power to detect the actual difference observed in acylated ghrelin concentrations between exercise and control trials when α = 0.05. This means that if there is a true difference of 1.9 pg ml^-1 then a

hypothesis test on the data will not be able to detect it and the p value of such a test will be greater than 0.05.

In the present study β is 95%, therefore if the observed difference were a true difference, and not simply the result of normal variation, there is a 95% chance it would not be detected. In other words, a 95% chance of a “false negative”, or type 2 error exists in this study. In reality it is unlikely that 1.9 pg ml^-1 would be a truly biologically significant difference, but this calculation provides a greater understanding of inherent error rates. A larger sample size would be needed to reduce the possibility of a type 2 error occurring to an acceptable level

(typically 80% power is considered adequate for most power calculations, this means the chance of a type 2 error occurring is 20%).

2.12.3 Post-hoc power calculations

There is no reference value for the magnitude of change in ghrelin levels that would be considered biologically or clinically significant; previous studies have found differences of 20-40 pg ml^-1 to be significant and thus two power

calculations were carried out to calculate the sample size that would be needed in the study to detect these change of 20 and 40 pg ml^-1. This involves

conducting a 1 sample t test power calculation for paired data and three values are required for this; the desired difference (20 or 40), the observed SD (100.9), and desired level of power (80%). The results of the calculations are shown below:

Power and Sample Size calculation 1 1-Sample t Test

Testing mean = null (versus not = null)

Calculating power for mean = null + difference α = 0.05 Assumed standard deviation = 100.9 Difference Size Power Actual Power 20 202 0.8 0.80

Power and Sample Size calculation 2 1-Sample t Test

Testing mean = null (versus not = null)

Calculating power for mean = null + difference α = 0.05 Assumed standard deviation = 100.9 Difference Size Power Actual Power 40 52 0.8 0.80

194

Thus 52 participants would be required to detect a difference of 40 pg ml^-1 and 202 participants would be needed to detect a difference of 20 pg ml^-1 in this data set with 80% power and α = 0.05.

2.12.4 Power curves

Another function available in statistical software (Minitab 14) is the construction of a power curve, this helps the researcher determine the minimum effect that can be either excluded or detected at 80% power, α of 0.05, and the existing sample size. Once again a 1 sample t test power calculation is used for this type of paired data. This time the required power (i.e. 80%), the relevant standard deviation, and the α value, and the range of values we wish to be able to detect or exclude are entered (i.e. if the values 50:100/0.05 are entered this creates a power curve telling us how many participants are needed to detect a range of effects of 50 pg ml^-1 up to 100 pg ml^-1). The range of effect sizes and the number of participants required to detect this effect is then produced. For the existing samples size the power curve output is shown below (by increments of 0.5 pg ml

-1):

Power and Sample Size

1-Sample t Test

Testing mean = null (versus not = null)

Calculating power for mean = null + difference α = 0.05 Assumed standard deviation = 100.8 Difference Size Power Actual Power 78.50 15 0.8 0.800749

79.00 15 0.8 0.805677 79.50 15 0.8 0.810529 80.00 15 0.8 0.815307 80.50 15 0.8 0.820008 81.00 15 0.8 0.824633 81.50 15 0.8 0.829182 81.60 15 0.8 0.830082

Therefore for the current data set the minimum effect size we have power to detect or exclude is 78.5 pg ml^-1 with 80% power, and the maximum is 81.6 pg ml^-1 with 83% power.

Thus power calculations are useful not only for determining sample size

prospectively, but can also be a useful retrospective tool that aids interpretation of a data set. For instance, with these calculations the conclusion of the study in question can be more informative than simply saying there was no significant difference. Instead the author and reader can be informed that the study excludes a difference in ghrelin levels between trials of 78.5 to 81.6 pg ml^-1; smaller effects may occur and escape detection. This knowledge gives more informative conclusions and highlights the importance of power calculations.

Papers published without reporting power calculations in any manner are difficult to interpret in comparison.

CHAPTER 3: Low test-retest reliability of twenty-four hour post-exercise assessment of energy intake and appetite measures in overweight and obese women obtained using current

methodology.

3.1 Participants and Study Design

196

3.1.1 Participants

Fourteen healthy women were screened as described in section 2.1 and gave written, informed consent to participate (appendix III), and there was no attrition from this study.

3.1.2 Study Design

Each participant completed a sub-maximal fitness test before completing two sets of trials. Participation involved a total of four trials – two exercise and two control. Participants were asked to avoid alcohol and standardise their food intake prior to each trial. All trials were timed to ensure they were completed in the same phase of the menstrual cycle for each participant, in practice trials were typically approximately four weeks apart; exact timings depended on the individual participants’ cycle. The first set of trials, one exercise and one

control, were completed in a randomised, counter-balanced fashion. Thereafter the second set of trials was completed in reverse order in order to minimise potential bias effects of order.

Each trial lasted 24 hours, spanning over 2 days (figure 3.1); observation was carried out in the afternoon of day 1, and the morning of day 2. Participants attended the laboratory on day 1 at ~2pm and remained for four hours, during which the intervention (exercise or control) period was completed. Participants fasted overnight at home and returned to the laboratory the next morning, remaining for a further five hours. Body composition was measured via bio-impedance in the fasted state at the beginning of day 2 of all trials (TANITA TBF-300, Tanita B.V, Hoofddorp, The Netherlands). Appetite assessment was carried out a total of eleven times; five on day 1 and six on day 2 using a visual analogue scale (VAS) as described in section 2.7 (appendix VIII) (Flint et al, 2000). Three ad-libitum buffet meals were served during each trial; evening meal on day 1, and breakfast and lunch on day 2.

Figure 3.1 Schematic representation of study protocol. Grey rectangle represents

intervention period, black arrows represent appetite assessments, grey arrows represent metabolic rate measurements, outlined rectangles represent buffet meal while solid black rectangles represent normal meal at home/work.

3.1.3 Sub-maximal Fitness Test

Participants completed a graded, sub-maximal fitness test on the treadmill before beginning trials in order to determine intensity of exercise sessions.

Fitness tests were carried out according to the protocol described in section 2.3.

3.1.4 Intervention sessions

During the exercise trials, a moderate (65% Vɺ O2max)intensity treadmill walking session was carried out in the afternoon of day 1 to expend 1.65MJ; an EE similar to that recommended for individual sessions for long term body mass control (Donnelly et al, 2009). At all other times during the trials participants were sedentary. The control trials were identical except that participants remained sedentary during the intervention period on day 1.

3.1.5 Ad-libitum Buffet meals

During each trial participants were served 3 buffet meals; evening meal on day 1, and breakfast and lunch on day 2. Dinner was served at the end of the day 1 observation period before participants returned home to fast, breakfast was served after fasting measurements were made on the morning of day 2, and lunch was served 4 hours after the breakfast meal. After the lunchtime meal final measurements were made before the trial concluded. Meals were prepared and served as described in section 2.8.

198

3.2 Results

3.2.1 Participant characteristics

Fourteen participants completed the study with no attrition. Participants had mean age of 35.7 ± 8.7 years, height 162 ± 6 cm, body mass 78.6 ± 14.3 kg, BMI 30.0 ± 5.1 kg m^-2, body fat 38.7 ± 5.5 %, and Vɺ O2max 30.9 ± 7.1 ml^-1 kg^-1 min^-1 (mean ± SD). There were no significant differences in body mass, BMI, or body composition between trials (p>0.05).

3.2.2 Exercise responses

There were no significant differences in intensity (60.5 ± 3.4 vs. 61.0 ± 3.5 % Vɺ O2max), duration (67.7 ± 3.3 vs. 66.4 ± 3.3 minutes) or EE (1.65 ± 0.10 vs. 1.66 ± 0.10 MJ) of the two exercise sessions conducted during exercise trials (mean ± SEM, all p>0.05).

3.2.3 Energy and macronutrient intake

Paired t-test showed that although there was no difference in EI between exercise trials (p>0.05), there was a tendency for control trial EI to differ (p=0.08; table 3.1). Repeated measures ANOVA comparison of EI in all trials showed that there was a significant main effect of meal; EI was significantly lower at the breakfast meal than at either lunch or dinner in all trials

(p=0.0003).

The ri for the control trial EI was 0.50 (0.03, 0.80) (p=0.02), and for exercise trial EI was 0.04 (-0.53, 0.55) (p=0.45). The ri value for the difference in exercise and control trial EI was -0.05 (-0.54, 0.48) (p=0.57).

Macronutrient intake was not different between exercise trials (p>0.05) but total protein (p=0.004) and fat intake (p=0.02) were significantly different between control trials (table 3.1). Repeated measures ANOVA of macronutrient intakes showed a significant main effect or meal for all four trials; carbohydrate and protein intake was significantly lower at breakfast compared to lunch or evening meal in all trials (p<0.05), intake of these macronutrients at lunch was also significantly lower than evening meal intake (p<0.05). Fat intake was

In document Relationships between exercise, energy balance, appetite and dietary restraint in overweight and obese women (Page 187-200)