4. Chapter Four: Main Study Methodology
4.16 Statistical analysis
Statistics encompasses the methods of collecting, summarizing, analysing and drawing conclusions from the data; researchers may use several statistical techniques to achieve their study aims. Additionally, as the data may take many different forms, different statistical tests are required. Thus, it is important to know what form every variable takes before we can make decisions regarding the most appropriate statistical methods to use. Each variable and the resulting data were one of two types: categorical or continuous. In this study the age and BMI were used as continuous-
94
numerical variables and recoded and used as categorical variables in Chi-square test (Chapter 5) and as a continuous variable in the regression analyses and the other items were all analysed as categorical variables.
After the data were entered and cleaned, Chi-square analyses was used to investigate health-related behaviours among students with student’s background (age, class level, type of school, home districts, fathers and mothers educational level), academic performance and liking school and examine the interrelation between different risk behaviours. In the majority of social and medical studies extensive use is made of significance tests. Study reports are often full of the resultant ‘p-values’. The way in which significance tests have gained their important role and the problems involved in their use are considered by, amongst others, Atkins and Jarrett cited by Jones and Rushton (1982):
‘‘Significance tests perform a vital function in the social sciences because they appear to supply an objective method of drawing conclusions from quantitative data. Sometimes they are used mechanically, with little comment, and with even less regard for whether or not the required assumptions are satisfied’’. However, there is no clear agreement about the cut-off value for significance, although, critics of p-values point out that the criterion used to decide “statistical significance” is based on the somewhat arbitrary personal choice of level which is often set at 0.05. It been argued that studies that generate a large number of measures of association have a probability of generating some false-positive results due to random error through comparisons. Statistically significant results could be found by chance in many epidemiological studies even if there are no real correlations between factors. Thus researcher should be aware and look behind numbers. A variety of procedures addressed to tackle such a problem and one of well known and debatable is Bonferroni adjustment. Bonferroni adjustment or correction, require the establishment of a smaller critical p value for rejecting the null hypothesis on each individual test in light of the multiple tests to “preserve” the stated alpha level for the entire study. Epidemiologists have expressed little enthusiasm for such formal correction methods, that Bonferroni correction adjustments are, at best, unnecessary and, at worst, deleterious to sound statistical inference (Rothman, 1990, Savitz and Olshan, 1995). Medical epidemiologist Thomas V Perneger stated:
95
‘‘What would happen to biomedical research if Bonferroni adjustments became routine? Cynical researchers would slice their results like salami, publishing one P value at a time to escape the wrath of the statistical reviewer. Idealists would conduct studies to examine only one association at a time-wasting time, energy, and public money. Meta-analysts would go out of business, since a pooled analysis would invalidate retrospectively all original findings by adding more tests to be adjusted for. Journals would have to create a new section entitled “P value updates,” in which P values of previously published papers would be corrected for newly published tests based on the same study. And so on ...’’ (Perneger, 1998).
However, in this study a P-value at 0.05 was considered statistically significant. This level of significant was used because it is widely used in literature and most importantly to be consistent with the majority of the studies in Saudi Arabia (e.g. for accurate compatibility).
Multivariable logistic regression analyses were performed for binary outcomes to examine the nature of the association between risk behaviours as outcomes and explanatory factors. For example, smoking behaviour (e.g. current smoker or non smoker as an outcome) and socio-demographic or family and peer smoking status variables as explanatory or independent factors. Moreover, I used linear regression for BMI (BMI was skewed therefore Log BMI was used) because It was used in the regression analyses as a continuous variable. The construction of the analysis plan was based on the findings of studies from different fields (e.g., public health, epidemiology, sociology psychology).
Several hypotheses were made and guided the analysis. First, it was hypothesized that socio-demographic/background characteristics influence adolescent’s behaviours. Therefore, their relation to the risk behaviours will be examined first and then controlled for in the other models. Second, it was hypothesized that school has an important role and influence on many aspects of adolescent’s life, and liking school and school achievement/performance are associated with adolescents’ behaviours.
Third, health risk behaviours are interrelated and tend to be inter-associated. Typically, involvement in one risk behaviour is found to be positively associated with
involvement in other risk behaviours. Therefore, logistic analyses were used to
96
P-values from likelihood ratio test for all variables in all results tables from logistic regression (F-tests for linear regression models) were reported. Likelihood ratio or F-tests were used to determine if there was an overall significant association for each variable. Where there were more than 2 levels of variable Wald or t-test for binary logistic or linear regression models respectively, were used to test for
difference between different levels of these variables. The odds ratios (OR) with a
95% confident interval (CI) from logistic regression was reported for related risk factors of each health risk behaviours. I tried not to force the models, although I included several risk behaviour variables in one model. The purpose of entering variables within different domains simultaneously was to reduce the number of models computed, and to identify the variables independently associated statistically. Automatic removal techniques (backwards or forward removal) were not applied since several variables from different domains were entered in one model and because it was recognised that risk behaviours tend to cluster. Alternatively, the non significant variables were removed one by one in a time starting with highest non significant value.
However, there were several stages have been followed for all behaviours. Stage one; aim to investigate the association between the behaviour and the socio- demographic variables. I start the regression analyses by using simple regression where I investigate the association between single behaviour, for example smoking behaviour and the all the socio-demographic variables (e.g. first model). Personal and socio-demographic backgrounds are associated with behaviours. For example, it has been found that risk behaviours increase with age (Currie et al., 2004, 2008; Shaw et al., 2010) and boys seem to have a higher number of concurrent risk behaviours (Chou et al., 2006; Brener and Collines, 1998; Currie et al., 2008; Shaw et al., 2010). In the second model, I investigate the interaction of parental education when I include the father and mother education interaction in the regression model since in the preliminary analysis I found that father’s and mother’s education were highly correlated. In model 3, I start to remove the non significant variable one every time. Stage two; aim to investigate the association between single behaviour and school variables (school performance and school satisfaction) and general perceived health status. I start the analyses by using simple regression where I investigated the association between single behaviour and school performance, school satisfaction and
97
perceived health status while controlling for socio-demographic variables (e.g. first model). Studies have reported that students' perceptions of school are related to academic achievement (Kristjansson et al., 2009), socioeconomic position (Koivusilta et al., 2006; Koivusilta et al., 2003). Also, self reported academic achievement has been shown to be associated with multiple risk behaviours and may play a role in the adoption of risk behaviours (Kristjansson et al, 2010; Martins and Alexandre, 2009; Lynskey and Hall, 2000; Patterson et al, 2004; Schnohr et al., 2009). In model two, because liking school and school satisfactions were highly correlated in the preliminary analysis, I included the interaction of school performance and school satisfaction to the model and in model 3, I removed the non significant variables one every time. In the third stage the aim is to investigate the co-occurrence (clustering) of health risk behaviours. There is evidence that health risk behaviours tend to cluster together, with similar risk factors for many different risk behaviours (Lindberg et al., 2000; Brener and Collins, 1998; DuRant et al., 1999; Viner et al., 2006; Rhee et al., 2007). In model 1, I investigated the association between single behaviour and several health related behaviours while adjusting for socio-demographic status. Then, I start to remove the non significant variable from the model. (See figure 4.2 below). Finally, all significant variables in the final model of each stage were included in one final regression model, which allowed identification of whether variables are truly independently associated with outcomes, rather than due to confounding by unanalysed variables in the separate models. Moreover, all the final models were re- run excluding the non-significant variables.
98
Figure 4-2.Analytical approach
Family and peers behaviours were only investigated with regard to students smoking and physical activity behaviours.
In each model, I tested and reported the goodness-of-fit test (Hosmer and Lemeshow Test and Omnibus Test). Non-significant Hosmer and Lemeshow test means the model fit the data well and explains the most variables and the higher the value of Hosmer and Lemeshow, the better model is. On the other hand Omnibus Test shows whether the explained variance in a set of data is significantly greater than the unexplained variance, overall, and has to be significant for best fit model. In SPSS if there are missing values in at least one variable the entire observation is not included in the analysis, and SPSS automatically do so by default.
There has been growing interest in considering factors defined at multiple levels in public health research. Over the past few years multilevel analysis has emerged as one analytical strategy that may partly address this need. In the context of my study there is chance for significant clustering of behaviours among pupils amongst specific class or school and that because adolescents students are clustered within classes and within schools and these might influence adolescents behaviours due to peer pressure or significant impacts of individual teachers. I acknowledge that this is a limitation of my analysis, but multi-level analysis could be the subject of further research.
Other risk behaviours (Clusters of behaviours) Age School Type Father’s Education Mother’s Education Home district Single risk behaviours (e.g. Smoking, Dietary, Physical activity, Violence, etc) * Family members’ behaviours * Peers behaviours School Performance Liking school Perceived Health
99
However, in my study there all the students were from Riyadh, all the students were male (e.g. no gender differences), no race differences (e.g. all Arab and Muslim) and all students were from high schools. Moreover, there were no previous social, psychological, economical research data suggesting differences among adolescents in different districts in Riyadh.