Chapter 4 Data management and statistical methods
4.1 Data management
4.1.1 Data collection and data handling
Questionnaires were administered directly to the cases and controls by the interviewer. Detailed instructions fo r com pletion of the questionnaires were provided and discussed at site visit meetings w ith all fieldw orkers in all collaborating centres. A sample instruction manual was produced by the London co-ordinating centre and all collaborating centres produced their ow n extended version of this docum ent. These manuals included precise instructions for administering and coding the questionnaires. In addition, the manuals defined any medical or technical term s in the questionnaire.
All data were recorded in duplicate on the study questionnaire using carbon copies. Form A and Form B were administered directly to the cases and hospital controls, w h ilst in hospital, by the interview er. Form C was completed for cases only and w ith o u t seeing the subject. This section was completed by reference to medical records, the senior nursing staff in charge of the patient, nursing record and/or the doctors responsible for the management of the case.
M ost o f the variables were coded by the local centres. A fte r local checking, copies of completed questionnaires were sent out to London fo r re-checking, com puterisation, verification and validation. Data entry was carried out by one person in London, and second entry was carried out by a different person to detect keying errors. Errors or queries arising from the data entry were noted on a duplicated query
sheet w hich was sent to the relevant centre for correction and return (Figure 4.1 ).
4 .1 .2 . Quality control
Quality control was maintained during data collection by means of: i) Qn-site training of fieldworkers in questionnaire com pletion,
w hich was achieved by regular site visits to all collaborating centres by the study coordinator.
ii) Rapid feedback of queries arising from incom plete or inadequately completed questionnaires.
iii) Cross checks of several questions which were incorporated into the design and coding of the questionnaire.
iv) Use of locally prepared manuals of instructions.
v) Evaluating completeness of case and control recruitm ent as frequently as facilities allowed in those centres where accurate hospital discharge data were produced.
4 .2 Statistical methods
4.2.1 Specification of derived variables used in data analysis
Smoking was categorised by self report as never smoker, ex-smoker and current smoker. Relative body w eight was measured using body mass index (BMI) (self-reported w eight [in kg] divided by height squared [in metres]) and categorised into four levels (BMI < 2 0 , 20-25, > 2 5 -3 0 , > 3 0 ). Qral contraceptives (OCs) use was classified as never user, past user and current user. A current user was defined as one w ho had used QCs any tim e w ithin the 3 m onths prior to the event precipitating hospital admission for cases, or 3 months prior to
hospital admission for controls. Alcohol consum ption was classified into 4 categories: never drinker, occasional drinker, <14 units per w eek and > 1 4 units per week. One unit is equivalent to 10 g of alcohol. Past medical history of hypertension, diabetes, rheum atic heart disease and abnormal lipids profile was based on self reported as 'n o ', 'yes' or 'd o n 't kn o w '. Locally defined social class was categorised into 5 levels as:- upper, upper middle, middle, low er middle, low .
Educational attainm ent was classified as none or prim ary level (lowest), secondary level (intermediate), vocational or technical level and university (highest). Due to the small number of subjects in the 'university' category, this was combined w ith vocational or technical level to form a single category.
4 .2 .2 Power calculation and statistical analysis
There are 92 9 7 subjects from fifteen countries (Former Yugoslavia considered as one country) participating in this study. Analyses for individual countries would not generate su fficie n t pow er to establish a firm association between potential risk factors and disease, especially when the incidence of disease is low among young wom en and the strength of association is modest. In order to obtain more reliable inform ation about the association, the data were pooled according to their geographical areas (Figure 3.1).
The average number of controls recruited per case and prevalence of risk factors among controls varied between regions and diseases. The following table shows the statistical power tha t this study would have to detect an increased odds ratio of 2 (tw o sided significance level
(a = 0 .0 5)).
Ischaemic Haemorrhagic Unspe AM I
stroke stroke stroke
Asia E.E. L A . Asia E.E. L A . Africa E.E. N.E.
Cases 2 8 0 125 2 6 8 3 2 2 187 4 9 5 165 1 4 6 1 7 0 Ratio 2 .9 2 2 .6 9 2 .7 4 3 .0 0 2 .7 5 2 .6 3 2 .6 6 2 .5 0 2 .7 0 Prev (% ) HER 5 .8 10.1 6 .0 2 .6 8 .6 9 .2 7 .7 6 .5 7 .6 HIP 8.1 1 4 .6 1 1 .3 6 .0 9 .4 1 5 .7 12 .5 1 1 .2 9 .7 DM 0 .9 1.5 1.2 0 .4 0 .8 2 .2 0 .5 1.1 2 .2 FHCVA 1.5 1.2 2.1 1.9 3.1 3 .4 2 .5 2 .2 3 .3 FHAMI 0 .7 2.1 2 .5 1.3 2.1 3 .2 0 .9 2 .7 1.5 LIPIDS 0 .4 1.5 0 .4 0.1 0 .4 0 .3 0 0 .3 1.1 Cur. 0 0 6 .7 2 2 .9 1 3 .6 6 .9 10.1 1 1 .9 2 3 .2 19.1 9.1 Sec Ed 4 0 .3 3 6 .6 3 3 .3 3 7 .5 4 4 .2 2 9 .8 2 5 .7 3 1 .2 3 6 .3 Low Ed 3 8 .2 3 9 .3 4 5 .8 3 8 .8 3 8 .9 55.1 7 0 .4 4 3 .3 5 0 .5 Power{%) HEP 83 67 81 58 79 100 69 57 71 HIP 9 2 78 9 6 87 82 100 8 4 75 79 Cur OC 83 71 95 89 80 1 0 0 8 9 7 6 71 Sec Ed 9 6 67 9 4 98 75 9 9 12 75 58 Low Ed 9 6 67 9 4 98 75 9 9 17 75 6 0
Smallest risk estimates
DM 3 .0 3 .7 2 .8 4 .2 4.1 1.9 5 .9 4.1 2 .7
FHCVA 2 .5 4.1 2 .3 2.1 2 .2 1.7 2 .6 2 .9 2 .3
FHAMI 3 .4 3.1 2.1 2 .4 2 .6 1.9 4 .2 2 .7 3.1
LIPIDS 4 .6 3.7 4 .9 10 .7 6 .2 4.1 - 9 .7 3 .6
Unspe Unspecified stroke
E.E. Eastern Europe
L.A. Latin America and Jamaica
N.E. non-Europe
ratio number of controls per case
Prev prevalence
HEP high blood pressure
HIP EP problem in pregnancy
DM diabetes mellitus
FHCVA family history of premature stroke
FHAMI family history of premature heart attack
LIPIDS abnormal blood lipids
Cur OC current OCs use
Sec Ed secondary education
Low Ed low education
For 2 8 0 ischaemic stroke cases in Asia, a ratio of 2 .9 controls for each case, w ith an exposure prevalence of 5.8 % among controls, gave a power of 83% to detect an increased odds ratio of 2; or w ith an exposure prevalence of 8.1 % (HIP) and gave a pow er o f 9 2 % . W ith more cases and higher exposure prevalence, the study w ould have more than 80% power to detect an increased risk of 2 or more. For example, there were 495 haemorrhagic stroke cases in Latin Am erica and w ith the exposure prevalence of 9.2 % or more among the controls, the study would have 100% pow er to detect an increased risk in those exposure. The table also illustrates the smallest significant odds ratios th a t this study w ould have been detected for a given low exposure prevalence and w ith 80% certainty (Schlesselman, 1982).
The size of the study was unlikely to have the pow er to detect interaction tha t deviated significantly from a simple m ultiplication of risk.
Because of the small number of ischaemic or haemorrhagic stroke cases in A frica, the results are of little help in assessing the relation of risk factors to the risk of disease.
Homogeneity of odds ratios of countries w ithin regions was assessed by likelihood ratio statistics, by subtracting the deviance fo r model containing the main effect of educational level and education-country interaction terms from the deviance fo r model containing only the main e ffe c t of educational attainm ent for th a t region. We obtained
X (educational levels-1 )*(num ber of countries-1) S t a t i s t i c S f o r e a c h O f t h e 4 r e g i o n S .
using conditional logistic regression so th a t all analyses w ould take into account the matched design of the study. Subjects w ith incom plete data were excluded from the analyses including the missing variables. Consequently, the number of subjects varied when d iffe re n t independent variables were used. Comparison of crude (unadjusted) and adjusted odds ratios, however, were based on subjects w ith no missing data.
Unadjusted odds ratios estim ates associated w ith educational attainm ent were calculated for each country, using high education as the reference group. Tests of heterogeneity of the odds ratios across countries were performed prior to pooling data together.
Univariate analyses were performed to identify individual risk factors fo r stroke and acute myocardial infarction separately, using disease status as a dependent variable.
In studying the possible relationship between educational level and disease, and the extent to w hich other postulated risk factors had contributed, m ultivariate logistic regression model including educational attainm ent and potential confounding variables w hich were shown to have at least a 5% change in odds ratios in stepw ise procedure. The follow ing variables were considered as potential confounding variables:- smoking, self-reported history of high blood pressure, diabetes mellitus, abnormal blood lipid profile, oral contraceptives use (OCs), alcohol consum ption and body mass index (BMI)). Tests for trend in odds ratios across educational levels were calculated by fitting a linear term for education. To assess the impact of risk factor adjustm ent on the OR associated w ith educational attainm ent, the proportion of excess risk accounted fo r by any risk
factor adjustment was calculated as: (OR,aj^,;Kd-ORur,adiusted)/(ORun,diu«ed- 1). The impacts of risk factor adjustment on trend were assessed by calculating the percentage change in the linear slope term for education. A second logistic regression model containing the known risk factors for CVD and any potential risk factors which w ere shown to be statistica lly significant on univariate analysis, was fitte d to assess the independent e ffe ct of each fa cto r on risk of the study disease.
Conditional logistic regression analyses were performed using EGRET softw are (SERC and CYTEL, 1991). All other statistical analyses were performed using the Statistical Analysis System (SAS), Version 6 .0 8 .
Figure 4.1 Data Flow and M a n a g e m e n t C o l l a b o r a t i n g H o s p i t a l
In te rv ie w e r
Study Coordinating Centre Form A: Eligibility
Form B: Sociodem ographic data Medical and C ontraceptive use Form C; Diagnostic Inform ation
(cases only)
D ia g n o s tic R e v ie w e r Local Coordinating Centre
D ata Entry Local Coordinator D a ta M a n a g e r Study D a ta b a s e C o n firm q u e s tio n n a ir e s c o m p le tio n copy retain e d R e v ie w problems a n d
Missing Data D a ta V a lid a tio n
R e v ie w C o r r e c t C o n firm