A. EDUYEARS ANALYSES
Cohorts were asked to estimate this regression equation for each measured SNP (we drop the SNP subscript j here to avoid notational clutter):
(4.1) πΈππ’πππππ = π½0+ π½1 πππ + ππ π + π π + π π + π,
where SNP is the allele dose of the SNP; PC is a vector of the first ten principal components of the variance-covariance matrix of the genotypic data, estimated after the removal of ge- netic outliers; B is a vector of standardized controls, including a third-order polynomial in age, an indicator for being female, and their interactions; and X is a vector of study-specific controls. Specifically, in X, study analysts were encouraged to include dummy variables for major events such as wars or policy changes that may have affected access to education in their specific sample. Mixed-sex cohorts were additionally asked to upload separate regres- sion results for men and women.
B. COLLEGE ANALYSES
The College specification is analogous to the EduYears specification. Cohorts uploaded ei- ther coefficient estimates from a linear probability model or from a logistic regression model.
Linear Regression. The linear model can be written as
(4.2) πΆππππππ = π½0,lin+ π½1,lin πππ + ππ πlin+ π πlin+ π πlin+ πlin,
where πΆππππππ is an indicator variable equal to one for individuals who completed college, the other variables are defined as above, and the subscript βlinβ indicates that the variables correspond to the linear probability model. The parameter π½1,lin is the average change in the
fraction of subjects whose value of πΆππππππ is equal to one associated with being endowed with one more copy of the reference allele, after linear adjustment for the covariates.
Logistic Regression. Most participating cohorts uploaded coefficient estimates from the lo-
gistic regression model,
(4.3) π(πΆππππππ = 1|πππ, ππ, π, π) = 1
1 + πβ(π½0,log+π½1,log πππ+ππ πΈlog+π πΆlog+π π½log),
where the subscript βlogβ is used to label coefficients from the logistic model. In this model, the parameter π½1,log can be interpreted as follows: controlling for the covariates, the odds of
having completed college is increased by a factor of ππ½1,log for each increase of one copy of
the reference allele.
C. SAMPLE SELECTION CRITERIA
Only individuals satisfying the following criteria were eligible for inclusion in the estimation sample:
a. Educational attainment was measured when the subject was 30 years of age or older.
b. The subject passed the cohortβs standard quality controls, which typically include removal of subjects who are genetic outliers (to mitigate stratification concerns) and subjects with poor genotyping rates.
114 GWASIDENTIFIES 74LOCI ASSOCIATED WITH EDUCATIONAL ATTAINMENT
c. The subject is of European ancestry, and the subjectβs mother tongue is the same as the main language in the country of the cohort.
d. All relevant covariates are available for the subject. D. STUDY-SPECIFIC DETAILS
The EduYears analyses are based on summary statistics from all 64 samples listed in Sup- plementary Table 1.1 of Okbay, Beauchamp, et al. (2016). Of the 64 samples, whose com- bined sample size is N=293,723, 5 were from single-sex cohorts, and 59 contained pooled results from mixed-sex cohorts (who additionally uploaded separate results for men and women).
The College analyses were based on results from 52 of the 64 EduYears samples. The com- bined sample size of these 52 cohorts is N=280,007. One small cohort, LBC1921, is ex- cluded because it did not upload College results. The cohort analyst determined that the low fraction of college-educated individuals (1-5%) and the small sample would not yield relia- ble estimates of the standard errors. Indeed, because analytical standard errors may not be reliably estimated in small samples when the dependent variable is rare, we restrict our final analysis to cohorts with a combined sample size (ππ‘ππ‘) of at least 500 and at least 100 cases
(ππππ ππ ). We also drop one family-based cohort (ERF) and one isolate (ORCADES) because
the estimated standard errors of the logistic regression coefficients did not account for the sample relatedness (in both cases, the standard errors from their EduYears did account for relatedness). Column 3 of Supplementary Table 1.5 in Okbay, Beauchamp, et al. (2016) re- ports if a given sample was included in the College analyses and also explains why, in two samples, the EduYears sample size is not identical to the College sample size.
Column 4 reports whether the cohorts omitted any of the basic control variables recom- mended in the Analysis Plan in their specification. For example, some cohorts dropped higher-order polynomials in birth year because collinearity was causing problems in model estimation. Column 5 lists extra controls included by the cohorts in the vector X, such as controls for cohort-specific events that may have impacted the education system in the co- hort.
Several cohorts contain samples with related subjects. The Analysis Plan encouraged cohorts that include related subjects to estimate mixed linear models (MLMs) (Kang et al., 2010; Yang, Zaitlen, Goddard, Visscher, & Price, 2014). To facilitate their implementation, the Analysis Plan contained a supplement with sample code for MLM estimation written for the
software GCTA (Yang, Lee, et al., 2011). Conceptually, the estimation of MLM models in- volves two steps: (i) the genome-wide data are used to estimate the degree of genetic simi- larity between each pair of individuals in the sample, and (ii) unlike in standard regression where the covariance of the error term (in an educational attainment regression) between any two individuals is assumed to be zero, the covariance is fitted as an increasing linear function of the individualsβ genetic similarity. In other words, to the extent that two individ- uals are more recently descended from a common ancestor (as very accurately measured by overall genetic similarity)βand thus are more likely to be similar on unobserved environ- mental factorsβthese individuals are treated as correlated observations.
Many cohorts that include related subjects have developed strategies for ensuring that the standard errors correctly account for relatedness. Column 6 of Supplementary Table 1.5 (Okbay, Beauchamp, et al., 2016) reports whether the estimated standard errors were ad- justed for family relatedness and provides information about the adjustment used. The details vary by software. For example, QIMR estimated a model implemented in the software Mer- lin Offline (W.-M. Chen & Abecasis, 2007), in which the variance-covariance matrix of the phenotypes of members of the same family is assumed to have a particular structure accord- ing to which resemblance between relatives is induced by the additive effects of their shared genes. Some cohorts made no adjustment for non-independence but instead sought to restrict the estimation samples to conventionally unrelated individuals. For example, 23andMe re- strict their estimation sample to conventionally unrelated individuals by ensuring that no pair of participants in the final estimation sample share more than 700 centimorgans of their genome identical-by-descent (Eriksson et al., 2010).