Association analyses - Essays on Genetics and the Social Sciences

A. EDUYEARS ANALYSES

Cohorts were asked to estimate this regression equation for each measured SNP (we drop the SNP subscript j here to avoid notational clutter):

(4.1) 𝐸𝑑𝑢𝑌𝑒𝑎𝑟𝑠 = 𝛽0+ 𝛽1 𝑆𝑁𝑃 + 𝐏𝐂 𝛄 + 𝐁 𝛂 + 𝐗 𝛉 + 𝜖,

where SNP is the allele dose of the SNP; PC is a vector of the first ten principal components of the variance-covariance matrix of the genotypic data, estimated after the removal of ge- netic outliers; B is a vector of standardized controls, including a third-order polynomial in age, an indicator for being female, and their interactions; and X is a vector of study-specific controls. Specifically, in X, study analysts were encouraged to include dummy variables for major events such as wars or policy changes that may have affected access to education in their specific sample. Mixed-sex cohorts were additionally asked to upload separate regression results for men and women.

B. COLLEGE ANALYSES

The College specification is analogous to the EduYears specification. Cohorts uploaded ei- ther coefficient estimates from a linear probability model or from a logistic regression model.

Linear Regression. The linear model can be written as

(4.2) 𝐶𝑜𝑙𝑙𝑒𝑔𝑒 = 𝛽0,lin+ 𝛽1,lin 𝑆𝑁𝑃 + 𝐏𝐂 𝛄lin+ 𝐁 𝛂lin+ 𝐗 𝛉lin+ 𝜖lin,

where 𝐶𝑜𝑙𝑙𝑒𝑔𝑒 is an indicator variable equal to one for individuals who completed college, the other variables are defined as above, and the subscript “lin” indicates that the variables correspond to the linear probability model. The parameter 𝛽1,lin is the average change in the

fraction of subjects whose value of 𝐶𝑜𝑙𝑙𝑒𝑔𝑒 is equal to one associated with being endowed with one more copy of the reference allele, after linear adjustment for the covariates.

Logistic Regression. Most participating cohorts uploaded coefficient estimates from the lo-

gistic regression model,

(4.3) 𝑃(𝐶𝑜𝑙𝑙𝑒𝑔𝑒 = 1|𝑆𝑁𝑃, 𝐏𝐂, 𝛂, 𝐗) = 1

1 + 𝑒−(𝛽0,log+𝛽1,log 𝑆𝑁𝑃+𝐏𝐂 𝜸log+𝐁 𝜶log+𝐗 𝜽log),

where the subscript “log” is used to label coefficients from the logistic model. In this model, the parameter 𝛽1,log can be interpreted as follows: controlling for the covariates, the odds of

having completed college is increased by a factor of 𝑒𝛽1,log_{for each increase of one copy of}

the reference allele.

C. SAMPLE SELECTION CRITERIA

Only individuals satisfying the following criteria were eligible for inclusion in the estimation sample:

a. Educational attainment was measured when the subject was 30 years of age or older.

b. The subject passed the cohort’s standard quality controls, which typically include removal of subjects who are genetic outliers (to mitigate stratification concerns) and subjects with poor genotyping rates.

114 GWASIDENTIFIES 74LOCI ASSOCIATED WITH EDUCATIONAL ATTAINMENT

c. The subject is of European ancestry, and the subject’s mother tongue is the same as the main language in the country of the cohort.

d. All relevant covariates are available for the subject. D. STUDY-SPECIFIC DETAILS

The EduYears analyses are based on summary statistics from all 64 samples listed in Sup- plementary Table 1.1 of Okbay, Beauchamp, et al. (2016). Of the 64 samples, whose com- bined sample size is N=293,723, 5 were from single-sex cohorts, and 59 contained pooled results from mixed-sex cohorts (who additionally uploaded separate results for men and women).

The College analyses were based on results from 52 of the 64 EduYears samples. The com- bined sample size of these 52 cohorts is N=280,007. One small cohort, LBC1921, is ex- cluded because it did not upload College results. The cohort analyst determined that the low fraction of college-educated individuals (1-5%) and the small sample would not yield relia- ble estimates of the standard errors. Indeed, because analytical standard errors may not be reliably estimated in small samples when the dependent variable is rare, we restrict our final analysis to cohorts with a combined sample size (𝑁𝑡𝑜𝑡) of at least 500 and at least 100 cases

(𝑁𝑐𝑎𝑠𝑒𝑠). We also drop one family-based cohort (ERF) and one isolate (ORCADES) because

the estimated standard errors of the logistic regression coefficients did not account for the sample relatedness (in both cases, the standard errors from their EduYears did account for relatedness). Column 3 of Supplementary Table 1.5 in Okbay, Beauchamp, et al. (2016) re- ports if a given sample was included in the College analyses and also explains why, in two samples, the EduYears sample size is not identical to the College sample size.

Column 4 reports whether the cohorts omitted any of the basic control variables recom- mended in the Analysis Plan in their specification. For example, some cohorts dropped higher-order polynomials in birth year because collinearity was causing problems in model estimation. Column 5 lists extra controls included by the cohorts in the vector X, such as controls for cohort-specific events that may have impacted the education system in the cohort.

Several cohorts contain samples with related subjects. The Analysis Plan encouraged cohorts that include related subjects to estimate mixed linear models (MLMs) (Kang et al., 2010; Yang, Zaitlen, Goddard, Visscher, & Price, 2014). To facilitate their implementation, the Analysis Plan contained a supplement with sample code for MLM estimation written for the

software GCTA (Yang, Lee, et al., 2011). Conceptually, the estimation of MLM models in- volves two steps: (i) the genome-wide data are used to estimate the degree of genetic similarity between each pair of individuals in the sample, and (ii) unlike in standard regression where the covariance of the error term (in an educational attainment regression) between any two individuals is assumed to be zero, the covariance is fitted as an increasing linear function of the individuals’ genetic similarity. In other words, to the extent that two individuals are more recently descended from a common ancestor (as very accurately measured by overall genetic similarity)—and thus are more likely to be similar on unobserved environ- mental factors—these individuals are treated as correlated observations.

Many cohorts that include related subjects have developed strategies for ensuring that the standard errors correctly account for relatedness. Column 6 of Supplementary Table 1.5 (Okbay, Beauchamp, et al., 2016) reports whether the estimated standard errors were ad- justed for family relatedness and provides information about the adjustment used. The details vary by software. For example, QIMR estimated a model implemented in the software Mer- lin Offline (W.-M. Chen & Abecasis, 2007), in which the variance-covariance matrix of the phenotypes of members of the same family is assumed to have a particular structure accord- ing to which resemblance between relatives is induced by the additive effects of their shared genes. Some cohorts made no adjustment for non-independence but instead sought to restrict the estimation samples to conventionally unrelated individuals. For example, 23andMe restrict their estimation sample to conventionally unrelated individuals by ensuring that no pair of participants in the final estimation sample share more than 700 centimorgans of their genome identical-by-descent (Eriksson et al., 2010).

In document Essays on Genetics and the Social Sciences (Page 119-122)