Individual-Level Data - Data and Measure of Intergenerational Mobility

1.2 Data and Measure of Intergenerational Mobility

1.2.1 Individual-Level Data

To measure intergenerational mobility, information on both the parent and the child as an adult is required, which in turn necessitates matching individuals across historical records to create a linked sample. Specifically, I match native-born white sons aged 3-17, whose fathers are the household heads, from the complete counts of the 1910 census to their older selves in the full 1940 enumeration.3 _{Both datasets are from the}

Integrated Public Use Microdata Series (IPUMS) (Ruggles et al. 2017). The 1910 census provides information on the economic status of fathers, while the 1940 census contains the outcomes of sons during adulthood. Although it is possible to create linked data using even older censuses, a 1910-1940 sample is preferred for two reasons.4

1_{Chetty et al. (2018) refers to Chetty, Hendren, Jones, and Porter (2018), and not to Chetty,}

Friedman, Hendren, Jones, and Porter (2018), which will henceforth be referred to as CFHJP instead. Apart from Chetty et al. (2018), Chetty and Hendren (2018a) and Chetty et al. (2014) also provide contemporary measures of upward mobility at the CZ level. The three papers use similar data sources, but each applies a slightly different set of sample restrictions. Appendix A.1 compares the characteristics of the three samples and shows that the mobility estimates are highly correlated across these studies. Chetty et al.’s (2018) measure is preferred here because it is available separately for non-Hispanic white males, which is the demographic group that corresponds most closely to the historical linked sample described in this section.

2_{CZ-level estimates of upward mobility are also available from CFHJP. I mention CFHJP sepa-}

rately from Chetty and Hendren (2018a) and Chetty et al. (2014) because the underlying samples in CFHJP and Chetty et al. (2018) are identical. The method used to estimate upward mobility in CFHJP, however, is slightly different from the other three papers. Practically, this makes little difference as the mobility estimates in CFHJP and Chetty et al. (2018) have a high correlation of 0.967 based on all 722 mainland CZs. Using Chetty et al. (2018) instead of CFHJP allows for a consistent reference throughout my chapter. This is because the age-at-move analysis in Chetty et al. (2018) is based on moves to different CZs, as will the age-at-move analysis in a later section of this chapter, whereas CFHJP implement their analysis with moves to different census tracts.

3_{To simplify the analysis, I exclude adopted sons and stepsons. Chetty et al. (2018) do not}

distinguish between biological and adopted children/ stepchildren. They consider the first person(s) who claims a child as a dependent on a 1040 tax form to be the child’s parent(s).

4_{One cannot create linked data using censuses after 1940 as the full enumerations are required}

for linking. Since the complete counts of a given census are only publicly released 72 years after the census was taken, the 1940 full counts are the latest records that are available at the time of writing.

First, since the goal is to estimate upward mobility for each CZ, there needs to be a sufficient number of linked persons in a given CZ to ensure that the aggregate figures are reasonably accurate. This constraint is relevant for all historical censuses as match rates will be well below 100 percent when linking individuals between these records.5

However, it is more binding with older censuses due to the smaller population and lower enumeration quality further back in time, both of which reduce the number of persons who can be linked. Second, the 1940 census is the first federal census to record information on education and wages. These will be used in the subsequent analysis.

I focus on native-born white sons aged 3-17 for the following reasons. First, I limit the sample to sons because daughters tend to change their last names upon marriage, making it difficult to track them across censuses by name. Second, I look at native-born persons because a substantial share of immigrants Americanize their names with time in the US (Biavaschi et al. 2017), which could make it harder to accurately match foreign-born individuals by name.6 Linking natives also allows for a more consistent mapping of birthplaces over time – this may be relevant given the territorial and regime changes that occurred in Europe and the Russian Empire after World War I (WWI).7 Third, I consider only whites in order to obtain the widest geographic coverage whilst simplifying the analysis below by excluding variation in

5_{For example, Collins and Wanamaker (2015) match 26 percent of southern white men aged 0-40}

in the 1910 census to the later 1930 census. Feigenbaum (2015) links 46.9 percent of white sons residing in urban locations from the 1920 census to 1940. Long and Ferrie (2013) obtain a success rate of 22 percent when matching white males aged 25 and under from the 1850 to 1880 censuses.

6_{Biavaschi et al. (2017) compile a random sample of immigrants who completed their naturaliza-}

tion papers in New York City by 1930. About a third of these individuals Americanized their first names. This might be an upper bound for the extent of name Americanization among foreign-born persons, since immigrants who intend to stay in the US may have greater incentives to assimilate with the native population.

7_{Natives will be matched by their state of birth. While states and territories within the US were}

being formed and divided up during the 19th and early 20th centuries, state borders were stable by 1910. Matching natives by their reported state of birth is thus likely to be more accurate than linking foreign-born persons by their country of birth.

upward mobility across CZs that is driven solely by differences in racial composition. In addition, the match rate for blacks, who comprise the majority of native-born non-whites, is likely to be lower than that for whites as the former tend to have more common names which results in relatively fewer unique matches.8 _{Fourth, I use a}

wide age range of 3-17 to maximize the size of the linked sample.9

Individuals are matched using the iterative approach popularized by Abramitzky et al. (2012, 2014, 2017). In their basic procedure, Abramitzky et al. (2012, 2014, 2017) first standardize the names of individuals with the New York State Identification and Intelligence System (NYSIIS) algorithm, before searching for exact matches by name, birthplace, and age.10 _{If an exact match cannot be found, they then allow for}

an age difference of one year. This is repeated one more time for persons who are still without any match, allowing for an age difference of two years. To test if their results are robust to different linking methods and to false positives, Abramitzky et al. (2012, 2014, 2017) also implement more conservative versions of their basic procedure. These variations include restricting the sample to persons who have unique NYSIIS- standardized names within 5-year age bands in each census year, using reported rather than standardized names, requiring matches to be exact on age, and using Jaro-Winkler string distances between names to determine matches, amongst others.

8_{Collins and Wanamaker (2015) obtain a match rate for blacks that is 7 percentage points lower}

than that for whites. Among native-born males aged 3-17 in the 1910 1 percent IPUMS sample, and whose fathers are the household heads, I find that blacks are more likely to have popular first and last names. 27.6 percent of whites and 31.5 percent of blacks have the top 10 first names in their respective racial groups. 5.3 percent of whites and 12.5 percent of blacks have the top 10 last names in their corresponding racial groups. Sample weights are used when computing all figures.

9_{Feigenbaum (2018) also uses an age range of 3-17 when linking sons from the 1915 Iowa state}

census to the 1940 federal census. Imposing an upper age limit of 17 accounts for the fact that older sons are more likely to have left home. Sons need to be residing with their fathers at the time of the 1910 census in order for me to obtain information on their fathers’ economic status. Up until age 17, the proportion of native-born white males living in households with their fathers is close to or above 0.8, a share that declines with age (author’s calculation based on the 1910 1 percent IPUMS sample with sample weights).

10_{The NYSIIS algorithm standardizes names based on their pronunciation, thus allowing names}

I use Abramitzky et al.’s (2012, 2014, 2017) basic iterative procedure with actual instead of NYSIIS-adjusted names, as the latter tends to increase the frequency of false positives (Bailey et al. 2018).11,12 This generates a linked sample of 2,962,656 individuals, with a corresponding match rate of 29.3 percent.13 _{Further details on the}

construction and representativeness of the linked sample are provided in Appendix A.2.

In document Three lessons for labor economics from history (Page 30-33)