A Construction of the Data Set - Competition, Wages and Teacher Sorting: Lessons Learned from a

The data come from population-wide registers collected by Statistics Sweden. The analy- sis is based on the teacher register (L¨ararregistret), which contains all teachers employed in Swedish schools along with a person identifier, information about where the teacher is employed (region, public/private), whether the individual is certified to be a teacher and his/her individual field of specialization. From 1995 onward, the data also contain unique school identifiers for the school in which the teacher is employed.

The teacher register is combined with several additional administrative data sources from Statistics Sweden from which I derive teacher’s demographic characteristics as well as aggregated regional statistics, such as the number of high school students in the local labour market where the teacher is employed.

The sample is restricted to teachers who received their main source of income from teaching according to earnings information from a matched employer-employee database (RAMS) which I link to the teacher register using the person identifier. Wages come from Struk- turl¨onestatistiken, which has information on monthly wages adjusted to full-time. These are measured in November each year and I assign teachers in the academic year 1991-1992 to their 1991 wage observation from the school generating the highest annual income.

Wages are available for all individuals employed in the public sector and for a sample of individuals in the private sector. The sampling is stratified by firm size and industry, and the register holds weights that can be used to obtain aggregated regional statistics that are nationally representative. However, as described in the main text, part of the empirical strategy relies on within teacher variation in the competitiveness of the local labour market, which means that only teachers who appear in the sample for two or more years will help to identify the coefficient of interest. To deal with this sampling issue, I use an imputed wage measure for private high school teachers who are not in the wage register in a given year.

In practice, I obtain the predicted wages from traditional Mincerian wage equations which controls for individual characteristics (gender, education and the age-earnings profile), a

dummy for whether the teacher worked in a private school and detailed (4-digit) information about the type of teaching position. In addition I also include the teacher’s approximated wage, which I derive from their annual earnings from the same employer adjusted for the number of months worked. The estimated model looks like:

log(wit) = β1Xit+ β2privateit+ β3log(wageapproxit ) + γt+ it. (A.4)

Where Xit is a vector that contains a dummy indicating whether the teacher is female, age, age2, education level (6 bins) and 4-digit indicators for field of teaching position; privateit is an indicator for whether the teacher works in a private school, wageapprox_it is the monthly income, γt is a year dummy and t is the error term robust for heteroskedasticity. I use this model to derive the predicted monthly wage, ˆwit, which is used for teachers which are non-sampled in a given year, and thus where information on true wages is missing.

Even if the annual earnings data contain information about the start and end of the employment spell it lacks information about the total hours worked, which means that they will fail to perfectly predict the true wages. To mitigate this issue I restrict the sample in (4) to teachers with an approximated monthly wage falling within the 1st and the 99th percentile of the true wage distribution. Table A.1 compares the actual and predicted wages for sampled high school teachers in the private sector. The predicted wages correspond reasonably close to the actual data when looking at the log wage distribution.

Table A2 shows the results from variants of equation (A.1). Column (1) reports the adjusted R2 _{when individual covariates net the approximated wage measure is included in} the model, column (2) includes only the approximated wage, and (3) reports the estimated β3 with both individual covariates and the approximated wage. The explanatory power increases from 0.737 (column 1) to 0.803 (column 3) when log(wageapprox) is included in the model. Hence, this regression equation should do a fairly good job in predicting wages. The correlation between approximated earnings and true wages is 0.457 when only year effects

are included in the model (column 2), and somewhat smaller when all covariates are added (column 3).

Importantly, to make sure that the results are not sensitive to the imputation procedure I will check the baseline results using the sampling weights instead of the imputed wages. I also report the results for public school teachers separately, where wages are availiable for all teachers.

Table A.1: Comparison Actual and Predicted Wages

(1) (2) Actual Predicted Mean 10.042 10.012 Sd 0.150 0.148 10th percentile 9.878 9.821 50th percentile 10.043 10.032 90th percentile 10.204 10.177 Observations 5,077 5,077

Table A.2: Estimation Results from Equation (A.1)

(1) (2) (3)

log(wageapprox) 0.457*** 0.320***

(0.001) (0.001)

Adjusted R2 0.737 0.735 0.803

Observations 295,441 295,441 295,441

Year dummies yes yes yes

Individual background charactersitics yes no yes

Notes. *,** and *** denote statistical significance at the 10, 5 and 1 % levels, respectively. The dependent variable is the log monthly wage. The model in column (1) includes year effects. The model in column (2) includes gender, age, age2_{, education level (6 bins) and 4-digit indicators for field of teaching position.}

In document Competition, Wages and Teacher Sorting: Lessons Learned from a Voucher Reform (Page 37-40)