To be able to work with the PME data, two working data sets are defined: one cross-sectional data set described in detail in Section 4.3.1 and a longitudinal data set described in this section.
At the time this research began, micro-data from the years 2004 and 2005 were available to use. The aim is to be able to form a longitudinal data set from the available data. Table 3.2 already showed that some of the panels in the available data included households being interviewed from their fifth interview onwards only. So no information is available for their first four interviews. The idea was to restrict the available data to consider only those households that had data for their starting interview in 2004 or 2005. In this way, the longitudinal data set only contains the households that entered the sample from January 2004. These households are followed over time until the end of 2005.
Table 3.8 gives a representation of PME panels after this first restriction. This forms the general working data set from which a cross-sectional and a longi- tudinal data set can be selected. Recall that by design in every month there are households being interviewed from the first time to the eighth. Therefore, each column represents an occasion and under them there are the monthly samples that the data will be selected from. For example, from the whole January 2004 sample, only those households being interviewed for the first time are to be considered, whereas from the February 2004 sample, households being interviewed for their first and second times are considered, and so on. Selecting the households that had their first interview from January 2004 onwards allows the construction of a longitudinal series that starts from their first interview.
According to the representation in Table 3.8, different longitudinal data sets could be formed. There is one, for example, where households have the full set of interviews (eight interviews), and another where households have up to the first four interviews. It is worth reinforcing that households which have their last four interviews in the year 2004 were not considered in the working data set. The working longitudinal data set will be formed by only those panels that by design should have the full set of eight interviews, which are the first nine panels in the table.
As discussed in Section 3.3, the matching of individual records is rather difficult. Although the matching of all eight time points is difficult, it is not impossible. In order to fulfil the methodological motivation of analysing a data set containing the full set of interviews from the PME survey it was decided to
Table 3.8: Representation of the Working Data-Set
Occasions
Panel 1 2 3 4 5 6 7 8
F4 Jan-04 Feb-04 Mar-04 Apr-04 Jan-05 Feb-05 Mar-05 Apr-05 F5 Feb-04 Mar-04 Apr-04 May-04 Feb-05 Mar-05 Apr-05 May-05 F6 Mar-04 Apr-04 May-04 Jun-04 Mar-05 Apr-05 May-05 Jun-05 F7 Apr-04 May-04 Jun-04 Jul-04 Apr-05 May-05 Jun-05 Jul-05 F8 May-04 Jun-04 Jul-04 Aug-04 May-05 Jun-05 Jul-05 Aug-05 G1 Jun-04 Jul-04 Aug-04 Sep-04 Jun-05 Jul-05 Aug-05 Sep-05 G2 Jul-04 Aug-04 Sep-04 Oct-04 Jul-05 Aug-05 Sep-05 Oct-05 G3 Aug-04 Sep-04 Oct-04 Nov-04 Aug-05 Sep-05 Oct-05 Nov-05 G4 Sep-04 Oct-04 Nov-04 Dec-04 Sep-05 Oct-05 Nov-05 Dec-05 G5 Oct-04 Nov-04 Dec-04 Jan-05 Oct-05 Nov-05 Dec-05 Jan-06
G6 Nov-04 Dec-04 Jan-05 Feb-05 Nov-05 Dec-05 Jan-06 Feb-06
G7 Dec-04 Jan-05 Feb-05 Mar-05 Dec-05 Jan-06 Feb-06 Mar-06
G8 Jan-05 Feb-05 Mar-05 Apr-05 Jan-06 Feb-06 Mar-06 Apr-06
H1 Feb-05 Mar-05 Apr-05 May-05 Feb-06 Mar-06 Apr-06 May-06
H2 Mar-05 Apr-05 May-05 Jun-05 Mar-06 Apr-06 May-06 Jun-06
H3 Apr-05 May-05 Jun-05 Jul-05 Apr-06 May-06 Jun-06 Jul-06
H4 May-05 Jun-05 Jul-05 Aug-05 May-06 Jun-06 Jul-06 Aug-06
H5 Jun-05 Jul-05 Aug-05 Sep-05 Jun-06 Jul-06 Aug-06 Sep-06
H6 Jul-05 Aug-05 Sep-05 Oct-05 Jul-06 Aug-06 Sep-06 Oct-06
H7 Aug-05 Sep-05 Oct-05 Nov-05 Aug-06 Sep-06 Oct-06 Nov-06
H8 Sep-05 Oct-05 Nov-05 Dec-05 Sep-06 Oct-06 Nov-06 Dec-06
I1 Oct-05 Nov-05 Dec-05 Sep-06 Oct-06 Nov-06 Dec-06 Jan-07
I2 Nov-05 Dec-05 Sep-06 Oct-06 Nov-06 Dec-06 Jan-07 Feb-07
I3 Dec-05 Sep-06 Oct-06 Nov-06 Dec-06 Jan-07 Feb- 07 Mar-07
consider a longitudinal data set, which included all eight time points, for as much sample as possible given some criteria for validating the matching. One alternative, however, to be able to work with individual data from the PME is to consider only the heads of household. These are the household reference units and therefore the most important member of the household. In addition, these individuals are easier to identify from the data set. Each household contains only one individual identified as the head of the household and for that similar matching rates as those found for the households are expected. Furthermore, other matching criteria using some of their individual characteristics, such as gender and age, would still be necessary to guarantee a more accurate matching.
With the objective of modelling job earnings, the variable for the usual em- ployment earnings in the main job was chosen as the variable of interest from the other three job earnings variables available in the data. As defined earlier in this chapter, the usual job earnings is the contractual pay received in the refer- ence month and excludes any benefits received. This income component was also used in Jenkins (2000), and for simplicity is adopted here and hereafter referred to as labour income. As also presented earlier in this chapter, IBGE defines the employed population as those who undertook some kind of paid or unpaid work for at least one hour in the reference week. It also includes those who had a job but were temporarily absent from this occupation. Job earnings are defined as the benefit in return for work done and are measured as monthly earnings from
the main and second jobs. Therefore, only those classified as employed have data for job earnings. For this reason, the analysis sample is reduced to include only employed heads of household.
The aim is to select a balanced data set of employed heads of household who were employed at all eight time points starting from the first interview. There are different strategies to select this longitudinal data set and different sets of criteria for validating the matching of individuals. The choice of a complete-case balanced data set is adopted for simplicity since the methods used to analyse this data set are flexible enough to accommodate an unbalanced data set. To guarantee a balanced data set, the heads of household that do not have, by design, all 8 time points were not considered. Those that dropped out of the panel or with intermittent non-response were also not considered. This leaves a total of 12,170 heads of household. Furthermore, only those with valid data for all variables in the analysis (see Tables 4.2 and 5.1) were selected, which leaves 10,183 heads of household.
The set of criteria to validate the matching across the eight time points was chosen based on the exercises performed earlier in this chapter. To ensure that the same head of household is being followed over time, the data set was further reduced to consider only those heads that, from one interview to the next, had:
- no change on the variable for gender; - no change on the variable for skin colour;
- a change in the categorical variable for education of up to two (ordered) categories (|Educr− Educs| ≤ 2);
- and a change in the declared age of up to three years (|Ager− Ages| ≤ 3).
Figure 3.1 presents the sample sizes for the different steps of the selection of the data, including the number of heads of household that did not satisfy the validation criteria. The final sub-set in Figure 3.1, of 6,524 employed heads of household, composes the balanced working longitudinal data set to be used in Chapters 5 and 7. It is worth mentioning that this is a quite restricted data set that was formed to fulfil the methodological motivations of this thesis. This working longitudinal data set might not be appropriate to draw important substantive conclusions.
Figure 3.1: Sample Sizes at Wave 1