The Comparison between the Census and the Population Registers
Daniela Casale, Marco Fortini, Roberta Radini, Leonardo Tininini, Luca Valentino
ISTAT, Italy, e-mail: casale,fortini,radini,tininini,[email protected]
Abstract
A fundamental process strictly related with the 15th Italian General Population and Housing Census was the contextual comparison between Census results and Population Registers. This is required by the Italian law and mainly used to review the Population Registers, as well as to determine the so-called “legal population”. In this paper we illustrate the main features of the web application that was specifically developed to support the Municipal Census Departments in all activities related with this complex process. Some statistics on the overall usage and effectiveness of the application are also provided and illustrated. Finally, we describe the underlying techniques used to perform all operations in online response times, more specifically the strategy to select the list of candidates (on which the similarity score has to be computed) as well as the main characteristics of the linkage module to compute the similarity score.
Keywords: Census Web Application, Data Quality, Record Linkage
1. Introduction
In the domain of the 15th Italian General Population and Housing Census, a particular importance was taken by the fulfillment of the contextual comparison between Census results and Population Registers. This operation is an integral part of the Census survey and is required by the Italian law to review the data in the Population Registers.
Particularly, its completion is required to determine the so-called “legal population”, whose relevance goes far beyond the statistical aims of the Census.
The comparison was carried out by the Municipal Census Departments (in Italian: Uffici Comunali di Censimento – henceforth UCC) at the end of the quality check of the paper questionnaires: as a family/households form was returned to the UCC premises, the municipal operators had to indicate the correct correspondences (as well as possible mismatches) between each component on the form and each individual position on the Municipal Population Registers (Liste Anagrafiche Comunali - henceforth LAC).
In order to support this operation and to provide every UCC with a common standard procedure for the fulfillment of the Census-Registers comparison, ISTAT implemented an online web application, included in the Survey Management System (Sistema di Gestione della Rilevazione – henceforth SGR), which was used by the municipal operators to handle every phase of the Census survey. This web application allowed the UCC to easily identify and indicate the possible correspondences between the individuals
enumerated by the Census and the ones present on the LAC. More specifically, three different cases may arise:
• successful match between individuals enumerated by the Census and individuals on the LAC (even if the two archives may disagree on the address);
• an individual is only enumerated by the Census (but has no corresponding counterpart on the LAC);
• an individual is only present on the LAC (but she/he was not enumerated by the Census).
The web application supports the UCC in this activity, by automatically suggesting certain and highly probable correspondences, based on similarity comparisons of household components in the two lists.
The software application also supports the detection of Census duplications, i.e.
enumeration of the same person more than once in the same municipality. This may happen for instance when the same Census questionnaire is returned both by means of the paper form and by using the online compilation and can be easily detected by exploiting the questionnaire ID.
More difficult to detect is the case when the same individual is enumerated in two distinct Census forms, e.g. one for the actual address of her/his dwelling and the other one related to a different address where she/he is recorded into the Municipal Population Registers.
In this case, due to response errors and other minor differences in key variables (e.g.
caused by typos and misspellings), exact search would produce poor results and consequently string distance functions and probabilistic methods had to be used instead.
At the same time, sophisticated filtering techniques were required to reduce the number of possible candidates to be compared, thus ensuring the required online (fast) response times.
2. The Web Application for Census-Register Comparison
As mentioned above, SGR (the Census Survey Management System) was designed as a workflow management system for the processing of all Census questionnaires, and the Census-Registers comparison can be considered as the final activity of this workflow. As for any other process activity, SGR provides the UCC officers with several online functionalities, enabling them to both manage and monitor the progress of all operations related to Census–Registers comparison, which were carried out on every enumerated individual. The integration of two specific capabilities makes up the most remarkable peculiarity of this web application:
1. the computer aided validation of the questionnaires, including the interactive comparison between the Census data and the contents of the Population Registers;
2. the contextual and automatic production of the individual lists, aimed to review the Population Registers, on the basis of the comparison results.
At the end of the process, the Municipalities were immediately able to carry out the post- census review of their Population Registers. Such operating method produced a number of benefits in terms of timeliness, data quality, transparency, homogeneity and costs.
2.1. Managing and Monitoring the Comparison Operations
The SGR online functions supporting the Census-Registers comparison activity, represented an important tool for all Census authorities, enabling them to perform the monitoring of all related operations, with different levels of detail (from the top level of the whole Country, down to the lowest level of census tracts). In particular, it allowed the operators to have an overview of many key figures and indicators, e.g. the number of individuals, families/households enumerated in each Census tract, and other parameters, such as the family composition or the nationality, useful for the interactive comparison.
Likewise, the authorities in charge of monitoring the Census survey at Country level (ISTAT, Ministry of the Interior) could use the system to have accurate and complete reports at their disposal and continuously updated. In particular, two specific functions were devoted to this purpose: the ad hoc balances and the summarizing reports. The former enables the user to browse the data on comparison operations at various levels of territorial aggregation, while the latter provides some key aggregate data related to the process of questionnaire review by Municipalities.
2.2. Computer Aided Validation and Comparison
Since web questionnaires and paper questionnaires require quite different management processes, two different panels have been implemented in SGR for each of them.
The former panel is used in the case of web-delivered questionnaires and an offline procedure was implemented performing an automatic linkage between the names in the list filled in by the citizens (the so-called "A-list") and the individual names present in the LAC (and thus expected for the same household).
Figure 1 - (left) percentage of web-delivered questionnaires on total amount of delivered ones, by province; (right) percentage of surveyed citizens within web-delivered questionnaires, which were recognized by the automatic linkage procedure, on total amount of surveyed resident citizens, by province.
The analysis of the output shows that in 90% of the cases the automatic linkage was successful in combining the two lists, thus completing and closing the comparison process without requiring any further intervention by the operators.
In the remaining 10% of cases, i.e. in all cases where the correspondence is not automatically determined for all the names in the household or where the linkage is not considered "safe" by the system, the questionnaires are highlighted with an appropriate flag on the web-panel. In this way, the operator is given the possibility to analyze the detailed data of the individuals, confirming or retracting the correspondences determined by the system. When the system fails to determine any correspondence and this is confirmed by the analysis of the operator, then the enumerated citizen is declared as
"non-resident". Conversely, if a name in the LAC is confirmed to have no correspondence with those in the A-list, he or she is declared "untraceable".
Figure 2 – Distribution of the residual 10% not automatically linked individuals on web-delivered questionnaires.
The latter panel is used by operators dealing with paper questionnaires. Given a questionnaire ID, the panel lists the components for the associated household, as expected by the data on the LAC. This allows the operator to perform a rapid visual comparison with the A-list originating from the paper questionnaire filled in by the citizen.
By simple clicks on the web panel the operator can mark the single component as:
• "surveyed" if her/his name is present both in LAC and in the A-List;
• "untraceable" if the name is listed in LAC but not in the A-List.
Finally, if an individual is present in the A-List, but not in the LAC of the household, then a search function (using a similarity-based linkage procedure) supports the operator’s activity, by proposing a list of names, which are extracted from the LAC of other households and which are “very similar to” the personal data of the individual. By examining the proposed list the operator can decide to select one of the names and mark the individual as "enumerated at a different address" or mark her/him as "surveyed and non-resident".
The system keeps track of all movements of individuals, thus preventing duplicate enumerations, i.e. the same citizen to be surveyed in more than one questionnaire.
Figure 3 – Number of daily surveyed citizens (resident + no-resident + untraceable) from Nov, 30th, 2011 to Jun, 6th, 2012, with a maximum peak (about 2.800.000 citizens) on Jan, 27th, 2012.
Figure 4 – Automatically surveyed individuals have been marked black, whereas individuals surveyed by operators have been traced in gray. The manual activities are concentrated in December – February, with a peak (about 1.000.000) on Jan, 19th, 2012. It is worth noticing that automatic elaboration is concentrated in the first part of this timespan, and it is possible to suppose that, by increasing the amount of web-delivered questionnaires, not only the amount of work performed by operators would be reduced, but also the linkage of surveyed citizens would be terminated earlier.
The ad hoc balances is a further fundamental SGR monitoring tool to be used by Municipal Census Departments. Their daily refreshed data offers a view of the progress
of the comparison activity, as well as an initial projection based on partial data. When the comparison activity of all questionnaires is completed, with confirmation of all respondents and untraceable, the SGR application allows municipalities to formally declare the closure of census activities. From that moment on the ad hoc balances represent the final population as enumerated by the Census. Thus data for the Census- Registers comparison is acquired at the same time as the municipality certifies them. This enabled ISTAT to disseminate the data on legal population in a very timely manner, shortly after the closure of the data collection process.
3. Performing Linkage in Online Response Times
As illustrated above, two fundamental functions of the Census-Registers comparison require a very efficient module performing record linkage. In the following we briefly illustrate the linkage module used by both functions and the technique to select the list of individuals on which the similarity score is computed.
3.1. The Linkage Module
This module is embedded in the web application and performs linkage among the people enumerated in the Census and those present in the LAC. The module uses a probabilistic methodology based on the Fellegi and Sunter (1969) approach, that requires, as input, two thresholds (match threshold) and (un-match threshold with > ).
For each pair of records to be analyzed the common attributes are compared and the single attribute score are then combined to form a global score r. If the r value is greater than the two records are automatically linked. If the score is less than the pair is excluded from further analysis. Finally if r is comprised between and the possible links require an explicit intervention by the operator, who has to confirm or retract them.
In practice, the list of possible links is proposed in a specific page of the web application to complete the linkage process.
More specifically, in order to assign the global score r, six attributes of the two records are considered: name, surname, gender, year month and day of birth. An approximate (similarity) comparison is applied to assess concordance or discordance for name and surname attributes, while the comparison is applied strictly for the other attributes.
The outcome of the comparison phase is a 6-dimensional vector .
where the single components can assume value 1 in case of concordance for the i-th attribute, and 0 otherwise.
As described in the proposed model, we adopt as global score the frequency ratio defined as follows: assuming known the set M of the actual links (pairs of records that represent the same person) and the set U of the actual non-links, the frequency ratio r depending on is defined as:
(3.1.1)
where is the probability of vector for pairs in M and is the corresponding probability for pairs in U.
Figure 5 – Frequency distribution of pairs by m, u probabilities and likelihood ratio r
It is clear that high values of score r correspond to a high probability that the pair is in M and not in U. As proposed in the Fellegi and Sunter method, under the conditional independence assumption for the six attributes, it results
(3.1.2) and
(3.1.3)
When comparison is not possible because one or both records are not valued in an attribute, then the attribute is not considered for the calculation. This is equivalent to the following assignment:
(3.1.4)
Then, assuming the parameters and are known, the global score r can be calculated only from .
There are some techniques enabling the estimation of parameters and (that are known as marginal probabilities) even when M and U are unknown.
As these techniques are generally time–consuming, the estimation was performed by an offline process using RELAIS (Cibella et al. 2008), an open-source toolkit that implements the EM algorithm to practically solve the Fellegi and Sunter problem.
Finally, these parameters are passed as additional input to the linkage module and the global scores r can be computed very quickly, even for a significant number of records pairs. In this way adequately low response time could be assured for the web application.
3.2. Selecting the List of Candidates for Linkage
As described in Section 2.2, if an individual X is present in the A-List, but not in the LAC of the household, then a search function supports the operator’s activity, by proposing a list of individuals of other households in LAC and which are "very similar to" the personal data of X. Since the number of potential candidates is huge and incompatible with the required online response times, a preliminary filtering technique is used to significantly reduce the list of individuals on which the linkage procedure is applied.
Conventional database techniques (e.g. indexes on name, surname and date of birth) could not be directly used, as they all assume exact matches, which would produce poor results on names (where typos and small misspellings are common). Therefore, the individual records in the database were preprocessed by an offline procedure to compute a "name code" and a "surname code" for each individual record in the LAC.
In the first step of this offline procedure a normalized version of the name (and surname) is computed, e.g. by replacing accented vowels with the corresponding unaccented one, and consonants with diacritics (e.g. č, š and ž) with the corresponding ones without diacritics (c, s and z). More generally, the outcome of this step is a transformed version of the name constituted by only the 26 ASCII-standard characters.
In the second step, starting from the normalized version of the name computed in the previous step, a 2-letters code is computed, constituted by its first two consonants, whenever possible, and possibly including vowels if the normalized name contains less than two consonants. These name and surname codes are added to each individual LAC record in specific, separate columns of the database table.
Finally, the offline procedure builds some conventional database indexes on each possible combination of 5 columns out of the following 6 ones: name code, surname code, gender, day, month and year of birth. In this way any partial search, setting 5 out of the 6 possible values, is very efficiently supported by a highly-selective database index and the corresponding records can be retrieved in few milliseconds.
This preliminary offline procedure is exploited by the online search procedure as follows.
First of all, given the (A-list) individual to be searched, the codes of her/his name and surname are computed by the online procedure. Similarly to the offline procedure, accents and special characters are removed/replaced by the canonical ones to obtain a normalized version of the name. The normalized names are then used to compute the corresponding codes to be used by the database search function.
In practice the outcome of the previous step is a 6-dimensional vector constituted by (name code, surname code, gender, day of birth, month of birth, year of birth) of the individual to be searched. This vector is used to extract the candidate matches from the database, where a candidate match is a record that has a match on at least 5 out of the 6
values in the vector, e.g. that matches the name code, gender, day, month and year of birth, but possibly not the surname code. This search can be done very efficiently, by exploiting the highly selective database indexes built by the offline procedure and produces a reduced list of candidate matches (less than 100 in the worst case).
Finally, this reduced list of candidate matches is systematically compared with the searched individual by using the linkage module illustrated above. It is worth noting that the score is computed using the name and surname in complete form (the codes are only used to pre-filter the set of candidates, and not by the linkage module).
4. Conclusion
In this paper we have described the main issues and solutions related to the process of comparing Census results (from the 15th Italian General Population and Housing Census) and Population Registers. We have illustrated the main features of the web application, which was specifically developed to support the Municipal Census Departments in all activities related with this complex process. One of the main challenges of performing these activities by using a web application was related to the conflicting requirements of good accuracy and low response times of the system. This was made possible by a sophisticated mechanism combining a linkage module to compute the similarity score with a very efficient strategy to select the list of candidates to be analyzed by the record linkage module. The overall outcome of the implemented solutions was shown to be definitely positive, in terms of quality of the record linkage process, reduced intervention by operators and timeliness, enabling ISTAT to disseminate the data on legal population shortly after the closure of the data collection process.
References
Cibella, N., Fortini, M., Scannapieco, M., Tosco, L., Tuoto, T., (2008), Theory and practice of developing a record linkage software, Proceedings of the Combination of surveys and administrative data Workshop of the CENEX Statistical Methodology Project Area "Integration of survey and administrative data", Vienna, Austria. See also: http://www.istat.it/en/tools/methods-and-software
Fellegi, I.P., Sunter, A.B. (1969), A Theory for Record Linkage, Journal of the American Statistical Association, 64, pp. 1183-1210.