I-CUE
Improving the Capacity and Usability of EUROMOD
Design Study
implemented as a
Specific Support Action
Deliverable D5.2
Standardisation of EUROMOD database variables
May 2007
Contract number: 011859
Project Co-ordinator: Holly Sutherland
Project website: http://www.iser.essex.ac.uk/msu/emod/i-cue.php
Project funded by the European Community
under the “Structuring the European Research Area” Specific Programme Research Infrastructures action
Standardisation of EUROMOD database variables
Horacio Levy1 May 2007
1 EUROMOD Database
The EUROMOD database is a set of variables that the model takes as input from an external micro-database (e.g. a household survey). These variables contain personal and household information that the model needs in order to calculate taxes and benefits for each individual/household in the sample. Among others, the variables include demographic and labour market characteristics, sources of income, assets and expenditure.
Before the I-CUE project, this database was extremely long and quite difficult to manage. Given the model’s own nature, some length and complexity is not only inevitable but desirable. Therefore, the reorganization of the database has not involved a drastic reduction or simplification. Instead it offers an alternative way that makes the content of the database clearer and handier to EUROMOD users.
Basically, there are two types of EUROMOD database variables: common and country specific. Originally, common variables were intended to contain data that would be easily comparable and transferable across countries. Therefore the same variable could be used in all countries. Country specific variables (CSV) were thought of as those that contain information that is specific of a particular country. Thus, they would only be applicable and used in a particular country.
During the initial construction of EUROMOD, country specificity was interpreted in a somewhat broad sense. For example, social benefits were considered country specific, reflecting the fact that their rules and amounts differ across countries. Common variables were restricted then, to some basic information such as demographic characteristics (e.g., age, gender, marital status, etc), labour market characteristics, and some sources of market income (e.g., earnings and capital income). Most data were considered country specific and included in country specific variables. About 85% of the database variables were country specific, although this proportion varied considerably across countries.
A high level of country specificity is problematic for a model like EUROMOD. First, country specificity makes it harder to compare and analyse data across countries, as the user needs first to identify and classify the content of each of the variables before performing the comparison. Second, EUROMOD requires a large amount of information from a substantial (and increasing) number of countries. If a significant part of the information is considered country specific, there is a risk that the number of variables in the EUROMOD database becomes excessively high. Finally, country specific variables were designed and documented independently by each EUROMOD
1
I would like to thank Emilia del Bono, Killian Mullan, Thomas Siedler and Francesca Zantomio for their help interpreting some of Euromod variables and Francesco Figari, Christine Lietz, Daniela Mantovani, Alari Paulus and Holly Sutherland for comments on previous versions. Usual disclaimer applies.
national team and in most cases did not follow any commonly agreed rule or convention. Thus, they were more difficult to interpret and use than common variables.
An exhaustive (variable-by-variable) study of the country specific variables has proved that just a fraction of the country-specific variables were in fact neither relevant nor applicable in other countries. In many cases these variables were available in other countries but with different names. In other cases they provided equal or similar information but were designed differently (e.g. in terms of categories used or the measurement period). Finally, many variables shared the same type of information but at different levels of particularity or disaggregation.
It must be recognised that country specificity could neither have been anticipated nor prevented during the construction of EUROMOD. On the one hand, as EUROMOD was the first multi-country tax-benefit model to be built, it could not rely on previous experience and lessons. Therefore, problems like this were to some extent expected and unavoidable. On the other hand, EUROMOD was built mainly through the collaboration of national experts. These experts had their own national models, tax benefit systems and data as background. Thus, it is understandable that the design and scope of the national models has somehow influenced the development of EUROMOD. Now that lessons have been learned, these issues can be revised and changed. This is the objective of the work described in the next section.
2 A new approach
Comparing variables across countries shows how complex the EUROMOD database is. National data not only diverge in terms of country specific information, but also in the level of detail for which information is available. Similar information may also be regarded or named in different ways reflecting how this is usually considered in a particular country. The official names of some variables are difficult to interpret by those not familiar with country’s tax-benefit system (e.g., it may not be evident for someone not familiar with the UK system that “attendance allowance” is a benefit for disabled people who require day or night-time care).
Given these characteristics some relevant conclusions about the structure of the EUROMOD database and its evolution can be drawn:
1. variable names cannot be expected to be exhaustive. New databases are likely to bring new variables that will require new names;
2. variable names cannot be expected to be exclusive. Given different levels of detail and aggregation, similar information may be divided or presented in different ways, therefore requiring different names;
3. new data will be included in the future, hence it is desirable that the structure is flexible and adaptable to include within this framework different and not anticipated variables;
4. variables should be named in a way that provides clear and straightforward information to a wide range of EUROMOD users;
5. brief additional description besides the official name of the variable should be available to complement the information about the variable content;
6. since similar information may be in different variables, it is desirable that variables with similar content can be easily identified and/or grouped together. This facilitates comparability and transferability across countries.
Following these conclusions was decided not to provide a full list of new variable names but a flexible and adaptable naming convention. This naming convention consists of a list of acronyms that, put together in a predetermined order, build a variable name. The acronyms were designed to cover most existing variables (some variables are still unidentified or are proposed to be dropped from the database) and are able to construct a wide range of new variable names, not used yet. Of course, the list of acronyms may be increased in order to accommodate variables to be added in the future.
There are three classes of acronyms, ordered hierarchically:
Class 1: as before, the names of variables begin with two lowercase characters
that identify the variable as common (co). Country specific acronyms (e.g., at, be, uk) could be used but must be avoided as much as possible.2
Class 2: one uppercase character identifies the subject that the variable refers to: Asset, Labour market, Demographic, Register, System, primarY
income, eXpenditure, Benefit, Pension, Taxes and Contributions.
Class 3: two uppercase characters identify specific information for the subject
determined by Class 2 acronym.
For example, under this new system, the variable “employment income” is named “coYEM” – co: common, Y: primary income, EM: employment - (the old names was coEMPY, which was simply an arbitrary abbreviation).
Class 3 acronyms are designed according to the characteristics of the subject (class 2 acronym) they belong to. Each of these acronyms has a unique meaning within each subject. For consistency, the same item is also referred by the same acronym across subjects (e.g., agriculture is AG in all subjects where it is used). However, the same acronym can mean different things across subjects (e.g., AG stands for age for demographics, but agriculture for assets, taxes, primary income and benefits). In order to make out more of the existing ones, acronyms should be used and interpreted in a wide but unambiguous way (e.g., the acronym for health, HL, should also be used to suggest illness, sickness or injury).
There is no limit on the number of acronyms that can be used to create a variable name. Nevertheless, in the interest of clarity and brevity, it is recommended that not more than 5 Class 3 acronyms are used (this would result in a name more than 13 characters long). Most names use one to three Class 3 acronyms.
Class 3 acronyms are arranged in subgroups that are ordered in levels. This order is important to assure that acronyms are sorted in a unique way. This prevents the same variable ending up having different names because the acronyms are used in different orders. Following the order of acronyms in Chart 1, a housing benefit complement
2
Currently, no variable using the new approach is named as a country specific variable. Of course, there are some common variables that are only used in one country, but they could (if needed) be used in other countries.
for pensioners must be named coBHOPECM instead of coBHOCMPE (where, B: benefit, HO: housing, PE: pensioner, CM: complement).
Chart 1. Extract of Class 3 acronyms available for Benefits and Pensions.
B BENEFITS P PENSIONS Benefit Type 1.000 CH Child 1.000 DI Disability-Invalidity 1.000 ED Education 1.000 FA Family 1.000 HL Health 1.000 HO Housing 1.000 MA Maternity 1.000 OA Old Age 1.000 SA Social Assistance 1.000 SU Survivors 1.000 UN Unemployment 1.000 Personal/Family Characteristics 6.000 AD Additional child 6.000 CP Couple 6.000 FH Father 6.000 LG Large Family 6.000 MH Mother 6.000 ND Non-dependent 6.000 OW Old worker 6.000 PE Pensioner 6.000 Payment 14.000 CM Complement 14.000 LS Lump-Sum 14.000 MM Minimum 14.000
TP Tax Pay Back 14.000
TU Top Up 14.000 XP Extra pay 14.000 Subgroup Level Acronym Description
The number of acronym levels varies across subjects. Previous ranking levels are not compulsory when defining a variable name. A variable can, for example, use acronyms from level 1 and 5, without including acronyms from levels 2, 3 and 4. The key advantage of this approach is its flexibility. A wide range of variable names can be created out of these acronyms. Moreover, it is able to adjust to the fact that the level of detail and specificity about similar information varies considerably between countries. Hence, variable names can be constructed either for very basic aggregated information or for a quite particular variable.
The rule of ordering acronyms according to a determined order ensures that similar variables are placed next to each other when names are sorted alphabetically. This makes it easier to identify and group together similar information.
The availability of a substantial list of acronyms makes it likely that variables containing equivalent information have the same or, at least, very similar names. This can be helpful for users who want to compare or swap policies or systems across countries and need to know if common or similar variables are available in other countries.
The use of acronyms allows for the implementation of an automatic variable label which puts into words the meaning of the acronyms that form the name of the variable. For example, the UK’s attendance allowance (that was named ukBENATT and described as attendance allowance) now should be named coBDICA and its automatic label would read Benefit : Disability : Receiving Care. Although the language is not elegant, this label can be useful as a complement to the variable description.
Of course, this approach also has some drawbacks. First, it requires a careful look at the list of existing acronyms and understanding of the rules before naming a variable. This can be quite time consuming. However, one must remember that building databases and adding new variables are not frequent tasks and therefore the extra time this demands is not expected to be a big burden. A second problem is that some variable names are less intuitive than before (e.g., previously the variable containing individual age was named coAGE, which is more intuitive than the new coDAG). It is expected that the automatic label will alleviate this problem.
2.1 Naming Rules
1) All variable names must always begin with a class 1 acronym followed by a
class 2 and at least one class 3 acronym;
2) Class 3 acronyms should be placed in ascending order in accordance to the subgroup levels. Some variables may use more than one acronym from the same level but this should be avoided or done with care;
3) Previously ranked levels are not compulsory when using Class 3 acronyms. A variable name can be created, for example, using acronyms from level 1 and 5, without acronyms from levels 2, 3 and 4;
4) Each additional acronym included in a variable name must be relevant or informative to distinguish information from another variable
5) For consistency, the same acronym refers to the same description across variable types (e.g., Agriculture is AG in all variable types where it is used); 6) Among monetary variables each additional acronym represents an additional
level of detail or specificity and it is assumed to be a component (part) of a commonly rooted and less detailed variables. For example, coBUNCTCM is a component of coBUNCT which is a component of coBUN;
7) Among monetary variables main amounts may be differentiated from complements by using the acronym 00: basic/main. For example, coPOA00 is the main (“basic”) old-age pension and coPOACM is its complement. Following rule 6, both are components of variable coPOA;
8) When creating new variable names, the user should first check whether an appropriate name already exists and if the data that are available can be adapted in accordance with this variable definition. This is particularly important with regard to categories and time periods. When possible one should use the categorical variables and the time periods already available. For example, if there is already a variable that measures the time worked in months, one should not create another that measures the same in weeks or years.
9) New acronyms should only be added after a careful analysis has proved that none of the existing ones would be applicable to name a new variable. The new acronym must be different from any existing acronym within the acronym list where it will be added. This inclusion should be agreed with the modelling team.
10) Each acronym must have a clear and (whenever possible) short label description.
11) Whenever possible the acronym should have some relation to its description (e.g., EM for employment).
12) If it is not possible to use/adapt the categories already existing, a new categorical acronym must be created. Categorical acronyms must have all possible categories defined and documented in their description. The description must be preceded with “(c)” to highlight that the acronym is categorical.
13) Variables measuring time-periods must consistently explain the period they refer to. For example, months worked may refer to the last year, current year, or the whole work history.
14) Pensions and benefits are distinguished on the grounds that all social transfers for old-age, disability and survivors are classified as "pension", all other transfers are "benefits", independently of their official names.
15) In order to make it clear the difference between "who pays/receives" and "what is paid/received", there are some similar acronyms for Taxes and Contributions. For example, PE is for pensioner (e.g., who gets a tax allowance) and PI is for pension insurance (e.g., paid by employees as social insurance contribution for pension). There is no acronym for, say, "pension". It is assumed that if there is a tax on pension income, this is paid by a "pensioner". The same rationale applies to other income sources (including earnings, assumed to be the income of an employee).
16) In order to keep consistency between names and the instruments where they are simulated, refundable tax credits are part of the income tax even if they are viewed as a benefit. So, for example, the UK Child Tax Credit is cotintrch (Income tax : Refund Tax Credit : Child)
17) In order to avoid using a temporary variable that is already in use in the model, there is a distinction between "intermediate" and "temporary" variables. The first are used for policy parameterisation and the second are left empty so that users can use them freely in their simulations.
2.2 Effects of new approach
New names have been proposed for each EUROMOD database variable.3 Table 1 shows that the new approach reduces the number of variables by about 28%. The
3
The complete list of new variable names is available at http://www.iser.essex.ac.uk/msu/emod/i-cue/.
main reductions occur among pensions, benefits, labour market, expenditure and primary income variables.
Some variables are proposed to be dropped as they are not used by the model or repeat information already available in other variables. .In just a few cases they contain very detailed information that is proposed to be eliminated.
The analysis was inconclusive about a significant number of variables (117). In most cases these are variables that could not be accurately identified and this will require assistance from the national teams.4
Under the new approach, 80 new variables are used at least twice to replace old variables. New unemployment benefit and social assistance benefit variables replace 8 old variables, each.
Table 1 Number of old and new variables
Number of Variables
Old New Reduction % Reduction
Assets 24 18 6 25% Benefit 138 88 50 36% Demographic 26 23 3 12% Labour Market 37 27 10 27% Public Pension 120 71 49 41% Register 3 3 0 0% System 7 7 0 0% Tax 41 39 2 5% Expenditure 91 72 19 21% Primary Income 84 65 19 23% Sub-Total 571 413 158 28% To be dropped 32 Not known 117 Total 720
Max No. Repeated 8 Used more than once 80
Chart 2 shows the effect of this new approach on how variables are named and labelled. This highlights how it significantly simplifies and organises the database. It also shows how the new approach makes it easier to identify variables that contain identical or very similar information.
4
Since national teams from the EU-15 member states are not involved in I-CUE, this is planned to be carried out after the conclusion of this project and as part of the updating of the EU-15 tax-benefit systems.
Chart 2 Effects of the new approach on naming and labelling
New name New Label Old name Description
coPOA Public Pension : Old Age atSIBPEN at: old age pension ("alterspension", pv)
coPOA Public Pension : Old Age beRETPEN be: retirement pension (pension de retraite)
coPOA Public Pension : Old Age GRBEN_OA gr: old age pension coPOA Public Pension : Old Age irRET ir: retirement pension coPOA00 Public Pension : Old Age :
Basic :
geBEN006 ge: own old age pension coPOA00 Public Pension : Old Age :
Basic :
nlBENAOW dutch basic old age pension (aow)
coPOAAGCT Public Pension : Old Age : Agriculture: Contributory
PTBEN02B pt: old-age agric.insurance (ressa)
coPOACS Public Pension : Old Age : Civil Servant :
geBEN008 ge: civil servants' own pension coPOACS Public Pension : Old Age :
Civil Servant :
itPEN5 it: excluding supp. pension: ipat (institute of treasury-managed insurance): old age, retirement pension
coPOACT Public Pension : Old Age : Contributory :
PTBEN02 pt: old-age insurance coPOACT00 Public Pension : Old Age :
Contributory : Basic
SpBE002a sp: Old-age (insurance and early retirement)
coPOACT00 Public Pension : Old Age : Contributory : Basic
ukBENpen uk: retirement pension coPOACTCM Public Pension : Old Age :
Contributory : Complement
ukbenser uk: state earnings related pension (serps)
coPOACTCM Public Pension : Old Age : Contributory : Complement
SpBE002b sp: old-age (minimum pension) coPOAFR Public Pension : Old Age :
Farmer :
geBEN010 ge: farmers' own pension coPOAMN Public Pension : Old Age :
Miners :
geBEN007 ge: miners' own pension coPOAMT Public Pension : Old Age :
Means-Tested :
PTBEN03 pt: old-age social pension coPOANC00 Public Pension : Old Age :
Non-Contributory : Basic
SpBE002c sp: old-age (non-contributory – new system)
coPOANCEX Public Pension : Old Age : Means-Tested :
Extinguished :
SpBE002d sp: old-age (assistance – old system)
coPOASS Public Pension : Old Age : Social Security :
itPEN1 it: excluding supp. pension: inps (national institute of social insurance): old age, retirement pension
coPOAWR Public Pension : Old Age : War :
geBEN009 ge: war victims' own pension
3 Implementation
The names of the EUROMOD database variables are used in a variety of places including parameter files and the model code. Therefore, replacing the old with the new variable names would involve substantial detailed work from the modelling and national teams. This would require resources that are not available in the I-CUE project. On the other hand, it is important to have this new approach operative so that
it can be used for the construction of the prototype models of the New Member States.
In order to deal with this, a mechanism for an intermediate implementation has been devised. This mechanism allows for two different names to be used for the same variable. Thus, old and new variable names can, for the time being, co-exist in EUROMOD as the model is able to recognise both. The names that denominate the same variable are stored in EUROMOD’s variable description file. Figure 1 shows that new (standard) names are included only where needed. Otherwise, the new name is left empty. In this case, the model assumes that the new name is the same as the old one. The modeller must inform whether old or new variable names should be used for each country and database. This is done via a parameter in EUROMOD’s control file (use_stdvarnames). If it is set to 1, as in Figure 2, the model uses the new variable names. Otherwise, it is set to zero and uses the old ones. Therefore, while old names are still used in the EU-15 countries, new variable names can be used in the NMS. This mechanism is assumed to be temporary and will be removed once the new variable names are implemented in all EUROMOD countries.
Figure 1. EUROMOD’s variable description file (VarDesc)
Figure 2. EUROMOD’s control file: user old or new variable names parameter
systems PL_2005 inclist PL_2005 taxunit PL_2005 spine PL_2005 baseline yes country PL datayear_col -datayear_inc -database pl_2005 database_table1 pl_test use_stdvarnames 1 currency_db national currency_output national currency_param national exch_rate_euro_to_nat 4.0388
4 Documentation
EUROMOD’s database variables are documented in two different files: the Data Requirement Document (DRD) and the variable description parameter file (VarDesc). The structure of both documents has been adapted in order to take into account the standardisation of the database.
The DRD is a template Excel workbook that is filled in by national teams. Figure 3 illustrates its cover sheet. Different groups of variables (personal, labour market, income, etc) are documented in separate worksheets (the structure of the worksheets is the same). Figure 4 shows the worksheet for personal information variables. The worksheet is divided in three pages. The first includes a variable description written by the national team. This is complemented with the automatic label explained in section 2. Whenever applicable, variable categories are documented. The derivation of data from the original data source, any imputations or adjustments and other remarks are pointed out in a space reserved for notes. The second page provides further information that it is used in EUROMOD’s data processing (e.g., whether the variable contains a monetary amount or is categorical). Finally, the third page contains some descriptive statistics that are useful for validation.
The new DRD also includes a worksheet with a guide explaining the aim and content of the DRD, and how it should be filled in (see Figure 5). Experience has shown that corrections, improvements and inclusions are frequently performed on the database after its release. These changes need to be documented so users are aware of them. For that reason a change log worksheet has been added (see Figure 6). Information about how the EUROMOD database was derived from the original data source is documented in the worksheet “sample” (see Figure 7). Finally, the full list of the EUROMOD acronyms and variables are also included in two additional sheets (see Figure 8 and Figure 9, respectively).
Figure 5. DRD: Guide
Figure 7. DRD: Sample sheet
Figure 9. DRD: List of EUROMOD variables
VarDesc is a EUROMOD parameter file with information about all variables used by the model. As already seen in the previous section and illustrated in Figure 1, this file includes the old and new variable names, information on whether the variable is model-generated, monetary or categorical and default value or alternative variable in case it is missing. It also contains the automatic label and the variable descriptions (filled in by the national teams) separated for each country. This is useful not only in order to know in which countries a variable is used but also to have a better understanding of its content across countries.
Figure 10. VarDesc: Automatic label and per country description
5 Tools
New tools were set up in order to facilitate the handling of acronyms. These new tools are included in a EUROMOD variable administration file (AdminVar). Figure 11
shows a first set of tools to add, delete or change acronyms and their definitions. A macro within this Excel file reads the information filled in the sheet by the user and then automatically makes the necessary changes in the appropriate EUROMOD parameter files. Figure 12 shows a second set of tools, also based on an Excel macro, that allow the user as well as model developers to check whether and where acronyms are used, whether variable names are correctly using the acronyms, and compare the acronym lists in two different versions of EUROMOD.
Figure 11. AdminVar: Administrate Acronyms