Project N°: 262608
Acronym:
D
ata
w
ithout
B
oundaries
S
TAND-
ALONED
OCUMENTAnalysis of Researchers' Needs re. Secure Access to Official Microdata
W
ORKP
ACKAGE4
Improving Access to Official Statistics Microdata
DATE OF ISSUE OF DELIVERABLE: April 2015
DOCUMENT PREPARED BY: Partner 1 CNRS-RQ
Combination of CP & CSA project funded by the European Community Under the programme “FP7 - SP4 Capacities”
P
RELIMINARYS
TATEMENTThis report is meant to complement and update the findings of a feasibility study on the organisational architecture for managing pan-European access carried out as part of the DwB deliverable D4.2.
T
ABLE OFC
ONTENTSPRELIMINARY STATEMENT ... 3
INTRODUCTION ... 5
PART 1- SURVEY METHODOLOGY:ADOUBLE QUANTITATIVE AND QUALITATIVE APPROACH TO CAPTURE A SHARPER VISION OF NEEDS ... 7
1. Preliminary Approach: A Business Case Analysis ... 7
2. The Online Survey: A Quantitative Approach ... 8
3. Individual and Collective Interviews: The Paris & The Hague Workshops and The Supplementary Questionnaire ... 9
PART 2- SOME CHARACTERISTICS OF ACCESS CENTERS,RESEARCHERS,TEAMS AND RESEARCH PROJECTS .. 11
1. The Surveyed Research Data Centers ... 11
2. Researchers’ Profile ... 12
PART 3- RESULTS ... 15
1. "Soft Needs": What Can Be Easily Done to Improve Current Way of Work ... 18
2. “Strong Needs” and Researchers Requirements Regarding a Eu-RAN ... 20
Reduce Waiting Time throughout the Process ... 20
Access Points: No Travel and One System ... 21
Anticipate a Potential Mobility between Institutions or Countries ... 23
Research Environment ... 23
Improve Documentation and Support ... 24
Output Checking: An Essential Issue ... 24
Merging Data: A Request For New Research Opportunities ... 26
IN CONCLUSION:SOME GENERAL POINTS ... 27
ANNEXES ... 29
ANNEX 1:QUESTIONNAIRE ... 30
ANNEX 2:WHO ARE THE RESPONDENTS? ... 60
I
NTRODUCTIONThe drive to meet researcher's expectations while ensuring confidential data protection was crucial in the process of extending in a number of countries secure access to data, initially on site then on a remote execution and remote access basis. In a similar manner, it is crucial that researchers' expectations should be taken into account to monitor the implementation of a secure European data network, making it possible to achieve complex research projects involving transnational access to confidential data stored in different sites across borders. In order to evidence the necessity of a secure data centers network and its feasibility, we first have to accurately identify to what extent the current development of transnational access – with non-resident researchers travelling to the relevant country and getting access in situ to the requested data (United Kingdom), or allowed to use remote access and process foreign data from their regular workplace (France) – leaves us with unfulfilled research needs. Such a network has also to be instrumental in providing researchers with the relevant facilities. This involves a full exploration of researchers' working methods and organization, without discounting the legal constraints and safety requirement of the centers where data are stored.
In order to collect this type of information, we conducted an on line survey, and both individual and group interviews on a sample of researchers with some experience of secure remote facilities, that is having processed confidential data, mostly in DwB Research Data Centres partners. Between the time the on line survey was started, at the beginning of the project and the time the group interviews took place, at the end of the project, the reflection upon a possible architecture for a European Remote Access Network has improved, making it possible to present researchers with a more elaborate project to be discussed and help them assess more precisely the needs for such a network. Previously, various phases of a research project involving secure access to confidential data had been defined. The online survey has been designed to capture for each phase the difficulties and needs met by the researchers based on their actual experience mainly restricted to one single facility; group interviews allowed a more prospective approach: researchers were then expected to project into a different context where they could access different facilities from one single point of access. It is worth noting that researchers, when discussing certain obstacles pertaining to legal or privacy issues that would for instance make difficult merging datasets from different countries for running a single analysis, suggested mid-range solutions, somewhat less ambitious but in the short term more pragmatic than those that WP4 had been envisaged with a longer perspective in mind.
We will first rapidly describe the various methods we used: on line survey, individual and group interviews, and their rationale, then, after having described the responding population and underlined some important characteristics for the analysis, we will present the results under two headlines. We will first identify what we label as “soft needs”, likely to be met relatively easily and eventually already dealt with in some facilities. These are relevant for all types of access, transnational or not, within the framework of a future network or in the current context and they are widely expressed by researchers whose experience is mostly based on national access. We also identified “strong needs”, harder to be met, considering the current design of access procedures and security constraints,
especially in the case of transnational access that is our main concern here. It is worth noting that a network meeting these "strong needs" would also be likely to satisfy needs which so far have not been dealt also with at national level. In conclusion, we will present a few users' cases reflecting the problems experienced by some researchers in the current situation; we will also address some major points for the design of a European Remote Access Network and the benefit of such a network for research activities.
P
ART1
-
S
URVEYM
ETHODOLOGY:
A
D
OUBLEQ
UANTITATIVE ANDQ
UALITATIVEA
PPROACH TOC
APTURE AS
HARPERV
ISION OFN
EEDSThis was a three steps process; a preliminary approach aiming at defining the survey basis and its main assumptions concerning the problems experienced by the researchers, then a
quantitative online survey which made it possible to weight their relative importance also
investigating possible relations with researchers profiles (disciplines, types of projects, research institutions contexts), and a qualitative investigation, well adjusted to a more prospective approach concerning transnational access issues; this latter approach was determined by the fact that, given the current constraints, few researchers have had the opportunity to get any actual experience of transnational access.
1. Preliminary Approach: A Business Case Analysis
Our major concern at first was to avoid an abstract mode of reasoning; we decided to start from a real situation, making it possible to identify the obstacles experienced by researchers whenever their project requires a secure transnational access to confidential data bases, stored in various facilities and in different countries. Though in progress (see WP3), transnational secure access remains complex and poorly publicized when available. It is therefore no wonder that complex research projects (involving use of multiple data bases from different countries) remain an exception; the very few projects of that kind happen to be authored by highly experienced research groups. No matter how rare they are, such situations do exist; researchers who designed them have invented solutions in a highly constraining environment that can be of use to anticipate the design of the future network. Such is the case of the research project we selected as a starting point: this project was conducted by a team of several partners based in 4 countries (already endowed with Research Data Centers) eventually members of DWB: France, the Netherlands, Germany and the United Kingdom. In the course of this research project, researchers use administrative micro data produced by the various national social insurance offices; these data involved individual data on wages and employment in each country. The project aims to provide new evidence on the effects of social security contributions using large administrative panel datasets in France, Germany, The Netherlands and the UK1
Further to an in-depth discussion with our team, the analysis which was provided by J. Grenet, one of the persons in charge of this project, allowed to identify three major problems hindering the achievement of this type of project. This analysis illustrating the researchers' perspective was presented by Grenet on the occasion of the 1st European Data Accesses Forum. The first problem pertains to information: lack of information on existing data, information at the variable level are rarely available, since we are dealing with confidential data; poorly documented metadata; language barrier: non existing or very partial translations. The second major issue pertains to the accreditation process required to get permission to access and process the data. Whenever a project involves
using a micro-based across-country analysis.
1 GRENET J. (2012), « Crossing Obstacles for a European Research Project: From a Business Case to an Ideal World”, DwB First European Data Access Forum, Luxembourg.
several teams, several national data bases stored in various facilities and countries, this implies multiple accreditations, with procedures and forms specific to each country and facility, which have to be signed by all members, which is time consuming and not compatible with the agenda and the financial constraints of research conducted within the framework of funding agencies calls. Once these obstacles were overcome, even access to data as such differs across facilities and/or countries. The researcher has to deal with different access systems, different standards concerning output checking and anonymization regulations, not to mention the variable amount of access fees, impossible to predict at the start of the project. Let us also mention problems arising from the lack of compatibility between software programs and ultimately the long delay before obtaining exploitable results. Lastly, whenever research projects involve more than one country, remote access across borders may be forbidden, making it sometimes necessary to go on site for non residents; an even more serious issue may lie in the impossibility to merge data bases from various national sources (indeed, this would imply their transfer from one country into another) in order to run a single analysis, as opposed to several distinct cross tabs yielding different statistical results.
The organizational structure of the team in charge of this project (several teams based in various countries), its working methods (need to work together at a certain point and, to compare the data), comparative nature of the project type (comparative), type of data used (administrative files) and the obstacles met, appear ultimately once the survey completed, as iconic of projects that a network of secure data centers makes it possible to develop.
2. The Online Survey: A Quantitative Approach
The online survey conducted on a sample of researchers working in several secure data centers was based on this first preliminary study which made it possible to better identify the major points to be further explored for each phase of a research project demanding access to confidential data. Eight phases had been previously described for such projects: information - accreditation - data - access - support - output checking - feedback - project closure, largely confirmed by the analysis initially conducted on the business case.
Though the business case evidenced issues related to the information and accreditation phase, however, these questions having been thoroughly explored in a precise way in other parts of the project (WP3 and WP7 and 8), we decided to focus the questionnaire on these 4 phases: Access - Data (processing) - output checking - support and surveillance. Interestingly, though they were not at the core of the on line survey, here again, the researchers spontaneously mentioned the importance of issues related to information and accreditation. The questionnaire, focused on the selected steps, was designed to further explore the possible impact of various technical modes and security constraints on data processing by the researchers, as they appeared in the description of several
Research Data Centers (s
analysis, the questionnaire tried to identify the more specific difficulties in the event of transnational access; moreover a few questions were targeted at transversal issues common to all phases such as the language issues. Detailed results of this investigation as
well as underlying assumptions will be presented further down. The questionnaire also made it possible for the researcher, to address other points more freely in a final part. Assuming that, in a European secure data network, researchers should be able to process all foreign data without having to travel abroad, we selected a sample of researchers having all experienced remote access, either from his desk or a specific room within his/her university or a specific center outside his university but at a reasonable distance. Though, in line with the overall project, we selected researchers having worked with “remote access” - which allows seeing the actual data and work freely until the final output - however a few researchers had also worked using the remote execution mode and were able to make some comparisons.
The first part of the questionnaire made it possible to check the researcher's experience as regards secure access modes and to identify his/her institutional affiliation: research environment, field of research, nature of the research project (namely need for accessing data stored in different RDCs/countries), individual or collective working method, all characteristics to be compared to the greater or lesser intensity of experienced problems. The questionnaire with 57 questions (Annex 1) was sent to researchers by 5 European RDCs, partners of the DwB project, offering remote access solutions in France (CASD), Germany (IAB), the Netherlands (CBS), and the United Kingdom (SDS, ONS). The centers sent the questionnaire’s weblink to their users either via their general mailing list or via a selection of researchers. Researchers completed the questionnaire anonymously and submitted it directly online to the survey design team. 90 researchers submitted a questionnaire but 65 were entirely completed and usable. Among the 65 respondents, 40% have been contacted by the Centre d’Accès Sécurisé Distant aux Données (CASD) in France, 22% by the Research Data Centre of the German Federal Employment Agency at the Institute for Employment Research, 26% by an English system (20% Virtual Microdata Laboratory – however not all questions were usable for this centre- and 6% Secure Data Service) and finally 12 % by the Central Bureau of Statistics (CBS) in the Netherlands. Obviously, this sample, deliberately centered on a specific population, cannot be viewed as representative; however, it provides us with converging pieces of information adequate to the aim of identifying, on the base of actual researchers' experience, mostly limited to national access, major points to consider in designing a secure European Data network.
3. Individual and Collective Interviews: The Paris & The Hague Workshops and The Supplementary Questionnaire
In order to explore more prospective aspects concerning research projects based on multiple national data and conducted by teams located in various access points who need to work together across borders, we conducted a set of focused individual and group interviews. Thus, certain points insufficiently covered by the online survey were further explored: researchers' organization and working methods, division of labor within team (who processes the data?, who needs to analyze the output in the course of the data
processing phase without working on raw data?, who is only implied in the report writing process?). Among interviewees, few were implied in transnational projects involving several teams, which is of no surprise, considering the problems currently experienced to achieve this type of projects. This was also the case for the researchers who took part in group interviews; however, the group discussions was a great opportunity for researchers to describe, on the base of their current practice, the essential needs that should be taken into account to support team work in a secure network providing access to data bases of multinational sources. Eventually, group discussions also revealed some problems experienced by the team when working in a less complex situation, i.e. within a national framework, suggesting that the design of a network structure dedicated to transnational access might also provide solutions at the national level.
On the occasion of two workshops, organized with the assistance of Kamel Gadouche (CASD) and Leo Engberts (CBS), two groups of ten researchers were formed, a first one in Paris at the Centre d’accès sécurisé distant(CASD) attached to GENES, a second one at the Research Data Center attached to CBS, the Dutch statistical institute. In both cases, workshop attendees were users of the data facility. After having described their research project and the framework in which it was carried out (individual or collective, each partner's role in the latter case), the researchers were asked to comment on the overall project, on the preliminary results of the on line survey, and to debate over the broad outline of the design of a secure European network (Eu-RAN). Further to these two workshops, a supplementary questionnaire focused on projects and teams typology was sent to the researchers, thus enabling them to revise their needs and to add comments. In a few cases, individual interviews were carried out by phone with researchers who had been unable to attend the workshop or in order to further explore some points of major interest.
P
ART2
-
S
OMEC
HARACTERISTICS OFA
CCESSC
ENTERS,
R
ESEARCHERS,
T
EAMS ANDR
ESEARCHP
ROJECTSLet us first present some information on researchers, teams and research projects, as revealed by this qualitative and quantitative investigation. There is obviously some link between problems and needs as expressed by interviewees and their personal characteristics which can differ widely from one researcher to another; or from a project to another. As already mentioned, few interviewees were currently involved in complex multinational projects due to the extreme difficulty in achieving projects of that type. Nevertheless, we are dealing with a population with some experience of secure access to confidential data, mostly from a single facility, but sometimes with some experience of other centers; we may reasonably assume that this population is quite similar to the population of future users of a Eu-RAN.
Here are some results drawn from the on line survey. Workshops attendees are not exactly a subsample of the online survey sample: the researchers selected for these workshops were not necessarily part of the on line survey sample. However, they do not form a very different group since they were drawn out of the pool of users of these two facilities that provided the online survey sample. Graphs and tables below report the situation of the 65 respondents to the on line survey but we can assume they would be basically similar for the 20 additional researchers having taken part in the group interviews.
1. The Surveyed Research Data Centers
Let us first recap the characteristics of the various Research Data Centers where the questionnaire was administered on a sample of users. All these centers have in common to provide remote access, however equipments and procedures are variable (see deliverable 4.1 on the state of the art regarding secure remote access centers in Europe); certain issues mentioned by researchers can be directly linked with these specificities and their importance may be weighted differently from one center to another.
As mentioned above, 5 RDCs from 4 different European countries, all partners of the DwB project, took part in the survey: IAB for Germany, CASD attached to GENES for France, CBS for the Netherlands; in the United Kingdom two centers were involved: VML , affiliated to ONS and SDS, affiliated with UKDA. All of them provide “remote access” however with notable differences regarding earlier phases (accreditation procedure, available information), access system, data processing possibilities and constraints regarding points of access. Some are already of age, such as VML and especially CBS where for long many researchers have had the opportunity to work, others like CASD and SDS are more recent and were in a developing phase at the time of the survey. Some issues mentioned by researchers during the survey e.g. authentication problems can be related to this context and may have been solved since then. The institutional context of these centers is also of some relevance. Those attached to a national Statistical Institute, such as VML for ONS in the United Kingdom and CBS in the Netherlands, provide access to a wide range of datasets stored by the NSI. The field may be more restricted whenever the data center is a service of a ministry, which happens to be the case for IAB with its
longitudinal employers/employees databases. The situation is less predictable for centers under contract with data producers in order to provide secure access: such is the case for CASD in France, under contract with INSEE and increasingly with other official statistics producers (statistical services of various ministries in particular); for SDS attached to UKDA, the UK data archive, currently providing access only to several Secure Use Files from ONS. The accreditation procedures also vary. In the case of IAB for example, the accreditation procedure is placed under the umbrella of the access data center and general requires a short amount of time; conversely, the accreditation procedure may depend upon an external specific committee as this is the case for VML , SDS and CASD; this implies that the researcher while submitting his request of accreditation to the relevant committee, must simultaneously get in touch with the data center access and the data producer which are required to assess the project feasibility and give a positive response to the accreditation application. Moreover, when the screening process for accreditation is not continuous, but is performed on a quarterly basis, as it is the case for the Committee of statistical confidentiality in France, the delay is increased; therefore, it is no wonder that CASD users are likely to mention this problem more frequently than others. Conversely, if the facility does not use dedicated and integrated equipment as it is the case for CBS, the researcher is more likely to experience problems with software installation on his/her computer, requiring third-party help; on the other hand, whenever a dedicated equipment is provided, the researcher is deprived of his/her usual working environment and has to get familiar with the new equipment. Constraints vary widely from one facility to another, typically on those centers that serve as a basis for remote access. In the case of CBS and CASD (equipped in that latter case with dedicated equipment, SdBox sent by post) the researcher is allowed to access data from his/her university office – also if affiliated with an institution of another European country; for all the other centers constraints are heavier: the researcher is bound to travel to an accredited center at national level; meaning also that for European researchers from other nation states they currently have to travel to the United Kingdom or to Germany to enjoy remote access. Note that for IAB, SDS will in the future been accredited as a point of access in UK for IAB data, as a result of the work conducted within DwB (including installation of an IAB server in SDS and signature of a contract with the University of Essex), thus saving the travel to Germany. Researchers dealing with those facilities where the constraints happen to be the strongest are obviously those who tend to consider travelling duties as a hassle. Let us note finally that researchers having had comparative experience with VML and more recently with SDS in the United Kingdom spontaneously report to appreciate greatly systems which allow the researcher to actually “see” the data.
2. Researchers’ Profile
The breakdown by country of the 65 respondents’ affiliation institutions (see Annex 2, graph 1) is fairly similar to their distribution by access center; in most cases (9 out of 10) respondents are using an access center located in their academic home country; few of them have any experience of other secure data centers, and more rarely any experience of transnational access since in a number of cases, it would precisely involve traveling abroad.
In most cases (83%), the users were affiliated with academic and public research centers (see Annex 2, table 1). This point is likely to ease the implementation of a secure European network, which will run much better in the future once the accreditation authorities will have agreed on converging criteria in this matter. It is clear however that the specific organization of research in each country cannot be discounted: some countries leave more space for private institutions (11% of the researchers in our sample are affiliated with private institutions). As to the definition of who is, or is not, a researcher, there is some debate as we can see from the experiment conducted during
the 1st European Data Access Forum where research projects and researchers' profiles
have been submitted to group of participants who were to declare if, in their opinion these profiles pertained to research activity or not.
The questionnaire makes it possible to collect information on the researchers' institutional environment, profile, experience, all characteristics that may impact their opinion concerning the current framework but that are also relevant for the design of a secure European network able to embrace a variety of situations.
The institutional research environment in particular is a relevant point regarding the differential capacity and willing of institutions to provide dedicated spaces for the access points or to provide financial support for access fees usually charged to the final user when a certain share of the infrastructure's investment and overhead costs are transferred to users (as this is the case for VML, CBS and CASD). We may thus assume that a single researcher, affiliated to a university with little demand for such services, is in a more difficult situation than other researchers attached to large research centers with high demand for confidential data. It may also impact the capacity of new users to benefit from their colleagues' experience.
Two researchers out of three are economists, and one on four is a sociologist, demographer or geographer. The importance of the economists' community is obviously not a surprise: economists in general are more likely to consume large individual data bases than other disciplines in the social sciences. Moreover, econometric tools increasingly require highly detailed data, accessible only via secure data systems. However other disciplines are represented and one may assume that their importance will be increasing in the future. Their needs are somewhat specific, as revealed by some answers, however difficult to generalize. Geographers for instance are in need for highly desegregated spatial data, which pose specific problems in term of anonymization criteria concerning their outputs.
The responding population appears as fairly experienced in terms of secure access: in most cases they conduct at least two research projects using such facility. These researchers are also familiar with the datasets; they may have previously worked on Scientific Use Files, which obviously facilitates working in safe mode, all the more that documentation is often incomplete. The datasets at stake may be business, as well as household or individual data; the proportion of business data is somewhat larger due to the fact that such data, considered as impossible to anonymize, are solely be accessed through secure facilities.
Some research projects would make use of several different data files. However, these datasets were mostly stored in a single access center: only 4 researchers declared they used two different access centers and out of 4, two declared they use the two twin centers in the U. K., VML and SDS, probably due to the recent creation of SDS allowing a more flexible access to certain data of ONS source.
However, while working only on data stored in a single secure data access center, the majority of the respondents declare they work on their project with other researchers, (see Annex 2, graph 2), often affiliated with their own research institution. This "collective work" issue in the case of a secure access, which was of special interest to us, in order to assess the specific problems encountered by multinational research projects involving one or several teams, happens to be relevant even in the case of national projects. This reveals the type of problems encountered by research projects under the current framework and simultaneously allows us to extend our conclusions to more complex situations involving multiple partners and/or several teams projects, based on several data files stored in different places. The central question concerning teamwork was to find out how the work was organized and how the tasks were distributed. Only, one half of all respondents working in team with other partners declared that all members of their team had access to the data and could process them directly.
Finally, 82% of the respondents, whether or not they were involved in collective work, declared that they had good control over their working organization. When this is not the case, it is mostly due to spatial problems and traveling constraints if the access point is not close to their regular environment (study or office located in their institution).
P
ART3
-
R
ESULTSAs mentioned above, the questionnaire was structured along the different steps we had previously identified for a research project involving secure access to confidential data: eight steps i.e. information, accreditation, access, data processing, support, output checking, feedback and finally project closure, among which we retained those phases which are directly related to access systems: access, data processing, support and output checking. We can sort out the answers according to the greater or lesser degree of difficulties. The following table gives an overview of this classification for certain issues concerning each selected phase.
NO PROBLEM SOFT NEEDS STRONG NEEDS
Access
Training X
Authentication Best success of authentication
Location No travel
Equipment Its own work environment
Hardware/Software More software available
Best support to install
Merging datasets Single point of access would offer new
research opportunities
Working with data
Organization of the work More flexibility
Software More software Homogeneous practices
Going back to the data X
Storage X
Delay X Could be shorter
Documentation Improve documentation and support
Output checking
Delay Shorter
Restrictions Be able to discuss outputs with other researchers
Additionnal formats More formats
Support and surveillance Human and reactive support
General opinion Reduce waiting time
Some points do not pose major problems or only problems that could be solved without compromising the delicate but necessary balance between meeting researchers' needs and the security demands specific to confidential data, either by adopting better practices to the satisfaction of everyone in some facilities, or at the minimum cost of making some investments that do not jeopardize the whole operation. Other issues raised by the researchers are more difficult in the current context, thus requiring different compromises and new solutions that will also meet the security demands. The construction of a secure European network must obviously take both minor and major issues into account but we will see that the needs most difficult to meet are also those that are the thorniest within the framework of a transnational network.
Before screening these various points, let us first consider two results worth of attention. First of all, even though the questionnaire focused on these phases which involved directly the secure systems, the researchers on all occasions (i.e. the preliminary study or business case, the online survey and the group interviews) repeatedly mentioned the difficulties encountered previous to these phases, concerning firstly access to information and secondly accreditation. Here we are facing a sort of paradox, since conversely, the technical aspects of secure access are well accepted, even when it involves specific constraints concerning authentication as we will see further down.
A specific concern concerning metadata issues when it comes to the Secure Use Files is about the need for details at the variable level. When Scientific Use Files and Secure Use Files are both available, it is sometimes hard for the researchers to rapidly identify which file they need and where they differ. The lengthy accreditation and access procedure is all the more difficult to accept than the researcher does not have access to a precise documentation or that repeated interactions with the producer are necessary to get hold of this information. The problem is not quite as acute for older (CBS) or more highly-specialized centers (IAB). The problem gets harder when researchers need to gain access to data (often administrative ones) that are not managed by the national statistical institute such as social security files or job-search files stored by specific government agencies that have less invested in metadata. Only a few respondents had some experience of transnational access but it is obvious that the difficulty is increased for researchers from other countries: not only are they less familiar with the datasets, but they also suffer from the language barrier when the metadata are unavailable in English. One may assume that part of the difficulties experienced within WP9 and WP10 to get research proposals was related to this deficit of information and language issues for no- resident researchers.
Researchers also spontaneously mentioned other issues: difficulty to get information on the accreditation process, procedure involving cumbersome tasks whenever the team includes many researchers, lengthy delay to obtain it, back and forth movements of the application forms between the data producer who must give his authorization and the organization in charge of accreditation. The discrepancy between two different schedules is often mentioned: there is frequently a mismatch between the timing of appeal for proposals and the timing of the accreditation process. The seriousness of this issue is variable depending on procedure complexity; it is likely to be quite serious in the case of a transnational
Conversely, several of the points mentioned concerning access (enrolment, authentication) and data processing do not raise important remarks. In three cases out of four, researchers declare they are satisfied with the data processing schedule; one out of four is unhappy because to his/her opinion the system runs too slow; this is likely to be fixed. A large majority (61/65) considers the storage capacity as appropriate. Only researchers in charge of very large projects express some dissatisfaction on that issue. Almost all researchers were able to resume data processing after having run their programs a first time; this is of major importance to go further into the details or to meet the referees' demands before publication.
Whereas secure access solutions had sometimes been a source of concern in the research community, we observe that this working method was easily accepted all the more for remote access (as opposed to remote execution). We therefore may assume that the implementation of a secure European network will attract quickly, and increasingly so, its pool of users, in the same way as the national secure facilities did.
Let us now consider, with more detail, firstly these issues that raise difficulties without representing a major obstacle and secondly some issues likely to be fixed with more difficulty, typically for a transnational network.
1.
"Soft Needs": What Can Be Easily Done To Improve Current Way of Work
As to the initial access phase, the questionnaire included questions about enrollment, authentication methods, places of access, and computer equipment.
Let us note that a very large share of all researchers attended a training session on the technical use of the facility (47), on the legal aspects (44), on anonymisation and output procedures (41) and finally on data and metadata (33). Some differences between the centers are revealed; some seem likely to emphasize more technical aspects, whereas others insist on the legal or anonymization issues. In general, researchers appreciate these sessions, which seem to meet their expectations.
Once they have attended the enrollment meeting, researchers have to go through an authentication process in order to access the required data; this process is based on various techniques (biometric, login, password, smartcards,). Most of the time, a single method is required (51 cases out of 65) the most widespread being the login/password combination; biometric fingerprint comes next (see Graph). Whenever two methods are combined, it consists of the biometric print supplemented by the smartcard.
The good news is that a large number of researchers seem to be fairly satisfied with the authentication methods available (58/65) as well as with the frequency of authentication (52/65). Only a minority considers the requests for authentication as too frequent, and complains it may slow down data processing. The bad news however is that authentication may sometimes be somewhat unpredictable. More than one half of survey respondents and participants in the workshops experienced problems with passwords or biometric devices (excessive sensitivity of the fingerprint device, this problem being concentrated in certain facilities). However, this problem is about to be fixed, since researchers mentioned that problems they experienced at first had found a solution later on. Lastly, the fact of having to use a different card for each project seems to be a hassle. Authentication is a serious issue since it takes place at the beginning of the process - but the problems mentioned seem to be likely to be fairly “easily” fixed.
Two major "soft needs" related to the next phase (data processing phase), appear once the researchers are allowed to access the data: installation of computerized equipment and additional software. Some researchers may have to install a device allowing them to access the data (this varies according to the facilities' access systems, the need to install or not a dedicated equipment - SD Box for example - or to perform a modification on the computer, or have a computer technician doing it for you) and/or installing software for data processing. Out of 65 respondents, 24 were obliged to install some hardware device or a software; in one half of cases, this task was performed by the researcher or by a member of his institution" IT (respectively 9 and 8 answers) In other cases; it was performed by an external operator. Most difficulties mentioned here are related with technical issues dependent on the facilities' security requirements and with communication problems between the secure data center staff and the members of the academic or research institution. Getting help from the IT experts, in particular from the IT team in the university may sometimes be a problem. The local IT team may also have concerns about security regarding external installations. For a European remote access network, (Eu-RAN), this issue should be considered with the greatest attention, since we can expect all access systems to be different. The adjustment process should be user-friendly and the computer installations as homogeneous as possible. Providing swift and competent assistance to researchers will be a crucial point for success. This is a very important point indeed, however we categorize it as a "soft need', since in general those researchers who needed assistance, finally obtained satisfaction.
0 5 10 15 20 25 30 35 40
Biometric Login, password Smartcard Other
Software turns out to be a major topic for discussion, with once again strong variations from one facility to the other: software supply is variable and researchers' expectations are dealt with in different manners. 13 researchers were in need of computer software, which was unavailable, and on several occasions, in the questionnaire or during the workshops and interviews, researchers mentioned the need to get or install new software for data analysis or other device for processing the data than those at hand. Specifically, cartography software, SPSS or R are in demand. Sometimes, researcher need only to add specific lines of program or updates certain modules and this is generally a source for technical problems. In the same line, the lack of connection to the web makes it impossible to directly download some software packages; researchers are then obliged to ask the facility staff to do it, this often involving more or less serious installation delay. As to the support and surveillance phases, we should note that a majority of the researchers ignore if they are under scrutiny or not during the data processing phase (only 11 of them mention this point concerning the control of methodology they used or the organization of work at the beginning of analysis); they generally willingly accept the constraints related to a secure access to data; however, it is the output checking phase which turns out to be a source of difficulty, just as the lack of confidence in the researchers and the absence of guarantee for their property rights, all issues that will be addressed next in the “strong needs” part.
2. “Strong Needs” and Researchers Requirements Regarding a Eu-RAN
Some issues either remain unmentioned by the researchers or raise but minor problems easy to fix without any major change in the way secure data access centers are run; however, other issues turn out to be major obstacles likely either to delay considerably research activities; this may be a source of limitation or at worst, cause the project to fall apart. These difficulties are all the stronger in a transnational perspective. We may reorganize them under 7 major headlines corresponding to strong requirements expressed by the researchers. In the framework of a secure European network of centers, they have to be treated with great care. Some of these points imply to discuss new solutions adapted to researchers' needs while respecting privacy and security constraints. Reduce Waiting Time throughout the Process
The issues relating timeframe and delays are mentioned over and over by the researchers in all modes of investigation, qualitative as well as quantitative: simplification of the data access procedure and reduction of the delays are two major requirements. The results suggest that the most disturbing delays occur at two points in the data access process.
They occur first early in time, with the accreditation procedure. We have already mentioned that although this issue was out of the survey scope, it was spontaneously pointed out by the researchers on several occasions: they emphasize the discrepancy between the "research timing", marked by specific constraints (dissertation, research projects subject to the specific deadline of tenders for proposals) and “bureaucratic timing” which involves delays too long and too unpredictable before receiving
accreditation, thus leaving too short a time for data processing; some projects may even be caused to fail because of such delays. Typically, this is the case, if the researcher has to get accreditations for data involving several countries and/or several teams.
At a later stage, researchers complain that control over their outputs is often delayed for too long. One third of all researchers consider that the delay to obtain outputs is too long. And among researchers who declared a delay of several days (as this is mostly the case according to our results - 58%), one out of two consider this to be unacceptable.
Let us quote here some comments which show that these delays are neither understood nor accepted: “such delay is not convenient for quality research”, “… any waiting just to learn that the program crashed is very frustrating”, “not acceptable for a paid service”. Some challenge the competence of the staff in charge of output checking: “Remote Access services are generally not experienced enough to judge that output is correct”. On top of that, there is the time to find relevant information concerning available data as well as procedures, time lost on trips, when there is no access point available from the researcher's home institution, time to access the data once the accreditation has been granted, time to obtain the necessary support; to make a long story short, it is the whole chain of operations which is being challenged. Since most of the time research projects involve several teams, that have their own deadline, coordination becomes extremely tricky, which in its turn delays the whole operation.
Access Points: No Travel and One System
Researchers express a second major requirement: they want to be able to access and process data from their home institution, without having to travel to another place.
Among the on-line survey respondents, 6 researchers out of 10 could reach the data from their home institution, either from their study or from a dedicated room; 3 out of 10 could access data from their study, which is perceived as the least frustrating the solution (see graph below).
How long did it take for each checking of output (on average) ? 0 10 20 30 40 50 60 70 A few minutes A few hours A few days Other
In fact these results reflect the breakdown of respondents across the various secure centers they used. At CASD and at CBS, researchers may access the data from their home institution, as opposed to the other facilities that expect the researcher to visit an accredited center in order to access the data; the latter may be located in another city than the researcher's residence.
One half of the respondents (53%) who could not work out of their study viewed this fact as problematic. The issue at stake is time but also money, i.e; the cost of transportation and accommodation if the accredited access point is located away from the researcher's home city. This is also a source of problem in terms of organization that implies a tighter schedule for the researcher who teaches or is involved in other activities. Among listed problems, we find: the necessity to book seats in advance, to be unable to work when the center is closed, which is a serious problem when the user has to find accommodation on the spot, the fact of not having one's documentation at hand, of not being able to discuss immediately a recent problem with a colleague; all these elements tend to be detrimental to team work, interactions and, ultimately, to efficiency.
What is true at national level is also relevant in the case of a transnational access. By using a secure European network to process confidential data, researchers will save travelling expenses, time and hassle. As to know if researchers will be able to process data from their home institution, this remains an open question. This point was crucial in the discussion between national statistical institutes concerning the recent European regulation on researchers' access to the European held by Eurostat which should allow the creation (under project) of a secure remote access for the Secure Use Files, thus saving users a trip to the on-site access at Luxembourg (actually quite unused). It turned out that access would only be allowed from accredited facilities, initially limited to the NSIs ones; though later on, it might be possible to accredit other institutions, under very strict conditions. It is unlikely that many universities should make such an investment for
0 5 10 15 20 25 In your own institution, in your own office
In your own institution, in a specific room In another research institution In a data center of a NSI Other
a limited number of possible users, therefore constraining many researchers to travel within their country to the specific accredited access points that may be distant from the researcher's usual working place. Similar situation may happen for a Eu-RAN providing access to the national microdata, then raising serious problems for the researchers.
Anticipate a Potential Mobility between Institutions or Countries
This point is in part related to the access-points issue. Three types of situation regarding mobility may arise: a) occasional or short-term mobility involving a trip to another location in order to work with other colleagues. In this situation, the researcher is unable to resume instantly data processing in order to modify or test new models, refine the analysis; b) another type of short term mobility is involved when the researcher makes a trip of a few days or weeks. Whether access takes place from the researcher's office or from a dedicated spot, the access point is not mobile. Researchers expect a more flexible solution "I wish the box could be usable from several locations", some suggesting for such cases, a more restricted access as remote execution. c) In other cases, we talk about a long-term mobility involving another national or foreign institution; if it takes place during the course of project, it implies to start all over again the accreditation and access procedures (typically if the accreditation involves the institution, as it is generally the case).
Research Environment
A fourth strong requirement of the researchers pertains to their research environment. Less than one survey respondent out of 10 was able to use his/her own computer to access the data. Most of them used dedicated equipment, whatever the center they described and whatever the location. Among those working on dedicated equipment, from their own study, the hardship is reduced, since their computer happens to be in the same room. Not working with their own computer is often problematic (42% of the researchers mentioned it) for two kinds of reasons. The first reason refers to the inaccessibility to their personal files, documentation and software and more generally to their own work environment. This is seen as an important constraint, expressed by the researchers, particularly during the discussions we had at the workshops. The second reason concerns the transfer of programs or the recovery of results or programs. As to programs transfer (authorized after control in several centers), it involves that the center is in charge of the transfer, sometimes causing long delays.
The research environment issue may be addressed under a second perspective that is the communication with other researchers. We have already mentioned the handicap caused by being unable to work in a same place with the research team and other colleagues. The problem of being unable to access the web comes next, frequently mentioned as a limitation involving other problems: the difficulty of being unable to check references, to compare results but also to ask questions, or get support concerning the work in progress. If researchers accept willingly protected access and its constraints, they would consider a more open research environment as an improvement. A researcher made this comment well reflecting general opinion:"It would be possible to do much more in terms of research
with accessibility from my desktop computer in the office". How far a more open research
environment can be compatible with highly-secure access to data raises issues for a Eu-RAN.
Improve Documentation and Support
Available metadata issues during the data processing phase are of a more traditional nature and apply to secure datasets as well as other files. However, for Secure Use Files, less frequently used, and most particularly for administrative files, the documentation is often more defective than for Scientific Use Files. The poor quality of documentation was several times mentioned, sometimes in a quite outspoken manner. "Some documentation is missing; this is unacceptable for a paid service", “clear and up-to-date documentation,
standard format for standard variables (e.g. time variables, geographical units' id, etc.)”.
Several researchers also express the need for improving the assistance to researchers by upgrading the hot line or remote assistance services. Let us quote two researchers: "we can contact the service only via a generic email address; this is not convenient as some of
our requests got lost, and sometimes it would be easier to explain the problem orally",
"The hot-line could be made more efficient by providing, for example, a specific contact address per project or specific addresses for each kind of services (input, output, IT
technical support, data technical support, accounting, etc.)”. In general, researchers ask
for a “human” and responsive technical support, which they sometimes miss.
Finally as regards transnational access, the language issue comes first, for all information on accreditation and access procedures (see in particular the business case on this subject). The language issue is less salient concerning all aspects related to data processing; to some extent this is probably due to the fact that most researchers are members of international teams where each one processes his/her own national data. In a network dedicated to facilitate transnational access, the translation problem will most certainly be more acute.
Output Checking: An Essential Issue
While remote access allows researchers to process data freely and pile up their outputs while working, output checking is mandatory until the time when they wish to get them out of the “bubble”, either in order to discuss their results with others, or for publication, Output checking relates to any type of result: descriptive tables, regressions, result of econometric models etc. This control performed by the centers' staff is meant to check that results comply with statistical confidentiality rules. Even though most of the time, researchers accept the constraints due to control, this issue has raised a number of remarks in the on-line questionnaire as well as during the workshops.
The first topic for discussion relates to those restrictions on outputs that may be detrimental to research. The second point relates to the subsequent difficulties as to the possibility of discussing the results during the data processing period, either with team members, or with some experienced colleagues who are not project members.
Slightly more than one researcher out of three declared he/she experienced difficulties in the past due to output restrictions. Four types of reasons are revealed.
- In the case of small numbers or with an insufficient number of cases in each cell, or when it is impossible to get highly detailed data fit to be mapped. Some disciplines more prone to using detailed descriptive data such as sociology, urban sociology in particular, demography and geography are more likely to experience that type of problems.
- In some other cases, limitations pertain to software use (use of graphs computed with Stata impossible, import problems) or to unavailable formats. Researchers suggest that more formats should be available for outputs (Stata graphs stata or software adapted for cartography, Excel format, Stata, new version of SPSS).
- Impossibility to report some descriptive statistical indicators (mean, max, min) - For confidentiality reasons, some methodologies may be outlawed.
The second issue to be discussed relates to the negative effects of output control over work organization. During the data processing phase, researchers are frequently unable to discuss their results with other team members unless having the outputs being first checked. This might be the case when all the team members are not enrolled, therefore cannot log, or in case they have to travel to the accredited access facilities. The problem is even worse of course for discussing outputs with experts not involved in the team. Researchers have therefore to wait for output checking before getting their outputs for being shared out of the “bubble”. Loss of time, organizational issues and additional costs (if the number of output tables is restricted as this is the case in some facilities) are only one aspect of the problem. While working, researchers may want to discuss intermediate outputs not designed for publication, some of them not meeting the anonymisation requirements for publication.
Most researchers make it very clear that they need to discuss their tentative results in order to validate methodologies or outputs or just to gain a good understanding of the data. When all researchers do not work in the same location (in different countries but also in different cities of a single nation), it is very difficult to discuss the intermediate outputs and they can discuss only the final outputs after validation. This limits interaction and reactivity because the exchange delays can be long (checking the outputs, discussion with the colleagues, back to the analysis to modify, new output, etc.). If only a few team members are accredited, then only accredited researchers will be able to discuss the tentative outputs. Once approved, the final outputs are discussed with the rest of the team but only in a second time, which is detrimental to group cohesion. “It’s particularly
a problem when you work with external people”, “major problem is that an RA may run a
program and can look at the output (complete, which is necessary for checking), but is
prevented from showing it to an off-location researcher”. This is quite as problematic in
the case of Ph. D candidates and their supervisors: should the supervisor be accredited in person or not, he/she would have to travel to the access facilities to access data and tentative outputs. Once again, all this process is time consuming (particularly harmful in the case of a Ph. D candidate), may be a source of errors and inflate the costs (if the cost is a function of the outputs number).
Another specific problem was mentioned: some researchers would like to use outputs not for publications but for further analysis “out of the bubble”. This may apply to demographers; it may also apply to some researchers who do not have currently permission to merge several cross national datasets in order to run a single analysis; they might in certain cases perform this type of analysis on the basis of outputs computed from each dataset file. Even though the required outputs meet statistical confidentiality standards, certain RDCs seem to consider that these outputs, not intended for publication, cannot however be validated as they are meant to be re-used to process other analyses.
Thus, the output checking process raises simultaneously problems of definition, problems as to the possibility of discussing tentative outputs with experts non members of the project, and finally problems concerning the work organization within a team, either because certain members do not have access to data if all did not ask for permission to be accredited or enrolled (even though they were listed as team members, at the time of the application for accreditation) or because they work off-site; in order to access the data, they have to travel to an accredited access point, which makes "real-time" work more complicated. This is particularly the case for multinational teams and obviously all the more complicated if transnational access happens to be impossible. Speaking of the secure European network project, one researcher made this comment: "that system could be improved by allowing access for researchers based outside the UK. If we have a project with someone based in the US, for example, that person cannot access the SDS and so cannot see the ongoing work until it has reached a point where we can extract an output from the SDS".
Merging Data: A Request For New Research Opportunities
The last point, particularly related to the question of a Eu-RAN is that of the possibility of being able to merge data files coming from various countries to perform a single analysis. This involves being able to transfer from one country to the other, across the borders, at least one of the relevant data files. This is currently the sore point; even in the case of remote access, it is generally understood that there is no data transfer as such. In a number of cases, it represents for researchers a significant barrier for data analysis: statistical results obtained with separate analysis for each data set are not comparable to results conducted on a single database merging the two national datasets. As this solution currently encounters difficulties, interestingly, some researchers suggested allowing data analysis conducted on the basis of outputs computed from each data set. As we have seen previously, this solution also encounters difficulties. It could however be a way to progress, in the expectation of a more comprehensive solution requiring agreements between countries based on the construction of a “circle of trust” (see on this point the
work carried out under
research environment where such computation on outputs would be possible could facilitate a solution.
I
NC
ONCLUSION:
S
OMEG
ENERALP
OINTS1) Researchers acknowledge the legitimacy of constraints resulting from highly detailed and sensitive data processing.
2) From a technical standpoint, out of all secure access modes, remote access is a favorite. 3) However, researchers consider that once accredited (too lengthy procedure) and enrolled,
they might enjoy greater trust: control over them is too fussy and not always competent. "Once a researcher has been approved by the system and their institution and has a substantial track record of publication, they are to be trusted. The primary goal of a
researcher is good and accurate research output, not stealing of data". As a consequence,
some point out the discrepancy in terms of the guarantees they enjoy regarding intellectual protection. Let us quote one researcher: "in principle, a remote access is a very good system that allows to access sensitive data while protecting their confidentiality (thus hopefully ensuring their long-term availability). In this sense, it is very protective for the data producer and the access provider. However, the protection of the users (researchers) regarding their property rights on data and code produced, results and papers obtained, etc., should be equally addressed very carefully. I have no doubt
concerning the honesty of people "behind" a remote access but, just as a secured remote
access doesn't trust "me" (it is its principle)... you never know!". Another asks to "clarify and address the issue of intellectual property rights: there is no protection of the remote access' user's intellectual property; all what we create on the remote access (software procedures; results; word texts) and what we input and output, are not protected; we have no insurance that our methodology/findings/interpretation of results, are not used by any third parties".
4) The impact on the researchers' organization of work is significant. Typically, time constraints are importantbecause researchers need time to work on other datasets, or to perform their teaching or administrative duties. The type of organization involved by secure access, in particular if it is necessary to travel to the access points but also to wait is a heavy burden, and often consequently results in using remote access over a long period, sometimes unwillingly.
5) This constraint and the difficulty of organizing work are particularly severe when the researchers work within a team, which is mostly the case, whether on site (even based in a single institution) or within the framework of multinational projects, and this in all disciplines. The problem of the time spent during the data processing phase over output controls and the difficulties in certain cases to share them is all the more important as the project is achieved by a team and based in different places. This major point should be considered for designing a secured centers network.
On top of time planning, planning of costs is also mentioned, when access is not free and/or the number of outputs restricted, because researchers tend to underestimate time necessary to achieve the whole project, the number of trips to the accredited access points, and the number of outputs they need, whenever it is restricted.
6) The survey reveals a number of important points worth of consideration when designing a secure European network: metadata, translation, research environment, outputs checking, the possibility of merging data, the possibility to benefit from different access modes depending of the various moments in the research process to allow more flexibility regarding the access points (remote execution may coexist for example with remote access allowing to modify previously tested models when the researcher can’t work from his access point).
7) Altogether, when asked about the most salient issue for a secure European network, researchers privilege access. They consider that a first step, rich of possibilities, would be primarily to build up the backbone infrastructure that would allow accessing datasets located in different RDCs/countries from one single point of access with the same equipment. When confronted with questions about the usefulness of such a network, considering that so far most data are not harmonized, researchers argue that it is their own usual business.
You concentrated your work in a short
time period and worked on a daily
basis 23%
You used remote access system several times during
several months 72%
Other 5%
How did you organise your work for this project, regarding remote access system used ?
A
NNEX1:
Q
UESTIONNAIREA
NNEX2:
W
HOA
RET
HER
ESPONDENTS?
Graph 1
Table 1
Q2. Is your institution…
A public University 48%
A public research center 35%
A private research center 11%
Other 6%
Graph 2
Sources: web-link survey 0 5 10 15 20 25 30 35 40
France Germany Netherlands United
kingdom Other
Where is your institution located ?
0 5 10 15 20 25 30 35 40 45
Alone With researchers
from your institution researchers fromwith other other institutions in
your country
with other researchers from
other countries
Did you apply for data through remote access
A
NNEX3:
A
DDITIONALQ
UESTIONNAIRERESEARCHERS’ NEEDS FOR TRANSNATIONAL ACCESS TO CONFIDENTIAL MICRODATA --- The goal of this questionnaire is to know more precisely your project, your team, the organization of the work in the team and your opinion about a future EU-RAN.
Thank you very much for your help!
--- DESCRIPTION OF THE PROJECT
Please choose one of your former or current projects to be described.
1. In the chosen project, how many datasets do you use?
2. Which one(s)? (for each of them, please indicate from which country they are)