4.3.1 Request for supply of data
Following approval of a pilot and/or an integration project, Statistics NZ will formally request a dataset from the source agency or agencies. All data in datasets that are obtained
by Statistics NZ for integration will be considered to have been collected under the Statistics Act 1975 and all relevant provisions of that Act will apply to the data.
Prior to requesting the data, an agreement document such as a Service Level Agreement (SLA) or a Memorandum of Understanding (MOU) should ideally be in place. An SLA is a formal agreement between two or more parties that seeks to achieve a mutually agreed level of services through the efforts of all the parties involved. An MOU is a formal voluntary agreement between two or more parties that seeks to achieve mutually agreed outcomes through the efforts of the parties. The only difference between an SLA and an MOU is that an SLA is a contract, while an MOU is a voluntary agreement. In data integration projects in Statistics NZ an MOU is commonly used and would stipulate:
• a specification on which variables to request or the formats of the data
• security and confidentiality measures
• the frequency of data supply.
A request to obtain a dataset can be prepared once the source agency and Statistics NZ have agreed on the appropriate dataset specifications to allow Statistics NZ to proceed with integration in the most cost-effective way. Such specifications are to include the variables requested, the corresponding format, periodicity and timing of delivery, the transport mechanism, missing data handling, information about data quality and responsibility for cleaning the data.
The source agency:
• is responsible for ensuring that all required variables are identified, specified and supplied as agreed
• must provide relevant documentation about the dataset, as well as access to data experts who can assist with queries as to structure, format, etc
• must also provide information on who currently, or potentially, has access to the dataset to assist Statistics NZ in establishing confidentiality requirements
• must advise any changes in its collection mode or classifications to Statistics NZ.
All those requests to the source agencies should be clearly documented in the MOU, and information collected about administrative data should be well documented in the metadata.
It has proved helpful to obtain the data model of the provider’s database so that data can be requested in terms used by the source agency.
Once sufficient understanding of the data has been gained, a request for a data extract can be prepared. This will include the following items:
• content of the file: population, time period, fields required
• how the file will be formatted: file type, field separators, provision of separate look up tables
• checklist for the extract before it is sent (eg no special characters, valid data values, range checks etc)
• how the data will be delivered: media, transport, encryption (see section 4.3.2).
Good communication with the data providers, and unambiguous specification of data requirements, reduces the likelihood of the data extract failing to meet the needs of the project. Sufficient time should be allowed for the providing agency to extract and supply the data – this can range from days to months, and should be discussed with the provider before submitting the request.
Along with the choice of the fields to request and the corresponding formats of the data, another very important decision in requesting data from providers is the time period over which the request is made. Data integration projects may integrate data over a monthly, quarterly or annual cycle. This requires precise definition of which records should be
supplied each period, without overlap or gap. Again, using the field names on the data model helps avoid misunderstandings. For instance, “all records collected in December quarter 2002” and “all records with creation_date field within 1st October to 31st December 2002”
could result in different datasets. In the first instance records that are updated in that period could be included, while in the second instance that would not happen. The agency receiving the data can detect records that should not have been received. However, there is usually no way for the receiving agency to check which records have not been received. Deciding on the time period can be more complex, because it can involve decisions around frequency of supply, what to do about late returns/claims filed much later etc.
The data request and supply process may need to be iterative, with modifications or corrections made to the data supplied as needed. It is recommended that the specification be tested first by transfer of a small version of the full dataset.
4.3.2 Data transfer
Statistics NZ corporate standards and policies on data transfer are being developed, but as yet there is no standard approach in place.
Various transmission modes have been employed in data integration projects: by email, courier or carried by hand. The medium of storage has also not been consistent across projects and has variously involved the use of tape, compact disc (CD) and digital video disc (DVD). Furthermore, the method to ensure data security for these media has also been variable, ranging from none, to a password compressed (ZIP) file or Pretty Good Privacy (PGP) encryption.
Email is a fast method of transferring data between the data source agency and Statistics NZ. This method of transfer has two main weaknesses. First, email transfer can be less secure than other methods and, secondly, email data transfer is suitable only for small-sized
datasets. However, more secure email systems are being introduced that could provide better options for consideration in the future.
Considering future demands for more efficient and secure ways of transferring data, the method of carriage by hand should be reviewed. In general, the recommended data transfer option is by courier.
Although storing data on CDs is still a common practice, transferring large datasets by DVD would be the preferable mode in the future, because the different media for storing data do not have a built in security mechanism. At media level, it is expected that all medium content will be encrypted.
Past records show that few measures have been taken to secure the media content. PGP encryption is the recommended method for encryption of data identified by Statistics NZ.
However, where providers do not have the resource for PGP encryption, password protection should be used.
Example of data collection process – Injury Statistics
This example outlines the administrative data collection process of the Injury Statistics programme. This process was developed as a temporary measure by the injury team to serve their current requirements.
Process diagram:
Commercial Courier Data Provider
Extracted Data
Data Custodian
Server/Network
Developer
Help Desk
Safe Storage Disposal
Media/Instruction Flow Information/Communication Flow Working Together
1
2
3 4
5
6
7
8
9
10
Step 1: Media creation
The data provider will extract information from their administrative systems, which is then copied to CD/DVD media in PGP-encrypted format. The encrypted key/password will be communicated to the data custodian via telephone. Statistics NZ highly recommends
encryption. However, if the providers do not wish to encrypt the data, it will be received in flat file text format. The file naming convention should be specified in the SLA or MOU.
Step 2: Handover to courier
The media is handed over to the courier. The package is to be addressed securely, as specified in the MOU
.
Step 3: Delivery
The Injury Statistics data custodian will receive the data media and the Data Collection Log will be updated. The Data Collection Log is a log of all events related to data collection and will be maintained by the data custodian in hard copy or electronic format to keep a track record of the collection.
Step 4: Transfer at Helpdesk
The data custodian or a representative will place a Helpdesk request to copy the file and personally carry media to the Helpdesk for the transfer of data to the file server, the Statistics NZ network. On completion of the transfer, the media is to be brought back by the same person. Though the transfer is expected to take only a few minutes, the data custodian is expected to set an appointment with the Helpdesk beforehand.
In the event that data transfer was unsuccessful (eg media corruption, incorrect format), the data custodian will be informed and the medium stored in an appropriate place.
Step 5: Collect from Helpdesk
This is a process of collecting the media on completion of the transfer from the medium to the server. On completion, the Data Collection Log is to be updated.
Step 6: Place in safe storage
If data transfer and the database loading are completed successfully, the same medium is to be stored safely for the period specified in the MOU. The Data Collection Log will be updated and the storage location for the medium is specified.
Step 7: Disposal
On completion of the storage period specified in the MOU, the media is to be disposed of by shredding.
Step 8: Feedback to provider
The data custodian is expected to inform the provider of all successful data transfers. In the case of an unsuccessful data transfer, the provider has to be informed about why the transfer failed, and a request for a new set of data must be made.
Step 9: Developer expertise
In the event that data from the server is not loaded to the database, or tests on loaded data show possible error, the data custodian may consult Application Services for expert advice.
Step 10: Transfer from media
This is the actual process of transferring the data from CD/DVD medium to the server that is carried out by Helpdesk personnel.
4.3.3 Data verification
On receipt of a data extract, a number of checks can be performed to verify that:
• the number of records extracted is equal to the number received
• there are no duplicate unique identifiers
• numeric fields contain numbers, and text fields are predominantly text
• all the variables requested are present, and determining if any extra variables have been provided by mistake
• the range of values in each field is appropriate, and there are no unusual or surprising values
• the distribution of values in each field is as expected
• there is consistency with other fields in the data
• the relationship between files is as expected (only relevant if more than one file has been supplied).
4.3.4 Feedback to provider
In the case of success of the data validation, it is expected that the data custodian will inform the provider of the success of the data transfer and validation. In the case of failure in the data transfer or data validation, the provider has to be informed about why the transfer or the validation has failed, and a new set of data will be requested. Due to privacy reasons (Privacy Act 1993, see section 2.3, above), on contacting the provider about a failure in transfer or validation, the data custodian should never disclose personal information, such as the record with IRD number XXXX has the payment field missing.