Pseudonymisation Implementation Project (PIP) Reference Paper 4

(1)

Pseudonymisation Implementation Project (PIP)

Reference Paper 4

Pseudonymisation Technical White Paper

- Design and MS-SQL

(2)

Page 2 of 42 Final FV2 24/03/2010

Programme NPFIT Document Record ID Key Sub-Prog /

Project

Pseudonymisation Implementation

Project (PIP) NPFIT-GUIDANCE-4 Prog.

Director J Thorp Version FV2

Owner Status Final

Authors Wally Gowing/John Nickson Version Date 24/03/2010

Document Status:

This is a controlled document.

Whilst this document may be printed, the electronic version maintained in FileCM is the controlled copy. Any printed copies of the document are not controlled.

Related Documents:

These documents will provide additional information. Ref Doc Reference

Number

Title Version

1

NPFIT-FNT-TO-BPR-0022.01 PIP Implementation Guidance FV1.1

2

NPFIT-FNT-TO-BPR-0023.01 Reference Paper 1 - Terminology FV1.1

3

NPFIT-FNT-TO-BPR-0024.01 Reference Paper 2 – Business Processes and Safe Havens FV1.1 4

NPFIT-FNT-TO-BPR-0025.01 Reference Paper 3 – De-identification FV1.1

5 PIP Planning Template and Guidance1

6 TBA SQL Code Examples as a Standalone Set of SQL

(3)

Page 3 of 42 Final FV2 24/03/2010

1 Introduction

1.1 Purpose and scope

1.1.1 This paper provides guidance on technical aspects to support local implementation of de-identification for the Pseudonymisation Implementation Project (PIP) and links to the paper on de-identification (Ref 4).

1.1.2 The purpose of the paper is to:

■ Provide a design and development context for implementation of technical solutions

■ Set out basic principles for technical solutions

■ Provide a ‘toolkit’ for undertaking technical implementation

■ Provide sample code for enabling existing systems to be modified to facilitate use of pseudonymised data.

1.1.3 The focus of this paper is on practical approaches to the design of a system that allows the creation and maintenance of pseudonyms. This is in the context of local organisations that need to support a mix of primary and secondary uses of patient level data with data drawn from a range of sources. Whilst it is intended to be a standalone document for supporting technical implementation, references to other PIP guidance documents are included for readers to access additional relevant material.

1.1.4 The first two sections set the context for the technical implementation, whilst the later sections are aimed principally at technical staff involved in the implementation of solutions for de-identification. It should be noted that this paper is specifically targeted at NHS organisations and potentially their suppliers; it is not intended for more general consumption. The paper complements PIP Reference Paper 3 – “Guidance on De-Identification” (Ref 4). 1.1.5 It is also important that both the technical approach to implementation and associated

access controls are considered in the context of the business and associated Information Governance requirements that the application must support. Sections one and two of this paper are generic; sections three and four of this version support implementation for MS-SQL Server Users. An equivalent document is currently in production to provide an Oracle version of sections three and four.

1.1.6 The paper contains technical language and a multiplicity of acronyms, not all of which may be defined, but which are expected to be understood by NHS IT staff.

1.2 Context

1.2.1 The aim of de-identification through pseudonymisation is to allow data to be assembled and analysed at person level without the need to reveal identity. An effective pseudonym will destroy any structure within a relevant field which might allow it to be reconstituted or otherwise allow an individual to be identified (other than through a secure and pre-defined mechanism). However, the pseudonym must maintain sufficient information from the original identifiable text to provide a consistent basis for discriminating different cases and associating those cases which are the same.

1.2.2 The implementation of pseudonymisation is particularly useful when information crosses system boundaries and domains so that data is not being handled within a single security framework. This is common even within a single organisation, for example where data from multiple operational systems is gathered into a single data repository/ warehouse for secondary analysis. Outside the organisation, there may be a need to provide data to third parties who need to be able to distinguish individuals without the ability to identify them. 1.2.3 It should be added that this paper is targeted at users of medium sized databases and data

(5)

Page 5 of 42 Final FV2 24/03/2010

Services. While some reference is made to performance issues, these are of greater significance for very large databases (such as those at national level) and this may require some change to the relatively simple techniques identified here.

1.2.4 It cannot be emphasised too strongly that pseudonymisation is not an alternative to the maintenance of rigorous security across all the layers of the solution. A security failure at any one of these layers can mean failure for an entire solution; areas of concern include:

■ Security policies and human behaviour

■ Network security (firewalls, ports, encryption)

■ Operating system security

■ Server level security (endpoints, server level logins, ports, protocols, and other surface area configurations)

■ Database level security (granting rights to logins/roles/schemas, encryption options, and determining what rights are appropriate when)

■ All solution components (such as ETL, and for SQL sites whether using SQL Integration Services (SSIS) or otherwise, SQL Server database, SQL Analysis Server (SSAS) and products used to support the reporting layer (including third party products, SQL Report Services, MS-Excel etc)).

1.2.5 A full consideration of these issues is outside the scope of this document, although some useful references are identified in the bibliography. These references are not intended to be all encompassing.

1.2.6 This document provides guidance and suggested techniques and code that organisations can adopt to support pseudonymisation.

Code examples are provided to illustrate the narrative only and no responsibility is taken for the correctness or reliability of their operation; it is the absolute

responsibility of the user to test the functionality and operation of any proposed changes to the systems for which they are responsible and to ensure that such testing has been undertaken in respect of the systems they use.

(6)

Page 6 of 42 Final FV2 24/03/2010

2 The Design Process

2.1 Introduction

2.1.1 Few NHS providers or commissioners have substantial in-house capability to develop and sustain complex IT solutions. Where the need for change to support de-identification does exist, it is likely that organisations will look to third party solution suppliers, multi-organisation collaborations or in-house services to provide suitable depth and continuity of expertise. 2.1.2 With possible limitations of skills and knowledge in mind, the general approach to

implementing de-identification is to “keep it simple”. There is a significant risk of over-engineering that can be avoided through a careful review of requirements. In particular, a requirement to maintain data both in the original ‘clear’ identifiable form and within (or accessible from) a given database significantly increases both the functional requirements and the attention which must be given to the security infrastructure. The need for and scope of any such requirement should therefore be the subject of explicit review. The conclusions of the review on the approach to be taken will require sign-off by the relevant Senior Information Risk Officer (SIRO).

2.1.3 A second key aspect of the overall approach is the need to take a strategic view of where pseudonymisation fits with business needs and how this is best handled given the systems or services available to the organisation. The starting point is to stand back and clarify the requirements and then to review the options available to implement them.

2.1.4 A series of logical steps that should be followed are set out below. Whilst this list may seem lengthy, all have been included to act as an aide-memoire in preparing a solution. The list and the guidance are a means to an end and are not intended to generate unnecessary work or take up inordinate amounts of time.

2.1.5 It should be noted that the author of this text is versed in MS Windows and MS SQL. Other operating systems and software packages are available and my use different terminology. An Oracle specific version of Sections 3 and 4 is in preparation.

2.2 Step 1 – Establish the project

2.2.1 The existing Implementation Guidance (Ref 1) and the associated planning templates have already set out the need for a well defined, sponsored project to implement the changes required to effectively deal with Local NHS Data Usage and Governance for Secondary Uses.

2.2.2 The overall change involved in revising arrangements for data usage and governance and the introduction of de-identification techniques is a classic project of moving from point A to point B. Organisations must have established a formal project to achieve this and, once they have identified a sponsor and a team they can plan the process based on the 12 essential steps, identified in the previously submitted planning template and maturity model.2 In effect, the technical implementation should be a component of a wider implementation project.

2.3 Step 2 – Review the Business Requirements

2.3.1 Prior to the technical implementation stage of the project, the organisation should be clear about the requirements in relation to the use of patient data in both identifiable and de-identified forms in operating and undertaking its business and associated processes.

2.3.2 The default position should be that staff needing access to patient level data should be accessing de-identified data unless there is justifiable cause to access identifiable data. For instance is ‘risk stratification’ analysis to be undertaken to support targeting of primary care services? If so, identifiable data will only need to be made available to a known group of clinicians with legitimate right of access to that data for that purpose, those undertaking the

(7)

Page 7 of 42 Final FV2 24/03/2010

general analysis do not have such a right. Access to identifiable data should be documented and signed off by the organisation’s Caldicott Guardian.

2.3.3 As indicated in the PIP Planning Guidance, organisations must have policies and procedures in place for determining who has access to identifiable data and for what reason, together with a process of approval of such access by the time the revised access to identifiable data is implemented.

2.3.4 After the technical requirements have been developed, see below, the interaction of the business and technical requirements should be reviewed to ensure consistency and where possible seek simplification in the technical design.

2.4 Step 3 – Identify Technical Requirements

2.4.1 The technical requirements should derive from the business requirements. This section sets out a list of some of the issues to be considered when determining how de-identification and pseudonymisation will be implemented in light of those. While the checklist below is relatively long, the approach should always be to look for options that avoid the need for complexity and not all cases will apply to all organisations.

Step 3a - Identify flows of patient level data

2.4.2 The need to identify existing flows and review their purpose and content against business needs has been indicated in Section 2.3 and is considered in more detail in the guidance documents (Ref 1, Ref 2, Ref 3, Ref 4). Knowledge about the flows and their usage is the starting point for developing the technical requirements.

Step 3b – Identify and confirm who needs to access personal identifiers and why.

2.4.3 The complexity of the approach adopted to implement de-identification within a local reporting system will depend on:

■ Whether there is a need to access personal identifiers. Cases where the requirement is to de-identify data consistently with no requirement for re-identification are inherently simpler.

■ The drivers underlying those needs, since these have wide variation in impact.

■ The alternative approaches that are available to meeting those needs.

2.4.4 The examples in Figure 1 overleaf indicate some of the considerations that arise from different drivers which have come to the attention of the PIP Team. They bear directly on a number of key considerations:

■ Who does the need to see data in identifiable form apply to?

■ In what circumstances does the need arise and what is the requirement for re-identification?

■ Which identifiers (i.e. identifiable data items) need to be accessible as identifiable data?

■ How up to date does the reporting system need to be and what are the implications of this for the volume of data to be processed and the time available for processing? 2.4.5 The entries in Figure 1 refer to supporting tables which, to aid readability, have been grouped

(8)

Page 8 of 42 Final FV2 24/03/2010

Figure 1 - Sample Use of identifiable data

Sample use Driver

Weaknesses in operational systems in provider

organisations

As well as supporting secondary use, the reporting solution supports the front line of delivery of care by providing access to locally developed reports which are not available from the operational system. Organisations should consider weaknesses that existing within current operational systems3

Identification of

individuals “at risk” Analysis of the population using personal data drawn from primary care, interventions by primary care interaction with secondary care and other sources to identify persons where intervention will reduce the risk of transition to a worsened state of health4

Provision of apparent secondary use data to General Practice

To meet the requirement from GPs to be able to identify data relating to those patients with whom they have a legitimate

relationship so that they can check data quality and cross reference to their own expectations around the delivery of care5_{– e.g.}

whether activity recorded in secondary care was consistent with expectations.

Support for clinical

audit To enable detailed audit (e.g. by pulling patient records) following analysis of a large sample of data6(including linked records). To enable data

quality issues to be addressed

To enable primary sources to be checked when analysis of the data indicates inconsistencies. This includes checking anomalous cases which are identified in audit validation7

To support tracing To support the use of tracing services to confirm NHS Number status, practice and PCT. However, please note that the Secondary Uses Service now includes practice and PCT derivation against NHS Number via PDS, avoiding the need to trace to confirm these items. Maintaining data in identifiable form for tracing purposes rather than making use of this embedded derivation will therefore require particular justification8

Spatial analysis To allow the allocation of data to geographic areas to support mapping and other spatial analysis9_.

Linking diverse data

sources A key reason for receiving data in identifiable form rather than pre-pseudonymised is the need to be able to link diverse datasets. Unknown future

requirements

Attempting to cater for “unknown unknowns” is not an acceptable justification for maintaining personal data or building functionality with no obvious rationale. However, two specific considerations are mentioned below, these relate to the proxy personal identifiers of data of birth and postcode. In the first case, there may be a case for retaining the ability to access the date in identifiable form when new events are to be added and “age at event” is to be maintained. In the case of postcode there may be a need to ensure the ability to map to revised 2011 Census boundaries to support analysis in Public Health. It is to be emphasised that these cases should be the subject of specific consideration and justification.

3_{Table 1} 4_{Table 2} 5_{Table 3} 6_{Table 4} 7_{Table 5} 8_{Table 6} 9_{Table 7}

(9)

Page 9 of 42 Final FV2 24/03/2010

Step 3c - Identify any sensitive data flows

2.4.6 Does any data require special treatment because of its sensitivity? For example, sexual health, GUM or addiction services may all require special treatment to avoid identification under any scenario or to ensure that there is no possibility of visibility outside the New Safe Haven. This is a particular concern in an environment where there is a stated requirement for patients associated with other datasets to be identifiable to a sub-set of users.

2.4.7 The simplest and safest approach is to anonymise such cases completely or not to publish them in data structures visible outside the environment of the New Safe Haven. It is also possible to identify and implement approaches which deliver internal linkage within the sensitive dataset based on non-reversible pseudonyms that are unique to that set.

2.4.8 The existence of a true need to go beyond this adds significant complexity and requires a very robust solution. Specific issues include:

■ Strengthened authentication for any users able to see data in identifiable form

■ The ability to maintain multiple pseudonymisations.

■ Encryption requirements

■ The stage of processing where de-identification must take place

■ Targeting of audit functionality.

Step 3d - Identify whether there is a need to support multiple pseudonymisations

2.4.9 Applying different pseudonymisations to the same data is a way of segmenting the ability to link it and of reducing the risk that pseudonymisation will be broken by the creation of maps and the potential impact of any such breach. It is a major design consideration in the design of national systems which, for example, are intended to meet the needs of researchers. 2.4.10 However, it adds complication to design and will not normally be a requirement for a single

organisation10_{. Nonetheless, it is important that any requirement is identified as part of the} review because such requirements are easier to incorporate into an initial solution than to retrofit.

2.4.11 Any requirement to maintain multiple pseudonyms is most likely to be associated with the need to handle especially sensitive data or from the need to send data to third parties. It is important to recognise that these requirements are likely to differ in detail:

■ The first relates to the need to pseudonymise on or prior to input to the reporting solution, so that there is no point of contact with the data of special concern and other data where linkage might disclose sensitive information.

■ The second case relates to output pseudonymisation. While this can be handled on a one off basis using the techniques set out below, the creation of general functionality is significantly more complex and outside the scope of this paper.11

Step 3e – Identify the data derivations that need to be supported

2.4.12 This issue relates to the two proxy personal identifiers: date and postcode. In these cases the primary approach to de-identification is abstraction by derivations which reduce information content, although the maintenance of true pseudonyms is relevant where the data is to be used to support record matching. The first requirement is therefore to identify which derivations are required and ensure that these are pre-calculated.

10_{Later discussion identifies the case for maintaining distinct root and output pseudonyms. The comment here relates to the}

need to maintain many output pseudonyms.

11_{For example, it not only involves the implementation and maintenance of some form of key store but consideration must be}

(10)

Page 10 of 42 Final FV2 24/03/2010

2.4.13 Otherwise, the strategic issue relates to whether there is a need to maintain the identifiable form data in full once derivations have been undertaken and how, if maintained, these will be kept secure.

2.4.14 It also needs to be recognised that some derivations involve such a low level of abstraction that, if maintained, they have the potential to act as effective proxy identifiers and should be subject to the same degree of security and confidentiality. Specific cases include:

■ Grid references and other geocodes

■ Census output areas. (For the 2001 Census, the minimum OA size was 40 resident households and 100 resident persons.)

■ Age derivations at the level of days, weeks or months.

2.4.15 As far as the potential need to maintain full postcodes is concerned, the next Census is due on 27 March 2011. While it is intended that Super Output and Output Areas will remain

relatively stable some change will be inevitable and some users/uses may need to remap data to match these.

Step 3f – Identify and address data quality issues which would give inconsistent pseudonymisation

2.4.16 Inconsistently formatted data will generate inconsistent pseudonyms and this is discussed in detail in section 3.5.4

2.4.17 The need is therefore twofold:

■ Identify issues with historic data which must be addressed before and while implementing pseudonymisation.

■ Identify current risks and issues and ensure that these are handled during processing on a continuing basis.

2.4.18 Sample code to address some of the above issues is included within the Appendices to this paper.

2.5 Step 4 – Review the Options

2.5.1 As indicated in Section 2.3, business and technical requirements should be reviewed with the aim of ensuring that the approach adopted offers the most effective way to meet requirements within the required timescale (which will inevitably be the simplest).

2.5.2 Cases for specific consideration include:

■ Is the reporting system being asked to maintain information which should and could be managed by operational systems? A case in point is the identification of individuals at risk who are identified by analysis of large populations. Once the cases have been identified, the better and simpler approach is to identify the individuals within the reporting system through a process of specific de-identification and then undertake further management through the use of an operational patient system (e.g. via a virtual

(11)

Page 11 of 42 Final FV2 24/03/2010

ward12). The impact of adopting this approach is to significantly reduce the requirements the reporting system must support.

■ Can any of the requirements be better addressed by enhancing operational systems? This is particularly relevant where the reporting system is being stretched to encompass identifiable data to address weaknesses in the reporting functionality of operational systems. Again the impact of adopting this approach will be to significantly reduce requirements around the reporting system.

■ Requirements which involve the need to support cross-linkage of less sensitive and especially sensitive data flows or the de-identification of the latter require specific justification.

2.6 Step 5 – Determine the technical solution

2.6.1 The key need is to develop and document revisions to the existing systems and process design in light of the technical requirements, piloting if necessary, even where the proposed changes appear minor. The development of the technical solution should be used as an opportunity to check that the solution fully meets security requirements and should not be limited to the implementation of pseudonymisation alone.

2.6.2 Areas that should receive some consideration include the following.

Structure

2.6.3 SQL Server and Oracle have a wide range of sophisticated data management facilities and tools (eg Analysis services, Internet Information Services). The structure of the revised overall solution needs to utilise the benefits of these facilities and tools in a coherent fashion.

Security model

2.6.4 The security facilities inherent in SQL Server and Oracle need to be used as the basis of the overall security model in enabling different users to access the functionality and data relevant to them and only them.

2.6.5 In addition to this need to assure the overall security of the solution at all layers, specific consideration should be given to the following issues:

■ Which additional objects and principals (groups and users) are to be created to support it and what are the rights of each>

■ Is encryption required and if so of which fields?

■ How are input data (e.g. .csv files) to be secured?

■ How will the value of salts13 (see below) and keys be protected?

Data quality checks

2.6.6 The need to undertake data discovery and address data quality issues both at initial implementation and during load processing are discussed in detail in following sections below.

■ Inconsistently formatted data will cause different pseudonyms to be created.

■ Pseudonymisation may obscure data quality issues. They therefore need to be identified and rectified and/or flagged prior to pseudonymisation.

12_{See en.wikipedia.org/wiki/Virtual_Wards}

13_{a ‘salt’ comprises random bits that are used as one of the inputs to a key derivation function; a ‘salt’ can also be used as a part}

(12)

Page 12 of 42 Final FV2 24/03/2010

Specific and General Audit

2.6.7 As indicated in PIP Reference Paper 3, logs should be kept of access to identifiable data for audit purposes. Some of these logging and audit requirements may be met by features intrinsic to the software (eg SQL Server or Oracle) or additional logs (e.g. for use of any re-identification facility) may be required, or possibly combinations of both. A log of key events, such as an explicit call to a routine to undertake de-identification is a particularly valuable mechanism to ensure that Information Governance requirements are not being breached.

Robustness of ETL and operational processes

2.6.8 The instance of process failure or database corruption may be more severe when data is pseudonymised because of the absolute importance of maintaining referential integrity. There is a need therefore to review existing processes for robustness and ensure that they follow best practice:

■ Are ETL processes sufficiently robust? What risks are associated with the new processes introduced to support pseudonymisation? – e.g. new risks that an attempt will be made to implement a duplicate value on a primary key, causing processes to fail. How will the design avoid these and what additional checks are required?

■ Is there a need to define transactions, improve error checking14_{and/or strengthen} update logs?

■ Is there a need to review the approach to database logging which may have consequent impact on the management of transaction logs?

■ Is there a need to review policy around the back-up of cryptographic keys and certificates to avoid data loss through the corruption of either? How will application defined pseudo-keys such as salts be maintained and protected?

■ Is there a need to review operational processes around routine database integrity checking?15

■ Is there a need to change operational procedures in light of the above?

Features required to support transition

2.6.9 There is a need to identify the facilities required to support the transition from current processes and operations with identifiable data to operating with both de-identified and identifiable data:

■ Are development and test environments available?

■ Is there an intention to support dual running and what are the implications of this for the solution?

■ Is there a need to be able to change certificates, keys and salts between test and production versions?

■ Is it intended to undertake phased implementation and what are the implications of this?

■ Is the infrastructure (and particularly disk space) sufficient to support the processing required to restructure the database, should this be required?

2.7 Step 6 – Plans

2.7.1 As part of a wider project, the technical implementation must be planned appropriately. An important component is ensuring that there are comprehensive test plans in order to

14_{A further reason for using SQL Server 2005 or later is the implementation of the TRY..CATCH construction to allow improved}

error checking in T-SQL processes

15_{For users migrating from SQL Server 2000 this includes a decision on whether to move to CHECKSUM rather than}

(13)

Page 13 of 42 Final FV2 24/03/2010

minimise risk, including some form of dual running to check that outputs are consistent using the different forms of data.

2.8 Step 7 – Review the risks

2.8.1 The penultimate stage is to review the risks that arise from the proposed design and approach to implementation and ensure that they are mitigated and controlled.

2.8.2 Key items for consideration include:

■ Risks of irretrievable data loss – inconsistent pseudonymisation, incomplete update processes

■ Failure to maintain pseudonymisation – e.g. Mapping revealed

■ Unintended release of identifiable data – failures in security model e.g. through accidental grant of implicit permissions, source data not secured, transient tables not cleared etc.

■ Source data not secured.

2.9 Step 8 – Iterate requirements, risks and the technical solution

2.9.1 The final stage involves iterating through the above with the aim of ensuring the simplest and most robust approach to providing a technical solution that meets the business requirements.

2.10 Tables in support of Figure 1

Table 1 - Weaknesses in provider operational systems

Relevant data items  All – typically including patient name Scope of patients who may need to

be identifiable  All or large sub-sets Access to identifiable information  On-line

Number of users needing to access

identifiable information  Many

Update interval  Frequent - at least overnight

Identifiable data retention period16 _{ Short – generally less than 12 months from last event}

Key audit event  Any access

Table 2 - Identification of Individuals at Risk

Relevant data items  All – patient name, address and phone number will need to be accessed to support intervention Scope of patients who may need to

be identifiable  Those identified as cases of interest only

Access to identifiable information  By re-identification of small subset of individuals defined through analysis of pseudonymised data

identifiable information  Few

Update interval  Weekly update for initial analysis

Identifiable data retention period  Medium – last three years (may be longer, eg for smoking)

Key audit event  Re-identification

16_{Identifiable data retention period relates to typical periods for which such data needs to be readily available for}

(14)

Page 14 of 42 Final FV2 24/03/2010

Table 3 - Secondary data to General Practice

Relevant data items  All – typically including patient name Scope of patients who may need to

be identifiable  All with a relationship to a given GP Practice, subject to any cases (implicit or explicit) where the patient would expect information to be withheld (e.g. some sexual health interventions)

Access to identifiable information  On-line (potentially via web) or via distribution of formatted data reports (which should be encrypted) Number of users needing to access

identifiable information  GPs and identified individuals in GP Practices in respect of defined subsets

Update interval  Weekly update17

Identifiable data retention period  Short – generally less than 12 months from last event

Key audit event  Any access

Table 4 - Clinical Audit

Relevant data items  Local Patient Identifier + DoB (to confirm) Scope of patients who may need to

Access to identifiable information  By re-identification of small subset of individuals defined through analysis of pseudonymised data

identifiable information  Few –typically restricted to clinical audit team, named individuals in individuals + some clinicians

Update interval  Weekly update

Identifiable data retention period  Need to identify individuals will not usually extend beyond those recently receiving care, though the requirement is to link patient events over an extended period.

Key audit event  Re-identification

Table 5 - Data Quality

Relevant data items  Local Patient Identifier + DoB (to confirm) + any personal identifiers which are subject to query; potentially SUS spell-id and pathway identifier Scope of patients who may need to

Access to identifiable information  By re-identification of small subset of individuals defined through analysis of pseudonymised data.

 Or access to clear data within New Safe Haven Number of users needing to access

identifiable information  Few – named individuals in information departments and/or medical records.

Identifiable data retention period  Need to identify individuals will not usually extend beyond those recently receiving care, though the requirement is to link patient events over an extended period.

Key audit event  Re-identification or access to New Safe Haven Comment:  Good practice is to centralise Data Quality reporting

within the New Safe Haven.

17_{Cases where a local reporting solution is used to provide operation data – such as discharge letters or results – to GPs are}

(15)

Page 15 of 42 Final FV2 24/03/2010

Table 6 –– Tracing and validation of Practice/PCT

Relevant data items  NHS Number + DoB + Gender Scope of patients who may need to

be identifiable  All

Access to identifiable information  Visibility not required except for a small subset of cases requiring manual resolution

identifiable information  Few – named individuals in information departments and/or medical records Update interval  Monthly update perhaps weekly

Identifiable data retention period  Data requirement is transient and information could be cleared after tracing process is complete

Key audit event  Re-identification where required to support manual resolution

Comment:  Assumes access to NSTS-like tracing functionality via PAR. There is the potential to use the PDS derivations with SUS to validate practice and PCT for SUS data.

Table 7 - Spatial analysis

Relevant data items  Postcode & geocode derivation Scope of patients who may need to

be identifiable  All

Access to identifiable information  Not required to be visible as potentially identifiable other than when point mapping at relatively high resolution is required and in some cases with regard to

communicable disease Number of users needing to access

identifiable information  None at a level which is personally identifiable, except in the special case where mapping is used to support communicable disease tracing.

Identifiable data retention period  Extended for postcode

(16)

Page 16 of 42 Final FV2 24/03/2010

3 System Design – MS SQL Server

3.1 Introduction

3.1.1 The focus of this paper is on practical approaches to the creation and maintenance of pseudonyms in the context of local organisations that need to support a mix and primary secondary uses with data drawn from a range of sources; the discussion does not extend to more theoretical discussion as part of a wider whole systems architecture18. The discussion encompasses two cases:

■ Where there is no need to maintain personal identifiers to support analysis, but where there is a need to assemble person level data from a range of disparate sources which require linkage at person level.

■ Cases where local data is accessed by a mix of users, some of whom have a legitimate interest in personally identifiable data but where others do not. (An example of the first group might include people identified as needing intervention by a PCT analysis of people at risk of readmission to hospital.)

3.1.2 Native encryption and hashing technologies were added to MS SQL Server as part of SQL Server 2005 and form an important element in the pseudonymisation toolkit. SQL Server 2005 also saw the implementation of User/Schema separation (discussed further below) which strengthens the ability to secure systems which must meet the needs of a range of users. For these reasons:

■ The discussion which follows does not apply to SQL Server 2000 and earlier products.

■ Where there is a confirmed need to support pseudonymisation (which will be the norm), users who are currently using SQL Server 2000 or earlier should consider migrating to a more recent version of the product.

3.1.3 The process of migration from SQL Server 2000 typically involves a close consideration of database structure and processes. Users planning to migrate from SQL Server 2000 should consider whether to use this as an opportunity to implement pseudonymisation as part of the implementation process.

3.1.4 An issue has been raised about the mechanisms available to support pseudonymisation with MS-Access. There are a number of considerations:

■ MS-Access does not in itself offer an adequately secure environment and should not be used to maintain clear data.

■ The only exception is where its use is to support the analysis of small datasets on a fully encrypted machine. In that case, the data which is maintained in clear should be restricted strictly on a need to know basis – for example, to date of birth, where there is a requirement to support multiple age related derivations, without the automatic assumption that other sensitive fields should be visible in clear.

■ In addition, MS-Access lacks the backup, recovery and transaction logging functionality available in MS-SQL.

■ Where users are seeking to pseudonymise the static content of an existing MS-Access database, or clear data maintained in other office products it is possible to use MS-SQL Server express as an engine to do so using the techniques set out below. An example will be found in Appendix 3.

(17)

Page 17 of 42 Final FV2 24/03/2010

■ While SQL Server Express supports column encryption, the fact that encryption keys are normally maintained within the SQL server environment means that physical security remains and issue.19

3.2 Code Samples – Summary of Testing Approach

3.2.1 A number of approaches have been taken to testing of the proposed pseudonymisation function:

■ for postcode, the PAF has been pseudonymised (2.4m records), both on the desktop and by WMCSS

■ For NHS Number an initial check was made on approx 24,000 CFH synthesised test cases, with WMCSS checking a further 5m+ cases

■ Dates – a table has been pre-populated with pseudonymised look-up from 1890 to 2040 and checked

3.2.2 A key consideration in checking the proposed approach is that, when correctly implemented the table which contains the cross reference between pseudonym and clear text, provides a rigorous check on the integrity of the approach. Key points of failure would be:

■ The routine generates duplicates pseudonyms for differing inputs

■ The routine generates more than one pseudonym for the same input

■ Data is lost

3.2.3 The cross reference tables which link clear data and pseudonym have a primary key defined on the root key (so that it cannot contain duplicates or be null) and unique indexes declared on the public (group or output) pseudonyms (so that these cannot contain repeated values, though in this case, for the reasons set out in the note, a null is allowed. So, if:

■ The table has been successfully populated, and

■ It has the same count is the input (once repeating values have been removed from the input) +1 (to allow for the NULL case which is added explicitly when the table is first created), and

■ The clear values in that table match the input set, then

■ Data integrity has been maintained and the pseudonyms have been created successfully.

These conditions were met in all cases.

3.2.4 Further checks were undertaken on format and length.

3.2.5 The other risk around any process of pseudonymisation is that the input can be derived from the distribution of the pseudonymised values. Appendix 2 Figure 10 contains a check on the distribution of 1M numbers when subjected to the proposed pseudonymisation approach. This gives some reassurance, as does the known characteristics of a cryptographic hash function, which provides the basis of the approach.

3.2.6 The fact that the initial application of the algorithm for a short form pseudonym produces few clashes means that the process of resolving them (i.e. taking further passes through the data) to remove them is relatively simple. The need to ensure that there are no duplicates in these cases is now emphasised in the text as well as being incorporated in the code (the latter has always been the case). The checks on the production of short form pseudonyms set out above were applied to this code.

19_{There are additional mechanisms available to mitigate this risk, particularly under SQL Server 2008 which provides support for}

(18)

Page 18 of 42 Final FV2 24/03/2010

3.2.7 That said, it is clearly critical that any user implementing pseudonymisation tests rigorously. The paper documents a wide variety of cases – particularly where data quality is poor - which can derail the process if not taken into account and this much be checked as part of the specific solution.

3.3 Design

issues

3.3.1 The need to support de-identification and to derive and assign unique short-form pseudonyms implies the need to maintain clear data somewhere within the application. This data must be maintained in a demonstrably secure environment.

3.3.2 This leads to the general principle that personal data should be maintained in clear values in one, and only one, location20. Adopting this approach provides simplicity and clarity, in that there is only one source of the clear data to control. While design constraints may result in the duplication of physical tables between a relational staging area and dimensional tables in SQL Server Analysis Server the designdeisgn should ensure that these operate as a single virtual table.21.

3.3.3 The only exception is the maintenance of transient data as part of load processing, where access should be tightly limited to those directly responsible for maintaining the data.22

3.3.4 The diagram below shows the resulting impact on design. Although the approach is a common to the maintenance of any pseudonymised value, the discussion below is couched in terms of the NHS number.

20_{This statement is made for simplicity, It is accepted that the technical needs of clustering, disaster recovery etc may make this}

a single logical location

21_{. Note also the comments in relation to key length later in this table.}

22_{The security and confidentiality of the source extracts, backups and the use of temporary storage to process data are also}

relevant issues. Root pseudonym (Primary Key) Clear value (encrypted as required) Public pseudonym(s) 1-n IsNull flag

Data quality indicators Associated data… Last modified date.. etc

Dimensional design (FACT)

Normalised Transactional table – e.g. Finished General Consultant Episode Primary key Surrogate key Foreign key

Code examples are provided to illustrate the narrative only and no responsibility is taken for the correctness or reliability of their operation; it is the absolute responsibility of the user to test the functionality and operation of any proposed changes to the systems for which they are responsible and to ensure that such testing has been undertaken in respect of the systems they use.

(19)

Page 19 of 42 Final FV2 24/03/2010

Figure 2 Pseudonymisation - design

3.3.5 As shown in Figure 2, a pseudonymisation master table:

■ Holds root pseudonym. The root pseudonym is used for internal linkage only and never exposed to users. It will be the primary key of the master table and will act as foreign key to other tables needing to recover a public pseudonym or clear values. In a dimensional model it acts as a surrogate key to link dimension and FACT tables.

■ Maintains the clear value of the sensitive data, which may or may not be encrypted.

■ Maintain one or more public (or output) pseudonyms23 . These are the pseudonyms which are exposed to the users who need access to a pseudonym; the ability to maintain and hold multiple public pseudonyms allows different pseudonymisations to be applied in different contexts.24

3.3.6 There are strong reasons for not using the root key as a public pseudonym as, if compromised (e.g. by the creation of an external map between it and the clear text), the only option is to rebuild database tables in a way which changes the values of the key. In SQL Server, there are also design constraints which mitigate the use of the randomised and relatively long values which are appropriate to an exposed pseudonym as primary / surrogate keys. These are discussed at greater length below.

3.3.7 The length and format of public pseudonyms is constrained by the fact that they may need to be:

■ shared between users to support the identification of a problem case without the need to refer to personally identifiable information;

■ shown on reports without impacting adversely on existing formats;

■ need to be handled by existing systems.

These factors imply that public pseudonyms should be consistent in length with corresponding clear text if at all possible (“short-form pseudonyms”).

3.3.8 As an internal value, the format and length of the root key is not a concern to users. However, where the root pseudonym will be used within a dimensional approach, for example within SQL Server Analysis Server, design considerations indicate that it should be as short as possible..

3.3.9 Other data elements within the pseudonymisation master table will be determined by the design and may include:

■ An IsNull flag. This relates to the handling of records with NULL values for the clear data and this is discussed below.

■ The table may also maintain data quality indicators – for example, in the case of the NHS number this might include a flag to confirm that the input, clear, NHS number passed the modulo 11 check, and a consistency flag (set to fail if inconsistent data is identified on records for the same NHS number – e.g. inconsistent dates of birth).

■ Subject to local business requirements and system design, the table may also include other data relevant to the grain and dimension – for example, there may be a case for extending the NHS Number pseudonymisation master table to include date of birth25, to support consistency checking and for maintaining the most recent GP practice code associated with the patient to support access control.

23_{also known as Group Pseudonyms in the context of SUS}

24_{The above assumes the need to support a limited number of public pseudonymisations – where the total number of}

pseudonymisations is not limited a normalised structure may be required

25_{In this case, allowance must be made for possible inconsistencies in the date of birth – for example by taking the latest}

(20)

Page 20 of 42 Final FV2 24/03/2010

3.3.10 Simple algorithms for deriving pseudonyms are discussed below. However, there are three broad approaches:

■ Check the incoming value against the data maintained in clear and assign a new value if the incoming value is new, assigning the next sequence number given by the identity function as the root pseudonym for the new value.

■ Derive a pseudonym directly using a cryptographic hash.

■ Derive a short form pseudonym using one of the approaches set out below. 3.3.11 The first two of these are most relevant to the derivation of a root pseudonym:

■ A particular risk with using an identity function is that the data to which it is applied is sequenced in some way which imparts meaning to the root key – the generation of a date pseudonymisation master table and the initial load and creation of master tables are obvious examples.

■ The length of a cryptographic hash (16-20 bytes) limits its direct usefulness where the result will be used as a surrogate key. Within SQL Server, a long random value is also a poor candidate as a clustered index and can give rise to poor performance on update and significant inflation in table size. However, the pseudo-random relationship of the hashed output to the clear text, together with the fact that it can be derived directly from the source can be useful as an intermediate step when generating root pseudonyms.

For these reasons, the practical examples discussed in the next section tend to use a combination of both methods. .

3.3.12 It is to be emphasised that where an existing system design already supports entities (such as a person dimension) which can be developed to support pseudonymisation with limited modification, where processes to maintain surrogate keys are already in place and where keys are populated, nothing in this document should be taken as indicating that there is a need to rebuild the existing database using a new set of keys. Rather the approach should be to:

■ Ensure that the keys which act as root pseudonym are not exposed to users26_;

■ Implement public pseudonyms as additional attributes. 3.3.13 In these circumstances, effort should be directed to:

■ Ensuring that the root pseudonym is never exposed to users and is protected by design and strong access controls.

■ Implementing support for one or more public pseudonyms, where users need access to a pseudonym.

■ Implementing rigorously controlled processes to support de-pseudonymisation where de-pseudonymisation functionality is required.

3.4 Pseudonymisation

methods

3.4.1 A number of approaches have been considered as mechanisms for generating pseudonyms. These are summarised below; more detail will be found in the appendix.

Cryptographic hash functions

3.4.2 A cryptographic hash function maps strings of arbitrary length to strings of fixed length so that it is computationally infeasible:

■ to enable the source text to discovered (pre-image resistance)

26_{Accepting that technical staff may need to use the root key to build efficient new reports. Any such work on live data should}

(21)

Page 21 of 42 Final FV2 24/03/2010

■ to discover an alternative source input to be discovered by inspecting the output which would give rise to the same output. (second pre-image resistance)

■ to find any two distinct inputs will give rise to the same output (collision resistance). 3.4.3 A number of algorithms have been developed aimed at meeting these requirements, notably

MD4, MD5, SHA-1 and SHA-227. SQL Server currently supports all but SHA-2.

3.4.4 Although the application of a given algorithm to the same clear text will give the same result, alternative pseudonymisations can be generated by the application of a constant (or salt) to the input stream prior to hashing.

3.4.5 The attractions of a cryptographic hash as a mechanism for pseudonymisation can be summarised as follows:

■ The length of the output is fixed and known in advance.

■ The function is relatively fast in its operation (see the discussion below).

■ Since the function produces a definitive result, it can be used to obtain the pseudonym algorithmically without the need to first check for an existing pseudonym in a pseudonymisation master table, though the need to maintain entries in a master table remains if there is a need to support de-pseudonymisation.

■ Because the pseudonym can be re-derived from source data, providing only that the algorithm in use and the salt are known, it offers additional recovery mechanisms if the event of data corruption.

3.4.6 Because the output of a hash function is stable for a given input, salt and platform a cryptographic hash can provide a direct method of creating a pseudonym without reference to a look-up table, though such a table is required to support re-identification.

3.4.7 The primary disadvantage of using a cryptographic hash lies in the length of the result, which depending on the algorithm in use is 17 or 20 bytes (i.e. 34 or 40 characters when expressed as hexadecimal). This is not consistent with a requirement for a public pseudonym of the same length as the clear text being pseudonymised and also gives rise to issues around physical design and the suitability of the result for use as a primary key.

Random sampling without replacement

3.4.8 Each clear value is uniquely associated with a random number.28 The process:

■ First checks the pseudonymisation master table to find whether a pseudonym has previously been identified against the clear text

 If a pseudonym is found, that value replaces the clear text in the input record.

■ If there is no entry in the pseudonymisation master table,

 a new pseudonym is required and is either taken from a pre-populated list of unused values

 or requested from a random number generator; the value checked to see

whether it is already in use and the process repeated until an unused value is returned.

 The new value is updated to the pseudonymisation master table together with the clear text

 And the new pseudonym replaces the clear text in the input record.

27_{SHA-3 is currently under development.}

(22)

Page 22 of 42 Final FV2 24/03/2010

3.4.9 The advantage of the approach is that it can support the generation of pseudonyms of any length.

3.4.10 The key issue is the mechanism used to produce random numbers and a comparison of the approaches available in SQL Server is set out in Appendix 2, Figure 9.

3.4.11 Finally, special considerations apply where the set of data to be pseudonymised has a defined set of occurrences – these include dates and postcodes. There are grounds for pre-populating tables of pseudonyms in these cases, ensuring that the pseudonyms are drawn from a very wide range of random values.

Partial extraction from a cryptographic hash

3.4.12 This approach contains some of the features of the previous two approaches and provides a mechanism for producing a reduced length pseudonym from a cryptographic hash function, though with a non-zero probability that the candidate pseudonym will clash with a value created by a different input. For this reason, a two-step approach is required:

■ A candidate pseudonym is derived.

■ There is then a need to check and resolve potential clashes in a way similar to that described above when a random number is used.

That said, the probability of a clash typically varies from low to very low, allowing the implementation of efficient mechanisms to undertake the second step.

3.4.13 The output of the process is pseudo-random; see Appendix A for details.

3.4.14 The approach uses the fact that SQL server will convert a substring of up to 8 bytes taken from a binary (varbinary) field, such as that produced by the cryptographic hash function, and convert it to a corresponding integer (8 byte bigint) value, giving a candidate pseudonym of reduced length. The length of the pseudonym can be reduced further if necessary by expressing the integer to Base16 (hexadecimal) or to Base32.

3.4.15 The size of the population from which the candidate pseudonym is drawn will depend on the number of bytes extracted but can be very large. Details are given in the table below:29. 3.4.16 In consequence the power of the approach to obfuscate the data is high and the numbers of

clashes which are likely to be generated are few. In the following cases, the number of bytes extracted was dictated by the need to provide an output of the same length as the clear text input (when expressed to Base32):

■ Pseudonymising the 2.4m entries in the postcode directory and extracting a 5 byte integer gave rise to a minimum of 2 and a maximum of 3 clashes.

■ Pseudonymising over 5 million NHS numbers30_{and extracting a 6 byte integer gave} rise to no clashes.

29_{Negative numbers are held as a complement. As a result, if a full 8 bytes are extracted approximately half of the numbers}

returned will be negative. Because of the way SQL Server pads shorter fields fewer bytes extracted and expressed as a bigint will always return a positive value.

(23)

Page 23 of 42 Final FV2 24/03/2010

Bytes Minimum

value

Maximum value Maximum

length integer Length Base 16 Length Base 32 Comment Held as Bigint 4 0 2^32-1 _{(4,294,967,295)} 10 8 7 5 0 2^40-1

(1,099,511,627,775) 13 10 8 Useful for postcode

6 0 2^48-1 (281,474,976,710,655) 15 12 10 Useful for NHS# 7 0 2^56-1 (72,057,594,037,927,935) 17 14 12 8 -2^63 2^63-1 (9,223,372,036,854,775,807) _{20 16}₁₄*

May give rise to negative integers because negatives held as complements Held as integer 4 -2^31 2^31-1 (2,147,483,647) _{10 8}₇*

May give rise to negative integers because negatives held as complements Table 8 - Range and length of extracted varbinary data

3.4.17 The approach was found to be the most effective approach to the production of short pseudonyms:

■ It is relatively simple to code

■ It is relatively fast

■ Multiple pseudonymisations can be generated by specifying different salt values for the hash function.

■ Because clashes are few they can be quickly resolved which has the advantage that a near definitive output can be obtained from a given input

3.4.18 Examples of the SQL code required to support the use of this approach will be found in the code examples.

Encryption as a mechanism for pseudonymisation

3.4.19 In principle, it is possible to use a simple stream encryption as a basis for pseudonymisation, with the encrypted value acting as the pseudonym and de-pseudonymisation being effected by decryption.

3.4.20 The use of this approach is, however, deprecated, not least because the native functionality (which involves the use of RC4 and RC4_128 algorithms) which would support it has been deprecated in SQL server and will be removed by Microsoft some time in the future.

3.4.21 Other ciphers are implemented by a form of double encryption to ensure that the encrypted value of a given clear text varies from case to case. This has two consequences:

■ Encryption is not a suitable basis for the direct algorithmic creation of pseudonyms.

■ Indexes are not an effective mechanism to improve performance on the encrypted field as the output is not determinate.

3.4.22 While it is possible to sidestep these constraints to some degree by using a lookup process to store the first occurrence of the encrypted value and then use this as a pseudonym, this approach is not recommended:

■ The process generates a very long pseudonym of indeterminate length – for example the generated pseudonym for a 10 character NHS number is 20 bytes using an SHA-1 hash algorithm and between 66 bytes using AES-128 or AES-256. This is wasteful of

(24)

Page 24 of 42 Final FV2 24/03/2010

storage and likely to give rise to less efficient processing than the use of a hash-key which is shorter in length.

■ Efficient processing of both updates and patient specific searches requires either that the look-up table is supplemented by the use of a hash function or that the source data is maintained unencrypted on the look-up table31_.

Special case – dates and other restricted sets

3.4.23 When the range of potential values is relatively small and bounded, the key requirement is to ensure that the pseudonyms in use are drawn from a wide set of potential values.

3.4.24 Note there is a difference between the handling of dates and postcodes:

■ Dates are static and the list of potential pseudonyms can be created as a one off solution.

■ Postcodes are relatively static but not completely so. While it is possible to pseudonymise the whole PAF file, allowance must be made for the fact that the Post Office routinely add 4-5,000 new postcodes a month32_{, so that incoming data must be} routinely checked to ensure that the new additions are made to the reference table of postcode pseudonyms as required..

3.5 Other

considerations

Null values

3.5.1 Special consideration needs to be given to the pseudonymised representation of NULL values within the clear text. The requirements can be summarised as follows:

■ It should be possible to support joins between the pseudonymisation master table containing the root pseudonymisation and associated fields to other tables on the root pseudonym using a simple inner join. This implies that record should be included in the pseudonymisation master table with a value for the root pseudonym to cover the case where the input clear value is NULL.

■ It should be possible to distinguish these cases from those where the input value is populated. This implies pre-populating the root pseudonym with a special value (e.g. 0), flagging the NULL case or both.

■ In a transactional environment, user queries based on a public pseudonym (where allowed) should ensure that accidental joins are not made between records where the input value was NULL and ensure that null values are propagated in accordance with the rules of three value logic. This implies that the value of the public pseudonym should be set to NULL when the input value was NULL.

■ In a dimensional (data warehousing) model, the usual approach is to use a specific label (such as ‘Missing’) to flag NULL values. This step can be handled when the dimension tables are built.

3.5.2 An example is set out below: Clear Root Pseudonym/ Surrogate Key Public Pseudonym 1 (transactional) Public Pseudonym 2 (Dimensional) IsNull Flag (0 for other cases)

NULL 0 NULL ‘Not Known’ 1

31_{http://technet.microsoft.com/en-us/library/cc837966.aspx}

32_{In the past, the Post Office have re-used postcodes. More recently they have indicated that they will seek to avoid doing so,}

(25)

Page 25 of 42 Final FV2 24/03/2010

3.5.3 The approach can be extended to include cases where there a different classes of unknown – e.g. in the cases of postcode where there are range of pseudo-postcodes for unknown cases33_{, for example by flagging each case in the master table.}

Data quality and internal representation

3.5.4 Inconsistently formatted data will generate inconsistent pseudonyms. Where it is intended to base pseudonymisation on an approach which involves a cryptographic hash function this extends to the need to ensure a consistent internal representation of the data.

3.5.5 The examples set out in Figure 3 have been generated by the simple approach to creating short pseudonyms discussed below and show that the pseudonyms can change as a result of apparently minor differences in format or of the internal representation of the data types used to present them for pseudonymisation.34

Ref Field description Example Pseudonym

P1 8 chars left/right justified (old postcode _format) CW2 5GX Y6BQEB7R P2 8 chars left/right justified – mixed case CW2 5gx G63EJCM8 P3 7 chars – single space separator, no _{trailing space} CW2 5GX 5H9BV0ZL

P4 8 chars – single space separator, _{trailing space} CW2 5GX 8T22EGGF

D1 Date held as character input format to _{pseudonym varchar format date} 2000-01-01 03W5Z4VD

D2 Date held as datetime input format to _{pseudonym varchar date time} 2000-01-01 _00:00:00 0179DTBA

D3

Date held as datetime, but containing time as well as date information input format to pseudonym varchar date time

2000-01-01

00:00:01 00QVPHY2

D4 Data held as datetime – input format to pseudonym – internal representation (varbinary)

2000-01-01

00:00:00 00YPNETW

T1 Maintained as int 12345 E6QQF6GW

T2 Maintained as bigint 12345 4Y39TK2G

T3 Maintained as varchar 12345 HJR26Z86

T4 Maintained as nvarchar 12345 5VJ1DEHT

Figure 3 - Different input formats can give rise to different pseudonyms

3.5.6 Particular problem cases include:

■ Inconsistent formatting of postcodes maintained in local systems. This is a particular concern given that the standard for maintaining postcodes differs between the historic

33_{See http://www.hesonline.nhs.uk/Ease/servlet/ContentServer?siteID=1937&categoryID=571&dirID=110582} 34_{The code used to generate examples is referenced in Appendix 3 as Code Sample 16.}

Pseudonymisation Implementation Project (PIP) Reference Paper 4