1. Mechanisms for obtaining a value which is randomly associated with the source data include:
■ The SQL Server Identity function
■ The SQL Server Unique Identifier functions
■ The rand function
■ The CRYPT_GEN_RANDOM function (SQL Server 2008 only)
■ A sub-string generated by taking a segment of a cryptographic hash
2. The characteristics of different approaches are summarised below; the basis of the analysis is set out in Code example 17 17 – Analysis of alternative approaches to creating pseudo–
random numbers:
Figure 9 - Approaches to Generating pseudo-Random Numberss compared
Approach N of
Produces as a string which is incremented by 1 or other constant for each new record.
While clearly not random in itself. may be regarded as having a pseudo-random relationship to related fields provided that records can be safely regarded as being added randomly to the table.
Short length (4 or 8 bytes integer)
Has the best characteristic for generating a primary clustered key.
No need to check for and remove clashes Cannot be reliably recreated from source data
Only one Identity allowed for each table
Global Unique Identifier function
NEWID() 0
Designed to provide a unique identifier which is truly unique.
Implemented as a default derivation to a table column of type uniqueidentifier
No need to check for and remove clashes Relatively long (16 bytes) - produces value rather than number
Pseudo random
Cannot be recreated from source data
Rand() function, unseeded via
set processing 1,000,000
Produces a quasi-random value between 0 and 1 each time it is called. If called without a record specific seed as part of set processing will assume the same value for all records.
Example included to demonstrate that Rand() is not useful within set processing
Rand() function, seeded at record
518
The rand function is applied to an integer seed which varies for each value to be
pseudonymised.
For a static input duplicates can be removed by brute force before the random set is used.
Where new pseudonyms are required on a case by case there is a need to test that the random value has not been previously issued and obtain a new value if this is the case
Will not automatically generate a unique number; number of clashes will depend in part on the characteristics of the seed.
Need to check for and remove clashes before use;
Length is determined by the processing of the random number
Quality of randomness has been questioned Very low chance of recovery from source data unless integer used to seed is available
Page 40 of 42 Final FV2 24/03/2010
© Crown Copyright 2010
Figure 9 - Approaches to Generating pseudo-Random Numberss compared
Approach N of called and cannot be seeded to predetermine the numbers which will be issued
Will not automatically generate a unique number; Need to check for and remove clashes before use;
Length is determined by input parameter No ability to recover pseudonym from source value
Four bytes are recovered from the hash of a unique input value. These can be cast as an integer. Because the hash is a one way function which can be applied to a text string and distribution is believed random there is the potential to use this as a random number
Will not automatically generate a unique number; Need to check for and remove clashes before use;
Length is four bytes and may encompass +ve and –ve values in the absence of additional processing
Potential to partially recover pseudonym from source data when input and salt are known – though in the example considered here it would not be possible to do so for using Base32 (see below with regard to formatting)
Will not automatically generate a unique number; Need to check for and remove clashes before use;
Length at base 32 is 8 characters – suitable for postcode short pseudonym
Potential to partially recover pseudonym from source data when input and salt are known is very high – very few clashes are created and depending on processing these may capable of resolution
Eight bytes are recovered from the hash of an input value.
These can be cast as bigint As above, but the numbers are generated across a wider range56
Not guaranteed to generate a unique number, though probability of this appears very good for other than very large sets.
Need to check for and remove clashes before use depends on context;
Length is eight bytes and may encompass +ve and –ve values in the absence of additional processing. This translates t o a 20 character number which can be shrunk to 14 characters using Base 32 to present Potential to recover pseudonym from source data when input and salt are known is very good
Internal representation is guaranteed consistent within a given version, but impact of change to version must be checked (note:
no checks have been made comparing 64 and 32 bit implementation)
56 See analysis on p 38 for analysis of the specific characteristics of this approach
Page 41 of 42 Final FV2 24/03/2010
© Crown Copyright 2010
Figure 9 - Approaches to Generating pseudo-Random Numberss compared
Approach N of
As above except that only 6 bytes of the output from the hash is stored as bigint, limiting the maximum value
See above; clearly the reduction in range will increase the risk of clash and checks should always be made where used as a root.
If presented using Base 32 outputs has an output length of 10 –suitable for NHS number pseudonymisation
Very good prospect of recovering
pseudonym providing input values and salt are known
3. The table below details a check on the distribution of integers extracted as a sub-string from a varbinary generated by a hash function when applied to a sequence running from 1 to 1,000,000. . In the table below, conversion is to integer from a 6 byte extract..
4. Figure 10 shows the results when 20 bins of equal range are created to cover the output values of the pseudonymisation in terms of the minimum and maximum values of the inputs which generated and the number of input cases which fall into each bin.
■ That there is no obvious correlation between the distribution of input and output;
■ There is a boradly even distribution of input cases in each output bin
Figure 10 - Distribution of integer subset of hash function over 1,000,000 cases
step Outputs Inputs The code to produce this histogram is referenced as item 18 in Appendix 3.
Page 42 of 42 Final FV2 24/03/2010
© Crown Copyright 2010