Mechanisms for Creating Random Numbers - Pseudonymisation Implementation Project (PIP) Referen

1. Mechanisms for obtaining a value which is randomly associated with the source data include:

■ The SQL Server Identity function

■ The SQL Server Unique Identifier functions

■ The rand function

■ The CRYPT_GEN_RANDOM function (SQL Server 2008 only)

■ A sub-string generated by taking a segment of a cryptographic hash

2. The characteristics of different approaches are summarised below; the basis of the analysis is set out in Code example 17 17 – Analysis of alternative approaches to creating pseudo–

random numbers:

Figure 9 - Approaches to Generating pseudo-Random Numberss compared

Approach N of

Produces as a string which is incremented by 1 or other constant for each new record.

While clearly not random in itself. may be regarded as having a pseudo-random relationship to related fields provided that records can be safely regarded as being added randomly to the table.

Short length (4 or 8 bytes integer)

Has the best characteristic for generating a primary clustered key.

No need to check for and remove clashes Cannot be reliably recreated from source data

Only one Identity allowed for each table

Global Unique Identifier function

NEWID() 0

Designed to provide a unique identifier which is truly unique.

Implemented as a default derivation to a table column of type uniqueidentifier

No need to check for and remove clashes Relatively long (16 bytes) - produces value rather than number

Pseudo random

Cannot be recreated from source data

Rand() function, unseeded via

set processing 1,000,000

Produces a quasi-random value between 0 and 1 each time it is called. If called without a record specific seed as part of set processing will assume the same value for all records.

Example included to demonstrate that Rand() is not useful within set processing

Rand() function, seeded at record

518

The rand function is applied to an integer seed which varies for each value to be

pseudonymised.

For a static input duplicates can be removed by brute force before the random set is used.

Where new pseudonyms are required on a case by case there is a need to test that the random value has not been previously issued and obtain a new value if this is the case

Will not automatically generate a unique number; number of clashes will depend in part on the characteristics of the seed.

Need to check for and remove clashes before use;

Length is determined by the processing of the random number

Quality of randomness has been questioned Very low chance of recovery from source data unless integer used to seed is available

Page 40 of 42 Final FV2 24/03/2010

Figure 9 - Approaches to Generating pseudo-Random Numberss compared

Approach N of called and cannot be seeded to predetermine the numbers which will be issued

Will not automatically generate a unique number; Need to check for and remove clashes before use;

Length is determined by input parameter No ability to recover pseudonym from source value

Four bytes are recovered from the hash of a unique input value. These can be cast as an integer. Because the hash is a one way function which can be applied to a text string and distribution is believed random there is the potential to use this as a random number

Will not automatically generate a unique number; Need to check for and remove clashes before use;

Length is four bytes and may encompass +ve and –ve values in the absence of additional processing

Potential to partially recover pseudonym from source data when input and salt are known – though in the example considered here it would not be possible to do so for using Base32 (see below with regard to formatting)

Will not automatically generate a unique number; Need to check for and remove clashes before use;

Length at base 32 is 8 characters – suitable for postcode short pseudonym

Potential to partially recover pseudonym from source data when input and salt are known is very high – very few clashes are created and depending on processing these may capable of resolution

Eight bytes are recovered from the hash of an input value.

These can be cast as bigint As above, but the numbers are generated across a wider range⁵⁶

Not guaranteed to generate a unique number, though probability of this appears very good for other than very large sets.

Need to check for and remove clashes before use depends on context;

Length is eight bytes and may encompass +ve and –ve values in the absence of additional processing. This translates t o a 20 character number which can be shrunk to 14 characters using Base 32 to present Potential to recover pseudonym from source data when input and salt are known is very good

Internal representation is guaranteed consistent within a given version, but impact of change to version must be checked (note:

no checks have been made comparing 64 and 32 bit implementation)

56 See analysis on p 38 for analysis of the specific characteristics of this approach

Page 41 of 42 Final FV2 24/03/2010

Figure 9 - Approaches to Generating pseudo-Random Numberss compared

Approach N of

As above except that only 6 bytes of the output from the hash is stored as bigint, limiting the maximum value

See above; clearly the reduction in range will increase the risk of clash and checks should always be made where used as a root.

If presented using Base 32 outputs has an output length of 10 –suitable for NHS number pseudonymisation

Very good prospect of recovering

pseudonym providing input values and salt are known

3. The table below details a check on the distribution of integers extracted as a sub-string from a varbinary generated by a hash function when applied to a sequence running from 1 to 1,000,000. . In the table below, conversion is to integer from a 6 byte extract..

4. Figure 10 shows the results when 20 bins of equal range are created to cover the output values of the pseudonymisation in terms of the minimum and maximum values of the inputs which generated and the number of input cases which fall into each bin.

■ That there is no obvious correlation between the distribution of input and output;

■ There is a boradly even distribution of input cases in each output bin

Figure 10 - Distribution of integer subset of hash function over 1,000,000 cases

step Outputs Inputs The code to produce this histogram is referenced as item 18 in Appendix 3.

Page 42 of 42 Final FV2 24/03/2010

In document Pseudonymisation Implementation Project (PIP) Reference Paper 4 (Page 39-42)