• No results found

Privacy with Bin Encoding

Given Bin Encoding is a simple lossy encoding technique, it cannot be classed as encryption. However, encoding can also protect privacy and secure data. The protection is analysed in Sections 4.2.1 and 4.2.2 for frequency and brute force attacks respectively.

4.2.1 Frequency Attack

The simple substitution cipher used by Mary, Queen of Scots, was broken via frequency analysis, which led to her execution in 1587 [156]; this is why

Table 4.1: English letter frequencies A B C D E F G H I J K L M 7.61 1.54 3.11 3.95 12.62 2.34 1.95 5.51 7.34 0.15 0.65 4.11 2.54 N O P Q R S T U V W X Y Z 7.11 7.65 2.03 0.10 6.15 6.50 9.33 2.72 0.99 1.89 0.19 1.72 0.09 2.5 2.0 1.5 1.0 0.5 0.0 -0.5 -1.0 -1.5 -2.0 -2.5 0 5 10 15 20 Frequency % Difference % of Bins Generated 13 Bins 3 Bins

Figure 4.2: Difference in calculated and actual bin frequencies

frequency analysis is covered in this section. Analysing the frequency of a bin occurring gives an estimation of the letters mapped to it. For example, given the letter frequencies in Table 4.1 [155], if a bin occurs at a relatively small frequency, it is more likely to contain letters that also have a smaller frequency. This also gives a reduction in bin combinations for a malicious user to try, because certain combinations do not fall within the estimated frequencies calculated.

Figure 4.2 shows the difference between the estimated frequency obtained from counting bins in the index and the actual frequency of the letters in each bin. These results show that with enough encoded documents indexed, it is possible to predict within ±2.5% of the actual letter frequency. They also show that a smaller number of bins for the index is harder to estimate.

Figure 4.3 shows the distribution of 100 million unique random bins for a 3 bin configuration, giving a total of 300 million bins, each containing 8− 9 letters. The average summed frequency for a bin is around 33.3%, even though the English letter frequencies vary. Therefore, when generating the bin mapping, we can check if the bin frequencies are within the majority of all possible bins combinations. For example, using 3 bins, the scheme might only

0 10 20 30 40 50 60 70 80 90 100 0 1 2 3 4

Sum of Letter Frequencies in a Bin (%)

%

of

Bins

Figure 4.3:Letter frequencies in 300 million randomly generated bins

accept bins with frequencies between 20% and 46%. This means that even if a malicious user knows the frequency of each bin to ±2.5%, it still gives over 20% of all possible bin combinations. Note this experiment only contained bins of even size (8-9 letters), where an implementation having a variation in the number of letters per bin would have more possible bin combinations.

4.2.2 Brute Force

There exists a finite number of states for a mapping; therefore, a malicious entity need only find which. With an even number of characters per bin, the expression below determines the number of possible states where N is the number of characters,bis the number of bins, andN mod b= 0. For example, with the set{AZ,0−9}, givesN = 36. When b= 3, it results in ≈3.38e15,

and if b = 12 then there are ≈ 1.71e32 possible states. An uneven number of

characters per bin grows these values further.

b Y i=1 N −(N/b)(i−1) N/b !

With the correct state or mapping, because Bin Encoding is lossy, there are still many combinations that the original plain-text value could be. With b = 3 andN = 36, then there are 12 possibilities for each bin. With a query of length 5, there would be 125 possibilities, before considering if it is a valid word

or phrase. An encoded document with only 100 characters would therefore have≈8.28e107 possibilities. Note that with 100 characters, and given spaces

have been removed, word lengths are not known. If the mapping is not known, trying to solve an encoded document is hard (≈8.28e107×3.38e15= 2.80e123).

The challenge is knowing the solution is correct. For example, using the prime number theorem to estimate the number of prime numbers that are 1024-bits gives many more possibilities log(2210241024)

21023

log(21023) ≈ 1.27e305. However, the

difference for encrypted text is it is usually obvious when the decryption key is found, where with Bin Encoding there are many possible solutions even with knowledge of the mapping.

An example can be seen in Table 4.2, where all 5 letter words (6919) from a dictionary [157] were tested against each other for collisions using the same mapping. This shows that with only a few bins, there is the potential for some false positives in results. However, it also means that for trying to recover the plain text, that there are a number of options even if the mapping and word length are known. Results from another experiment in Tables 4.3 and 4.4 show the large number of potential matches if the mapping is not known. Each word was encoded into x bins using ASCII values modulo x, and tested for a match against the query word, giving the same mapping column. The other mapping column was computed by encoding the query and testing if each 5 or 10 letter word could be a potential encoding. The criteria for a potential match was that no character gets encoded to different bins. For example,hello is not a potential match for the encoded queryABBAC becausel is split across two bins. Table 4.4 also includes two 5 letter words which still have potential matches with 10 letter words.