5.4.1
Complexity Analysis
Proposition 6 Complexity. The average runtime complexity of our proposed protocol is bounded by O(logfn×d2) operations, whered is the number of records andn is the number of columns in
EncDS.
Proof.The generation ofEncDSfrom Sub-Protocol 1.2requiresO(n×d). The shuffling ofEncDS also requiresO(n×d) [SHKS12]. As a result, Protocol 5.1 takesO(n×d) operations to generate MixDS. The shuffling of n items in Univ and the construction of TaxTree with fan-out f requires O(n) operations. In Sub-Protocol 2.1, 2n distinct partitions might be created in the worst-case
scenario. However, since it is impossible in practice to have|MixDS|=d= 2n records to fill out all
the partitions, we argue that the average-case complexity reflects a more accurate measure of our protocol’s performance. Given that the mean of Laplace distribution is 0, the noise in the average case is cancelled out, while assuming that the records are assign evenly between all partitions at the same level in PartTree (worst case). The number of levels in PartTree is (log2d/f), and the total
number of partitions in all levels islog2d/f
i=0 2i×f =O(d), whereO(logfn) operations are applied
on each partition. Since all records in MixDS are validated against each partition inPartTree for assignment, then the required number of operations isO(logfn×d2).
Discussion.The analysis shows that our approach is suitable for high-dimensional data since the complexity is logarithmic with regard to the number of dimensions. On the other hand, it is quadratic with respect to the number of records. This is due to the security protections of our protocol, where no information can be inferred by malicious adversaries either during data integration or during the partitioning process. Lowering record complexity while maintaining the same level of security is a non-trivial open problem.
5.4.2
Security Analysis
Proposition 7 Integrity. The overall protocol is sound under the malicious adversarial model.
Proof.All steps in our solution are publicly verifiable, which prevents a compromised data owner from deviating from the correct computation without detection. If detected, honest data owners will not proceed, preventing the completion of the protocol (as the decryption operations throughout the protocol, including the last step, require all participants). Table9illustrates the publicly verifiable primitive of each security-sensitive step in each proposed protocol and sub-protocol. We inherit integrity against a dishonest majority from our building blocks (and we can provide robustness against dishonest minority by adjusting the threshold of the decryption operation).
We must also ensure that all inputs to the protocol are correctly formed. In the setup phase, where the data owners interact together to construct the public key, the distributed key generation (DKG)
Table 9: The publicly verifiable primitives involved in each security-sensitive step of the proposed protocol P. V. Primitive Construction 1 Protocol 1 Sub-Protocol 1.1 Sub-Protocol 1.2 Protocol 2 Sub-Protocol 2.1 Sub-Protocol 2.2 Mix Network 1,4,5†
Mix and Match 1−4 1.b
NIZKP 1 Homomorphic Operation 2 2,3 2 Public Encryption 1.a‡ Distributed Decryption 4 Cut-and-Choose 1 Distributed Proxy Re-encryption 2.a.i Cleartext Operation∗ 2,3,6 3 1−4 21,2.a.ii, .b,3 5−7
† The shuffling in Step 2 and 5 is performed at the same time to ensure that the same random permutationπis used. ‡All participants agree on one randomness value for encryption so that anyone can verify the ciphertexts by regenerating them.
∗Cleartext operations involve steps that do not require a secret, such as sub-protocol calls and broadcasting an output.
protocol ensures that the output is uniformly distributed at random [GJKR07]. In the case of data encryption, each ciphertext must be from Gq ×Gq such that the data owners are able to check
the independency of the ciphertexts. When operations of mixing (shuffling & re-randomization), distributed proxy re-encryption, and plaintext equality test are performed, each data owner inputs a random exponent fromZ∗q for blinding. As long as there is at least one exponent that is uniformly
distributed at random, the addition of all exponents is also random. Finally, during the generation of a uniformly random bitstring using the Coin Toss protocol, the same property holds: as long as there is at least one honest data owner, then the result is uniformly random.
Proposition 8 Privacy-preserving. The overall protocol is privacy-preserving.
Proof. To prove that our protocol is privacy-preserving, we show that the data is protected throughout the protocol execution.
Input Data. Each data ownerPiencrypts his data, proves knowledge of it, and then inputs it to
the protocol. The proof iszero-knowledge, wherePi proves that he knows the underlying plaintexts
of the encrypted data without revealing any information about the plaintexts.
Encrypted Data. While encrypted, the data is protected under the CPA-security of the en- cryption scheme (e.g., DDH for ElGamal) and the proof is zero-knowledge. The adversary cannot decrypt items arbitrarily, as the decryption key is (n, n)-shared between all data owners, requiring
the adversary to corrupt every data owner to be successful (in which case, all the inputs are al- ready known). Moreover, applying verifiable mixing on the columns and rows of the encrypted data removes any correspondence between ciphertexts and the original items/records.
Decrypted Data. The underlying data remains encrypted throughout the protocol except in two areas: within Mix and Match (during plaintext equality tests) and within proxy re-encryption. However, both subprotocols are verifiable and already provide protection against a malicious adver- sary.
5.4.3
Correctness and Utility Analysis
Proposition 9 Correctness. Given p≥2 set-valued datasets with record and item overlaps, the proposed protocol generates ε-differentially private set-valued data.
Proof.We first show that our protocol can handle record and item overlaps, and then show that the released data isε-differentially private.
Data Overlap.If data about the same individual exists in more than one dataset (record over- lap), then Step 2.a of Sub-Protocol 1.2is applied to generateoneintegrated record for that individual. Functionn-ORis used to set the total number of occurrences of an item tooneif the item exists in more than one record for the same individual (item overlap).
ε-Differentially Private Data Generation.Sub-Protocol 2.1performs the same sequence of partitioning operations as the algorithm in [CMF+11], except that our protocol is in a distributed
setting. Since [CMF+11] generatesε-differentially private set-valued data, we prove the correctness
of Sub-Protocol 2.1by only proving the correctness of the different steps:
• Record-Partition Assignment. Verifying that every item in DifNodes exists in R, and R has already been assigned to Parent(Part), is equivalent to verifying that every item in HCut exists in R. Moreover, verifying that every item inPart.CCutdoes not exist inRensures that the same record is not assigned to more than one sibling partition.
• Noise Generation. In Sub-Protocol 2.2, even thoughLN noiseis differentially private, the par- tial noiseXkfromPk is not; hence the use of Cut-and-Choose protocol to allow for encrypted
partial noises while ensuring they are random variables satisfying gamma distribution. More- over, since the total noiseLN noiseof a leaf partition is equal toLapNoise(ε/2), the output is guaranteed to be ε/2-differentially private.