• No results found

Privacy-Preserving Deep Neural Networks using Secure Computing

Deep neural networks (DNN), also known as deep learning (DL), have been increasingly used in many fields such as computer vision, natural language processing, speech/audio recognition, etc. With increasing availability of huge amounts of training data and devices with increased computing power, they are becoming more practical for use in ML based applications. Such DNNs usually consist of two phases: the training

phase and theinference phase. In thetraining phase, a well-designed DNN is provided as input a training dataset and an appropriate optimization algorithm to generate optimal parameters; then, in the inference

phase, the generated model (i.e., optimal parameters) is used for inference tasks, namely, predicting a label for a specific input sample.

In DNN-enabled applications, one of the critically needed components is a powerful computing infras- tructure with powerful CPU and GPU, larger memory storage, etc. Existing commercial AI service providers such as Google, Microsoft, and IBM have devoted significant efforts toward buildinginfrastructure as a ser- vice (IaaS) platforms for clients that do not have such powerful computing devices. The clients can employ these AI related IaaS to manage and train their models and provide prediction services to their customers.

The size of the available training dataset is another critical issue. In particular, the performance of a DNN system heavily relies on the availability of a large volume of training data. However, in many scenarios, training data used by a DNN may contain privacy-sensitive information. Recent data breach incidents have increased the privacy concerns related to large-scale collection and use of personal data [150, 176]. Moreover, recent regulations such as European General Data Protection Regulation (GDPR), California Consumer Privacy Act, Cybersecurity Law of China, etc., restrict the availability and use of privacy sensitive data. Such privacy concerns of users and requirements imposed by regulations pose a significant challenge for the deployment of DNN solutions.

To balance the privacy concerns and regulatory issues mentioned above, with that of the need of large volumes of training data, several approaches have been proposed in the literature for building privacy- preserving ML (PPML) solutions. These approaches include: (i) applying privacy-preserving mechanisms such asdifferential privacy to limit the disclosure of private information before outsourcing a dataset to a third party to train a DNN model [1]; (ii) employing new DNN architectures such asfederated learning (FL) where each participant trains a model locally and exchanges only the model parameters with the coordinating server [166]; and (iii) utilizing existing general secure multi-party computation (SMC) techniques or other encryption schemes (e.g., homomorphic encryption) to build appropriate security protocols to protect the input training data while training a DNN model [128, 151, 73, 129, 191].

In Table 4.1, we summarize existing representative privacy-preserving approaches used by DNN models. Existing solutions such as a FL approach and those based on differential privacy cannot provide strong privacy guarantees because ofinference attacks, as demonstrated in the literature [66, 167, 131]. The general

Table 4.1: Comparison of representative privacy-preserving approaches in deep learning

Proposal Training Prediction Privacy. Multi-Source† Technique

[166] 3 7 # - Federated Setting∗ [1] 3 7 # - Differential Privacy [128] - 3 H# 7 Customized SMC (GC)‡ [151] - 3 H# 7 General SMC (GC) [73, 86, 31, 92] 7 3 7 Homomorphic Encryption [129] 3 3 7 Homomorphic Encryption

CryptoNN (our work) [191] 3 3 - Functional Encryption

NN-EMD (our work) 3 3 3 Hybrid Functional Encryption

.The column denotes privacy guarantee degree such as mild approach#(e.g. differential privacy) and strong guarantee

(e.g. confidentiality level privacy guarantee).

The minus symbol indicates the approach supports multiple data sources to some extent.

The data owner trains the model by itself and outsources partial computation in a privacy-preserving setting.The model is trained in a federated manner where each data owner trains a partial model on their private data.

SMC(garbled circuits based) approaches, such as those proposed in [128, 151], have a limitation with regards to large volumes of encrypted data that need to be transferred during the execution of the associated secure protocols. Except for the recently proposed solutions in [191, 129], most of these SMC approaches only address privacy issues in the inference phase rather than in the training phase; this is mainly due to the efficiency challenges related to both computation and communication.

Furthermore, none of the existing solutions for privacy-preserving training of a DNN consider the fact that training data may be coming from multiple data sources partitioned horizontally or vertically. The training dataset may have different composition cases; it may include data from multiple data sources, where:

(i) Each data source provides a dataset that includes all the features;

(ii) Each source provides a dataset that has only a subset of the required features; but, collectively these datasets cover the complete set of features;

(iii) It is a hybrid of (i) and (ii).

Even though existing privacy-preserving solutions such as CryptoNN [191] support case (i), cases (ii) and (iii) still pose a huge challenge.

In this chapter, we propose a framework to train aNeural Network over Encrypted Multi-sourced Datasets

(NN-EMD). That is,NN-EMD trains a DNN using a dataset that is composed of independently encrypted datasets from many different sources. Each data source may provide its encrypted data that may include a

complete set of features or only asubset of features.

The goal here is to provide a strong privacy guarantee, while training a DNN model in a more efficient way as compared to the most recently proposed solutions, namely, those in [191, 129]. To the best of our knowledge, this is the first efficient and more practical approach for training a DNN over a set of encrypted/private datasets.