Defence against causative availability attack

7.5 Analysis of defence

7.5.2 Defence against causative availability attack

To defence the causative availability attack, there are two main problems that need to be taken into account: defence against causative attack and exploratory attack. To be specific, defence against causative attack is resilient against contaminate the training data, while defence against exploratory attack is to limit the ability of an attacker in learning knowledge from the classifier (e.g. defence against probing).

Defence from internal and external attack perspective: For insider attack, the ideal

scenario is to make use of insider detection techniques to detect the insider adversary and prevent it from poisoning more training data. However, a detailed discussion of insider attack is beyond this study, for more information, researchers are recommended refer to [264–266]. We specifically discuss the strategy of removing the malicious data from the training set and to harden the SVMs algorithm against the proposed causative availability attack.

For attacks from external, one of most important security problem we have to consid- ered is the defence of exploratory attack. Barreno et al. [255] have proposed the strategy of defending against exploratory attacks, which is aiming to prevent an adversary from learning any knowledge about the training data and limits the attacker’s ability to reconstruct internal states of classifiers. The main idea is to increase the difficulty of reverse engineer- ing a learner using techniques such as randomization and limiting feedback or misleading feedback.

Defence from training data perspective: One one hand, to eliminate the impact caused by the poisoned support vectors in the training dataset, we can make use of leave one out strategy to train the classifier and remove these instances that have a significant negative impact on classification performances. Assume we are given a training set Xtr

and an initial testing setXte. Our goal is to identify the potential malicious data samples in

Xtr. For each data point(xi, yi) ∈ X (wherexi is the sample and yi is the related label),

a classifier is training onX \(xi, yi)and test on Xte. To determine the significance of a

change, we compare the shift in classification performance to the average shift caused by a single points(xi, yi). If the classification performance of classifierCi shown a significant

decrease compared to the averaged performance,(xi, yi)will be removed from the training

setXtr.

On the other hand, to eliminate the effect from the potential candidate training data. One recently proposed method is Reject On Negative Impact (RONI) defence [255]. It focuses on shifting of accuracy caused by each training instance and eliminates those point that have a substantial negative impact. To determine whether a candidate training sample is the poisoning point or not, RONI defence first train a classifier on a base training set, then updated the classifier by adding the candidate instance to the training set. Then both classifiers are applied to a quiz set (i.e. safe training set). Finally, one can compare the classification results of the two classifiers, and once the second classifier produce substantially more classification errors, the added instance will be rejected, otherwise, accept it.

7.6 Conclusion

In this chapter, we formulated a model of causative availability attack against support vectors of support vector machines classification algorithm. The proposed scheme can be

potentially used by both internal and external attackers. Experimental results based on one simulated study and four scenarios of real-world problems showed that our proposed causative availability attack can significantly decrease the classification performance. In order to defence the proposed attack, we also presented several strategies against causative availability attack.

Chapter 8 Conclusion and Future Work

8.1 Conclusion

In this thesis, we have addressed a number of research challenges and proposed effective solutions in regard to the deficient data classification. Meanwhile, we have simulated and analysed the security problem regarding support vector machine classification algorithm.

Firstly, extensive experiments have been conducted in order to examine the impact caused by the class imbalance problem, especially when there are missing value. Experi- ment results have demonstrated that class imbalance has a negative impact on classification performance. Moreover, this problem can be more severe when there are missing values presented in the imbalanced data.

Secondly, considering that many learning algorithms exhibit reduced performance when there are missing values presented in the imbalanced data. We have proposed a fuzzy-based information decomposition oversampling method for re-distributing the imbalanced training data to alleviate the problems. In the proposed scheme, both missing data imputation and rebalance of the training data are treated as a specific missing data estimation problem. In particular, FID rebalances the training data by creating synthetic samples for the minor- ity class. The proposed scheme has two steps: weighting and recovery. In the weighting

step, the weights obtained by membership functions are used to quantify the contribution of the observed data to the missing features. In the recovery step, missing values will be estimated by taking into account different contribution of the observed data. To evaluate the performance of the new FID method, a large number of classification experiments have been carried out on 27 well-known datasets. The results show that the FID method significantly outperforms other 10 state-of-the-art individual methods and 8 different combination methods in the critical situation of missing values and imbalanced data.

Thirdly, we have studied the spam detection problem in regard to online social networking and identified that the classification performance of online social networking spam detection drops dramatically with the increase of imbalance ratio. To address the class imbalance problem in online social networking spam detection, we have presented an interest- ing work which makes use of well know ensemble approach to empower our information decomposition algorithm used for repairing imbalanced data with which we integrate the information decomposition algorithm, random oversampling and random undersampling approaches into one unified technique. The proposed ensemble algorithm has been applied to ground truth twitter spam detection, experiment results have confirmed the effectiveness of the proposed scheme.

Fourthly, we have further explored the online social networking spam detection problem, we found there is limitation (data ‘drifting’) with classification algorithm when faced with the streaming data. The classification performance reduces dramatically when histor- ical data is used to build classification models for future spam detection, the reason is the ‘drifting’ data problem, which means the data structure are always changing. The other limitation is that it is impossible to pre-label a large number of samples for training real- time classification models. To address these problems, we proposed a new asymmetric

sampling technique to re-balance the sizes of spam samples and non-spam samples in the training data. Experiments show our proposed model is applicable in ‘drifting’ twitter spam detection.

Finally, we formulated a model of causative availability attack against support vectors toward the security analysis of support vector machines classification algorithm. The proposed scheme enables both internal and external attackers to poison the identified support vectors rather than the whole dataset which is more focused. The results of a simulated study and four scenarios of real-world problems showed that our proposed causative availability attack can significantly decrease the classification performance. In order to defence the proposed attack, we also presented several strategies against causative availability attack.

The work presented in this thesis is significant for research areas either from academic or industrial perspective. In particular, the proposed schemes have the potential to be of benefits to the Internet of things, cyber-physical system, network system applications, healthcare applications, health industries, researchers and finally the community as well. The long-term potential benefits including security and privacy applications in social network applications can be an attractive proposition.

In document Deficient data classification with fuzzy learning (Page 177-182)