A New Privacy Preserving Data Mining Algorithm using Data Distortion Techniques
Mohana Chelvan P
1and Perumal K
21
Department of Computer Science,
Hindustan College of Arts and Science, Chennai, INDIA.
2
Department of Computer Applications, Madurai Kamaraj University, Madurai, INDIA.
email: [email protected], [email protected]
(Received on: May 9, 2018) ABSTRACT
Most of the privacy keeping data mining algorithms transforms the records with a purpose to keep privacy but it will to bring about the lack of accuracy. In data mining, feature selection is a crucial technique for handling “the Curse of Dimensionality”. In present day years data turn out to be excessive dimensional and so feature selection is crucial for data mining for lowering the dimensionality of dataset as it improves accuracy, reduces computational price and also improves model interpretability. Feature selection stability is the robustness of feature selection algorithms for choosing equal or comparable set records in subsequent iterations.
Unstable feature selection outcomes in confusion in researchers mind about their end result findings. So, feature selection stability is these days become critical and emerge as new active subject matter for research. Feature selection stability is widely speaking depends on the traits of the dataset but isn't always certainly algorithmic unbiased. Privacy keeping data mining transforms the dataset in case you want to keep privacy on the way to have an effect on selection stability as it's miles basically dataset based. This research paper introduces a privacy preserving data mining algorithm which has privacy upkeep, advanced accuracy and feature selection stability. In this paper, for preserving the privacy of individuals during data mining, data distortion or data perturbation technique has been utilised.
Keywords: Data Mining, Privacy Preservation, Feature Selection, Selection Stability, Kuncheva Index.
1. INTRODUCTION
Data mining might be characterised because the examination of chronicled datasets of organizations to extract possibly precious, already tough to recognize, non-insignificant, verifiable and captivating patterns or know-how. Data mining is essential for organizations for getting area over their rivals. The amassed records of humans with the useful resource of the online frameworks are for the maximum aspect high dimensional in light of the headways in the internet throughput advances on the manner to make the data mining undertakings relatively tough and in like way terms signified as “the curse of dimensionality”
1. Feature selection is thought to be a dimensionality decrease strategy wherein crucial features framing a little subset is picked the diverse dataset this is specific in settlement to persuaded requirements concerning evaluation that is relevant
2,3. Feature selection method brings approximately better gaining knowledge of presentation, for example, brings down computational price, higher learning accuracy, higher model interpretability and lessened garage room. Similarly, the high dimensional data that has historical past records or public information can understand the record proprietors that are hidden and that therefore can represent a hazard for their privacy.
Feature selection stability is the insensitivity of the algorithm of feature selection for the selection of comparative or comparable features which can be subsets in consequent cycles of the algorithms for selection of features for the expansion or erasure of few tuples from the dataset
4. Temperamental feature selection will result in disarray in the specialist's psyche about their studies alternatives and the exploratory results become questionable
5,6,7. Currently a-days, the importance of feature selection stability is referred to by way of the scientists as it diminishes their reality on their research findings. And furthermore selection stability is taken into consideration as a important well-known of feature selection algorithms as it turns into a developing topic of research
6,8. The adjustment in the characteristics of the dataset will impact the feature selection stability. Anyhow, it isn't always truly algorithmic unbiased
9,10,11. The components that affect the selection stability include of number of selected features
12, dimensionality, sample size
5and several data distribution crosswise over various folds.
Within the path of the time spent data mining, the data for the most part include delicate person facts, as an example, medicinal document or reimbursement and one of kind cash associated information which receives presented to few gatherings along with government, proprietors, clients and miners. Those patterns incorporate facts which are exposed in decision trees, association rules, classification models or clusters. Private records approximately individuals or enterprise is contained inside the information determined by distinct data mining techniques. Privacy preserving data mining (PPDM) is worried approximately shielding the privacy of man or woman data or touchy know-how without giving up the utility of the data. The cutting-edge techniques may be for the maximum part ordered into two popular classes
13:
(i) Methodologies that comfortable the sensitive records itself inside the mining technique,
and
(ii) Methodologies that comfy the sensitive data mining outcomes (i.e. extracted understanding) that have been delivered thru the use of the data mining.
PPDM has a tendency to perturb the original data with the goal that the after effect of data mining project ought not to challenge privacy imperatives.
Privacy keeping data mining demonstrates the branch of data mining that is going for guarantee of records, that is protection of data which is privacy-sensitive of persons having an area with unsanctioned and a number of the time spontaneous disclosure because of this guarding the tuples of dataset alongside their privacy. In data mining for shielding of privacy, the sensitive crude facts and moreover the touchy expertise of mining comes about are ensured by some means with the useful resource of the perturbation of the authentic dataset using the created algorithm
14. Utilising this approach, privacy of the humans is included and within the period in-between beneficial know-how is separated from the dataset
15. The actual determination of right privacy maintaining structures is excessive data quality with privacy.
Maintaining in mind the cease goal to shield the man or woman's records from being re- identified, those systems perturb the accrued dataset via some type of alternate or adjustment before its launch
16. Due to those annoyances, the selection stability may be prompted as its miles for the maximum part dataset subordinate. Extra adjustments to the dataset will result in precarious f eat ur e s el e ct i on as a way to prompt less data utility. It has been decided that there has been no profitable studies placed massive to the factor i.e., the connection amongst perturbation of data for data mining for conservation of privacy and feature selection stability.
2. METHODOLOGY 2.1 Proposed Methodology
Datasets of microdata comprise lot of public facts because of advancements in net technology which may additionally growth the dimensionality of the datasets and is known as
“the Curse of Dimensionality”. The dataset incorporates these styles of facts known as identifiers, quasi identifiers and sensitive attributes. Identifiers are the attributes that uniquely discover the tuple which includes roll number of a student. Quasi identifiers are the attributes which can be organization of identifiers that in a roundabout way pick out the tuple as date of birth, age and sex. Sensitive attributes are the attributes that comprise sensitive statistics like income.
The Feature Selection Algorithm CFS has been used identify quasi identifier
attributes. By way of making use of the algorithm, ranked list of attributes acquired. From the
ranked list of attributes, quasi identifier attributes are selected. Statistical properties mean,
standard deviation and variance are calculated for the experimental dataset. Feature selection
algorithm has been applied on the experimental dataset. Accuracy of the chosen features is
calculated. The identified quasi identifier attributes and touchy attributes are perturbed via
the privacy preserving data mining algorithm as shown in algorithm 1. After the perturbation
of the experimental dataset, statistical properties of mean, standard deviation and variance are
calculated. Feature selection algorithm is applied on the perturbed dataset. Accuracy of the
chosen features is again calculated. From the chosen features, selection stability is calculated.
2.2 Privacy Preserving Algorithm
The proposed privacy preserving algorithm used in the experiments is shown in the Algorithm 1. The data alteration can be done in different ways including suppression, distortion or perturbation, data swapping, association rule, data shuffling, microaggregation, hide association rule, k-anonymity rounding or coarsening and noise addition. The data perturbation techniques used in the algorithm are additive data perturbation, multiplicative data perturbation, rotation data perturbation and reconstruction tree method.
Input: Microdata Table T
Output: Privacy Preserved Table T*
Step 1.
1. For each sensitive tuple Ci in K, where 1<j<d do 2. Select the noise term ei in N for the sensitive tuple Ci 3. The j-th operation OPj <- {Add}
Step 2.
1. For each Xi E K do
2. For each Ci in Xi =Xi= (al,. ..,ad), where a, is the observation of the j-th tuple do 3. a *-trasform(aj,opj,ej)
4. Transform(aj; opj ; ej)
5. End
Step 3.
1. For each sensitive tuple Ci in K, where 1<j<d do 2. Select the noise term ei inNfor the sensitive tuple Ci 3. The j-th operation opj *- {Mult}
Step 4.
1. For each Xi E K do
2. For each Ci in Xi =v =(al,...,ad),where a1 is the observation of thej-th tuple do
3. a' *-trasform(aj,opj,ej) 4. Transform(aj; opj ; ej)
5. End
Step 5.
1. For each sensitive tuple Ci in K, where 1<j<d do 2. Select the noise term ei in N for the sensitive tuple Ci 3. Thej-th operation opj -- {Rotation}
Step 6.
1. For each Xi E K do
2. For each Ci in Xi =Xi= (al,. ..,ad), where a is the observation of the j-th tuple do 3. a' -trasform(aj,opj,ej)
4. Transform(aj; opj ; ej)
5. End
Step 7.
1. if (dataset.length> 3 &&dataset.length != 3) 2. For: w=0 to dataset.length do
3. wage = wage + Double.parseDouble(dataset[w][1]) 4. end for
5. root = (wage/ (dataset.length)) 6. left = new ArrList<String>() 7. right = new ArrList<String>() 8. For: d=0 to dataset.length 9. if( dataset[d][1])> root)
10. tmp = dataset[d][0]+dataset[d][1]+dataset[d][2]
11. left.add(tmp) 12. else
13. tmp = dataset[d][0]+dataset[d][1]+dataset[d][2]
14. right.add(tmp) 15. end for
16. conVar = mkStringArray(left) 17. divAndCon(conVar)
18. conVar = mkStringArray(right) 19. divAndCon(conVar)
Algorithm 1. Proposed privacy preserving algorithm 2.3 Feature Selection Algorithms
The system of feature selection is usually in light of the three methodologies viz. filter, wrapper and embedded. The filter method of feature selection is through evacuating features on a few standards or measures and on this approach, the integrity of a feature is classed utilizing intrinsic or statistical properties of the dataset. A feature is chosen for data mining or machine learning application within the wake of assessing it as the maximum reasonable feature in view of these properties. In the wrapper method the subset of features is produced and after that decency of subset is found out using some classifier. The ranking of the features in the dataset is the incentive in the back of some classifier on this method and a feature is chosen for the specified application in view of this rank. The embedded method tries to make usage of the benefits of both the filter and wrapper techniques. The principle concept in the back of these algorithms is the lessening of experiment space for a wrapper approach through the filter approach.
2.4 Information Gain IG
The entropy is the pollution making ready set circumstance S. It’s miles portrayed as
a reflecting measures more data in regards to Y introduced by means of X which symbolizes
the actual measure of the entropy of that of Y diminishes
17. This sort of measure is known as data advantage and is given in (1).
IG = H(Y) − H(Y/X) = H(X) − H(X/Y) (1) A symmetrical measure this is inferred once the information on X on watching Y is equivalent to the information this is decided on Y on watching X is referred to as IG. This IG is frequently adjusted toward those features that have some greater values even within the event of not being useful. The advantage of information as to class is figured primarily based at the cost of the assessed attribute. The autonomy current between the class label and the feature is accurately surveyed via methods for IG on considering the disparity that exists among entropy of the particular feature and in addition restrictive entropy of the class label as indicated by (2).
IG (Class, Attribute) = H (Class) − H (Class | Attribute) (2) 2.5 Correlation-based Feature Selection CFS
The particular attributes and their subset values are assessed thru CFS through thinking about the redundancy degree among them collectively with the individual predictive capability of every feature. Feature subsets which are which includes low inter-correlation among the classes but which might be a good deal corresponded inside the class are favoured
5. The search structures inclusive of genetic search, best-first search, backward elimination, forward selection and bi-directional search can be joined with CFS for finding out the great feature subset which is given in (3).
k r
zir
zc= (3) √ k + (k − 1) r
iiin which r
zcshows the real correlation that exist within the class variable and moreover the subset features which are summed, wherein k signifies the range of features of subset, r
zimethod the common of correlations within the class variable alongside the subset features and here r
iisuggests the common of inter-correlation inside the subset features
5.
3. SELECTION STABILITY MEASURES 3.1 Kuncheva Index KI
Within the substantial majority of the stability measures, there will be cover among
the two subsets of the features because of chance. The larger cardinality of the selected
features' lists emphatically corresponded with the threat of overlap. To overcome this
drawback, the Kuncheva Index KI is proposed in
18which incorporate correction term to stay
away from the intersection by means of chance. KI is the principle measurement that complies
with every one of the prerequisites confirmed up in
18i.e., Monotonicity, Limits and Correction
for chance. The correction for chance term became provided in KI thus it finishes up appealing.
By no means like alternate measurements, won’t the bigger estimation of cardinality have an impact on the stability value in KI.
│Ƒ'
1∩Ƒ'
2│. m – k
2KI (Ƒ'
1, Ƒ’
2) = (4) k ( m – k)
In (4), Ƒ'1 and Ƒ'2 are subset of features selected in consequent iterations of feature selection algorithms, k is number of features in the subsets and m is the total quantity of features in exploratory dataset. KI's outcomes certain between the scopes of [– 1, 1], wherein – 1 implies k = m/2, i.e., there is no crossing factor between the two subsets of features. KI progresses towards turning into 1 while the cardinality of the intersection set equivalents k, i.e., Ƒ'1 and Ƒ'2 are indistinguishable. KI seems to be near zero for in another way drawn lists of subset of features.
4. EXPERIMENTAL RESULTS
The two datasets applied as part of the experiments are Census-Income (KDD) dataset and Insurance Company Benchmark (COIL 2000) dataset. The datasets are received from the KEEL dataset store
19. Table 1 demonstrates the qualities of the datasets. In the recorded datasets, the Census dataset has both categorical and numeric values whilst the Coil 2000 dataset has just numeric values.
Table 1. Characteristics of datasets Census and Coil 2000
S.
No.
Datasets Characteristics
Datasets
Census Coil 2000
1 Type Classification Classification
2 Origin Real World Real World
3 Instances 142521 9822
4 Features 41 85
5 Classes 3 2
6 Missing Values Yes No
7 Attribute Type Numerical, Categorical Numerical
The ranked attributes are received by assessing the noteworthiness of an attribute
through estimating the information gain with respect to the class. This was finished through
the feature selection algorithm Information Gain IG. In view of the got ranked attributes, the
quasi identifiers are recognized and selected for privacy maintaining perturbation. The quasi
identifiers and sensitive attributes are perturbed utilising the privacy retaining algorithm which
is appeared in algorithm 2. Each single domain value of the selected trait has changed for a
hundred% privacy conservation consequently a gatecrasher or vindictive data miner despite
sizable heritage records cannot make certain about the accuracy of a re-identification.
The feature selection algorithm CFS has been applied to pick attributes from both authentic and privacy preserved datasets and the search technique applied as a part of the trial is BestFirst. CFS algorithm is filter-based, so it doesn't connect with any classifier in the determination technique Overfitting is lessened by way of using 10-fold cross validation.
BestFirst utilizes greedy hillclimbing for searching through the gap of trait subsets and is better with a backtracking facility. BestFirst can also appearance in opposite within the wake of starting with the total arrangement of traits or hunt forward in the wake of beginning with the unfilled arrangement of attributes or inquiry within the bearings next to starting anytime via thinking about all plausible single attribute augmentations and erasures at a predetermined point. The amounts of decided on features become saved at best range as selection stability will decorate as much as the best wide variety of applicable features and afterward diminishes.
The feature selection stability estimations of the privacy preserved datasets are Census and Coil 2000 are figured utilising the stability measure Kuncheva Index KI and the outcome is seemed inside the Fig.2. Resulting from KI, the bigger estimation of cardinality might not have an effect on the selection stability and for that reason it is utilized as a part of the analyses as a stability measure. Selection stability is contrarily related to the sort of the dataset i.e., perturbation of the training samples. The privacy preserving algorithm has created notably solid feature selection outcomes due to the statistical properties for the numerical characteristics of the annoyed datasets are reliable. The dataset Coil 2000 has each one of the attributes as numeric even as the dataset Census has both categorical and numerical attributes.
For that reason from the consequences, it's been seen that the dataset Coil 2000 is steadier than the dataset Census because it contains simply numeric attributes.
Fig. 2. Feature Selection stability using Kuncheva Index KI for the datasets Census and Coil 2000 after privacy preserving perturbation
Feature selection stability and data utility are decidedly related. Because the feature selection stability comes approximately for the privacy maintaining algorithm are extraordinary, the precision of the privacy preserved datasets are noticeably same as earlier than perturbation.
The accuracy consequences are seemed in the Fig.3.
Fig. 3. Accuracy for selected features for the datasets Census and Coil 2000 before and after privacy preserving perturbation
Along these lines, the proposed privacy maintaining algorithm has been tried utilising two various trial datasets for its performance in privacy upkeep, feature selection stability and data utility. The test comes approximately have demonstrated that the usage of the algorithm on take a look at datasets bring about stable feature selection with extraordinarily reliable accuracy. The table 2 condenses the insights of the led probe the datasets in reference to feature selection stability and accuracy.
Table 2. Summary of feature selection stability and accuracy for datasets Census and Coil 2000
Experimental Results Datasets
Census Coil 2000 Feature Selection stability using Kuncheva Index KI 0.89 0.93
Overall accuracy before perturbation 74.41% 76.84%
Overall accuracy after perturbation 69.73% 71.62%
Accuracy of selected features before perturbation 79.72% 82.82%
Accuracy of selected features after perturbation 75.42% 77.86%