Discrimination Prevention by Different Measures in Direct Rule Protection Algorithm

(1)

International Journal of Emerging Technology and Advanced Engineering

Website: www.ijetae.com (ISSN 2250-2459,ISO 9001:2008 Certified Journal, Volume 5, Issue 9, September 2015)

91

Discrimination Prevention by Different Measures in Direct Rule

Protection Algorithm

Manoj Ashok Wakchaure

1

_{, Prof. Dr. Shirish Sane}

2

1_{Research Scholar, KKWIEER, Nashik, India} 2_{Professor & Head, KKWIEER, Nashik, India}

Abstract— Data mining plays an important role in knowledge discovery that helps to make automated decision making process more efficient. The automated decision making process may give a sense of fair decision but there can be discrimination as training data may act as source of discrimination i.e. it may be biased towards or against certain group leading to a discrimination. People do not want discrimination based on their race, gender, religion, nationality etc. The privacy of an individual is very important. Data Sanitization is one of the privacy preserving approaches that can be used for the purpose of Discrimination Prevention. This method described in [4] modifies original data in order to prevent discrimination that results into extraction of discrimination free rules after mining process. This approach consists of Discrimination discovery and Data transformation as major steps. The performance is measured using elift.

Keywords—Data mining, Data sanitization, Discrimination, Data Transformation, Extended lift.

I. INTRODUCTION

Importance of data mining is increasing day by day as it helps in automated Classiﬁcation process as well as in decision making process in many applications. Discrimination is prejudicial and/or distinguishing treatment of an individual based on their actual or perceived membership to a certain group or category, restricting members of one group from privileges or opportunities that are available to another group. This leads to exclusion of an individual or entities based on irrational decision making. In the classiﬁcation process, training data is used to derive rules to make decisions, but if the training data itself is biased towards a particular group, race, gender or some sensitive attributes then resulting model will show discrimination. Discrimination is legally allowed up to a certain limit in every country. This limit varies as per the country, for example, in India article 15 of the law prohibits discrimination based on religion, race, caste, gender, place of birth or any of them. There exist many such anti-discrimination laws but these are reactive not proactive. The use of technology helps to add pro-activeness to it.

Automated decision making system is used in various applications such as loan approvals/denial, job hiring, insurance, crime detection and many more. Thus it is important to eliminate discrimination from such automated system.

II. MOTIVATION

To prevent discrimination in data mining that occur on the basis of age, gender, race, nationality and many more. In data mining historical data that is used for learning classification rules, may itself biased towards certain group. Publishing such historical data for public usage and decision support system results into discriminatory classification rules and discriminatory mining model. One may remove discriminatory attribute from original dataset but that does not solve the problem. It may lead to loss of data quality. In order to avoid such discrimination, one must transform the data so that it will not lead to extraction of discriminatory classification rules and discriminatory mining model. The DRP algorithm described in [4], found to be latest algorithm that prevent discrimination and it uses elift as a measure to test the performance. The algorithm exhibits good result and thus motivated us to implement DRP algorithm that deals with conditional discrimination prevention. For discrimination discovery, along with elift it would be also interesting to analyze the performance of the algorithm using other measures such as slift and glift that have been used in other reported research.

III. REVIEW

(2)

International Journal of Emerging Technology and Advanced Engineering

92

Three ways for Discrimination prevention are: Preprocessing method, In-processing method and Post-processing method. Discrimination can be of 3 types: Direct, Indirect or combination of both, based on presence of discriminatory attributes and other attributes that are strongly related with discriminatory one. Thus depending on the existence of any of above three discrimination, the respective Discrimination Prevention Method is applied. In Discrimination Prevention by Pre-processing method the original data is modiﬁed whereas, in case of In-processing and Post-processing method the data mining algorithm and resultant mining model is changed respectively. Dino Pedreschi, Salvatore Ruggieri, Franco Turini[1] has introduced a model that can be used for the analysis and reasoning of discrimination in

Decision Support System which helps DSS owners and control authorities in the process of discrimination analysis [1].

A. Discrimination Prevention by Pre-processing Method In this method, the original dataset is modiﬁed so that it will not result in discriminatory classiﬁcation rule. In a preprocessing method any data mining algorithm can be applied to get mining model. It consists of data massaging step. Kamiran and Calder [2] proposed a method based on

data massaging where class label of some of the records in the dataset is changed but as this method is intrusive,

concept of Preferential sampling was introduced where

distribution of objects in a given dataset is changed to make it non-discriminatory [2]. It is based on the idea that, "Data objects that are close to the decision boundary are more vulnerable to be victim of discrimination." This method uses Ranking function and there is no need to change the class labels. This method ﬁrst divides data into 4 groups that are DP, DN, PP, PN, where ﬁrst letters D and P indicate Deprived and Privileged class respectively and second letters P, N indicates positive and negative class label. The ranker function then sorts data in ascending order with respect to positive class label. Later it changes sample size in respective group to make that data biased free. Sara Hajian and Josep Domingo-Ferrer [4] proposed another preprocessing method to remove direct and indirect discrimination from original dataset. It employees ‘elift’ as discrimination measure to prevent discrimination in crime and intrusion detection system [5]. Preprocessing method is useful in applications where data mining is to be performed by third party and data needs to publish for public usage [4].

B. Discrimination Prevention by In-processing Method In-processing method mainly changes mining algorithm instead of modifying original dataset. Faisal Kamiran, Toon Calders and Mykola Pechenizkiy [3] introduced such in-processing method based on decision Tree. This approach consists of two techniques for the decision tree construction process, first is Dependency-Aware Tree Construction and another is Leaf Relabeling. The first technique focuses on splitting criterion for tree construction to build a discrimination-aware decision tree. In order to do so, it first calculates the information gain with respect to class and sensitive attribute represented by IGC and IGS respectively. There are three alternative criteria for determining the best split that uses different mathematical operation: (i) IGC-IGS (ii) IGC/IGS (iii) IGC+IGS. The second approach consists of processing of decision tree with discrimination-aware pruning and it relabel the tree leaves [3]. Unfortunately, in-processing methods requires special purpose data mining algorithms.

(3)

International Journal of Emerging Technology and Advanced Engineering

93 IV. STATEMENT OF SCOPE

In automated decision making system, the decisions are made based on historical data, i.e. classiﬁer is learned based on trained data. If that training data is in favour or against of certain group then resultant mining model will show discriminatory rules. If data mining task on such discriminatory data is to be performed by third party or external entity, also if such data needs to be made available for public use, then it must be discriminatory free. The simple solution would be removing all such discriminatory items from the original data but it may degrade data quality and is able to remove only direct discrimination. Such solution is not able to remove indirect discrimination as non-discriminatory item may strongly correlate with discriminatory item. For example, if discrimination occurs on the basis of race and even if race attribute from original dataset is removed, one may not guarantee that discrimination from the original dataset has been prevented as other non-discriminatory attribute such as pin code which can be obtain from publicly available data, may reveal that particular area is populated with particular race. Also one must evaluate the impact of discrimination measure on number of discriminatory rules and on quality of the transformed data with respect to original dataset. This project mainly focuses on direct discrimination prevention and different discrimination measures that is performed on categorical input data.

V. SOFTWARE CONTEXT

Data mining is emerging in the recent year. It helps in automated decisions making process. This system can be used in applications where people want discrimination free decisions and where data needs to be published for public usage.

VI. MAJOR CONSTRAINTS

To achieve the goal of discrimination prevention, the input data must be categorical as Apriori algorithm works on categorical data only. Also the discriminatory items must be speciﬁed by the user.

VII. USE CASE DIAGRAM

Following UML diagram shows use case view of this system.

[image:3.612.322.572.129.344.2]

In ﬁgure 1.1, the use case diagram is shown in which there are two actors i.e. User and Dataset. There are number of use cases, the pre-process input data case represents pre-processed input data.

Figure 1.1: Use Case Diagram

Frequent classification rule is another case used to represent classification rules extracted from the input data. Also Discrimination discovery use case represents those discriminatory classification rules that are obtained using elift, slift, glift etc. The use case Data Transformation performs transformation on given input data. Finally results obtained using elift, glift, slift are compared to evaluate its performance.

VIII. DATA MODEL DESCRIPTION

This section describes information domain for the software.

A. Data Description

It gives the processes, data store, data ﬂow and source and destination entities.

B. Data objects and Relationships

Data object and their major attributes and relationships among data objects are described using an ERD-like form. C. Level 0 Data Flow Diagram:

(4)

International Journal of Emerging Technology and Advanced Engineering

[image:4.612.325.574.147.586.2]

94

Figure 1.2: Level 0 Data Flow Diagram

D. Level 1 Data Flow Diagram:

Level 1 data flow diagram gives a detailed view of the flow of data in the discrimination prevention system. The processes shown in figure 1.3 are pre-processing, frequent classification rules, discriminatory rules, data transformation and finally the transformed dataset.

Figure 1.3: Level 1 Data Flow Diagram

IX. FUNCTIONAL MODEL AND DESCRIPTION A. Flow Model:

[image:4.612.47.298.287.409.2]

(5)

International Journal of Emerging Technology and Advanced Engineering

95 B. Description

Initially the given dataset is pre-processed to convert numerical data into categorical data. This pre-processed data is used to extract the frequent classification rules. Next step is to obtain the discriminatory rules using elift, glift and slift separately and threshold value (Discrimination threshold). Then perform data transformation on original dataset to make it discrimination free that results into transformed dataset as an output. Finally evaluate the results obtained using different discrimination measures. This system provide discrimination free classification rules and determine that which of the discrimination measure amongst elift, slift, olift is well suited for discrimination free classification.

C. Design Constraint:

To implement the discrimination prevention method, the input data set must contain categorical data only. For a given numerical data, discretization is performed to make it categorical.

D. Software Interface Description:

The project will be implemented on any single machine: external machine interface won’t be required.

X. RESTRICTIONS,LIMITATIONS AND CONSTRAINTS This system assumes that given dataset contains categorical data only. The numerical is converted into categorical by performing discretization. This system modiﬁes input data to prevent direct discrimination while as indirect discrimination is not taken into consideration.

XI. VALIDATION CRITERIA A. Description:

The system uses different classes for different purpose. User class performs all user related operations. The calculation class is used to calculate discrimination measures i.e. elift, glift, slift etc. The user interface and result classes are used to perform GUI related operations and evaluation of discrimination prevention result.

B. Estimation Software Response: i. Estimated cost:

This system requires hardly any ﬁnance as it is developed using open source software and partly hardware independent software.

ii.Performance bounds:

Performance bound has been measured using DDPD and DDPP parameters.

XII. DESCRIPTION FOR COMPONENT A. Processing narrative for component:

1. Pre-processing- Pre-processing is a component which takes dataset containing records as an input and performs pre-processing operations and then provide its output as an input to Apriori algorithm.

2. Apriori-This component performs Apriori algorithm on the dataset obtained after pre-processing. It consists of candidate set generation and pruning step to derive frequent item set.

(6)

International Journal of Emerging Technology and Advanced Engineering

96 4.Elift - Uses only PD rules and calculate the elift for

them.

5.Glift - Uses only PD rules and calculate the glift for them.

6.Slift - Uses only PD rules and calculate the slift for them.

7.Discriminatory rules - It uses the discrimination threshold and discrimination measures elift, glift, slift separately to derive discriminatory rules.

8.Dataset

B. The main functions are given below: 1. Pre-processing input data

 Input - Data set containing numerical, categorical attributes and partitioning value for discretization

 Output – Pre-processed data set containing categorical data only

2. Generate classiﬁcation rules

 Input – Pre-processed data set

 Output - Frequent classiﬁcation rules extracted from given data set using Apriori algorithm

3. Find PD rules

 Input - Frequent classiﬁcation rules

 Output - Potentially discriminatory (PD) classiﬁcation rules

4. Calculate elift

 Input - Potentially discriminatory classiﬁcation rules

 Output - The value of elift for all PD rules 5. Calculate slift

 Output - The value of slift for all PD rules 6. Calculate glift

 Output - The value of glift for all PD rules 7. Find discriminatory rules

 Input - PD rules along with their value of elift, slift, olift and threshold value

 Output – Modified dataset

REFERENCES

[1] D. Pedreschi, S. Ruggieri, and F. Turini, "Integrating Induction and Deduction for Finding Evidence of Discrimination," Proc. 12th ACM Int’l Conf. Artiﬁcial Intelligence and Law (ICAIL ’09),pp. 157-166, 2009.

[2] F. Kamiran and T. Calders, "Classiﬁcation with no Discrimination by Preferential Sampling," Proc. 19th Machine Learning Conf. Belgium and The Netherlands, 2010.

[3] F. Kamiran, T. Calders, and M. Pechenizkiy, "Discrimination Aware Decision Tree Learning," Proc. IEEE Int’l Conf. Data Mining (ICDM ’10), pp. 869-874, 2010.

[4] S. Hajian and J. Domingo, "A Methodology for Direct and Indirect Discrimination prevention in data mining." IEEE transaction on knowledge and data engineeing, VOL. 25, NO. 7, pp. 1445-1459, JULY 2013. (Cited by 12)

[5] S. Hajian, J. Domingo-Ferrer, and A. Martnez Balleste, "Discrimination Prevention in Data Mining for Intrusion and Crime Detection," Proc. IEEE Symp. Computational Intelligence in Cyber Security (CICS ’11), pp. 47-54, 2011.

[6] S. Hajian, A. Monreale, D. Pedreschi, J. Domingo Ferrer and F. Giannotti. "Injecting discrimination and privacy awareness into pattern discovery," In 2012 IEEE 12th International Conference on Data Mining Workshops, pp. 360-369. IEEE Computer Society, 2012.

[7] S. Ruggieri, D. Pedreschi, and F. Turini, "DCUBE: Discrimination Discovery in Databases,"Proc. ACM Int’l Conf. Management of Data (SIGMOD ’10), pp. 1127-1130, 2010.

[8] T. Calders and S. Verwer. "Three naive Bayes approaches for discrimination- free classiﬁcation," Data Mining and Knowledge Discovery, 21(2):277-292, 2010. (Cited by 44)

AUTHOR’S PROFILE

Manoj Ashok Wakchaure is currently doing his PhD from KK Wagh institute of Engineering Education and Research, Nashik .His area of research is data mining.

Dr. Shirish S. Sane is currently working as the Head, Department of Computer Engineering & Vice-Principal of KK Wagh institute of Engineering Education and Research, Nashik. He is Having More than 27 Years of Teaching Experience. His area of research is data mining, Database