Investigative Data Mining in Fraud Detection

126 

Loading....

Loading....

Loading....

Loading....

Loading....

Full text

(1)

Investigative Data Mining in Fraud Detection

Chun Wei Clifton Phua

BBusSys

A Thesis

submitted in partial fulfilment of the requirement for the

Degree of Bachelor of Business Systems (Honours)

School of Business Systems

Monash University

November

2003

(2)

Chun Wei Clifton Phua ©2003

(3)

Declaration

I, Chun Wei Clifton Phua, declare that this thesis contains no material that has been accepted for the award of any other degree or diploma in any university or other institutions. To the best of my ability and belief, this thesis contains no material previously published or written by another person, except where due reference is made in the text.

(4)

Acknowledgements

My supervisor, Dr. Damminda Alahakoon, has been very encouraging and helpful. I believe that he has more patience than most people I know. My deepest appreciation to him for ensuring that resources are available for me, and also consistently giving me sound, valuable advice. The School of Business Systems, Faculty of Information Technology and Monash University have given me excellent financial aid, opportunities and resources. In particular, I want to express my utmost gratitude to Dr. Dineli Mather. Through her recommendation, I obtained special departmental help before commencing my honours year. Without her, this thesis will never have started. Other departmental staff members especially Assoc. Prof. Kate Smith, Dr. Leonid Churilov, Dr. Ai Cheo Yeo, and especially Ms. Fay Maglen have extended important help in my honours year.

My appreciation goes to both Automated Learning Group (ALG) at National Centre of Supercomputing Applications (NCSA) and the Angoss Software Corporation for providing their Data to KnowledgeTM (D2K) and KnowledgeSeekerTM IV tools for free. And I extend the appreciation to my reliable friends, Ellyne, Edwin, Sandy, and others who gave careful and useful comments on the thesis drafts.

To my excellent study partner and confidante, Sheila, I offer my heartfelt thanks to her for always sharing my ups and downs since my first day in Monash. To the Monash International Student Navigators, Kean, Kar Wai, Hanny and many others, I am grateful for all their cherished friendships and kindness. To close friends from Monash College, Kenneth, Lynnette, Kelly and others, I often looked forward to our occasional feasts. To my fellow tutors, Mary, Nic, Prasanna and others, I had the pleasure to work and learn from all of them. To my honours gang, Ronen, Cyrus, Nelson, and others, I truly enjoyed our interesting “discussions” and nights of “work” together. I wish every person mentioned here happiness.

(5)

To my brother, Yat, I am delighted to be the beneficiary of his wonderful gifts over these years and have to commend him for spending more than five hours trying to “decipher” this dissertation.

My uncle, Long and aunties, Patricia and Elsie, welcomed me into their homes where I stayed for about three years. I am indebted to them. Without them, living in Melbourne was financially impossible. My parents made many personal sacrifices for me to get an overseas education but have never put any kind of pressure on me. At this point in time, I hope that enough has been done in my studies to make them feel proud, not only of me, but also of themselves.

(6)

Abstract

The purpose of this dissertation is to determine the most appropriate data mining methodology, methods, techniques and tools to extract knowledge or insights from enormous amounts of data to detect white-collar crime. Fraud detection in automobile insurance is used as the application domain.

The focus is on overcoming the technical problems and alleviating the practical problems of data mining in fraud detection. The technical obstacles are due to imperfect, highly skewed data, and hard-to-interpret predictions. The practical barriers are caused by the dearth in domain knowledge, many evolving fraud patterns, and the weaknesses in some evaluation metrics.

The problem solving approach is to integrate database, machine learning, neural networks, data visualisation, statistics, and distributed data mining techniques and tools into the crime detection system, which is based on the CRoss-Industry Standard Process for Data Mining (CRISP-DM) methodology. The crime detection method utilised naive Bayesian, C4.5, backpropagation learning algorithms, and the Self Organising Map; classification and clustering visualisations; t-tests and hypothesis tests; bagging and stacking to construct predictions and descriptions of fraud.

The results in the automobile insurance data set confirm the effectiveness of the crime detection system. Specifically, the stacking-bagging experiment achieved higher cost savings than the other nine, and the most statistically significant insight is the discovery of 21 to 25 year old fraudsters who use sport cars as their crime tool.

This thesis demonstrates that the crime detection system has the potential to significantly reduce loss from illegitimate behaviour.

(7)

Table of Contents

Acknowledgements ... ii

Abstract ... iv

Table of Contents... v

List of Tables... viii

List of Figures ... ix

CHAPTER 1 INTRODUCTION ... 1

1.1 INVESTIGATIVEDATAMINING ... 2

1.2 FRAUDDETECTIONPROBLEMS ... 3

1.3 OBJECTIVES... 5 1.4 SCOPE... 6 1.5 CONTRIBUTIONS... 6 1.6 OUTLINE... 8 CHAPTER 2 BACKGROUND... 9

2.1 EXISTINGFRAUDDETECTIONMETHODS ... 9

2.1.1 Insurance Fraud ... 9

2.1.2 Credit Card Fraud ... 11

2.1.3 Telecommunications Fraud ... 12

2.1.4 Analysis of Methods... 13

2.2 THENEWINVESTIGATIVEDETECTIONMETHOD ... 14

2.2.1 Precogs ... 15

2.2.2 Integration Mechanisms ... 15

2.2.3 Analytical Machinery ... 16

2.2.4 Visual Symbols... 16

2.2.5 Analysis of New Method... 16

2.3 SUPPORTINGCRIMINALDETECTIONTECHNIQUES... 17

2.3.1 Bayesian Belief Networks ... 17

2.3.2 Decision Trees ... 18

2.3.3 Artificial Neural Networks ... 19

2.3.4 Analysis of Techniques ... 19

(8)

CHAPTER 3

THE CRIME DETECTION METHOD... 23

3.1 STEPONE:CLASSIFIERSASPRECOGS... 24

3.1.1 Naive Bayesian Classifiers ... 24

3.1.2 C4.5 Classifiers... 28

3.1.3 Backpropagation Classifiers ... 32

3.1.4 Analysis of Algorithms... 35

3.2 STEPTWO:COMBININGOUTPUTASINTEGRATIONMECHANISMS ... 35

3.2.1 Cross Validation ... 36

3.2.2 Bagging... 37

3.2.3 Stacking ... 38

3.3 STEPTHREE:CLUSTERDETECTIONASANALYTICALMACHINERY... 40

3.3.1 Self Organising Maps ... 40

3.3.2 Analysis of Self-Organisation ... 41

3.4 STEPFOUR:VISUALISATIONTECHNIQUESASVISUALSYMBOLS ... 43

3.4.1 Classification Visualisation ... 43

3.4.2 Clustering Visualisation ... 44

3.5 SUMMARYOFNEWMETHOD... 44

CHAPTER 4 IMPLEMENTING THE CRIME DETECTION SYSTEM: PREPARATION ... 46

4.1 PHASEONE:PROBLEMUNDERSTANDING... 47

4.1.1 Determine the Investigation Objectives ... 47

4.1.2 Assess the Situation... 47

4.1.3 Determine the Data Mining Objectives ... 48

4.1.4 Produce the Project Plan... 48

4.2 PHASETWO:DATAUNDERSTANDING... 49

4.2.1 Describe the Data ... 49

4.2.2 Explore the Data ... 49

4.2.3 Verify the Data Quality ... 52

4.3 PHASETHREE:DATAPREPARATION ... 52

4.3.1 Select the Data... 52

4.3.2 Clean the Data ... 53

4.3.3 Format the Data ... 53

4.3.4 Construct the Data ... 53

(9)

4.4 KEYISSUESINPREPARATION ... 56

CHAPTER 5 IMPLEMENTING THE CRIME DETECTION SYSTEM: ACTION... 58

5.1 PHASEFOUR:MODELLING ... 58

5.1.1 Generate the Experiment Design... 58

5.1.2 Build the Models ... 60

5.1.3 Assess the Models ... 65

5.2 PHASEFIVE:EVALUATION... 71

5.2.1 Evaluate the Results ... 71

5.2.2 Review the Process... 72

5.3 PHASESIX:DEPLOYMENT ... 73

5.3.1 Plan the Deployment ... 73

5.3.2 Plan the Monitoring and Maintenance... 74

CHAPTER 6 CONCLUSION... 75

6.1 RECOMMENDATIONS ... 77

6.1.1 Overcome the Technical Problems... 77

6.1.2 Alleviate the Practical Problems... 78

6.2 FUTURERESEARCHDIRECTIONS ... 79

6.2.1 Credit Application Fraud... 79

6.2.2 Mass Casualty Disaster Management... 80

APPENDICES REFERENCES

(10)

List of Tables

Table 2.1: Features in Fraud Detection Methods... 14

Table 2.2: Strengths and Weaknesses of Data Mining Techniques... 20

Table 3.1: Extracting Rules from a Decision Tree ... 32

Table 3.2: Net Input and Output Calculations ... 33

Table 3.3: Calculation of the Error at each Neuron ... 34

Table 3.4: Calculations for Weight Updating ... 34

Table 3.5: Bagging Predictions from Different Algorithms ... 38

Table 3.6: Evaluation of Fraud Detection Capability of Algorithms within Clusters ... 42

Table 4.1: Costs of Predictions ... 48

Table 5.1: The Experiments Plan... 59

Table 5.2: The Tests Plan ... 59

Table 5.3: Bagged Success Rates versus Averaged Success Rates ... 60

Table 5.4: Best Determining Training Set Distribution for Data Partitions ... 60

Table 5.5: Bagging versus Stacking versus Conventional Backpropagation... 61

Table 5.6: Claim Handling Actions and Thresholds... 62

Table 5.7: Rule Induction on Training Data Set ... 64

Table 5.8: Ranking of Experiments using Cost Model... 67

Table 5.9: Descriptions of Existing Fraud ... 68

(11)

List of Figures

Figure 2.1: Predictions using Precogs, Analytical Machinery, and Visual Symbols ... 15

Figure 3.1: Steps in the Crime Detection Method ... 23

Figure 3.2: Fully Grown Decision Tree... 30

Figure 3.3: Pruned Decision Tree... 31

Figure 3.4: Multilayer Feedforward Neural Network... 33

Figure 3.5: Building and Applying One Classifier using Data Partitions... 37

Figure 3.6: Stacking Classifiers from Examples ... 39

Figure 3.7: Stacking Classifiers from Instances ... 39

Figure 3.8: The Self Organising Map ... 41

Figure 3.9: The Confusion Matrix... 43

Figure 4.1: Phases of the Crime Detection System ... 46

Figure 4.2: Claim Trends by Month ... 50

Figure 4.3: Proportion of Fraud within the Vehicle Age Category ... 51

Figure 4.4: Proportion of Fraud within the Policy Holder Age Category ... 51

Figure 4.5: Dividing Data into 50/50 Distribution for Each Partition ... 56

Figure 5.1: Prioritisation of Fraud Data Instances in Organisations... 63

Figure 5.2: Attribute Input Strength from the Backpropagation Algorithm... 65

Figure 5.3: Claim Trends by Month for 1996... 70

(12)

“Life is change,

that is how it differs from the rocks, change is its very nature.” - John Wyndham, 1955, The Chrysalids

For my parents Chye Twee and Siok Moy

(13)

CHAPTER 1

INTRODUCTION

The world is overwhelmed with millions of inexpensive gigabyte disks containing terabytes of data. It is estimated that these data stored in all corporate and government databases worldwide doubles every twenty months. The types of data available, in addition to the size, are also growing at an alarming rate. Some relevant examples can put this situation into perspective:

• In the United States (US), DataBase Technologies (DBT) Online Incorporated contains four billion records used by its law enforcement agencies. The Insurance Services Office Incorporated (ISO) claim search database contains over nine billion US claim records, with over two billion claims records added annually (Converium, 2002).

• In Australia, the Insurance Reference Service (IRS) industry database consists of over thirteen million individual insurance claims records (Converium, 2002). The Law Enforcement Assistance Program (LEAP) database for the Victorian police in Australia details at least fourteen million vehicle, property, address and offender records, with at least half a million criminal offences and incidents added annually (Mickelburough and Kelly, 2003).

• Private firms sell all types of data on individuals and companies, often in the forms of demographic, real estate, utility usage, telecom usage, automobile, credit, criminal, government, and Internet data (Infoglide Software Corporation, 2002).

This results in a data rich but information poor situation where there is a widening gap between the explosive growth of data and its types, and the ability to analyse and interpret it. Hence there is a need for a new generation of automated and intelligent tools and techniques (Fayyad et al, 1996), known as “Ask me when fraud will stop and the answer is never. Fraud has become one of the

constants of life.”

- Frank Abagnale, 2001, The Art of the Steal: How to Protect Yourself and Your Business from Fraud

(14)

investigative data mining, to look for patterns in data. These patterns can lead to new insights, competitive advantages for business, and tangible benefits for society. This chapter introduces investigative data mining in Section 1.1 and outlines the research problems in Section 1.2. While Section 1.3 covers the objectives, Section 1.4 defines the scope of the research. A summary of the research contributions is provided in Section 1.5. Section 1.6 concludes with the outline of the other chapters in this thesis.

1.1 INVESTIGATIVE DATA MINING

Data mining is the process of discovering, extracting and analysing of meaningful patterns, structure, models, and rules from large quantities of data (Berry and Linoff, 2000). The process (see Appendix B) is automatic or semi-automatic, with interactive and iterative steps such as problem and data understanding, data selection, data preprocessing and cleaning, data transformation, incorporation of appropriate domain knowledge to select data mining task and algorithm, application of data mining algorithm(s), and knowledge interpretation and evaluation. The last step is either the refinement by modifications or the consolidation of discovered knowledge (Fayyad et al, 1996). Discovered patterns, or insights, should be statistically reliable, not known previously, and actionable (Elkan, 2001).

The data mining field spans several research areas (Cabena et al, 1998) with stunning progress over the last decade. Database theories and tools provide the necessary infrastructure to store, access and manipulate data. Artificial intelligence research such as machine learning and neural networks is concerned with inferring models and extracting patterns from data. Data visualisation examines methods to easily convey a summary and interpretation of the information gathered. Statistics is used to support and negate hypotheses on collected data and control the chances and risks that must be considered upon making generalisations. Distributed data mining deals with the problem of learning useful new information from large and inherently distributed databases where multiple models have to be combined.

(15)

The most common goal of business data mining applications is to predict customer behaviour. However this can be easily tailored to meet the objective of detecting and preventing criminal activity. It is almost impossible for perpetrators to exist in this modern era without leaving behind a trail of digital transactions in databases and networks (Mena, 2003b).

Therefore, investigative data mining is about systematically examining, in detail, hundreds of possible data attributes from such diverse sources as law enforcement, industry, government, and private data provider databases. It is also about building upon the findings, results and solutions provided by the database, machine learning, neural networks, data visualisation, statistics, and distributed data mining communities, to predict and deter illegitimate activity.

1.2 FRAUD

DETECTION

PROBLEMS

The Oxford English dictionary defines fraud as “criminal deception”, “the use of false representations to obtain an unjust advantage”, and “to injure the rights and interests of another”. Abagnale (2001) explains that technological advancements like the Internet make it easier to commit fraud. The same author also states that the situation is exacerbated by little legal deterrence (as courts rarely put fraudsters in jail), and the societal indifference to white collar crime.

Fraud takes many diverse forms, and is extremely costly to society. It can be classified into three main types, namely, against organisations, government and individuals (see Appendix C). This thesis focuses on fraud against organisations. KPMG’s (2002a; 2002b) fraud surveys about Australian and Singaporean organisations detail the severity of scams committed by external parties, internal management, and non-management employees. In the Australian fraud survey, about one third of fraud losses are caused by external parties like customers, service providers and suppliers. Data mining can minimise some of these losses because of the massive collections of customer data. Insurance fraud is shown to be the most critical area, followed by credit card fraud, cheque forgery,

(16)

and telecommunications fraud (see Appendix C). Within insurance fraud, automobile, travel, household contents and workers’ compensation insurance are more common (Baldock, 1997; O’Donnell, 2002). Recently, automobile insurance fraud alone costs about AUD$32 million for nine insurance companies in Australia (KPMG, 2002a) and SGD$20 million for the industry in Singapore (Tan, 2003).

Fraud detection poses some technical and practical problems for data mining; the most significant technical problem is due to limitations, or poor quality, in the data itself. The data is usually collected as a by-product of other tasks rather than for the purpose of fraud detection. Although one form of data collection standard for fraud detection has recently been introduced (National Association of Insurance Commissioners, 2003), not all data attributes are relevant for producing accurate predictions and some attribute values are likely to have data errors.

Another crucial technical dilemma is due to the highly skewed data in fraud detection. Typically there are many more legitimate than fraudulent examples. This means that by predicting all examples to be legal, a very high success rate is achieved without detecting any fraud. Another negative consequence of skewed data is the higher chances of overfitting the data. Overfitting occurs when models’ high accuracy arises from fitting patterns in the training set that are not statistically reliable and not available in the score set (Elkan, 2001).

Another major technical problem involves finding the best ways to make predictions more understandable to data analysts.

The most important practical problem is a lack of domain knowledge, or prior knowledge, which reveals information such as the important attributes, the likely relationships and the known patterns. With some of the domain knowledge described in this and the following paragraph, the search time for using the data mining process can be reduced. Basically, fraud detection involves discovering three

(17)

profiles of fraud offenders (see Appendix C), each with constantly evolving modus operandi (Baldock, 1997). Average offenders can be of any gender or socio-economic group and they commit fraud when there is opportunity, sudden temptation, or when suffering from financial hardship. Criminal offenders are usually males and criminal records. Organised crime offenders are career criminals who are part of organised groups which are prepared to contribute considerable amount of time, effort and resources to perpetuate major and complex fraud.

Using learning algorithms in data mining to recognise a great variety of fraud scenarios over time is a difficult undertaking. Fraud committed by average offenders is known as soft fraud, which is the hardest to mitigate because the investigative cost for each suspected incident is usually greater than the cost of the fraud (National White Collar Crime Center, 2003). Fraud perpetuated by the criminal and organised crime offenders is termed hard fraud and it circumvents anti-fraud measures and approximates many legal forms (Sparrow, 2002).

The next practical problem includes assessing the potential for significant impact of using data mining in fraud detection. Success cannot be defined in terms of predictive accuracy because of the skewed data.

1.3 OBJECTIVES

The main objective of this study is to survey and evaluate methods and techniques to solve the six fraud detection problems outlined in Section 1.2. Classification, cluster detection, and visualisation techniques are carefully examined and selected. These techniques are then hybridised into a method, envisioned in the science fiction book, Minority Report (Dick, 1956). It has the potential to allow data analysts to better understand the classifiers and their predictions. The method is incorporated, together with a cost model, statistics, and domain knowledge, into a comprehensive data mining system to eliminate the fraud detection problems.

(18)

1.4 SCOPE

This thesis builds upon some well-known methods and techniques chosen from insurance, credit card, and telecommunication fraud detection. It concentrates solely on the naive Bayesian, C4.5, and backpropagation algorithms to generate classifiers which make predictions. To simplify the use of ensemble mechanisms on fraud detection classifiers, only the bagging and stacking strategies are applied on the classifier predictions. The Self Organising Map (SOM) and two-dimensional visualisation techniques, such as the confusion matrix and the naive Bayesian visualisation, are chosen to analyse and interpret the data and classifier predictions. The other important outputs include scores and rules. A simple cost model is preferred over a cost-sensitive algorithm. This research demonstrates the crime detection system’s capability in an automobile insurance data set.

In reality, data mining is only part of the solution to fraud detection. Process re-engineering (FairIsaac, 2003b), manual reviews by fraud specialists, interactive rule-based expert systems (Magnify, 2002b), and link analysis (Mena, 2003b) are also essential but beyond the scope of this thesis. Although this research uses the crime detection system for fraud, most of the system and its techniques are general for non-crime and other types of crime data.

1.5 CONTRIBUTIONS

The main thesis contribution is the creation of the new crime detection method to predict and describe criminal patterns from data:

• The innovative use of naive Bayesian, C4.5, and backpropagation classifiers to process the same partitioned numerical data increases the chances of getting better predictions.

• The selection of the best classifiers of different algorithms using stacking and the merger of their predictions using bagging produces better predictions than the single algorithm approach.

• The SOM is introduced as a descriptive tool to understand particular characteristics of fraud within the clusters and also as an evaluation tool to assess the algorithms’ ability to cope with

(19)

evolving fraud.

• The visualisation of model predictions and results is expressed with the confusion matrix, naive Bayesian visualisation, column graph, decision tree visualisation, cumulative lift charts, radar graphs, scores, and rules.

The other significant thesis contributions include:

• The introduction of the crime detection system, through modifying the CRISP-DM methodology, which provides a comprehensive and step-by-step approach to preparing for and carrying out data mining.

• The adoption of a cost model which allows the realistic measure of monetary cost and benefit within a data mining project.

• The demonstration of visualisations as an important part of attribute exploration before and after model building.

• The strong emphasis on statistics to objectively evaluate algorithms, classifiers, and cluster descriptions prevents inaccurate predictive accuracy estimates and poor model interpretation.

• The justification to use the score-based feature in fraud detection over the rule-based feature.

• The extensive literature review conducted on existing fraud detection methods, techniques and tools to seek out the best data mining practices. This resulted in the integration of methods and techniques across insurance, credit card, and telecommunications fraud detection.

• The in-depth analysis of naive Bayesian, C4.5, and backpropagation algorithms at the practical and conceptual levels to overcome their specific limitations. The critical issues of overfitting and missing values were addressed.

(20)

1.6 OUTLINE

Chapter 2 contains existing fraud detection methods and techniques, the new crime detection method, the recommended three classification techniques, and the relevant statistical concepts.

Chapter 3 focuses on the strengths and limitations of the three classification algorithms, two ensemble mechanisms, the role of self organisation to describe fraud and algorithm performance, and the relevant visualisation techniques. A summary of the new crime detection method is given.

Chapter 4 prepares the data using the crime detection system phases of problem understanding, data understanding, and data preparation. Essential issues in preparation are highlighted.

Chapter 5 applies the crime detection phases of modelling, evaluation, and deployment to the prepared data.

Chapter 6 concludes with the summary of the research, recommendations for the research problems, and possible directions for future research.

(21)

CHAPTER 2

BACKGROUND

Studies have shown that detecting clusters of crime incidents (Levine, 1999) and finding possible cause/effect relations with association rule mining (Estivill-Castro and Lee, 2001), are important to criminal analysis. Yet, the classification techniques have also proven to be highly effective in fraud detection (He et al, 1998; Chan et al, 1999) and can be used to categorise future crime data and to provide a better understanding of present crime data. This chapter describes and evaluates the existing fraud detection methods and techniques in Section 2.1. The need for a new and innovative method to crime detection is introduced and discussed in Section 2.2. The choice of three classification techniques in crime detection is justified in Section 2.3. Finally, Section 2.4 concludes the chapter by specifying the statistical tests that must be used to evaluate the performance of classifiers.

2.1 EXISTING FRAUD DETECTION METHODS

This section concentrates on the analysis of present data mining methods applied specifically to the data-rich areas of insurance, credit card, and telecommunications fraud detection. A brief description of each method and its applications is given and some commercial fraud detection software is examined. The methods are critically evaluated to determine which is more widely used in fraud detection.

2.1.1 Insurance Fraud

Ormerod et al (2003) recommends the use of dynamic real-time Bayesian Belief Networks (BBNs), named Mass Detection Tool (MDT), for the early detection of potentially fraudulent claims, that is then used by a rule generator named Suspicion Building Tool (SBT). The weights of the BBN are

“…The Precrime System, the prophylactic pre-detection of criminals through the ingenious use of mutant precogs, capable of previewing future events and transferring orally that data to analytical machinery...”

(22)

refined by the rule generator’s outcomes and claim handlers have to keep pace with evolving frauds. This approach evolved from ethnology studies of large insurance companies and loss adjustors who argued against the manual detection of fraud by claim handlers.

The hot spot methodology (Williams and Huang, 1997) applies a three step process: the k-means algorithm for cluster detection, the C4.5 algorithm for decision tree rule induction, and domain knowledge, statistical summaries and visualisation tools for rule evaluation. It has been applied to detect health care fraud by doctors and the public for the Australian Health Insurance Commission. Williams (1999) has expanded the hot spot architecture to use genetic algorithms to generate rules and to allow the domain user, such as a fraud specialist, to explore the rules and to allow them to evolve according to how interesting the discovery is. Brockett et al (1998) presented a similar methodology utilising the SOM for cluster detection before backpropagation neural networks in automobile injury claims fraud.

The use of supervised learning with backpropagation neural networks, followed by unsupervised learning using SOM to analyse the classification results, is recommended by He et al (1998). Results from clustering show that, out of the four output classification categories used to rate medical practice profiles, only two of the well defined categories are important. Like the hotspot methodology, this innovative approach was applied on instances of the Australian Health Insurance Commission health practitioners’ profiles.

Von Altrock (1995) suggested a fuzzy logic system which incorporates the actual fraud evaluation policy using optimum threshold values. It outputs the degree of likelihood of fraud and gives reasons why an insurance claim is possibly fraudulent. Through experimentation with one thousand two hundred insurance claims that belonged to an anonymous company, the results showed that the fuzzy logic system predicted marginally better than the experienced auditors.

(23)

Cox (1995) proposed another fuzzy logic system which uses two different approaches to mimic the common-sense reasoning of fraud experts: the discovery model and the fuzzy anomaly-detection model. The first uses an unsupervised neural network to learn the natural relationships in the data and to derive significant clusters. A neuro-fuzzy classification system is then used to identify patterns within the clusters. The second uses the Wang-Mendel algorithm to generate a fuzzy model. It was primarily applied to search and explain the reasons for health care providers committing fraud against insurance companies.

The EFD system (Major and Riedinger, 1995) is an expert system in which expert knowledge is integrated with statistical information assessment to identify providers whose behaviour does not fit the norm. It also searches in the power set of heuristics to find better classification rules. EFD has been used to detect insurance fraud in twelve US cities.

FraudFocus Software (Magnify, 2002a) automatically scores all claims and continually rescores them as more are added. Claims are prioritised in descending fraud potential to get relevant attention and descriptive rules are generated for fraudulent claims.

SAS Enterprise Miner Software (SAS e-Intelligence, 2000) depends on association rules, cluster detection and classification techniques to detect fraudulent claims. It compares the expected results with the actual results so that large deviations can be further investigated. One of its successful applications saved four million dollars for an anonymous US health insurer with over four million members.

2.1.2 Credit Card Fraud

The BBN and Artificial Neural Network (ANN) comparison study (Maes et al, 2002) uses the STAGE algorithm for BBNs and backpropagation algorithm for ANNs in fraud detection. Comparative results show that BBNs were more accurate and much faster to train, but BBNs are slower when applied to

(24)

new instances. Real world credit card data was used but the number of instances is unknown.

The distributed data mining model (Chan et al, 1999) is a scalable, supervised black box approach that uses a realistic cost model to evaluate C4.5, CART, Ripper and naive Bayesian classification models. The results demonstrated that partitioning a large data set into smaller subsets to generate classifiers using different algorithms, experimenting with fraud/legal distributions within training data and using stacking to combine multiple models significantly improves cost savings. This method was applied to one million credit card transactions from two major US banks, Chase Bank and First Union Bank.

The neural data mining approach (Brause et al, 1999) uses generalised rule-based association rules to mine symbolic data and Radial Basis Function neural networks to mine analog data. It has found that using supervised neural networks to check the results of association rules increases the predictive accuracy. The source of the credit card data was unknown and over fifty thousand transactions were used for training.

The credit fraud model (Groth, 1998) recommends a classification approach when there is a fraud/legal attribute, or a clustering followed by a classification approach if there is no fraud/legal attribute.

The HNC (now known as FairIsaac) Falcon Fraud Manager Software (Weatherford, 2002) recommends backpropagation neural networks for fraudulent credit card use.

2.1.3 Telecommunications Fraud

The Advanced Security for Personal Communications Technologies (ASPECT) research group (Weatherford, 2002) focuses on neural networks, particularly unsupervised ones, to train legal current user profiles that store recent user information and user profile histories that store long term information to define normal patterns of use. Once trained, fraud is highly probable when there is a

(25)

difference between a mobile phone user’s current profile and the profile history.

Cahill et al (2002) builds upon the adaptive fraud detection framework (Fawcett and Provost, 1997) by using an event-driven approach of assigning fraud scores to detect fraud as it happens, and weighting recent mobile phone calls more heavily than earlier ones. The Cahill et al (2002) framework can also detect types of fraud using rules, in addition to detecting fraud in each individual account, from large databases. This framework has been applied to both wireless and wire line fraud detection systems with over two million customers.

The adaptive fraud detection framework presents rule-learning fraud detectors based on account-specific thresholds that are automatically generated for profiling the fraud in an individual account. The system, based on the framework, has been applied by combining the most relevant rules, to uncover fraudulent usage that is added to the legitimate use of a mobile phone account (Fawcett and Provost, 1996; Fawcett and Provost, 1997).

2.1.4 Analysis of Methods

Table 2.1 illustrates the features of fraud detection methods used by these three fraud types in research and practice. Supervised learning is an approach in which the algorithm’s answer to each input pattern is directly compared with the known desired answer, and feedback is given to the algorithm to correct possible errors. On the other hand, unsupervised learning is another approach that teaches the algorithm to discover, by itself, correlations and similarities among the input patterns of the training set and to group them into different clusters. There is no feedback from the environment with which to compare answers. The score-based approach uses numbers with a specified range, which indicates the relative risk that a particular instance may be fraudulent, to rank instances. The rule-based approach uses rules that are expressions of the form Body → Head, where Body describes the conditions under which the rule is generated and Head is typically a class label.

(26)

Table 2.1 depicts supervised learning as a more popular approach than unsupervised learning, and score-based approach as more widely used in credit card fraud than rule-based approach. All the research described before uses some form of supervised learning to detect fraud. For unsupervised learning, Silipo (2003) points out that the organisation of input space and the dimensionality reduction provided by unsupervised learning algorithms would accelerate the supervised learning process. Berry and Linoff (2000) also advocates the use of unsupervised learning to find new clusters and new insights which can improve the supervised learning results. Indeed, much of the research outlined above first utilised unsupervised learning to derive clusters, then used supervised learning approaches to obtain scores or rules from each cluster. The only exception was He et al’s (1998) work that used unsupervised learning subsequently to improve the initial supervised learning results. Only one of the seventeen fraud detection methods, the SAS Enterprise Miner Software, allows the use of all the supervised and unsupervised learning, score-based and rule-based approaches.

Table 2.1: Features in Fraud Detection Methods

Main Fraud Types Supervised

Learning

Unsupervised Learning

Score-Based Rule-Based

Insurance Fraud 100% 56% 67% 67%

Credit Card Fraud 100% 20% 80% 50%

Telecommunications Fraud 100% 33% 67% 67%

All 100% 41% 71% 59%

2.2 THE NEW INVESTIGATIVE DETECTION METHOD

This section proposes a different but non-trivial method of detecting crime based partially on Minority

Report (Dick, 1956). The idea is to simulate the book’s Precrime method of precogs, integration

mechanisms, analytical machinery, and visual symbols, with existing data mining methods and techniques. An overview of how the new investigative detection method can be used to predict crime, and its advantages and disadvantages are discussed.

(27)

Figure 2.1: Predictions using Precogs, Analytical Machinery, and Visual Symbols

2.2.1 Precogs

Precogs, or precognitive elements, are entities that have the knowledge to predict that something will happen. Figure 2.1 uses three precogs to foresee and prevent crime by stopping potentially guilty criminals (Dick, 1956, p19). Unlike the human “mutant” precogs (Dick, 1956, p3), each precog contains multiple classification models, or classifiers, trained with one data mining technique (see Section 2.3) in order to extrapolate the future. This is a top-down approach (Berry and Linoff, 2000) because the organisation aims to predict certain types of crime. The three precogs proposed here are different from each other in that they are trained by different data mining algorithms. For example, the first, second, and third precog are trained using naive Bayesian, C4.5 and backpropagation algorithms respectively. They require numerical inputs of past examples to output corresponding class predictions for new instances.

2.2.2 Integration Mechanisms

Figure 2.1 shows that as each precog outputs its many predictions for each instance, all are counted and the class with the highest tally is chosen as the main prediction (Breiman, 1994). The main

Final Predictions Main Predictions Attribute Selection Analytical Machinery CL = L4(D)

Main Predictions + Predictions Examples

and Instances

D

Graphs and Scores

Rules Precog P1 = L1(D) Precog P2 = L2(D) Precog P3 = L3(D) Precog P1 = L1(P1, P2, P3) Visual Symbols

(28)

predictions can be combined either by majority count or the predictions can be fed back into one of the precogs (Wolpert, 1992), to derive a final prediction.

2.2.3 Analytical Machinery

Dick’s (1956, p3) analytical machinery is made up of three computers that record, study, compare, and represent the precogs’ predictions in easily understood terms. The first two computers are of similar models. However, if both produce different conclusions about the final predictions on each instance, the third, by means of statistical analysis, is then used to check the results of the other two (Dick, 1956, p20). Instead of three computers, the analytical machinery is simplified by emulating one computer with one type of unsupervised learning, the SOM, for grouping the similar data into clusters. This is a bottom-up approach (Berry and Linoff, 2000) that allows the data to speak for itself in patterns. The data analyst assesses the performance of the classifiers within each cluster and decides which crime patterns are important.

2.2.4 Visual Symbols

Dick’s (1956, p5) analytical machinery produces, in word form, details of a crime based on the final prediction. Instead, graphical visualisations, numerical scores, and descriptive rules about the final prediction, are used. They are important for explaining and understanding the main and final predictions.

2.2.5 Analysis of New Method

Figure 2.1 demonstrates that, to effectively and efficiently predict future crime, supervised and unsupervised learning techniques have to be used concurrently. Unseen numerical crime data that has no class label is simultaneously fed into the trained precogs and analytical machinery. One of the precogs identifies which attributes have the most predictive value after being trained and chooses a subset of attributes for the analytical machinery from which to generate visual symbols. The analytical machinery groups the instances into clusters. At the same time, each precog produces its predictions

(29)

on an instance which are then combined into a main prediction or by one of the three precogs, into a final prediction. The main predictions and final prediction of each instance are then appended to the clustered instances. From this, visual symbols are generated to explain the precogs’ predictions and the most relevant visual symbols are kept and analysed.

The black box approach for using precogs to generate predictions has been transformed into a semi-transparent approach by using analytical machinery to analyse and interpret the results. Visual symbols, scores, and rules help the data analysts to comprehend the predictions faster. The amount of required effort for data preparation is reduced as the same numerical instances are used as inputs for all the three different precogs and for the analytical machinery. Considerable time is saved because predictions and clusters are simultaneously computed. The key advantage lies in the fact that precogs can be shared between organisations to increase the accuracy of predictions, without violating competitive and legal requirements. However, the problem of overfitting and missing values must be addressed; and the predictive accuracy from the precogs must be compared, based on statistical concepts, before this method can be really effective.

2.3 SUPPORTING

CRIMINAL DETECTION TECHNIQUES

This section focuses on the directed or supervised data mining approach to crime detection using BBNs, decision trees and ANNs. A brief description is provided of each technique, its applications in combating crime, and the use of its most widely used algorithm to classify a new instance. A summary of each technique’s main advantages and disadvantages is presented. The benefits of using the three different techniques together on the same crime data are also highlighted.

2.3.1 Bayesian Belief Networks

BBNs, based on Bayes’ (1763) theorem, provide a graphic model of causal relationships on which they predict class membership probabilities (Han and Kamber, 2000), so that a given instance is legal or fraudulent (Prodromidis, 1999). One type of Bayesian category, known as the naive Bayesian

(30)

classification, assumes that the attributes of an instance are independent of each other, given the target attribute (Minsky and Papert, 1969; Feelders, 2003). The main objective here is to assign a new instance to the class that has the highest posterior probability.

Although the naive Bayesian algorithm is simple, it is very effective in many real world data sets because it can give better predictive accuracy than well known methods like C4.5 decision trees and backpropagation (Domingos and Pazzani, 1996; Elkan, 2001) and is extremely efficient in that it learns in a linear fashion using ensemble mechanisms, such as bagging and boosting, to combine classifier predictions (Elkan, 1997). However, when attributes are redundant and not normally distributed, the predictive accuracy is reduced (Witten and Frank, 1999).

2.3.2 Decision Trees

Decision trees are machine learning techniques that express a set of independent attributes and a dependent attribute, in the form of a tree-shaped structure that represents a set of decisions (Witten and Frank, 1999). Extracted from decision trees, classification rules are IF-THEN expressions in which the preconditions are logically ANDed together and all the tests have to succeed if each rule is to be generated. Various related applications include the analysis of instances from drug smuggling, government financial transactions (Mena, 2003b), and customs declaration fraud (Shao et al, 2002), to more serious crimes such as drug-related homicides, serial sex crimes, stranger rapes (SPSS, 2003), and homeland security (James, 2002; Mena, 2003a). The main idea of using decision trees is to use C4.5 (Quinlan, 1993) to divide data into statistically significant segments based on desired output and to generate graphic decision trees or descriptive classification rules that can be used to classify a new instance.

C4.5 can help not only to make accurate predictions from the data but also to explain the criminal patterns in it. It deals with the problems of the numeric attributes, missing values, pruning, estimating error rates, complexity of decision tree induction, and generating rules from trees (Witten and Frank,

(31)

1999). In terms of predictive accuracy, C4.5 performs slightly better than CART and ID3 (Prodromidis, 1999). C4.5’s successor, C5.0, shows marginal improvements to decision tree induction but not enough to justify its use. The learning and classification steps of C4.5 are generally fast (Han and Kamber, 2000). However, scalability and efficiency problems, such as the substantial decrease in performance and poor use of available system resources, can occur when C4.5 is applied to large data sets.

2.3.3 Artificial Neural Networks

ANNs represent complex mathematical equations, with lots of summations, exponential functions, and many parameters to mimic neurons from the human brain (Berry and Linoff, 2000). They have been used to classify crime instances such as burglary, sexual offences, and known criminals’ facial characteristics (Mena, 2003b). An artificial neural network (Rosenblatt, 1958) is a set of connected input/output units in which each connection has an associated weight. The main objective is to use the backpropagation learning algorithm (Rumelhart and McClelland, 1986) to make the network learn by adjusting and finalising the weights so that it can be used to classify a new instance.

Backpropagation neural networks can process a very large number of instances, have a high tolerance to noisy data and the ability to classify patterns on which they have not been trained (Han and Kamber, 2000). They are an appropriate choice for some crime detection areas where the results of the model are more important than understanding how it works (Berry and Linoff, 2000). However, backpropagation neural networks require long training times and extensive testing and retraining of parameters, such as the number of hidden neurons, learning rate and momentum, to determine the best performance (Bigus, 1996).

2.3.4 Analysis of Techniques

Table 2.1 illustrates that each technique is intrinsically different from the other, according to the evaluation criteria, and has its own strengths and weaknesses. Interpretability refers to how much a

(32)

domain expert or non-technical person can understand each of the model predictions through visualisations or rules. Effectiveness highlights the overall predictive accuracy and performance of the each technique. Robustness assesses the ability to make correct predictions given noisy data or data with missing values. Scalability refers to the capability to construct a model efficiently given large amounts of data. Speed describes how effective it is in terms of how fast a technique searches for patterns that make up the model.

By using the three techniques together on the same data, within the context of classification data analysis, their strengths can be combined and their weaknesses reduced. BBNs could be used for

scalability and speed, decision trees for interpretability, and ANNs for its effectiveness and robustness.

Table 2.2: Strengths and Weaknesses of Data Mining Techniques

Data Mining Techniques Interpretability Effectiveness Robustness Scalability Speed

Bayesian Belief Networks Good Good Good Excellent Excellent

Decision Trees Excellent Good Good Poor Good

Neural Networks Poor Excellent Excellent Excellent Poor

2.4 EVALUATION AND CONFIDENCE

Experiments described in this thesis split the main data set into a training data set and a scoring data set. The class labels of the training data are known, and the training data is historical compared to the scoring data. The class labels of the score data set are not known, and the score data set is then processed by the classifiers for actual predictions.

An evaluation is made to assess how each classifier Ci performs and to systematically compare one

classifier with another (Witten and Frank, 1999). The true but unknown success rate, T, measures the proportion of successes or correct predictions made by classifiers on the score data set. T lies within a certain specified interval of 2z, where z is the standard deviation from the mean, with a certain specified confidence. Given x number of instances from the score data set and S number of successes, the observed success rate oi of a classifier is oi = Si/x.

(33)

Witten and Frank (1999) recommend the use of the paired Student’s t-test with k-1 degrees of freedom, to compare with confidence between two learning algorithms LA and LB, using the k-fold

cross-validation method (see Section 3.2.1). It is applied to the differences of the observed success rates oA,iand oB,iof each pair of their classifiers CA,iand CB,iderived in each of the k runs, 1 ≤ ik.

Given that oi = oA,ioB,iand o =oA -oB , the null hypothesis, H0, states that LA and LB have the same

success rate if o= 0. To reject H0 in favour of the alternate hypothesis, H1, which states that LA and LB

have different performance with c confidence, if

(

)

1 1 2 − − • =

= k o o k o t k i i is greater than 2 1 , 1 c k t +

− for the 2-tailed t-test, 0 < c < 1. For example, with 99% confidence, and

assuming k = 11, t10,0.995is equal to 3.169.

According to Prodromidis (1999), the oi from a classifier can be affected by reasons such as the

selection of training data, the random variation of score data, and the randomness within a data mining algorithm. When two classifiers, CA and CB, exhibit different oA and oB respectively on x, it may not

mean that they have different predictive performances. In order to evaluate the results, CA and CB, with

confidence, Salzberg (1997), Prodromidis (1999), Elkan (2001) suggest the use of the McNemar’s hypothesis test on the classifiers’ results from x. Given that the number of instances correctly classified only by CA is represented by sA, and the number of instances correctly classified only by CB

is represented by sB, the null hypothesis, H0, that states that CA and CB have the same success rate if sA

= sB. To reject H0 in favour of the alternate hypothesis, H1, which states that CA and CB have different

performances with c confidence, if

(

)

(

A B

)

B A

s

s

-

- s

s

s

+

= 2

1

|

|

is greater than 2 , 1c

x , 0 < c < 1. For example, with 99% confidence, x21, 0.99 is equal to 6.635. The 2 1

x

(34)

correction for the fact that sA and sBare discrete while the chi-squared distribution is continuous.

Elkan (2001) lamented the dearth of awareness and understanding among data mining researchers and practitioners regarding statistical significance. He argued that the problem lies in training data sets with small numbers of the rare class, such as fraudulent examples. To avoid inaccurate predictive accuracy estimates and poor model interpretation, this thesis aims to determine if predictive relationships are statistically reliable in Section 5.1.3.

(35)

CHAPTER 3

THE CRIME DETECTION METHOD

The primary goals of investigative data mining are to predict and describe criminal patterns from observed data. According to Fayyad et al (1996), prediction involves using available data attributes to extrapolate values from other unknown data attributes, and description concentrates on finding human-interpretable patterns explaining the data. These two goals are achieved in steps which provide the core capability of generalising large numbers of specific facts into new knowledge shown in Figure 3.1. This chapter scrutinises three learning algorithms in Section 3.1, using their classifiers to predict occurrences of fraud. Section 3.2 illustrates two ways of improving their predictions by integrating multiple classifiers. An ANN approach to clustering is presented as a new descriptive approach in Section 3.3. Section 3.4 chooses some existing visualisation techniques to describe the patterns and Section 3.5 ends the chapter with an in-depth analysis of the overall crime detection method.

Figure 3.1: Steps in the Crime Detection Method

3.4 Step Four: Visualisation Techniques 3.3 Step Three: Cluster Detection

3.2 Step Two: Combining Output 3.1 Step One: Classifiers

“To solve really hard problems, we will have to use several different representations... Each have domains of competence and efficiency, so that one may work where another fails.”

(36)

3.1 STEP ONE: CLASSIFIERS AS PRECOGS

This section applies each algorithm to small data sets to show its importance in fraud detection. In progressive steps, the likely problems in each algorithm are explained and effective solutions to overcome them are proposed. The advantages and disadvantages of using the three algorithms in fraud detection are also presented.

3.1.1 Naive Bayesian Classifiers

The naive Bayesian classifier learns according to the algorithm in Appendix G, using the training data in Appendix D. The classifier has to predict the class of instance X = (sex = “0”, fault = “1”, driver_rating = “0”, number_of_suppliments = “0.33”) to be either fraud or legal.

According to Step 2 of the naive Bayesian algorithm,

(

)

0.1 20 2 fraud = = = s s P i

(

)

0.9 20 18 legal = = = s s P i

According to Step 3 of the naive Bayesian algorithm,

(

)

(

)

(

)

(

)

(

fraud

)

(

fraud

)

0 1 0.5 0 0 0 2 0 fraud "0.33" s suppliment number_of_ 5 . 0 2 1 fraud "0" ing driver_rat 1 2 2 fraud "1" fault 0 2 0 fraud "0" sex 1 = × × × = = = = = = = = = = = = = =

= n k k x P X P P P P P

(37)

According to Step 4 of the naive Bayesian algorithm,

(

)

(

)

(

)

(

)

(

fraud

)

(

legal

)

0.22 0.72 0.33 0.11 0.0057 11 . 0 18 2 legal "0.33" s suppliment number_of_ 33 . 0 18 6 legal "0" ing driver_rat 72 . 0 18 13 legal "1" fault 22 . 0 18 4 legal "0" sex 1 = × × × = = = = = = = = = = = = = =

= n k k x P X P P P P P

Using the above probabilities and Step 5 of the naive Bayesian algorithm,

(

) (

)

(

legal

) (

legal

)

0.0057 0.9 0.0051 0 1 . 0 0 fraud fraud = × = = × = P X P P X P

Therefore, the naive Bayesian classifier predicts that instance X is legal. P

(

sex="0"fraud

)

=0 and

(

number_of_suppliments="0.33" fraud

)

=0

P highlight the problem of any attribute value not present

in all the fraud examples from the training set. The probabilities for P

(

X fraud

)

is always 0, and poor score set results are inevitable. The Laplace estimator improves this situation by adding 1 to the numerator and the number of attribute values to the denominator of P

(

X fraud

)

andP

(

X legal

)

(Witten and Frank, 1999). To prove this, an assumption that each attribute value is equally probable has to be made when the algorithm is used.

According to Step 3 of the naive Bayesian algorithm,

(

)

(

)

(

)

(

)

(

fraud

)

0.25 0.75 0.33 0.17 0.011 17 . 0 4 2 1 0 fraud "0.33" s suppliment number_of_ 33 . 0 4 2 1 1 fraud "0" ing driver_rat 75 . 0 2 2 1 2 fraud "1" fault 25 . 0 2 2 1 0 fraud "0" sex = × × × = = + + = = = + + = = = + + = = = + + = = X P P P P P

(38)

According to Step 4 of the naive Bayesian algorithm,

(

)

(

)

(

)

(

)

(

legal

)

0.25 0.7 0.32 0.14 0.0078 14 . 0 4 18 1 2 legal "0.33" s suppliment number_of_ 32 . 0 4 18 1 6 legal "0" ing driver_rat 7 . 0 2 18 1 13 legal "1" fault 25 . 0 2 18 1 4 legal "0" sex = × × × = = + + = = = + + = = = + + = = = + + = = X P P P P P

Using the above probabilities and Step 5 of the naive Bayesian algorithm,

(

) (

)

(

legal

) (

legal

)

0.0078 0.9 0.00702 0011 . 0 1 . 0 011 . 0 fraud fraud = × = = × = P X P P X P

By converting the results into percentage format, P(fraud/X)=13.55% andP(legal/X)=86.45%. Therefore, instance X is about 6 times more likely to be legal.

Since attributes are treated as though they are completely independent, the addition of redundant ones dramatically reduces its predictive power. The best way of relaxing this conditional independence assumption is to add derived attributes (Elkan, 2001). These attributes are created from combinations of existing attributes (see Section 4.3.4). To illustrate this, the naive Bayesian classifier will learn using the training data with two derived attributes in Appendix E. The classifier has to predict class of instance Y = (is_holidayweek_claim = “0”, fault = “1”, driver_rating = “0”, age_price_wsum = “0.33”) instead.

(39)

According to Step 3 of the naive Bayesian algorithm,

(

)

(

)

(

)

(

)

(

fraud

)

0.25 0.75 0.33 0.11 0.0068 11 . 0 7 2 1 0 fraud "0.33" wsum age_price_ 33 . 0 4 2 1 1 fraud "0" ing driver_rat 75 . 0 2 2 1 2 fraud "1" fault 25 . 0 2 2 1 0 fraud "0" _claim is_holiday = × × × = = + + = = = + + = = = + + = = = + + = = Y P P P P P

According to Step 4 of the naive Bayesian algorithm,

(

)

(

)

(

)

(

)

(

legal

)

085 07 032 016 03 16 0 7 18 1 3 legal "0.33" wsum age_price_ 32 0 4 18 1 6 legal "0" ing driver_rat 7 0 2 18 1 13 legal "1" fault 85 0 2 18 1 16 legal "0" week_claim is_holiday . . . . . Y P . P . P . P . P = × × × = = + + = = = + + = = = + + = = = + + = =

Using the above probabilities and Step 5 of the naive Bayesian algorithm,

(

) (

)

(

legal

) (

legal

)

0.3 0.9 0.27 00068 . 0 1 . 0 0068 . 0 fraud fraud = × = = × = P Y P P Y P % 25 . 0 ) / fraud ( Y =

P andP(legal/Y)=99.75%. Therefore, given two derived attributes as training examples, instance Y is about 400 times more likely to be legal. Given this small illustration and the correct class label is legal, using derived attributes can result in more accurate predictions.

The naive Bayesian classifier is one of the few classifier types that can handle missing values in training examples well (Elkan, 1997; Witten and Frank, 1999). To demonstrate this, the naive Bayesian classifier learns based on the training data with nine missing values in Appendix F to predict the class of instance Y = (is_holidayweek_claim = “0”, fault = “1”, driver_rating = “0”, age_price_wsum = “0.33”) instead.

(40)

According to Step 3 of the naive Bayesian algorithm,

(

)

(

)

(

)

(

)

(

fraud

)

0.5 0.75 0.33 0.11 0.0045 11 . 0 7 2 1 0 fraud "0.33" wsum age_price_ 33 . 0 4 2 1 1 fraud "0" ing driver_rat 5 . 0 2 2 1 1 fraud "1" fault 25 . 0 2 2 1 0 fraud "0" _claim is_holiday = × × × = = + + = = = + + = = = + + = = = + + = = Y P P P P P

According to Step 4 of the naive Bayesian algorithm,

(

)

(

)

(

)

(

)

(

legal

)

075 06 027 016 0019 16 0 7 18 1 3 legal "0.33" wsum age_price_ 27 0 4 18 1 5 legal "0" ing driver_rat 6 0 2 18 1 11 legal "1" fault 75 0 2 18 1 14 legal "0" week_claim is_holiday . . . . . Y P . P . P . P . P = × × × = = + + = = = + + = = = + + = = = + + = =

Using the above probabilities and Step 5 of the naive Bayesian algorithm,

(

) (

)

(

legal

) (

legal

)

0.019 0.9 0.017 00045 . 0 1 . 0 0045 . 0 fraud fraud = × = = × = P Y P P Y P % 58 . 2 ) / fraud ( Y =

P andP(legal/Y)=97.42%. Therefore, given nine missing values in the training examples, instance X is about 38 times more likely to be legal. As the nine missing values are simply not included in the frequency counts, the probability ratios are dependent on the number of values that are actually present.

3.1.2 C4.5 Classifiers

The C4.5 classifier learns according to the algorithm in Appendix H based on the training data with two derived attributes in Appendix E to predict the outcome of instance Z = (is_holidayweek_claim = “0”, fault = “1”, driver_rating = “0.66”, age_price_wsum = “0.5”). The attribute age_price_wsum is used for demonstration purposes.

(41)

According to Step 2 of the C4.5 algorithm,

(

)

0.467 20 18 log 20 18 20 2 log 20 2 18 , 2 =− 22 = I

According to Step 3 of the C4.5 algorithm,

(

)

( )

225 . 0 9 8 log 9 8 9 1 log 9 1 20 9 8 , 1 20 9 wsum age_price_ 2 2 =            = = I E

According to Step 4 of the C4.5 algorithm,

(

age_price_wsum

)

=0.467−0.225=0.242

Gain

Through working on the other attributes, Gain

(

age_price_wsum

)

has the highest information gain. The other information gains are Gain

(

is_holidayweek_claim

)

=−0.001, Gain

(

fault

)

=0.037, and

(

driver_rating

)

=0.077

Gain

According to Step 5 of the C4.5 algorithm, a decision tree is created, in Figure 3.2, by first having a node named age_price_wsum, and branches are grown for each of the attribute’s values. Each rectangle node depicts a test of an attribute and each oval node (leaf) represents a class. The number in every node indicates the number of examples within it.

(42)

Figure 3.2: Fully Grown Decision Tree

Therefore, using Figure 3.2, the C4.5 classifier predicts that instance Z is legal. Because there is a serious problem if missing values exist in the training data, it is not clear which branch should be taken when a node tests a missing attribute value. Witten and Frank (1999) suggest that the simplest solution is to assume that the absence of that value is not significant and use the branch with most examples as the missing attribute value. Using examples with missing values in Appendix F, the resultant decision tree is the same as in Figure 3.2.

To avoid overfitting, subtree-raising postpruning can be used to remove insignificant branches and leaves to improve performance (Witten and Frank, 1999). This is generally restricted to raising the subtree of the most popular branch, for example, the subtree of driver_rating. As shown in Figure 3.3, the raising is done because the branch from the driver_rating node to the fault node has equal to, or more, examples than the other leaves on the same level. The entire subtree from the fault node downward has been “raised” to replace the subtree of the driver_rating node in Figure 3.3.

0 0.33 0.66 1 0.5 0.33 0.67 0.84 1 age_price_wsum? 20 driver_rating? 9 legal 6 legal 1 legal 2 fraud 1 legal 3 legal 3 legal 1 fault? 3 1 0 legal 1 is_holidayweek_claim? 2 0 1 legal 0 50% fraud 2

(43)

Figure 3.3: Pruned Decision Tree

Therefore, after pruning, the C4.5 predicts that instance Z is fraud, instead of legal, before pruning. This is a case of fragmentation where the number of examples at each given branch is too small to be statistically significant and the result is inaccuracy and incomprehensibility. Given more examples, other problems surface, like repetition, which turns out when an attribute is tested more than once along a given branch of a tree, and replication which occurs when subtrees are duplicated. According to Han and Kamber (2000), derived attributes can mitigate fragmentation, repetition and replication, and as have been used in the C4.5 discussions above.

Knowledge in decision trees can be extracted in the form of IF-THEN rules. One rule is created for each path from the root to a leaf node. Table 3.1 displays seven descriptive rules which are created from Figure 3.3 so that the tree can be better understood. However, this ability is often overstated, as a large complex decision tree may contain many leaves that are not useful (Berry and Linoff, 2000) and they often lack interpretability (Elkan, 2001).

0.5 0.33 0.67 0.84 1 age_price_wsum? 20 legal 6 legal 1 fraud 1 legal 3 fault? 9 1 0 legal 3 is_holidayweek_claim? 6 0 1 legal 5 fraud 1

(44)

Table 3.1: Extracting Rules from a Decision Tree

Rule Number IF-THEN rules

1 IF age_price_wsum = “0.33”

THEN class = “legal”

2 IF age_price_wsum = “0.67”

THEN class = “fraud”

3 IF age_price_wsum = “0.84”

THEN class = “legal”

4 IF age_price_wsum = “1”

THEN class = “legal”

5 IF age_price_wsum = “0.5” AND fault = “0”

THEN class = “legal”

6 IF age_price_wsum = “0.5” AND fault = “1” AND is_holidayweek_claim = “0”

THEN class = “fraud”

7 IF age_price_wsum = “0.5” AND fault = “1” AND is_holidayweek_claim = “1”

THEN class = “legal”

In the discussion so far, C4.5 has been applied on a small training set of only twenty examples. There are likely to be problems of scalability and efficiency when it is used to mine very large real data sets. Section 3.2.1 proposes an approach to overcome this limitation.

3.1.3 Backpropagation Classifiers

The backpropagation classifier learns according to the algorithm in Appendix I, using the first training example in Appendix E. Example 1 = (is_holidayweek_claim = 0, fault = 1, driver_rating = 0, age_price_wsum = 0.5, class = d1 = 0), where fraud is represented by 1 and legal is represented by 0 in the class attribute. To simplify the discussion, the learning rate c and steepness of the activation function

λ

are fixed at 1, and all initial weights w and v start with 0.1.

Example 1 is fed into the network in Figure 3.4, and the net input and output of the hidden neurons and output neuron are calculated in Table 3.2 using Steps 2 and 3 of the backpropagation algorithm. The error of the output neuron is computed and propagated backwards in Table 3.3 to update the weights in Table 3.4 using Steps 4 and 5 of the backpropagation algorithm. Once all examples in the

Figure

Updating...

References

Updating...

Related subjects :