Fraud Detection Based on Data Mining
Mohammed Abdulrahman Al DosariInformation Systems Department College of computer Engineering & Sciences
Salman Bin Abdulaziz University Email: [email protected]
Under the supervision of: Dr. Mohammad Omar Alhawarat
Computer Science Department,
College of computer Engineering & Sciences Salman Bin Abdulaziz University
Email: [email protected]
2 | P a g e
Abstract
The wide enormous usage of the Internet and the spread of its services everywhere cause individuals as well as companies to become a target for fraudsters to access and exploit their information to be used in illegal ways. In this paper we concern with anomaly detection, specifically with fraud detection. In this paper we will talk about supervised and unsupervised methods for detecting fraud rather than statistical methods that are limited. The ultimate aim of this paper is promoting the usage and showing the importance of data mining techniques, viz. fraud detection. The vast amount of data available nowadays and the huge development of fraud make the exploitation of this data in illegal manner possible. This exploitation which is called fraud may affect many agencies, whether private or government agencies. Of course, these organizations invest a lot of millions of riyals. Therefore all of these agencies should use fraud detection systems to avoid fraudulent activities. This paper also clarifies the benefit of fraud detection and helps show the importance of security and community issues.
Key words:
3 | P a g e
ثحبلا صخلم
تامدلخا في ءاوس ناكم لك في اتهامدخ راشتنإو تنترنلإا ةكبشل لئالها مادختسلإا ليالحا رصعلا في ظحلان
لىإ لوصولل ينلاتحملل ةيلاع ةصرفو ًافده تاهلجا هذه تحبصأ ببسلا اذله ,هددعتلما هيتايلحا روملأا وأ ةيموكلحا
الهلاغتسإو اهفادهتسلإ ًاينوتركلإ ةحاتلما تامولعلما هذه
نكلو لايتحلإا فشكل ةيديلقت قرط دجوت .هيعرش يرغ قرطب
ام ىلع دارفلأاو تاكرشلا ابه نميأ تامدخ مدقت ثيبح ىودج قرطلا هذله حبصي لم ليالحا يمقرلا رصعلا مدقت عم
ت كانه حبصأ مدقتلا اذه عم لب ةيديلقت حبصت لم لايتحلإا تلااح ًاضيأو ًاينوتركلإ ةظوفمح تناايب نم هنكوليم
مدق
روطت عمو ًايلآ ةنزمخ تناايبلا نم يرثكلا دجوي ايلاح فورعم وه امك .يعرش يرغلا للاغتسلإل ينلاتلمحا لبق نم روطتو
مهاست ةيديلقت يرغ قرط ريوطتب موهفلما اذه مهاسي ثيبح ةيوق ةيمق تناايبلا هذله حبصأ تناايبلا في بيقنتلا تاودأ
نميأ تيلا تامدلخا نم يرثكلا يمدقت في
.تناايب نم هنوكليم ام ىلع تاسسؤولما و دارفلأا ابه
Introduction4 | P a g e
Data Mining can be defined as "Non-trivial extraction of implicit, previously unknown and potentially useful information from data"[1]. Also data mining used to assist decision makers to make a good decision and to prevent malicious acts.
Data Mining is very important because there are many disciplines such as information technology and business deal with vast amount of data. Therefore, Data Mining is important in digital era due to the huge amount of data that may not be useful by itself, but with the presence of data mining techniques; meaningful knowledge can be extracted to help decision makers to make successful decisions where individuals, companies and communities can benefit from. For example data mining techniques can be useful in the current era as digital crimes are increasing where traditional methods of data analysis are no longer viable to detect fraudulent activities. For example, nowadays the discovery of fraudulent activity depending on the behavior of the intruder (called zero day detection) will be much better than non-behavioral things like signature-based antivirus systems.
Also with the massive development of databases and the increasing size over the years there was a need to develop techniques to support this development, where traditional techniques may not be appropriate for it, and deal with complex databases so the data come from different sources and different types so fraudster may take the opportunity through this vast amount of data available on the internet.
There are several major data mining techniques that have been developed and used in data mining projects recently including: association, classification, clustering, prediction and sequential patterns [2], where each of them may be classified as predictive or descriptive.
5 | P a g e
Anomaly (or Outlier) refers to patterns in data that do not conform to a well defined notion of normal behavior [3]. Anomaly detection refers to the problem of finding patterns in data that do not conform to expected behavior. Anomaly detection are used extensively in a wide range of applications such as fraud detection for credit cards, insurance or health care, intrusion detection for cyber-security, fault detection in safety critical systems [3].
Anomalies that might happen in a set of data to detect specific objectives for a malicious activity such as social networks fraud, credit card fraud and identity theft. Usually the normal case that are equal or nearby the values and is totally represent and be combined together unlike cases that may be abnormal or different values.
There are two categories of anomaly detection techniques. First, Unsupervised anomaly detection techniques that detects anomalies in an unlabeled test data set under the assumption that the majority of the instances in the data set are normal by looking for instances that seem to fit least to the remainder of the data set [3].
The other category is supervised anomaly detection techniques that require a data set that has been labeled as "normal" and "abnormal" and involves training a classifier (the key difference to many other statistical classification problems is the inherent unbalanced nature of outlier detection) [3].
6 | P a g e
Fraud can be defined as a criminal activity, involving false representations to gain an unjust advantage. Fraud occurs in a wide variety of forms and is ever changing as new technologies and new economic and social systems provide new opportunities for fraudulent activity [4].
The evolution of information technology and its usage in various fields and by various ages make it possible for fraudsters to commit e-crimes. In the past, fraud may be dependent on reaching to the actual place for the purpose of theft, to commit the theft of money or steal important data that may be written in the papers and saved in the shelves. This method may be difficult for the fraudsters of that had planned may be very difficult in the case of heightened physical security precautions.
With the massive spread as previously stated E-Crimes become wide for fraudsters using the latest methods of illegal ways to reach and use of information by unauthorized way. For example, you may use credit card to buy and sell via the Internet and also access personal information for the purpose of selling to other parties. Various methods of fraud usually focus on electronic hiding of evidences of the crime.
Fraud detection is a topic applicable and is a hot issue to many industries including banking and financial sectors, insurance, government agencies and law enforcement, and more. Fraud attempts have seen a drastic increase in recent years, making fraud detection more important than ever. Despite the efforts on part of the affected institutions, hundreds of millions of dollars are lost by fraudsters every year. Since relatively few cases show fraud in a large population, finding these can be tricky.
7 | P a g e
Fraud detection is important because most organizations and individuals have at their hands a lot of important data, which is not allowed to be seen by unauthorized users. Important data may be governmental or health related. Also fraudsters may use web sites to theft the identity of other users to spread rumors or for other purposes.
"Even the most comprehensive fraud prevention controls can be circumvented by a determined and skilled fraudster. Fraud detection techniques can help to uncover new fraud in action as well as historical frauds. Effective fraud detection saves money and sensitive information. Promoting the organization's detection activities can also act as a deterrent to would-be fraudsters" [5].
There are various types of frauds include credit card frauds, telecommunication frauds, and computer intrusions Bankruptcy fraud, theft fraud/counterfeit fraud, application fraud and behavioral fraud. The following are important targets to fraudsters:
Credit card: Is one of the important issues in the fraud because they are the
major objective for fraudsters. Therefore, sophisticated methods to detect this type of fraud must be developed. Fraudsters tend to use unauthorized credit card information to buy and sell or transfer money via the Internet.
Social Network Sites: With the rapid evolution of social networking web
sites and the increased usage of these websites by users; they become a great opportunity for fraudsters to take advantage of for theft and impersonation in order to spread rumors and damage the reputation of a victim.
8 | P a g e
Computer devices: where fraudsters attempt to access without authorization
to get important data or destroy the important data.
Telecommunication service: Is an attempt access to the E-services provided
via the Internet without authorization with the aim of private purposes may be curious targets by fraudsters such as persons entry to Health gate at hospitals for patients and Education gates for students.
Detecting Fraud
Traditional methods of data analysis cannot accurately detect anomalies. For example, if a user uses a credit card to buy via the Internet from any E-commerce website, then the traditional methods cannot detect cases of theft in the illegal usage because these methods do not have the ability to differentiate between behaviors of different users and take into account other differences. However, when using data mining techniques, they detect when the normal behavior of a user changes and alert the concerned authority then to take the necessary action to deal with the criminal. All of these processes are carried out based on the profile that is assigned for each user and by finding relationships between normal behaviors and abnormal ones. This method is interesting because it does not detect theft only but it might also be used for other purposes. For example, a credit card usage (buying behavior) of a customer might be changed when used by his son with his knowledge, then in this case fraud detection methods will detect a change in the usage behavior from the actual behavior of the father, and takes the necessary actions such as pause the credit card or send an alert to the father on his mobile phone.
9 | P a g e
Some ways to detect frauds may be by: Unusual Behavior, Excessive Voids, Missing Documents, Excessive Credit Memos, Adjustments to Receivables or Payables, Excess Purchases, Employee Expense Accounts, Inventory Shortages, Increased Scrap, Large Payments to Individuals and Increase of employee Overtime.
Once again, traditional methods for fraud detection that uses data analysis tools are limited. Of course they are very helpful in explaining information and extracting statistical information, but with the large amount of data and increasing complexity, dimensionality and diversity of sources, these methods fail to satisfy the data requirements. Therefore, data mining and machine learning techniques are used to overcome the limitations of traditional data analysis methods.
Fraud detection requires complex and time-consuming investigations that deal with different domains of knowledge based on many dimensions. Fraud often consists of many instances or incidents involving repeated transgressions that may be abnormal in some uses in this case to take the necessary action. Fraud is an adaptive crime, so it needs special methods of intelligent and sophisticated data analysis to detect and prevent it. They offer applicable and successful solutions in different areas of fraud crimes [6].
One of the ways actually used in the detection of fraudulent activity is predictive method which is based on historical data that may not be well enough due to scarcity. In this case, because all of these issues may rarely occur, all of the historical data are used to identify fraudulent activities based on predefined models. Traditional methods of data analysis have been used for long time to detect fraud but the need arises with this evolution as mentioned previously to develop sophisticated techniques to detect fraud such as relying on the behavior of users and their preferences.
10 | P a g e
As mentioned earlier that fraud requires time and complex work to be detected. Two of the most important advanced methods to detect fraud will be presented, the normal case for each user of course will be known of the actual behavior within certain limits so that it allowed to be reassured that all electronic transactions will be saved in case of changing this behavior so the situation requires to note the change and inform the user. Much work has been carried out on all methods of data mining, but here Supervised and Unsupervised Methods will be highlighted for handling fraud.
Before starting the work, a fraud data profile must be developed. It is the process of drawing a picture of a fraud scenario with data. The clarity of the picture will depend on the availability and integrity of the information in the database. The fraud data profile will focus on the master file description, the transaction, or both master file and the transaction. Also noticing and recording the actual behavior of the users and the change in time to the other must be carried out first. The large data set we need may be millions of instances to build a model properly.
Supervised Method
The class labels are classified as "Fraudulent" or "Non Fraudulent" based entirely on historical data previously recorded in all situations and details of instances and whether it is fraudulent or not depends on all the records that used to build a model that will be used to detect fraud. All of these data as mentioned in the previous point depend on what is called User profile to record the behavior that must be accurately identified and record all transactions performed by user and the classification of this transaction to fraudulent or non fraudulent as class label.
11 | P a g e
Unsupervised Method
This way does not depend on a class label, therefore classified instances as Fraudulent or Non Fraudulent is opposite to the previous method, but this method is based on building models that can be used to get accurate information about the transaction or behavior. These might be similar of groups around each other and in the case of an anomaly; the transaction or behavior occurs away from the cases classified as natural that is called "Outlier" or anomaly observation and then takes necessary action. The most important unsupervised technique that is used in data mining to detect fraud is clustering. Figure 1 shows the results of classifying set of two dimensional data into two classes: N1 and N2. Note that o1, o2 and o3 are classified as anomalies.
Fig.1 Simple example of anomalies. Taken from [3]. Conclusion
12 | P a g e
Unsupervised method does not need to build or rely on data profiles but may be very expensive to build such models to detect a fraud, therefore the prediction is not based on anything earlier occurrence. However, the supervised method may be less expensive to build than the other way and it requires accuracy in building data profile at one time. The supervised method may have some problems. First, the difficulties to develop sophisticated ways to identify the actual behavior of the user to find the relationships between data to build data profile and what actually we need. Research must be developed and motivate college students to work on this major because it is also important in all areas of science to reach the efficient and high quality to community service.
References
[1] Pang Ning Tan ,Michael Steinbach, Vipin Kumar, 2006, Introduction To Data
Mining, 1nd, Addison-Wesley.
[2] Kalyani M Raval, 10 October 2012, Data Mining Techniques. [3] Varun Chandola, September 2009, Anomaly Detection : A Survey.
[4] Gary Miner,H. G. Wells, 2009, Handbook Of Statistical Analysis And Data
Mining Applications, 1nd , Academic Press.
[5] http://www.fraudadvisorypanel.org/, Fraud Fact-issue 12, April 2012.
[6] G.K. Palshikar, 28 May 2002, The Hidden Truth-Frauds and Their Control: A Critical Application for Business Intelligence.