finding shell company accounts using anomaly detection CODS COMAD2018 final

(1)

Devendra Kumar Luna

Computer Science and Engineering, Indian Institute of Technology Kanpur

Kanpur, Uttar Pradesh, India [email protected]

Girish Keshav Palshikar

TCS Research and Innovation, Tata Consultancy Services Limited

Pune, Maharashtra, India [email protected]

Manoj Apte

TCS Research and Innovation, Tata Consultancy Services Limited

Pune, Maharashtra, India [email protected]

Arnab Bhattacharya

Computer Science and Engineering, Indian Institute of Technology Kanpur

Kanpur, Uttar Pradesh, India [email protected]

ABSTRACT

Money launderingrefers to activities pertaining to hiding the true income, evading taxes, or converting illegally earned money for normal use. These activities are often performed throughshell companiesthat masquerade as real companies but where actual the purpose is to launder money. Shell companies are used in all the three phases of money laundering, namely, placement, layering, and integration, often simultaneously. In this paper, we aim to identify shell companies. We propose to useonlybank transactions since that is easily available. In particular, we look at all incoming and outgoing transactions from a particular bank account along with its various attributes, and use anomaly detection techniques to identify the accounts that pertain to shell companies. Our aim is to create an initial list of potential shell company candidates which can be investigated by financial experts later. Due to lack of real data, we propose a banking transactions simulator (BTS) to simulate both honest as well as shell company transactions by studying a host of actual real-world fraud cases. We apply anomaly detection algorithms to detect candidate shell companies. Results indicate that we are able to identify the shell companies with a high degree of precision and recall.1

CCS CONCEPTS

•Computing methodologies→Anomaly detection; Model-ing and simulation; •Information systems→Data mining;

KEYWORDS

Money Laundering; Shell Companies; Bank Transaction Simulation;

1_{The simulator is available from http://www.cse.iitk.ac.in/users/arnabb/projects/} money-laundering/.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

CoDS-COMAD ’18, January 11–13, 2018, Goa, India

ACM ISBN 978-1-4503-6341-9/18/01. . . $15.00 https://doi.org/10.1145/3152494.3152519

1 INTRODUCTION

Money laundering (ML)refers to activities performed to enable the use of illegally obtained money while hiding its true source from government and taxation authorities. As often stated in colloquial terms, it is converting “black” money to “white” [17], [18], [8], [11]. Illegally obtained money may come from frauds, bribes, or orga-nized crimes such as drug trafficking, illicit arms trades, gambling, etc., or even from incomes of legal businesses hidden for tax evasion. Money laundering is a worldwide phenomenon and is a serious threat to the national economies. The United Nations Office on Drugs and Crime estimated that criminal proceeds of roughly USD 1.6 trillion were laundered in 20092 (thisdoes notinclude legiti-mate earnings laundered for tax evasion). As a recent example, a global bank paid a fine of USD 1.9 billion to the US government in a large money laundering case3. Widespread money laundering has resulted in significant international collaboration to tackle it (e.g.,

www.fatf-gafi.org), and most countries have enacted laws related to it.

Broadly, any money laundering (ML) method has three phases. In theplacementphase, the money (typically cash) is brought into banks or other financial institutions, usually through a complex set of deposits in various bank accounts. In thelayeringphase, the money in the banks or financial institutions is routed or transferred to other accounts, and converted into various forms, through a complex set of transactions, in order to hide the trail and also to hide the true source of the money. In theintegrationphase, the money is put back to legitimate use, such as business investments, real estate purchases, etc.

Shell companiesare widely used as part of various ML schemes. The Enforcement Directorate of India recently found that about 1,000 shell companies were being used for ML by a single indi-vidual4. Ashell company(akafront companyormailbox company) is essentially a dummy (or virtual) organization. It has a minimal physical existence, does not possess any significant physical assets (except money), and has no significant physical business operations. A shell company is often created by lawyers or accountants, and is

2 http://www.unodc.org/unodc/en/frontpage/2011/October/illicit-money_-how-much-is-out-there.html

3 http://www.reuters.com/article/2012/12/11/us-hsbc-probe-idUSBRE8BA05M20121211

(2)

sometimes registered in known tax havens. A particular ML agent— person or organization—often creates multiple shell companies (we call them ashell set) that work in tandem as a group. Shell companies in a particular shell set are often interlinked through relationships such as one being a subsidiary of another, or through common directors, etc. Transactions among the shell companies within a shell set are characterized by excessive complications and/or vol-ume, and are often designed to hide money trails. Most business transactions of a shell company are fake in the sense that they are based on false purchase orders, false sales invoices, false customers, false employees, etc. While shell companies are widely used for ML, they are also used for tax evasion, purchasing land, obtaining large contracts, as fronts for individuals or other companies, and even as protection against litigation. Shell companies are used in all the three phases of ML, often simultaneously.

Typically, a shell company receives some money during the layer-ing phase. Then it transfers this money to other shell companies in the shell set through fake business transactions; it may also receive some money from them, thus creating a documentary evidence for the money as legal income. This cycle continues and eventually, the shell company may start participating in the integration phase, by using the money for investments and acquisitions. By generat-ing a very large number of such transactions, and by usgenerat-ing shell companies across different countries, the money trail becomes very difficult to unearth, thereby hiding the true source of the money.

In this document, we focus on the problem of classifying a com-pany account in a bank as belonging to a possible shell comcom-pany or not, usingonlythe incoming and outgoing transactions of that account over a period of time (say, 1 year). We do not consider the problem of detecting the shell set in this paper, for which we are applying community detection techniques over amoney flow graph[10]. Given that a bank may have millions of company ac-counts, we aim to simply create an initial list of potential (candidate) shell companies, which the experts can investigate. We assume that there is no labeled training data in which accounts are labeled as shell companies or normal companies; thus, we need to use unsu-pervised methods to identify accounts belonging to shell companies. In particular, we use anomaly detection techniques to identify shell company accounts, because of the following crucial assumption thatthe transactions of a shell company account have significantly different “characteristics” from those of normal companies. Hence, we represent each company account as a single vector of charac-teristic summary features, derived from the nature and timing of money flows, and then use various anomaly detection techniques to detect “discordance” between the expected and observed business transaction profiles of normal and shell companies.

At a high level, one can identify severalred flagsto identify a shell company account. For example, (i) a shell company makes “too much” use of wire transfers, electronic fund transfer or other indirect money transfer mechanisms; or (ii) it may be involved in “too many” international transactions, particularly from known tax havens. We compare various well-known anomaly detection algorithms in terms of how well they succeed in establishing this discordance between profiles of normal and shell companies. Since there are no public-domain datasets on shell companies, we de-scribe our synthetic data generator, in which we have encoded our understanding of the characteristics of the business transactions of

normal businesses. We generate transactions of shell companies by “violating” these characteristics in specific ways.

Our main contributions in this paper are:

(1) We create a simulator that generates banking transactions for normal companies. We enhance that to simulate several different types of shell companies.

(2) We propose an anomaly detection technique to identify shell company accounts using only the banking transaction char-acteristics.

The simulator and other codes are available from http://www.cse. iitk.ac.in/users/arnabb/projects/money-laundering/.

A lot of other types of information can be incorporated into our approach. For example, if any of the account holders (e.g., directors of a company) are known criminals or politically exposed persons, then we could add another summary variable (for each account) based on this information. However, our focus is to primarily use the patterns within the banking transactions and, hence, we do not make use of such side information. Similarly, there might be countries that are not tax havens but which have a weak banking or tax regime. We are currently not explicitly making use of this kind of side information, although our model can be easily extended to incorporate them.

The rest of the paper is organized as follows. Section 2 con-tains a short summary of related work. In Section 3, we explain the banking transaction simulator. Section 4 details the summary of data followed by experimental results of anomaly detection in Section 5. Finally, Section 6 concludes with possibilities for future work mentioned.

2 RELATED WORK

(3)

and all accounts with score above a given threshold were consid-ered suspicious. A stochastic approximation algorithm for finding roots of the logit function is used to findk0points closest to the current hyperplane and among these candidates, the point with the maximum value for the Fisher information matrix is selected and presented to the user for manual labeling. Michalak et al. [9] proposed a graph mining method for the detection of subgraphs corresponding to suspicious transaction patterns (e.g., a lattice-like sender-intermediaries-receiver pattern). Their method takes into consideration dependencies between individual transfers that may be indicative of illegal activities. Considering the huge transaction volumes, the authors discussed the use of data warehousing and OLAP cube technology in anti-ML applications. Wang et al. [19] used agent-oriented ontology to capture anti-ML knowledge. The FINCEN AI system (FAIS) links and evaluates reports of large cash transactions to identify potential ML [15]. FAIS clusters the trans-actions and using summary attributes of the clusters and linkages among clusters identifies groups of transactions for investigations. They also included innovative data visualization (e.g., wagon-wheel displays) and Case-Based Reasoning (CBR) and data mining tech-niques like nearest neighbor and decision trees. Chang et al. [2] presented coordinated visualization metaphors including heatmap, search by example, keyword graphs, and strings and beads, which are based on identifying specific keywords within the wire transac-tions. This set of visualizations helps analysts to detect accounts and transactions that exhibit suspicious behaviors. Several com-mercial products are available for ML detection; but, details of their internal algorithms are not clear.

3 SYNTHETIC DATA GENERATION

3.1 Transactions of an Honest Company

Due to lack of any public domain dataset about banking transac-tions of a company (honest or shell), we propose abanking trans-actions simulator (BTS). BTS is designed to take many parameters into consideration that affect a company’s banking activities di-rectly or indidi-rectly and use these parameters to generate banking transactions for a given period (we set the period to one year in this paper). Thus, in essence, we generate synthetic data that pro-duces all the banking transactions of a particular company, based on the specific values of these parameters for that company. The incoming banking transactions of a company are essentially about income (e.g., deposits, payments received from customers, returns on its investments, etc.) and the outgoing banking transactions are about expenditures (e.g., employee salaries, payments to suppliers, rents, utilities payments, operational expenses, taxes, refunds to customers, etc.). We assume that most (but not all) companies for which we generate data are located in India and, for convenience, we show all transaction amounts converted to Indian rupees. However, this is not essential and the BTS can be adapted to any geography. The parameters control various aspects of the transactions – the amounts, dates, frequency, payment modes, etc. These parameters, when taken together, depict a company’s general banking behavior. Each specific company has a different set of values for these param-eters, which distinguishes it from other companies. For example, a small-scale manufacturing company will spend a high percentage of its revenues on procurement of supplies (input materials) and

electricity bills, which may not be true for a consultancy firm. Simi-larly, a retail business (such as a jeweller) will have a large number of customers, but a bulk business such as a construction company will have much fewer clients. Even though companies belonging to the same industry sector tend to possess similar (but not identical) banking behavior, there will be differences among them that can be attributed to individual and random causes. Therefore, to capture this dissimilarity, we treat each parameter as a random variable (RV), and allow the user to only specify a probability distribution for each parameter. BTS then automatically chooses the actual value of each parameter (for a specific company) from the associated probability distribution.

The data generation algorithm is as follows. The only input is the numberNof companies for which to generate data.

For each company, the banking transactions data is generated as follows. The parameters are summarized in Table 1.

(1) Generate the metadata for the company, i.e., values for the random variablesindustry-sector, annual-revenue-category, employee-base, num-employees, num-owners, location, is-location-tax-haven, annual-revenue, etc.

(2) Generate company bank account details, such as initial-balance.

(3) Generate the actual number for the annual revenue of this company, from its workforce size and mean revenue per employee, as follows. Suppose for this company, annual-revenue-category = LOW. Then, generate a value for a RV X (mean revenue per employee per month) using the nor-mal distributionN(500000,50000). Generate a value for an-other RVY (standard deviation of revenue per employee per month) using the normal distributionN(0.1·X,0.01·X). Next, generate a value for the RVZcorresponding to the revenue per employee per month for this company using the normal distributionN(X,Y). Finally, generate the annual revenue for the company asannual-revenue=12×num-employees

×Z. For other values ofannual-revenue-category, the process

is the same, except that different mean and standard devi-ation values are used for the normal distribution ofX. In other words, the values forXandYcome from normal distri-butions whose means and deviations depend on the business category. This ensures that a business such as an investment firm will have a higher value of revenue per employee per month, in comparison with a low revenue business such as a small-scale manufacturer.

(4)

Table 1: Parameters for generating the banking transactions of a company.

Parameter Possible Values Probability Distributions

Parameters for Account Metadata

industry-sector 10 values (JEWELLERY, REAL-ESTATE, RESTAURANT, HOTEL etc.) discrete (user-specified)

annual-revenue-category HIGH, MEDIUM, LOW [0.34, 0.33, 0.33]

employee-base SMALL (1−50), MEDIUM (50−200), LARGE (200−500) [0.34, 0.33, 0.33]

num-employees positive integer conditional onemployee-base uniform random

num-owners 1, 2, 3, 4 or 5 [0.45, 0.3, 0.2, 0.025, 0.025]

location INDIA, OFFSHORE [0.99, 0.01]

is-location-tax-haven YES, NO (used only whenlocation =OFFSHORE) [0.01, 0.99]

annual-revenue positive real number conditional onannual-revenue-categoryandnum-employees see text

Parameters for Transaction

date 1 to 365 uniform random

transaction-type DEBIT, CREDIT see text

payment-mode CASH, CHEQUE, TRANSFER see text

currency INDIAN RUPEE, USD, OTHER discrete (user-specified)

amount positive real number see text

source-account see text see text

destination-account see text see text

having small amounts. Iteratively, we take one distribution and generate transactions using it until its proportion in theannual-revenueis exhausted. Thus, the total number of transactions is a RV, whose value depends upon the mixture coefficients. The date of each transaction is chosen uniformly randomly over the range 1 to 365 (since we are considering only 1 year). Other details of each transaction, such as the

payment-mode(cash, cheque, transfer),currency(Indian ru-pees or foreign),is-tax-haven(whether the transaction is from/to a known tax haven), etc., are all RVs, whose values are chosen randomly according to the corresponding user-specified probability distribution. For incoming transactions, the destination account is always the company account and source account IDs are generated randomly, with some prob-ability for repetition, as explained below.

(5) Generate outgoingsalary transactionfor each employee for each month. We pick the salary amount randomly from a mixture of 5 normal distributions, to cover salary variations due to seniority, expertise, role, designation, etc. The means of these normal distributions depend on the salary category (very low up to very high) and mixture weights denote the proportion of employees in that salary category. The date for each salary transaction is picked randomly towards the last few days of each month. Thepayment-modeis always TRANSFER, and other details of each salary transaction are chosen as above.

(6) A company has to pay its suppliers, for raw materials, ser-vices, etc. Companies inindustry-sector such as manufac-turing and retail spend a much larger percentage of their revenues on suppliers, compared to service based companies. The amounts in such outgoingsupplier transactions, corre-sponding to payments to suppliers, are generated iteratively, in a similar manner as above, using another mixture of 5 normal distributions. We could further refine our model to

distinguish between goods suppliers and services suppliers, which might be very useful in identifying shell companies, as some of these are known to spend a lot of money on services suppliers.

(7) A company has to pay for electricity, water, rent, taxes, phones, etc., typically every month; we call them outgo-ingutility transactions. Amounts in utility transactions are randomly generated using another mixture of 5 normal dis-tributions, in a similar manner as above. Some utility trans-actions, such as taxes, may have a different frequency (e.g., annual), which can be accommodated in a similar manner.

While generating incoming transactions of a company, the des-tination account will always be the company account and source accounts are generated randomly, but in case of outgoing transac-tions, company account will be the source account and destination accounts are generated randomly. These accounts that are asso-ciated with a company are referred to asfriend accountsin this paper. These friend accounts are generated randomly with some repeat probability which signifies how likely it is for a new trans-action to use an old (i.e., already used) friend account. If the repeat probability is high for incoming transactions of a business then there will be less number of unique clients. The friend accounts for salary transactions are employee accounts and they are fixed for every company. The Gaussian mixture in salary transactions char-acterizes the hierarchical structure of salaries in an organization. These transactions are generated with the assumption that each employee is paid salary every month which is generally followed in India by most of the industries. Other details of friend accounts, such as account type (INDIVIDUAL or COMPANY),location, etc. are chosen probabilistically in a manner suitable for the company’s

(5)

3.2 Transactions of a Shell Company

Shell companies are hard to detect due to their complex nature of working which makes them look like any other account. Some criminals go to great extent to hide the trail of money laundering by doing complicated transactions among domestic and international locations. We have analyzed some of the case studies of money laun-dering involving shell companies and formulated a few patterns [14], [16]. These patterns are simulated by the banking transactions simulator explained above with some modifications in the parame-ter setting. These modifications are not too aberrant, in the sense thatindividuallythey arenot ableto distinguish between normal and shell accounts. Only by considering themtogether, a shell com-pany shows a deviation from the general behavior of the business. This is an important point to note since otherwise detecting a shell company would have become much simpler and only one attribute need to be inspected.

Following are the patterns that we have accommodated in BTS to generate banking transactions of a shell company account:

(1) One prominent characteristic of a shell company (whether located in India, abroad or in a tax haven) is that it has a significant number of transactions with a company located in a known tax haven (such as Panama, the Bahamas etc.). Such transactions are out of jurisdiction of the home country and are hard to trace or prosecute. We can simulate such a shell company by appropriately modifying the parameters of foreign and tax haven transactions.

(2) Another prominent characteristic of a shell company is that it has too many (incoming or outgoing) transactions in small amounts—not necessarily in cash—from personal (individual) accounts. Some money laundering methods involve setting up hundreds of personal accounts and making many trans-fers in small amounts from those accounts to the shell com-pany account. The amounts involved in these transactions are just below the “radar” to avoid any unwanted attention, but often sum up to a very large amount in total. We can simulate such a shell company by appropriately modifying the parameters of the RVamount.

(3) Another prominent characteristic of a shell company (e.g., one acting as a small retail business) is that it has too many incoming cash transactions in small amounts (e.g., as pay-ments from individual customers). While the amounts in individual transactions are small, the total sum of cash re-ceived by the shell company is often fairly large. Assuming the customers are fake, such a shell company is acting as a placement agent in the money laundering methodology, al-lowing cash to come into the financial system as a legitimate income of a business. For example, such a shell company might be controlled by a drug lord, and allows him to have a large inflow of small cash bills as legitimate income. He could not possibly deposit such a large amount of cash directly in a few bank accounts, as it would alert the authorities. We can simulate such a shell company by appropriately modifying the parameters ofamountandpayment-mode.

(4) Another characteristic of a shell company is that it has fre-quent transactions in large amounts with a specific company located abroad, but not necessarily in a known tax haven.

The transactions of the shell company with such a foreign company are typically either incoming or outgoing but not both. Thepayment-modefor such transactions is usually TRANSFER. The idea here is that a shell company may act as a front for either an import or an export business. Unfortu-nately, this is a weak pattern and may hold even for honest companies.

(5) A typical layering phase in some money laundering methods involves creating and using, not one, but hundreds of shell companies together (ashell set). The layering is achieved by creating thousands of dummy transactions among the member shell companies of such a shell set, thereby making it very difficult to trace the origins of the money. The key characteristic here is that a shell company’s proportion of transactions among its shell set is much higher than that of transactions with other genuine accounts. Thus, a shell set forms acommunityamong the bank accounts of a particular bank, if many of the shell companies in the shell are using the same bank. The BTS system can generate transactions for such shell sets, although detection of such a shell set is not our focus in this paper.

4 DATA ANALYSIS

In this section, we summarize the synthetic banking transactions data that we generated using the BTS system described earlier. The transactions of the shell companies are generated using the 4 flags described in Section 3. Each transaction has the following form:

txid: (date, source account ID, destination account ID, source country, destination country, payment-mode, currency, amount)

The synthetic dataset contains 1,309,880,987 monthly banking transactions of 10,000 honest companies and 7,562,500 monthly banking transactions of 60 shell companies. Average annual rev-enue for honest companies is INR 30,252,019,031 while for shell companies it is INR 37,421,284,048. It is difficult to manually check each transaction for suspicious behavior. Also, a single transaction, when considered individually, does not give sufficient information about whether it is a part of a larger money laundering episode. Hence, based on domain knowledge and the patterns that we have outlined above, we have summarized a set of 20summary features

that describe all the transactions of a company and summarize an account’s (i.e., a company’s) banking activities within a year. Thus, we have summarized all the transactions of each account in the given one year period into a single vector, with these 20 features (see Table 2). These features are essentially non-linear functions of the data under study and are expected to represent the behavior of the companies.

(6)

Table 2: Summary features for each account.

ID Feature Description Mean (honest) St.dev. (honest) Mean (shell) St.dev. (shell)

a1 Proportion of local transactions 78.48 21.19 65.62 16.02

a2 Proportion of foreign transactions 21.52 21.19 34.38 16.02

a3 Proportion of tax haven transactions 11.85 14.46 20.10 14.41

a4 Clients to incoming transactions ratio 49.93 23.94 44.38 20.80

a5 Proportion of small amounting incoming transactions 32.12 21.20 47.57 18.91

a6 Proportion of large amounting incoming transactions 67.88 21.20 52.43 18.91

a7 Proportion of incoming saving accounts transactions 39.68 23.33 41.48 30.80

a8 Proportion of incoming current accounts transactions 60.32 23.33 58.52 30.80

a9 Proportion of incoming cash transactions 34.11 18.22 26.62 16.08

a10 Proportion of incoming online transactions 65.89 18.22 73.38 16.08

a11 Annual expenditure 49.82 23.99 46.98 20.93

a12 Proportion of large amounting outgoing transactions 44.47 23.67 48.92 22.10

a13 Proportion of small amounting outgoing transactions 55.53 23.67 51.08 22.10

a14 Average transactions per client 27.16 16.68 37.73 11.81

a15 Incoming transactions percentage 62.36 34.57 63.30 30.38

a16 Outgoing transactions percentage 37.64 34.57 36.70 30.38

a17 Average incoming transaction amount (monthly) 49,142.97 42,647.92 39,543.72 20,770.86 a18 Average outgoing transaction amount (monthly) 40,522.70 25,412.42 45,777.67 27,200.69 a19 Average number of incoming transactions (monthly) 1,03,699.61 1,00,900.05 88,549.97 46,568.17 a20 Average number of outgoing transactions (monthly) 27,288.49 19,958.46 34,877.70 30,590.68

Figure 1: Proportion of small amounting incoming transac-tions.

(47.56,18.91)respectively. The histogram for the featurea5 in Fig-ure 1 shows that using a threshold of featFig-urea5 alone, it is difficult to find the shell companies, considering the relative abundance of honest companies as compared to the shell companies. Likewise, it is also not possible to differentiate between honest and shell com-panies using any other particular single feature. Figure 2 shows the histogram of the featurea14 (average transactions per client). It is clear from the distribution of the data in the histograms that there is significant overlap between honest and shell companies and it is not possible to draw a threshold to distinguish between them. Figures 3, 4 and 5 show histograms for some more summary variables.

Another way to support the statement that it is not possible to use any one single summary feature to separate between honest and shell companies is to measure the summary variable’s correlation with the class label (honest or shell). The maximum correlation was between featurea5 and class label—0.05—which in itself is quite weak. We further created a correlation matrix for all the

pairsof the 20 summary features. Some of the feature pairs like (a1: proportion of local transactions anda2: proportion of foreign

Figure 2: Average transactions per client.

Figure 3: Proportion of foreign transactions.

transactions) are complimentary to each other and, therefore, show a perfect negative correlation of−1. Table 3 shows the top-5 pairs with highest positive correlation.

(7)

Figure 4: Proportion of tax haven transactions.

Figure 5: Proportion of incoming cash transactions.

Table 3: Top-5 pairs with highest positive correlation.

Summary feature pair Correlation value

a2−a3 0.787

a15−a19 0.734

a16−a17 0.703

a5−a19 0.668

a12−a20 0.668

in the data and followed the following procedures in order: (i) ap-plied an outlier detection algorithm to it, (ii) ranked the detected outlier records, (iii) selected top-Krecords with largest outlier score among them (we usedK=60 since that is the number of shell com-panies in our dataset), and (iv) checked how many among these Koutliers were indeed shell companies. With a single feature, the maximum accuracy was shown by featuresa1 anda2, where 5 out of 60 outliers were shell companies, thus having a precision, recall and F-score of onlyP =R =F =5/60=0.083. The next best single feature wasa18, which detected 4 shell companies out of 60 (P =R =F = 0.067). Note that some summary variables appear to have very different means (e.g.,a1 ora2); however, con-sidering the spread in the values (standard deviation), there is a significant overlap among the honest and shell companies. Thus, it is still to distinguish between honest and shell companies using such a summary variable alone. Also, often there is a single group of shell companies that work together to make a very complex trail of dummy transactions among themselves in order to hide the true source of the money. Detecting suchshell setsis currently out of scope of this paper and is a future avenue of work.

In the next experiment, we kept 2 summary variables at a time in the input data and checked how many among the topK=60 outliers were shell companies. When a combination of 2 features is considered as the feature vector for each account, then most of such pairs were able to detect 5 shell companies among the top 60 outliers. The pairsa15−a20 anda5−a13 were the best and were able

to detect 15 and 10 shell companies respectively among the top 60 outliers (the corresponding quality metrics areP=R=F=0.250 andP =R = F = 0.167 respectively). In a similar manner, the triplets ofa5−a13−a18 anda1−a16−a20 are able to detect 25 shell companies (P=R=F =0.417). The combination of 4 features

a1−a5−a13−a18 detected 27 shell companies (P=R=F =0.450).

In general, we found that the F-score for shell company detection increases as the number of summary features used increases.

Next, to check the redundancy among the 20 summary variables, we performed principal component analysis (PCA) on the input data. Note that since the pairs of variablesa1−a2,a5−a6,a7−a8,

a9−a10,a12−a13, anda15−a16 are complimentary (i.e., perfectly

negatively correlated), in essence, there are only 14 independent variables. The PCA analysis showed that we need at least 9 principal components out of these 14 to reproduce 90% of the total variance present in the data, and at least 10 components to capture 95% of the variance.

5 EXPERIMENTAL RESULTS

We have adapted some of the well known anomaly detection algo-rithms on the 20 summary feature data of 10,000 honest and 60 shell companies for the problem of detecting shell companies.

Ramaswamy, Rastogi and Shim (RRS) [13] defined a distance-based notion of an outlier distance-based on the distanceD(P,k)of a data pointPfrom itskthnearest neighbor. IfD(P,k)is large thenPlies in a sparse region and is most likely an outlier. Once the distance of each point from itskthneighbor is computed, either the top-K points or points withD(P,k)greater than a given threshold value are considered as outliers. Here, we have used the threshold based approach for the sake of comparison with other algorithms. We have experimented with the values ofkranging from 2 to 10 and found the best results fork=3 and threshold=20000.

A novel approach was proposed in [7] using the variance of an-gles between pairs of data points. The intuition is that the variance of angles subtended by a outlier with other points is much smaller compared to that by a normal point with other points. However, computing the angles and finding out variance for all the triplets in a dataset of sizenhas a time complexity ofO(n3). In [12], a near linear time approximation algorithm FastVOA was proposed using the AMS sketching approach. We have used FastVOA with the parameterss1=1600,s2=10,t =100, wheretis the number of random projections, ands1ands2are parameters that affect the accuracy of the AMS sketching. The values ofs1,s2 andtare taken as reported in [12].

(8)

Table 4: Performance of anomaly detection algorithms.

Algorithm Precision Recall F-score

RRS 0.16 0.20 0.18

FastVOA 0.49 0.57 0.52

LOF 0.98 0.80 0.88

Figure 6: The LOF outlier detection algorithm (red points are shell companies).

Table 4 shows the performance of the different anomaly detection algorithms for our dataset. The LOF algorithm clearly outperforms the other two. Also, it is simpler to use in practice, since we can easily tune the LOF threshold between 0 and 1.

Figure 6 shows the Local Outlier Factor (LOF) for each company. The honest companies are represented as green data points while the shell companies are shown as red data points. As can be seen, most of the shell companies have a significantly higher LOF score as compared to the honest ones. Since BTS simulates real scenarios, some of the shell companies have a neighborhood surrounded by honest companies and, hence, there are false negatives. Similarly, some honest companies show up as false positives.

6 CONCLUSIONS AND FUTURE WORK

Money laundering refers to activities pertaining to hiding the true income, evading taxes, or converting illegally earned money for normal use, all the while hiding the true source of the money as well as the true beneficiaries. Many schemes for money laundering involve shell companies, i.e., companies that masquerade as real companies whereas their primary purpose is to launder money. Shell companies are used in all three phases of money laundering (placement, layering, integration), often simultaneously.

Due to lack of any public domain dataset we have proposed a banking transactions simulator (BTS) that works on multiple pa-rameters affecting a company’s banking activities such industry sector, annual revenue category, number of employees, suppliers, utilities, etc. We have formulated a few patterns of frauds after care-fully analyzing some case studies of shell companies. Some of these patterns include sizable amount of transactions with tax havens, too many incoming transactions in small amounts, too many incoming

transactions in cash, frequent large amount transactions with a specific company abroad, large number of transactions with same set of accounts (shell sets), etc.

In this paper, we have shown that analyzing banking transactions by summarizing each account using a set of summary features and then using anomaly detection techniques on such account summary datasets can be used to detect candidate shell company accounts. Such candidate shell company can be investigated further by legal and financial experts to identify true shell companies among these. In future, we would like to focus on the detection ofshell sets, i.e., groups of shell companies operating together in the layering phase to hide the source of money by creating a false trail of a larger number of dummy transactions among themselves. Finally, we would love to work with a real dataset of banking transactions.

REFERENCES

[1] Markus M Breunig, Hans-Peter Kriegel, Raymond T Ng, and Jörg Sander. 2000. LOF: Identifying density-based local outliers.ACM Sigmod Record29, 2 (2000), 93–104.

[2] R. Chang, A. Lee, M. Ghoniem, R. Kosara, W. Ribarsky, and J. Yang et al. 2008. Scalable and interactive visual analysis of financial wire transactions for fraud detection.Information Visualization7, 1 (2008), 63–76.

[3] X. Deng, V. R. Joseph, A. Sudjianto, and J. Wu. 2009. Active learning via sequential design with applications to detection of money laundering.J. American Statistics Association104, 487 (2009), 969–981.

[4] Z. Gao and M. Ye. 2007. A framework for data mining-based anti-money laun-dering research.J. Money Laundering Control10, 2 (2007), 170–179.

[5] C. Ju and L. Zheng. 2009. Research on suspicious financial transactions recog-nition based on privacy preserving of classification algorithm. InProc. 1st Int. Workshop on Education Technology and Computer Science (ETCS09). 525–528. [6] Hans-Peter Kriegel, Peer Kroger, Erich Schubert, and Arthur Zimek. 2011.

Inter-preting and unifying outlier scores. InICDM. 13–24.

[7] Hans-Peter Kriegel, Arthur Zimek, et al. 2008. Angle-based outlier detection in high-dimensional data. InKDD. 444–452.

[8] J. Madinger. 2012.Money Laundering: A guide for criminal Investigators(3 ed.). CRC Press.

[9] K. Michalak and J. Korczak. 2011. Graph mining approach to suspicious transac-tion detectransac-tion. InProc. Federated Conference on Computer Science and Information Systems (FedCSIS). 69–75.

[10] G.K. Palshikar and M. Apte. 2008. Collusion Set Detection Using Graph Clustering.

Data Mining and Knowledge Discovery16 (2008), 135–164. Issue 2.

[11] G.K. Palshikar and M. Apte. 2013. Financial Security against Money Laundering: A Survey. InEmerging Trends in Information and Communication Technologies Security. Elsevier (Morgan Kaufman), 577–590.

[12] Ninh Pham and Rasmus Pagh. 2012. A near-linear time approximation algorithm for angle-based outlier detection in high-dimensional data. InKDD. 877–885. [13] Sridhar Ramaswamy, Rajeev Rastogi, and Kyuseok Shim. 2000. Efficient

algo-rithms for mining outliers from large data sets.ACM Sigmod Record29, 2 (2000), 427–438.

[14] Stephen Schneider. 2004. The incorporation and operation of criminally con-trolled companies in Canada.J. Money Laundering Control7, 2 (2004), 126–138. [15] T. E. Senator, H. G. Goldberg, J. Wooton, M. A. Cottini, A. F. Umar Khan, and

C. D. Klinger et al. 1995. The financial crimes enforcement network AI sys-tem (FAIS): Identifying potential money laundering from reports of large cash transactions.AI Magazine16, 4 (1995), 21–39.

[16] Graham Stack. 2015. Shell companies, Latvian-type correspondent banking, money laundering and illicit financial flows from Russia and the former Soviet Union.J. Money Laundering Control18, 4 (2015), 496–512.

[17] E.M. Truman and P. Reuter. 2004.Chasing dirty money: Progress on anti-money laundering. Peterson Institute.

[18] J.E. Turner. 2011.Money laundering prevention: Deterring, detecting, and resolving financial fraud. Wiley.

[19] Y. Wang, H. Wang, S. Gao, D. Xu, and K. Ye. 2007. Agent-oriented ontology for monitoring and detecting money laundering process. InProc. 2nd Int. Conf. on Scalable Information Systems (InfoScale).

[20] J. S. Zdanowicz. 2004. Detecting money laundering and terrorist financing via data mining.Commun. ACM47, 5 (2004), 53–55.

[21] G. Zengan. 2009. Application of cluster based local outlier factor algorithm in anti money laundering. InProc. Int. Conference on Management and Service Science (MASS09). 1–4.