Exploring the Dataset - Problem Description and Objectives

4.1 Problem Description and Objectives

4.2.2 Exploring the Dataset

To get an initial overview of the statistical properties of the data, we can use the function summary():1

> summary(sales)

ID Prod Quant Val

v431 : 10159 p1125 : 3923 Min. : 100 Min. : 1005 v54 : 6017 p3774 : 1824 1st Qu.: 107 1st Qu.: 1345 v426 : 3902 p1437 : 1720 Median : 168 Median : 2675 v1679 : 3016 p1917 : 1702 Mean : 8442 Mean : 14617 v1085 : 3001 p4089 : 1598 3rd Qu.: 738 3rd Qu.: 8680 v1183 : 2642 p2742 : 1519 Max. :473883883 Max. :4642955 (Other):372409 (Other):388860 NA's : 13842 NA's : 1182

Insp ok : 14462 unkn :385414 fraud: 1270

We have a signiﬁcant number of products and salespeople, as we can con- ﬁrm using the function nlevels():

> nlevels(sales$ID)

[1] 6016

> nlevels(sales$Prod)

[1] 4548

1_{An interesting alternative can be obtained using the function describe() from the extra}

The result of the summary() function reveals several relevant facts on this data. First there are a considerable number of unknown values in the columns Quant and Val. This can be particularly problematic if both happen at the same time, as this would represent a transaction report without the crucial information on the quantities involved in the sale. We can easily check if there are such situations:

> length(which(is.na(sales$Quant) & is.na(sales$Val)))

[1] 888

As you can see, this is a reasonable number of transactions. Given the large total amount of transactions, one can question whether it would not be better to simply delete these reports. We will consider this and other alternatives in Section 4.2.3.

As a side note, particularly relevant for very large datasets, there are more eﬃcient forms of obtaining this type of information. Although the previous code using length() and which() may be considered more understandable, we can take advantage of the way logical values are coded inR (T=1 and F=0) to obtain the same number more eﬃciently:

> sum(is.na(sales$Quant) & is.na(sales$Val))

[1] 888

Another interesting observation from the results of the summary() function is the distribution of the values in the inspection column. In eﬀect, and as expected, the proportion of frauds is relatively low, even if we only take into account the reports that were inspected, which are also a small proportion overall:

> table(sales$Insp)/nrow(sales) * 100

ok unkn fraud

3.6051712 96.0782359 0.3165930

Figure 4.1 shows the number of reports per salesperson. As you can con- ﬁrm, the numbers are rather diverse across the salespeople. Figure 4.2 shows the same number but per product. Again we observe a strong variability. Both ﬁgures were obtained with the following code:

> totS <- table(sales$ID) > totP <- table(sales$Prod)

> barplot(totS, main = "Transactions per salespeople", names.arg = "", + xlab = "Salespeople", ylab = "Amount")

> barplot(totP, main = "Transactions per product", names.arg = "", + xlab = "Products", ylab = "Amount")

Detecting Fraudulent Transactions 169

Transactions per salespeople

Salespeople Amount 0 2000 4000 6000 8000 10000

FIGURE 4.1: The number of transactions per salesperson.

Transactions per product

Products Amount 0 1000 2000 3000

The descriptive statistics of Quant and Val show a rather marked variability. This suggests that the products may be rather different and thus it may make sense to handle them separately. In effect, if the typical prices of the products are too different, then a transaction report can only be considered abnormal in the context of the reports of the same product. Still, these two quantities may not be the ideal ones to draw this conclusion. In effect, given the different quantity of products that are sold on each transaction, it is more correct to carry out this analysis over the unit price instead. This price can be added as a new column of our data frame:

> sales$Uprice <- sales$Val/sales$Quant

The unit price should be relatively constant over the transactions of the same product. When analyzing transactions over a short period of time, one does not expect strong variations of the unit price of the products.

If we check the distribution of the unit price, for example,

> summary(sales$Uprice)

Min. 1st Qu. Median Mean 3rd Qu. Max.

2.4480e-06 8.4600e+00 1.1890e+01 2.0300e+01 1.9110e+01 2.6460e+04 NA's

1.4136e+04

we again observe a rather marked variability.

Given these facts, it seems inevitable that we should analyze the set of transactions of each product individually, looking for suspicious transactions on each of these sets. One problem with this approach is that some products have very few transactions. In eﬀect, of the 4,548 products, 982 have less than 20 transactions. Declaring a report as unusual based on a sample of less then 20 reports may be too risky.

It may be interesting to check what the top most expensive and cheap products are. We will use the median unit price to represent the typical price at which a product is sold. The following code obtains the information we are looking for: > attach(sales) > upp <- aggregate(Uprice,list(Prod),median,na.rm=T) > topP <- sapply(c(T,F),function(o) + upp[order(upp[,2],decreasing=o)[1:5],1]) > colnames(topP) <- c('Expensive','Cheap') > topP Expensive Cheap [1,] "p3689" "p560" [2,] "p2453" "p559" [3,] "p2452" "p4195" [4,] "p2456" "p601" [5,] "p2459" "p563"

Detecting Fraudulent Transactions 171 We have attached the data frame to facilitate access to the columns of the data. We then obtained the median unit price of each product using the aggregate() function. This applies a function that produces some scalar value (in this case the median) to subgroups of a dataset formed according to some factor (or list of factors). The result is a data frame with the values of the aggregation function for each group. From this obtained data frame we have generated the ﬁve most expensive (cheapest) products by varying the parameter decreasing of the function order(), using the sapply() function.

We can conﬁrm the completely diﬀerent price distribution of the top products using a box plot of their unit prices:

> tops <- sales[Prod %in% topP[1, ], c("Prod", "Uprice")] > tops$Prod <- factor(tops$Prod)

> boxplot(Uprice ~ Prod, data = tops, ylab = "Uprice", log = "y")

The %in% operator tests if a value belongs to a set. The call to the function factor() is required because otherwise the column Prod of the data frame tops would have the same number of levels as the column in the original sales data frame, which would lead the boxplot() function to draw a box plot for each level. The scales of the prices of the most expensive and cheapest products are rather different. Because of this, we have used a log scale in the graph to avoid the values of the cheapest product becoming indistinguishable. This effect is obtained by the parameter setting log=y, which indicates that the Y -axis is on log scale (notice how the same distance in the axis corresponds to a different range of values of unit price). The result of this code is shown in Figure 4.3.

We can carry out a similar analysis to discover which salespeople are the ones who bring more (less) money to the company,

> vs <- aggregate(Val,list(ID),sum,na.rm=T) > scoresSs <- sapply(c(T,F),function(o) + vs[order(vs$x,decreasing=o)[1:5],1]) > colnames(scoresSs) <- c('Most','Least') > scoresSs Most Least [1,] "v431" "v3355" [2,] "v54" "v6069" [3,] "v19" "v5876" [4,] "v4520" "v6058" [5,] "v955" "v4515"

It may be interesting to note that the top 100 salespeople on this list account for almost 40% of the income of the company, while the bottom 2,000 out of the 6,016 salespeople generate less than 2% of the income. This may provide some insight into eventual changes that need to be carried out within the company:

● ● ● ● p560 p3689 1e−02 1e+00 1e+02 1e+04 Upr ice

FIGURE 4.3: The distribution of the unit prices of the cheapest and most

expensive products.

> sum(vs[order(vs$x, decreasing = T)[1:100], 2])/sum(Val, na.rm = T) *

+ 100

[1] 38.33277

> sum(vs[order(vs$x, decreasing = F)[1:2000], 2])/sum(Val,

+ na.rm = T) * 100

[1] 1.988716

If we carry out a similar analysis in terms of the quantity that is sold for each product, the results are even more unbalanced:

> qs <- aggregate(Quant,list(Prod),sum,na.rm=T) > scoresPs <- sapply(c(T,F),function(o) + qs[order(qs$x,decreasing=o)[1:5],1]) > colnames(scoresPs) <- c('Most','Least') > scoresPs Most Least [1,] "p2516" "p2442" [2,] "p3599" "p2443" [3,] "p314" "p1653" [4,] "p569" "p4101" [5,] "p319" "p3678"

Detecting Fraudulent Transactions 173 > sum(as.double(qs[order(qs$x,decreasing=T)[1:100],2]))/ + sum(as.double(Quant),na.rm=T)*100 [1] 74.63478 > sum(as.double(qs[order(qs$x,decreasing=F)[1:4000],2]))/ + sum(as.double(Quant),na.rm=T)*100 [1] 8.94468

You may have noticed in the code above the use of the function as.double(). This is required in this case because the sum of the quantities generates too large a number that must be stored as a double. This function ensures this transformation.

From the 4,548 products, 4,000 represent less than 10% of the sales volume, with the top 100 representing nearly 75%. Notice that this information is only useful in terms of the production of the products. In particular, it does not mean that the company should consider stopping the production of the products that sell too few units. In effect, these may be more profitable if they have a larger profit margin. Because we do not have any information on the production costs of the products, we cannot draw any conclusion in terms of the usefulness in continuing to produce these products that sell so few units. One of the main assumptions we will be making in our analysis to find abnormal transaction reports is that the unit price of any product should follow a near-normal distribution. This means that we expect that the transactions of the same product will have roughly the same unit price with some small variability, possibly caused by some strategies of the salespeople to achieve their commercial goals. In this context, there are some basic statistical tests that can help us in finding deviations from this normality assumption. An example is the box plot rule. This rule serves as the basis of outlier identification in the context of box plots that we have already seen several times in this book. The rule states that an observation should be tagged as an anomaly high (low) value if it is above (below) the high (low) whisker, defined as Q3+ 1.5× IQR

(Q1− 1.5 × IQR), where Q1 is the ﬁrst quartile, Q3 the third quartile, and

IQR = (Q3− Q1) the inter-quartile range. This simple rule works rather well

for normally distributed variables, and it is robust to the presence of a few outliers being based in robust statistics like the quartiles. The following code determines the number of outliers (according to the above deﬁnition) of each product:

> out <- tapply(Uprice,list(Prod=Prod),

+ function(x) length(boxplot.stats(x)$out))

The boxplot.stats() function obtains several statistics that are used in the construction of box plots. It returns a list with this information. The out component of this list contains the observations that, according to the box plot rule, are considered outliers. The above code calculates their number for the transactions of each product. The products with more outliers are the following:

> out[order(out, decreasing = T)[1:10]]

Prod

p1125 p1437 p2273 p1917 p1918 p4089 p538 p3774 p2742 p3338 376 181 165 156 156 137 129 125 120 117

Using this very simple method, 29,446 transactions are considered outliers, which corresponds to approximately 7% of the total number of transactions,

> sum(out)

[1] 29446

> sum(out)/nrow(sales) * 100

[1] 7.34047

One might question whether this simple rule for identifying outliers would be suﬃcient to provide the kind of help we want in this application. In Sec- tion 4.4.1.1 we will evaluate the performance of a small variant of this rule adapted to our application.

There is a caveat to some of the conclusions we have drawn in this section. We have been using the data independently of the fact that some of the reports were found to be fraudulent and some other may also be fraudulent although not yet detected. This means that some of these “conclusions” may be biased by data that is wrong. The problem is that for the transactions that are tagged as frauds, we do not know the correct values. Theoretically, the only transactions that we are sure to be correct are the ones for which the column Insp has the value OK, but these are just 3.6% of the data. So, although the analysis is correct, the conclusions may be impaired by low-quality data. This should be taken into account in a real-world situation not to provide advice to the company based on data that includes errors. Because a complete inspection of the data is impossible, this risk will always exist. At most we can avoid using the small number of transactions already found to be errors in all exploratory analysis of the data. Another thing one can do is present the results to the company and if some result is unexpected to them, carry out a closer analysis of the data that leads to that surprising result. This means that this sort of analysis usually requires some form of interaction with the domain experts, particularly when there are doubts regarding data quality, as is the case in this problem. Moreover, this type of exploratory analysis is of key importance with low-quality data as many of the problems can be easily spotted at these stages.

In document Data Mining With R Learning With Case Studies _KUPdLMi3 (Page 183-190)