Overfitting and K-fold Cross-validation - A generic data representation for predicting player b

Overfitting is a common issue in machine learning which can seriously affect the reliability of models built. It often leads to a trained model that works great on the training dataset but does not have the generality needed to solve any unseen problems. Figure 4.8 shows a simple regression problem which can help to explain the problem. In this figure, black dots are training data that classifiers see during the training of the model; white dots are testing data unknown while training. In a regression problem, the task is to find a line which can

CHAPTER 4. PLAYER MODELLING WITH DATA MINING 57

Table 4.8: A brief summary of measurements being applied in this work Measurement Name Imbalanced

Data Impact

Random Classifier Performance Area Under PR curve Yes, performance

is positively correlated with proportion with positive examples Can be approximated as ratio of positive examples Cohen’s Kappa Yes, performance

decreases with bias to any class

0.0

Area Under ROC curve No 0.5

Figure 4.8: Overfitting Problem

minimise the sum of distances from all data points to the line. The left sub-figure uses a linear (less complex) model to fit in the data points; the right one uses a non-linear (more complex) model. By looking at the training data (black dots) only, the more complex model is able to pass all training-data points and has no error whereas the simpler model is worse. However, when both models are used to predict unseen data points (white dots), the simpler model has a lower error than the complex one. In this case, the complex model is said to be affected by overfitting. Therefore, as can be seen from this example, more complex (higher dimensional) models are more likely to face an overfitting issue, as they can cover more cases in training examples including these outliers. However, for the same reason, in a larger dataset (the ratio between ‘normal data’ and outliers is generally higher), more complex models can generally give better predictions.

Overfitting happens when machine-learning algorithms capture both needed information and misleading random noises at the same time (Lee et al., 2006). As has been discussed in the above example, this situation often appears when the data is not quantitatively sufficient compared to the complexity of the model to be trained. This is because the lack of data points can confuse the classifier about what the ‘normal’ data points are and what the outliers are, as their quantities are small. Assuming that, in an extreme situation in which

the amount of real information is the same as the outliers, an algorithm may hardly decide which is the best path to follow. For this reason, the resultant classifier is able to display a good performance only in the training set; it shows bad generality when facing other unseen samples. In other words, the performance shown by the classifiers trained in this case is unreliable.

To get rid of the overfitting impact and show the reliable performance, K-fold cross- validation is a widely used method of evaluation (Arlot et al., 2010). This method splits the whole dataset into K pieces, and repeats similar experiments K times on different training and testing sets. The algorithm it follows can be found in Algorithm 1. Notice that, in each repeated experiment, the training set and testing set are independent of each other: That is, the classifiers trained are always tested by unseen samples. In this work, a commonly used 10-fold cross-validation was applied to every experiment.

Algorithm 1K–Fold Cross Validation

1: procedure K–Fold Cross Validation

2: Split Dataset intokpieces

3: resultSet= { }

4: foreach piece i∈k do

5: testingSet = i,trainingSet =k\i

6: Train classifier with trainingSet

7: Test classifier with testingSet and store incurrentResult

8: resultSet =resultSet∪currentResult

9: end for

10: Calculate averaged result from elements in resultSet

11: end procedure

As was introduced in Section 4.6, several experiments in this work use feature selection as an essential part. In this case, a special modification of K-fold cross-validation has to be made. Since all experiments were performed with 10-fold cross-validation, feature selection should be applied for each of the ten repeated experiments separately instead rather than to the whole dataset. This is because, if the feature selection was done before splitting, all data points would attend the selection, during this process, some data which only meant to be known by the testing set after splitting will also be included by mistake. This may break the idea of k-fold cross-validation, which tries to keep the testing set unseen from the view of the training set at any sub-experiment and can sometimes cause bias. Therefore, Algorithm 1 for k-fold cross-validation was modified to Algorithm 2 when feature selection was enabled.

4.10 Summary

This chapter provides an overview of the experiments that were conducted in this work. There are seven stages to each experiment: game data sources, labelling methods, balancing methods,data representations,feature selection,classification algorithms and evaluation methods. The basic information has been introduced respectively for each stage of an experiment. In summary, Algorithm 3 depicts how classifiers are trained in the experiments of this work. In this algorithm, Training Set X is ready for the training process, which has been processed by any data representation, and the labelY is chosen based on different labelling methods. Based on the experimental procedure introduced here, the case studies conducted

CHAPTER 4. PLAYER MODELLING WITH DATA MINING 59

Algorithm 2K–Fold Cross Validation with Feature Selection

1: procedure K–Fold Cross Validation with Feature Selection

2: Split Dataset intokpieces

3: resultSet= { }

4: foreach piece i∈k do

5: testingSet = i,trainingSet =k\i

6: Get selectedF eatureSetvia Feature selection based on trainingSet

7: Filter trainingSettotrainingSetSelected by only keepingselectedF eatureSet

8: Filter testingSet totestingSetSelectedby only keeping selectedF eatureSet

9: Train classifier with trainingSetSelected

10: Test classifier with testingSetSelected and store incurrentResult

11: resultSet =resultSet∪currentResult

12: end for

13: Calculate averaged result from elements in resultSet

14: end procedure

Algorithm 3The Simplified Complete Modelling Process

1: procedure Train(RawData R)

2: Get ready-to-use Dataset X by selected data representation from R

3: Get label list Label Y with selected labelling methods from R

4: Split Dataset X and Label Y intok pieces

5: resultSet= { }

6: foreach piece i∈k do

7: testingSet = i,trainingSet =k\i

8: Get selectedF eatureSetvia Feature selection based on trainingSet

9: Filter trainingSettotrainingSetSelected by only keepingselectedF eatureSet

10: Filter testingSet totestingSetSelectedby only keeping selectedF eatureSet

11: Apply balancing method in trainingSetSelected

12: Select the algorithm as the training classifier

13: Perform random search (3 Fold Cross Validation) on trainingSetSelectedto find best hyper-parameters for the classifier

14: Set the classifier with the best hyper-parameters

15: Train the classifier with trainingSetSelected

16: Test the classifier with testingSetSelectedand store incurrentResult

17: resultSet =resultSet∪currentResult

18: end for

19: Calculate averaged result from elements in resultSet

in the following chapters with event frequency-based data representation for predicting different purposes will start to be discussed. In the next chapter, the case studies start with a popular predictive target, first purchase. Experiments conducted for predicting first purchases will help to investigate both the generality and performance that can be brought by applying event-frequency-based data representation.

Chapter 5

Predicting First Purchase

The previous chapter introduced the basic approach to my experiments. From this chapter onwards, experiments performed to investigate both the generality and performance of event-frequency-based data representations are discussed. As was introduced in Chapter 3, revenue-related predictive purposes have attracted many studies in the game-analytics area. In this chapter, as one of the most important revenue related predictive purpose, the player’s first-purchasing behaviour is used as the predictive target. The first part of this chapter reviews the state-of-the-art in this area. Next, the results from the experiment of predicting first purchases with event-frequency-based data representation are displayed and discussed. As mentioned in Section 4.6, since event-frequency-based data representation is a highly dimensional approach, feature selection is added for dimensional reduction to see if a less-complex model can be generated without losing significant accuracy.

Main points in this chapter:

u introduction to the problem of first purchase

u case studies for predicting first purchase, and

u experiments for investigating the effect caused by feature selection.

5.1 First Purchase

Game revenue comes from different sources depending on the business model. In the modern game industry, two main types of monetization are typically used: i.e., fixed pricing (including subscription for online games) and ‘freemium’ pricing strategies (Marchand and Hennig-Thurau, 2013). In the fixed-pricing case, except for its paid basic content, modern games usually provide extra content as an in-game purchase, such as downloadable content (DLC). These purchases can be considered in-game purchases. On the other hand, the main content of freemium games is often free of charge. However, in-game items, such as special skins and powerful items are the sources of revenue. This type of strategy can usually be found in mobile games and web games, in which the in-game purchases act as the main sources of revenue. As can be seen, no matter which business model a game runs, in-game purchasing behaviours are important–especially for the ‘freemium’ games. Because it is the most revenue-related topic, purchasing-behaviour prediction is important to any company, because once a predictive model is built, developers are able to determine the important potential purchases in their games so that special care can be taken of these players for achieving better revenue.

Figure 5.1: Experiment of First Purchase Prediction

As described in Section 3.6, many efforts have been made to predict purchase behaviours. However, except for a recent study by Sifa et al. (2015), most studies in this area are not focused on players’ first-purchase decisions. As discussed in Section 4.5.2, we attempted to conduct an experiment that would allow us to compare event-frequency-based data representation (introduced in Section 4.5.1) with the features used by Sifa et al. (2015). However, because most of the features they chose were not available for testing in our game datasets, the comparison was not successfully made. Details of the availability issue are further ex- plained in Section 5.2. First purchase is a special and important behaviour among all purchasing behaviours. This is because the first purchase is the point at which a non-paying player becomes a paying one. According to Kim (2012), once a player has made his/her first purchase, it is very often the case that he or she will start paying for more items. To investigate whether the first purchase can be successfully foreseen, event-frequency-based data representation is utilised.

In document A generic data representation for predicting player behaviours (Page 56-62)