Data Representation - A generic data representation for predicting player behaviours

The next stage in the blueprint is data representation. Several methods for representing the data have been reviewed in Chapter 3. However, as explained, most of the current methods involve a lack of the generality such that they cannot be applied across games. This section introduces, as the main contribution of this research, a new data-representation method named event-frequency-based data representation.

4.5.1 Event Frequency-based data representation

The limitation on generality in game metrics mostly comes from two aspects: game-specific and availability. The first aspect is easy to understand: I.e., a data representation that is

CHAPTER 4. PLAYER MODELLING WITH DATA MINING 49

Figure 4.5: Problem for calculating session length

The game-session lengths of players A and B are different. However, since the time between the first and last recorded events are the same, their game-session lengths will be considered the same.

formed by game-content features can hardly be applied to a game with different content– especially for games in other genres. For instance, player behaviours in football games can be very different those in a music game. The second reason is mostly based on the design of the data-collection system of a game. For example, the length of a game session is reasonably generic across games (Hadiji et al., 2014), but this may not be collected if the data-collection system is not designed to handle it. When session information is not directly available, an alternative way to calculate the session length is to identify the difference between the ‘last’ and the ‘first’ event in the same session. However, this may become unreliable when the example shown in Figure 4.5 happens.

Thus, to solve both problems, a new data-representation method needs to be both game- content irrelevant and able to take whatever game event is available. Although many efforts have been made to achieve various predictive purposes (e.g., disengagement and purchase) in the area of game data mining, as discussed in Section 3.5 and 3.6, the most widely investigated approaches are unable to provide data-representation methods which are generic enough to be migrated to different games without adaptation. To cope with this issue, a new generic data-representation method is introduced as the main contribution of my research in this section.

A possible solution is to use counts of the appearance of each event to represent the dataset for individual players. This is inspired by a similar model widely used in text mining (Zhang et al., 2010). The model is called ‘bag of words’: Words in an article are enumerated and considered a single dimension (feature) (Cormack, 2007). This approach makes sense, because the frequency of words can reflect some information about the whole article. Likewise, in the context of games, it is possible that the frequency of events a player performs or experiences can also hide valuable patterns. Furthermore, the use of event- frequency can provide good enough generality, as it is content irrelevant and can use any event that happens in the game.

To verify this conjecture, the main contribution of this work –event-frequency based data representation– relies only on the number of occurrence of events in a game. Table 4.5 gives a subset example of the feature space that was built in the game I Am Playr while predicting players’ disengagement behaviours. In this example, the three randomly selected events are “LevelUpOffer-MissOut–Unset”, “Video-Played–MD10a-playrmp4” and “Player- Trophy-UnlockItem-IAmTyphon”. Their definitions can be found in Table 4.4. Taking Player 1 as an example, he/she missed 10 times of the level-up offer named ‘unset’, has

Table 4.4: Example event explanation Event Name Explanation LevelUpOffer-

MissOut–Unset

The number of times that player missed a level up offer called ‘unset’ Video-Played–

MD10a- playrmp4

The number of times that player played a video (generally when a

milestone is reached) Player-Trophy-

UnlockItem- IAmTyphon

The number of times that player unlocked a trophy named

‘IAmTyphon’

Table 4.5: Event Frequency Data Representation LevelUpOffer- MissOut–Unset Video-Played– MD10a- playrmp4 Player-Trophy- UnlockItem- IAmTyphon Player 1 10 1 1 Player 2 9 1 1 Player 3 8 1 0 Player 4 3 1 1 Player 5 9 0 1

played the video ‘MD10a’ and has unlocked the trophy ‘IAMTyphon’. As only the counts of the events are used, their actual meanings become less important in this data representation. This is also why this data-representation method can be applied across different games for multiple predictive purposes. Therefore, the hypothesis of my research is as follows:

Event-frequency-based data representation can be used to predict player be- haviour with supervised learning to provide a significantly better performance than random guess and competitive performance while being compared to other state-of-the-art methods, where applicable.

In this hypothesis, to be more precise, “Method A provides a significantly better performance than MethodB’ represents the situation where, in all cases of an experiment, the p-values resulting from two-tailed t-tests conducted between A and B are less than 0.01, as well as the t-values and effect-sizes are both positive. Instead, “Method A provides a competitive performance while being compared with Method B” stands for the situation that, in most cases of an experiment, A can either provide significantly better performance than B or there is no significant difference (p-valueis larger than 0.01) found between A and B.

4.5.2 Game Specific Data Representation

To compare event-frequency-based data representation for predicting players’ disengagement behaviours, this work also implements a game-specific data-representation method introduced by Runge et al. (2014). The details of this data-representation method are introduced in Chapter 6.

As for predicting players’ first purchases, to the best of my knowledge, only a few works (Sifa et al., 2015) have aimed at the same predictive purpose. Unfortunately, most of the

CHAPTER 4. PLAYER MODELLING WITH DATA MINING 51

features used in the data representation are not available in the game datasets used in this study. Therefore, to predict players’ first-purchase behaviours, a random classifier was used as the baseline for comparison with event frequency-based data representation.

In document A generic data representation for predicting player behaviours (Page 48-51)