3.4 Benchmarks for Expensive Continuous Optimisation
4.1.4 Sources of Uncertainty
Based on the described taxonomy, we can identify several different sources and types of uncertainties that all interact in a typical game optimisation problem (see sections 4.1.4.1 and 4.1.4.2). These also depend on the choice of evaluation model (see section 4.1.4.3).
4.1.4.1 Feedback Dimension
The process of obtaining appropriate feedback for game content can be relatively complex, and there are many caveats to consider, depending on what type of data is acquired.
Feedback Survey For example, the most common way to obtain feedback from human
players, either explicitly or implicitly (EXP,IMP) is via a survey. While using this type of feedback reduces the amount of assumptions required to build an evaluation model, there are also several possible problems due to the nature of real-world experiments. For example, there could be external issues with the experiments, potentially causing loss of immersion for the participants. Additionally, human bias might affect the creation and answering of a survey questionnaire. For instance, the questions might not be neutral or a participant might be influenced by factors unrelated to a given question, e.g. by the entertainment value of a game. Furthermore, it is difficult to even mobilise enough survey participants to obtain enough data to train an evaluation model. Nevertheless, survey participants should be selected with care and considering the target audience in order to avoid introducing a sampling bias.
Interpretation of feedback data No matter how and what feedback is obtained, the
collected data needs to be interpreted and translated to a fitness function. This can introduce additional errors and biases. While it is of course true that the risk of errors is reduced the more detailed and the more explicitly human players are asked (EXP), there
is always the possibility of miscommunication. Additionally, when qualitative feedback is obtained, it is interpreted through the perspective of a designer, opening up this process to potential subjectivity and issues such as confirmation bias.
Obtaining implicit feedback (IMP) does alleviate these issues, as no human is con-
sciously involved in either the interpretation or response process. IMPapproaches, how- ever, have to rely on a model that translates unconscious behaviour such as physical signals to some form of qualitative feedback for the game. These models are still heavily researched at the time of writing [41, 78, 132, 159], and it is not entirely clear how reliable they are. They therefore probably introduce an unobservable bias if applied
exclusively. Additionally, the interpretation of physical signals in the context of games is complicated by the fact that survey participants react less expressively when confronted with a virtual reality [33].
Finally, if the evaluation model does not consider feedback data at all (NONE), the
evaluation is of course entirely reliant on the accuracy of the modelling assumptions made by the designer. Designers might be able to define a fitness function that expresses their design goals, and it seems to be common practice in industry to consider some statistics such as win rates in design considerations. Still, the definition of these functions undoubtedly requires design experience and their results are usually only used in conjunction with playtests.
Non-determinism in Games As most games are non-deterministic, depending on a
specific playthrough, the feedback might vary as well. Especially if no human players are involved (NONE), this issue is mostly addressed by aggregating quantitative feedback.
In surveys (EXP, IMP), questions can be instead formulated in a way that does not (significantly) change with the playthrough.
Types of Errors In order to handle the uncertainties introduced in an optimisation problem, it is important to consider what type of errors might occur. Errors resulting from a survey or the interpretation of its result will probably be non-symmetric. The same is true for modelling errors as potentially introduced by approaches in category
NONE. In contrast, most non-determinism in games will cause symmetric noise.
4.1.4.2 Input Dimension
Uncertainties can of course also result from the decisions made on the input dimension. Depending of what type of data is used for a game evaluation model, for instance, additional issues might occur.
Data Selection As in any data-driven modelling approach, the selection of input data is important. Omitting relevant data will not produce accurate models, whereas adding too much data will result in overfitting and uninterpretable results. Thus, especially if the evaluation model is only trained from outcome statistics (OUT) or measures based
on the encoding (CODE), many intricacies that affect the gameplay and the resulting
feedback might be completely missed and thus not modelled. The resulting evaluation model would thus not express the intended fitness function.
A related problem also occurs in methods that use playthrough data (PLAY). Here, all relevant data theoretically should be available, but usually raw data cannot be parsed as input. The data is therefore selected and aggregated, thus allowing for potential biases introduced by the conscious decisions and intentions of the designer.
It is furthermore important with all approaches that the input space is sampled adequately. Sparsely sampled regions can result in extrapolation issues for the model trained on the data. Bias in the choice of samples can affect the model as well.
Data Generation As the data used in evaluation models for game optimisation is generated from a process that is consciously designed as well, additional issues can occur in methods that use data from playthroughs (OUT,PLAY). One of the main issues in this regard are the AIs required to automatically obtain this data. Despite the continuous efforts in the field of player modelling [56, 94, 127], there is no reliable general approach to developing AIs that behaves human-like. In fact, there is not even an appropriate measure that expresses behavioural differences in players on a strategic level [146]. As a result, in most game optimisation problems, AIs are used to generate input data despite the fact that AIs might behave entirely differently than human players. The effect this has one the evaluation models is rarely investigated, but it might be more striking the more information is used (playthrough dataPLAY vs. outcome statisticsOUT).
In order to combat the issues described above, some approaches (e.g. restricted play [64]) choose to analyse specific usecases instead of the whole game. This reduces the complexity of the problem and the restricted setting might improve an AI’s ability to imitate human behaviour. At the same time, the reduced complexity might not produce data that can be translated in order to evaluate the complete game. Additionally, the need to select usecases to analyse of course introduces an additional source of bias and important aspects might be missed entirely.
Types of Errors As explained in the previous section, it is important to consider the types of errors that need to be handled. Both data selection and data generation issues will likely result in non-symmetric error distributions, as these issues are modelling problems.
4.1.4.3 Evaluation Model
In addition to potential issues and uncertainties in the data obtained, the choice and implementation of the model to train can introduce problems. This is in addition to issues resulting from insufficient data, for example caused by the lack of data in either dimension (see above sections 4.1.4.1 and 4.1.4.2).
Model Choice Many machine learning models do make assumptions about the prob-
lem and data they are trained on. The Kriging model used in this thesis, for example, assumes a specific form of correlation between search space and objective space by the choice of kernel (see section 2.3). The assumptions of a specific model in question should therefore be carefully tested before using it as an evaluation model.
This is especially important for we are dealing with black-box problems, where it is often difficult to decide which assumptions are safe to make. Additionally, in game optimisation due to the many interacting mechanisms in a game, fitness landscapes are likely rarely continuous. This is because even when only making small changes to game parameters, the balance between different characters or mechanisms might flip, causing entirely different behaviour. This is a caveat that should be considered as most models do assume some form of continuity of the fitness function.
Types of Errors The errors discussed above are all modelling errors and most likely non-symmetric.