Game Optimisation Problems - Future Work - Uncertainty handling in surrogate assisted optimisat

6.2 Future Work

6.2.1 Game Optimisation Problems

6.2.1.1 Validation of Evaluation Functions

The GBEA benchmark presented in this thesis and run for the GECCO 18 and 19 workshops1consciously utilises previously published game optimisation problems and the evaluation functions proposed in the corresponding papers. This is done in order to reflect the state-of-the-art in automatic game (content) generation and tuning.

This has the added benefit of not requiring a thorough validation of these functions in terms of how meaningful they are for human gameplay. However, as made apparent by our taxonomy (see section 4.1) and survey (see appendix A), validation functions in literature are of a very limited variety. Most evaluation approaches in the survey are based on model assumptions that are not validated for human players (category feedback NONE). However, we have also shown in our case study in appendix B that

these assumptions can be very misleading for any optimisation algorithm or prediction model utilising the resulting evaluation functions.

For this reason, I would recommend that more effort is made towards validating evaluation functions in context of (semi-) automatic game design / tuning. Unfortunately, conducting comprehensive experiments with human players is not always possible, especially in cases where the evaluation approach is only a minor component of the project. In our planned task force on game evaluation, we are attempting to tackle these problems and facilitate the validation of evaluation functions.

One suggestion is to host websites where surveys based on popular research games (such as Mario and the GVGAI framework) can be set up for online participation. Re- ducing the effort to set up these surveys might lead to more researchers collecting data from human players. Making the survey available through a browser online should also increase the number of participants, and thus the significance of the results. Ideally, this set-up would also include several game playing agents, as well as extensive logging and visualisation capabilities. A description of what such a system could entail, in addition to a description of its potential can be found in a recent vision paper [47].

An even easier approach is to make researchers aware of previously published evaluation functions, especially if they have been validated. In context of the task force, a website that provides this information in an easily accessible way is planned. Further in the future, it would be worthwhile to investigate whether meaningful and validated evaluation functions exhibit specific patterns that can be generalised to multiple (similar) games. Such an investigation could be based on the data from both the website and

online surveys as described above. From the GBEA results, we were already able to observe some characteristics that were consistent for certain types of functions, such as steps in fitness functions based on the encoding in MarioGAN. With more results, these observations could be extended and verified.

Another option is to depart from automatic evaluation and instead obtain the evaluations of game content by playtests directly. Of course, this is only practicable if a small number of solutions need to be evaluated. One potential approach could be surrogate- based algorithms that reduce the number of exact evaluations required to a minimum. A promising option here is to use surrogates based on a variety of information, possibly including multiple diverse fully automatic evaluation functions. Such an experimental set-up was proposed in section 5.1. Even though the corresponding experiments were conducted, they are unfortunately only interpretable after more baseline experiments have been completed for the GBEA. We will therefore definitely come back to these results in the future.

Further tools for similar scenarios, i.e. semi-automatic optimisation of game problems have also been proposed in [104] and tested using an real-world strategy (RTS) game. This thus shows that considering several types of data is a promising solution, even for complex games with large search spaces.

6.2.1.2 Analysis of Fitness Landscapes

Independent of the validity of the evaluation function, it should also be considered what type of fitness landscape is created with its usage. Information on the fitness landscapes are crucial for the choice of suitable optimisation algorithms as well as for putting their performance into context. The need for further analysis of existing evaluation functions has also been recognised in other publications as well, see for example recent surveys and vision papers [19, 130].

This is the reason we are investigating the function suites in GBEA in more detail. Part of this analysis are the line walks as described in section 5.3.1. In addition, we also plan to do further analysis using techniques from the field of Exploratory Landscape

Analysis (ELA). Corresponding approaches center around computable features intended

to characterise functions in terms of a set of abstract concepts, such as their modularity and the existence of plateaus. The features have been commonly used as inputs for models that choose which one of a set of algorithms to run on a given problem [72]. We want to use them instead as a way to characterise the resulting fitness landscapes. This could be done by training models to recognise specific characteristics based on information from the ELA features.

The analysis is also going to include visualisations of the fitness functions, as ELA features do not necessarily provide interpretable information. Visualisations allow a more holistic overview for the human observer. However, corresponding plots would likely be done on smaller scale versions of the GBEA problems, due to the inherent dimensional limitations of visualisations.

Additionally, since the fitness functions seem to be flatter than expected due to the lack of baseline performance results (see section 5.3.3), properties that are usually

assumed of a fitness landscape should be investigated further. One such property is locality, i.e. sensitivity to small modifications in search space. This is especially important with regard to the correlations between fitness and available mutation operators. This relationship is of course influenced by the representation of the search space, as game optimisation problems invariably include some form of phenotype-genotype mapping. For example, the steps in the encoding-based fitness functions were clearly a result of the one-hot encoding used in MarioGAN (see section 5.3.1). Based on these insights, common representation methods for levels and game parameters should also be analysed in terms of their influence on the properties of the resulting game optimisation problems.

6.2.1.3 Analysis of Uncertainty

The analysis of the uncertainties identified in the taxonomy described in section 4.1 have largely been quantitative in nature, see chapter 5. The only large exception is the case study in appendix B that verifies the existence of a specific type of bias. However, common sources of uncertainty should ideally be investigated qualitatively and in more detail. To discuss potential future work in this regard, we refer to the several sources of uncertainty identified in section 4.1.4.

For the feedback dimension, the main errors are based on survey design and the interpretation of the feedback. Both of these problems would be mitigated if the evaluations function could be validated as discussed in section 6.2.1.1. If enough data is available, even the non-determinism in games would not be an issue in terms of obtaining a meaningful signal for the evaluation.

Lacking a thorough validation of a given evaluation function, other approaches to analyse the uncertainty could be taken. If no feedback from human players is available, survey design becomes irrelevant. The issues caused by non-determinism in games are also alleviated, as evaluations that do not require playtests can usually be repeated often enough to obtain statistically significant results. What remains is the interpretability of the feedback, which is significantly harder without human feedback. Explainable AI as envisioned in [146] could be the key to translating AI behaviour into interpretable feedback. However, explainable and interpretable algorithms are still an active field of research, with sometimes counter-intuitive results.2

For the input dimension, data selection and data generation have been identified as the two main issues. The latter has been covered by the case study in appendix B. However, more of these studies should be conducted in different settings in order to identify how prevalent and noticeable data generation bias really is. Uncertainty from data selection can be approached by either avoiding a selection completely (by using enough computational resources and employing deep learning practices), or by using suitable dimensionality reduction techniques developed in machine learning, such as feature selection or principal component analysis.

The final source of uncertainty stemming from the choice of model can be investigated using model validation approaches, e.g. cross-validation as described in the context

2_{Such a result comes from a recent study that finds that increasing the transparency of models reduced}

of SAPEO (see section 4.2.2). Because the experimental framework (see section 4.4) includes automatic logging of the prediction error, as well as the predicted uncertainty, model fit can be conveniently investigated. These features already produced interesting results in the experiments. An example is the conclusion from section 5.2.1.3 to increase the strictness of model validation in SAPEO.

In document Uncertainty handling in surrogate assisted optimisation of games (Page 138-141)