• No results found

Building the Model

3.4 The Effects of the Mn-Ga Substitution on the Properties of MRG Films

4.2.1 General Considerations

4.2.1.3 Building the Model

The model building entails the feature selection and the algorithm training. These are two distinct steps but intimately connected in a self-consistency loop. The outcome of the training depends on the choice of the input features and in turn, the input is chosen such that the training score is maximized. Since each algorithm follows a different strategy to map the input data to the output, the optimal choice of the input features is algorithm dependent. The self-consistency loop starts by defining the input vector. This involves finding a suitable transformation of the available training data. Once the input vector is defined a ML algorithm is trained. For the training purposes a second data split, into the train and the validation set, is performed. This is needed since the precision of the algorithm has to be determined during the training. The difference between the validation and the test set lies in the fact that the information about the validation set is implicitly built into the ML model, while in the latter case it is not. The information feedback is mediated by human intervention, by using the validation score to improve the model. Therefore, the validation set can not be used to obtain an unbiased measure of the performance. This is why it is so important to keep the test data aside. Since the information about the validation set is already implicitly contained in the model it makes sense to use the entire set for the training. This is especially true when the dataset is small to begin with. When the train-validation set is small the training process may become sensitive to the details of the split. A single random split can be highly biased, since it may break the underlying distribution of the target property, by pure chance. Making a number of different splits reduces the probability of accidentally

selecting a biased dataset. The k-fold cross-validation is commonly used to avoid this shortcoming (cf. section 1.2.3). In this way a more robust ML model can be obtained and the dataset is used more efficiently.

The feature selection is the key step in building a ML model. However, a unique or standard way to perform such selection does not exist. Consequently, it is difficult to interpret the significance of the input features, which the ML model finds important, especially when the dimension of the input vector is large. Since in the material science we often want to gain a deeper understanding of the problem we are investigating, the difficulty to interpret the ML model may seem repulsing. However, this shortcoming may be circumvented if one takes a constructive approach to model building. One can start from a full physical description of the system and identify the key elements and the processes that define the quantity of interest. The data, which best represents these, is then included in the input vector. For each trial input the ML model is trained and the precision, evaluated on the test set, is measured. The model is then iteratively improved until a satisfactory precision is achieved. For example, all material properties are, in principle, defined by the structure and the composition of the material. This information should then be considered sufficient to describe any problem in the material science. However, it would be naive to expect that an arbitrary property can be described in this way. The core of the problem lies in the discrete nature of the chemical space. Very often, the internal task of a ML algorithm is to find a non-linear transformation, which makes the problem linear in some abstract hyperspace. The available data, e.g. the atomic numbers and the positions, are in general insufficient to define the necessary transformations. The ML algorithm will in general require less data to learn the rule the smoother the dependency between the input vector and the output is [122]. The input features should be chosen having this in mind. For example, the magnetic moment arises from the interaction of electronic degrees of freedom, which are not explicitly specified by the material structure. Therefore, one needs to resort to using auxiliary variables, which introduce more information about the underlying problem. Ideally, one would like to have linear dependencies between the input and the output. The feature selection is thus a slow, trial and error procedure, but it is often much faster than the typical throughput of the ab initio approach. The advantage is that one works with all the materials at once and the ML model obtained in this way should, ideally, include all the main trends that can be found in the dataset.

For the training one usually considers a whole range of ML algorithms, aimed at a specific ML task, e.g. the regression. A subset of algorithms, which perform best in the training are selected for testing. It is important to note that the quality of the ML algorithm is not only given by the score. If a ML algorithm has too many “internal degrees of freedom”, the training scores will usually be high, however, they are false.

This is what is commonly known as the over-fitting. All ML algorithms come with a number of free parameters, which can be used to tune their behaviour. These are called the hyper-parameters. The purpose of the ML training step is to find an optimal set of parameters for a given algorithm. These parameters can be of different nature, but their common feature is that they define the capacity of the ML algorithm, i.e. its ability to represent different functions. This is directly related to the size of the space of admissible functions. If this space is too large the ML model can represent almost any function. Therefore, the task of training a ML algorithm is the one of maximizing the precision, while keeping the capacity of the algorithm minimal. This is known as the structural risk minimization [124]. A common technique that is used to ensure that the over-fitting does not occur is to plot the learning curve (see section 1.2.3). By plotting this curve we can directly see if the learning process is converging and to which accuracy. It allows us estimate if we have enough data and how much more would be needed to obtain improvement. This is of vital importance when one needs to estimate the “cost” of proceeding with the ML approach. If generating new data is expensive, it may be more appropriate to take a less precise but faster converging algorithm, i.e. the one that can be properly trained using a smaller dataset. The final decision depends on the specific requirements (precision, throughput, fidelity, etc.).