3.4 The Effects of the Mn-Ga Substitution on the Properties of MRG Films
4.2.1 General Considerations
4.2.1.2 Data Selection
The first step, the problem definition, is used to define what kind of data is needed. The data then needs to be obtained and put into a form, which is useful for building the ML model. This step, although conceptually straightforward, usually takes time, sometimes even more than building the model. The difficulty is that the data needs to be obtained from various sources, not all of which are easy to access and/or to process.
The data collection starts by identifying the sources, which contain useful data and deciding on the best strategy to extract it. Manual data collection is a slow and time consuming process. Automated computational techniques for performing this task are being actively developed, and a new field, commonly known as “data scraping” [201, 202], is starting to emerge. After the data has been extracted it is likely to be in an inhomogeneous format. The data then needs to be curated and brought into a standardized form. It is very fortunate if all the data is already found in a database. In this case only the standardization step might be needed. For example, consider the task of collecting experimental data. The data will be found scattered across different journals, arbitrarily formatted tables or graphs, etc. The extracted values will have been produced using different experimental techniques and expressed in various measurement units. The data curation then ensures that one can interpret what each data point means and where it comes from. This step already involves some data standardization, but only to the extent to make the data interpretable. The standardization implies performing a data transformation, which would allow us to directly compare different data points. In general, this may even be impossible. Such data may still be used for the ML, but the input to the algorithm will exhibit a strong “noise”. Consequently, the precision of the final model will be lower. In practice one needs to find a balance between these two requirements.
The data preparation is tightly related to the model building procedure. As indicated in figure 4.2, the data from the database is passed to the next step in a structured, tabular form. The definition of the problem, shown in figure 4.2-a, sets the target property that needs to be evaluated by the ML model. The list of material properties in the database that can potentially be correlated with the target property is usually not large. The list can then be chosen a priori, using our knowledge of the underlying physics, or by performing a separate data exploration step. This may involve unsupervised learning methods, especially if our understanding of the problem is insufficient to make the decision. This list is later transformed and narrowed down to a useful set of features, in the model building step. The optimal transformation is unknown in advance and depends implicitly on the ML model. If a satisfactory model can not be obtained using the existing data one may need to come back to this step. Therefore, the data selection can also be viewed as a part of the model building procedure.
The dataset is then split into two parts, the training and the test set, indicated by the green and the red table in figure 4.2-b. The latter is crucial to establish the precision of the model in an objective way. The training set is typically larger, containing more than 50 % of the data. Ideally, the test set should be as large as possible. This is not difficult to achieve if a large amount of data is available. However, in practice the dataset sizes are often small and the test set size is then a compromise between being able to train
a model and being able to ascertain its precision. The subtle point regarding the data splitting step is that it needs to preserve the underlying data distribution. The purpose of the split is to ensure an objective test of the model. This implies that the data used for the training originates from the same “distribution”, i.e. from the same generating process, as the data in the test set. In other words, the test data has to be representative. The split is performed with respect to the target property, for instance the value of the magnetic moment. This step is conceptually simple but it is best performed using a standard method, provided by the ML software package. The reason for this is that these methods have been thoroughly tested. In this way one avoids potential coding errors, which can lead to improperly split dataset. Such biased training set would likely yield a ML model, which performs poorly on the test data. Since the error is very unpredictable, it is very hard to detect and to correct it. Therefore, if possible, it is best to avoid such problems. After the split, the test dataset is stored in a separate file and used only after the ML model has been built.