• No results found

Chapter 5 Evaluating Descriptors for Ion Mobility in Solids via Machine Learning

5.3.2 Implementation of Machine Learning

Feature filtering. Figure 5.2 shows the Pearson correlation analysis for the vacancy mechanism. There is no individual feature that shows a high linear correlation with the elementary barrier (at most 0.34 of the path distance, CCD). It confirms that a single feature cannot represent ion migration in solids. ML analysis can help to find complicated correlations between many features and elementary barriers. This conclusion is the same for the interstitial dumbbell mechanism (at most 0.14 of the phonon frequency, 𝜔LEO).

Seven features were reduced based on the analysis (Table 5.1). The octahedron volume (Vocta) is highly correlated to the cell volume (Va) with 𝜌 = 0.98, indicating that one of these

volumetric features is redundant. Thus, Vocta was reduced since Va is more general than Vocta. The

total dielectric constant (𝜀) is strongly correlated to the ionic dielectric constant (𝜀() with 𝜌 = 0.99, since the ionic contribution to the total dielectric constant is usually much larger than the electronic contribution (𝜀‚). So, 𝜀 was reduced to 𝜀( because 𝜀 contains 𝜀‚, whereas 𝜀( has an independent meaning from 𝜀. The electronegativity of cation (ENC) has high correlations to ionic radius (rC),

atomic mass (mC), and polarizability (𝛼!) of cation with 𝜌 ≤ -0.98 because they can be classified

by the type of species. Thus, rC, mC, and 𝛼! were reduced to ENC. Also, the ionic radius of

octahedral anion (rO) has high correlations to electronegativity (ENO), and polarizability (𝛼c) of

octahedral anion with |𝜌| ≥ 0.94, so ENO and 𝛼c were reduced to rO. The filtering implies that

reduced features are also important if a reducing feature is proved to be influential in the target property. The correlation analysis result for the interstitial dumbbell is similar to the vacancy result, thus we used the same features for the dumbbell mechanism.

Figure 5.2 Pearson correlation analysis among descriptors and vacancy elementary barriers. +1/-1 value indicates perfect positive/negative linear correlation, whereas no correlation is expected when the coefficient is close to 0.

Model selection. Figure 5.3 compares the performance between ML models with different algorithms, optimized with 38 features after filtering. The ‘adaboost + ERTR’ algorithm shows the best predictability for vacancy barriers with 72 meV root mean squared error (RMSE). RFR performs the best for dumbbell barriers (58 meV RMSE), but it shows a bad prediction for vacancy barriers (113 meV RMSE, more than 1.5 times larger than ‘adaboost + ERTR’). Also, all the tree- based algorithms except DTR show similar prediction powers for dumbbell barriers; the largest difference between RMSEs of these five models is less than 1.6 meV. Therefore, we selected the ‘adaboost + ERTR’ algorithm to evaluate feature subsets for both vacancy and dumbbell mechanisms.

Figure 5.4 presents the training and test results for vacancy and dumbbell barriers with the ‘adaboost + ERTR’ algorithm. Figures 5.4(a) and 5.4(b) show that the ML models perfectly predict DFT calculations of the training set for vacancy and dumbbell, indicating that the learning process was performed appropriately. Most of data points in Figures 5.4(c) are well aligned on the guide line, implying a good predictability of the ML model for vacancy (72 meV RMSE). A large portion of the error comes from 5 data points with barriers larger than 550 meV; 64% of the sum of squared error (SSE) is originated from these 5 samples (among 64 test samples). This is because of a relatively small number of training data with very large barriers over 550 meV [4.4% of the Figure 5.3 Comparison between ML models optimized with 38 features after filtering: (a) vacancy and (b) interstitial dumbbell mechanisms. The test set is used to obtain RMSEs. A red bar indicates the best model.

training set, Figure 5.4(a)]. The ML model for dumbbell also shows a reasonable predictability, Figure 5.4(d). As the same to vacancy, a large portion of RMSE (58 meV) comes from 12 data points with barriers larger than 150 meV; 79% of SSE is originated from these 12 samples (among 56 test samples). Only 10.0% of the training data have barriers over 150 meV, Figure 5.4(b).

Feature subset evaluation. Figure 5.5 presents the effect of the number of features on the predictability of the ML model. The result with the vacancy barriers shows that RMSE reduces and converges as increasing the number of features from 1 to 8 features. However, RMSE with 33 features is higher than that with 8 features and the error increases with a greater number of features over 33 (like a U-shape). We believe that RMSE will not show a meaningful decrease as increasing Figure 5.4 Training results of the ‘adaboost + ERTR’ model with 38 features after filtering using the training data set of (a) vacancy and (b) dumbbell mechanisms. The performance of the model is presented by predicting the test data set of (c) vacancy and (d) dumbbell mechanisms. The black line is a guideline for the perfect prediction.

the number of features more than 8. The same trend is also observed in dumbbell barriers, and RMSE fluctuates when the number of features exceeds 5. Although not severe in our results, this U-shape is a well-known overfitting trend.243 It implies that the predictability of a model becomes

maximized with an optimal number of features. Therefore, the 8-feature subset for vacancy and the 5-feature subset for dumbbell are the optimal choices for the predictability as well as a computational efficiency. (Features in these subsets will be discussed in the next section.)

Comparing to Figure 5.4(c) with the full number of features (72 meV RMSE), Figure 5.6(a) with the optimal features subset of vacancy shows that the test samples distribute more closely to the guideline with lower RMSE of 58 meV. This is the same for dumbbell with the optimal feature subset that RMSE reduces from 58 [Figure 5.4(d)] to 45 meV [Figure 5.6(b)]. This result verifies that the optimal feature subset predicts and describes ion migrations through elementary paths better than subsets with larger or smaller number of features.