Supplementary Information

(1)

Supplementary Information

Finite Temperature Structures of Supported Subnanometer Catalysts Inferred via

Statistical Learning and Genetic Algorithm-Based Optimization

Yifan Wang1,2_{, Ya-Qiong Su}3,4_{, Emiel J.M. Hensen}3*_{, and Dionisios G. Vlachos}1,2*

1_{Department of Chemical and Biomolecular Engineering, 150 Academy St., University of Delaware,}

Newark, Delaware 19716, United States

2_{Catalysis Center for Energy Innovation, RAPID Manufacturing Institute, and Delaware Energy Institute}

(DEI), 221 Academy St., University of Delaware, Newark, Delaware 19716, United States

3_{Laboratory of Inorganic Materials and Catalysis, Department of Chemical Engineering and Chemistry,}

Eindhoven University of Technology, P.O. Box 513, 5600 MB Eindhoven, The Netherlands

4_{School of Chemistry, Xi'an Key Laboratory of Sustainable Energy Materials Chemistry, MOE Key}

Laboratory for Nonequilibrium Synthesis and Modulation of Condensed Matter, State Key Laboratory of Electrical Insulation and Power Equipment, Xi'an Jiaotong University, Xi'an 710049, China

(2)

Figure S1. DFT and configuration dataset size in each active learning iteration. Structure size n (the number of Pd atoms) vs. the number of DFT structures at each iteration of the active learning process. The size of the circle indicates the number of configurations which are isomorphic graphs generated by rotational and translational operations within the lattice.

Figure S2. Pd lattice representation and unit cell. (a) FCC lattice showing A, B, and C sites. (b) 2D projection of a unit cell with the neighboring A sites included. (c) A unit cell for the 3D lattice with nearest neighboring connections and corresponding site types and layer numbers labelled. Type A sites on the second layer (A2) and type B and C sites in the base layer (B1, C1) are not in the cluster expansion, and are therefore not shown. (Color code: green – type A sites; blue – type B sites; orange -type C sites; light green – type A sites in neighboring cells). Pd atoms prefer to stay at a minimal distance from other Pd atoms, which can be approximated with the nearest-neighbor (NN) distance in a Pd bulk structure. In general, Pd atoms in the relaxed DFT structures are mapped to the closest fixed-position sites in the super-cell while

Number of isomorphic graphs 10 50 100 500 1000

Iteration 1 Iteration 2 Iteration 3 Iteration 4

A1

C2

B2

A3

C3

B3

A4

C4

B4

a

b

c

A

C

B

A

(3)

being placed at no closer than this minimal distance. A specific site is the nearest neighbor to the same type in the same layer and sites of two other types in the adjacent layer. For example, A1 sites are nearest neighbors to A1 sites in the 1st_{layer and to B2 and C2 sites in the 2}nd_{layer. Three additional rules on the}

possible Pd atomic positions are defined: (1) in the 1st_{layer, Pd atoms are only allowed to occupy type A}

sites. (2) The 1st_{layer A sites give rise to B and C sites in the 2}nd_{layer and A, B, C sites in the 3}rd_{and 4}th

layers. (3) Some structures violate the constraint as the two Pd atoms can be packed closer than the nearest-neighbor distance on a higher layer (for example in both B2 and C2 sites in a Pd6-3D structure (Figure 1a)).

Figure S3. The histogram of the longest interaction distance in the cluster patterns with nonzero ECI values measured by nearest neighbor (NN) distance

(4)

Figure S4. Flow chart and operations of the Cluster Genetic Algorithm (CGA).

Figure S5. Simulation trajectories shown as cluster energy vs. the number of generations for n = 20 in the CGA and the number of Monte Carlo moves in Metropolis MC. CGA was simulated for 2,000 generations whereas Metropolis MC was run for 10,000 MC moves. The range of the energies in the CGA population are shown in the green shaded area. (a) The entire simulation trajectory. (b) The first 100 moves/generations.

Population (N) Fitness Evaluation Crossover Mutation (Single-atom) Mutation (Multi-atom) Fitness Evaluation

Diversity Check and Selection Replenishing Converged? Nunique= N? Final Population (N) Yes No Yes No Parent Population (N) Offspring Population (N) Ranking of Unique Individuals (≥N) Selection Crossover, Mutation New Population (N) Parent 1 Parent 2 Offspring 1 Offspring 2 Occupied Site 1 Occupied Site 2

Mutation (Single-atom)

Mutation (Multi-atom) Crossover

Diversity Check and Selection

(5)

Figure S6. Fitness metrics (a-c) and CNs (d-e) trajectories for n = 20 in the CGA. (a) Total number of isolated atoms, 𝑛𝑖𝑠𝑜. (b) Negative structure average CN1, -𝐶𝑁1̅̅̅̅̅̅. (c) Structure average distance to the

support, 𝑍̅ (Å). (d) CN1 of each atom. (e) CN2 of each atom vs. the number of generations. Solid colored lines are the mean, and the blue shaded areas indicate the range.

a

c

d

b

(6)

Figure S7. (a) Energy per atom, (b) Generalized Coordination number (GCN) distribution, (c) Surface atom ratio distribution for different seeds of the random generator in initializing the lattice. Sample distributions are computed using Gaussian kernel density estimation (KDE). The maximum, the mean and the minimum are marked as lines. The predicted stable structures from GA at each cluster size are shown in (a). Changing different random seeds in the CGA simulations has no significant effect on the stable structures.

a

(7)

Figure S8. Cluster-specific descriptors. (a) CGA simulation trajectories at n = 5-21. (b) All structures including stable, metastable, and unstable structures in CGA trajectories. (c) 𝐺𝐶𝑁̅̅̅̅̅̅ vs. energy per atom for all structures. (d) 𝐶𝑁2̅̅̅̅̅̅ vs. energy per atom for all structures.

a

b

(8)

Figure S9. Parity plots for the ordinary least square regression relating energy per atom of all structures in CGA trajectories to (a) CN1̅̅̅̅̅̅, 0.5 order, (b) CN1̅̅̅̅̅̅, 1st_{order, (c) CN1}̅̅̅̅̅̅, 2nd_{order, (e) CN2}̅̅̅̅̅̅, 0.5 order, (f) CN2̅̅̅̅̅̅,

1st_{order, (g) CN2}̅̅̅̅̅̅, 2nd_{order, (h) GCN}̅̅̅̅̅̅, 0.5 order (i) GCN̅̅̅̅̅̅, 1st_{order (j) GCN}̅̅̅̅̅̅, 2nd_{order. the order of the}

polynomial term and error (RMSE) of those relations are also shown. The most accurate fit is for CN1̅̅̅̅̅̅ with an order of 0.5 (the equation shown in Figure 7), suggesting CN1̅̅̅̅̅̅ as an excellent descriptor for subnanometer clusters. Note that GCN has a lower RMSE with an order of 0.5 with a relation of E̅̅̅̅̅̅ =Pd_n

−0.916√ GCN̅̅̅̅̅̅ + 0.179 GCN̅̅̅̅̅̅. However, its fit is worse at larger sizes (Figure S12b).

a

b

c

e

f

g

(9)

Figure S10. Cluster energies calculated from DFT vs. cluster size compared to the energy distribution of a low energy ensemble obtained from CGA-CE model. Sample distributions are computed using Gaussian kernel density estimation (KDE). The maximum, the mean and the minimum are marked as lines. The stable structures when n = 1-15 agree well with the most stable structures seen in the DFT dataset. At other sizes, especially for larger clusters, due to the vast configurational sampling space, CGA discovers new structures with lower energies that are not included in the DFT dataset.

Figure S11. (a) CN1 (b) CN2 distribution for the structure ensemble vs. cluster size. Sample distributions are computed using a Gaussian kernel density estimation (KDE). The maximum, the mean, and the minimum are marked as horizontal short lines.

a

(10)

Figure S12. Structure optimization for large clusters (n = 25, 30, 38, 55). (a)-(c). Cluster-specific descriptors for all structures in the trajectories (d)-(h) Atom-specific descriptors for stable, metastable structures in the low energy ensemble. (a) 𝐶𝑁1̅̅̅̅̅̅ vs. energy per atom. The fit from Eq. 9 is shown in the dash line (b) 𝐺𝐶𝑁̅̅̅̅̅̅ vs. energy per atom. The fit from Supplementary Figure 9d is shown in the dash line (c) 𝐶𝑁2̅̅̅̅̅̅ vs. energy per atom. (d) Energy per atom vs. the cluster size. (e) CN1 vs. the cluster size. (f) GCN vs. the cluster size. (g) CN2 vs. the cluster size. (h) Surface atom ratio vs. the cluster size.

a

b

c

d

e

f

(11)

Figure S13. Stable Pd20 structures reported in the literature (Pd20-pyramid) and predicted by CGA-CE (Pd20-rod).