1.3 Circular Dichroism
1.3.5 Computational Tractability
Calculations on proteins are computationally demanding due to the size of the compounds, which can contain dozens of chains comprising thousands of atoms. The utilization of the matrix-method facilitates the use of ab ini-
tio parameters, but three limits have to be considered for large proteins: the
maximum number of atom coordinates to define the chromophore positions, the number of chromophoric groups to be processed, and the dimension of the Hamiltonian matrix that can be diagonalized. While, in the past, those limitations have been addressed manually if encountered for a protein, the au- tomation of the calculations for public use on the web interface DichroCalc89 posed the need for dealing with as many PDB files as possible without manual intervention.
Due to compiler constraints, the maximum number of atoms whose coordi- nates can be read in is 10,000. However, only the atoms that are used to de- fine the chromophore position and to whose positions monopole charges have been fitted are actually needed. In a peptide bond, for example, these are the atoms designated as C, N and O in the PDB and only a subset of the side chain chromophore atoms. The PDB file is, therefore, filtered in an additional step for all atoms not required in the calculation. This includes, for example, sol- vent molecules, which have been a reason in the past for reaching the critical atom limit. This reduces the amount of atoms to about 50% and considerably increases the number of proteins that can be processed.
The diagonalization algorithm limits the size of the Hamiltonian matrix. Inclusion of charge-transfer transitions doubles the matrix dimension. Side chain chromophores also enlarge the calculation, but to a smaller extent, which depends on the content of aromatic side chains (usually between 5 and 15%).
Table 2: Large proteins of SP175, which required splitting into single chains to calculate the CD.
Protein PDB
Code Chains Residues
Residues per Chain
Pyruvate kinase 1a49 8 4152 519
Aldolase 1ado 4 1452 363 Peroxidase 1atj 6 1836 306 β-Galactosidase 1bgl 8 8168 1021 Catalase 1f4j 4 1928 479 / 481 c-Phycocyanin 1ha7 24 3996 162 / 171 Glutamate dehydrogenase 1hwx 6 3006 501 Immunoglobulin 1igt 4 1316 214 / 444 Ceruloplasmin 1kcw 1 1017 1017 Ovalbumin 1ova 4 1519 386 / 373 Phosphoglucomutase 3pmg 2 1122 561
Considering two backbone transitions, our current algorithm and compiler al- low the calculation of the CD spectrum of a protein comprising about 2000 residues for backbone-only calculations and 1000 residues if charge-transfer is taken into account. In cases where the limit is reached, the proteins are split into single chains and their individual spectra calculated. If all single chain spectra are similar, it is most likely that the overall spectrum will look similar, since CD is not orientation dependent and interchain interactions are negligi- ble due to the distance. There are some variations in the intensity, but these differences are relatively small. In the SP175 set there were 11 proteins that were too large to be calculated as a whole, that is, all chains in the oligomer (Table 2). The single chains usually have the same sequence; some of them have different chain termini, but this has almost no effect on the CD spec- trum. An exception, however, was immunoglobin (1igt), which consists of four chains, of which chains A and C are equivalent, as are chains B and D.
−60000 −30000 0 30000 60000 160 180 200 220 240 [ θ ] / deg cm 2 dmol −1 wavelength λ / nm −30000 −15000 0 15000 30000 160 180 200 220 240 [ θ ] / deg cm 2 dmol −1 wavelength λ / nm
Figure 14: Comparison of spectra calculated from the single chains of proteins for which calculation would otherwise be in- tractable. Left: c-Phycocyanin (1ha7, 24 chains, 29,916 atoms). Right: Catalase (1f4j, 4 chains, 15,402 atoms).
Rather than calculating each separate chain, the file was split in half, thus creating two equal fractions of the protein. Figure 14 shows some examples of single chain calculations and the minor differences in the intensities. For the proteins β-galactosidase (1bgl) and ceruloplasmin (1kcw), which contain chains with more than 1000 residues (Table2), calculations are feasible for the backbone-only calculations, but not if charge-transfer is also considered.
The maximum dimension of the Hamiltonian poses the most critical limit for the calculations, and possibilities to tackle this limitation were explored. About 98% of the off-diagonal interactions are orders of magnitude smaller than the strongest interactions. Removing these matrix elements before diago- nalization did not have a substantial effect on the results of the CD190 protein set. Since the interaction of dipoles decreases with the cube of their separation, the vast majority of the influential 2% is due to close neighbour interactions situated at the diagonal of the matrix. Calculations have been performed con- sidering different numbers of residues depending on their separation in the
−0.5 0 0.5 1
190 200 210 220 230
Spearman Rank Correlation Coefficient
wavelength λ / nm
Normal calculation only nearest neighbours only n+2 neighbours only n+3 neighbours
Figure 15: Spearman rank correlation coefficient of the CD190 set for distance-dependent consideration of interactions. The solid line shows the correlation when all interactions are con- sidered. The other graphs show the correlation consider- ing only interactions with the nearest-neighbour (n+1, long dashes, blue), up to two residues (dashed, red) and up to three residues away (dotted, brown).
chain sequence (Figure15). Taking only the nearest- and n+2 neighbours into account considerably worsens the results, affecting the Spearman rank corre- lation coefficient especially below 200 nm. However, considering only interac- tions with chromophores up to three residues away does not visibly affect the correlation graph above 205 nm and only slightly decreases the correlation at shorter wavelengths. Therefore, if the need arose to deal with larger proteins, the square Hamiltonian matrix could be reduced to a band diagonal matrix. The diagonalization routine could then be replaced and algorithms for band matrices can handle larger dimensions.