U12-type introns are present in diverse eukaryotic organisms at very low levels.
Assuming that these remaining U12-type introns are remnants of a formerly popular splicing system, one cause for the persistence of the remaining modern U12-type introns over time is that their host genes experience selection against any change. We found gene structure, protein sequence identity, and local protein sequence identity to be greater in U12-type intron host genes than controls.
Gene structure is more conserved in uOGCs compared to nOGCs. Several mechanisms for intron loss and gain in genes over evolutionary time have been proposed
[41], but the prevalence and balance of these events is still much debated [42]. One possible explanation for the persistence of U12-type introns is that they reside in genes with generally slowly evolving gene structures, that undergo little intron loss and/or gain. To test this hypothesis, we compared conservation of gene structure within uOGCs to conservation of gene structure within nOGCs. We define a measure of gene structure similarity (GSS) between the sequences of two clades within an OGC as the number of shared intron sites with common introns (at least one sequence in each clade has an intron in this position), divided by the total number of shared intron sites including sites in which one sequence of one clade aligns well to one sequence of the other clade, but introns occur only in sequence of one of the clades (see Figure 10 for example). Our sets for comparison are the uOGCs for which there is at least one U12-type cintron between the clades and the nOGCs for which there is at least one U2-type cintron between the clades. Figure 11 shows that uOGCs have greater mean GSS than nOGCs in all of the clade comparisons except human versus mouse. This difference is most apparent in the vertebrates versus drosophila in the animals versus plants comparisons (differences greater than 19%). These differences were found to be statistically significant based on a sampling test (see Methods), in which we compared the uOGCs GSS mean to the GSS means from same-sized random samples from the nOGCs. The lack of a significant difference between human and mouse is possibly explained by the fact that there has been less divergence time between the clades to observe many gene structure changes. Overall, we conclude that U12-type introns tend to be in OGCs with slowly evolving gene structures.
Protein sequence is slightly more conserved in uOGCs compared to nOGCs over short evolutionary time spans. Another possible explanation for the conservation of U12-
type introns is that they reside in genes that have slowly evolving protein sequences. We tabulated the maximum protein sequence percent identity between genes from one clade versus genes from another clade for each OGC. Our sets for comparison are the uOGCs in which there is at least one U12-type cintron between the clades and the nOGCs in which there is at least one U2-type cintron between the clades. For each clade comparison except animals versus plants, the uOGCs have a greater mean protein sequence identity than the nOGCs, although these differences were less than 5% in all cases (Figure 12). These differences, except the vertebrate and drosophila comparison, were found to be statistically significant through a sampling test (see Methods). Between drosophila versus vertebrates and plants versus animals, the divergence times may be too large to detect the relationship between U12-type intron conservation and protein sequence. Overall, we conclude that conserved U12-type introns tend to be in OGCs experiencing slightly reduced protein sequence change over relatively short evolutionary time spans.
U12-type cintrons have slightly greater local protein sequence conservation than U2- type cintrons. A more specific cause for the conservation of U12-type introns may be that the local protein sequence around the U12-type intron, rather than the entire gene, is slowly evolving or is immutable for the sake of the protein’s function, such as being located in a critical domain. To test this hypothesis, we compared sequence conservation in up to 10 amino acids on both sides of U12-type cintrons and U2-type cintrons. For regions where one intron is within 10 amino acids of the beginning or end of the protein sequence, the longest available segment was used. The sets for comparison were U12-type cintrons and U2-type cintrons from the uOGCs with U12-type cintrons, so that all cintrons are sampled from the same clusters. The mean local protein sequence identity of U12-type cintrons is slightly
greater than U2-type introns, for all clade comparisons. (Figure 13) These differences, except between animals and plants, were found to be statistically significant through a sampling test (see Methods). This difference is largest between vertebrates and drosophila, with a mean of 65 percent identity for the U12-type cintron set and a mean of 52.2 percent identity for the U2-type cintron set, which equates to a difference of about 2 conserved amino acids in the local region surrounding the cintron. We conclude that cintrons of U12-type introns tend to have a small increase in local protein sequence conservation compared to U2- type cintrons.