Josue Chinchilla-Vargas1* , Francesca Bertolni2 , K. J. Stalder1 , J. P. Steibel3 , M. F. Rothschild1
1 Iowa State University, Department of Animal Science, Ames, Iowa, 50011.
2 National Institute of Aquatic Resources, Technical University of Denmark, 2800, KGs.
Lyngby, Denmark
3 Department of Animal Science, Michigan State University, East Lansing, Michigan 48824 Modified from a manuscript published in Livestock Science: 244, 104398
Abstract
Breed associations and registries maintain breed purity by enforcing certain
conformational characteristics defining the breed along with cataloguing the pedigree of every animal in the registry. Furthermore, developing niche markets is often based on specialized products using heritage breeds that need to guarantee breed purity. Genomic technology and the progressively lower costs of genotyping can be helpful when assessing breed purity by
estimating breed composition. In this research, genotypes from 648 pigs and 11 breeds were used to develop marker panels to estimate breed composition with special emphasis on Mangalitsa pigs as a heritage breed. Two sets of panels were created. The first set was based on Fst scores that were calculated individually for ~31,000 available markers across the pig genome. Here, panels composed of the 10, 50, 100, 500 and 1000 markers with the highest Fst scores were generated. The second set was composed by randomly selected markers and had the same number of markers as the Fst-derived panels. Two statistical methods, linear regression and random forest were then used on the marker panels to estimate breed composition, of 107 pigs including 47 individuals known to have Mangalitsa background. Fst appeared to be better at identifying Mangalitsa individuals when compared to random markers regardless of the method
used to estimate breed composition. However, random markers were more accurate at estimating breed composition for non-Mangalitsa individuals.
When the results were compared across methods for estimating breed composition, linear regression produced more accurate estimates of breed composition than random forest. However, both methods lacked accuracy when estimating breed composition for crossbred individuals. It must also be noted that these methods were focused on estimating breed composition of
Mangalitsa pigs and different markers should be selected if different breeds will be the focus and accuracy of prediction will depend on the breeds that are available to be used as references for the Fst calculations.
The results presented in this study allow us to conclude that: 1) Random forest was effective at classifying individuals into breeds, but not at estimating breed composition when compared to the linear regression method. 2) Markers filtered using Fst scores are more effective at identifying Mangalitsa breed composition while not as effective at identifying other breeds. 3) If Fst-filtered markers that are effective at identifying Mangalitsa from other breeds are being used to estimate breed composition for individuals of other breeds, a greater number of markers is needed.
Keywords: Mangalitsa; Mangalica; Swine; Breed Composition; Random Forest; Linear Regression
Introduction
Livestock breeds have been developed through continuous natural and artificial selection over long periods of time, often with specific traits of interest to be targets of selection and hence more prevalent in the population. The conservation of the diversity of breeds with different traits and adaptations can play an important role in developing livestock that are adapted to climatic and specific production systems (Hall and Bradley, 1995) and the increased demand of animal source foods expected in the next decades (Nardone et al., 2010). However, in order to maintain between-breed diversity, it is important to maintain within-breed purity. Breed associations and registries maintain breed purity by enforcing certain conformational and performance
characteristics along with cataloguing the pedigree of every animal that is approved for registry within that breed (Funkhouser et al., 2017).
Before the genomic era, a common method to screen for breed purity in addition to pedigree in white pigs was to perform test matings to determine if white boars would only sire white progeny (Giuffra et al., 1999; Marklund et al., 1998). However, this procedure was time and resource demanding (Funkhouser et al., 2017) and therefore, other methods using molecular data have been developed for multiple species (Bertolini et al., 2018, 2015; Funkhouser et al., 2017; Huang et al., 2014; Jacobs et al., 2018; Munoz et al., 2020; Schiavo et al., 2020). As a consequence of breed formation, population bottlenecks and within breed selection for specific productive or adaptative traits, allele frequencies have been changed and in some cases genetic variants become fixed (Qanbari and Simianer, 2014). Therefore, the genetic heterogeneity present amongst populations and breeds makes genotypes at loci that have been under strong selection pressure more useful to estimate breed composition for an individual (Gorbach et al., 2010; Kuehn et al., 2011).
In the present study, the focus was on the Mangalitsa breed of pigs, that has its origins in Hungary’s and Romania’s Carpathian Basin as a lard breed. The Mangalitsa pigs are
characterized by hairy fleece, similar to that of a sheep (Oroian and Petrescu-Mag, 2014). While being hardy and producing meat and fat with desirable quality, animals of this breed tend to be slow growing (Nistor et al., 2012; Petrovic et al., 2010). Today, Mangalitsa breeders wish to maintain breed purity in order to develop specialized niche markets. In recent years, genomic tools including several medium and high density commercial SNP chip panels for several livestock species including pigs have been developed (Nicolazzi et al., 2015). Normally, SNP chips include thousands of markers across the genome, and panels can be generated by reducing the number of markers used to address specific questions, such as individual animal breed purity evaluation. To identify the most discriminating markers among the thousands available in the commercial SNP chips, several statistical approaches have been applied. Among these
approaches, Fst analysis measures the standardized variance in allele frequencies among different populations (Weir and Cockerham, 1984). This approach has been shown to be a simple and effective tool to identify informative genetic markers and population structures in humans and livestock species, including pigs (Bennasir et al., 2010; Bertolini et al., 2018, 2015; Bowcock et al., 1994; Hulsegge et al., 2013; Schiavo et al., 2020; Wilkinson et al., 2011).
These informative marker panels can be coupled with other techniques to classify or assign individuals to groups or breeds. Among those allocation tests, random forest (RF) is an algorithm used for classification and regression that is based on a large number of low-correlated decision trees (Breiman, 2001; Chen and Ishwaran, 2012; Hastie et al., 2009). In this method, decision trees are built using a bootstrap sample of the data set and a random subset of all predictors is chosen to determine the best split at each tree. Therefore, all trees in a forest are
different. For each tree, approximately one third of all the observations are not included in the bootstrap sample; these observations are called out-of-bag (OOB) data. The OOB data are then used to estimate prediction accuracy. For a particular tree, each OOB observation is given an outcome prediction. The overall prediction of each individual is then obtained by counting the predictions over all trees for which the individual was out-of-bag, and the outcome with the most predictions is the individual's predicted outcome (Meng et al., 2009). Previous research has shown that RF can effectively assign breeds to individuals based on genotypes (Bertolini et al., 2018, 2015; Jacobs et al., 2018; Schiavo et al., 2020). Additionally, random forest can produce an estimation of the probability of an observation of being of a specific class, which we argue can be interpreted as breed composition estimations. With this rationale, this study represents also an effort that evaluates the effectiveness of random forest at estimating breed composition.
A second method, which has been successfully used to estimate breed composition in pigs (Funkhouser et al., 2017; Huang et al., 2014) was also used. In this linear regression method, a test animal’s genotypes are regressed onto allele frequencies derived from reference animals (Funkhouser et al., 2017). Additionally, quadratic programming is used to develop linear constraints on the solution of the regression equation so that the estimate of each breed’s
proportion is between 0 and 1.
The objectives of this research were to i) develop and compare approaches to identify a marker subset that would effectively identify pigs with sufficient Mangalitsa influence to be included in the herd registry. ii) Evaluate the potential and accuracy of using the probability assigned to a pig as being of one breed by random forest as proxies of breed composition for purebred and crossbred pigs. iii) Compare the performance of random forest and linear regression (Funkhouser et al., 2017) methods to do estimate breed composition.
Materials and methods Animal care and welfare
Animal care and use approval was not needed for this study because all data utilized was sourced from existing databases and no live animals were used.
Animal genotype data sets
Genotypes of Duroc (n=111), Hampshire (n=102), Landrace (n=96) and Yorkshire (n=114) individuals genotyped with the PorcineSNP60 SNP chip were provided by the National Swine Registry (NSR). Genotypes from Berkshire (n=44), Hereford (n=22), Large Black (n=3), Meishan (n=52) and Spotted (n=10) breeds were obtained from the USDA Meat Animal
Research Center (USMARC) through the National Animal Germplasm Program genomic data request tool
(https://agrin.ars.usda.gov/genomic_data_decision_tool_page_dev?language=EN&record_source
=US), these genotypes were produced with the GGP PorcineHD array containing approximately 80,000 markers. Additionally, 23 Pietrain genotypes from a commercial genetics company were used. Mangalitsa genotypes were provided by US Mangalitsa Breed Organization and Registry (MBOAR). Pietrain and Mangalitsa genotypes were produced with the GGP Porcine v1 array containing approximately 50,000 markers. The Mangalitsa data set included 96 individuals with 48 pure Mangalitsa animals having no grandparents in common except for 2 individuals with one common grandparent (Group 1), and 48 individuals (Group 2) related to those in Groups 1.
Group 2 also included individuals with unknown ancestry that appeared to be pure (n=5) and 4 crossbred individuals ranging from 50% to 87.5% Mangalitsa based on pedigree information.
Based on pedigree information these pigs were crossbreds of Mangalitsa and Red Wattle, Mulefoot and Large Black.
Genotypes were processed and formatted with SNPipeline (https://github.com/cbkmephisto/SNPipeline) and SNPware
(https://github.com/josuechinchilla/SNPware). Because the genotypes were produced using three different marker panels, once all genotypes were transformed to genotype matrices the first step was to retain only the markers that were common between panels that reduced the number to
~32,000 markers distributed across all 18 autosomal chromosomes plus chromosome X. Quality control (QC) was then performed using plink 1.7 (Purcell et al., 2007) to filter out individuals with a coverage of less than 85% and markers with a call rate of less than 90%. After QC, 648 individuals and 31,089 markers were retained. Finally, before taking on the downstream analyses, marker positions were updated to the Sus scrofa genome assembly version 11.1
(https://www.ncbi.nlm.nih.gov/assembly/GCF_000003025.6/) using in-house scripts and the new marker coordinates provided by Neogen Genomics (Lincoln, Nebraska).
After QC, the dataset was divided into a training population and a validation population.
The training population was used for SNP selection (Fst analysis), to train and cross validate (in the case of random forest) models and to calculate allelic frequencies for each breed (in the case of linear regression) while breed composition was estimated on the validation population. For Mangalitsa, individuals in group 2 and a random set of non-Mangalitsa pigs were selected to be used as the validation population. For the Duroc, Hampshire, Landrace and Yorkshire breeds, 10 pigs were randomly chosen in order to use approximately 10% of the available individuals as validation. Because a small number of Hereford and Pietrain genotypes were available, only 5 of each Hereford and Pietrain were randomly chosen to be used as validation in order to have enough pigs of these breeds in the training population. In a similar manner, 10 Berkshire
individuals were randomly assigned to the validation set. Additionally, due to the limited number
of animals available, Large Black and Spotted samples were not used for downstream analyses because of the limited sample sizes. In this context, since a number of markers with high Fst
scores were not successfully genotyped for individuals of this breed, the Meishan breed was only used for Fst calculation. In Table 5.1 the number of individuals from each breed used to calculate the Fst and those used as training and as validation are presented.
Determining marker subsets for analyses
All purebred Mangalitsa (group 1) and all purebred individuals from the other breeds of the training group were used to calculate Fst scores. The Fst for each marker was calculated with Plink 1.9 (Purcell et al., 2007) between two populations. Here, the allele frequencies of
Mangalitsa pigs from group 1 were compared against the allele frequencies of the combination of all the other breeds, as done by Zsolnai et al. (2013).
Once Fst was calculated for each marker, 5 panels were created using the 10, 50, 100, 500, and 1000 markers with the highest Fst scores. In order to objectively compare the accuracy for the panels selected with Fst, a second set of panels obtained using the same training group and with the same number of markers as the Fst panels was created by randomly selecting markers across the genome.
Linkage disequilibrium (LD) between selected SNPs was moderate (r2 > .25) only in few pairs of markers indicating that most of the selected SNPs captured different fractions of the variance. Although a marker pruning based on LD is often performed before further analyses when designing custom marker panels, this was deemed unnecessary given that decision trees in random forest are built with markers chosen at random from the available set to minimize correlations between trees and this limits the effects of LD between markers on accuracy prediction. In the case of the linear regression method, we followed the methods used by Funkhouser et al. (2017) and no LD filter was applied.
Additionally, the location of the 10 markers with the highest Fst score was examined to identify qualitative trait loci and genes that were located within 0.5 Mega bases (Mb) upstream or downstream from each marker using the NCBI genome browser
(https://www.ncbi.nlm.nih.gov/genome/gdv/) and release 41 of QTLdatabase (https://www.animalgenome.org/cgi-bin/QTLdb/index).
Breed composition analyses
Two methods were implemented to determine breed composition, a machine learning approach using random forest algorithms and a linear regression method.
Random forest
Random forest was implemented using the R package randomForest (Liaw and Wiener, 2002). Each marker panel was used to predict breed composition using 500 trees and the number of predictors in each tree was set as the square root of the number of markers in the SNP panel, both of which are the default settings of the algorithm. Table 5.2 shows the number of predictors used for each panel. Additionally, as part of the random forest algorithm pipeline, 1/3 of the individuals used for training the model were considered for as an internal cross-validation set to calculate breed prediction accuracy in terms of OOB error. OOB error is computed by taking, as a predicted value for the ith observation, the most frequent predicted class among the trees that were not fit using that observation and it is a valuable tool to estimate accuracy of prediction (Bertolini et al., 2015; Hastie et al., 2009). Random forest produced two results: 1. it assigned a breed to each individual and 2. it generated probabilities for each pig to be of each of the breeds present in the reference data.
Linear regression
The linear regression method was implemented through the R package breedTools (https://github.com/funkhou9/breedTools). With this method, a test animal’s genotypes are
regressed onto allele frequencies derived from the reference set of animals (Funkhouser et al., 2017). Additionally, quadratic programming is used to put a set of linear constraints on the solution of the regression equation so that the estimate of each breed is between 0 and 1. Exact details are explained in “Estimation of genome-wide and locus specific breed composition in pigs” (Funkhouser et al., 2017).
Results and discussion Marker subsets for analyses
Table 5.3 shows the number of markers per chromosome across all different marker panels used in this study when markers were filtered through Fst scores and randomly chosen.
When 10 Fst-selected markers were used, three markers were located on chromosome 1, two markers were located in chromosome 2, and one marker was located in each of chromosomes 4, 7, 10, 16 and 17. When 10 random markers were used, two markers were located in each of chromosomes one, four and fourteen while one marker was located in each of chromosomes 2, 3, 7 and X. When 50 markers were used, markers were located on all chromosomes except 12 and X for Fst-filtered markers and 5, 6, 10 and 12 for random markers. All chromosomes except 12 and X were represented in the panel composed of 100 Fst-filtered markers while in the panel composed of random markers, only chromosome 17 was not represented. In the case of panels composed of 500 and 1000 markers, all chromosomes were represented independently of the marker selection strategy. In Table 5.4 the chromosome, position and gene in which each of the 10 markers with the highest Fst scores are located are shown along with the number of QTL located within 0.5 Mb of each marker. Fst scores for these markers ranged from 0.82 to 0.74.
Only two of the ten markers were not located in intragenic regions. Even though the marker selection strategy used in the present study was focused on the Mangalitsa breed, one of the selected markers was in a gene previously detected by Schiavo et al.(2019), as one of the most
discriminating region across commercial European pig breeds. This gene is PDE7B and belongs to a gene family that have been linked to meiotic resumption of mammalian oocytes (Gupta et al., 2017).
It must be noted that all 10 markers were within 0.5Mb upstream or downstream of at least 1 QTL. The marker located in chromosome 4 on base pair 97,376,691 had 38 QTL within 0.5Mb upstream or downstream. The QTL near this marker were associated to adipocyte
diameter, average daily gain (ADG), last-rib backfat, harvest body weight, carcass length, etc. In total, 67 QTL were near the 10 markers. The full list of QTL in proximity to the 10 markers with the highest Fst score and their details are shown in Table 5.6.
Previous research followed a similar marker selection strategy to identify Mangalitsa pigs and test parentage (Zsolnai et al., 2013) and identified 24 markers that were accurate at
differentiating between different Mangalitsa coat colors and Mangalitsa and commercial white pigs. None of the 24 markers reported were represented in our Fst-filtered panels. However, this may be explained by the difference in the breeds that were used to calculate the Fst scores and the difference in the objectives of the research.
Random forest
The OOB errors for each breed along with the average OOB error when 10 random and Fst-filtered markers were used are shown in panel A of Figure 5.1. When 10 random markers were used OOB error was distributed among all breeds with the Hereford breed having the highest error, being of 55% followed by Mangalitsa with 41%. The lowest OOB error was produced for Duroc with 9%. When Fst-filtered markers were used, Berkshire had the highest OOB error with 52% followed by Duroc with 51%. As expected, when Fst-filtered markers were
used Mangalitsa showed an OOB error of 0%, as Fst -filtered markers were selected to be the most discriminating towards the Mangalitsa breed.
When 50 random markers were used, the overall OOB error dropped to 2% for the Berkshire. Duroc and Hampshire breeds had an OOB error of 0% and Hereford having the highest OOB error of 22%, while the Mangalitsa breed had an OOB error of 4.5%. When 50 Fst -filtered markers were used the only breeds that had an OOB error greater than 0% were
Berkshire, Landrace, Yorkshire and Hampshire with 16.0%, 7.1%, 4.9% and 3.1%, respectively.
Interestingly, all breeds that tend to show red pigmentation showed an OOB error of 0.0% likely due to the recessive nature of red from MC1R. Given that three of the 50 markers with the
Interestingly, all breeds that tend to show red pigmentation showed an OOB error of 0.0% likely due to the recessive nature of red from MC1R. Given that three of the 50 markers with the