Entropy Calculation - Infection Timing - Development of a data processing toolkit for the analy

5 Infection Timing

5.3 Entropy Calculation

Diversity of the sequences in the alignment are summarized by computing the Shannon entropy at each position in the alignment and then reporting summary statistics. This section first explains what Shannon entropy is with the aid of several examples. The implementation of the entropy calculation is carefully described and lastly, a number of tests designed to check that the computation functions as expected are presented.

5.3.1 Definition and Explanation of Shannon Entropy

The Shannon entropy measures the amount of information present in a set of observations from a random variable. It is defined so that the maximum value is achieved when the observations are uniformly distributed across the entire domain and minimized (at value zero) if all observations are of the same value. The explicit formula is given by

𝐻(𝑋) = − ∑ 𝑃(𝑥_𝑖)

𝑛

𝑖=1

⋅ 𝑙𝑜𝑔_𝑏(𝑃(𝑥_𝑖)).

In this equation, 𝑃(𝑥_𝑖) is the probability of observing the value 𝑥_𝑖 which is estimated by the number of times that observation occurs divided by the total number of observations of all values, and 𝑏 is the base of the logarithm used. Entropy also has the property that it can be summed across independent positions and still be interpretable.

When 𝑏 is chosen as 2, the interpretation of 𝐻 is that it counts the smallest number of yes/no questions that you have to ask before you can accurately state the value of the random variable.

Consider a position where A, C, G and T are equally likely. Figure 66 shows an optimal scheme in the form of a decision tree for finding out the nucleotide using only yes/no questions. In all cases, two questions are needed, hence the Shannon entropy is 2. More algorithmically, in 25% of the cases, there will be an A and using the tree in Figure 66, two questions are needed to deduce that the base is an A. In another 25% of the cases, there will be a C and using the tree in Figure 66, two questions are needed to deduce that the base is a C. The base G will occur in another 25% of the cases, when again, 2 questions will be needed according to the tree in Figure 66. Similarly, the last 25% of cases will be a T, also requiring 2 questions. Putting this together, the Shannon entropy is computed as 2 = 2 ⋅ 0.25 + 2 ⋅ 0.25 + 2 ⋅ 0.25 + 2 ⋅ 0.25. This is the same answer as obtained by plugging the values into the equation of 𝐻(𝑋). In this case, 𝑃(𝑥𝑖) = 0.25 for all values of 𝑥_𝑖 and 𝑙𝑜𝑔2(𝑃(𝑥_𝑖)) =

−2 for all values of 𝑥_𝑖.

Figure 66: An optimal decision tree to accurately state the base for a single position under a uniform distribution. In all cases, two questions are needed so that the Shannon entropy (when defined using a logarithm of base 2) for this case is 2 = 2 ⋅ 0.25 + 2 ⋅ 0.25 + 2 ⋅ 0.25 + 2 ⋅ 0.25.

Figure 67: An optimal decision tree to accurately state the base for a single position when the bases A, C, G and T are distributed with frequencies (0.5, 0.25, 0.125, 0.125). In half the cases (A), one question is needed, in a quarter of the cases (C), two questions are needed, in an eight of the cases (G), three questions are needed and the last eight of the cases (T) also requires three questions so that the Shannon entropy (when defined using a logarithm of base 2) for this case is 1 ⋅ 0.5 + 2 ⋅ 0.25 + 3 ⋅ 0.125 + 3 ⋅ 0.125 = 1.75.

A more complex example is presented in Figure 67 where the distribution is no longer uniform. By designing a more complex decision tree, the average number of yes/no questions needed to ascertain the base can be reduced. In 50% of the cases, there will be an A and using the tree in Figure 67, only

one question is needed to deduce that the base is an A. In 25% of the cases, there will be a C and using the tree in Figure 67, two questions are needed to deduce that the base is a C. The base G will occur in 12.5% of the cases and 3 questions will be needed according to the tree in Figure 67. Similarly, the last 12.5% of cases will be a T, also requiring 3 questions. Putting this together, the Shannon entropy is computed as 1 ⋅ 0.5 + 2 ⋅ 0.25 + 3 ⋅ 0.125 + 3 ⋅ 0.125 = 1.75. This smaller number reflects the fact that there is less information in that case since on average you need to ask fewer yes/no questions to obtain the correct answer.

Shannon entropy has a useful property in that entropy of independent variables can be added together to obtain the entropy of their joint distribution. This is illustrated in Figure 68 where two positions are independently each uniformly distributed. Parallel to the case presented in Figure 68, all cases are resolved using the same number of questions, in this case 4, so that the Shannon entropy is 4, exactly double that of the case presented in Figure 68.

Figure 68: An optimal decision tree to accurately state the base for two positions when bases at the two positions are independently uniformly distributed. In all cases, four questions are needed, so that the Shannon entropy (when defined using a logarithm of base 2) for this case is 4, which is the sum of the entropy of two positions each of which is uniformly distributed.

5.3.2 Implementation

The pipeline includes a step that computes the Shannon entropy as a convenient measure of the diversity in the alignment. The average entropy (across the positions) and the standard deviation of the entropy (across the positions) is reported. The entropy calculation is performed by a script called computeEntropyFromAlignedFasta.R and it is controlled by setting the environment variables

computeEntropyFromAlignedFasta_inputFilename and

computeEntropyFromAlignedFasta_outputDir. The

computEntropyFromAlignedFasta.R scripts internally sets the values of inputFilename and outputDir from the aforementioned environment variables. inputFilename is parsed to obtain the path to the input file, the name of the file with its extension but with the path excluded, the name of the file without its extension and the extension only. The outputDir variable is checked to ensure that it specifies a valid dir. If it is not specified, then the path deduced from the inputFilename is used. The name for the output file is constructed by appending “.entropy.txt” to the end of the inputFilename.

The fasta file is read in and a consensus matrix is constructed using the consensus function from the seqinr package by specifying the method argument as “profile”. The consensus matrix contains a count of the number of times each character occurs at each position. IUPAC characters are handled by adding fractional amounts of the letters they represent to the consensus matrix. For example, if at position 10, a sequence had an N (representing either an A, C, G or T), then for position 10, 0.25 is added to the count for the number of A’s, 0.25 is added to the count for the number of C’s, 0.25 is added to the count for the number of G’s and 0.25 is added to the count for the number of T’s. Gaps are ignored from entropy calculations.

The entropy.empirical function from the package entropy with the logarithm base 2 as unit is applied to each position in the consensus matrix to compute the Shannon entropy for each position.

The summary statistics listed in Table 33 is computed and saved to the output file. The last step in the script is to print the name (including the path) to the output file.

Table 33: Summary statistics calculated on the entropies. An entropy is computed at each position in the alignment and these statistics summarize those per position entropies. *The R script writes the results to a file, and changes to the variable names to conform to R variable naming restrictions. Variable names that start with a number gets an ‘X’ perpended to then and spaces are replaced by underscores.

Explanation Name in R script* Name in pipeline

script

The number of sequences in the alignment. N entropy_seqs

The number of positions in the alignment. K entropy_sites

Across all positions, then minimum entropy. Min. entropy_min The 1^st quartile (25^th percentile) of the per position

entropies.

1st Qu. entropy_q1

The median of the entropies across all positions. Median entropy_median The average of the entropies across all positions. Mean mean_entropy The 3^rd quartile (75^th percentile) of the per position

entropies.

3rd Qu. entropy_q3

The maximum entropy found at any of the positions. Max. entropy_max The standard deviations of the entropies calculated at

each position.

SD sd_entropy

The pipeline calls the computeEntropyFromAlignedFasta.R script using the backtick notation.

The output (to STDOUT) from the script is parsed for the name of the output file. The output file is read and parsed, initializing the variables listed in the last column of Table 33. Of the metrics listed in Table 33, average of the entropies and the standard deviation of the entropies is added to the identify_founders.tab with the names mean.entropy and sd.entropy.

5.3.3 Tests and Examples

A number of alignments were generated to test the behavior of the entropy calculation script as shown in Table 34. Five edge cases were considered. Tests 1 to 3 check the behavior of the script when the sequences are just a single repeated base. When all the sequences are only As, all entropies are zero since there is no variation. In the case where each sequence is a different base repeated (as in test 2), the entropy is maximized as illustrated in the example presented in Figure 66. This maximization occurs due to the fact that the positions are assumed to be independent, while they often are not, highlighting a shortcoming of measuring diversity in this way. Tests 1 and 2 minimized and maximized entropy using sequences that are just a single repeated base. Test 3 is based on a dataset with an intermediate amount of entropy while also still consisting of sequences that are just a single repeated base. Tests 10 and 11 check that adding either a sequence composed only of gaps, or a position composed only of gaps, has no effect on the results. Gaps are ignored when entropy is computed by design.

Four cases were generated by randomly sampling bases from a uniform distribution in which each base has a 25% chance to occur at any position in any sequence. The four cases differ in the number of sequences and positions they include. As is expected, when using a small number of sequences (only four sequences as in test cases 4 and 6), large variances are observed in the per position entropies. The chance of sampling the same letter four times is 0.0039 or approximately one in 256 which did not occur in the case with only 10 positions. The minimum entropy when looking at only 10 positions was 0.8113, achieved when one letter occurs 3 times, another letter occurs 1 time and the other two letters do not occur. However, when 1000 positions were considered, an entropy of zero was observed among the 1000 four-letter draws.

As the number of sequences increases, it is expected that all the summary measures of entropy (all the quantiles as well as the average), will converge to the entropy of the population distribution from which sampling occurred. Since the uniform distribution under which each base has a 25% chance of occurring implies an entropy of 2 (as illustrated in the examples presented in Figure 66 and Figure 68), it is expected that with enough uniformly random sequences, these statistics will converge to 2. As expected, in both cases involving 400 sequences the minimum entropy exceeds 1.95 and the averages

exceeds 1.99. Together cases 4 through 7 show that the calculations performed by the pipeline behaves as expected when the amount of sequences are increased or decreased, providing another basic check of these properties.

Cases 8 and 9 introduce gaps randomly across the sequence over all positions. Since gaps are ignored, the expected effect is only that the variance might increase by a small amount due to the reduced number of observations. Indeed, the statistics are comparable between the cases with gaps and those similar to then but without gaps (Both cases 4 and 8 concern 4 short sequences and both cases 7 and 9 concerns 400 long sequences), but with slightly elevated standard deviations (0.46 vs 0.56 and 0.0051 vs 0.0061).

The last two cases (numbers 12 and 13) are the datasets simulated based on trees derived from real-world data. The low diversity sample is extremely homogeneous, with the most common sequence accounting for 97.6% (851 out of 872) of the dataset. In total the low diversity sample contains only 20 unique sequences. The high diversity timepoint sample is also highly homogeneous when compared with the contrived test cases. In the high diversity dataset, there are a total of 183 unique sequences (out of 691) with the most frequent sequence accounting for 12.2% (84 of 691) of the sample. The entropy statistics reflect these relatively low levels of diversity well, with average entropies of 0.0006 and 0.1853 respectively.

Table 34: Results from running the entropy calculation script on a set of simulated datasets.

Test Description N K Min. 1^st Qu. Median Mean 3^rd Qu. Max. SD 1. Three sequences of length 37 composed entirely of

A’s

3 37 0 0 0 0 0 0 0

2. Four sequences of length 45, one only A’s, one only c’s, one only G’s and the last only T’s.

4 45 2 2 2 2 2 2 0

3. Eight sequences of length 13 in the proportions

In document Development of a data processing toolkit for the analysis of next-generation sequencing data generated using the primer ID approach (Page 147-153)