4. Chapter 4: MM subgroups
4.3.2. Consensus peaks for PC and MM subgroups
Since the purpose of the analysis is to obtain MM subgroup differences, it is important to consider all subgroup specific features. The number of samples in every subgroup is
unbalanced, therefore, similarly to how consensus peaks for primary PC and MM samples were
A
B
C
D
produced, first consensus peak sets were obtained for the different subgroups and then merged (as described in the Materials and Methods).
As was the case with the PC and MM samples (see Chapter 3), the number of down sampled filtered shifted single ends per sample used was 28,231,242. The number of balanced consensus peaks found for each subgroup can be seen in Table 4-1, in general, it can be seen that the called subgroup consensus peaks may be entering saturation as the number of samples increases in the HD group. The PC group, despite having 5 samples still has a very high peaks per sample ratio and can likely benefit from the incorporation of more samples and higher sequencing. PC chromatin accessibility and RNA-seq expression were observed in terms of CD19status and donor id (not shown) with the highest variance being due to donor id status, it is possible that these two covariates were producing the heterogeneity in the PC samples, leading to the high peak to sample ratio. Despite having 13 samples, the HD group is still creating a considerable peak set in terms of peaks per sample, this can be due to the high variability between these samples due to different copies of chromosomal arms.
Subgroups Number of samples Subgroup peaks (balanced) Peaks/sample MAF 2 104,903 52,452 HD 13 224,475 17,267 MMSET 4 97,631 24,408 CCND1 4 73,899 18,475 PC 5 188,261 37,652
Table 4-1: Consensus peaks per sample for each subgroup.
Subgroups: MM subgroups based on cytogenetics and PC. Subgroup peaks (balanced): PC and MM subgroup consensus peaks produced at equal sequencing depth per sample.
All the subgroup balanced consensus peaks were merged and a joint set of 295,238 consensus peaks were found, referred to as subgroup MM and PC consensus peaks. These are areas of high chromatin accessibility in at least one of the samples considered using the same sample sequencing depth.
From the 295,238 total regions, 45,322 regions are only accessible in PC samples (Figure 4-2 A white area, bottom right). As can be seen in Figure 4-2 A, there are 44,296 regions accessible in
all MM subgroups and 5,062 of them not intersecting with PC (Figure 4-2 B). There are therefore 39,234 (44,296 – 5,062) regions common to all MM subgroups and PC regions.
Figure 4-2: Consensus peak regions overlap for each subgroup.
Consensus peak regions for each MM subgroup: HD (red), CCND1 (blue), MAF (green), MMSET (purple) and PC (in white area, bottom right) overlapping the consensus peaks for PC and MM subgroups.
A: Chromatin accessible regions for all subgroups including PC regions. B: Excluding PC.
A
111,117 peaks remained after removing PC chromatin accessible regions (Figure 4-2 B). The largest set of subgroup specific chromatin accessible areas occurs in HD, with more than half of the total (59,123), and MAF (10,180). A very high proportion of MM only regions are common to both of them: 8,278, meaning that one quarter of MAF accessible chromatin is also
accessible in HD. HD and MMSET share 7,296 MM only peaks, a third of MMSET peaks are also found in HD. Since the HD subgroup has many samples and more reads can pile to the same regions due to Hyperdiploidy, this can contribute to creating a larger set of consensus peaks. It is therefore to be expected that the chromatin accessibility profile for HD will partially
recapitulate other subgroups. Furthermore, 5,062 MM only chromatin accessible peaks are common to all MM subgroups.
The annotations of the consensus peak regions and the different subgroup specific subsets were studied (see Materials and Methods): the distributions are generally similar to each other and to a random set of genomic regions (Figure 4-3 and Figure 4-4). Most of the regions are introns (between 37% and 45% of the total), followed by intergenic which is highly
underrepresented in consensus peaks for PC and MM subgroups and the subset of peaks for all subgroups (18% to 28%) compared with randomly generated genomic regions
representative per chromosome (Figure 4-3 “RANDOM_ALL” category in cyan and Figure 4-4 D with 40%).
There is an enrichment of open chromatin in PC and MM subgroups in the proportion of promoters (9-16%) compared to random regions (5%); coding sequences (Figure 4-3 labelled “cds” in green and Figure 4-4 C): 10-13% vs. 5%. Also, the 5’ UTR, non-coding parts of the mRNA involved in translational regulation, is enriched for accessible chromatin (Figure 4-3 labelled “5UTRs” in dark yellow and Figure 4-4 B) vs. random background: 5-12% vs. 2%. These enrichments can be explained because promoters can extend from 1kbp upstream of the TSS to 1kb downstream, therefore a portion of the 5’UTR (and perhaps the CDS) annotated sites might in fact be promoters. Also, unannotated TSS involved in MM and PC might overlap exons (CDS) on the annotation, so these CDS might in fact be acting as promoters of expressed genes. As it was seen in Chapter 3, around half of the DAMM regions were annotated or unannotated TSS and promoter accessibility tends to be necessary for gene expression. Finally, the 3’ UTR ratio is even throughout the different MM subgroups, PC and random background (3-4%), this is marked in Figure 4-3 labelled “3UTRs” in light red and Figure 4-4 A.
Figure 4-3: Annotation of consensus peak regions. (ND: PC)
Genomic annotation of the consensus peak regions, one region can overlap multiple genomic categories on both strands but each genomic category was counted only once per region. ALL: All consensus peak regions. MM
subgroup regions: CCND1, HD, MAF, MMSET and PC. RANDOM_ALL: A random generation of sequences simulating a sample equal to all consensus peak regions per chromosome. UTR: Untranslated Region. 3UTRs: 3 prime end UTR, 5UTRs: 5 prime end UTR. CDS: coding sequence.
Figure 4-4: Ratio scatterplots showing annotation of consensus peaks for PC and MM subgroups. (ND: PC)
Proportion of consensus peak regions corresponding to each genomic annotation: A: 3 prime end UTR, B: 5 prime end UTR, C: coding sequence, D: intergenic, E: introns, F: promoters. One region can overlap multiple genomic categories on both strands but each genomic category was counted only once per region. The groups on the x-axis are: All consensus peak regions (ALL). Regions overlapping different subgroups: CCND1, HD, MAF, MMSET and PC (“ND” label). A random generation of sequences simulating a sample equal to all consensus peak regions per chromosome (RANDOM_ALL).
The chromatin accessibility profiles were obtained as specified in the Materials and Methods section and can be seen in Figure 4-5. In general the cancer state is characterized by general opening of chromatin (enrichment of regions in Figure 4-5 in the first column above the 0 log2Foldchange for each subgroup). CCND1 and HD samples have a large proportion of regions
opening up even more than already open regions in PC (regions in the top right quadrant in the
A
B
C
first column in Figure 4-5). MMSET has an enrichment of regions becoming accessible which are in inactive chromatin in PC (regions in the top left quadrant in the first column in Figure 4-5). MAF seems to have more even distribution in this regard. Finally, the subgroup samples are very correlated in terms of the change in fold accessibility compared to normal. This phenomenon however, can be a result of spurious correlation of ratios where even if two MM subgroups are not correlated with each other, they may be individually correlated with a third.
Figure 4-5: Subgroup chromatin accessibility profiles. (ND: PC)
Details for the subgroup MM and PC consensus peak regions. The first column shows the average Rlog (normalized) chromatin accessibility for the PC samples. The rest of the columns and rows show the Log2fold change in chromatin
accessibility signal between the samples of each subgroup specified and PC samples. Distinction is made between signals with significant Log Ratio Test (LRT) where the effect of the subgroup accounting for batch is significant.