Chapter 6 Comparison of the two platforms: RNAseq and microarray
6.3.1 Detected genes in the data sets
The number of expressed genes that can be corrected detected is the most important quality criterion for gene expression profiling technologies, thus we only compared the expressed genes identified in the three datasets. Different numbers of genes were investigated and identified as expressed in each data set, owing to the different technologies, samples sizes and the quality filtering criteria. (see Table 6.1) In the microarray data, there were almost 48 thousand probes used, but ~20% are designed to target the same genes, thus in total ~38 thousand unique genes were profiled, amongst which over 33 thousand genes had valid identifiers. In total 13202 genes were selected as expressed in the microarray data. Restricted by the library preparation protocol, our RNAseq data set in theory had the ability to investigate every transcript with a poly-A tail, thus every mRNA can be considered as investigated genes. Among them, 14507 genes were considered as expressed in the cartilage samples. The RT-PCR data
93
investigated 551 genes with valid identifier and detected 427 genes expressed in the cartilage samples. Comparison of the detected/expressed genes in the data sets revealed discrepancies of the two high-throughput platforms (Figure 6.1). The microarray data detected 3474 genes that were not determined being expressed in the RNAseq data. Seventy-two of them were also detected within the RT-PCR data. Among the genes expressed in the RNAseq data, 4779 genes were not detected in the microarray data, 24 of which were detected with RT-PCR. There are 56 genes only detected with the RT- PCR. In terms of the differentially expressed genes, only a fraction of such genes in the RT-PCR data was also identified as differentially expressed using the other two technologies. However, the RT-PCR data validated more differentially expressed genes from the RNAseq data than the microarray data. Furthermore, comparing the number of differentially expressed genes identified in the two high-throughput data sets, more than half of the up-regulated genes (104/203) and 85% of the down regulated genes (172/200) in the microarray were identified in the RNAseq data set, while the microarray data only identified no more than 30% of differentially expressed genes (272/1028) in the RNAseq data.
94
Data set
/Technology Sample size Criteria to select expressed genes Genes Total
Genes with
valid ID Expressed genes
Differenti- ally expressed genes RT-PCR OA=12 NOF=12 Median Ct <40 569 551 427 117 Microarray OA=9 NOF=10
Probes with valid flag value in ≥ 80% of either group of samples 37838 (48784 probes) 33425 13202 403 RNAseq OA=10 NOF=6
Genes have more than 1 molecule detected in ≥80% of either group of samples
All
mRNAs 55765 14507 1028
Table 6.1 Number of genes in the 3 datasets. The table presents the technology used to generate the data, the sample size, the criteria to select expressed genes, the total number of genes interrogated, number of genes that have identifiers can be converted to ENSEMBL identifiers, number of genes expressed and number of genes which expressions were significantly changed (Fold change ≥ 2 or ≤ -2, P-values ≤ 0.05) in OA cartilage samples comparing to NOF for each data set. As there are no primers or probes used for RNAseq technology, we consider every mRNA was interrogated in the analysis. With the same fold change and p-value thresholds, RNAseq identified more than twice of the differentially expressed genes identified using microarray, despite the similar number of genes found expressed in the cartilages using the both technology.
95
Among the 427 genes detected in the RT-PCR data, 274 genes were detected by all of the technologies. Their expressions are relatively higher than the other genes in the RT- PCR data (Figure 6.2), while the 56 genes detected by only the technology were expressed at the lowest levels compared to other genes. The genes that were not expressed in the RNAseq data had lower expression levels than the other genes in the RT-PCR data. The expressions of the 56 unique genes to the RT-PCR were also checked in the RNAseq data, as these genes could be detected using RNAseq but did not pass the criteria to be considered as expressed. None of the 56 genes had more than 0.5 molecule copy detected in the RNAseq data. Compared to the expressed genes, which had around or more than 20 copies, these genes expressed at very low levels (Table 6.2). Comparison of the Ct values of the 56 genes to all of the expressed genes in the RT- PCR confirmed the relative lower expression of the 56 genes.
A B C
Figure 6.1: Comparison between detected genes. A. The Venn diagram of
expressed genes in each data set. In the 427 expressed genes in the RT-PCR data, 56 of them were not found expressed in RNAseq or microarray data. 274 genes were found expressed in all of the 3 data sets. The overlapped expressed genes between the microarray and the RT-PCR data are more than the overlapping genes between RNAseq and RT-PCR data. B. The Venn diagram of up regulated genes in cartilages in each data set. Six genes identified in RT-PCR data were identified in RNAseq data, only 2 were identified in microarray data; C. The Venn diagram of down regulated genes in OA cartilages in each dataset. Nine genes identified in RT-PCR data were identified in RNAseq data, including the 4 genes identified in microarray data.
96
Figure 6.2 Relative expressions of genes in the rt-PCR data grouped by
intersections of the 3 data sets. The figure shows the Ct values of genes in the RT-PCR data. The genes are separated into 8 groups: All 3 sets: genes that were commonly detected in the 3 data sets; Array PCR: genes that were detected in both the microarray and the RT-PCR data but not in the RNAseq data; PCR only: genes that were detected in the RT-PCR data only; RNAseq PCR: genes that were detected in both the RNAseq and the RT-PCR data but not in the microarrays data. Ct values for OA and NOF samples were separately denoted. The asterisks indicate the significances of the comparison between the groups. “***” means the p-value < 0.001. Comparing the genes commonly detected in all of the 3 data sets, genes detected by the RNAseq had lower Ct values meaning more expression than the other 2 groups of genes. Genes that were detected by RT-PCR only had highest Ct values, which meant they have lowest expression levels. It implies that the
97