Spermatogenesis-related Gene Selection with DNA microarray Data and Gene Ontology

(1)

Procedia Environmental Sciences 8 (2011) 609 – 614

doi:10.1016/j.proenv.2011.10.094

ICESB 2011: 25-26 November 2011, Maldives

Spermatogenesis-related Gene Selection with DNA

microarray Data and Gene Ontology

Jiesi Shi

a

, Weixiang Liu

b,∗

, Tianxue Gong

a

, Lan Tao

a

, Aifa Tang

c

a_{College of Computer Science and Software Engineering,Shenzhen University,Shenzhen, China}

b_{Department of Biomedical Engineering, School of Medicine, Shenzhen Key Lab of Biomedical Engineering,Shenzhen} University ,Shenzhen, China

c_{Shenzhen Key Lab of Male Reproduction and Genetics, Peking University Shenzhen Hospital, Shenzhen, China}

Abstract

Spermatogenesis-related genes are essential for mammalian male reproduction. To select spermatogenesis-related gene from microarray data, we integrate GeneRank with Gene Ontology (GO)-terms semantic similarity. Basing on this method, we rank genes of microarray dataset, and the results indicate that our method provides a useful framework for spermatogenesis-related gene selection.

Keywords: spermatogenesis; gene selection; GeneRank; microarray; Gene Ontology (GO); semantic similarity

1.Introduction

Sperm development, termed spermatogenesis, is an essential stage in mammalian male reproductive process. The identification of spermatogenesis-related genes will make an important contribution to our understanding of the biology of spermatogenesis and human reproduction [1].

Now it is a promising way to use microarray data for spermatogenesis analysis. Recently, we have isolated testis from 4, 9, 18, 35, 54 days and 6 months old Balb/C mice. cRNAs prepared from these testis samples have been hybridized with commercially available GeneChip Mouse Genome 430 2.0 Array (Affymetrix Inc.) chip, which contained 34,000 known mouse genes and 8,000 unknown genes or ESTs (Expressed Sequence of Tags), and thus spanning the whole mouse genome. In mining the microarray data, we identified 2058 gradually up-regulated transcripts from four days to six months of testis samples.

∗_{Corresponding author, Tel.: 13424302216}

email˖wxliu@szu.edu.cn

© 2011 Published by Elsevier Ltd. Selection and/or peer-review under responsibility of the Asia-Pacific Chemical, Biological & Environmental Engineering Society (APCBEES) Open access under CC BY-NC-ND license.

(2)

The analysis of a microarray experiment is more robust when prior information is included. Gene Ontology (GO) annotations have successfully enabled to get the additional information. Since the relationship between GO terms directly reflects the association between gene products, many efforts have been put on studying the semantic similarity of GO terms to measure the similarity between genes [2].

To rank genes combining gene expression information with a network structure, Morrison et al proposed GeneRank, a customised version of the PageRank algorithm [3]. However, gene network is simply constructed. To address the wish of fully utilizing the natural resources we adopt GeneRank and Wang’s new method[4], proposed to measure the similarity based on the graph structure of GO. According to the existing study and the knowledge from the annotation, the top ranked genes play important roles in spermatogenesis and the results demonstrate the proposed method.

2.Methods

2.1.GeneRank Model

GeneRank is an engine technology to generate prioritized gene lists automatically by combining microarray experimental information and prior knowledge about the underlying network [3].If a gene is connected with many high ranked genes, it should be ranked high, even if it may be ranked low by the

experimental data. W R∈ N N× is a symmetric matrix as the undirected gene network, i.e. _WT _W

= .

1 2

( , , ... , _N)T

ex= ex ex ex , with ex_i ≥0andi=1, 2, ... ,N, reflects the expression information of g_i.

1 1

deg_i N _ij N _ji

j=w j=w

=¦ = ¦ (1)

Assume the vector r*_{is the solution of GeneRank. Then using matrix decomposition technique we}

have

1

(_{I d W D}_{− ⋅} T −)_r∗ _{= − ⋅}(1 _{d ex}) ₍₂₎

where D diag= (deg ,deg , ... ,deg )₁ ₂ _N is a diagonal matrix, and d∈[0,1] is the damping factor and plays

an important role in the GeneRank model. As d approaches to 0, the ranking depends on the expression

data. On the other hand, if d closes to 1, the model emphasizes on the network information.

2.2.Functional Similarity

Wang et al [4] proposed a new method to measure the similarity based on the graph structure of GO.

A GO term A is represented asDAGA =( , ,A T EA A)where TA represents the set of GO terms in DAGA,

including term A and all of its ancestor terms, and EA is the set of edges (semantic relations).

Firstly, defineSA t( ), the S-value of GO term t related to term A as (3). Here we is the semantic

contribution factor for edge e E∈ A linking term t with its child term t'. The semantic value of term A,

SV(A), is calculated as (4). The semantic similarity between terms A and B, SGO( , )A B is defined as (5).

Then, define the semantic similarity between one GO term go and a GO term set

1 2

{ , , ... , k}

GO= go go go as (6).Therefore, the functional similarity between two genes G1and G2 is

defined as (7), where genes G1and G2are annotated by GO term sets GO1={go go11, 12,...,go1m}and

2 { 21, 22,..., 2n} GO = go go go respectively.

{

( ) 1 ( ) max{ _e _A( ') | ' ( )} if SA A SA t w S t t childrenof t t A = = × ∈ ≠ (3) t ( ) ( ) A A T SV A S t ∈ =

_¦

(4)

(3)

t ( ) ( ) ( , ) ( ) ( ) A B A B GO T T S t S t S A B SV A SV B ∈ + = +

¦

(5) 1 ( , ) max(_{i k} GO( , i)) Sim go GO S go go ≤ ≤ = (6) 1 2 2 1 1 1 1 2 ( , ) ( , ) ( , ) i j i m j n Sim go GO Sim go GO Sim G G m n ≤ ≤ ≤ ≤ + = +

¦

(7) 3.Experimental Result

3.1.Data and Preprocessing

Since we want to use GO annotations to create the matrix W, the genes without annotations are

automatically removed. With the microarray dataset matrix of 2058-genes×6-arrays, we filter out 550

genes without annotations firstly. Consequently, the data matrix X we analyzed is of 1508-genes×

6-arrays.

3.2.Biological Analysis and Discussions

There are two main steps in our approach: Create W with GO-terms semantic similarity and Get rank

list with GeneRank

1) Create W with GO-terms semantic similarity

To obtain the GO-terms semantic similarity by Wang's method, we use the GOSemSim [5], an R package contains functions to estimate semantic similarity of GO terms, gene products and gene clusters.

The semantic similarity scores matrix S as W is input to the GeneRank for the next step. The element

value of matrix S is between 0 and 1. The higher the value genes obtain the more similar between them.

2) Get rank list with GeneRank

There are three parameters for GeneRank: ex, d and W. Firstly, we setexi =max( ) min( )xi xi , with the

view of fold-change, to show the gene expression change of the ith probe in the data matrix X. On the other

hand, exiis set to be the absolute value of the expression vector xi, i.e. exi =|xi|, which reflects the full

gene expression of the whole time points. After normalization respectively, ex_i is input to the GeneRank.

Secondly, considering the optimal choice of d is data-dependent[3], it is suggested that d = 0.5 would

be an appropriate choice for general use [3]. While the value d=0.85 is used by Google in PageRank [6].

Therefore, in both the experiments below, we choose d=0.5 and 0.85, respectively.

By getting semantic similarity matrix S from GOSemSim, we set W=S. As a comparation, we set the

parameter W as the negative Euclidean distance between each pair of gene expression vectors.

Normalization is also demanded to guaranteew_ij∈[0,1].

Additionally, we get another rank list only by the value of r_i =max( ) min( )x_i x_i without GeneRank for

comparation.

3) Results and analysis

For the purpose of analyzing the genes’ biological function, we combine GO and NCBI (GPL1261) to get the annotations according to the probe sets.

We sketchily judge whether the top 100 and top 200 genes in the ranking list are related to spermatogenesis by matching key words ”sperm” and ”testis” with the annotations ( combine annotations from NCBI and annotations from BRB-ArrayTools [7] ). The counting results of these nine experiments are listed in Table I. Compared to the fold-change, the best result is earned by GeneRank with

max( ) min( )

i i i

ex = x x , d=0.85 and W is constructed by GO-terms semantic similarity. Obviously, the results with ex_i =max( ) min( )x_i x_i is much better than that withex_i =|x_i|, which indicates that the change of gene expression can be more representative to a development process than the full expression level.

(4)

Among the nine ranking lists, we find that the top 5 genes selected by GeneRank with max( ) min( )

i i i

ex = x x are corresponding no matter what other parameters are and also the same to the top

5 genes selected only byri =max( ) min( )xi xi . These 5 genes are named groupĉ. Furthermore, the top 5

lists acquired by GeneRank with exi =|xi | are consistent too. Correspondingly, these 5 genes are named

groupĊ. The Probe sets, Gene symbol and GeneBank Access of these two gene groups are listed in Table

II, and their expression levels are plotted in Figure 1.

Now we analyze the top 5 genes of the two groups through literature search of articles extracted from PubMed and annotations from NCBI and GO.

The first and second gene in groupĉ, also the third and fifth gene in groupĊ, Prm1 and Prm 2, are

short for protamine 1 and protamine 2, respectively.RT-PCR revealed amplicons for them in all

spermatids except step 3 round spermatids. Sperm chromatin compaction in the sperm head is achieved when histones are replaced by protamines during spermatogenesis. Haploinsufficiency of PRM1 or PRM2 gene causes infertility in mice [8].

Table I The counting results of genes whose annotation contain the two key words in the top 100 and top 200 lists respectively. the colume titled by “sperm+testis” lists the number of genes either related to “sperm” or “testis”.

Method PARAMETERS SPERM TESTIS SPERM +TESTIS

GeneRank ex W d WRS WRS WRS WRS WRS WRS max( ) min( ) i i i x ex x = semantic similarity 17 31 66 134 79 155 21 37 66 127 81 151 negative Euclidean distance 21 36 66 125 81 149 21 37 65 125 80 150 | | i i ex =x semantic similarity 19 31 50 95 65 116 24 35 44 91 63 118 negative Euclidean distance 25 35 41 88 63 116 18 26 42 88 63 116

Fold-Change max( )_{min( )}i i i x r x = 21 38 65 124 80 150

Table II The top 5 genes in groupĉ and groupĊ : list the probe id, gene symbolm and genebank access(GB Access) : (a) The top 5 genes in groupĉSELECTED BY GENERANK WITH ex_i₌max( ) min( )x_i x_i OR ONLY BY r_i₌max( ) min( )x_i x_i . (b) The top 5 genes in groupĊSELECTED BY GENERANK WITH ex_i=|x_i|.

(A) GROUPĉ (B) GROUP Ċ

Rank Probe ID Gene Symbol GB Access

1 1437054_x_at Prm1 AV209063 2 1439379_x_at Prm1 AV209010 3 1429513_at 1700019M22Rik AK006132 4 1451976_s_at Cklf AF401531 5 1432503_a_at Pdcl2 AK006040

Rank Probe ID Gene Symbol GB Access

1 1421682_a_at Tcte3 NM_011560 2 1421683_at Tcte3 NM_011560 3 1448105_at Prm2 NM_133711 4 1417020_at Spata4 NM_008933 5 1437054_x_at Prm1 AA138616

(5)

(a) Groupĉ (b) GroupĊ

Figure 1 The expression levels of group and groupĉ Ċ: (a) The top 5 genes in group selected by GeneRank with ĉ max( ) min( )

i i i

ex= x x or only by r_i=max( ) min( )x_i x_i . (b) The top 5 genes in group selected by GeneRank with Ċ ex_i=|x_i|.

The third gene in groupĉ, 1700019M22Rik, which is annotated as Mus musculus adult male testis

cDNA as Nucleotide Title.

The forth gene in groupĉ, Cklf, the abbreviation of chemokine-like factor is one of rapidly evolving

testis-expressed genes [9].

The fifth gene in groupĉ, Pdcl2 is short for phosducin-like 2. P. Lopez et al find a member of the

phosducin-like protein family that is predominantly expressed in male and female germ cells. A kind of phosducin-like 2 proteins exert a function in germ cell maturation [10].

The first and second gene in groupĊ, Tcte3, the abbreviation of t-complex-associated testis expressed

3, which from the mouse t-complex region is expressed specifically in testicular germ cells, encodes a

putative light chain of the outer dynein arm of cilia and sperm Àagella [11].

The forth gene in groupĊ, Spata4, the abbreviation of spermatogenesis associated 4. The mouse

Spata4 sequence is identi¿ed as signi¿cantly changed in cryptorchidism [12]

4.Conclusion

In this paper, GeneRank is used to select spermatogenesis-related genes combining gene expression with GO. It is proved that GeneRank provides a useful framework for spermatogenesis-related gene selection. The output ranking list lies on the choices of the three parameters. The best result is earned when

we choose GeneRank with exi =max( ) min( )xi xi , d=0.85 and W is constructed by GO-terms semantic

similarity.

Acknowledgements

This work was partially supported by National Natural Science Foundation of China (No. 60903113) and SZU R/DFund (No. 201054).

References

[1] Aifa Tang, Zhendong Yu, Yaoting Gui, Xin Guo, Yun Long and Zhiming Cai, “Identification and Characteristics of a Novel Testis-Specific Gene, Tsc24, in Human and Mice”, Biol. Pharm. Bull., vol.29, pp.2187-2191, 2006.

(6)

[2] Yang Yang, “A New Similarity Measure over Gene Ontology with Application to Protein Subcellular Localization”, International Conference on Biomedical Engineering and Informatics (BMEI 2010), pp.2452-2456, 2010.

[3] Julie L Morrison, Rainer Breitling, Desmond J Higham and David R Gilbert, “GeneRank: Using search engine technology for the analysis of microarray experiments”, BMC Bioinformatics 2005, vol.6, pp. 233.

[4] James Z. Wang, Zhidian Du, Rapeeporn Payattakool, Philip S. Yu and Chin-Fu Chen, “A new method to measure the semantic similarity of GO terms”, Bioinformatics. vol.23, no.10, pp.1274-1281, 2007.

[5] Guangchuang Yu, Fei Li, Yide Qin, Xiaochen Bo, Yibo Wu and Shengqi Wang. “GOSemSim: an R package for measuring semantic similarity among GO terms and gene products”. Bioinformatics vol.26, no.7, pp.976-978, 2010.

[6] Page L, Brin S, Motwani R and Winograd T, “The PageRank citation ranking: bringing order to the web”, Tech rep Stanford Digital Library Technologies Project, 1998

[7] R. Simon, A. Lam, M. Li, M. Ngan, S. Menenzes, and Y. Zhao,“Analysis of gene expression data using BRB-Array Tools,” Cancer Informatics, vol.3, pp.11-17, 2007.

[8] F. Tuttelmann, P. Krenkova, S. Romer, A.R. Nestorovic, M. Ljujic, A. Stambergova, M. Macek Jr, M. Macek Sr, E. Nieschlag, J. Gromoll and M. Simoni, “A common haplotype of protamine 1 and 2 genes is associated with higher sperm counts”, International Journal of Andrology, vol.33, no.1, pp.e240-e248, 2010

[9] Leslie M. Turner, Edward B. Chuong, and Hopi E. Hoekstra “Comparative Analysis of Testis Protein Evolution in Rodents” Genetics, vol.179, no.4, pp.2075-2089, 2008

[10]Pascal Lopez, Ruken Yaman, Luis A. Lopez-Fernandez, Frederique Vidal, Daniel Puel, Philippe Clertant, Franc¸ois Cuzin, and Minoo Rassoulzadegan, “A Novel Germ Line-specific Gene of the Phosducin-like Protein (PhLP) Family”, BIOLOGICAL CHEMISTRY, vol.278, no.3, pp.1751–1757, 2003

[11]S. Rashid, P. Grzmil, J. Drenckhahn, A. Meinhardt, I. Adham, W. Engel, and J. Neesen, “Disruption of the murine dynein light chain gene Tcte3-3 results in asthenozoospermia,” Reproduction, vol.139, no.1, pp.99-111, 2010.

[12]S. Liu, C. Ai, Z. Ge, H. Liu, B. Liu, S. He, Z. Wang et al., “Molecular cloning and bioinformatic analysis of SPATA4 gene,” Journal of Biochemistry and Molecular Biology, vol.38, no.6, pp.739, 2005