Part I Text Mining
5.3 Large-Scale Network Generation, Analysis, and Applications
5.3.5 Network Schemes: A Means for Exploiting Context
Network schemes represent a means for specifying text-mining contexts (Figure5.12). Net-
work schemes are defined as Petri nets, which are bipartite graphs containing one set of vertices called places and the other set of vertices called transitions. Places are defined by biological entities, such as genes/proteins, cell-types, tissues, diseases, or organisms. Places can be occupied by specific entities, by alternatives between entities or by more general expressions (e. g. any tissue, three or more organisms). Transitions can be defined as co- occurrences in sentences or abstracts or as RelEx relations. RelEx relations can further be specified by specific types of relations, which are defined by respective subsets of rela-
tion restriction terms (e. g. typedown-regulation contains the words:inhibit,down-regulate,
block,inactivate, etc.).
Network schemes take as input data (1) a network which provides, for each relation, links into the respective articles where evidence for the relation was found, and (2) context annotations for the literature articles. The network has been generated by RelEx (Sec-
tion 5.2); each relation in the RelEx network is labeled with the abstracts from which it
was extracted. Context annotations have been compiled automatically by exact matching
of non-gene and non-protein dictionaries (Section3.5.1) against title, abstracts, and MeSH
annotations of all publications in MEDLINE. The matching returned, for each abstract, a set of identifiers for the objects which were found in the respective abstract.
Relation Type Protein 1 Relation Context Protein 2 Organism Bodypart Synonym Dictionaries RelEx NEI Cell Tissue Disease Network Schema
Figure 5.12: Network schema used for exploiting context in text-mining networks: Re- lations are extracted from texts by co-occurrence search or RelEx. Context annotations from the respective texts are extracted by named entity identification with appropriate synonym dictionaries. Thus, relations can be restricted to those with specific contexts for network generation, or single relations or subnetworks can be analyzed for common and statistically overrepresented contexts.
specification. This represents a flexible means for defining contexts of interest. Instances of the specified schema can then be searched by text mining and reported to the user.
Second, network schemes can be used as templates to be filled with contexts as derived
automatically from texts. This provides a means detect common contexts and identify frequent patterns in the respective biological contexts. Statistical analysis is then applied for ranking and reporting relevant common contexts from the expanded network schemes.
The statistical significance of a context annotation c can be determined by Fisher’s exact
test (Fisher,1932), which makes use of the hypergeometric distribution:
pc(x≥r;N, n, k) = 1− r−1 X i=0 k i N−k n−i N n
where: N is the number of abstracts in the entire set used for analysis (i. e. the abstracts
subjected to RelEx analysis); n is the number of abstracts in the selected subset (i. e. con-
taining a relation which matches the applied network schema);k is the number of abstracts
in the entire set containing annotationc;pc(x≥r)is the probability of observingror more
abstracts with annotation cby chance.
This second approach can be used for detecting contexts, such as diseases or tissues, for which a specific relation has been described. The approach is based on the same statisti-
cal principles as gene ontology overrepresentation analysis (Section 9.1). Yet, the network
schema approach is more generally applicable as it directly exploits the literature and thus does not require manual annotations. Similarly, it is more flexible, as the non-gene and non-protein dictionaries used for compiling context annotations contain more entries and are more fine-grained than gene ontology. Furthermore, it does not rely on information for single genes but focuses on articles discussing specific relations or interaction events. The approach can be applied for single relations/interactions or larger network modules.
In the following, the application of network schemes for the detection of common contexts is described by two examples, namely the IL1 pathway and the MMP13 interaction net-
work as extracted from the RelEx network (see Section 5.3.4). These networks are small
subnetworks of the entire RelEx network. By use of network schemes, these subnetworks can be functionally categorized.
IL1 Pathway:
disease: inflammation, rheumatoid arthritis, arthritis, osteoarthritis,
acute phase reaction, septic shock, inflammatory response, synovitis, endotoxemia, thymoma, infection, fever
tissue: articular cartilage, vascular endothelium, cartilage, epidermis, Media, smooth vascular muscle, respiratory mucosa
cell: cultured cells, macrophages, cell line, monocytes, tumor cells cultured, fibroblasts, hela cells, macrophages peritoneal, granulocyte, chondrocytes, alveolar macrophages, keratinocytes, osteoblasts, jurkat cells, leukocytes mononuclear, epithelial cells, 3t3 cells, endothelial cells, osteoclasts,
microglia, synoviocytes, neutrophils, kupffer cells, mesangial cells, astrocytes, cell line transformed, t cell, smooth muscle cells, bone marrow cells, u937 cells body part: synovial membrane, glomerular mesangium, veins, gingiva, pulmonary alveoli, joint organism: mice, mice inbred c3h, mice inbred balb c, mice knockout
MMP13 interaction network (red):
disease: osteoarthritis knee, osteoarthritis, rheumatoid arthritis, chondrosarcoma,
periodontitis, osteosarcoma, carcinoma squamous cell, odontogenic cysts, arthritis tissue: cartilage, tissue, cartilage articular, extracellular matrix, Bone
cell: chondrocytes, osteoblasts, fibroblasts, cells cultured, neutrophils, squamous cell body part: synovial membrane, bone, parathyroid, skull, tibia
Table 5.7: Results of application of network schemes for the detection of common con- texts: Manually generated network of the IL1 pathway (presented in Section5.3.4) and MMP13 interaction network as extracted from the RelEx network restricted with term set red). The table only shows highest ranked annotations (p-value≤10−7 for the IL1 Pathway, p-value≤10−5 for the MMP13 interaction network).
The results are shown in Table 5.7. BioCarta5, which served as one data source for gener-
ating the manually curated IL1 network, provides a description for each pathway. The first sentences of the description for the IL1 pathway are given in the following:
Interleukin-1 (IL-1) is a pro-inflammatory cytokine that signals primarily through the type 1 IL-1 receptor (IL-1R1). The activities of IL-1 include induction of
fever, expression of vascular adhesion molecules, and roles inarthritis andseptic shock. The inflammatory activities of IL-1 are partially derived by transcrip- tionally inducing expression of cytokines such as TNF-alpha and interferons, as well as inducing the expression of other inflammation-related genes.
The terms found by the network-schema context analysis (Table 5.7) fit well to the Bio- Carta description. The terms marked in italics in the above description are contained in the disease section of overrepresented annotations. The section on cell types contains several cell types that play a role in the immune system. The overrepresentation of mice in the organism annotation indicates that phenomena induced by IL1 are often studied in this model organism.
The second example investigates context overrepresentation for interactions of MMP13. The top-ranked annotations indicate that these interactions are frequently discussed in the literature in context of osteoarthritis and cartilage. It is known that MMP13 is involved in the destabilization of the joint cartilage collagen network and is used as a marker for
hyper-catabolism in cartilage degradation (Sections6.4and 6.4). Thus, it makes sense that
MMP13 interactions are closely related to this context.
The network scheme approach thus makes it possible to detect common contexts for inter- actions. By this approach, individual interactions or network modules can be functionally characterized.