List of Tables
2.2 Approaches to represent chemical structures
2.2.1 Generation of computer molecular representations and descriptors from structure
2.2.1.1 Molecular Graph
Molecular graphs serve as a convenient model for representing chemical struc-tures in a computer. In a molecular graph (usually non-directed and connected multi-graph), the nodes correspond to the atoms and the edges to the bonds. Its vertices and edges are labelled with the kinds of the corresponding atoms or types of bonds, respectively. A graph represents only the topology of a molecule, that is, the way the nodes (or atoms) are connected and is less suitable for modelling those properties that are determined by molecular geometry, conformation or stereochemistry. The molecular graph can distinguish between structural isomers (compounds with the same molecular formula but non-isomorphic graph), further-more, it normally does not contain any information about 3D arrangement and therefore cannot distinguish between conformational isomers or stereoisomers.
Thus a given graph may be drawn in many different ways and may not obvi-ously correspond to a “standard” chemical. The complexity of chemical systems is significantly reduced and some aspects are lost whenever they are modelled as graphs. It is necessary to have means to convert the molecular graph to and from a computer-readable format. This can be achieved in a variety of ways. Common methods are to use linear notations or graph matrices (Leach & Gillet, 2007).
Connectivity of atoms through bonds leads to adjacency and distance matrices.
The polynomials, generated from these matrices may be treated as the signature of those molecules. Several molecular descriptors are then calculated based on these polynomials (e.g. degree of a node, eigenvalues, distance-based topological indices, etc).
Linear Notations
Chemical line notations represent chemical structures as compact linear string of alphanumeric symbols, easily handled by computers and allowing fast manual
2.2 Approaches to represent chemical structures
coding/decoding by trained users (faster than drawing a structure). Table 2.1 shows some examples of different line notations for the molecule 1H-Indole-2,3-dione.
Table 2.1: Different line notations for the molecule 1H-Indole-2,3-dione.
Systematic Name: 1H-Indole-2,3-dione
Name synonyms: Isatin; Indole-2,3-dione; Isatic acid lactam; Isatine; 2,3-Diketoindoline;
Dioxo-dihydroindole; Dioxoindoline; Indolinedione; 2,3-Dihydro-1H-indole-2,3-dione; 3-Hydroxy-2-oxoindole
WLN: T56 BMVVJ
ROSDAL: 1=2-3=4-5-6-1-9N-8-7-6,7=10O,8=11O
SLN: O=C(N(H)C[2]=CC=CC=C(@4)@2)C[4]=O
SMILES: O=C(C1=O)Nc2c1cccc2
Unique SMILES: [NH]1[C]([C]([c]2[cH][cH][cH][cH][c]12)=[O])=[O]
InChI: InChi=1/C8H5NO2/c10-7-5-3-1-2-4-6(5)9-8(7)11/h1-4H,(H,9,10,11) InChIKey: JXDYKVIHCLTXOP-UHFFFAOYSA-N
An early (1949) and remarkably compact fragment-based line notation that became quite widely used was the Wiswesser Line Notation (WLN) . This nota-tion uses a complex set of rules to represent different funcnota-tional groups (with more than 40 symbols) and the way they are connected in the molecule, which makes the notation difficult to code and error-prone. Most of the complexity of the no-tation is involved in determining the order in which the symbols are to occur so as to achieve not only a complete and unambiguous representation of the struc-ture but also a unique or canonical representation (Wiswesser, 1954). From 1985 on, Representation of Organic Structures Description Arranged Linearly (ROS-DAL) notation was developed by Welford, Barnard and Lynch granted by the Beilstein Institute. The ROSDAL generation process is straightforward, six rules
permit to code the organic molecule into an unambiguous but not unique alpha-numeric string. Nevertheless, its use was restricted to Beilstein-DIALOG system (Barnard et al., 1989). In 1988, Weininger (1988) at the US Environmental Re-search laboratory (USEPA) released the Simplified Molecular Input Line Entry System (SMILES) for chemical data processing, which has found widespread dis-tribution as a universal standard chemical nomenclature. Compared with WLN and ROSDAL, SMILES is more intuitive and uses a set of six, very basic, rules to convert a structure into a string: (1) atoms are represented by their atomic sym-bols; (2) hydrogen atoms are omitted; (3) neighbouring atoms stand next to each other; (4) double and triple bonds are characterized by "=" and "#", respectively;
(5) branches are represented by brackets; (6) rings are described allocating digits to the two connecting ring atoms (Weininger,1988). Information about chirality and geometrical isomerism can also be included in the SMILES notation. The absolute stereochemistry at chiral atoms is indicated using the "@" symbol and geometrical (cis–trans) isomerism about double bonds is indicated using slashes.
This notation has been later extended mainly by Daylight Chemical Information Systems Inc. and several coding enhancements derived from it such as SMiles ARbitrary Target Specification (SMARTS) for substructural pattern search and SMIles ReaKtion Specification (SMIRKS) for encoding reaction transformations (James et al., 2011). A special extension of SMILES is Unique SMILES (US-MILES), a canonical and unambiguous representation of a structure, granted by a proprietary algorithm (Weininger et al.,1989). This has led to the use of differ-ent generation algorithms and/or differdiffer-ent implemdiffer-entations, and thus, differdiffer-ent SMILES of the same compound can be found. The SYBYL line notation (SLN) (1997) is a specification for unambiguously describing the structure of chemical molecules using short ASCII strings, and is a product of Tripos Inc. (Ash et al., 1997). SLN was inspired by the SMILES notation but differs from it in several ways. SLN can specify molecules, molecular queries, and reactions in a single line notation whereas SMILES handles these through language extensions. SLN has support for relative stereochemistry, it can distinguish mixtures of enantiomers from pure molecules with pure but unresolved stereochemistry. In SMILES, aro-maticity is considered to be a property of both atoms and bonds whereas in SLN it is only a property of bonds. The latest line notation is the IUPAC’s International
2.2 Approaches to represent chemical structures
Chemical Identifier (InChI) (Stein et al., 2003). Like USMILES notation, the InChI allows a canonical serialization of molecular structure and unlike SMILES its main objective is to identify a compound in a unique and non proprietary manner. SMILES is certainly more human-readable by an expert and can be used for substructure search and analysis. Furthermore, InChI allows detection of tautomeric forms and group mobile hydrogen atoms together. Every InChI starts with the fragment “InChI=” followed by the version number. Structural information is organized in six layers and sub-layers, describing different aspects of a molecule. A special form of line notation as structure representation devel-oped to facilitate searching is InChIKey. It is a condensed version created from InChI through hashing using the Secure Hash Algorithm InChIKey with a fixed length of 27 characters. InChI and InChIKey are currently used by several public and commercial databases as well as scientific journals.
Viewing, editing and converting chemical formats
Numerous computer applications are available to handle molecular structure information. Since chemical data has its own specificities, many formats were developed to facilitate information exchange (Gasteiger,2003). The most widely used file formats in chemistry are summarized in Table 2.2.
When working with chemical information in cheminformatics, creating, query-ing, modifying and saving representations of chemical structures are very impor-tant tasks. For that propose, there are several molecule editors to manipulate chemical structure representations in either 2 or 3D. Typically, molecule editors support reading and writing at least one of the file formats mentioned above and they can mainly be divided into stand-alone programs and web-based applications (applets) (Gasteiger,2003;Gunda,2011). Table2.3 summarizes the most widely used molecule editors/viewers. The wide variety of chemical structure represen-tations in use has inevitably resulted in a need to interconvert them. OpenBabel (O’Boyle et al., 2011, 2008) and JOELib (Guha et al., 2006) are freely available open source tools specifically designed for converting between file formats.
Table 2.2: Summary of the most widely used file formats for exchange chemical
The MDL Molfile contains information about the atoms, bonds, connectivity and coordinates of a molecule. It is the most widely used connection table format.
http://accelrys.com/
Structure-Data file – SDfile (*.sdf)
The Structure-Data file is an extension of the MDL Molfile containing one or more compounds and the
ability to include associated data. http://accelrys.com/
Reaction-Data file
– Rdfile (*.rdf) The Reaction-Data file is an extension of the MDL
Molfile containing one or more sets of reactions. http://accelrys.com/
SMILES (*.smi) SMILES is the most widely used linear text format which can describe the connectivity and chirality
of a molecule. http://www.daylight.com
Canonical SMILES (*.can)
The Canonical SMILES format (can) produces a canonical representation of the molecule in
SMILES format. http://www.daylight.com
Chemical Markup Language – CML (*.cml)
CML is an open standard for representing molec-ular and other chemical data. The open source project includes XML Schema, source code for parsing and working with CML data, and an ac-tive community. CML data files are accepted as input of many chemical applications.
http://www.xml-cml.org
IUPAC’s InChI
(*.inchi)
IUPAC’s InChi file format which contains only structure definitions in a unique and predictable ASCII character string.
http://www.iupac.org/
inchi/