Chapter 4 New infrastructures for structural data
4.2. Objectives
4.3.1. Establishing relationships with atomic models
The complementary nature of atomic data and 3D-EM maps was realised during the creation of the BioImage database, where we first created a data model for the organization of X-ray/EM combined studies [132]. The result of the fitting experiments is, in most cases, a new model (at “pseudo-atomic” resolution) that can be now deposited in the Protein Data Bank (PDB). But the detailed characterization of the whole experimental procedure will only be accomplished if the 3D-EM data, as well as the atomic models are fully described. This will be handled by the appropriate cross- references between the PDB and the EMD, the new infrastructure for the deposition of 3D-EM data.
4.3.2. Architecture
The Electron Microscopy Database (EMD) is being designed and developed from the very beginning to be fully integrated and compatible with the structural data in the PDB, enabling future tools and services to provide, when possible, structural information and knowledge regardless of the resolution level achieved by the experimental method.
The policy for new macromolecular structural data submissions to PDB and EMD will be the following:
Atomic models obtained by high-resolution 3D-EM should be deposited in the PDB.
Atomic models obtained by fitting atomic coordinate data into 3D-EM maps should be deposited in the PDB.
New infrastructures for structural data 53
3D-EM maps used in fitting experiments should be deposited in the EMD database. Appropriate links between PDB models and 3D-EM maps in the EMD will be provided.
Therefore, the MSD data model will enable access to macromolecular structural data, regardless of its nature.
4.3.3. Phases and methodology
Working in collaboration with the MSD group at the European Bioinformatics Institute, and building on our previous results obtained during the design and development of the BioImage database (see chapter 3), we redesigned, extended and developed a new infrastructure for the organisation of 3D-EM structural data to be managed and publicly accessible at the EBI.
4.3.3.a. Definition
The first stage of the development of the new database has been devoted to the definition of the relevant data and complementary information to be archived. Working on previous information compiled for the BioImage database, the Content and Preservation Description information for 3D-EM studies where compiled.
The main purpose of the EMD is to provide a central repository for 3D-EM maps, i.e. structural data reconstructed by 3D-EM, plus additional descriptive information (or meta-descriptors) and additional data files. One 3D-EM map corresponds to one EMD entry (i.e. a single accession code). Apart from 3D-EM maps, other complementary information will be stored:
1. Textual descriptors: Together with the 3D-EM maps, a set of textual annotations covering all aspects of the experimental procedure (from sample preparation, image acquisition and processing) and detailed description of the biological specimen being studied, as well as reference data (authors, bibliographic references, etc.) have been defined. Appropriate links to other biological databases have also been identified (e.g. NCBI taxonomy, Gene Ontology, InterPro). Meta-descriptors cover all areas needed to characterise the results of a 3D-EM experiment: the biological sample being studied, the experimental conditions (sample preparation, data acquisition and data processing), and the structural results in terms of a 3D-EM
54 New infrastructures for structural data
map, as well as any administrative and reference data (such as bibliographic references). These descriptors have been categorised as “mandatory”, i.e. those that should always provided by the author, or “optional”, i.e. those that would be desirable to be stored but the author may decide not to provide. This categorisation will give the authors a chance to choose the level of detail to describe the results of their experiments, while ensuring a common minimum description of the data in the database homogeneously.
2. Complementary data files: additional data files that might provide supplementary and relevant information on the experiment performed. These can be further classified as:
Supplementary figures, for illustrating important aspects of the resulting structures or the experimental intermediate data.
3D surface data (masks), for iso-surface rendering purposes, provided as a binary map format.
Structure factors (only in crystallographic experiments) and layer line data (only in helical reconstructions). Sending these data to the EMD is optional, although we strongly encourage depositing them.
4.3.3.b. Database design and integration
An entity-relationship model was created for 3D electron microscopy data using Oracle Designer 2000. This model was subsequently integrated with the existing model of atomic coordinate data in the MSD (containing over 400 tables that describe the results of experiments in NMR and X-ray crystallography). MSD data is maintained and managed at the EBI as a relational database implemented using Oracle database (Oracle Corporation www.oracle.com).
Appropriate relationships have been carefully analysed for those entities representing biological information, both in terms of integration with the MSD, as well as in the context of other relevant biological databases. This integration with already existing biological databases is essential in order to provide cross-references, and it is a valuable resource for establishing a common nomenclature by the adoption of widely used controlled vocabularies and ontologies.
New infrastructures for structural data 55
Due to the special characteristics of 3D-EM structural data (i.e. maps), further analysis of requirements for map representation and storage where performed. Currently, there is not a standard format for 3D maps in the 3D-EM community: several proprietary volume formats are used by the different software packages for three- dimensional reconstruction by electron microscopy (e.g.: MRC, Brandeis, Duchy, Synu, EM, IVE, IMAGIC, BMD, PIC, SUPRIM, Semper, Spider, etc...). Nevertheless, a single 3D-EM map format has been adopted by the EMD: the CCP4 (Collaboratory Computing Project Number 4 for Protein Crystallography, Daresbury UK) map format [154] used in X-ray crystallography and electron microscopy domains.
4.3.3.c. Development of EMD interfaces
An important aspect for the success of this kind of initiative, which is often neglected, is the interaction and collaboration with the scientific community that produces the data. It is essential to avoid any potential obstacle (either technical or sociological) in the way of the data from the author’s laboratory to the database. At the end of the day, the value of a database is the value of the data it contains.
Data ingest
Appropriate tools for data conversion should be used during ingest in order to store and manage the 3D-EM maps homogeneously in the archive. The submission system converts uploaded map format to CCP4 by using Image Science's EM2EM map conversion utility (see http://www.ImageScience.de/em2em/).
Data dissemination
EMD data will be disseminated as a set of files: 3D-EM maps and complementary data files (e.g. CCP4 for 3D-EM files), while textual descriptors need the development of an XML file format. This XML file format is intended for data distribution and download, not for data management process. The EMD XML file format is described in terms of its corresponding XML Schema.
A release lock-in period can be placed on the 3D-EM map (up to 4 years) by the author, while the descriptive information will be immediate released (after it has been
56 New infrastructures for structural data
reviewed by the authors). 3D-EM maps should be sent to the 3D-EM MSD for getting an accession code.