File formats - Python Programming for Biology_

When data is stored in a plain text file, although interpreting the individual component characters is trivial, if we are to understand what the contents of the file actually mean we have to understand the way in which the data inside the file is structured. This is just like written language where knowing the alphabet is not enough, we also need to understand concepts like words, sentences and punctuation. Ultimately it is the decision of the computer programmer as to how the data in saved files is structured. However, where it is important that files should be understood by a variety of programs the data will be represented in a standardised, and hopefully documented, way. The data standard for a file is often referred to as a file format. Virtually all plain text formats consider the stored data according to lines; they are subdivided by special end-of-line control characters1 at the end of each line.

A very common file structure is to have one record of data per line, often with a single header line at the top of the file, to describe the contents of the lines. One possibility is that the file represents a table with each line describing one row. The fields (or cells) in each row, one for each column of the table, may be demarked in various ways: this could be according to the position within the row, i.e. the position of a character relative to the start, or specified by special separating characters, like commas or whitespace (tabs, blank spaces etc.). An alternative to a fixed order of fields is that the lines consist of pairs of named keys (identifiers) and corresponding values. Here the keys will generally come from a fixed, allowed set of keys and in some instances the data values that are addressed by a single key may span many lines.

Another common file structure is to have tags that identify and specify the beginning and end of a record. Often these tags can be nested, one inside the other, thus denoting containment. For example, the XML (eXtensible Markup Language) data standard uses tags where the record starts with text like ‘<NAME>’ and ends with the text ‘</NAME>’, where here ‘NAME’ is the identifier for the element.

Sometimes a programming language will have its own, inbuilt formats for representing data structures created in that language. This process is referred to as serialisation. In general the serialisation format will be specific to the language in question, and may require special modules to be installed. However, if data is only going to be used in a single language environment, using serialisation can offer an efficient means to store the active data, in terms of both speed and programming ease. Python has a serialisation method which is referred to as pickling. Such ‘pickle’ files are usually textual, but they are not so easy for a person to interpret. For example, the Python list data structure: x = [1, 2, 'a', 'b', True, None] is saved by the pickling method as: (lp1 I1 aI2 aS'a' aS'b' aI01 aNa. Given most file formats, it is normally easier to write files than it is to read them; it is easier to extract data from a controlled and standardised in-memory representation (e.g. Python structures) than it is to interpret someone else’s text, which may not be standardised or even fully understood. So when writing a program to read a file you need to parse the file (determine its syntactic structure) and also confirm that the content is valid, however that may be defined. When writing a file you just have to make sure that you are following the rules for the file layout. A common programming paradigm is to first read one or more files, do some processing and then write out one or more files. Although the processing is normally the major objective of the program, it is not uncommon to have situations where most effort needs to be spent creating the code to do the reading and writing of the files, especially for simple programs. If there is already an existing piece of tested code, such as the importable BioPython module, which can be used to read and write files, then this is often used in preference to spending time writing something new. See Chapter 11 for examples that use BioPython to read and write files.

Figure 6.1. An example of position (column) formatted data. This extract is from a

Protein Data Bank file which is used to represent data about molecules and their three- dimensional structures.

Figure 6.2. An example of a plain text format that uses keys and values. Here the last value extends over multiple lines. This extract is a fragment of a file format called mmCIF. As with the PDB format mmCIF is used to represent data about biological molecules and their structure. Figure 6.3. An example of a file format containing tagged elements. A truncated fragment of an XML file from the Protein Data Bank (PDBML).

Reading files

Using ‘open’

When reading a file within a Python program, you obviously need to specify the location and the name of the file, collectively termed the path to the file. The file’s path is then used by the inbuilt open() function, or similar, to get access to the file’s contents. The path can be absolute, which is to say that it starts at the top or root of the operating-system file hierarchy, or the path can be relative to the current working directory (folder). Initially the current working directory is the directory where the Python interpreter starts, although it can be changed from inside the code if needed. In Python the file path is just a string, so, for example, you could have: path = 'examples/dataFile.txt' fileObj = open(path)

The open function passes back what is termed a file handle, a Python object used to access and represent the open file. Here the file handle is stored in a variable called fileObj. It might be tempting to use the more natural variable name ‘file’, but in Python the word ‘file’ is already used to describe the type (class) of such objects. Hence, it is best to avoid overriding the internal variable name. An alternative to the descriptive but verbose ‘fileObj’ is the shorter ‘fh’, meaning ‘file handle’.

There is the possibility that a program may fail to open the file with the specified path. For example, the stated file may simply not exist (e.g. the name was wrong), or you may not have permission to read it. Under such circumstances, when you try to open it the function will throw an error, a standard Python exception (IOError). Otherwise, if all goes well, when you are done reading a file then you can explicitly say you are finished, by using the close() function, a method that is known to the file object class. A listing of methods associated with file objects is given in Appendix 2.

fileObj.close()

Usually you do not have to explicitly close a file, because when the variable name fileObj goes out of scope and is no longer meaningful, such as at the end of a loop or function, Python will automatically close it.2 However, it is generally a good habit to explicitly close a file handle when you know you are done, just in case it is important. Also, this lets someone who reads your code know that you have finished using the file.

The open() function can take an optional second argument: a ‘mode’, specifying how the file should be opened. These modes include ‘r’ to read from a file and ‘w’ to write to a file, but, as you might have spotted in the above examples, the ‘r’ is optional, given that it is the default. Although we could still have explicitly stated reading mode:

fileObj = open(path, 'r')

In document Python Programming for Biology_ - Tim J. Stevens (Page 95-98)