Reading Files - Dealing with Files - Python for Bioinformatics

Dealing with Files

5.1 Reading Files

Reading a file is a three step process in Python:

1. Open the file: There is a built-in function called open, that creates a filehandle. This filehandle is used to refer to the file during all the file lifetime. The open function takes two parameters: Name of the file and opening mode. The file name is a string with the file name, in most cases including system path. When the system path is included, this absolute path is used by the program. In case you enter just the file name (without any path), a relative path is assumed.² The second

1These kinds of file are often called CSV files and they are covered on page 92.

2Use os.getcwd() in case you need to know the current path.

parameter has the following valid parameters: “r” to read, “w” to write and “a” to append data at the end of a file. The default value is “r”. If you want to open a file for both read and write, use “r+”.

Using open,

Create a file handle to read a file:

>>> fh=open(’/home/sb/Readme.txt’)

>>> fh

As you see, fh is not the file, but a reference to it. Since reading mode is the default option, it was omitted.

2. Read the file: Once the file is opened, we can read it contents. There are several ways to read a file, here are the most used:

read(n) : Reads n bytes from the file. Without parameters, it reads the whole file.³

readline() : Returns a string with only one line from the file, including

’\n’ as an end of line marker. When it reaches the end of the file, it returns an empty string.

readlines() : Returns a list where each element is a string with a line from the file.

3. Close the file. Once we are done with the file, we close it by using:

filehandle.close(). If we don’t close it, Python will do it after program execution. However it is considered a good programming practice to close it in an explicit way.

5.1.1 Example of File Handling

Let’s suppose we have a file called seqA.fas that contains:⁴

>O00626|HUMAN Small inducible cytokine A22.

MARLQTALLVVLVLLAVALQATEAGPYGANMEDSVCCRDYVRYRLPLRVVKHFYWTSDS<=

CPRPGVVLLTFRDKEICADPR VPWVKMILNKLSQ

3Due to the amount of memory it could take, it is not advisable to read the whole file in this way, unless you are sure of the file size. To process big files, there are better strategies like reading one line at a time.

4It is a FASTA file with one entry, the first line have a > followed by sequence name and description. The following lines has the sequence (DNA or amino acids). For more information on FASTA files, please seehttp://www.ncbi.nlm.nih.gov/BLAST/fasta.shtml.

From this file we need the name and the sequence. A first approach is to read the file with read():

>>> fh=open(’/home/sb/bioinfo/seqA.fas’)

>>> fh.read()

’>O00626|HUMAN Small inducible cytokineA22.\nMARLQTALLVVLVL<=

LAVALQATEAGPYGANMEDSVCCRDYVRYRLPLRVVKHFYWTSDSCPRPGVVLLTFRDK<=

EICADPR\nVPWVKMILNKLSQ\n’

In this case my goal is to have two variables, one with the sequence name and the other with the sequence itself. In code 5.1 we can see a way to do it using read():

Listing 5.1: FastaRead.py: First try to read a FASTA file 1 fh = open(’/home/sb/bioinfo/seqA.fas’)

2 myfile = fh.read() #myfile is a string 3 name = myfile.split(’\n’)[0][1:]

4 sequence = ’’.join(myfile.split(’\n’)[1:]) 5 print("The name is: %s"%name)

6 print("The sequence is: %s"%sequence) 7 fh.close()

The first line opens the file in read mode and creates a file handle that we call fh. On line two, the whole file is read with read() and the resulting string is stored in system memory with the name myfile. The next step is to separate the names from the sequences. Since the name is after the “>”

symbol and before the ’\n’, this information can be used to get the data we want (line 3). The sequence is obtained by joining the elements resulting of spliting myfile string, but without the first element.

The problem with this code is that it uses the read() function to read the file at once. This is a potential problem if there is not enough memory available to accommodate the file’s contents. This is why it is better to use readline() (unless you know that you can handle the size of the file):

Listing 5.2: Read a FASTA file using readline() 1 fh = open(’/home/sb/seqA.fas’)

10 print("The name is: %s"%name)

11 print("The sequence is: %s"%sequence) 12 fh.close()

Code explanation: The first line is identical to the first line of the pre-vious code listing (code 5.1). In the second line we use readline() function to read the first line of the FASTA file. From this line we take the substring between the “>” and the first ’\n’ (line 3). In this case we don’t need to use the index function to search for the ’\n’ character because we know it is at the end of the line, returned by readline(). From line 5 to 9, there is a loop to execute the readline() function, several times to finish reading the file.

The exit condition is line=="" that is returned at the end of the file.

Although this version is more efficient than code 5.1, it could be rewriten to make it easier to read:

Listing 5.3: FastaRead.py: Reads FASTA file, sequentially 1 fh = open(’/home/sb/seqA.fas’)

2 name = fh.readline()[1:-1]

3 sequence = ""

4 for line in fh:

5 sequence += line.replace(’\n’,’’) 6 print "The name is: %s"%name

7 print "The sequence is: %s"%sequence 8 fh.close()

Code explanation: The FistLine variable that was present in listing 5.2 is omitted and the result of fh.readline()[1:-1] is called name. The formula for x in filehandle (line 4) is the clearest and most efficient way to iterate through all the lines of a file. At this point we may add to our protein net charge calculation program (code listing 4.14) the ability to use as input data, a FASTA format sequence, instead of entering it manually.

Listing 5.4: Calculate the net charge, reading the input from a file 1 fh = open(’/home/sb/prot.fas’)

9 for aa in sequence:

10 charge += AACharge.get(aa,0) 11 print charge

12 fh.close()

Code explanation: The code is essentially the same as that in listing 4.14, with the difference that the first 5 lines are similar to those of listing 5.3 and are used to fill the sequence variable with the string that is read from the FASTA file. The only difference is on line 2, where the first line of the file is read as input, but not stored in any variable.

In document Python for Bioinformatics (Page 109-113)