DNA sequencing is the process of determining the precise order of the nucleotide bases in a particular DNA molecule. In 1974, two methods of DNA

(1)

BIO440 Genetics Laboratory DNA sequencing

DNA sequencing is the process of determining the precise order of the nucleotide bases in a particular DNA molecule. In 1974, two methods of DNA sequencing were independently developed. Maxam and Gilbert used a chemical cleavage protocol, while Fred Sanger designed a procedure similar to DNA replication. Both teams shared the 1980 Nobel Prize, but Sanger’s method became the standard because of its practicality. This was Sanger's second Nobel prize - his first was for figuring out how to determine the sequence of amino acids in proteins.

The Sanger method involves creating DNA fragments terminated with dideoxynucleotides (ddNTP). These ddNTPs lack a 3'OH on the deoxyribose, which prevents DNA polymerase from adding more nucleotides (this method is also called "chain termination sequencing"). Traditionally, this involved performing four separate reactions (one for each of the 4 bases). The DNA fragments generated from each dideoxy reaction were separated by gel electrophoresis under conditions that allow DNA

molecules that differed in length by only one nucleotide to be resolved. To visualize the DNA, it was typically labeled with 32

P or 35

S. This method is depicted on the next page. Today, several modifications of the Sanger method allow us to sequence DNA much faster. Instead of using radioactively labeled nucleotides, we now use fluorescent labels. Thus, we can now run all 4 ddNTP reactions in one lane of a gel instead of 4 separate lanes. Each fragment ending in a dideoxyA (ddA) is labeled with a red fluor, those ending in ddT are labeled with a yellow fluor, etc. Additionally, this lets us do all four dideoxy reactions in one tube simultaneously, because the different fluorescent dyes are attached only to the ddNTPs. Every fragment that gets terminated with a ddA is thus labeled with the red dye that is attached to that ddNTP. This method is called fluorescent dye-terminator cycle sequencing, and it uses PCR to incorporate ddNTPs in a primer extension sequencing reaction.

The PCR reaction consists of DNA template, primer, a special DNA polymerase, unlabeled dNTP's, fluorescently labeled ddNTP's, and buffer. When the PCR is

complete, the reaction mix contains a population of PCR fragments of different lengths, each terminating in a fluorescent-dye-containing ddNTP. Each ddNTP base contains a different fluorescent dye that emits a characteristic wavelength, thus the identity of the dye corresponds to the final base on that fragment. The entire reaction is run in a single lane on a polyacrylamide gel, so that the fragments separate according to size. The fragments run past a laser detector at the bottom of the gel, and the emission wavelength of each fragment is recorded. This is depicted below.

(2)

The sequence data is usually converted to a chromatogram form, and various software programs allow the rapid analysis of the .scf (standard chromatogram format) files of these chromatograms. A sample chromatogram is depicted below. As you can see each of the bases emits at a different wavelength, and the chromatogram can be read from left to right (early to late, hence 5' to 3' relative to the newly synthesized strand).

Sequencing reaction for the Licor DNA sequencer.

The automated sequencer that we have lets us sequence a plasmid insert in two directions. We need to set up 4 PCR reactions, each of which contains a different ddNTP. In our PCR reactions, the forward primer and the reverse primer are each labeled with a different fluorescent dye. Thus, we can read the sequence of one strand in one direction and the sequence of the other strand in the other direction. This is called Simultaneous Bidirectional Sequencing.

Each PCR reaction is 6 µl in volume. The molar amount of template used is based on the size of the insert between the priming sites. This equals the size of the cloned insert + 100 bp. The table below gives guidelines:

Insert Size (bp) amount of template desired 300-600 bp 50-100 femtomoles

600-1200 bp 125-225 fmol 1200-1800 bp 250-300 fmol >1800 bp 300-500 fmol

The mass corresponding to fmol amounts of 500 bp and 1000 bp inserts is shown in the table below

Template required 500 bp insert 1 kb insert

50 fmol 17 ng 33 ng 100 fmol 33 ng 66 ng 150 fmol 50 ng 100 ng 200 fmol 67 ng 135 ng 250 fmol 82 ng 165 ng 300 fmol 100 ng 200 ng

(3)

Add the following components to a 0.2 ml tube to prepare the template/primer mix for each template:

dsDNA (your plasmid)_____________ µl 700nm-emitting forward primer (1 pmol/µl) _____1.5____ µl 800nm-emitting reverse primer (1 pmol/µl) _____1.5____ µl

sterile distilled Water _____________ µl

__________________________________________________ Total Volume 13 µl

2. Label a set of 4 tubes A,C,G, or T . Add 3 µl of the A reagent to tube A, 3 µl of the T reagent to tube T, etc. This has been done for you.

3. Mix your plasmid template/primer mixture by gently pipetting up and down. Add 3 µl to each of the 4 tubes, using a new tip for each addition. After addition, mix your

template/primer/reagent mixture by gently pipetting up and down twice.

4. Cap your set of 8 tubes (make sure caps are all the way down on each tube), and move to a thermalcycler at 4°C.

4. Begin PCR reaction. At end of PCR reaction, add 3 µl of loading dye/formamide stop solution to each reaction. Denature at 92 °C for 2 minutes, then chill on ice. Load sequencing gel.

(4)

Observations and Analyses - DNA sequencing Due 10/18/07

(note: there is a second part of this observation and analysis that will be completed using software in class)

Name: ___________________________________________

Plasmid number ____________

A260 of 1:20 dilution of plasmid: _________________________

Concentration of undiluted plasmid DNA ___________________(ng/µl)

Desired molar amount of template (from table) ______________________ Volume of plasmid that gives desired mass __________________µl

Volume of water to use in sequencing reaction (10 µl - vol. plasmid) ___________µl Additional questions:

Examine the chromatograms of your sequenced plasmid. Describe the chromatograms. How does the quality of the two chromatograms change as you go from the beginning of the sequence to the end? What do these changes represent, physically–why does the quality change?

(5)

2. What is the function of the primer in a set of 4 sequencing reactions?

3. What role does the gel play in the sequencing process?

4. The structure of AZT, which is used in treating HIV infections, is shown below. Based on what you know about DNA sequencing, how do you think AZT works to stop the

spread of HIV?

4. Draw a gel below, with the bands that you would expect to see if you sequenced the following template:

template

3' G A C T G A A G C T G A 5'

(6)

BIO440

Fall 2007 DNA Sequencing Results Part 2

In this part of the project, we will start with raw data from the LiCor

sequencer. We will clean up the raw data, and then determine whether or not

our sequences really are 16S rRNA sequences. If they are 16S rRNA

sequences, we will determine what kind of organisms they came from, and

hence, an idea of the phylogenetic diversity of isolates from Boiling Springs

Lake. For the purposes of this project, you are to create an electronic copy of

your analysis, and submit it via email to your beloved instructor. This should

be a word document entitled 'XXX(your initials)seqanalysis'. I.e.,

MSWseqanalysis. There is a form on the class website to use for this - you

can download it and then type into the spaces. There are italicized, bolded

regions where you are to fill in your results.

This exercise should introduce you to the type of data that you will get and

the kinds of analysis you will have to carry out in order to interpret your

results. It will also introduce you to two different types of sequence analysis

and manipulation software: Sequencher, an intuitive but unreasonably

expensive program; and Staden, a powerful and free but non-intuitive

package of programs. You will learn how to use the software on example

files that are on your desktop, and then once you have defined your own scfs

from the sequencer upstairs you will analyze youur actual data.

Note: The protocols below are generalized protocols. Because this is real

data and (and a real research project) not all of the sequences will

necessarily conform exactly to this process. You may need to be creative/try

a few different approaches in order to get this to work. Remember that

patience is a virtue.

Cleaning up the sequence.

We are going to start with the raw data from the sequencer. This data

consists of the sequence of both strands of our insert (and sometimes of the

plasmid vector).

Summary. We want to:

-Open Sequencher, the sequence analysis software.

-Open up the forward and reverse sequence data files.

-Align the two sequences.

(7)

-Trim away vector sequences and any poorly sequenced regions at the ends.

Note: This may actually be the most problematic part of the entire process.

Details.

-Open Sequencher, the sequence analysis software. The icon for this

program is located in the pop-up menu. We will need to use ‘demo mode’

for this process, if we do it all at once (and you can't save your results in

demo mode). This first time I just want you to understand how the software

works, so use the demo version so that you know what to expect.

Open Sequencher, start a new project, and

Under ‘File’, select import sequences. The files are in the folder on the

desktop entitled BIO440 sequencing, that contains single curve files

(chromatograms from the sequencing gels). To see these files in sequencher

you will also need to change the ‘Files of Type" box at the bottom of the

select screen from ‘*.ABI’ to ‘all’

-Open up the forward and reverse sequence data files for your first sequence.

The files will end in '.ab1'. Files that you generate from the LiCor

Sequencer upstairs will have the ending '.scf'

Each group should clean up one of the sequences in this project to become

familiar with the Sequencher software, and (if possible) save the final

cleaned consensus sequence in a word document. You will then compare

your cleaned sequence to my version.

Align the forward and reverse sequences. This is actually an alignment

between one of the sequences and the reverse complement of the other

sequence. This is done by highlighting the two sequences and selecting the

‘Assemble automatically’ button. You should get a screen that has a ‘Contig

[0001]’ icon. Select this icon.

If you don’t get the icon, go to ‘Assembly parameters’, and slide the

‘minimum match percentage’ bar a little to the left, then try again.

When you select the ‘contig’, you should get a diagram displaying the

alignment. Take a look at the diagram and see if it makes sense. If it does,

then select ‘Bases’ for a more detailed view of the alignment.

At the top of the screen, the alignment of the two sequences will be

displayed. At the bottom of the screen, the consensus sequence will be

(8)

displayed. Where the two sequences are in perfect agreement, the consensus

sequence is unmarked. Where there is a discrepancy between the two

sequences, the consensus sequence is marked with an asterisk. There may be

many discrepancies at the beginning and the end of the consensus sequence,

because these regions represent the very ends (poor quality) of one or the

other of the sequencing reactions. Scroll through the sequence to verify this.

Where one reaction is of the highest quality, the other reaction is of the

lowest quality.

Go to the middle region of the sequence, where there are few

asterisks. Using your cursor, select a base on the consensus sequence that is

NOT marked with an asterisk. Then select ‘Show chromatograms’ to see the

chromatograms representing the raw sequence data. The two chromatograms

should agree well. By changing the base that is selected in the consensus

sequence, you can examine how the chromatograms change in the different

parts of the sequence.

How does the quality of the two chromatograms change as you go from

the beginning of the consensus sequence to the end? What do these

changes represent, physically–why does the quality change?

Now you want to trim away the poorly sequenced regions at the ends of the

sequence. These are the regions with numerous asterisks. (note that the

sequence for one of the two reference sequences is probably of very high

quality [don't trim], and for the other sequence of very low quality[get rid

of]). To do this, use the cursor to highlight the poor-quality data, and delete

it. If Sequencher asks you if you want to 'fill from left(or right)', select yes.

Use the chromatograms to try to resolve any discrepancies between the

remaining portion of the sequence. If you can, try to obtain at least 900 bp of

consensus sequence.

-Trim away vector sequences. To do this, we will search for the primer

sequences -- i.e. the primers that you used for the original 16S amplification.

The primers are degenerate, that is, there are some places where there is

more than one nucleotide in the 'conserved' target site. For example, the

forward primer is

5'cctacgggrsgcagcag 3'

where the R stands for a puRine

(A,G) and the s represents "Strongly H-bonding" (C, G)

I realize that this is a somewhat clunky approach - can you think of a better

way to do this?

(9)

All of the strains were produced using 341F as the forward primer...some

had 1525R as the reverse primer and others had 1391R as the reverse primer.

primer sequences =1391R

5' gacgggcggtgtgtgc 3' 5' gacgggcggtgtgtac 3'

in opposite orientation, this =

5’ gcacacaccgcccgtc 3’ 5’ gtacacaccgcccgtc 3’

primer sequences = 341f

5'

CCT ACG GGR SGC AGC AG

3'

note this primer is a mixture of 4 slightly different primers

What are the 4 primer sequences?

What is the reverse complement of these sequences?

After trimming these sequences away, paste the final cleaned consensus

sequence into your document.

Analyzing sequence data using the Staden program

Using Staden

We can use the Staden programs to process sequence data instead of the Sequencher program. The advantages of using Staden are that it is free, the vector clipping function is easy and works, and you can use it on computers at home. Also, it’s more powerful. However, Staden is not as user-friendly as Sequencher - it is somewhat clunky and takes getting used to.

The Staden Package has 5 programs

PreGap4 –for removing vector sequences & aligning different sequence files Gap4 – for studying aligned sequences

Trev – Trace viewer

Spin – a set of functions including looking for restriction sites, translating nt to aa, and some alignment functions

Console – I don’t yet know what this does

To clip vector sequences and remove poor quality sequencing data from a raw sequence

To start: Need a folder on the desktop that has the sequences you want to analyze and the file pGemT.txt (this file is a text file (not a word file) with the sequence of the PgemT vector. The sequences you want to analyze can be in .ab1 format or .scf. You need to know the name of this folder, i.e. ‘My BIO440 seqs’

(10)

This should open a program that gives you a window with a big blank screen with three folder tabs up top. These tabs have the names “Files to Process”, “Configure Modules” and “Textual output”

Under “Configure Modules” set up the general configuration by making sure that the following boxes are checked (may be a good idea to uncheck the other boxes): Estimate base accuracies

Trace format conversion Initialize experiment files Augment experiment files Quality clip

Screen for unclipped vector Cloning vector clip

With Cloning vector clip, you want to deselect it and then reselect it. When reselected, on the right hand side of the window you will be prompted to select a cloning vector. Using the browse button, select the pGemT.txt file. Then select ‘save these parameters’. Gap4 Shotgun Assembly

After selecting Gap4 shotgun assembly, the program that will align your forward and reverse sequences, you need to select ‘create new database’ and where it says

GAP4database name, type ‘Align’. When looking at your actual sequence data, you might type your strain name (i.e. 341F17) instead of Align. Then select ‘save these parameters’.

Under “Files to Process” select ‘Add files” A new window opens

Change ‘Files of type’ ABI(*.ab1) to Files of Type Any *.*

In that window, select Desktop from the icons on the left hand side, and then select your folder (ie My 440 Seqs)

Then highlight the files you want to analyze and select open, or doubleclick. For example, you might have the files sample1.ab1, sample2.ab1, and HSU1.scf and HSU2.scf

Then, click run. Some stuff should happen, and you should see the phrase ‘Processing finished’

Using Trev to view vector clip, quality clip, and to edit your sequence

Now go to the desktop and look in your folder (My seqs)…pregap4 has created some new files. You are interested in the .exp files. There should be a .exp file for each of the sequences that you analyzed. To look at one, doubleclick the .exp file, and it will open in Trev. You may have to open Trev first, then open your file.

The sequence should be color-coded – the crosshatched area is bad sequence, the pink area = vector, the light grey area = good sequence data and the dark grey area = not as good seq data. You might not see any vector on the practice files, but should see some on your experimental files.

Under view, select ‘display edits’.

Under Edit, select ‘sequence’. This will cause a new line of sequence data to appear – the edit line. Go to the right hand side of the vector, and using the mouse select the right-most nt of the vector. Use the delete (not backspace) key to remove the vector sequence, and/or poor quality sequence.

(11)

Then, go to the far right of the sequence file. Using the quality of the chromatogram, you can delete the poor quality region of the sequence file. Note that you are editing the .exp file and not the original .ab1 or .scf file.

Then, save the .exp file under the file menu. Finally, select File… ‘save as’ … Plain text…and give it a name that corresponds to your sequence followed by the extension .txt. For example sample1.txt

You should now have a text file that contains your sequence which has the vector and poor quality sequence data trimmed.

Note that this is not your aligned sequence, but only your sequence in one direction. To look at your forward and reverse aligned sequences, we will use the program Gap4. Using GAP4 to examine your aligned sequences.

On the desktop, select Start….Staden Package….Gap4. From the Gap4 window, select File…Open.

In the 440 sequencing folder you have been using, there should be a file with the name Align.O.aux. Open this document.

Select Edit…..Edit contig. Then select OK. The aligned bases should open in a new window.

Select Settings……Trace Display….Embed Traces size 5.

Select Settings……Highlight disagreements by background color.

Now, edit your sequence …I will leave the details of this up to your discretion and experimentation.

Once your editing is complete, go to the left hand side of the consensus sequence and select the first nt. Then, holding down the Shift key, go to the right hand side and select the right-most T. Copy the selected sequence, and paste it into your word document. Once you have your new, edited sequence………Finding the Closest Match in GenBank, and aligning the sequences.

First, blastn search the nucleotide sequence to verify that it is a 16S sequence. Use the discontiguous megablast program at

http://www.ncbi.nlm.nih.gov/

What is the closest match in GenBank? What is the Genbank accession number?What are the corresponding nucleotides in the GenBank sequence (this information can be obtained from looking at the alignments in the blast output).

If the closest match isn’t from a cultured organism (genus and species will be named), then what is the closest match which is to a cultured organism? What is the Genbank accession number?

Does this appear to be a 16S sequence?

List the publication information (i.e authors, title, and journal/date, if any). Also, list the information giving taxonomic details of the organism in the entry. What is the percent identity with the closest match, and with closest cultured organism? Across how many nucleotides?

(12)

Next, use the forward and reverse sequences alone (not the aligned forward and reverse sequence) and carry out discontiguous megablast searches on each of these two sequences. Did all three searches (forward, reverse, and aligned) yield the same result? Describe your results.

Analysis in the Ribosomal Database Project. The RDP is at: http://rdp.cme.msu.edu/index.jsp

We are going to perform a 'Sequence Match' analysis with the small subunit sequences in the database. This allows us to find the sequences most similar to our new sequence. Paste the sequence into the provided space, leave default settings as they are, and select submit sequences. See the 'Seq. Match Info' to interpret your results.

In your own words, compare the output and utility of the blastn and the RDP analysis programs.

Next, we will create an alignment with similar 16S sequences.

At RDPII, Go to the 'Online Analyses' page, and use the 'Sequence Aligner' function. Click on run, cut and paste your sequence in space provided, choose HTML format as output and include 10 sequences. Leave other defaults as is.

Examine the results. Is this a good alignment? Were gaps inserted? Were identities to other organisms apparent? Did the sequence match up to other sequences in the database? How closely? What do your results indicate?

Next, use the 'Classifier' program at the RDP to assign a sequence to the taxonomical hierarchy at the RDP. Interpret the output.

Print out the results or paste them into your report.

Write a brief description of the organism that your sequence is likely to have come from.

Do you think that it is likely this organism was isolated from Boiling Springs Lake, or do you think that this organism may represent a contaminant that was introduced during the isolation procedure?