BIO440 Genetics Laboratory DNA sequencing
DNA sequencing is the process of determining the precise order of the nucleotide bases in a particular DNA molecule. In 1974, two methods of DNA sequencing were independently developed. Maxam and Gilbert used a chemical cleavage protocol, while Fred Sanger designed a procedure similar to DNA replication. Both teams shared the 1980 Nobel Prize, but Sanger’s method became the standard because of its practicality. This was Sanger's second Nobel prize - his first was for figuring out how to determine the sequence of amino acids in proteins.
The Sanger method involves creating DNA fragments terminated with dideoxynucleotides (ddNTP). These ddNTPs lack a 3'OH on the deoxyribose, which prevents DNA polymerase from adding more nucleotides (this method is also called "chain termination sequencing"). Traditionally, this involved performing four separate reactions (one for each of the 4 bases). The DNA fragments generated from each dideoxy reaction were separated by gel electrophoresis under conditions that allow DNA
molecules that differed in length by only one nucleotide to be resolved. To visualize the DNA, it was typically labeled with 32
P or 35
S. This method is depicted on the next page. Today, several modifications of the Sanger method allow us to sequence DNA much faster. Instead of using radioactively labeled nucleotides, we now use fluorescent labels. Thus, we can now run all 4 ddNTP reactions in one lane of a gel instead of 4 separate lanes. Each fragment ending in a dideoxyA (ddA) is labeled with a red fluor, those ending in ddT are labeled with a yellow fluor, etc. Additionally, this lets us do all four dideoxy reactions in one tube simultaneously, because the different fluorescent dyes are attached only to the ddNTPs. Every fragment that gets terminated with a ddA is thus labeled with the red dye that is attached to that ddNTP. This method is called fluorescent dye-terminator cycle sequencing, and it uses PCR to incorporate ddNTPs in a primer extension sequencing reaction.
The PCR reaction consists of DNA template, primer, a special DNA polymerase, unlabeled dNTP's, fluorescently labeled ddNTP's, and buffer. When the PCR is
complete, the reaction mix contains a population of PCR fragments of different lengths, each terminating in a fluorescent-dye-containing ddNTP. Each ddNTP base contains a different fluorescent dye that emits a characteristic wavelength, thus the identity of the dye corresponds to the final base on that fragment. The entire reaction is run in a single lane on a polyacrylamide gel, so that the fragments separate according to size. The fragments run past a laser detector at the bottom of the gel, and the emission wavelength of each fragment is recorded. This is depicted below.
The sequence data is usually converted to a chromatogram form, and various software programs allow the rapid analysis of the .scf (standard chromatogram format) files of these chromatograms. A sample chromatogram is depicted below. As you can see each of the bases emits at a different wavelength, and the chromatogram can be read from left to right (early to late, hence 5' to 3' relative to the newly synthesized strand).
Sequencing reaction for the Licor DNA sequencer.
The automated sequencer that we have lets us sequence a plasmid insert in two directions. We need to set up 4 PCR reactions, each of which contains a different ddNTP. In our PCR reactions, the forward primer and the reverse primer are each labeled with a different fluorescent dye. Thus, we can read the sequence of one strand in one direction and the sequence of the other strand in the other direction. This is called Simultaneous Bidirectional Sequencing.
Each PCR reaction is 6 µl in volume. The molar amount of template used is based on the size of the insert between the priming sites. This equals the size of the cloned insert + 100 bp. The table below gives guidelines:
Insert Size (bp) amount of template desired 300-600 bp 50-100 femtomoles
600-1200 bp 125-225 fmol 1200-1800 bp 250-300 fmol >1800 bp 300-500 fmol
The mass corresponding to fmol amounts of 500 bp and 1000 bp inserts is shown in the table below
Template required 500 bp insert 1 kb insert
50 fmol 17 ng 33 ng 100 fmol 33 ng 66 ng 150 fmol 50 ng 100 ng 200 fmol 67 ng 135 ng 250 fmol 82 ng 165 ng 300 fmol 100 ng 200 ng
Add the following components to a 0.2 ml tube to prepare the template/primer mix for each template:
dsDNA (your plasmid)_____________ µl 700nm-emitting forward primer (1 pmol/µl) _____1.5____ µl 800nm-emitting reverse primer (1 pmol/µl) _____1.5____ µl
sterile distilled Water _____________ µl
__________________________________________________ Total Volume 13 µl
2. Label a set of 4 tubes A,C,G, or T . Add 3 µl of the A reagent to tube A, 3 µl of the T reagent to tube T, etc. This has been done for you.
3. Mix your plasmid template/primer mixture by gently pipetting up and down. Add 3 µl to each of the 4 tubes, using a new tip for each addition. After addition, mix your
template/primer/reagent mixture by gently pipetting up and down twice.
4. Cap your set of 8 tubes (make sure caps are all the way down on each tube), and move to a thermalcycler at 4°C.
4. Begin PCR reaction. At end of PCR reaction, add 3 µl of loading dye/formamide stop solution to each reaction. Denature at 92 °C for 2 minutes, then chill on ice. Load sequencing gel.
Observations and Analyses - DNA sequencing Due 10/18/07
(note: there is a second part of this observation and analysis that will be completed using software in class)
Name: ___________________________________________
Plasmid number ____________
A260 of 1:20 dilution of plasmid: _________________________
Concentration of undiluted plasmid DNA ___________________(ng/µl)
Desired molar amount of template (from table) ______________________ Volume of plasmid that gives desired mass __________________µl
Volume of water to use in sequencing reaction (10 µl - vol. plasmid) ___________µl Additional questions:
Examine the chromatograms of your sequenced plasmid. Describe the chromatograms. How does the quality of the two chromatograms change as you go from the beginning of the sequence to the end? What do these changes represent, physically–why does the quality change?
2. What is the function of the primer in a set of 4 sequencing reactions?
3. What role does the gel play in the sequencing process?
4. The structure of AZT, which is used in treating HIV infections, is shown below. Based on what you know about DNA sequencing, how do you think AZT works to stop the
spread of HIV?
4. Draw a gel below, with the bands that you would expect to see if you sequenced the following template:
template
3' G A C T G A A G C T G A 5'
BIO440
Fall 2007 DNA Sequencing Results Part 2
In this part of the project, we will start with raw data from the LiCor
sequencer. We will clean up the raw data, and then determine whether or not
our sequences really are 16S rRNA sequences. If they are 16S rRNA
sequences, we will determine what kind of organisms they came from, and
hence, an idea of the phylogenetic diversity of isolates from Boiling Springs
Lake. For the purposes of this project, you are to create an electronic copy of
your analysis, and submit it via email to your beloved instructor. This should
be a word document entitled 'XXX(your initials)seqanalysis'. I.e.,
MSWseqanalysis. There is a form on the class website to use for this - you
can download it and then type into the spaces. There are italicized, bolded
regions where you are to fill in your results.
This exercise should introduce you to the type of data that you will get and
the kinds of analysis you will have to carry out in order to interpret your
results. It will also introduce you to two different types of sequence analysis
and manipulation software: Sequencher, an intuitive but unreasonably
expensive program; and Staden, a powerful and free but non-intuitive
package of programs. You will learn how to use the software on example
files that are on your desktop, and then once you have defined your own scfs
from the sequencer upstairs you will analyze youur actual data.
Note: The protocols below are generalized protocols. Because this is real
data and (and a real research project) not all of the sequences will
necessarily conform exactly to this process. You may need to be creative/try
a few different approaches in order to get this to work. Remember that
patience is a virtue.
Cleaning up the sequence.
We are going to start with the raw data from the sequencer. This data
consists of the sequence of both strands of our insert (and sometimes of the
plasmid vector).
Summary. We want to:
-Open Sequencher, the sequence analysis software.
-Open up the forward and reverse sequence data files.
-Align the two sequences.
-Trim away vector sequences and any poorly sequenced regions at the ends.
Note: This may actually be the most problematic part of the entire process.
Details.
-Open Sequencher, the sequence analysis software. The icon for this
program is located in the pop-up menu. We will need to use ‘demo mode’
for this process, if we do it all at once (and you can't save your results in
demo mode). This first time I just want you to understand how the software
works, so use the demo version so that you know what to expect.
Open Sequencher, start a new project, and
Under ‘File’, select import sequences. The files are in the folder on the
desktop entitled BIO440 sequencing, that contains single curve files
(chromatograms from the sequencing gels). To see these files in sequencher
you will also need to change the ‘Files of Type" box at the bottom of the
select screen from ‘*.ABI’ to ‘all’
-Open up the forward and reverse sequence data files for your first sequence.
The files will end in '.ab1'. Files that you generate from the LiCor
Sequencer upstairs will have the ending '.scf'
Each group should clean up one of the sequences in this project to become
familiar with the Sequencher software, and (if possible) save the final
cleaned consensus sequence in a word document. You will then compare
your cleaned sequence to my version.
Align the forward and reverse sequences. This is actually an alignment
between one of the sequences and the reverse complement of the other
sequence. This is done by highlighting the two sequences and selecting the
‘Assemble automatically’ button. You should get a screen that has a ‘Contig
[0001]’ icon. Select this icon.
If you don’t get the icon, go to ‘Assembly parameters’, and slide the
‘minimum match percentage’ bar a little to the left, then try again.
When you select the ‘contig’, you should get a diagram displaying the
alignment. Take a look at the diagram and see if it makes sense. If it does,
then select ‘Bases’ for a more detailed view of the alignment.
At the top of the screen, the alignment of the two sequences will be
displayed. At the bottom of the screen, the consensus sequence will be
displayed. Where the two sequences are in perfect agreement, the consensus
sequence is unmarked. Where there is a discrepancy between the two
sequences, the consensus sequence is marked with an asterisk. There may be
many discrepancies at the beginning and the end of the consensus sequence,
because these regions represent the very ends (poor quality) of one or the
other of the sequencing reactions. Scroll through the sequence to verify this.
Where one reaction is of the highest quality, the other reaction is of the
lowest quality.
Go to the middle region of the sequence, where there are few
asterisks. Using your cursor, select a base on the consensus sequence that is
NOT marked with an asterisk. Then select ‘Show chromatograms’ to see the
chromatograms representing the raw sequence data. The two chromatograms
should agree well. By changing the base that is selected in the consensus
sequence, you can examine how the chromatograms change in the different
parts of the sequence.
How does the quality of the two chromatograms change as you go from
the beginning of the consensus sequence to the end? What do these
changes represent, physically–why does the quality change?
Now you want to trim away the poorly sequenced regions at the ends of the
sequence. These are the regions with numerous asterisks. (note that the
sequence for one of the two reference sequences is probably of very high
quality [don't trim], and for the other sequence of very low quality[get rid
of]). To do this, use the cursor to highlight the poor-quality data, and delete
it. If Sequencher asks you if you want to 'fill from left(or right)', select yes.
Use the chromatograms to try to resolve any discrepancies between the
remaining portion of the sequence. If you can, try to obtain at least 900 bp of
consensus sequence.
-Trim away vector sequences. To do this, we will search for the primer
sequences -- i.e. the primers that you used for the original 16S amplification.
The primers are degenerate, that is, there are some places where there is
more than one nucleotide in the 'conserved' target site. For example, the
forward primer is
5'cctacgggrsgcagcag 3'
where the R stands for a puRine
(A,G) and the s represents "Strongly H-bonding" (C, G)
I realize that this is a somewhat clunky approach - can you think of a better
way to do this?
All of the strains were produced using 341F as the forward primer...some
had 1525R as the reverse primer and others had 1391R as the reverse primer.
primer sequences =1391R
5' gacgggcggtgtgtgc 3' 5' gacgggcggtgtgtac 3'
in opposite orientation, this =
5’ gcacacaccgcccgtc 3’ 5’ gtacacaccgcccgtc 3’
primer sequences = 341f
5'
CCT ACG GGR SGC AGC AG3'
note this primer is a mixture of 4 slightly different primers
What are the 4 primer sequences?What is the reverse complement of these sequences?
After trimming these sequences away, paste the final cleaned consensus
sequence into your document.
Analyzing sequence data using the Staden program
Using Staden
We can use the Staden programs to process sequence data instead of the Sequencher program. The advantages of using Staden are that it is free, the vector clipping function is easy and works, and you can use it on computers at home. Also, it’s more powerful. However, Staden is not as user-friendly as Sequencher - it is somewhat clunky and takes getting used to.
The Staden Package has 5 programs
PreGap4 –for removing vector sequences & aligning different sequence files Gap4 – for studying aligned sequences
Trev – Trace viewer
Spin – a set of functions including looking for restriction sites, translating nt to aa, and some alignment functions
Console – I don’t yet know what this does
To clip vector sequences and remove poor quality sequencing data from a raw sequence
To start: Need a folder on the desktop that has the sequences you want to analyze and the file pGemT.txt (this file is a text file (not a word file) with the sequence of the PgemT vector. The sequences you want to analyze can be in .ab1 format or .scf. You need to know the name of this folder, i.e. ‘My BIO440 seqs’
This should open a program that gives you a window with a big blank screen with three folder tabs up top. These tabs have the names “Files to Process”, “Configure Modules” and “Textual output”
Under “Configure Modules” set up the general configuration by making sure that the following boxes are checked (may be a good idea to uncheck the other boxes): Estimate base accuracies
Trace format conversion Initialize experiment files Augment experiment files Quality clip
Screen for unclipped vector Cloning vector clip
With Cloning vector clip, you want to deselect it and then reselect it. When reselected, on the right hand side of the window you will be prompted to select a cloning vector. Using the browse button, select the pGemT.txt file. Then select ‘save these parameters’. Gap4 Shotgun Assembly
After selecting Gap4 shotgun assembly, the program that will align your forward and reverse sequences, you need to select ‘create new database’ and where it says
GAP4database name, type ‘Align’. When looking at your actual sequence data, you might type your strain name (i.e. 341F17) instead of Align. Then select ‘save these parameters’.
Under “Files to Process” select ‘Add files” A new window opens
Change ‘Files of type’ ABI(*.ab1) to Files of Type Any *.*
In that window, select Desktop from the icons on the left hand side, and then select your folder (ie My 440 Seqs)
Then highlight the files you want to analyze and select open, or doubleclick. For example, you might have the files sample1.ab1, sample2.ab1, and HSU1.scf and HSU2.scf
Then, click run. Some stuff should happen, and you should see the phrase ‘Processing finished’
Using Trev to view vector clip, quality clip, and to edit your sequence
Now go to the desktop and look in your folder (My seqs)…pregap4 has created some new files. You are interested in the .exp files. There should be a .exp file for each of the sequences that you analyzed. To look at one, doubleclick the .exp file, and it will open in Trev. You may have to open Trev first, then open your file.
The sequence should be color-coded – the crosshatched area is bad sequence, the pink area = vector, the light grey area = good sequence data and the dark grey area = not as good seq data. You might not see any vector on the practice files, but should see some on your experimental files.
Under view, select ‘display edits’.
Under Edit, select ‘sequence’. This will cause a new line of sequence data to appear – the edit line. Go to the right hand side of the vector, and using the mouse select the right-most nt of the vector. Use the delete (not backspace) key to remove the vector sequence, and/or poor quality sequence.
Then, go to the far right of the sequence file. Using the quality of the chromatogram, you can delete the poor quality region of the sequence file. Note that you are editing the .exp file and not the original .ab1 or .scf file.
Then, save the .exp file under the file menu. Finally, select File… ‘save as’ … Plain text…and give it a name that corresponds to your sequence followed by the extension .txt. For example sample1.txt
You should now have a text file that contains your sequence which has the vector and poor quality sequence data trimmed.
Note that this is not your aligned sequence, but only your sequence in one direction. To look at your forward and reverse aligned sequences, we will use the program Gap4. Using GAP4 to examine your aligned sequences.
On the desktop, select Start….Staden Package….Gap4. From the Gap4 window, select File…Open.
In the 440 sequencing folder you have been using, there should be a file with the name Align.O.aux. Open this document.
Select Edit…..Edit contig. Then select OK. The aligned bases should open in a new window.
Select Settings……Trace Display….Embed Traces size 5.
Select Settings……Highlight disagreements by background color.
Now, edit your sequence …I will leave the details of this up to your discretion and experimentation.
Once your editing is complete, go to the left hand side of the consensus sequence and select the first nt. Then, holding down the Shift key, go to the right hand side and select the right-most T. Copy the selected sequence, and paste it into your word document. Once you have your new, edited sequence………Finding the Closest Match in GenBank, and aligning the sequences.
First, blastn search the nucleotide sequence to verify that it is a 16S sequence. Use the discontiguous megablast program at
http://www.ncbi.nlm.nih.gov/
What is the closest match in GenBank? What is the Genbank accession number?What are the corresponding nucleotides in the GenBank sequence (this information can be obtained from looking at the alignments in the blast output).
If the closest match isn’t from a cultured organism (genus and species will be named), then what is the closest match which is to a cultured organism? What is the Genbank accession number?
Does this appear to be a 16S sequence?
List the publication information (i.e authors, title, and journal/date, if any). Also, list the information giving taxonomic details of the organism in the entry. What is the percent identity with the closest match, and with closest cultured organism? Across how many nucleotides?
Next, use the forward and reverse sequences alone (not the aligned forward and reverse sequence) and carry out discontiguous megablast searches on each of these two sequences. Did all three searches (forward, reverse, and aligned) yield the same result? Describe your results.
Analysis in the Ribosomal Database Project. The RDP is at: http://rdp.cme.msu.edu/index.jsp
We are going to perform a 'Sequence Match' analysis with the small subunit sequences in the database. This allows us to find the sequences most similar to our new sequence. Paste the sequence into the provided space, leave default settings as they are, and select submit sequences. See the 'Seq. Match Info' to interpret your results.
In your own words, compare the output and utility of the blastn and the RDP analysis programs.
Next, we will create an alignment with similar 16S sequences.
At RDPII, Go to the 'Online Analyses' page, and use the 'Sequence Aligner' function. Click on run, cut and paste your sequence in space provided, choose HTML format as output and include 10 sequences. Leave other defaults as is.
Examine the results. Is this a good alignment? Were gaps inserted? Were identities to other organisms apparent? Did the sequence match up to other sequences in the database? How closely? What do your results indicate?
Next, use the 'Classifier' program at the RDP to assign a sequence to the taxonomical hierarchy at the RDP. Interpret the output.
Print out the results or paste them into your report.
Write a brief description of the organism that your sequence is likely to have come from.
Do you think that it is likely this organism was isolated from Boiling Springs Lake, or do you think that this organism may represent a contaminant that was introduced during the isolation procedure?