5 PART III: INPUT DATA TYPES AND FILE FORMAT
5.1.3 Sequence Input Data
G
GeenneerraallCCoonnssiiddeerraattiioonnss((SSeeqquueenncceeDDaattaa))
The sequence data must consist of two or more sequences of equal length. All sequences must be aligned and you may use the in-built alignment system for this purpose. Nucleotide and amino acid sequences should be written in IUPAC single-letter codes. Sequences can be written in any combination of upper- and lower-case letters. Special symbols for alignment gaps, missing data, and identical sites also can be included in the sequences.
Special Symbols
Blank spaces and tabs are frequently used to format data files, so they are simply ignored by MEGA. ASCII characters such as the period (.), dash (-), and question mark (?), are generally used as special symbols to represent identity to the first sequence, alignment gaps, and missing data, respectively.
I
IUUPPAACCssiinngglleelleetttteerrccooddeess
Nucleotide or amino acid sequences should be written in IUPAC single-letter codes. The single-letter codes supported in MEGA are as follows.
Symbols Name Remarks
DNA/RNA A Adenine Purine G Guanine Purine C Cytosine Pyrimidine T Thymine Pyrimidine U Uracil Pyrimidine R Purine A or G Y Pyrimindine C or T/U M A or C K G or T S Strong C or G W Weak A or T H Not G A or C or T B Not A C or G or T V Not U/T A or C or G D Not C A or G or T N Ambiguous A or C or G or T
Protein
A Alanine Ala
C Cysteine Cys
D Aspartic Acid Asp
E Glutamic Acid Glu
F Phenylalanine Phe G Glycine Gly H Histidine His I Isoleucine Ile K Lysine Lys L Leucine Leu M Methionine Met N Asparagine Asn P Proline Pro Q Glutamine Gln R Arginine Arg S Serine Ser T Threonine Thr V Valine Val W Tryptophan Trp Y Tyrosine Tyr * Termination * K KeeyywwoorrddssffoorrFFoorrmmaattSSttaatteemmeenntt((SSeeqquueenncceeddaattaa))
Command Setting Remark Example
DataType DNA, RNA, nucleotide, protein
Specifies the type of data in the file
DataType=DNA
NSeqs A count Number of sequences
NSeqs=85
NTaxa A count Synonymous with NSeqs
NTaxa=85
NSites A count Number of nucleotides or amino acids
Property Exon, Intron, Coding, Noncoding, and End. Specifies whether a domain is protein coding. Exon and Coding are
synonymous, as are Intron and Noncoding. End specifies that the domain with the given name ends at this point. Property=cyt_b Indel single character Use dash (-) to identify insertion/deletions in sequence alignments Indel = - Identical single character Use period (.) to show identify with the first sequence.
Identical = . MatchChar single character Synonymous with the identical keyword. MatchChar = . Missing single character Use a question mark (?) to indicate missing data. Missing = ?
CodeTable A name This instruction gives the name of the code table for the protein coding domains of the data
CodeTable = Standard
D
DeeffiinniinnggGGeenneessaannddDDoommaaiinnss
Writing Command Statements for Defining Genes and Domains
The MEGA format easily can designate genes and domains within the molecular sequence data. In this format, attributes of different sites (and groups of sites, termed domains) are specified within the data "on the spot" rather than in an attributes block before or after the actual data, as is the case in some other data formats. An example of a three-sequence dataset written in MEGA format is shown below. The sequences consist of three genes named FirstGene, SecondGene, and ThirdGene for two groups of organisms Setup/Select Genes/Domain (Mammals and Birds). (Note that the genes and domains can also be defined interactively through a dialog box.)
!Gene=FirstGene Domain=Exon1 Property=Coding;
#Human_{Mammal} ATGGTTTCTAGTCAGGTCACCATGATAGGTCTCAAT #Mouse_{Mammal} ATGGTTTCTAGTCAGGTCACCATGATAGGTCCCAAT #Chicken_{Aves} ATGGTTTCTAGTCAGCTCACCATGATAGGTCTCAAT !Gene=SecondGene Domain=Intron Property=Noncoding;
#Human ATTCCCAGGGAATTCCCGGGGGGTTTAAGGCCCCTTTAAAGAAAGAT #Mouse GTAGCGCGCGTCGTCAGAGCTCCCAAGGGTAGCAGTCACAGAAAGAT #Chicken GTAAAAAAAAAAGTCAGAGCTCCCCCCAATATATATCACAGAAAGAT !Gene=ThirdGene Domain=Exon2 Property=Coding;
#Human ATCTGCTCTCGAGTACTGATACAAATGACTTCTGCGTACAACTGA #Mouse ATCTGATCTCGTGTGCTGGTACGAATGATTTCTGCGTTCAACTGA #Chicken ATCTGCTCTCGAGTACTGCTACCAATGACTTCTGCGTACAACTGA
Keywords for Command Statements (Genes/Domains)
Command Setting Remark Example
Domain A name This instruction defines a domain with the given name
Domain=first _exon
Gene A name This instruction defines a gene with the given name
Gene=cytb
Property Exon, Intron, Coding, Noncoding, and End.
This instruction specifies the protein-coding attribute for a domain. Keywords Exon and Coding are synonymous; similarly Intron and Noncoding are synonymous. End specifies the domain in which the given name has ended.
Property= cytb
CodonStart A number This instruction specifies the site where the next 1st-codon position will be found in a protein-coding domain. CodonStar t=2 D DeeffiinniinnggGGrroouuppss
Writing Command Statements for Defining Groups of Taxa
The MEGA format allows you to assign different taxa to groups in a sequence as well as to distance data files. In this case, the name of the group is written in a set of curly brackets following the taxa name. The group name can be attached to the taxa name using an underscore or just can be appended. It is important to note that there should be no spaces between the taxa name and group name. (Note that the groups of taxa can also be defined interactively through a dialog box.) In the following, we show an example in which human and mouse are designated as the members of the group Mammal and chicken belongs to group Aves.
!Gene=FirstGene Domain=Exon1 Property=Coding;
#Human_{Mammal} ATGGTTTCTAGTCAGGTCACCATGATAGGTCTCAAT #Mouse_{Mammal} ATGGTTTCTAGTCAGGTCACCATGATAGGTCCCAAT #Chicken_{Aves} ATGGTTTCTAGTCAGCTCACCATGATAGGTCTCAAT !Gene=SecondGene Domain=Intron Property=Noncoding;
#Human ATTCCCAGGGAATTCCCGGGGGGTTTAAGGCCCCTTTAAAGAAAGAT #Mouse GTAGCGCGCGTCGTCAGAGCTCCCAAGGGTAGCAGTCACAGAAAGAT #Chicken GTAAAAAAAAAAGTCAGAGCTCCCCCCAATATATATCACAGAAAGAT
!Gene=ThirdGene Domain=Exon2 Property=Coding;
#Human ATCTGCTCTCGAGTACTGATACAAATGACTTCTGCGTACAACTGA #Mouse ATCTGATCTCGTGTGCTGGTACGAATGATTTCTGCGTTCAACTGA #Chicken ATCTGCTCTCGAGTACTGCTACCAATGACTTCTGCGTACAACTGA Setup/Select Taxa & Groups
Data | Setup/Select Taxa & Groups
This invokes the Setup/Select Taxa & Groups dialog box for including or excluding taxa, defining groups of taxa, and editing names of taxa and groups.
L
LaabbeelllliinnggIInnddiivviidduuaallSSiitteess
Site Label
The individual sites in nucleotide or amino acid data can be labeled to construct non-contiguous sets of sites. The Setup Genes and Domains dialog can be used to assign or edit site labels, in addition to specifying them in the input data files. This is shown in the following example of three-sequences in which the sites in the Third Gene are labeled with a ‘+’ mark. An underscore marks an absence of any labels.
!Gene=FirstGene Domain=Exon1 Property=Coding;
#Human_{Mammal} ATGGTTTCTAGTCAGGTCACCATGATAGGTCTCAAT #Mouse_{Mammal} ATGGTTTCTAGTCAGGTCACCATGATAGGTCCCAAT #Chicken_{Aves} ATGGTTTCTAGTCAGCTCACCATGATAGGTCTCAAT
!Gene=SecondGene Domain=AnIntron Property=Noncoding;
#Human ATTCCCAGGGAATTCCCGGGGGGTTTAAGGCCCCTTTAAAGAAAGAT #Mouse GTAGCGCGCGTCGTCAGAGCTCCCAAGGGTAGCAGTCACAGAAAGAT #Chicken GTAAAAAAAAAAGTCAGAGCTCCCCCCAATATATATCACAGAAAGAT !Gene=ThirdGene Domain=Exon2 Property=Coding;
#Human ATCTGCTCTCGAGTACTGATACAAATGACTTCTGCGTACAACTGA #Mouse ATCTGATCTCGTGTGCTGGTACGAATGATTTCTGCGTTCAACTGA #Chicken ATCTGCTCTCGAGTACTGCTACCAATGACTTCTGCGTACAACTGA !Label +++__-+++-a-+++-L-+++-k-+++123+++-_-+++---+++;
Each site can be associated with only one label. A label can be a letter or a number.
For analyses that require codons, MEGA includes only those codons in which all three positions are given the same label. This site labeling system facilitates the analysis of specific sites, as often is required for comparing sequences of
regulatory elements, intron-splice sites, and antigen recognition sites in the genes of applications such as the Major Histocompatibility Complex.
Labeled Sites
Sites in a sequence alignment can be categorized and labeled with user-defined symbols. Each category is represented by a letter or a number. Each site can be assigned to only one category, although any combination of categories can be selected for analysis.
Labeled sites work independently of and in addition to genes and domains, thus allowing complex subsets of sites to be defined easily.