2. USING SEQUENCES TABLE OF CONTENTS TYPES OF SEQUENCE FILES 2-3

(1)

2. USING SEQUENCES

TABLE OF CONTENTS

ThisUsing Sequencessection tells how to name and find sequences, and lists all logical names used in HUSAR/GCG.

TYPES OF SEQUENCE FILES 2-3

Database sequences 2-3

Single sequence files 2-3

List files (formerly Files of Sequence Names) 2-3

Multiple Sequence Format files (MSF) 2-3

USING DATABASE SEQUENCES 2-4

Specifying Database Sequences by Name 2-4

Specifying Database Sequences by Accession Number 2-5

USING SINGLE SEQUENCE FILES 2-7

Creating and Editing Single Sequences 2-7

Specifing Single Sequence Files 2-8

Specifing Sequence types 2-8

USING LIST FILES 2-9

Creating and Editing List Files by Hand 2-10

Programs that Create List Files 2-10

Specifying List Files 2-11

USING MULTIPLE SEQUENCE FORMAT (MSF) FILES 2-12

Programs that Create MSF Files 2-13

Editing MSF Files 2-13

(2)

FINDING AND COPYING DATABASE SEQUENCE FILES 2-15

Finding Database Sequences 2-15

Copying Sequences from the Databases 2-17

Viewing Database Sequences 2-17

Viewing Sequences in your directories 2-18

Reformatting Sequence Files to GCG Format 2-18

Using Personal Databases 2-19

Refining a Sequence List 2-19

DATABASE LOGICAL NAMES FOR THE HUSAR/GCG PACKAGE 2-21

Nucleic Acid Databases 2-21

Subdivision of the Nucleic Acid Databases 2-22

Protein Databases 2-23

Subdivisions of the Translated Nucleic Acid Sequences Databases 2-24

IRX 2-25

(3)

TYPES OF SEQUENCE FILES

The HUSAR Package works with many different types of sequences: Database sequences

Includes sequences from different databases such as EMBL, GenBank, SwissProt, PIR and several others (For a complete list see the Database Logical Names For HUSAR section of this chapter).

Single sequence files

Includes individual sequence files in your personal directories. These sequences include ones created with SegEd, reformatted sequences, those you copied from a database into your personal directories, and those created with other software and reformatted to use with the HUSAR/GCG programs.

List files (formerly Files of Sequence Names)

Includes a list of sequence specifications (sequence names) but no sequence data. List files also can include sequence specifications containing wildcards and nested list files (or list files within list files).

Multiple Sequence Format files (MSF)

Includes two or more sequences aligned together. MSF files are created by HUSAR/GCG programs such as PileUp, LineUp, Clustal, MultAlign, MAlign2MSF and Tree2MSF.

(4)

USING DATABASE SEQUENCES

The HUSAR/GCG package provides you access to the following nucleotide and protein sequence databases.

LOGICAL NAME DESCRIPTION

GeAll EMBL + EmNew + GbOnly GenBank Nucleic Acid Sequences EMBL Nucleic Acid Sequences

EmNew EMBL Updates since last release EmLast Last Week Update of EMBL Database SwissPir SwissProt + PirOnly + SwNew

SwissPirPlus SwissPir + Part of TREMBL PIR Protein Sequences

PIROnly PIR Sequences not contained in SwissProt SwissProt Protein Sequences

SwNew SwissProt Updates since Last Release MIPSX/PATCHX Protein Sequences

TREMBL Translated EMBL GenPept Translated GenBank VecBase Vector Sequences VectorDB Vector Sequences

HIV-Base Human Retroviruses Nucleic Acid Sequences HIV-Prot Human Retroviruses Protein Sequences

NRL3d Protein Sequences from Protein Data Bank (PDB) Kabat Nucleic Acid Sequences of Immunological Interest KabatProt Protein Sequences of Immunological Interest Ecoli E. coli Nucleic Acid Sequences

RNA5s Ribosomal 5S RNA Sequences AluBase Alu Sequences

Yeast Yeast Complete Genome

Yeast_EST Yeast Expressed Sequence Tags (TIGR) YeastProt Yeast Protein Sequences (MIPS)

EPD Eukaryotic Promoter Sequences (-499:+100) SBASE Protein Domains

In addition to these logical names, there are a number of abbreviated names that have the same meanings but require less typing. There are also a number of logical names that refer to the individual subsections of either the GenBank or EMBL data libraries. For example, "gb_ba" refers to only those sequences in the bacterial subsection of GenBank. All of these additional logical names are listed at the end of this section where each data collection is listed and briefly described.

To find more information about the databases, read the release notes that accompany each database release by typing %about databasename, for example %about embl.

Specifying Database Sequences by Name

You can specify database sequence entries by name. Note, however, that a sequence name is subject to change from release to release of the database. For instance, let’s say an existing database sequence is merged with another sequence. The complete merged sequence may acquire the name of the second sequence while the first sequence name is omitted. A more stable way of tracking a sequence from release to release is by its accession number, as is described in

(5)

the next task.

To specify a database sequence entry by name, choose one of the following:

NoteDatabase names are case-insensitive. That is, you can type them in uppercase, lowercase, or mixed case.

Single Sequence

The name of a sequence in a database consists of a database name (such as "GenBank") followed by a colon (:) followed by a sequence name (Dro5s). For example, if a HUSAR/GCG program prompts you for a sequence, you could respond with GenBank:Dro5s. You will notice that in most there is more than one logical name to refer to a database. Thus, you could refer to this same sequence as GB:Dro5s.

Multiple Sequences

If a program prompt asks you "What sequence(s)?", it implies that the program can accept multiple sequences. You can specify multiple sequences in the databases using an asterisk (*) wildcard. For example, GeALL:Hiv*refers to all sequences in GeALL whose names start with "Hiv". Or, GenBank:*refers to all sequences in the GenBank database. Specifying Database Sequences by Accession Number

The names of entries in the major data collections change, and the same entry may have a different name in EMBL and GenBank. Because of this, sequences are increasingly referred to by accession numberin publications. These accession numbers are more stable than the entry names, and they are consistent between EMBL and GenBank. The SWISS-PROT protein database has cross reference to EMBL based on accession numbers. When a sequence is first entered into EMBL, GenBank, or SWISS-PROT, it is assigned a newprimary accession number. If that sequence is ever merged with another sequence, the accession number of the original sequence becomes asecondary accession numberin the newly-merged sequence.

The GCG Package lets you name sequences by accession number. The GCG Package does not distinguish between primary and secondary accession numbers, as long as the number you use does not occur in more than one entry. If you use an accession number that does not occur or that occurs more than once, the GCG Package acts as if the sequence you have named cannot be found.

If the number you use to name a sequence has become a secondary accession number, there is no guarantee that it is exactly the same as the sequence that someone else has cited using that same number. You can only be sure that some or all of the original sequence is now contained in the entry you have found. You can tell if a number is a secondary accession number by fetching the sequence and reading the documentation at the top. The primary accession numberalways comes before the secondary numbers.

The syntax for specifying sequences by accession number is the same as for naming sequences by entry name. For example, entering the sequence gb:v00580 is equivalent to entering the sequence gb:humrep2. The specification consists of a database logical name, followed by a colon, followed by an accession number. Accession numbers start with one or two alphabetic character and end with five or six numbers. In PIR, GenBank, SwissProt and EMBL, accession numbers are always six-characters long. You cannot use wildcards to specify sequences by accession number.

If you don’t know the database of the accession number, type % typedata -REFerence accessionnumber, for example % typedata -REFerence j00411. The program finds the sequence file in the appropriate database and displays its reference information (that is, everything but the sequence itself) on your screen. The first line of this reference information

(6)

tells you the database in which the sequence resides. If you also want to see the sequence information, use % typedatawithout the -REFerenceparameter. Or, if you want to copy the sequence to your directory, use the Fetch program.

(7)

USING SINGLE SEQUENCE FILES

Much of the work you perform may revolve around single sequences, which are sequence files stored in your personal directories. There are two ways to create single sequence files: 1) by using SeqEd or 2) by using a text editor and the Reformat program.

You can store single database sequences in your personal directories as well as import single sequences created by other sequence analysis software and reformat them to use with the Wisconson Package. For more information on importing sequences, see the "Reformatting Sequence Files to GCG Format" section in this chapter.

Creating and Editing Single Sequences

You can create sequences from scratch in the Wisconsin Package or edit existing sequences. Each sequence must have a "type" associated with it, denoting the sequence as either a nucleotide or a protein. To specify the sequence type, you can add the parameter -NUCleotide or -PROteinto the command line when you run SeqEd or Reformat. If you forget to do so, the programs determine the type for you based on the symbols in the sequence. Note that because nucleotide and protein sequences share some symbols, the programs can guess incorrectly at the sequence type.

If you use SeqEd, a screen-oriented editor, you can enter and check a sequence rapidly. For more Information on SeqEd, see menu 2, Sequence Editing and Manipulation of the Program Manualor the Full Description section in thisUser’s Guide.

If you use the text editor of your choice to create a file, then you must reformat it into GCG format using the Reformat program.

1) Type the sequence information in the text editor of your choice, for example pico or vi. Include the following information:

Heading. (optional) May contain any number of lines of text at the top of the file describing the sequence.

Dividing Line. Consists of a single line containg two periods in succession (..) to separate header information from the sequence. This line is required only if you include header information.

Sequence. Contains the sequence information in any format. Each line of the sequence cannot be longer than 512 characters.

2) Save the file.

3) Use Reformat to rewrite the sequence file into GCG format. To do so, type % reformat -NUCleotide filename, or % reformat -PROtein filename. For more Information on Reformat, see the menu 13, File Transfer and Sequence Formatting of the Program Manual.

NoteYou can use a text editor to modify existing sequence files, although we do not recommend this method. Once you modify a sequence with a text editor, the checksum of the sequence changes, and HUSAR/GCG programs will not recognize the sequence. Therefore, if you use a text editor to modify a sequence, you must use the Reformat program to rewrite the file into GCG format.

(8)

Specifing Single Sequence Files

If you specify a sequence file in response to a program prompt, choose one of the following: Single Sequence

If you are running a program in the directory containing the sequence file, type the name of the file, gamma.seq. If the sequence file is in a directory other than where you currenly are running the program, type the directory and file specification, for example, /smith/test/gamma.seq.

Multiple Sequences

If a HUSAR/GCG program’s sequence prompt asks you what sequence(s) ? instead of what sequence ?, the (s) implies that the program can accept a group of sequences. Groups of sequences can be named with expressions containing wildcard characters; for example, geall:ad*, where the "*" (asterisk) is the wildcard. This expression refers to all nucleic acid sequences whose names start with the characters "Ad" and continue with anything (or nothing). In another example, the expression geall:* refers toallnucleic acid sequences. Sequence names such as geall:*are said to beambiguous. You can use wildcard also in conjuction with your private sequence files. For example, gam*refers to all the sequence files in your directory starting with "gam".

Sequence entry names can be specified ambiguously. For database names, only the sequence name field can contain wildcards; for file names, any ambiguous file specification that can be interpreted by the operating system is acceptable. You cannot specify a group of sequences ambiguously by accession number.

Specifying Sequence Type

In previous versions of the Wisconsin package, you could specify the sequence type, that is, if it was a nucleotide or protein, on the command line when you ran a program. In Version 8.1 of the Package, however, the sequence type is an inherent part of the sequence, and it cannot be changed from the command-line. To change the sequence type, you must change the sequence file itself.

Use the Reformat program to change the sequence type. Type % reformat -NUCleotide filenameor

% reformat -PROtein filename.

Otherwise you can edit the file using the text editor of your choice, for example pico.

1) Find the sequence type listed after "Type:" in the line preceding the sequence information.

2) Change the type to "N" for nucleotide or "P" for Protein. 3) Save and exit the file.

(9)

USING LIST FILES

(Formerly Files of Sequence Names)

A list file, formely known as a file of sequence names, is what its name implies: afilecontaing alistof sequences. List files are helpful for naming groups of sequences whose names do not have characters in common: that is, those sequences for which you cannot use a wildcard to name multiple sequences. You will also find list files useful for specifying sequences from multiple locations - such as different databases or single sequences and MSF sequences in your personal directories - in one file as input to a program. List files can contain any number of the following types of sequences:

- Single sequences from the database or your personal directories, for example, GB_In:Dro5sor /smith/projekt/gamma.seq.

- Database sequence names using asterisk (*) wildcards, for example

GenBank:Hum*. Note that you cannot use wildcards to include multiple sequences from your personal directories, for example /smith/projekt/*.seq.

- Names of other list files, for example, @hsp70.list.

- Sequences in multiple sequence format (MSF) files, for example pileup.msf{ssa} or pileup.msf{*}.

You can use list files with any program that accepts multiple sequences as input. A program prompt asking "What sequence(s)?" implies that the program accepts multiple sequences. Below in an example of a list file.

This is an example of a list file for heat shock proteins (HSP70). The list contains

- single and ambiguous sequences - nested lists

- databases ambiguous specifications - msf specifications

..

gb_ov:xlhsp70 Begin:488 End:2428 Strans:+ Circ:F Wgt:1.00 !heat shock protein 70 gb_ov:chhkhsp Begin:392 End:2293 Strans:+ Circ:F Wgt:1.00 !J02579 Chicken 70 kd gb_in:the70hsp Begin:445 End:2382 Strans:+ Circ:F Wgt:1.00 !T.annulata (hsp 70.1) gb_in:ldhsp70 Begin:165 End:2123 Strans:+ Circ:F Wgt:1.00 !Leishmania hsp70 gene gb_pl:M27825 Begin:591 End:2618 Strans:+ Circ:F Wgt:1.00 !B.lactucae (hsp70) sw:hsp*

/smith/project/my_hsp70.seq /smith/project/@hsp70.list

/smith/project/pileup_hsp.msf{*}

In addition to sequences specifications, each sequence in a list file may optionally contain sequence attributes. These attributes include:

Begin:n. An editable field showing the base position you want to start with, where n= 1 to the length of the sequence.

End: n. An editable field showing the base position you want to end with, where n = 1 to the length of the sequence.

Strand: + or -. An editable field defining the forward or reverse complement nucleic acid sequence strand, where + = forward strand and - = reverse strand.

(10)

Circ:T or F. An editable field defining the strand as linear or circular, where T = circular and F = linear.

Wgt: n.n. An editable field defining the sequence weight, or the significance of the sequence in comparision to other sequences. That is, you may not want all sequences accounted for equally to determine a result. Therefore, you can weight some greater than others. This attribute is of use only when you are using two or more sequences in the analysis.

Join: Sequence_name. An editable field indicating that the sequence segment should be concatenated with the next sequence in the list that has an identical Join:Sequence_name attribute. Several contiguous sequences specified in a list file with the sameJoin:Sequencename attribute can be concatenated together. (Assemble and Translate are the only GCG programs that use theJoinattribute.)

Note At the moment, only Assemble, Clustal, CodonFrequency, MAlign, Motif, MultAlign, PileUp, ProfileMake, Translate and Tree use some or all of these sequence attributes in the command-line version of HUSAR/GCG.

Creating and Editing List Files by Hand

When you create a list file with a text editor, you should follow these steps below. 1) Open a new file with the text editor of your choice, for example pico.

2) Type the appropriate information. A list file contains the following optional and required elements (see the list file example earlier in this section):

Description. (optional) Contains informative text, including the date of creation, describing what is in the file.

Dividing Line. (required) Includes two periods ( .. ) that must appear on the line preceding the sequence list.

Sequence List. (required) Includes the single sequence from your personal directory or a database, sequence specifications using wildcards, MSF files, or list files. You must provide the database or directory specification. You can add sequences in any order.

Sequence Attributes. (optional) Can include the begin and end position, indicate the forward or reverse strand, define the strand as linear or circular, give the sequence a weight in comparison with other sequences in the list.

Sequence Comments. (optional) Includes an exclamation point ( ! ) followed by a short comment or definition of the sequence(s) or list file.

3) Save and exit the file.

If you "comment out" the unwanted sequences insteading of deleting them, you can use them at a later time. You can type an exclamation point (!) in front of the name of each sequence you do not want.

Programs that Create List Files

Some HUSAR/GCG programs can produce output in list file format. Any program that creates multiple sequence output files and can orgranize those sequence specification in a list file supports the -LIStfileparameter. You can then use that list file as input to other programs.

(11)

Programs which can create list output files, their parameters (if necessary), and their locations in theProgram Manualare listed below.

Program Parameter Program Manual Chapter (if necessary)

Assemble -LIStfile Sequence Editing and Manipulation Corrupt -LIStfile Sequence Editing and Manipulation FastA -NOALIGN Data Base Searching

FindPatterns -NAMes Pattern Recognition and Composition Analysis

Framesearch -NoAlign Data Base Searching

FromEMBL -LIStfile File Transfer and Sequence Formatting FromFastA -LIStfile File Transfer and Sequence Formatting FromGenBank -LIStfile File Transfer and Sequence Formatting FromIG -LIStfile File Transfer and Sequence Formatting FromPIR -LIStfile File Transfer and Sequence Formatting IRX Data Base Utilities

LineUp Multiple Sequence Alignment Motifs -NAMes Protein Sequence Analysis Names Data Base Utilities

Pretty -UGLy Multiple Sequence Alignment ProfileSearch Multiple Sequence Alignment

Reformat -LIStfile File Transfer and Sequence Formatting Sample -LIStfile Sequence Editing and Manipulation Simplify -LIStfile Sequence Translation and Conversion StringSearch Data Base Utilities

TFastA -NOALIGN Data Base Searching

Translate -LIStfile Sequence Translation and Conversion WordSearch Data Base Searching

TWordSearch Data Base Searching Specifying List Files

When you name a list file to a HUSAR/GCG program,you must precede its name with an @(at) character. If the file of sequence names contains the name of another file of sequence names, that second file’s name must also be preceded by an @ character, for example,@hsp70.list. The name of a file of sequence names should not be ambiguous.

NoteYou cannot use wildcards to specify a list file. For example, you cannot specify@hsp*.list.

(12)

USING MULTIPLE SEQUENCE FORMAT (MSF) FILES

Several HUSAR/GCG programs write out files that have many sequences aligned together. These multiple sequence format files are often referred to as MSF files in our documentation. MSF files include not only the sequence name but also the sequence itself, which is usually aligned with the other sequences in the file. Five HUSAR/GCG programs, Clustal, LineUp, MultAlign, PileUp, and Reformat, can create MSF files. You can specify a single sequence within an MSF file, a subset of sequences, or all sequences. Like other sequences, the sequences in MSF files can be used with other HUSAR/GCG sequence analysis programs.

Below is an example of an MSF file created with PileUp. Globin peptides

globin.msf MSF: 168 Type: P January 31, 1996 16:30 Check: 5085 .. Name: hbhagf Len: 168 Check: 6042 Weight: 1.00

Name: hbrlam Len: 168 Check: 3856 Weight: 1.00 Name: hbbhum Len: 168 Check: 5373 Weight: 1.00 Name: hbghum Len: 168 Check: 7818 Weight: 1.00 Name: hbahum Len: 168 Check: 5322 Weight: 1.00 Name: myohum Len: 168 Check: 9191 Weight: 1.00 Name: mycrhi Len: 168 Check: 7483 Weight: 1.00 //

1 50 hbhagf PITDHGQPPT LSEGDKKAIR ESW...PQIY KNFEQNSLAV LLEFLKKFPK hbrlam PIVDSGSVAP LSAAEKTKIR SAW...APVY SNYETSGVDI LVKFFTSTPA hbbhum ...VH LTPEEKSAVT ALW...GKV. .NVDEVGGEA LGRLLVVYPW hbghum ...GH FTEEDKATIT SLW...GKV. .NVEDAGGET LGRLLVVYPW hbahum ...V LSPADKTNVK AAW...GKVG AHAGEYGAEA LERMFLSFPT myohum ...G LSDGEWQLVL NVW...GKVE ADIPGHGQEV LIRLFKGHPE mycrhi ...S LQPASKSALA SSWKTLAKDA ATIQNNGATL FSLLFKQFPD 51 100 hbhagf AQDSFPKFSA KKS..HLEQD PAVKLQAEVI INAVNHTIGL MDKEAAMKKY hbrlam AQEFFPKFKG MTSADQLKKS ADVRWHAERI INAVNDAVAS MDDTEKMSMK hbbhum TQRFFESFGD LSTPDAVMGN PKVKAHGKKV LGAFSDGLAH LDN...LKGT hbghum TQRFFDSFGN LSSASAIMGN PKVKAHGKKV LTSLGDAIKH LDD...LKGT hbahum TKTYFPHF.D LSHGSA.... .QVKGHGKKV ADALTNAVAH VDD...MPNA myohum TLEKFDKFKH LKSEDEMKAS EDLKKHGATV LTALGGILKK KGH...HEAE mycrhi TRNYFTHFGN MSDA.EMKTT GVGKAHSMAV FAGIGSMIDS MDDADCMNGL You may find the following components in an MSF file:

Description. (optional) Contains informative text describing what is in the file. You can add this information to the top of the MSF file using a text editor.

Dividing Line. (required) Must include the following attributes:

MSF. Displays the number of bases or residues in the multiple sequence alignment. Checksum. Displays an integer value that characterizes the contents of the file.

Two periods (..). Acts as a divider between the descriptive information and the following Using Sequences Using Multiple Sequence Format (MSF) Files 2-12

(13)

sequence information.

Name/Weight. (required) Must include the name of each sequence included in the alignment, as well as its length and checksum (both non-editable) and weight (editable). The checksum of the individual sequences is important as a safety measure to ensure that you do not change the sequence data inadvertently. If this has happend, you will not be able to use the sequence(s) within the MSF file. You then can use the Reformat program to reformat the sequences and create a new checksum to reflect the file’s edited contents.

Separating Line. (required) Must include two slashes (//) to divide the name/weight information from the sequence alignment.

Multiple Sequence Alignment. (reguired) Must include each sequence named in the above Name/Weight lines. This alignment allows you to view the relationship among sequences.

Programs that Create MSF Files

Programs which can create MSF files, their parameters (if necessary), and their locations in the Program Manualare listed below.

Program Parameter Program Manual Chapter (if necessary)

Clustal -MSF Multiple Sequence Alignment LineUp -MSF Multiple Sequence Alignment MAlign2MSF Multiple Sequence Alignment MultAlign -MSF Multiple Sequence Alignment PileUp Multiple Sequence Alignment

Reformat -MSF File Transfer and Sequence Formatting Tree2MSF Multiple Sequence Alignment

NoteIf you use %reformat -MSFto create an MSF file, it does not align the sequences. Editing MSF Files

To edit an MSF file, use LineUp. For more information, see the "Multiple Sequence Alignment" chapter of theProgram Manual.

You also can use a text editor to modify an MSF file. If you do so, however, the file’s checksum changes, and HUSAR/GCG programs will not recognize the file. Therefore, if you use a text editor to modify an MSF file, you must use the Reformat program with -MSF parameter to rewrite it into GCG format.

Specifying MSF Sequences Single Sequence

The way to specify a sequence in an MSF file is to name the file, then follow it with the name of the sequence entered within curly brackets ({ and }). For example, if the name of the file above were globin.msf, you could specify the sequence hbrlam within it by entering the expression globin.msf{hbrlam}.

Multiple Sequences

You can also specify the sequences within an MSF file ambiguously by using wildcard characters within the curly brackets. For example, globin.msf{hb*} specifies all sequences in globin.msf beginning with "hb", whereas globin.msf{*} specifies every sequence in globin.msf.

(14)

Note You cannot use wildcard characters in the file name part of an MSF sequence specification (that is, you cannot specify glo*.msf). You can use wildcards only between the curly brackets. Also, an MSF sequence specification must contain a sequence name or ambiguous sequence expression within curly brackets; the file name alone is not enough. HUSAR/GCG programs will ignore any sequence in an MSF file when the name/weight line for that sequence (at the top of the file, before the actual alignment) begins with an ’!’. "Commenting-out" sequences in this manner provides an easy way to refine the list of sequences in an MSF file for use with HUSAR/GCG programs.

(15)

FINDING AND COPYING DATABASE SEQUENCE FILES

Finding Database Sequences

You can use the HUSAR/GCG sequence identification programs to look through any set of entries you name to find all the entries that contain some common feature.

You can search for entries using different attributes: 1) Accession number, for example J00441

2) Name, for example, Dro5s. You can also use wildcards within the name, for example Bov*.

3) Reference information, such as a gene or contributing author, for example globin or Weissmann

4) Sequence pattern, such as GAATTC 5) Similarity to query sequence. Finding Sequences by Accession Number

You can find a sequence if you know only its accesion number by using the Fetch or TypeData program.

The Fetch program finds the sequence in the appropriate database and copies it to your current directory. Type % fetch accessionnumber, for example % fetch j00411. TypeData finds the sequence file in the appropriate database and displays its reference information on your screen. Type % typedata -reference accesionnumber, for example typedata -referencej00411. It does not copy the sequence to your directory. (If you want to see the sequence information in addition to the reference information, omit the -referenceparameter).

Finding Sequences by Name

You can find database sequences with similar names using the Names program. Names shows you what set of sequences is implied by any sequence specification.

1) Type % names. The program displays the prompt "Names for what GCG data file(s)?" 2) Type the characters you want to search for. For example, % names bov* displays

every Wisconsin Package sequence entry or data files (see Chapter 4, Using Data Files for more information) whose name begins with the letters "bov".

The sequence specification can include a database logical name to refine your search. For example % names OM:bov*searches only the Other Mammals division of EMBL + GenBank, whereas % names bov* searches every division of GeAll, SWISSPROT, and PIR, as well as every GCG data file.

The program displays the prompt "What (file of filenames) output file (* TERM *)? Using Sequences Finding and Copying Database Sequence Files 2-15

(16)

3) Press<Return>to display output on your screen or type the name of a file you want to write the output to. The Names output file is a list file. You can use the output file of Names as input to any program that support naming multiple sequences.

If you write the output to a file rather than to your screen, the Names program also documents each sequence with the first 132 characters of reference information, including the sequence name, number of base pairs, definition, and accession number. For more information, see the "Data Base Utilities" chapter of theProgram Manual.

Finding Sequences by Text Pattern

You can search the database for textual reference, for example, "human" or "keratin". This type of search scans the reference section of each database entry searching for a word or phrase you specify: it does not scan the sequence section. To find sequences by text patterns, use the IRX program.

1) Type %irx. After calling IRX you are asked whether to display a help screen. You can type ’y’ or ’n’. Afterwards you will see a list of databases available under IRX.

2) IRX is a menu driven system which is, generally, easy to use. The main problem with IRX is its inabililty to understand any cursor (arrow) keys. Use only keys that are mentioned on top of every screen! Use D(down line) and U (up line) to move the cursor and press<Enter>to select a database.

You will find yourself in the menu "Question Input" where you can enter your questions using different searching techniques:

- Entering simple keywords, e.g. keratin

- Using Boolean Operators ( AND, OR, NOT ), e.g. keratin AND human

- Restricting keywords to specific fields, e.g. keratin [de,kw] AND human [os]

- Using the "proximity search", if you enter non-alphanumeric characters, e.g. "c-myc"

- Using wildcard characters (*), e.g. interl*

In any case, you will get a list of sequence entries, satisfying the given request. Use the command ’w’ (write) to store the results to a file (e.g. a list file). For more information on IRX, see the Data Base Utilities chapter in theProgram Manualor the Full Description section in thisUser’s Guide.

Finding Sequences by Sequence Pattern

To find sequences with similar sequence patterns (for example, all sequence patterns with GAATTC or YRYRYRYR), use the FindPatterns program. For more information on FindPatterns, see the Pattern Recognition and Composition Analysis chapter of the Program Manual.

(17)

Finding Sequences Similar to a Query Sequence

To compare your query sequence to other sequences in the databases, choose one of the following programs.

BLAST (BlastN, BlastP, Blastx, TBlastN and TBlastX). Identifies database sequences with similarity to any query sequence. The BLAST programs are usually faster than other database searching programs. The search uses the statistics of Karlin and Altschul to identify segments whose similarity would not be expected to occur by chance.

FastA (TFastA). Searches databases in a manner similar to BLAST. FastA allows gaps to occur in the segments found. It is usually slower than BLAST, but it can be more sensitive, particulary when searching nucleotide sequence databases.

For more information about the BLAST programs, FastA, WordSearch, and QuickSearch, see the Data Base Searching chapter of theProgram Manual.

Copying Sequences from the Databases

You can copy sequences from the databases to your directory using the Fetch program. Single Sequence

To copy sequences from the datbases, type % fetch entryname, for example, % fetch In:Dro5s. If you do not know the database in which a sequence residues, you can simply type the sequence name and Fetch will find it. However, if you do this, fetch searches through a number of directories, taking longer to complete and possibly finding files you are not interested in.

Multiple Sequences

To copy multiple sequences, use a wildcard in the specification, for example, %fetch hum* or %fetch Vi:HIV*. You also can copy multiple sequences from the databases by creating a list file of those sequences of interest. This method is useful if the sequence names do not have characters in common. Then, to copy the sequences from the database, type %fetch @listfilename, for example %fetch @hsp70.list. The sequences in the list file are copied to your current directory.

Viewing Database Sequences

You may want to read the reference information associated with a sequence or view the sequence itself. You can easily view the contents of sequence files by using the TypeData program.

To view database sequences, enter % typedata entry_name, for example % typedata GB_IN:Dro5S. The sequence data, including reference information, scroll on your screen. You cannot edit a file using the TypeData command. To control screen output, choose from the following:

1) To prevent sequence data from scrolling off your screen, typetypedata filename | less. 2) To temporaily stop the scrolling of the data, press<Ctrl>s.

(18)

3) To resume scrolling, press<Ctrl>q. 4) To exit from TypeData, press<Ctrl>c. Viewing Sequences in your directory

To view the contents of single sequence files, MSF files, or list files in your directories, enter % less filename, for example % less gamma.seq. The sequence data, including reference information, displays on your screen one screen at a time. To advance from screen to screen, press the<Space Bar>.

Reformatting Sequence Files to GCG Format

At some point in your work with HUSAR/GCG, you may need to reformat sequence files to GCG format. This may happen when:

1) You create a sequence file using an automated sequencer.

2) You obtain a sequence directly from a database service (such as EMBL, GenBank, or PIR e-mail services) or through another program (such as Staden or IntelliGenetics).

3) You create a sequence file using a text editor.

4) You modify a GCG-formatted sequence file using a text editor. (Note that this is not a recommended practice.)

You can use a number of differently formatted sequences with the Wisconsin Package -sequences created with a text editor or automated sequencer; -sequences in a different software format (for example Staden or IntelliGenetics; or sequences in the database formats of GenBank, EMBL, SWISSPROT, or PIR.

Each sequence in the Wisconsin Package must have a "type" associated with it, denoting the sequence as either a nucleotide or a protein. To specify the sequence type, you can add the parameter -NUCleotide or -PROtein to the command line when you run Reformat, FromStaden, FromEMBL, FromFastA, FromGenBank, FromPIR, or FromIG. If you forget to do so, the programs will determine the type for you based on the symbols in the sequence. Note that because nucleotide and protein sequence share some symbols, the programs can guess incorrectly at the sequence type.

To reformat sequence files, choose one of the following: Sequences with no format.

If you create or modify a sequence using an automated sequencer or text editor, use the Reformat program to rewrite the sequence file to GCG format. Use Reformat when the file contains only sequence information and is in no particular format. Sequences from a database service or another program.

1) FromStaden

Reformat sequences from Staden format to GCG format. 2) FromEMBL

Reformat sequences from the distribution (flat file) format of the EMBL or SWISS-PROT databases to GCG format.

(19)

3) FromGenBank

Reformat sequences in the flat file format of the Genbank database to GCG format.

4) FromFastA

Reformat sequences in FastA format to GCG format. 5) FromPIR

Reformat sequences from the protein database of the Protein Identification Resource (PIR) to GCG format.

6) FromIG

Reformat sequences from IntelliGenetics format to GCG format.

Note You can also reformat sequences from GCG to Staden, PIR, FastA, and IntelliGenetics formats.

Using Personal Databases

You can create your own personal databases, similar to GenBank and EMBL databases, for searching with the Wisconsin Package. This option is a particular advantage if you frequently work with large list files. A large set of sequences is more compact to store and faster to search if it is assembled into a database. Thus you can convert your large list files into databases for faster searching capabilities. When sequences are assembled into a database, all HUSAR/GCG programs work with them exactly as they work with the public databases (GenBank, EMBL, etc.).

The program DataSet creates personal databases from any set of sequences you specify.

Type % dataset. The program displays the prompt "Assemble DATASET from what sequence(s)?"

Type the sequence specification of the list file you want to convert to a database, for [email protected] type a file specification from a public database using an asterisk(*) wildcard. For example,SW:Hs70*would create a database of all 70 kD heat shock protein sequences in SWISS-PROT.

The program displays the prompt "What should I call the database?"

Type the logical name you want to refer to the database, for example HSP. This prompt sets the logical name of your personal database. Your personal database logical names are automatically assigned in a shell script called .datasetrc in your home directory.

Specifying a personal database you created using DataSet is the same as specifying a sequence from a public database such as GeALL. To do so, type the logical name of your database, followed by a colon (:), followed by the sequence(s) of interest. For example, using the example above, you could type HSP:Hs70_Brelc to specify a single sequence in the database.

If you want to use your database also with the Blast programs, you additionally have to run either PressDB (for nucleic acid sequences) or SetDB (for protein sequences), respectively.

Refining a Sequence List

You can refine MSF or list files to fit your analysis needs. There are three ways to refine sequence lists.

You can use the output file from one program as input to another to refine a sequence list. For example, you could identify human globin sequences with IRX. The output list from this session Using Sequences Finding and Copying Database Sequence Files 2-19

(20)

with IRX could then be refined with FindPatterns to show only those globin sequences containing an EcoRI sites. You could then use WordSearch to compare the output list of globin sequences from FindPatterns to a sequence of your own that you think is similar to the globin sequences.

You can combine two or more list files by using a text editor such as pico. NoteYou cannot combine MSF files in this way.

Suppress any item in a list by putting an ’!’ (exclamation point) at the beginning of the line on which the item you want to suppress occurs.

(21)

DATABASE LOGICAL NAMES FOR HUSAR/GCG PACKAGE

The table below lists the important database logical names for use with the HUSAR/GCG Package. Nucleic Acid Databases

Logical Name Abbreviation Description

GeAll EMBL plus EmNew (EMBL daily updates) plus GbOnly (GenBank, not in EMBL) EMBL em EMBL Nucleotide Sequence Database (European Bioinformatics Institute, EBI)

EmNew emn EMBL (daily updates) EmLast eml EMBL (last week)

EmAll EMBL plus EmNew (daily updates)

GenBank gb Nucleotide Sequence Database (National Center for Biotechnology Information, NCBI)

GenEMBL ge EMBL plus GenBank(not in EMBL) GbOnly gbo GenBank (not in EMBL)

VecBase vec Vecbase library

VectorDB ve Vector library (National Center for technology Information, NCBI)

HIV-Base hiv Human Retrovirus Nucleic Acid Sequences (National Laboratory, Los Alamos, USA) KabatBase kabat Nucleotide Sequence Database of ological Interest (Institute of Cancer Research, New York; National Institute of Allergy and Infectious Diseases, Bethesda)

Yeast yeast Complete genome of yeast (Martinsried Insitute for Protein Sequences, Planck-Institute for Biochemistry, Martinsried, Germany, MIPS)

Yeast_EST Expressed Sequence Tags of yeast (The Institute for Genomic Research, TIGR, Rockville, USA)

Ecoli eco E. coli Nucleic Acid Sequences (subset of EMBL)

(22)

RNA5s rna RNA5S library by Berlin RNA DataBank (Institut fuer Biochemie, FU-Berlin, Berlin, Germany)

archae Archaebacteria seq eubac Eubacteria seq eukar Eukaryotes seq

pseudo Eukar. pseudogenes seq

AluBase alu Alu-Base library (MBCRR, Dana-Farber Cancer Institute, Boston, USA)

epd Eukaryotic Promoter Database (-499:+100) (Institut Suisse de Recherches

Experimentales sur le Cancer, Lausanne, Switzerland)

Subdivision of the Nucleic Acid Databases

EMBL GenBank GeAll Description

EM_Ba GB_Ba Bacterial Bacterial sequences Bacteria

Ba

EM_EST GB_EST EST Expressed sequence tag sequences

EM_Hum - - Human sequences

EM_In GB_In Invertebrate Invertebrate sequences In

EM_OR - Organelle Organelle sequences Or

EM_Om GB_Om Other_Mammalian Non-rodent, non-primate, OtherMammal mammalian sequences OtherMamm

EM_Ov GB_Ov Other_Vertebrate Non-mammalian vertebrate OtherVertebrate sequences

OtherVert Ov

EM_Pat GB_Pat Patent Sequences from patents and patent applications

EM_Ph GB_Ph Phage Phage sequences Ph

EM_Pl, EM_Fun GB_Pl Plant Plant and fungal sequences Pl

- GB_Pr Primate Primate sequences Pr

(23)

EM_Ro GB_Ro Rodent Rodent sequences Ro

- GB_St Structural_RNA Structural RNA sequences Structural (such as rRNAs)

ST

EM_STS GB_STS STS Sequence-tagged site sequences

EM_Sy GB_Sy Synthetic Synthetic sequences (plasmids, vectors) EM_Un GB_UN Unannotated New (not yet fully annotated) sequences EM_Vi GB_Vi Viral Viral sequences Vi

Protein Databases Logical Name Abbreviation Description

SwissProt swiss Protein sequences (Amos Bairoch, sw University of Geneva, Switzerland, and European Bioinformatics Institute, EBI) SwissNew swnew Updates of SwissProt library

SwissPir SwissProt plus SwNew plus PirOnly SwissPirPlus SwissPir plus part of TrEMBL PIR Protein Identification Resource (National Biomedical Research dation (NBRF), Martinsried Insitute for Protein Sequences (MIPS) and International Protein Information Database in Japan (JIPID)

PIR1 Annotated/classified entries of PIR library

PIR2 Annotated entries of PIR library PIR3 Unverified entries of PIR library PIR4 Unencoded or untranslated entries of PIR library

TrEMBL trem Translated sequences from EMBL menting the SWISS-PROT Protein Sequence Data Bank

GenPept gp Translated sequences from GenBank MIPSX Martinsried Insitute for Protein Sequences (MIPS) protein database based on a merged database (PIR, MIPS, SwissProt, EMTrans, GBTrans, Kabat, and PSeqIP; Max-Planck-Institute for Biochemistry, Martinsried, Germany)

(24)

PATCHX MIPSX minus PIR (Max-Planck-InstituteI for Biochemistry, Martinsried, Germany) SBASE sb Collection of annotated protein domain sequences (ABC Institute for chemistry and Protein Research, Hungary and International Centre for Genetic Engineering and Biotechnology, Italy) HIV-Prot hivp Human retroviruses protein library (National Laboratory, Los Alamos, USA) KabatProt kabatp Protein sequence database of ological Interest (Institute of Cancer Research, New York; National Institute of Allergy and Infectious Diseases, Bethesda)

YeastProt Yeast Protein library (Martinsried Insitute for Protein Sequences, MIPS) NRL3d nrl3d A sequence database derived from the 3 dimensional structure of proteins deposited with the Brookhaven National Laboratory’s Protein Data Bank (National Biomedical Research Foundation (NBRF), Washington, USA)

Subdivisions of the Translated Nucleic Acid Sequences Databases TrEMBL GenPept Description

- gp_ba bacterial division - gp_est est division tremfun - fungi division tremhum - human division

treminv gp_in invertebrate division tremmam gp_om other mammalian division tremmhc - MHC division

tremorg - organelle division - gp_pat patent division tremphg gp_ph phage division trempln gp_pl plant division - gp_pr primate division trempro - prokaryote division - gp_st rna division

tremrod gp_ro rodent division tremsynth gp_sy synthetic division tremunc gp_un unclassified division tremvrl gp_vi viral division

tremvrt gp_ov other vertebrate division - gp_sts sts division

tremimmuno - immunoglobulin and T-cell receptor division

trempseudo - pseudo protein division tremsmalls - small sequences division

(25)

IRX

The following sequence databases are also available under IRX: EMBL, EmNew, GenBank, GbOnly, SwissProt, SwNew, PIR, PirOnly, TrEMBL, GenPept, Nrl3D, YeastProt, EPD, KabatBase, KabatProt, SBASE, HIV-Base, HIV-Prot and Ecoli. Additionally, you can access the following databases.

Name Abbreviation Description

Transcription transfac Database on eukaryotic cis-acting regulatory Factor DNA elements and trans-acting factor (Gesell-Database schaft fuer Biotechnologische Forschung mbH, Braunschweig, GBF)

Papillomavirus papnuc Papillomaviruses nucleotide library (Deutsches Krebsforschungszentrum, DKFZ, Heidelberg, Germany)

Papillomavirus papprot Papillomaviruses protein library (Deutsches Protein Database Krebsforschungszentrum, DKFZ, Heidelberg, Germany)

Prosite Database prosite Sequence motifs in the PROSITE Dictionary of Protein Sites and Patterns (Amos Bairoch, Geneva, Switzerland; European Bioinformatics Institute, EBI)

Reference Library rldb Reference Library DataBase on hybridisation DataBase probes (Imperical Cancer Research Fund, ICRF, London, England)

Listing of Molecular limb Information about the contents and details of Biology Databases databases related to molecular biology

(Theoretical Biology and Biophysics Group, Los Alamos National Laboratory, Los Almos, USA)

Sequence Analysis seqanalref Bibliographic reference data bank relative to Bibliographic papers dealing with sequence analysis (Amos Reference Data Bank Bairoch, Geneva, Switzerland)

Enzyme Data Bank enzyme Definition of enzymes classes, subclasses and sub-subclasses (Amos Bairoch, University of Geneva, Switzerland)

Database of Structure- hssp Homology-derived structures of proteins, Sequence Alignment derived database merging structral (2-D and 3-D) and sequence information (1-D) (European Molecular Biology Laboratory, EMBL, Germany) Families of Structurally fssp A database of protein structure families with Similar Proteins similar folding motifs, based on 3D alignments of protein structures (European Molecular Biology Laboratory, EMBL, Germany)

(26)

Printed: October 24, 1996 11:26 (1162)