1. INTRODUCTION TABLE OF CONTENTS INTRODUCTION 1-3. How This Guide Is Organized 1-3 Additional Documentation 1-4 Conventions Used in This Guide 1-4

(1)

1. INTRODUCTION

TABLE OF CONTENTS

TheIntroductionto the HUSAR/GCGUser’s Guidedescribes basic information you need to get started with the HUSAR/GCG Sequence Analysis Software Package.

INTRODUCTION 1-3

How This Guide Is Organized 1-3

Additional Documentation 1-4

Conventions Used in This Guide 1-4

USING HUSAR/GCG PROGRAMS 1-6

Starting Off 1-6

Running Programs 1-7

Answering Program Questions 1-8

Stopping Programs 1-9

Identifying Sequences 1-9

Finding Sequences in the Databases 1-10

Command Line Options 1-11

Abbreviating Command Line Qualifiers 1-11

Command Line Control 1-11

Local Data Files 1-11

Graphics Configuration 1-11

Customizing Your HUSAR/GCG Environment at Start - Your .gcgrc File 1-12

Screen Versus File Output 1-12

Documentation at Your Terminal: GenHelp or ? 1-12

Running Programs in the Batch Queue 1-12

Data for the Sample Sessions 1-13

(2)

(3)

INTRODUCTION

TheUser’s Guide contains general information for everyone who uses the Heidelberg Unix Sequence Analysis Resources (HUSAR) based on a Convex port of the Unix version of the Wisconsin Package (GCG, Genetics Computer Group Inc., Madison, Wisconsin, USA). The programs that make up GCG and HUSAR are tools for biologists and genetic researchers -- for you, the user.

HUSAR incorporates, at the time of this writing, the whole Convex-GCG-Unix-version 8.1 and, additionally, programs developed or implemented at our site like, for instance, IRX, the Blast programs, the multiple alignment programs Tree, Clustal, MultAlign, BoxAlign and MAlign, the RNA secondary structure programs FoldAnalyze, FoldSplit and Poland, the protein sequence analysis program FoldClass and many others. It should be emphasized that we always try, as far as possible, to make such programs "look like GCG" in order to alleviate the use of the whole system, including general features of the package described in detail in this guide. It should be noted that some of the newly implemented programs do not fit completely to the GCG package. An example is IRX, which is a real interactive program.

This guide is (as theProgram Manual) based on the GCG original one.

How This Guide Is Organized

The 9 sections in theUser’s Guideare as follows:

- Introductiondescribes the organization and conventions used in this guide, gives information about using programs and lists differences to the last version.

- Using Sequencesdescribes how to name sequences and lists all database names used in HUSAR.

- Using Programsdescribes the basics of working with HUSAR/GCG programs, their options on the command line and shows a typical program description.

- Using Data Filesdescribes the concept of alocal data fileand lists all available local data files. - Using Graphicsdescribes the graphics hardware used by the GCG Package and how you can

create graphic output using GCG programs.

- Using Batch Queuesdescribes how to run HUSAR/GCG programs in the background, freeing your terminal for other work.

- Short Descriptionsbriefly describes every command in the HUSAR/GCG Package.

- Full Descriptionsare provided for the most important programs of the HUSAR/GCG Package. You will find all complete program descriptions in theProgram Manualor online using the ? or genhelpcommands.

- Glossarydefines some terms found in this guide.

(4)

Additional Documentation

In addition to the User’s Guide, the HUSAR/GCG Package comes with the following documentation:

TheProgram Manualdescribes the detailed operation of each HUSAR/GCG program.

The manual describes file restrictions, examples of how to run the program, and command line options you can enter for each program. Whenever possible, the description also includes examples of graphic output that you can either view on the screen or print out. There are fifteen sections in the Program Manual corresponding to HUSAR’s menu system. The complete table of contents tells you where HUSAR/GCG programs are located by chapter; the index following the table of contents lists programs alphabetically. You can also download and print program descriptions by yourself. Descriptions are available both as text and as postscript files. With fetch program.doc or fetch program.ps, you can copy either file into your local directory. The command ‘copydesc‘ is no longer supported.

The Online Help displays detailed description of each program on your terminal screen. You just have to type ? or genhelp. The listing includes all the information as the Program Manualentry, except for graphical output, if there is any.

The WPI Guide provides step-by-step instructions on how to use the Wisconsin Package Interface (WPI). WPI is a graphical user interface to the GCG Wisconsin Sequence Analysis Package, which we have adapted to work also with the additional programs in HUSAR. It does not replace the command line interface you may be familiar with, but rather supplements it. Furthermore there are Introduction to the UNIX operating systemand Introduction to the HUSAR Program Package, available on request from our site (brief introductions to UNIX and HUSAR).

Conventions Used in This Guide

The information listed here describes special fonts and conventions that are used in this guide. Typewriter Font

Responses you type are shown with bold Courier (bold typewriter) letters. For example, you get a help index of all on-line HUSAR/GCG programs by entering the command

% genhelp

Notice that the % (percent sign) UNIX prompt is in the Courier font and is not in bold. This means that UNIX presented the % prompt on your screen to tell you to continue; your response was to type genhelp. In short, prompts from the UNIX operating system and from the HUSAR/GCG Package are shown in Courier; your responses are shown in bold Courier. (See also the Fonts topic below.)

Commands and Qualifiers

HUSAR/GCG commands, like operating system commands, can only be abbreviated as mentioned in the menu (see below); qualifiers can be abbreviated. If a qualifier is shown in this document as only partly bold, the bold part indicates the fewest number of characters you type in to indicate the qualifier. For example, the expression

% mapsort -CHEck

(5)

means you type mapsort and -che to get a list of the MapSort command line and qualifiers.

Fonts

There are several fonts used in GCG documentation:

- Thisbold font (Avant Garde) is used for HUSAR/GCG manual names and sections in the manual; for example, "TheMappingsection of theProgram Manual."

- This italic font is used for emphasis, for syntax of commands, and for key words; for example, "The dendrogramis a key concept that you mustunderstand before you can read the graphic output."

- This Courier font (typewriter, non-bold) is used for system messages and session examples; for example, "The message All done! appears when you are through running this program." (See also the Typewriter Font topic above.)

- This bold Courier font (bold typewriter) is used for responses you type in and the names of HUSAR/GCG or UNIX commands and/or qualifiers; for example, "You enter genhelp to see an index of HUSAR/GCG programs on your terminal screen; enter genmanual to see an index of Program Manual sections." (See also the Typewriter Font topic above.)

File Names

File names are usually shown in lower case letters in HUSAR/GCG documentation; for example, gamma.seq.

(6)

USING HUSAR/GCG PROGRAMS

This section describes how to run HUSAR/GCG commands.

Starting Off

Before you can run any HUSAR/GCG program, you mustinitializethe HUSAR/GCG Package by typing the command % husaror, on the GENIUS, by selecting HUSAR from the menu system. Current news messages may appear on your terminal followed by the HUSAR banner and the main menu. This takes a little while, but when the HUSAR prompt HUSAR %is displayed, you are able to start running HUSAR/GCG programs.

The procedure looks like this:

% husar

* * * * * * * * * HH HH UU UU SSSSS AAAAAA RRRRRRR * WELCOME TO THE * HH HH UU UU SS SS AA AA RR RR * * * * * * * * * HH HH UU UU SS AA AA RR RR Heidelberg HHHHHHHH UU UU SSSSS AAAAAAAA RRRRRRR Unix HH HH UU UU SS AA AA RR RR Sequence HH HH UU UU SS SS AA AA RR RR Analysis HH HH UUUUUU SSSSS AA AA RR RR Resources Release 4.0

based on the GCG program package version Unix8.1 (1996). Copyright Genetics Computer Group, Inc. All rights reserved. +---+ | News 27: FoldClass: prediction of protein foldclasses and domains | | News 26: *** how to make databases available for Blast *** | | News 25: PrettyPlot+PrettyBox: programs for displaying alignments | | News 24: && MAlign for aligning sequences of various lengths && | | News 23: new database GenPept contains translated GenBank sequences | | News 20: *** HUSAR version 4.0 *** +---+ | News 17-18-19: what’s new, changed and missing ! | EmNew | | +-- HUSAR support---| last update | +---| mail $genmanager or Tel.: 06221/42-2334 or -2349 | | +---+--Fri Jan 19 20:06-+

(7)

You are in the main-menu of HUSAR. Enter your choice of one of the following items:

1 [fas] DNA fragment assembly

2 [sqm] sequence editing and manipulation 3 [tra] sequence translation and conversion 4 [mpg] mapping

5 [spc] sequence pair-comparison 6 [ali] multiple sequence alignment 7 [evo] evolutionary analysis 8 [dbs] data base searching 9 [dbu] data base utilities

10 [pat] pattern recognition and composition analysis 11 [rna] RNA secondary structure

12 [pro] protein sequence analysis

13 [fts] file transfer and sequence formatting 14 [gra] graphics support and batch utilities 15 [fil] file support utilities

16 [inf] information about data base releases, >> NEWS <<, etc.

0 [men] MENU [?] GENHELP [e] EXIT

HUSAR % Running Programs

At the HUSAR prompt you are able to call any HUSAR/GCG program to your wishes just by typing its name or its abbreviation in lower case letters. If you are not that familiar with the names and the function of each program, you may switch to a sub-menu of your choice by typing its number or the three letter abbreviation. For example, to switch over to the DATA BASE SEARCHING sub-menu you may type either 8or dbs. The sub-menu looks like this:

HUSAR %8

You are in the sub-menu ’Data Base Searching’.

Enter your choice of one of the following items (+ generates graphics):

1 BLASTN : fast comparison of nucl. seq. to DNA sequence database 2 TBLASTN : fast comparison of pept. seq. to DNA sequence database 3 BLASTP : fast comparison of pept. seq. to prot. sequence database 4 BLASTX : comparison of nucl. seq. to prot. sequence database 5 PAM : generates PAM matrices for BLAST

6 FASTA : search of DNA and protein data bases

7 TFASTA : search of DNA data bases using a peptide sequence

8+WORDsearch : similarity between a sequence and any group of sequences 9 SEGMENTS : displays the output of WORDSEARCH

10 QUICKSEarch : quick search for nucleotide sequence in GeAll 11+QUICKSHOW : displays the results of QuickSearch

12 BLASTALIGN : display results from BLAST as multiple alignment 13 BLIMPS : compares protein/DNA sequence to BLOCKS database 14+TWORDsearch : WORDSEARCH with 6-frame translation of database 15 TSEGMENTS : displays the output of TWORDSEARCH

16+FRAMESEARCH : optimal alignment between prot. sequence and DNA database or DNA sequence and prot. database including frameshifts [q] : return to main-menu

DBS %

(8)

As you can see above, the prompt changes according to the three letter abbreviation of the sub-menu. If you type 0or men, the sub-menu listing reappears. With qyou can return to the main-menu. At any position in HUSAR you can call the Online-Help by typing ? or genhelp. Within a specific sub-menu you can start a program either by typing its name or the corresponding number. The upper case letters in the program’s name define a possible abbreviation (e.g. wordfor WORDsearch) but you must not type them in upper case!

Note, that you may run a program by typing its name (or abbreviation) wherever you are in HUSAR and not only in the sub-menu the program belongs to.

The ’+’ sign in front of the progam’s name marks programs that create a graphical output.

Answering Program Questions

Most programs in GCG/HUSAR prompt you for information: they usually require a sequence for input, a name for an output file, the beginning and ending positions of the portion of sequence you want to use, and other pieces of information particular to individual programs.

You can answer most program prompts with a yes or no, a number, a letter, or a filename. Some prompts display several alternatives for you to choose from. The following example displays some of the various types of prompts.

% map

Map displays both strands of a DNA sequence with restriction sites shown above the sequence and possible protein translations shown below.

(Linear) MAP of what sequence ? ggamma.seq (***) *** Most programs require you to provide a filename as input.

Begin (* 1 *) ? <Return> (***) Enn (* 1700 *) ? <Return> (***) *** Press <Return> to accept default answers.

Select the enzymes: Type nothing or "*" to get all enzymes. Type "?" for help on which enzymes are available and how to select them. Enzyme(* * *): <Return>

What protein translations do you want: a) frame 1 b) frame 2 c) frame 3 d) frame 4 e) frame 5 f) frame 6

t)hree forward frames s)ix frames o)pen frames only n)o protein translation q)uit

Please select (capitalize for 3-letter) (* t *): s (***) *** Or, type the appropriate answer to override the default.

What should I call the output file (* ggamma.map *) ? %

(9)

When appropriate, GCG/HUSAR programs supply default answers for program prompts. You can accept the default answer by pressing <Return> or type a different response. The default answer is always displayed between parentheses and asterisks. In the example above, default answers are provided for many of the prompts, for example begin (* 1 *), end (* 11375 *), enzyme (* * *), name of translation (* t *), and name of output file(* gamma.map *).

NoteIf you press <Return>without typing an answer in response to a prompt which does not have a default, the program will stop.

Stopping Programs

To stop a program, choose from the following:

1) Press<Ctrl>cto quit a program prematurely.

2) Press<Ctrl>zto suspend a program you want to work on later.

To bring the suspended jobs to the foreground where you can work with it again, type % fg %job_number, for example % fg %6. If you cannot remember what programs you suspended, type % jobsto list the jobs and job numbers.

Identifying Sequences

Defining a Sequence Precisely

Every sequence in the HUSAR/GCG Package has a name. A sequence can be defined precisely with a name, a strand(if the sequence is a DNA sequence), a checksumfor the sequence, and a pair of coordinates that define a range of interest within the sequence. Likewise, all HUSAR/GCG programs identify sequences in their output with a name, strand, checksum, and range. (A few programs allow you to define ranges that cross from the end of the sequence into the beginning.)

Naming a Sequence

All programs that handle sequences require you to identify sequences by naming a file (such as gamma.seq) that contains the sequence or by naming a sequence in a database. Sequences from the database (such as the GenBank) are named with the name of the database, followed by a colon, and then the name of the sequence; for instance, GB:HumRep2. (See the "Using Sequences" section of the User’s Guide for more information about naming sequences.)

Naming a Group of Sequences: List Files (former Files of Sequence Names)

Some HUSAR/GCG programs are designed to operate on a group of sequences or files. Our documentation and programs use words likesequence(s),file(s), orfile name(s)instead of sequence, file, or file name to indicate that ambiguous file specification is allowed. These programs let you name either alist fileor an ambiguous sequence group to identify the sequences you want.

A list file, called anindirect file specification, can have any number of individual sequence names, ambiguous sequence names, or names of other list files. (Note that many HUSAR/GCG programs do not support indirect file specification.) You can also nest a list file up to five deep.

When you enter the name of a list file for a HUSAR/GCG program, use an @(at) character in front of its name to distinguish it from a standard UNIX file name. You cannot use wild

(10)

cards to specify list files ambiguously.

Many HUSAR/GCG programs that analyze groups of files can write output into a list file, which is suitable for input to any other HUSAR/GCG program that supports indirect file specification.

You can use Fetch, a GCG program, to get the file hsp70.list to see an example of a list file.

For more information about naming a group of sequences, see the "Using Sequences" section of theUser’s Guide.

Naming Sequences in Multiple Sequence Format (MSF) Files

Some HUSAR/GCG programs write a group of aligned sequences into a single multiple sequence format (MSF) file. Like other sequences, the sequences in an MSF file can be used with many HUSAR/GCG sequence analysis programs.

For example, answering a program’s sequence prompt with pileup.msf{ssa4}instructs the program to use the sequence ssa4in the MSF file pileup.msf. All of the sequences in this file can be identified with the pileup.msf{*} specification. (See the "Using Sequences" section of theUser’s Guidefor more information on MSF files.)

Selecting aRange of Interestwithin a Sequence

The characters of a sequence are numbered from 1 to the length of the sequence. Most programs allow you to select a part of a sequence. You are asked for the beginning and ending coordinates of the range of interest. The correct answers to these prompts are integers within the range from 1 to the length of the sequence.

Selecting theStrand

If a program’s operation is different on the opposite strand of a DNA sequence, the program asks if you want to use the reverse strand. You can answer Yesor No. If you answer Yes, the output from the program has reverse of in front of references to the sequence name.

Making New Sequence Files

The SeqEd program is a screen-oriented editor that allows you to enter and check a sequence rapidly. You can also use any text editor to enter the sequence and then use the Reformat program to put the edited file into GCG format. All sequence files are kept in this format. TheProgram Manualentries for SeqEd and Reformat give more details on creating sequence files.

Naming New Sequence Files

Make up a name that reflects the gene or the function. Try to think of a name that would give a colleague in your field an idea of what the sequence is without having to read the file. It is very helpful to name all nucleic acid sequence files with the file name extension .seq and all peptide sequence files with .pep.

Finding Sequences in the Databases

HUSAR provides information of different databases. The most important are EMBL and Genbank for nucleotide sequences and SwissProt and PIR for protein sequences, respectively. You find the complete list of currently supported databases online in menu 16 under DATABASES_AVAILABLE. The "Using Sequences" section of the User’s Guideexplains how

(11)

to search for sequences in these databases using any of four different attributes: file names, key words, sequence patterns, or similarity to other sequences.

Command Line Options

A number of HUSAR/GCG programs have special features that can be used by running the program with a command line that includes both the name of the program and an additional word or parameter, called a command line option. For instance, the program Map normally writes its output with 60 sequence characters per line so that the output looks reasonable on your terminal screen. The command line % map -WIDth=100 sets Map to write output with 100 sequence characters per line.

The command line options available for each program are described at the end of the program’s entry in the Program Manual. As an alternative, you can list a brief description of the command line options whenever you run a program by including the -CHEck option on the command line. For example, you could type % map -CHE to list the options for the Map command.

Abbreviating Command Line Qualifiers

Command line options are different from commands in the sense that options are case insensitive and can be abbreviated. The convention used in this documentation is to present the fewest number of letters that you can type in by using the bold Courier letters and by capitalizing those letters.

For example, in the above topic the -CHEck option means that you can type % map -che to indicate that you want to list the command line options for the Map command. Because options are case insensitive, you could also type % map -CHEckor % map -chEor even % map -CHEC and the command line options will still display.

Command Line Control

Most often used HUSAR/GCG programs can read their parameter values either interactively or from the command line. This command line control is very helpful for running HUSAR/GCG programs repetitively. The use of command line control is discussed in detail in the "Using Programs" section of theUser’s Guide.

Local Data Files

Programs that require non-sequence data can use files in your local directory or in the public directory. You can customize the program’s operation by thoughtful modification of such data files. For instance, you might want a mapping program to identify TAATA boxes or zDNA nucleation sites as well as restriction enzyme sites. You could use Fetch to copy the public data file used by the mapping programs. Your copy of the file, like the public version, is called enzyme.dat. Because your copy of the file belongs to you, you can change it to suit your needs with a text editor. The presence of this enzyme.dat file in your local directory means that the mapping programs will use it automatically, instead of using the public version. Files that can be found either locally or in the public HUSAR/GCG database are referred to in our documentation aslocal data files.

Local data files are discussed in detail in the "Local Data Files" section of this guide.

Graphics Configuration

To use a HUSAR/GCG program with graphics output you must define your graphics configuration beforeyou run the program. Your graphics configuration includes those devices, such as a workstation or printer, on which you display the graphic result of running a HUSAR/GCG program. Your graphics configuration remains set for your entire HUSAR/GCG

(12)

session, unless you set it to something else. Setting up your graphics configuration is discussed in the "Using Graphics" section of theUser’s Guide.

Customizing Your HUSAR/GCG Environment at Start - Your .gcgrc File

You can customize your HUSAR/GCG operating environment by creating and editing a file called .gcgrc in your home directory.

When you initialize the HUSAR/GCG environment, the last step in the initialization processes is the execution of your .gcgrc file, if it exists. The .gcgrc file is not executed until after the logical name and symbol service is fully initialized and after all environment table entries needed by the HUSAR/GCG Package are set. Because of this, the .gcgrc file is the best place to put your customized commands.

The .gcgrc file can contain any legal shell commands. For example, if you want all HUSAR/GCG programs to display the summary of command line qualifiers and parameters, you could have an one-line .gcgrc file that looks like this:

# This is an example of a .gcgrc file. #

# The line in this file lists the command summary # everytime you use a HUSAR/GCG command.

#

comcheck

You can use any text editor to create the .gcgrc file, but you must put the .gcgrc file in your home directory.

Screen Versus File Output

Most HUSAR/GCG programs write a text file containing the output from running the program. You can examine these output files, rename them, delete them, and use them as input to other HUSAR/GCG programs. In addition, you can edit output files with any text editor and send them to other scientists via electronic mail.

While almost all HUSAR/GCG programs write their results into a text file, you may wish to see the results directly on your screen or to standard output. To write your results to your terminal screen, you answer the program question What should I call the output file ? with the responseTermfor terminal.

Documentation at Your Terminal: GenHelp or ?

You can display all the available on-line help for the GCG Package with GenHelp or ?. The command % genhelplists the programs in theProgram Manualalphabetically by name. The entireProgram Manualis available on-line.

Running Programs in the Batch Queue

Normally you can run programs interactively by typing the program name at the % prompt and answering the questions asked by the program. Most programs run in a few seconds, but in the case when a program takes a long time to execute, you may wish to run it in thebatch queue. In any event, using the batch queue frees your terminal for other work. If you call a program which tends to take a lot of computing time (like FastA or TBlastN), you are asked whether to run it in the batch mode by default. Nonetheless, you may run any other program in the batch mode as well. For more information and an example of how to run the Map program in the batch queue,

(13)

see the "Using Batch Queue" section of this guide.

Data for the Sample Sessions

The data used in the example session for each program in the Program Manual are provided with the HUSAR/GCG Package. These files can be retrieved using Fetch. For example, if the file gamma.seq is used for an example session, use the command

% fetch gamma.seq

to make a copy of the file in your directory. By using your own copy, you can run the program exactly as shown in the example session.

(14)

HUSAR 4.0 and HUSAR 3.0 DIFFERENCES

HUSAR Version 4.0 is based on Wisconsin Package (GCG) version 8.1 and runs on Convex Exemplar (SPP). GCG is introducing a graphical user interface (Wisconsin Package Interface, WPI) which is also available in HUSAR and includes almost all HUSAR/GCG programs. To start WPI, type "wpi" on the command line (in HUSAR). You can see "Graphics under X-Windows" in "what’s changed" (News No.18) for a description on how to prepare your session for X-Windows.

The main differences between HUSAR version 3.0 and version 4.0 are summarized in menu 16 under the topic NEWS no. 17 - 19. Online, you can also fetch these files under

whats_new.txt whats_changed.txt whats_missing.txt

Printed: October 24, 1996 13:53 (1162)