454 Sequencing System Software Manual, v 2.5p1

(1)

454 Sequencing System Software Manual, v 2.5p1

Part B – GS Run Processor, GS Reporter, GS Run Browser, GS Support Tool

August 2010

For life science research only. Not for use in diagnostic procedures.

(2)

454 Sequencing System Software Manual Software v. 2.5p1, August 2010

Part B

GS Run Processor, GS Reporter, GS Run Browser, GS Support Tool

Table of Contents

1. GS Run Processor ...4

1.1 Data Processing Pipelines ...6

1.2 Image Processing...11

1.2.1 Launching the image processing pipeline ...11

1.2.2 Image Processing Output...12

1.3 Signal Processing...13

1.3.1 Launching the signal processing pipeline...14

1.3.2 Signal Processing Output...15

1.4 Filters...15

1.4.1 Signal Filters and Corrections ...15

1.4.2 Read Quality Filters and Read Trimming ...16

1.4.2.1 Read rejecting filters...17

1.4.2.2 Read Trimming Filters ...18

1.4.3 Signal Processing Adjustable Filter Parameters ...20

2. GS Reporter ...27

2.1 gsReporter Executable...27

2.2 GS Reporter output ...30

2.2.1 GS Reporter Metrics Files ...30

2.2.2 GS Reporter FNA and QUAL Files...33

2.2.3 Organization of a Data Processing Directory ...33

3. GS Run Browser...36

3.1 Launching the GS Run Browser...36

3.2 Opening a Data Set...38

3.3 Overview of the GS Run Browser Interface ...40

3.3.1 The Global Action Area Buttons ...40

3.3.2 The Tabs ...41

3.3.3 The Buttons and Plots ...41

3.3.4 The Navigation and Data Capture Buttons...42

3.3.5 The Mouse Functions...44

3.4 The Overview Tab ...44

3.4.1 Sequencing Run Area ...45

3.4.2 Run Processor Results Area ...45

3.4.3 Run Processor Manager ...46

3.5 The Wells Tab ...48

3.5.1 Wells Tab Features and Functionalities ...49

Software v. 2.5p1, August 2010 2

(3)

3.5.1.1 Well Categories ...50

3.5.1.2 Color chart ...51

3.5.1.3 Flows ...52

3.5.1.4 Image area ...54

3.5.1.5 Average well density summary...59

3.5.2 Well Flowgrams (for Wells Generating Library Reads) ...60

3.5.2.1 Well Flowgram Features and Functionalities...60

3.5.3 Well Tri-Flowgrams (for Wells Generating Control DNA Reads)...61

3.5.3.1 Well Tri-Flowgram Features and Functionalities ...62

3.5.4 Location Flowgrams ...63

3.5.4.1 Location Flowgram Features and Functionalities ...64

3.5.5 Subtraction Raw Flowgrams...65

3.5.5.1 Subtraction Raw Flowgrams Features and Functionalities ...66

3.6 The Signals Tab ...66

3.6.1 Signals Tab Features and Functionalities ...67

3.7 The Reads Tab...69

3.7.1 Reads Tab Features and Functionalities...70

3.8 The Control DNA Tab...71

3.8.1 Control DNA Tab Features and Functionalities ...72

3.8.2 Control DNA Consensus Flowgrams...74

3.8.2.1 Control DNA Consensus Flowgram Features and Functionalities ...75

3.9 The Filters Tab ...76

3.9.1 The Filters Tab Features and Functionalities ...77

4. GS Support Tool...78

5. Data Management Topics...82

5.1 backupScript.sh...82

5.2 post-analysis script...83

6. GS Run Processor Appendices ...85

6.1 gsRunProcessor executable ...85

6.2 startGsProcessor script ...87

6.3 GS Run Processor Manager ...87

6.4 GS Run Processor Log Files - gsRunProcessor.log and gsRunProcessor_err.log ....89

6.5 GS Reporter Metric File Contents Descriptions ...90

6.6 Phred-equivalent Base Quality Scores...94

7. GS Run Browser Appendices ...96

7.1 Control DNA Keys and Sequences for the Different Chemistries ...96

7.2 Titration Runs Calculations for Older Chemistries ...96

8. Glossary...99

9. Index...107

(4)

1. GS RUN PROCESSOR

GS FLX System Users: At the time of this writing, version 2.5 of the GS Run Processor has not yet been optimized for use with images acquired on the Genome Sequencer FLX Instrument. Such images should NOT be processed or reprocessed with the version 2.5 GS Run Processor at this time.

The GS Run Processor application is identical for both the GS Junior and Genome Sequencer FLX Instruments. However, due to differences in Instrument hardware, some references in this manual will be specific to either the GS Junior or the Genome Sequencer FLX Instruments. There are three main differences: the On-Instrument computing resources, the available data processing pipelines and the PicoTiterPlate (PTP).

The GS Junior Instrument does not have On-Instrument computing capability, but instead includes an Attendant PC with all of the 454 Sequencing Software components installed. It is designed to perform image and signal processing, as well as data analysis. The 454 Sequencing Software can optionally be installed on a DataRig or cluster.

Therefore, references to On-Instrument software or computation is specific to the Genome Sequencer FLX Instrument, while references to a DataRig or cluster can apply to either instrument.

The Genome Sequencer FLX Instrument supports five data processing pipelines. On the GS Junior Instrument, there are three available data processing pipelines. The supported pipelines are detailed in section 1.1.

The PicoTiterPlate (PTP) Device on the Genome Sequencer FLX Instrument supports division into multiple (2, 4, 8 or 16) regions. The GS Junior Instrument PTP Device only supports a single region. Therefore, any references to multiple regions are specific to the Genome Sequencer FLX Instrument.

The GS Run Processor application performs the data processing of a sequencing Run to convert raw images into signal intensity values. This is done in two steps: image processing and signal processing. During a sequencing Run, the CCD camera on the GS Junior or Genome Sequencer FLX Instrument takes an image of the PicoTiterPlate Device for each nucleotide flow of the sequencing Run protocol. This image capturing step generates the image files (.pif) for all the flows which can then be processed. The image analysis step finds raw wells (a well containing a DNA fragment that produced light due to base incorporations during the sequencing Run) across the entire PTP Device, and implements algorithms to normalize the background. This is followed by signal processing which filters, corrects and trims the raw flow signals to produce high quality sequence information.

The core of the GS Run Processor application is the gsRunProcessor executable, which contains algorithms for determining location of the loading regions of the PTP Device, for image processing, signal correction, basecalling and Run metrics generation. The components of the GS Run Processor application are described in Table 1 below:

(5)

Component Description

gsRunProcessor executable Contains the core algorithms of the data processing application

startGsProcessor script Calls the gsRunProcessor executable Launching scripts (runImagePipe,

runAnalysisPipe, runAnalysisFilter, etc.)

Provides a command line interface to the gsRunProcessor executable

Processing scripts (imageProcessing.xml, signalProcessing.xml, etc.)

A set of XML files passed to the gsRunProcessor executable containing default commands and parameters for various processing pipeline options gsRunProcessorManager executable

Provides a simple job batching system for the gsRunProcessor executable in shared user environments

gsReporter executable Extracts reports and other output data from the processed CWF files

Table 1: GS Run Processor components

There are some environmental variables that control the gsRunProcessorManager application.

One key variable is the GS_LAUNCH_MODE, which can operate in one of the modes listed in Table 2:

Value Description

SINGLE Starts a single copy of the gsRunProcessor (Default)

MULTI Starts multiple copies of the gsRunProcessor, equal to the number of processors in the current workstation (Non-cluster)

MPI

Uses ‘mpiexec’ for launching jobs on a compute cluster. Refer to the System Administrator’s Guide for details on configuring the gsRunProcessor suite for use on a non-Titanium cluster.

GSRPM

Starts the job using the gsRunProcessorManager. Will submit jobs to the same job queue as users who use gsRunBrowser to submit processing jobs.

Recommended as a multi-user job queuing system.

Table 2: gsRunProcessorManager Environmental Variable GS_LAUNCH_MODE

If GS_LAUNCH_MODE is set to GSRPM, the jobs are launched via the gsRunProcessorManager and the queued jobs can be monitored and aborted via the gsRunProcessorManagerCtrl command. This is the recommended configuration and is also the default.

The gsRunProcessorManager manages the gsRunProcessor application and launches and queues all processing jobs. The gsRunProcessorManager daemon is implemented as a system V script. If the Data Processing software was installed with system-level privileges, the gsRunProcessorManager will automatically start when the system is restarted. The following command (run as root) can be used to start the manager manually if necessary:

/etc/init.d/gsRunProcessorManager start

(6)

This command will return the job _ids of currently queued processing jobs. The following command can be used to abort a job that is in the job queue:

gsRunProcessorManagerCtrl abort job_id

1.1 Data Processing Pipelines

The data processing steps can be configured as a part of the sequencing Run using the Run Wizard (see GS Sequencer application, in Part A of this manual), or invoked post-Run using the GS Run Processor Manager tool in the GS Run Browser application (see Section 3.4.3), or by using the command line interface (CLI) on an Attendant PC, DataRig or cluster (described below). A data processing pipeline specifies the options of how to carry out the data processing with respect to;

•

When to process (during or post sequencing Run)

•

Where to process (on-instrument, Attendant PC, DataRig or cluster)

•

Which processing steps (no processing, image processing, signal processing, full processing)

•

What to process (standard shotgun, Paired End or Amplicon libraries) The 454 Sequencing System Software supports five data processing pipelines.

•

No Processing

•

Image Processing

•

Full Processing Standard

•

Full Processing Amplicon

•

Signal Processing – Filter Only

The data processing pipeline options are different for the Genome Sequencer FLX and GS Junior Instruments. The pipeline options available for the Genome Sequencer FLX Instrument are illustrated in Figure 1. The pipeline options available for the GS Junior Instrument are illustrated in Figure 2.

(7)

Figure 1: Data Processing Pipeline Options for the Genome Sequencer FLX Instrument

(8)

Figure 2: Data Processing Pipeline Options for the GS Junior Instrument

The “No Processing”, “Image Processing”, “Full Processing Standard” and “Full Processing Amplicon pipelines are available through the Run Wizard on the Genome Sequencer FLX Instrument. The “No Processing”, “Full Processing Standard” and “Full Processing Amplicon pipelines are available through the Run Wizard on the GS Junior Attendant PC.

The “No Processing”, “Full Processing Standard” and “Full Processing Amplicon pipelines are also available via the GS Run Processor Manager Tool accessed via the GS Run Browser interface.

All of the data processing pipelines can also be accessed through the CLI using launch script commands. There is an additional launch script command, runAnalysisFilter, for running the signal processing filtering step only. When any processing job is launched, a pipeline processing script is called to specify the function calls to the gsRunProcessor executable. The processing scripts are XML-based text files located in /etc/gsRunProcessor on a DataRig, or in

$RIGDIR/config on the Genome Sequencer FLX Instrument CPU or GS Junior Attendant PC.

The output report generation (gsReporter) sections of the processing scripts can be configured by the user to alter the default output reports generated. This is discussed in Section 2.2. The XML script file specifying the filter parameters in the signal processing step can also be

(9)

customized by the user by first generating a template XML file using the --template option of the gsRunProcessor command (see Section 1.4.3 for a full description of the filtering process and adjustment of filter parameters.) The pipelines and associated scripts are listed in Table 3 below:

Pipeline Launch Script Command Default Pipeline Processing Script No

Processing (none) (none)

Image

Processing runImagePipe imageProcessingOnly.xml Full

Processing

Standard runAnalysisPipe signalProcessing.xml (if passed a D_ directory) fullProcesssing.xml (if passed an R_ directory) Full

Processing

Amplicon runAnalysisPipeAmplicons

signalProcessingAmplicons.xml (if passed a D_ directory) fullProcesssingAmplicons.xml (if passed an R_ directory) Signal

Processing –

Filter Only runAnalysisFilter No default (created locally using --template option of gsRunProcessor)

Table 3: Data Processing Pipeline Launch Commands and Processing Scripts

For Genome Sequencer FLX Instrument users, it is recommended to carry out image processing on the Instrument and signal processing on an off-Instrument location (a DataRig), because the latter is computationally intensive and may take over 40 h to complete, on the instrument. This would prevent instrument use for subsequent Runs for significant periods of time.

The "Paired End" pipeline with its associated launch script

"runAnalysisPipePairedEnd" have been merged into the standard pipeline. Use

"runAnalysisPipe" to process Paired End Runs via the command line or select

"Processing for Shotgun or Paired End” from the GS Run Browser or instrument GUIs.

An additional command, startGsProcessor, is used to prepare the directory structure for GS Run Processor jobs. It is a wrapper script that is called initially by all the ‘run’ launch script commands. Its use will be transparent to most users. It is described in the appendix Section 6.2 for reference.

The data flow for the image and signal processing is illustrated in Figure 3.

(10)

Figure 3: Data Flow for Image and Signal Processing

The directory structure for a Genome Sequencer FLX sequencing Run on a DataRig, after the data processing and report generation steps are carried out, is described below for a two-region sequencing Run: (Note that for a GS Junior sequencing Run, the directory structure would be similar, except for the multiple regions).

Figure 4: Sequencing Run Data file structure after full Data Processing and default report generation

(11)

1.2 Image Processing

In the image processing step, the initial PPi or ATP flow is used to define the PTP regions that are loaded and the shape of the background across the plate. The specific background for each nucleotide type is also computed from the key flows and these are combined to subtract the background for each signal producing location on the plate for the key flow images. The sum of all signal locations across the initial key flow images defines the total set of initial raw wells. Well centers are thus defined at consistent locations across the remaining series of nucleotide flows (distinct image files) comprising the sequencing Run. Flow signal information is extracted for each raw well and organized into composite well format (cwf) for further processing.

There is one CWF file generated for each sequencing run / PTP Device on a GS Junior Instrument, while for the Genome Sequencer FLX Instrument, there is one CWF file generated for each region of the PTP Device. The total size of the raw images for a 2-region, 200 cycle GS FLX Titanium Run is about 30GB. For a GS Junior Titanium Run, the total size of the raw images is about 2.5GB.

The image processing computation is fast enough to be run concurrently with the sequencing Run. There are no adjustable parameters for the image processing step.

The following files are generated by the sequencing Run image capture and are used as input to the image processing step;

•

dataRunParams.parse

•

ptpImage.pif

•

runLog.parse

•

rawImages (subdirectory)- containing the image data in .pif formatted files

•

imageLog.parse

•

aaLog.txt

o 00000.pif (initial PPi or ATP flow image)

o 00001.pif. 00002.pif, 00003.pif, … (sequential nucleotide flow images)

The “.pif” file format was developed at 454 Life Sciences Corporation for storing image data from the Genome Sequencer FLX Instrument and continues to be used for the GS Junior Instrument. Currently, all image data is stored in 16 bit unsigned integers, or 2 bytes per pixel.

Valid image data is limited to the first 14 bits.

image files are used by data processing applications, thus it is strictly required that these files not be altered.

1.2.1

For the 454 Sequencing System Software Applications version 2.3 and higher, all the

Launching the image processing pipeline

There are three ways to launch the image processing step of the GS Run Processor application:

(12)

•

From the GS Run Processor Manager tool accessed via the GS RunBrowser application )

and line interface (CLI)

The o application can be launched by the runImagePipe command, which has the follo in

run a

ption (see section 3.4.3

•

From the comm c mmand line

w g command line structure:

Im gePipe [options] RUN_DIRECTORY

The RUN_DIRECTORY specified must contain a dataRunParams.parse file, the imageLog.parse file and the complete contents of the rawImages subdirectory. The options for the runImagePipe command are listed in Table 4.

O Description

--progress Displays real-time progress of the job’s processing.

--pipe=

n is optional.)

XMLFILE Specifies a pipeline processing script file. (Specifying the .xml extensio

--reg=R

Specifies processing of data from a particular region of the PTP.

range or as a comma-separated list ltiple regions are only supported on the EGION_NUM Regions can be specified as a

("1-4" or "1,2,5,9"). (Note: Mu

Genome Sequencer FLX Instrument.)

-verbose Provides verbose log output useful for troubleshooting.

Table 4 runImagePipe command options :

1.2.2 Image Processing Output

The ata directory of the format:

and are used as input to the signal proces

image processing step creates an output d D_...imageProcessingOnly

The following files are generated by the image processing sing step.

Output Description dataRunParams.xml

Contains the results of the region finding, a list of the Control DNA sequences detected in the data set, and meta-data about the sequencing Run.

Region/regionnum.cwf

ells Format (CWF) files - Contains the raw flowgram run meta-data,

flow.

Composite W

information from the processed images, run metrics,

all intermediate results and low-resolution image for each base gsRunProcessor.log Contains logged messages from the image processing.

gsRunProcessor_err.log Contains logged error messages from the image processing.

Table 5: Image processing output files

(13)

1.3 a

s the signal data for each flow for all active wells of each loading region of the PicoTiterPlate Device, using the data genera sing step and stored in the .cwf files. The signal processing

l processing jobs.

al processing performs a series of normalization, correction, and quality filtering steps

and then outputs the rema e

also ividual

reads, and outputs this data as Standard Flowgram Format (or SFF) files that contain all the sequence trace data for th

or signal processing job, the G

ifferent nucleotides (nukeSignalStrengthBalancer) tween neighboring wells (blowByCorrector)

•

Correct for anomalous signal spikes or interruptions due to a reagent flow valve event (nucValveEventBalancer)

•

Correct for known out-of-phase errors (CAFIE – Carry Forward and Incomplete

dualWellScaler)

ing quality scores for all d Sign l Processing

The signal processing step of the GS Run Processor application analyze ted during the image proces

uses different internal parameters and function calls for signal processing of standard shotgun, Paired End, or cDNA libraries compared to Amplicon libraries and thus the library type must be specified for signa

The sign

ining (high quality) signals into flowgrams for each well (read). Th generates base calls with associated quality scores for the ind signal processing step

e reads. All the data analysis applications (GS De Novo Assembler, GS Amplicon Variant Analyzer) use these SFF files as input. In a

S Run Processor software performs the following operations:

GS Reference Mapper

•

Screen out ghost wells (Amplicon pipelines only, wellScreener)

•

Balance the signal strengths of the d

•

Correct for interwell cross-talk be

Extension, cafieCorrector)

•

Rebalance the signal strengths of the different nucleotides (nukeSignalStrengthBalancer)

•

Correct for signal droop (indivi

•

Subtract residual background signal (mostLikelyErrorSubtractor)

•

Screen out ghost wells (wellScreener)

•

Filter (pass or fail) the processed reads based on signal quality (Key Pass, Dot and Mixed filters)

•

Trim read ends for low quality and primer sequence (Signal Intensity, Primer and TrimBack Valley filters)

•

Generate flowgrams and base called sequences with correspond

individual, high quality reads (i.e. those which passed all filters) and output to CWF an t, the entire SFF files, one per PTP region processed. (Note: For the GS Junior Instrumen

PTP is a single region)

(14)

• Version 2.5 of the GS Run Processor contains significant enhancements to the at was processed with essor, which will be used for data analysis with

libraries tend to generate much stronger signals than other library types.

es contain similar sequences, can result in ‘ghost wells’, areas

cting ‘Full Processing for Amplicons’ or ‘Signal Processing for Amplicons’ from the Run Wizard or the GS Run Processor Manager tool in the GS

1.3

basecaller and CAF/IE corrector modules. For any data set th earlier versions of the GS Run Proc

V2.5, signal reprocessing is strongly recommended

• The Amplicons configuration of the signal processing is important because Amplicon

They consist of shorter fragments that amplify to a greater extent during emPCR Amplification. Very strong signals, coupled with the fact that many wells in Runs with Amplicon librari

where light is detected but that do not truly correspond to DNA sequencing events.

The special signal processing configuration for Amplicons includes two additional well screening steps (see above) that help remove these ghost wells. This configuration is triggered by using the runAnalysisPipeAmplicons CLI launch script command or by sele

Run Browser application.

• With 454 Sequencing Software applications version 2.01 or higher, while a signal processing job is in progress, the cwf files will have a .tmp extension. Users can verify if the pipeline has successfully finished by checking for .cwf files without .tmp suffixes in the regions sub-directory of the processing data (D_...) directory.

.1 Launching the signal processing pipeline

two commands that can be used to launch the signal processing step of the GS Run application:

AnalysisPipe [options] SOURCE_DIRECTORY

AnalysisPipeAmplicons [options] SOURCE_DIRECTORY e_Directory can be eith

There are Processor run run

The Sourc er an ‘R_’ directory to run both the image processing and signal pr

step. If a subdirecto

contain im n signal

rocessed, the runAnalysisPipe command will produce a warning message (note that this is only a warning; processing will still proceed).

An additional command can be used to launch just the quality filter portion of the signal nd deemed in need of filtering adjustment. The steps for modification of

the signal pr meters are described in

the Da The filt

ocessing steps on the data set, or a ‘D_’ directory to run only the signal processing run (‘R_’) directory is specified, the *.pif files must be present in the rawImages ry of the Run directory. If a data (‘D_’) directory is specified, the directory must

age processed results. If the data present in a ‘D_’ directory has also bee p

processing step on data that has already been processed through a previous iteration of the signal processing step a

ocessing pipeline script (XMLFILE) and the adjustable para ta Processing Background (Section 1.4.3)

er-only processing command is:

runAnalysisFilter --pipe=XMLFILE SOURCE_DIRECTORY

The signal processing and filtering command line arguments and output are described below:

(15)

Argument Description

SOURCE_DIRECTORY Path to an ‘R_’ (full processing) or a ‘D_’ directory (signal or filter processing).

Option Description

--progress Displays real-time progress of the job’s processing.

--pipe=XMLFILE filter paramet

Specifies a pipeline processing script file for signal processing using customized filter parameters (runAnalysisPipe) or to specify the input

ers for a filter-only signal processing job

(runAnalysisFilter). (Specifying the .xml extension is optional.) --reg=REGION_NUM Regions can be specified as a range or as a comma-separated list

("1-4" or "1,2,5,9"). (Note: Multiple regions are only supported on th Only process data from a particular region of the PTP Device.

e Genome Sequencer FLX Instrument.)

-verbose Provides verbose log output useful for troubleshooting.

1.3.2 Signal Processing Output rocessing step

D_...signalProcessi

en rter to

generate reports and outpu

The signal p creates an output data directory of the format:

ng

erated by the signal processing and are used by the gsRepo t files for further data analysis;

The following files are g

Output Description

region/regionnum.cwf

er PTP region.

Composite Wells Format (CWF) files - Contains the corrected flowgram information, processing metrics from the data processing and low-resolution image for each base flow. One file p

gsRunProces or.log s Contains messages from the image processing.

gsRunProcessor_err.log Contains error messages from the image processing.

1.4 Filters

with a lower number of reads. The reads can then be used for further data analysis such as de novo assembly using the

ovo Assembler a apper

application (Part C) and s r

The stringent filtering algorithms used to capture and discard poor quality reads ensures that the igh quality data retained yields a better assembly even

h

resulting

GS De N pplication, mapping to a reference using the GS Reference M equence variant analyses using the GS Amplicon Variant Analyze (Part D).

1.4.1 Signal Filters and Corrections

(16)

•

Calculate and apply nucleotide normalization – normalizes the signal strengths of

reagent valve events (valveEventBalancer).

•

error corrections (CAFIE - CArry Forward &

signal ‘noise’.

se flow. This can be due to reactivity

ection). The strands that fail to incorporate must

•

Re-calc differen

•

Calcula orrection – correct for signal reduction during the eigh

•

Apply r

•

Calcula ghost w

The remaining correcte var g

quality ality reads for further data

analysis.

1.4.2

different base incorporations (nukeSignalStrengthBalancer).

•

Calculate and apply inter-well crosstalk correction – corrects individual wells for the additional signal intensity conveyed by neighboring high intensity signal wells (blowByCorrector).

•

Calculate and apply reagent flow event balancer – corrects anomalous signal spikes due to

Calculate and apply out-of-phase Incomplete Extension – cafieCorrector)

o Carry Forward occurs when a trace amount of nucleotide remains in a well after the apyrase wash, perpetuating premature nucleotide incorporations for specific sequence combinations during the next base flow. While this affects a small percentage of DNA strands per bead (~2%), it causes those strands to continue to incorporate nucleotides out-of-phase with respect to the rest of the strands and contributes to

o Incomplete Extension occurs when some DNA strands on a bead fail to incorporate during the appropriate ba

differences (it happens more often for ‘T’ flows, for example) and/or reagent and substrate local concentrations (it happens more for wells on the far end of the plate relative to the base flow dir

wait another flow cycle to continue sequencing and thus those strands will incorporate out-of-phase with the rest of the strands.

ulate and re-apply nucleotide normalization – normalizes the signal strengths of t base incorporations (nukeSignalStrengthBalancer).

te and apply signal droop c

t-hour sequencing Run exposure (individualWellScaler).

esidual background subtraction and rescaling (mostLikelyErrorSubtractor).

te and apply well density correction – calculates the signal per base to filter out ells (wellScreener).

d wells are considered to contain valid sequence information, albeit, of yin quality. They are counted as Raw Wells in the totalRawWells output metric. Next, signal

filters and read trimming is carried out to retain only high qu

Read Quality Filters and Read Trimming

The order in which the read quality and read trimming filters are applied is shown in Table 6.

(17)

Table 6: Signal quality filtering and read trimming order

1.4.2.1 Read rejecting filters

Read rejecting filters are applied as a pass/fail test and quickly discard no-information or low information active wells.

•

Key Pass Filter – fails for reads that do not start with a valid 4-base ‘key’ sequence corresponding to either a library read: GACT for Rapid Library or ‘TCAG’ for FLX Standard or a Control DNA read, ‘CATG’ (GS Junior and GS FLX Titanium chemistry;

see Section 7.1 for other chemistries) (numKeyPass me )

negative flows. This filter discards reads that are too short or have too many poor incorporations or interruptions. A ‘dot’ is an instance of three successive negative flows (no signal for three base flows, denoted at ‘N’).

Discards reads with the last positive flow occurring at less than 84 flows (~30-50 bp).

tric

•

Dots Filter – fails for reads with too many

Also discards reads with more than 5% of the flows before the last positive flow occurring as dots. (numDotFailed metric)

•

Mixed Filter – fails for reads with too many positive flows. This filter discards reads with too many nucleotide incorporations, possibly occurring from a bead carrying two or more DNA fragments attached, a well containing more than one DNA bead, signal contamination from a neighboring well or a low signal-to-noise ratio well. Discards a read if more than 70% of the flows occurring before the last positive flow recorded a positive signal. Additionally, if the normalized flow signal value is between 0.45 and 0.75, it is considered ambiguous. If the read has more than 20% ambiguous or less than 30%

unambiguous positive flows (either below 0.45 or above 0.75) the read is discarded.

(numMixedFailed metric)

DO NOT mix samples prepared with the Rapid Library protocol with ones prepared

(18)

1.4.2.2 Read Trimming Filters

e read rejecting filters have been applied, the remaining reads are assessed for quality ignal intensity information as a function of the nucleotide flow. This information can be ed in a Signal Flowgram (see example in Figure 5).

al Flowgram, the x-axis corresp After th

of the s visualiz

In the Sign onds to the individual nucleotide flows of the

equencing Run and the y-axis is a measure of signal intensity. The processed signal flowgram

•

E r easy ‘reading’ of the

sequence trace data from visual inspection of the flowgram.

•

Nuc n lized and quantized such that a signal intensity near s

s have the following characteristics;

ach nucleotide type is plotted in a different color allowing fo leotide i corporations are norma

1.0 corresponds to a single nucleotide incorporation and a signal intensity near 2.0 corresponds to a dinucleotide homopolymer incorporation.

Figure 5: Normalized signal flowgram

As was discussed in the Signal Filters and Corrections section above (Section 1.4.1) several chemical- and system-related effects can gradually degrade the signal of some wells over the sequencing run, towards the 3’ end. The pipeline implements certain filters to trim back the reads from the 3’ end which show borderline signal intensity values.

(19)

Any trimmed read that is less than the minLength parameter described below (default value of 50) is rejected as too short. The number of reads rejected due to a particular trimming filter, is output in the metrics file generated by the gsReporter application (Section 6.5).

•

Signal Intensity Filter – trims reads that lose signal ‘crispness’ near the end, possibly from overall signal droop and/or CAFIE error accumulation that leads to a low signal-to- noise ratio (as shown in Figure 6). The 3’ end of a read is trimmed such that <3% of the remaining flows have borderline intensity for incorporation (0.5-0.7 on a 0-1.0 scale) (numTrimmedTooShortQuality metric)

Figure 6: Signal intensity filter

•

Primer Filter – trims the end of a read when it matches a 454 Sequencing System Adaptor sequence. (numTrimmedTooShortPrimer metric)

•

TrimBack Valley Filter – Filters or trims reads with many off-peak signal intensities. A Valley flow is defined as an intermediate signal intensity, i.e., a signal intensity occurring in the valley between th

m c

o A Valley filter is applied to discard a read if more than 4 borderline valley flows occur before the last trimmed or 320^th flow. This removes reads with many

flows is < 1.25% (4 occurrences/ 320 flows). This trimming is used to retain the higher quality part of a read.

(numTrimmedTooShortQuality metric)

Parameters are available to turn off or adjust the application of this filter and are e peaks for 1-mer and 2-mer incorporations, or the 2-mer and 3- er, etc. The signal distribution of all reads of the Run is used to define the peaks of the omopolymer incorporations relative to these, the valleys or borderline zones for lassification of intermediate signals.

h

homopolymer errors within the first 320 flows. (numTrimmedTooShortQuality metric)

o The TrimBack filter scans reads backwards from the end of the read and trims flows until the number of valley

(20)

1.4.3 Signal Processing Adjustable Filter Parameters

Some of the quality filters may be turned off or adjusted to control the stringency of the output.

This section provides instructions for customizing the filter parameters for signal processing.

•

Identify the Data Processing folder (‘D_’) of the sequencing Run whose reads are to be re-filtered or the Run folder (‘R_’) of a sequencing Run whose reads will be filtered using the customized filter parameters.

•

Generate a template file in the Run or data directory by typing the following command, from within the fullProcessing folder:

gsRunProcessor --template=filterOnly > filterTemplate.xml shotgun library processing in the current directory.

There are additional templates for Paired End and Amplicons experiments, which can be

•

Within section

parameters that govern the filtering of high quality reads for the sequencing Run are adju

def lt are ma

o uce the number of filtered high quality

This will create a template file for

generated by using the template arguments “filterOnlyPairedEnd” and

“filterOnlyAmplicons” in the command above. The XML output filename (to the right of the “>” symbol) can be any valid string ending with ‘.xml’.

Open the file with a text editor. An easy-to-use text editor called “nedit” is present on the Genome Sequencer FLX Instrument; on the GS Junior Attendant PC, a similar program called “gedit”, is present. To use nedit (or gedit) to edit the file, type the following command:

nedit filterTemplate.xml

the template, scroll down to the <qualityFilter> and <baseCaller>

s (Figure 7 and Figure 8 below). All of the user-adjustable Quality Filter sted in these sections. Both sections contain some adjustable parameters by au . Additional parameters can be added to the <qualityFilter> section. As changes

de, note that:

Increasing a stringency setting will red

reads by eliminating the lowest quality reads from among the previously filtered high quality reads. This may increase overall accuracy of the filtered high quality reads.

o Decreasing a stringency setting will increase the number of passed filter high quality reads by allowing lower quality reads to pass through the quality filter.

This may reduce overall accuracy of the filtered high quality reads.

o he following command structure

After the changes have been made, save and exit the text editor.

•

Change directory, to the parent Run folder directory:

cd ..

•

Launch the processing job specifying the modified filter parameter script:

To launch a filter only signal processing job use t

(21)

runAna

If the jo D_..filte

lysisFilter--pipe=/path_To_filterTemplate.xml /path_To_DATA_DIRECTORY

b is successfully launched, a new Data Processing directory with the name convention rTemplate will be created in the Run directory.

processi

Figure 7: Signal ng filter adjustment XML code for Shotgun library processing (comments removed)

(22)

The quality filter parameters are modifiable. Their defaults and modification recommendation values are described below. For more information regarding Amplicon filter parameters, please refer to Application Note APN-10002: Amplicon Sequencing; Experimental Designs, Guidelines and Tips, available at http://454.com/my454.

doDotCheck Checks for reads with too many negative flows.

Value Effect True – Default All Dots filtering enabled

False Dots filtering disabled

doMixedCheck Checks for reads with too many positive flows.

Value Effect True – Default All Mixed filtering enabled

False Mixed filtering disabled

doClassifierCheck Checks if reads start with a valid 4-base ‘key’ sequence (Key Pass)

Value Effect True – Default All Key Pass filtering enabled

False Key Pass filtering disabled

doShortSignalCheck A signal intensity filter that trims reads that lose signal

‘crispness’ near the end.

Value Effect True – Default All Signal intensity filtering enabled

False Signal intensity filtering disabled

doPrimerTrimming Trims the end of a read when it matches a 454 Sequencing System Adaptor sequence.

Value Effect True – Default All Primer trimming enabled

False Primer trimming disabled

doValleyFilterTrimBack Enables or disables the TrimBack Valley Filter Value Effect

True – Default Shotgun, Paired End read trimming enabled False – Default Amplicon read trimming disabled

(23)

minConsensusSignal values integer >= 1.

The minimum average intensity of the positive key flows to be considered as a potential well. Acceptable

are any

Value Effect 20 - Default Shotgun, full

wise processing

60 - Default other

>Default

Increasing this value can eliminate dim candidate wells This can yield more high-quality fewer candidates. However, if

oo many good candidates are with poor signal quality.

bases and Reads from the value is too high, t

discarded and throughput declines.

useBicubic

Images are upsampled during image processing, doubling arameter impacts psampling is performed for their size in each dimension. This p

whether bicubic or bilinear u this step.

Value Effect True - Default All Bicubic upsampling is performed

False Bilinear upsampling is performed

vfScanLimit Controls how much of the read is scanned with the valley teger > 0.

filter. Acceptable values are any in Value Effect

< Default

Amplicon Runs will have more reads, but bases from the un-filtered end of the flowgram m

called ay have higher es.

error rat 4,096 - Default Shotgun

efault Amplicon 700 - D

Default Amplicon Runs will have fewer reads, but they will tend to have higher accuracy.

>

vfScanAllFlows Modifies the meaning of the valley flow parameters.

Value Effect True

The ratio of vfBadFlowTh taken as a score thresho

reshold to vfLastFlowToTest is ld that is applied over the entire read.

False - Default Shotgun

For amplicon runs, a maximum of vfBadFlowThreshold valley flows can be seen within the first vfLastFlowToTest nucleotide flows.

tiOnly - Default Amplicon nt to true for both GS Junior and Genome Sequencer FLX Titanium runs.

Equivale

flxOnly Equivalent to true for Genome Sequencer FLX Standard ns only.

ru

(24)

vfLastFlowToTest

which the reads are monitored for the presence of valley flows. The maximum value for this parameter is equal to the Represents the number of flows over

number of nucleotide flows in the Run (most stringent), and the minimum value is “0” (to turn off the Valley Filter). This parameter’s behavior has been largely superseded by the vfScanLimit parameter.

Value Effect

400 higher stringency

320 Default

168 Lower stringency

vfBadFlowThreshold

If vfScanAllFlows is false, then this is literally the number of flows that can fall within a valley in the valley

general this arameter should not need to be adjusted; the preferred filter before the read is rejected (for amplicons) or trimmed (for shotgun reads). In the more common case where vfScanAllFlows is true, this value forms the numerator of the valley filter score. In

p

method is to adjust the valley filter threshold is the vfTrimBackScaleFactor parameter.

Value Effect

2 higher stringency

4 Default

6 lower stringency

vfTrimBackScaleFactor

Test.

If vfScanAllFlows is true, then flows are scored against the flow valleys using an exponential function. The sum of scores is multiplied by vfTrimBackScaleFactor before being compared to the maximum allowable score ratio formed from vfBadFlowThreshold and vfLastFlowTo If the scaled score exceeds the maximum, then the read is trimmed (shotgun) or discarded (amplicon).

Acceptable values are floating point numbers >=0.

Value Effect 1.6 - Default Shotgun

3.0 - Default Amplicon

> Default ields

r Increases the stringency of the valley filter. This y higher accuracy, but also shorter read lengths and fewe high-quality reads.

(25)

vfUseRollingWindows tween using This parameter estimates the valley thresholds be

the 0, 1, 2, and 3-mers dynamically across the run using a window of some number of flows, rather than one set of thresholds for the entire run.

Value Effect

True - Default es the

read Use rolling windows enabled. This generally mak

valley filter less stringent, yielding more high-quality reads and bases, but slightly shorter average lengths.

False Use rolling windows disabled

axFailedPercent

entage of “valley” flows in a read which cause ected entirely. This parameter can be used to tune the failed percent. A range between “90”

The perc

the read to be rej vfM

and “100” is recommended.

Value Effect 100 Default

90 higher stringency

vfTrimBackMinimumLength

r if they are attempting to sequence very short When a read is trimmed such that it would be shorter than the value of this parameter, the read is rejected entirely. Users may set this parameter to a lower numbe

reads.

Value Effect

>84 higher stringency

84 Default

<84 lower stringency

minLength Represents the minimum length of reads acceptable after all quality filtering steps.

Value Effect 50 Default

84 higher stringency (old default)

useAmpliconsPrimers Only search for the sequencing primers appropriate for Amplicon experiments.

Value Effect

True - default Enabled

False Disabled

useCorrectionGlobalLimit Enables or disables a global limit on the SFF flowgram correction limit.

Value Effect

True - default Enabled

False Disabled

(26)

that is NOT recomme

While it is possible to turn of s,

Dot, Mixed and Short Signa ata

nded fo to turn

off Dots and Mixed filter, ad of

the Shotgun Processing temp ure 7).

<doDotCheck>false</doDotC

<doMixedCheck>false</doM

<doClassifierCheck>false</d

<doShortSignalCheck>false</doShortSignalCheck>

f the read rejecting and signal intensity filters (Key Pas l ) this will result in output of sub-standard quality d

r use in subsequent data analysis For example , d the following lines under the <qualityFilter> section

late (Fig heck>

ixedCheck>

oClassifierCheck>

(27)

2. GS REPORTER

The GS Reporter application is identical for both the GS Junior and Genome Sequencer FLX Instruments. However, due to differences in Instrument hardware, some references in this manual will be specific to either the GS Junior or the Genome Sequencer FLX Instruments. The main difference is the PicoTiterPlate (PTP) Device.

The GS Reporter is an application that can extract read trace information and Run metrics from the CWF files produced by the GS Run Processor. The GS Reporter application is invoked by the signal processing or ‘Full Processing’ jobs and is called automatically by the signal processing launch scripts to produce default files and reports.

2.1 gsReporter Executable

The gsReporter executable generates various human-readable files which can be used to examine the results of a sequencing Run. These include files pertaining to read data, FASTA files (*.fna), associated base quality score files (*.qual), and legacy files (*.wells, etc.). The gsReporter also generates files containing the Run metrics in text (*.txt) and comma-separated value (*.csv) files.

Figure 9: gsReporter section of an XML processing script (options should appear on one line)

The default output files of the gsReporter are specified in the gsReporter section of the signal processing XML pipeline script (Figure 9) and are indicated in the options table below by the shaded rows. The default location of the output files can be redirected to alternate locations, using the --out option of the gsReporter command, or to standard output, using the --console option, for piping of the gsReporter command in a script. The gsReporter will attempt to generate all output files specified even if it does not have the required information. For example, generating a FASTA file on an image-processed only CWF file will result in an empty .fna file and a warning message will be sent to the console.

(28)

gsReporter [OPTIONS] SOURCE_FILES Command Description

gsReporter

An application that extracts read trace information and Run metrics from CWF files to generate various human-readable files, including FASTA files (*.fna), associated base quality score files (*.qual), legacy files (*.wells, etc.; files generated from pre-2.0 software), and Run metrics files in text (*.txt) and comma-separated value (*.csv) formats.

Argument Description SOURCE_FILES List of CWF data files to be processed.

(29)

Option Description

--console Redirects output to standard output. Used when piping the gsReporter command in a script.

--out

Redirects output to an alternate (non-default) output location By default, the GS Reporter application attempts to write its output files to the locations they would have been placed by earlier versions of the software.

--dump Dumps all XML in the corrected CWF file as a single document. (binary)

--info Shows summary information about the CWF file.

--legacy Simulate the output of a GS Junior or GS FLX sequencing Run processed with pre-2.0 software.

-a, --all Generates all output files.

--fna Generates one FNA file per library key per region for a corrected CWF file.

--qual Generates one qual file per library key per region for a corrected CWF file.

--meta Extracts the meta information about a region’s data as an XML file.

--metrics Extracts the metrics information about a region’s data as an XML file.

--wells Generates a .wells file per region.

--cfValues

Generates one cfValues file per region for a corrected CWF file. This is a binary file that contains information about the CAFIE correction data. This file is used by the gsSupportTool for troubleshooting Runs. (Binary)

--454RuntimeMetricsAll.txt Generates a 454RuntimeMetricsAll.txt file for the Run.

--454RuntimeMetricsAll.csv Generates a 454RuntimeMetricsAll.csv file for the Run.

--454BaseCallerMetrics.txt Generates a 454BaseCallerMetrics.txt file for the Run.

--454BaseCallerMetrics.csv Generates a 454BaseCallerMetrics.csv file for the Run.

--454RuntimeMetrics.txt Generates one 454RuntimeMetrics.txt per key per region.

--454RuntimeMetrics.csv Generates one 454RuntimeMetrics.csv per key per region.

--454QualityFilterMetrics.txt Generates a 454QualityFilterMetrics.txt file for the Run.

--454QualityFilterMetrics.csv Generates a 454QualityFilterMetrics.csv file for the Run.

--cafieMetrics Generates one cafieMetrics.csv file per region. (binary) --droopEstimate Generates one droopEstimate file per region. (binary) --trimInfo Generates one trimInfo file per region.(binary not used) --xy Generates one text file containing a list of well locations, per

region.

--csv Generates one file with the flows written out as comma- separated values per region.

--analysisParms.parse Generates an approximation of a pre-2.0 analysisParms.parse.

--revisedRegions.parse Generates a single revisedRegions.parse file.

The binary files are an intermediate product of processing and do not contain metric information. They are used by the pipeline processing job.

(30)

2.2 GS Reporter output

The default output files of the gsReporter application are the following:

•

454RuntimeMetricsAll.txt/.csv

•

region.key.454RuntimeMetrics.txt/.csv

•

454QualityFilterMetrics.txt/.csv

•

454BaseCallerMetrics.txt/.csv

•

454AllControlMetrics.txt/.csv

•

region.librarykey.454Reads.fna

•

region.librarykey.454Reads.qual

where ’.txt/.csv’ denotes that both tab-delimited (*.txt) and comma-delimited (*.csv) files are output, and ‘region.key.’ denotes that one file per region-key combination is output.

The .fna and .qual files for the Control DNA reads are no longer automatically generated by the default gsReporter options but can generated by using the following constructs:

gsReporter --fna=control 1.cwf, or gsReporter --qual=TCAG 1.cwf The ‘454AllControlMetrics.txt’ file contains the summary statistics for the Control DNA metrics and is useful in situations in which multiple control keys are used in a Run (see Section 7.1 for more details on this). It is generated by default, but is not generated when using the --legacy option to gsReporter.

2.2.1 GS Reporter Metrics Files

The gsReporter application metrics file contents are shown, in part, below. The full contents are described in the appendix Section 6.5 along with the conventions for date, directory and other specifications.

Examples of the Metrics file sections are shown in txt format, in the Figures below.

(31)

Figure 10: 454RunTimeMetricsAll.txt comment header section

Figure 11: 454RuntimeMetricAll.txt Run conditions group section

(32)

Figure 12: 454QualityFilterMetrics.txt key group section

Figure 13: 454BaseCallerMetrics.txt basecall results section

(33)

Figure 14: 454BaseCallerMetrics.txt Region Key Group

2.2.2 GS Reporter FNA and QUAL Files

The GS Reporter application also generates FASTA files (*.fna) and associated base quality files (*.qual) for the high quality reads generated. These are described below.

Output Description

region.librarykey.454Reads.fna The (trimmed) nucleotide sequences for the filtered reads of that region and key.

region.librarykey.454Reads.qual

This file contains the nucleotide quality scores (Phred- equivalent) for the high quality (filtered and trimmed) reads of that region and key. For a description of the process to compute individual base quality scores, see appendix Section 6.6.

2.2.3 Organization of a Data Processing Directory

Once the data processing has been completed and metrics and report files have been generated, these data structure can be quite complex. Table 7 below can be used to locate specific files within their expected directory if default locations were used.

(34)

Organization of a Data Processing Directory ( "D_" )

File name

Generated by gsRunProcessor During Processing Generated by Default Can Be Generated Directly by gsReporter From CWF Files Generated by gsReporter's "--legacy" Option No Longer Generated Notes

gsRunProcessor.log

gsRunProcessor_err.log 1

454DataProcessingDir.xml 2

region.KEYL.454Reads.fna 3,7

region.KEYL.454Reads.qual 3,7

region.KEYC.454Reads.fna 3,6,7

region.KEYC.454Reads.qual 3,6,7

454BaseCallerMetrics.csv 3,4,7

454BaseCallerMetrics.txt 3,7

454QualityFilterMetrics.csv 3,4,7

454QualityFilterMetrics.txt 3,7

454RuntimeMetricsAll.csv 3,4,7

454RuntimeMetricsAll.txt 3,7

454AllControlMetrics.txt 3,7

analysisParms.parse 3,5

revisedRegions.parse 4

454BaseCallerThresholds.txt

error.baseCaller2

error.bbcSelfTrain

error.cafieCorrection

regions/

region0X.cwf

region0X.metrics.xml

region0X.meta.xml

region0X.processingHistory.xml

region0X.wells

region0X.wells.KEY.454RuntimeMetrics.cs

v 3,4,7

region0X.wells.KEY.454RuntimeMetrics.tx

t 3,7

region0X.wells.cafieMetrics.csv 3,4,7

region0X.wells.cfValues 3,7

region0X.wells.droopEstimate 3,7

region0X.wells.incValues 3,7

region0X.wells.mleCorrectionInfo 3,7

region0X.wells.trimInfo 3,7

(35)

sff/

ACCNOPREX.sff

Table 7: Organization, in the ‘D_’ directory, of the files output by the GS Run Processor and the GS Reporter applications. The following codes are used in the file names: ‘region’ is the region number, ‘ØX’ is the zero- padded region number, ‘KEYL’ is the 3 letter library sequencing key, ‘KEYC’ is the 3 letter library sequencing key, ‘ACCNOPRE’ is the accession number prefix.

Notes:

1 - This file is only generated if there are warnings or errors generated during processing.

2 - This file is only created after the base calling step is run and signifies to the data analysis software that the sff directory contains files suitable for further data analysis.

3- The default generation of these files can be controlled by adjusting the gsReporter options in the pipeline processing scripts.

4 - These files are generated for legacy purposes only. It is recommended that new applications use X.metrics.xml and X.meta.xml extracted from the CWF file via gsRunProcessor for report generating purposes.

5 - This file is deprecated and generated for legacy purposes only. X.processingHistory.xml and X.meta.xml are the canonical record of parameters used to process a Run and a Run's metadata, respectively.

6 – These .fna and .qual files were generated by default in previous versions of the software.

7 - These files may only be generated after all processing is complete. Specifically, the data to create these files are not available after the image processing only step of gsRunProcessor.

(36)

3. GS RUN BROWSER

The GS Run Browser application is identical for both the GS Junior and Genome Sequencer FLX Instruments. However, due to differences in Instrument hardware, some references in this manual will be specific to either the GS Junior or the Genome Sequencer FLX Instruments. The main difference is the PicoTiterPlate (PTP) Device layout.

The GS Run Browser application allows the user to interactively view and analyze the results of sequencing Runs performed on a GS Junior or Genome Sequencer FLX Instrument, to assess the general quality of a Run or for troubleshooting when results are sub-optimal. It also allows one to launch data processing of the raw Run data ( via gsRunProcessorManager), and to prepare a data package that can be sent to Roche Customer Support for troubleshooting (via gsSupportTool, described in Section 4). It is available on the Genome Sequencer FLX Instrument, on the GS Junior Attendant PC, a DataRig or GS FLX Titanium cluster.

The application has the following tabs:

•

Overview: Run summary information

•

Wells: Raw images of the PTP Device captured during the Run; Locations and status of all identified active wells, well density information for raw wells and key pass wells, and CAFIE correction data

•

Signals: Raw or subtracted flowgrams for selected positions on the images; or fully processed flowgrams for all the data-generating wells detected during the Run

•

Reads: Read length and quality statistics for the sample library or the Control DNA reads

•

Control DNA: Consensus flowgrams and accuracy metrics for the Control DNA sequence reads

•

Filters: Raw signal statistics for the sample library or the Control DNA reads

3.1 Launching the GS Run Browser

The GS Run Browser can be launched by double-clicking the desktop icon located on the GS Junior Attendant PC desktop.

For all other hardware, including the Genome Sequencer FLX Instrument, open a terminal window and type the following command to launch the gsRunBrowser

(37)

gsRunBrowser

If the optional path to a data set is given on the command line, that data set will be loaded and shown in the application. Otherwise, an introduction window will be presented (Figure 15), ready to load a data set.

Figure 15: The GS Run Browser main window, just after launching the application, when no data set directory is specified on the command line or launched from the desktop GS Run Browser icon.

(38)

3.2 Opening a Data Set

Clicking on the ‘Open’ button or on the ‘Open a Data Set’ text will open the ‘Open a Data Set’

dialog, prompting the user to select a data set to display (Figure 16). Either a Run directory (R_...), a data processing directory (D_...), or individual cwf (.cwf) or wells (.wells) files may be selected:

•

R_ directory– will load only the general Run information and the captured Images, if available.

•

D_ directory– will load the general Run information, the captured images, and all the data processing results.

•

.cwf files - Contain the uncorrected flowgrams from the image processing step or the corrected flowgrams from the signal processing step.

•

.wells files – Legacy files generated by pre-version 2.0 software that contain well data.

Figure 16: The ‘Open a Data Set’ dialog, used to select the dataset

The ‘Open’ button of the ‘Open a Data Set’ dialog is enabled only when a valid data set and the correct directory or file type has been selected. A progress bar will appear at the upper right corner of the GS Run Browser window while the data set is being loaded, showing the steps of the loading. If the data set fails to open, an error message window will appear (not shown).

(39)

The application does not allow multiple .cwf files to be selected from different data processing directories. If you attempt to do so, the open button will be disabled and an error message will be displayed if you hover the mouse over the open button.

A list of recently opened data sets is always available. To access the list of previously opened data sets click the ‘Reopen Recent Data Set’ text from the introduction page, or click the ‘Open’

button while pressing the Shift key. The full path to a data set and the last time the data set was modified is displayed in a tooltip for reference (Figure 17).

Figure 17: Pop up list of recent data sets, ready to reopen

(40)

3.3 Overview of the GS Run Browser Interface

After a sequencing Run data set has been loaded, it will be displayed in the GS Run Browser window (Figure 18). The full path of the data set is listed in the status area at the top of the page. Active data tabs are populated, inactive data tabs are grayed-out.

Figure 18: General view of the GS Run Browser window

3.3.1 The Global Action Area Buttons

Four main buttons are always available in the right hand tool bar (see Figure 18):

•

The Exit button closes the application.

•

The Open button allows a user to browse the file system to find a data set to open.

Clicking this button while pressing the Shift key evokes a pop up menu allowing the user to choose from a list of data sets recently opened with the GS Run Browser. Only one data set may be opened at a time, thus when opening a new data set, a message will appear asking the user to confirm closing of the current data.

•

The About button opens a dialog window providing version information about the GS Run Browser application and a button to access the GS Support Tool used to help in the troubleshooting of potential sequencing Run issues (see Section 4 for more details about the Support Tool).

(41)

•

The Help button will provide instructions for where to find further information on the software, or will link you directly to a searchable version of the 454 Sequencing System Software Manual.

3.3.2 The Tabs

The GS Run Browser application displays the various aspects of the sequencing Run results in a series of 6 tabs, as listed below.

The Overview tab contains:

•

a summary of the sequencing Run on the left

•

the GS Run Processor results, if available, on the top right

•

the Run Processor Manager interface, which allows the user to launch a new data processing job of the currently open sequencing Run, on the bottom right.

The Wells tab displays three layers of information:

•

the raw images captured during the sequencing Run

•

the regions found during image processing

•

the wells found during image processing.

This tab also gives the user access to well flowgrams by right-clicking on areas of the PTP Device image.

The Signals tab provides statistics on raw signal intensity and homopolymer length distributions across all the wells of the PicoTiterPlate Device for any individual flow, for either sample library or Control DNA reads.

The Reads tab provides statistics on read length and quality score results, for either sample library or Control DNA reads.

The Control DNA tab displays the accuracy metrics for the Control DNA reads (% match to their reference sequences) and other Control DNA information.

The Filters tab shows the quality filter information for the Run, including the number of wells that failed and passed each filter, for either sample library or Control DNA reads.

3.3.3 The Buttons and Plots

The images, plots and data tables displayed in the various tabs of the GS Run Browser application are scrollable and/or zoomable graphical elements. They share certain common buttons and functions, e.g. to perform the scrolling and zooming. When they do appear, these graphic elements have some or all of the following features (see in Figure 19, an example of a well flowgram window, which has most of these elements):

(42)

•

A column of buttons along the upper left edge of the graphic elements, used for navigation (including various zooming functions) and/or to save snapshot images or text files of the displayed data (see Section 3.3.4 for details). For the image area of the Wells tab, this is replaced by a unique set of controls that are overlaid near the upper left corner of the image.

•

Mouse functions (pointing, clicking or dragging the mouse, touchpad, pen, etc. over the graphical element) to view data values and adjust the zoom level.

Figure 19: An example Well Flowgram window showing many of the common graphical element functions

3.3.4 The Navigation and Data Capture Buttons

The buttons appearing to the left of an element have common general functions. In some cases their specific meaning is adjusted to the context of the element: