Sequencing Analysis Software User Guide

(1)

ILLUMINA PROPRIETARY Catalog # SY-960-1501 Part # 15006500 Rev. A

FOR RESEARCH USE ONLY

Sequencing Analysis Software User Guide

For Pipeline Version 1.5 and CASAVA Version 1.0

T

G G A A A AAA A A A A A A A A A A A A A A A C C C C C C G G G G G GT

TA

CG

TA

CG A T G AC GT AC CT TA CG TA CG A T G C A

G

A T

G

C

(2)

(3)

Genome Analyzer Pipeline v1.5 and CASAVA v1.0 Software User Guide iii

Notice

This document and its contents are proprietary to Illumina, Inc. and its affiliates (“Illumina”), and are intended solely for the contractual use of its customers and for no other purpose than to use the product described herein. This document and its contents shall not be used or distributed for any other purpose and/or otherwise communicated, disclosed, or reproduced in any way whatsoever without the prior written consent of Illumina, Inc.

For the proper use of this product and/or all parts thereof, the instructions in this document must be strictly and explicitly followed by experienced personnel. All of the contents of this document must be fully read and understood prior to using the product or any of the parts thereof.

FAILURE TO COMPLETELY READ AND FULLY UNDERSTAND AND FOLLOW ALL OF THE CONTENTS OF THIS DOCUMENT PRIOR TO USING THIS PRODUCT, OR PARTS THEREOF, MAY RESULT IN DAMAGE TO THE PRODUCT, OR PARTS THEREOF, AND INJURY TO ANY PERSONS USING THE SAME.

RESTRICTIONS AND LIMITATION OF LIABILITY

This document is provided “as is,” and Illumina assumes no responsibility for any typographical, technical or other inaccuracies in this document. Illumina reserves the right to periodically change information that is contained in this document and to make changes to the products, processes, or parts thereof described herein without notice.

Illumina does not assume any liability arising out of the application or the use of any products, component parts, or software described herein. Illumina does not convey any license under its patent, trademark, copyright, or common-law rights nor the similar rights of others. Illumina further reserves the right to make any changes in any processes, products, or parts thereof, described herein without notice. While every effort has been made to make this document as complete and accurate as possible as of the publication date, no warranty of fitness is implied, nor does Illumina accept any liability for damages resulting from the information contained in this document.

ILLUMINA MAKES NO REPRESENTATIONS, WARRANTIES, CONDITIONS, OR

COVENANTS, EITHER EXPRESS OR IMPLIED (INCLUDING WITHOUT LIMITATION ANY EXPRESS OR IMPLIED WARRANTIES OR CONDITIONS OF FITNESS FOR A

PARTICULAR PURPOSE, NON-INFRINGEMENT, MERCHANTABILITY, DURABILITY, TITLE, OR RELATED TO THE PERFORMANCE OR NONPERFORMANCE OF ANY PRODUCT REFERENCED HEREIN OR PERFORMANCE OF ANY SERVICES REFERENCED HEREIN).

This document may contain references to third-party sources of information, hardware or software, products or services, and/or third-party web sites (collectively the “Third-Party Information”). Illumina does not control and is not responsible for any Third-Party Information, including, without limitation, the content, accuracy, copyright compliance, compatibility, performance, trustworthiness, legality, decency, links, or any other aspect of Third-Party Information. Reference to or inclusion of Third-Party Information in this document does not imply endorsement by Illumina of the Third-Party Information or of the third party in any way.

Illumina, illuminaDx, Solexa, Making Sense Out of Life, Oligator, Sentrix, GoldenGate, GoldenGate Indexing, DASL, BeadArray, Array of Arrays, Infinium, BeadXpress, VeraCode, IntelliHyb, iSelect, CSPro, and GenomeStudio are registered trademarks or trademarks of Illumina, Inc. All other brands and names contained herein are the property of their respective owners.

(4)

(5)

Genome Analyzer Pipeline v1.5 and CASAVA v1.0 Software User Guide v

Revision History

Part Number Revision Letter Date

15006500 A August 2009

15003807 A April 2009

1005359 A December 2008

1004759 A June 2008

1003881 A January 2008

(6)

(7)

Genome Analyzer Pipeline v1.5 and CASAVA v1.0 Software User Guide vii

Chapter 1 Overview . . . 1

Analysis Computing Systems. . . 2

What’s New . . . 5

Reporting Problems . . . 6

Technical Assistance . . . 7

Chapter 2 Core Pipeline Concepts . . . 9

Analysis Modules . . . 10

Use of Modules . . . 11

Running the Pipeline Modules. . . 12

Understanding the Run Folder. . . 13

Run Folder Structure . . . 15

Images Folder . . . 15

Data Folder . . . 16

Run Folder Naming . . . 17

File Naming . . . 18

Configuration/Parameters . . . 18

Calibration and Input Parameters . . . 19

Quality Scoring . . . 19

Image Offsets . . . 19

Frequency Cross-Talk Matrix . . . 20

Phasing/Prephasing Estimates. . . 21

Sample Information . . . 21

Alignment Algorithms . . . 22

ELAND Algorithm Description. . . 22

Chapter 3 Using GOAT Starting with Image Analysis . . . 25

Invoking GOAT for Image Analysis . . . 26

Running a GOAT Image Analysis . . . 27

(8)

Paired Reads . . . 27

Parallelization Switch . . . 28

Nohup Command . . . 28

Command Line Options for GOAT . . . 29

General Options . . . 29

GOAT Options. . . 30

Makefile Targets . . . 32

Chapter 4 Using Bustard Starting with Base Calling . . . 35

Invoking Bustard for Base Calling . . . 37

Running Pipeline Base Calling . . . 38

Starting with SCS Image Analysis Data . . . 38

Starting with IPAR Image Analysis Data . . . 38

Parallelization Switch . . . 40

Nohup Command . . . 40

Command Line Options for Bustard . . . 41

General Options . . . 41

Bustard Options. . . 42

Makefile Targets . . . 43

Chapter 5 Using GERALD for Sequence Alignment . . . 45

Running GERALD as a Standalone Program . . . 47

Standard GERALD Analysis . . . 47

GERALD Configuration File . . . 47

Lane-Specific Options . . . 49

Optional Parameters . . . 49

Paired-End Analysis Options . . . 50

GERALD Parameters . . . 51

ANALYSIS Variables. . . 51

Analysis Parameters . . . 52

Lane-by-Lane Parameters . . . 52

USE_BASES Option . . . 53

QCAL_SOURCE Option. . . 54

Make Option . . . 55

Rerunning the Analysis . . . 56

Building an SRF Archive. . . 56

ELAND Alignments . . . 57

Missing Bases in ELAND . . . 58

Using ANALYSIS eland_extended . . . 58

Using ANALYSIS eland_pair . . . 59

Using ANALYSIS eland_tag . . . 61

Using ANALYSIS eland_rna . . . 62

Preparing the Reference Genome . . . 65

Chapter 6 Pipeline Analysis Output . . . 67

(9)

ix

Genome Analyzer Pipeline v1.5 and CASAVA v1.0 Software User Guide

Results Summary . . . 68

Cluster Intensity . . . 73

Error Rates . . . 74

Text-Based Analysis Results . . . 75

GERALD Output Files in Temp Folder . . . 76

Interpretation of Run Quality . . . 77

Summary.htm. . . 77

IVC.htm . . . 81

All.htm and Error.htm . . . 81

Chapter 7 Advanced Pipeline Usage . . . 83

Running Bustard as a Standalone Program . . . 84

Filtering Parameters . . . 85

Analysis of Multiplexed Sequencing Runs . . . 86

GERALD Analysis . . . 86

Split_on_index.py Script . . . 87

Chapter 8 Using CASAVA . . . 89

Use Cases . . . 91

CASAVA Workflow. . . 92

Hardware and Software Requirements . . . 92

Expected Performance . . . 94

Estimating Build Depth . . . 94

CASAVA Input Files . . . 95

Export.txt Files. . . 95

Run.conf.xml . . . 95

Pair.xml . . . 95

Config.xml . . . 96

Summary.htm. . . 96

Methods . . . 96

Duplicate Removal. . . 96

Final Set of Reads . . . 96

Allele Calling . . . 96

SNP Calling . . . 97

CASAVA Read Start Counting Method . . . 97

Running CASAVA . . . 98

Examples . . . 101

Results Directory . . . 101

Running Specific Use Cases . . . 103

DNA Sequencing Analysis for Large Genomes . . . 103

DNA Sequencing Analysis for Small Genomes . . . 105

RNA Sequencing . . . 106

CASAVA Output Files . . . 109

Build Directory . . . 109

Build Web Page. . . 110

CASAVA Build . . . 111

Appendix A Requirements and Software Installation for Pipeline . . . . 115

System Requirements. . . 116

(10)

Network Infrastructure . . . 116

Analysis Computer. . . 117

Installation Prerequisites . . . 119

Setting Up Email Reporting . . . 119

Installing the Pipeline Software . . . 121

Compiling on Other Platforms. . . 121

Directory Setup . . . 121

Appendix B Analysis Output File Descriptions . . . 123

Output File Types . . . 124

Intensity Files . . . 125

Main Sequence Files from Bustard . . . 127

Optional Files from Bustard. . . 127

Efficiency . . . 128

Intermediate Output Data Files . . . 129

Output File Formats . . . 130

Configuration/Parameters File Format. . . 132

.Params File . . . 132

Config.xml Files . . . 132

RunInfo.xml File . . . 135

Appendix C Using Parallelization in Pipeline . . . 137

“Make” Utilities . . . 138

Customizing Parallelization . . . 138

Parallelization Limitations . . . 142

Memory Limitations . . . 142

Appendix D Reference Files for Eland_rna and CASAVA. . . 143

Eland_rna Reference Files . . . 144

Eland_rna Genome Files . . . 144

Abundant Sequences Files . . . 144

Splice Junction Set . . . 145

Human Eland_rna Reference Files . . . 145

Mouse Eland_rna Reference Files . . . 146

Rat Eland_rna Reference Files . . . 146

CASAVA Reference Files . . . 148

CASAVA Genome Files . . . 148

Genome Size File. . . 148

Exon Coordinates Set . . . 148

Human CASAVA Reference Files. . . 148

Mouse CASAVA Reference Files . . . 149

Rat CASAVA Reference Files . . . 149

Generating Reference Files . . . 150

Getting Data Files . . . 150

Genome Sequence Files . . . 152

Abundant Sequence Files . . . 152

Splice Junction Set . . . 153

(11)

Genome Analyzer Pipeline v1.5 and CASAVA v1.0 Software User Guide xi

List of Figures

Figure 1 Three Steps of Data Analysis . . . 2

Figure 2 Data Analysis Workflow . . . 3

Figure 3 Phasing and Prephasing . . . 10

Figure 4 Pipeline Modules . . . 11

Figure 5 SCS Real Time Analysis Run Folder Directory Structure . . . 13

Figure 6 IPAR/Pipeline Run Folder Directory Structure . . . 14

Figure 7 Frequency Cross-Talk Matrix and Phasing File Locations . . . 20

Figure 8 CASAVA Workflow. . . 92

Figure 9 CASAVA Build Directory . . . 109

Figure 10 Build Web Page . . . 110

Figure 11 Summary.htm File . . . 110

Figure 12 SNPs Graphs in Home.html . . . 111

Figure 13 Statistics Graphs in Home.html . . . 111

Figure 14 Chromosome.snp.txt File Opened in Excel . . . 113

Figure 15 Chromosome_genes_count.txt File Opened in Excel . . . 114

Figure 16 Run Folder Structure and Output File Types . . . 124

Figure 17 UCSC Genome Bioinformatics Web Page . . . 150

Figure 18 Selected Genome Web Page (Human) . . . 151

Figure 19 Annotation Database Files Web Page, Files . . . 151

Figure 20 Index (Data set by chromosome) Web Page . . . 152

(12)

(13)

Genome Analyzer Pipeline v1.5 and CASAVA v1.0 Software User Guide xiii

List of Tables

Table 1 Illumina Customer Support Contacts . . . 7

Table 2 File Naming Components . . . 18

Table 3 GERALD Configuration File Parameters . . . 48

Table 4 GERALD Configuration File Lane-Specific Options . . . 49

Table 5 GERALD Configuration File Optional Parameters . . . 49

Table 6 GERALD Configuration File Paired-End Analysis Options . . . 50

Table 7 ANALYSIS Variables . . . 51

Table 8 Analysis Parameters . . . 52

Table 9 Lane-by-Lane Parameters. . . 53

Table 10 USE_BASES Options . . . 53

Table 11 QCAL_SOURCE Variable Values . . . 54

Table 12 Parameters for ANALYSIS eland_extended . . . 58

Table 13 Parameters for ANALYSIS eland_pair . . . 60

Table 14 Parameters for ANALYSIS eland_rna . . . 64

Table 15 Example of Relative Orientation Statistics Table . . . 72

Table 16 Example of Insert Size Statistics Table . . . 72

Table 17 Example of Insert Statistics Table . . . 73

Table 18 Text-Based Analysis Results . . . 75

Table 19 Example of Lane Results Summary . . . 77

Table 20 Example of Expanded Lane Summary . . . 77

Table 21 Expected Performance for Typical CASAVA Projects . . . 94

Table 22 Data Volumes Per Experiment . . . 116

Table 23 Intermediate Output File Descriptions . . . 129

Table 24 Final Output File Formats . . . 130

Table 25 Intermediate Output File Formats . . . 131

Table 26 Human Genome Reference Files for Eland_rna . . . 145

Table 27 Mouse Genome Reference Files for Eland_rna . . . 146

Table 28 Rat Genome Reference Files for Eland_rna . . . 147

Table 29 Human Genome Reference Files for CASAVA for RNA Sequencing . . . 149

Table 30 Mouse Genome Reference Files for CASAVA for RNA Sequencing . . . 149

Table 31 Human Genome Reference Files for CASAVA for RNA Sequencing . . . 149

(14)

(15)

Genome Analyzer Pipeline v1.5 and CASAVA v1.0 Software User Guide 1

Chapter 1 Overview

Topics

2 Introduction

3 Genome Analyzer Pipeline Software 4 Pipeline Workflow

4 CASAVA Software 5 What’s New 6 Reporting Problems 7 Technical Assistance

(16)

Introduction

This user guide documents the Genome Analyzer Pipeline Software and the CASAVA Software. The Genome Analyzer Pipeline Software performs offline data analysis of a sequencing run. The CASAVA Software package performs post-sequencing analysis of data from reads aligned to the reference genome by Pipeline. The basic functionalities of these modules are described below.

Analysis of Sequencing Data

After the Genome Analyzer generates the sequencing images, the data is analyzed in three steps: image analysis, base calling, and sequence analysis (Figure 1).

1. Image analysis—Uses the raw TIF files to locate clusters on the image, and outputs the cluster intensity, X,Y positions, and an estimate of the noise for each cluster. The output from image analysis provides the input for base calling.

2. Base calling—Uses cluster intensities and noise estimates to output the sequence of bases read from each cluster, a confidence level for each base, and whether the read passes filtering.

3. Sequence analysis—Allows for alignment to a reference sequence and visualization of the result.

Figure 1 Three Steps of Data Analysis

Analysis Computing Systems

The different analysis steps can be performed by different analysis computing systems:

`

Sequencing Control Software (SCS) real time analysis, which runs on the Genome Analyzer instrument computer. SCS real time analysis performs real-time image analysis and base calling.

`

The Genome Analyzer Pipeline Software (Pipeline), which runs on a Linux analysis server. Pipeline can perform off-line image analysis, base calling, and sequence analysis.

NOTE

With the launch of GA_IIx, IPAR has become functionally obsolete. However, if you have not upgraded to SCS 2.4 or later you can still use IPAR 1.3 and older versions of SCS to

(17)

3

Figure 2 Data Analysis Workflow

The standard workflow is to perform image analysis and base calling using SCS real time analysis, after which Pipeline performs alignment using the base calling results (Figure 2).

Genome Analyzer Pipeline Software

The Genome Analyzer Pipeline Software is a set of utilities designed to perform a complete offline data analysis of a sequencing run. It is supplied as source code and scripts.

The output data produced by the Genome Analyzer Pipeline Software are stored in a hierarchical folder structure called the Run Folder. The Run Folder includes all data folders generated from the Genome Analyzer and the data analysis software. For a detailed description of the Run Folder structure, see Understanding the Run Folder on page 13.

The Pipeline requires a Linux system with specific processing and data storage capacity. For specific requirements, see System Requirements on page 116.

NOTE

IPAR, SCS, and Pipeline image analysis may yield slightly different results, due to minor variations in libraries and algorithms used. The differences are negligible compared to experimental variation.

(18)

Pipeline Workflow

The image data from a sequencing run are saved on the Genome Analyzer computer in a folder structure organized by lane and tile number. The data are transferred to a network location for analysis after the sequencing run is complete or by mirroring the data to the storage location while the run progresses. SCS real time analysis and IPAR also transfer their output data to a network location for analysis by Pipeline after the run is complete.

The following is an overview of the Pipeline workflow.

Installation

1. Install the Pipeline prerequisites on a suitable Linux system. See Installation Prerequisites on page 119.

2. Install the Pipeline software and compile the Pipeline using the “make”

command. See Installing the Pipeline Software on page 121.

3. Set up the “Instruments” directory for parameters files. See Directory Setup on page 121.

Running the Analysis

1. Navigate (via the command line) to the Run Folder location.

2. Create a configuration file that specifies what analysis should be done for each lane. See GERALD Parameters on page 51 and GERALD

Configuration File on page 47.

3. Run a check on the Run Folder. See Running a GOAT Image Analysis on page 27.

4. Add command line options, generate the analysis folder, and corresponding makefiles. See Command Line Options for GOAT on page 29.

5. Change to the analysis directory and start your analysis by executing makefiles.

Analysis Output

1. View the analysis results of your run. See Visual Analysis Summary on page 68 and Text-Based Analysis Results on page 75.

2. Interpret the run quality. See Interpretation of Run Quality on page 77.

CASAVA Software

The CASAVA Software v1.0 (CASAVA) provides analysis for three basic use cases:

`

DNA Sequencing for large genomes.

`

DNA Sequencing for small genomes (data sets).

`

RNA Sequencing.

All types of analysis take export.txt files from Pipeline as input and produce a set of allele calls for Single Nucleotide Polymorphisms (SNPs). In addition, RNA Sequencing analysis provides counts for exons, genes and splice

(19)

5

What’s New Important Changes in Pipeline v1.5

`

New quality tables supporting flow cell v4 and cluster generation chemistry v4.

`

Support for quality tables for 1.4mm flow cell and cluster generation chemistry v2.

Important Changes in Pipeline v1.4

`

Improved the estimation of the alignment scores of longer reads.

`

Can start from SCS real time analysis base calling data.

`

Bustard supports the binary intensity data format generated by SCS real time analysis (with the --CIF option).

`

The format of the Firecrest output has changed. The intensity and noise files are now generated cycle by cycle.

`

The pipeline uses the data from the file RunInfo.xml (normally generated by SCS) to identify the boundaries of the reads (including index reads).

`

PhageAlign produces export files.

Important Changes in Pipeline v1.3

`

The quality scoring scheme has changed to the Phred scoring scheme, encoded as an ASCII character by adding 64 to the Phred value. A Phred score of a base is:

Q_phred =-10 log₁₀(e)

where e is the estimated probability of a base being wrong.

`

The Bustard output formats have changed; a new file format called

"qseq.txt" is used to store read IDs, sequence and quality information as well as filter information.

`

The old Bustard output formats can be produced optionally with the "-- with-seq", "--with-prb", "--with-siq2", "--with-qval" options.

`

A new build system is used. The installation still involves installing the prerequisites and then typing "make" and "make install" in the top-level pipeline folder. The executables can now be found in the bin/ directory (e.g. bin/goat_pipeline.py).

`

For base-call auto-calibration, the option "--with-qval" needs to be specified at the goat_pipeline.py or bustard.py command line.

`

The Gerald analysis modes "expression" and "eland" are deprecated.

They are replaced by "eland_tag" and "eland_extended" respectively.

`

The "CONTAM_FILE" feature in PhageAlign mode is deprecated.

(20)

Reporting Problems

Contact Illumina Technical Support to report any issues with the Pipeline.

When reporting an issue, it is critical to capture all the output and error messages produced by a run. This is done by redirecting the output using

“nohup” or the facilities of a cluster management system. For an explanation of “nohup,” see Running a GOAT Image Analysis on page 27.

It helps to attach the makefile corresponding to the part of the Pipeline that is causing the problem. If there are GERALD-related issues, it helps to post the config.txt file found in the GERALD output folder. For problems relating to specific tiles or files, it is useful to send the output of “wc -l” and “ls -l” on these files.

(21)

Technical Assistance 7

Technical Assistance

For technical assistance, contact Illumina Customer Support.

MSDSs

Material safety data sheets (MSDSs) are available on the Illumina website at http://www.illumina.com/msds.

Product Documentation

If you require additional product documentation, you can obtain PDFs from the Illumina website. Go to http://www.illumina.com/documentation. When you click on a link, you will be asked to log in to iCom. After you log in, you can view or save the PDF.

If you do not already have an iCom account, then click New User on the iCom login screen and fill in your contact information. Indicate whether you wish to receive the iCommunity newsletter (a quarterly newsletter with articles about, by, and for the Illumina Community), illumiNOTES (a monthly newsletter that provides important product updates), and announcements about upcoming user meetings. After you submit your registration information, an Illumina representative will create your account and email login instructions to you.

Frequently Asked Questions

Frequently asked questions are available online.

Go to http://www.illumina.com/FAQs, and click on Software, then on Genome Analyzer Pipeline Software.

Table 1 Illumina Customer Support Contacts

Contact Number

Toll-free Customer Hotline 1-800-809-ILMN (1-800-809-4566) International Customer Hotline 1-858-202-ILMN (1-858-202-4566)

Illumina Website http://www.illumina.com

Email [email protected]

(22)

(23)

Chapter 2 Core Pipeline Concepts

Topics

10 Introduction 10 Analysis Modules

13 Understanding the Run Folder 15 Run Folder Structure 15 Images Folder 16 Data Folder

17 Run Folder Naming 18 File Naming

18 Configuration/Parameters 19 Calibration and Input Parameters

19 Quality Scoring 19 Image Offsets

20 Frequency Cross-Talk Matrix 21 Phasing/Prephasing Estimates 21 Sample Information

22 Alignment Algorithms

22 ELAND Algorithm Description

(24)

Introduction

Analysis modules perform the specific tasks of image analysis, base calling, and sequence alignment. During an analysis run, a defined folder structure is generated that captures the output of an instrument run in text files and also contains the configuration files. Configuration files contain calibration and input settings that optimize your analysis run and the alignment programs perform sequence analysis. This chapter describes these core concepts of the Genome Analyzer Pipeline Software.

Analysis Modules

The Pipeline is divided into the following modules:

`

Firecrest is the module used for image analysis. Firecrest identifies cluster positions, sharpens and enhances clusters through image filtering, removes background noise, detects clusters based on morphological features on the image, and extracts intensities.

`

Bustard is the module used for base calling. Bustard deconvolves the signal from the clusters and applies correction for cross-talk, phasing, and prephasing.

• Frequency cross-talk—The Genome Analyzer uses two lasers and four filters to detect four dyes attached to the four types of nucleotide, respectively. The emission spectra of these four dyes overlap so that the four images are not independent. Pipeline uses a frequency cross-talk matrix to correct for this cross-talk (for more information, see Frequency Cross-Talk Matrix on page 20).

• Phasing/Prephasing—Depending on the efficiency of the fluidics and chemistry of the sequencing reactions, a small number of molecules in each cluster may run ahead of (prephasing) or fall behind (phasing) the current incorporation cycle (see Figure 3). This effect is mitigated by applying corrections during the base calling step (for more information, see Phasing/Prephasing Estimates on page 21).

Figure 3 Phasing and Prephasing

`

Generation of Recursive Analyses Linked by Dependency (GERALD) is the module used for sequence alignment and metrics visualization. The following two alignment programs work within the GERALD module:

(25)

Analysis Modules 11

• Efficient Large-Scale Alignment of Nucleotide Databases (ELAND) is very fast and aligns for up to two errors from a reference for the first 32 bases or more bases with ELAND extended. This algorithm is used for any reference larger than 100 Kbases.

• PhageAlign does an exhaustive alignment (all possible alignments up to arbitrary edit distances), but is slow.

Use of Modules

The use of these modules depends on the input data the Pipeline starts with (Figure 4):

`

Starting with base calling data generated by SCS real time analysis (RTA), use the GERALD.pl script. Chapter 5, Using GERALD for Sequence Alignment decribes the use of GERALD.

`

Starting with image analysis data generated by SCS real time analysis (or IPAR), use the bustard.py script. After the Bustard module is finished, the bustard.py script calls the subscript for GERALD (GERALD.pl)

automatically. Chapter 4, Using Bustard Starting with Base Calling, describes the use of bustard.py.

`

Starting with images from the Genome Analyzer, use the script

goat_pipeline.py, named after the General Oligo Analysis Tool (GOAT).

The goat_pipeline.py script calls the subscripts for three Pipeline modules: Firecrest, Bustard (bustard.py), and GERALD (GERALD.pl).

Chapter 3, Using GOAT Starting with Image Analysis describes the use of GOAT.

Figure 4 Pipeline Modules

Typically, the analysis begins with the alignment script GERALD.pl, using base calling data generated by SCS real time analysis. However, if you need to reanalyze data, you can start with one of the other scripts and use different parameters.

NOTE

When you start with image analysis data, you have to invoke Bustard.The goat_pipeline.py script cannot be used to start with image analysis data in Pipeline v1.3 and up.

Top Level Script: GOAT Top Level Script: Bustard Top Level Script: GERALD

(26)

Running the Pipeline Modules

The Pipeline is divided into modules that are managed by the “make” utility.

The “make” utility is commonly used to build executables from source code and is designed to model dependency trees by specifying dependency rules for files. These dependencies are stored in a file called a makefile. Each Pipeline module is a collection of Perl or Python scripts and C++

executables, and has its own makefile associated with the analysis task.

“Make” has a dual purpose within the Pipeline software:

`

To build executables from source code

`

To perform data analysis steps using the software A run of the Pipeline is a two-stage process:

1. Generate the folders and makefiles using one of the above scripts.

2. Start the Pipeline analysis by executing “make.”

The process is described for the different wokflows in Chapter 3, Using GOAT Starting with Image Analysis Chapter 4, Using Bustard Starting with Base Calling, and Chapter 5, Using GERALD for Sequence Alignment.

(27)

Understanding the Run Folder 13

Understanding the Run Folder

The Pipeline operates in a specific directory called the Run Folder where the images and analysis output files are saved by default in a consistent

hierarchical structure. A Run Folder containing SCS real time analysis data is very similar to a Run Folder containing IPAR analysis data, or a Run Folder containing only Pipeline analysis data.

Figure 5 illustrates a typical Run Folder after SCS image analysis and base calling, and Pipeline alignment.

Figure 5 SCS Real Time Analysis Run Folder Directory Structure

(28)

Figure 6 illustrates a typical Run Folder after IPAR image analysis and Pipeline base calling and alignment, or one containing only Pipeline analysis data.

Figure 6 IPAR/Pipeline Run Folder Directory Structure

The standardized structure, file naming conventions, and file formats of the Run Folder allow for the following:

`

(29)

`

Encoding sufficient information to trace the history of the data in the Run Folder back to the laboratory notebook without confusion between instruments, experiments, or sites.

`

Standardized input and output enabling component software to operate flawlessly, regardless of the instrument generating the data.

`

Capturing and encoding enough information to independently reanalyze the data at any time, in such a way that existing extractions of sequence and related data are preserved, and parameters used during any point of the extraction process are captured and related to the subsequent output data.

`

Subsequent analyses to be stored in the Run Folder.

`

The software tools and other user software to implement and enforce these structures and standards.

Run Folder Structure

The Run Folder contains the Images folder and Data folder as illustrated in Figure 5 and Figure 6 above.

`

The Data folder contains Image Analysis folders and the Image Analysis folders contain Basecall folders which contain Sequence folders. The Data folder is created by the Genome Analyzer when a run starts. Any analysis performed on the data, including SCS real time analysis and IPAR analysis, is saved within the Data folder.

`

The Images folder holds the images from every tile for all cycles of sequencing. The Images folder will not be present if only analysis data, not the images, are copied to the analysis server after SCS real time analysis or IPAR analysis. There is an option to send images to a second networked run folder apart from the main/default network destination.

Each run of the main Pipeline analysis modules creates a subdirectory in the Data folder of the Run Folder as follows (see Figure 5 and Figure 6 above):

`

Each run of the Pipeline image analysis software (Firecrest) creates a new image analysis output folder in the Data folder.

`

Each run of the Pipeline base calling software (Bustard) creates a new subdirectory in the image analysis subdirectory on which the base calls are based, resulting in a tree-like structure of analyses.

`

Parameters and versions for any given analysis run are logged in the folder structure to make it possible to reconstruct any previous analysis run.

You can do multiple analyses of the data using different analysis parameters and the results will not be overwritten. The default naming convention for folders generated by the Pipeline consists of the number of cycles run, the version of the software used for the operation (Firecrest, Bustard), the date the analysis initiated, and the login of the user. If the user initiates a second analysis on the same day, a new folder structure is created and the results from the previous analysis are not overwritten.

Images Folder

The Images folder contains a subfolder for each lane that has been

sequenced. The folders are named using the following convention where the lane number is padded to three digits:

L<lane number>

(30)

For example, L001 contains the images taken in the first lane.

Each lane folder contains a subfolder for each cycle of sequencing. Each image-cycle subfolder contains four images for every tile, one for each of the four bases.

The Image folder naming follows the naming convention C<cycle

number>.<iteration number>. Cycle number is indexed and represents the nth cycle.

Within each image-cycle subfolder are four tif files for each tile. These files are named using the following convention:

<sample>_<lane>_<tile>_<base>.tif

In the example, s_1_67_g.tif, the “s” is the default sample-ID.

Data Folder

The Data folder contains a hierarchical structure that consists of the image analysis output folder, then the base calling output folder, and then the sequence alignment output folder.

A new subfolder is generated each time a set of images is processed by the image analysis module (Firecrest), IPAR, or SCS real time analysis. The data are kept in one file per tile for raw intensities and one file per tile for cluster noise. Firecrest and IPAR use the extension _int.txt and use the extension _nse.txt. SCS real time analysis reports image analysis results in the binary .cif format (intensities) and .cnf format (noise).

The Data folder contains one config.xml file in each image analysis folder generated as a result of analyzing sets of images.The config.xml file explicitly records which cycle-image folders were used to generate the raw intensities and noise files, and any parameters used. For a detailed description of the parameters file, see Configuration/Parameters on page 18.

Image Analysis Folders

The image analysis folders have the following naming structure:

`

The image analysis folder generated by SCS real time analysis is called Intensities

`

The image analysis folder generated by IPAR is called IPAR_1.3

`

Each image analysis folder generated by Firecrest is named using the following convention:

C<first cycle>-<last-cycle>_<analysis module><analysis module- version>_<date>_<user>

For example, C1-27_Firecrest1.8.20_31-07-2006_myuser.2 contains the second version of an analysis of cycles 1–27 performed using version 1.8.20 of the Firecrest analysis module, run by the user “myuser” on the 31st of July 2006.

NOTE

Sample-IDs must not contain any underscores. Underscores are used as separators between the different identifiers of the filename to allow easy splitting by any software reading these filenames.

(31)

Base Calling Folders

Each image analysis folder may hold multiple sequence folders with the output of different runs of a base caller package. The base calling folders have the following naming structure:

`

The base calling folder generated by SCS real time analysis is called BaseCalls.

`

Each base calling folder generated by Bustard is named using the following convention:

<analysis module><analysis module-

version>_<date>_<user>[.<version-number>]

For example, the folder name Bustard1.8.8_08-11-2005_myuser.3 represents the third run of the Bustard base caller on 8th of November 2005 by the user “myuser.”

Each base calling folder also holds a config.xml that records any relevant information about the run of the base caller module.

Run Folder Naming

The top level Run Folder name is generated using three fields to identify the

<ExperimentName>, separated by underscores. For example,

YYMMDD_machinename_NNNN. You should not deviate from the Run Folder naming convention, as this may cause Pipeline to stop.

1. The first field is a six-digit number specifying the date of the run. The YYMMDD ordering ensures that a numerical sort of Run Folders places the names in chronological order.

2. The second field specifies the name of the sequencing machine. It may consist of any combination of upper or lower case letters, digits, or hyphens, but may not contain any other characters (especially not an underscore). It is assumed that the sequencing instrument is synonymous with the PC controlling it, and that the names assigned to the

instruments are unique across the sequencing facility.

3. The third field is a four-digit counter specifying the experiment ID on that instrument. Each instrument should be capable of supplying a series of consecutively numbered experiment IDs (incremental unique index) from the onboard sample tracking database or a LIMS.

A Run Folder named 070108_instrument1_0147 indicates experiment number 147, run on instrument 1, on the 8th of Jan 2007. While the date and instrument name specify a unique Run Folder for any number of instruments, the addition of an experiment ID ensures both uniqueness and the ability to relate the contents of the Run Folder back to a laboratory notebook or LIMS.

NOTE

It is desirable to keep Experiment-IDs (or Sample-ID) and instrument names unique within any given enterprise. You should establish a convention under which each machine is able to allocate Run Folder names independently of other machines to avoid naming conflicts.

(32)

Additional information is captured in the Run Folder name in fields separated by an underscore from the first three fields. For example, you may want to capture the flow cell number in the Run Folder name as follows:

YYMMDD_machinename_XXXX_FCYYY.

File Naming

The Pipeline uses the following format for file naming:

Some files are split on a read basis, leading to the file naming:

When a given file type is split on a read basis, the read always appears in the name, even for single-read analysis.

Example: s_5_1_0030_qseq.txt is a valid filename.

Exception: for image (.tif) files, the <tile> location can have less than four digits.

Configuration/

Parameters

The Data Folder and subfolders, and the top level Image folder can all contain a configuration file (config.xml), and the top level Run Folder a related .params file. This is intended to contain any parameter data specific to the given level of information held in the folder.

NOTE

When publishing the data to a public database, it is desirable to extend the exclusivity globally, for instance by prefixing each machine with the identity of the sequencing center.

Table 2 File Naming Components

Component Description

<sample> Alphanumeric string (always “s”)

<lane> Single-digit number identifying a flow cell lane

<read> Single-digit number identifying the read (starts at 1)

<tile> Four-digit number identifying a tile location in a flow cell lane

<cycle> Two- or three-digit number identifying a sequencing cycle

<id> Single-digit number to distinguish files; for example, the different reads of a paired-end read

<type> Alphabetical string identifying the type of content stored in the file

<filesuffix> Suffix to identify the traditional file type

(33)

Calibration and Input Parameters 19

Calibration and Input Parameters

For an optimal analysis run, the Pipeline needs a number of calibration and input parameters. By default, the Pipeline auto-generates these parameters for each analysis.

For samples with biased base compositions, as encountered in many tag- based (for example, Digital Gene Expression) or microRNA applications, auto-calibration does not provide perfect results. For such samples, you need to dedicate one lane of the flow cell to a control sample and use the -- control-lane command option to generate analysis parameters. For a detailed description, see Command Line Options for GOAT on page 29.

Quality Scoring

Base quality value calibration now uses a pre-determined calibration table in Bustard, supplied with the software. Custom calibration (lane auto-

calibration, calibration using a control lane or specification of an external calibration table) is still supported but not generally recommended and, in particular, lane auto-calibration is no longer the default (see Filtering Parameters on page 85).

Since Pipeline release 1.3, the quality scoring scheme is the Phred scoring scheme, encoded as an ASCII character by adding 64 to the Phred value. A Phred score of a base is:

Q_phred =-10 log₁₀(e)

where e is the estimated probability of a base being wrong.

Image Offsets

There are small pixel offsets among the four differently colored images taken of each tile. These are due to slightly different optical paths for the four images collected from each tile. The Pipeline uses a file to correct for this, and also corrects for linear rescaling of the image.

Each analysis run creates a file called default_offsets.txt in the Data subfolder of the current Run Folder. The default_offsets.txt file is used for subsequent analysis of the same run. Another default_offsets.txt is located in

Instruments/<instrument>, which values will be updated during the first run only.

The default_offsets.txt file contains four lines, corresponding to A, C, G, and T respectively, with six values each, using the A image as a reference. The following is an example of a typical default_offsets.txt file:

The first two columns in a row correspond to the translational offset of X and Y of the four images (in pixels). Since channel A is the reference (first line), the offsets for A are zero.

The slightly different optical paths for the four images collected from each tile result in slightly different scales of the images. This is corrected in the next two columns, which indicate scale factors applied to the image.

(34)

`

A scale factor of 0 indicates that the image does not need to be rescaled.

`

A scale factor of 0.001 for a 1000 x1000 pixel image indicates that images taken in the corresponding frequency channel tend to be one pixel larger than the reference channel.

The last two values are set to zero.

Frequency Cross- Talk Matrix

The Genome Analyzer uses two different lasers to excite the dye attached to each nucleotide. The emission spectra of these four dyes overlap, so the four images are not independent. As in Sanger sequencing, the frequency cross- talk has to be deconvolved using a frequency cross-talk matrix.

The frequency cross-talk is estimated during the base calling run and captured in a file called s_matrix.txt. The s_matrix.txt file is located in the base calling folder as shown in Figure 7.

Figure 7 Frequency Cross-Talk Matrix and Phasing File Locations The following is an example of a typical s_matrix.txt file:

(35)

Calibration and Input Parameters 21

The lines starting with a greater than symbol (“>”) specify the order of the rows and columns in terms of the bases they represent.

The matrix elements show how the C, A, T, and G dyes/nucleotides (columns) cross-talk into the C, A, T, and G channels. A normal matrix should be diagonally dominant (diagonal elements tend to be the largest values) with the exception of the top-left and bottom-right corners (A/C and G/T cross- talk respectively). These are not as well-separated due to the fact that both corresponding dyes are excited by the same laser.

Phasing/Prephasing Estimates

Depending on the efficiency of the fluidics and the sequencing reactions, a small number of molecules in each cluster may run ahead (prephasing) or fall behind (phasing) the current incorporation cycle. This effect can be mitigated by applying corrections during the base calling step.

The phasing estimates are produced before a run of the base caller module and captured in a file called phasing.xml. The phasing.xml file is located in the Phasing folder as shown in Figure 7.

As the estimation uses statistical averaging over many clusters and

sequences to estimate the correlation of signal between different cycles, the phasing estimates tend to be more accurate for tiles with larger numbers of clusters and a mixture of different sequences. Samples containing only a small number of different sequences do not produce reliable estimates.

Sample Information

Depending on the application, a reference genome may be supplied for the read sequences to be aligned against.

(36)

Alignment Algorithms

The Pipeline provides two alignment algorithms: PhageAlign and ELAND.

`

PhageAlign performs an exhaustive alignment and always finds the best match but is very slow.

`

Efficient Large-Scale Alignment of Nucleotide Databases (ELAND) is very fast and should be used to match a large number of reads against the human genome with no more than two mismatches in the first 32 bases.

mismatches include technical errors due to PCR or polymerase error as well as bona fide SNPs that differ versus the reference sequence.

ELAND is much faster than PhageAlign but will only detect matches that have at most two differences between the matching region and the first 32 bases of the read. This means that ELAND is less sensitive than PhageAlign, which will always find a best match for your reads, although possibly not a unique one. Consider the following points when using ELAND:

`

If your data is noisy, not all of it is going to align, and you may not get good results.

`

Error rates based on ELAND output underestimate the true error rate.

Since reads with more than two errors in the first 32 bases do not get aligned, they do not contribute to the calculation.

The eland_extended, eland_pair, and eland_rna modes, which are built around the ELAND core, all permit reads of more than 32 bases to align with more than 2 errors, provided there are no more than 2 errors in the first 32 bases.

ELAND Algorithm Description

This section provides a detailed description of the ELAND algorithm.

Alignment Score Calculation

ELAND gives a set of candidate alignments based on the first 32 bases of each read. These candidates are extended to give an ungapped alignment of each candidate. A 'match descriptor' string in the output encodes which bases in the read matched the genome and which were mismatches.

The base quality values and the positions of the mismatches in a candidate alignment are used to give a probability score (p-value) to each candidate.

This is the probability that the candidate position in the genome aligned to would, if its bases were sequenced at error rates that correspond to the read's quality values, give rise to the observed read. This way the contribution of each base is weighted according to its quality.

NOTE

A consequence of this is that the best alignment does not necessarily have the least number of mismatches, although an exact match will always beat any alignment containing mismatches.

(37)

Alignment Algorithms 23

The alignment score of a read is computed from the p-values of the candidate alignments. The candidate with the highest p-value is the best candidate and its alignment score is its p-value as a fraction of the sum of the p-values of all the candidates. This is also known as a Bayes' Theorem inversion. The alignment score is expressed on the Phred scale, ie Q20 corresponds to 1% chance of alignment being wrong, Q30=0.1%, etc.

Rest-of-Genome Correction

If only one candidate alignment is found, the scoring scheme above would give an infinite Phred score. MAQ deals with this by giving such cases an arbitrary high score of 255. ELAND gives such cases an arbitrary high score of 255 with a correction known as the 'rest-of-genome correction' that depends on the average base quality of the read, the read length and the size of the genome. This gives a scoring scheme with the following properties:

`

Single-candidate alignments for longer reads will score more highly than single-candidate alignments of shorter reads

`

Single-candidate alignments for better quality reads will score more highly than single-candidate alignments of lower quality reads

`

Single-candidate alignments to shorter genomes will score more highly than single-candidate alignments to longer genomes

Rest-of-Genome Correction and Read Length

Prior to Pipeline v1.4 a constant read length was used in the rest-of-genome calculation. Since the p-value of candidate alignments tends to decrease as read length increases, even for high quality reads, this means that more alignments tended to score low with respect to the rest-of-genome correction. The upshot was that more reads tended to fail the alignment score threshold as the read length increased, leading to a disturbing decrease in the reported percentage of reads aligned when the same reads were reanalysed at longer read lengths. This was fixed by computing a different rest-of-genome correction for each read length.

Unreported Unique Alignments

A line in an export file will only contain alignment information if the alignment score for that read exceeds a threshold. The primary purpose of this threshold is to retain only alignments that are markedly better than any other possible alignment for the read.

Pipeline reduces alignment quality to a single confidence score and read quality, the number of mismatches in the best alignment, and the presence of other candidate alignments all contribute to the calculation of that score.

Therefore, changes in any of these three variables will affect whether the alignment passes the alignment quality threshold. So even if only a single candidate alignment has been found for a read, it may still fail the alignment quality threshold for one of two reasons, and not be reported in Export.txt and Sorted.txt:

`

Low base quality values.

NOTE

The alignment score of a read and the p-values of the candidate alignments for the read are not the same.

The former is computed from the latter.

(38)

`

Excessive number of mismatches in the candidate alignment. There will be at most 2 mismatches in the seed (the first 32 bases of the read, unless ELAND_SEED_LENGTH has been modified) but potentially there can be any number of mismatches in the remainder of the read.

For most applications, this is the right thing in both cases. For example, you would not want to use a read with 10 mismatches for SNP calling, even if it is the only candidate found. The same applies for a read of poor base quality.

Indels Causing Mismatches

A key reason for an excessive number of mismatches is that the aligned read has an indel with respect to the reference. Since the released version of ELAND does not support gapped alignments, this can get misinterpreted as a string of mismatches.

Spurious mismatch errors due to non-detected indels will falsely inflate the observed error rate in later cycles. However the effect of this phenomenon is not large, at least for the rate of indel polymorphism we expect to see on human genomic data. To give an idea of its extent, performing gapped alignment to paired 125 base human genomic data resulted in an error rate estimate that was 0.12% lower than that obtained from the ungapped alignments using the current version of ELAND.

(39)

Chapter 3 Using GOAT Starting with Image Analysis

Topics

26 Introduction

26 Invoking GOAT for Image Analysis 27 Running a GOAT Image Analysis

27 Standard GOAT Analysis 27 Paired Reads

28 Parallelization Switch 28 Nohup Command

29 Command Line Options for GOAT 29 General Options

30 GOAT Options 31 Paired Reads 32 Makefile Targets

(40)

Introduction

This section describes the typical analysis run and command line options for GOAT (General Oligo Analysis Tool). Use GOAT when you want to perform Pipeline analysis starting with the raw image files.

The image data should be organized within a standard Run Folder directory structure as described in Run Folder Structure on page 15. To successfully initiate image analysis, you need four images for each tile, for each cycle, and a parameters (.params) file in the Run Folder.

Invoking GOAT for Image Analysis

Although several different software programs are involved in an analysis run, a single command generates the analysis folders, then a second command (`make recursive') can be used to start a complete analysis.

Below is the standard invocation of Pipeline when doing image analysis.

Arguments contained in brackets [ ] are optional.

/path-to-pipeline/bin/goat_pipeline.py

<run-folder-directory> [<run-folder-directory2>]

[--offsets=/path/default_offsets.txt|auto]

[--cycles=1-25|auto] [--tiles=s_1,s_2_0003,...]

[--control-lane=5] [--flow-cell=v4|1.4mm]

[--matrix=mymatrix.txt|auto|auto<n>]

[--phasing=0.01|auto|auto<n>] [--prephasing=0.01]

[--with-sig2] [--with-seq] [--with-prb]

[--with-qhg] [--with-qval] [--directory=/path/C1- 14_Firecrest1.4_01-08-2006_user]

[--GERALD=/path/config.txt] [--make]

Some of the arguments above have sample values displayed.

The only compulsory argument is the path to the Run Folder that is to be analyzed. The path can also point to any folder containing tiff images that are to be analyzed. Alternatively, you can provide a space-separated list of TIFF filenames.

If you are analyzing data generated using SCS v2.4 or earlier, you need to specify the option --flow-cell, so Pipeline knows the type of flow cell and associated chemistry that has been used.

NOTE

Standard use of the Pipeline performs sequence alignment on base calling data. This is described in Chapter 5, Using GERALD for Sequence Alignment.

If you want to perform base calling and sequence

alignment, but no image analysis, refer to Chapter 4, Using Bustard Starting with Base Calling.

(41)

Running a GOAT Image Analysis 27

Running a GOAT Image Analysis

Standard GOAT Analysis

A standard analysis for image analysis, base calling and alignment consists of calling the goat_pipeline.py script to generate an analysis directory including a Makefile to be processed by the "make" command, and then executing the

“make” command.

Start a standard analysis run using the following command format:

[--GERALD=/path/config.txt] [--make] <run-folder>

In this example, we will perform analysis on a Run Folder named

"070813_ILMN-1_0217_1234" with a previously prepared GERALD config file stored in the run folder named "config.txt":

1. Type the following command to run a check on the Run Folder, report all detected folders and parameters files, and fill in any missing

configuration options.

--GERALD=/data/070813_ILMN-1_0217_1234/config.txt /data/070813_ILMN-1_0217_1234

Illumina recommends running this script before generating the makefile to check for data integrity and consistency. It scans all the images folders and prints diagnostic output about the images and parameters files. No files or directories are modified on the data drive as a result of this command.

2. Add --make to the command listed above to create an analysis directory in the Run Folder. If you specify the --GERALD option, you will create the GERALD analysis folder and the corresponding makefile.

--GERALD=/data/070813_ILMN-1_0217_1234/config.txt --make /data/070813_ILMN-1_0217_1234

3. Change to the newly generated directory (for example, /data/

070813_ILMN-1_0217_1234/Data/C1-26_Firecrest) and type the “make recursive” command. This command starts the actual analysis

make recursive

For more information on “make recursive,” see Makefile Targets on page 32.

The primary outputs are the sequences read with per-base quality values (_qseq files) and, if alignment was performed, the alignments (see Pipeline Analysis Output on page 67). These files can be found in the GERALD folder.

The output files containing data statistics and histograms, used for quality control, can also be found in the GERALD folder.

A new output directory is created each time you rerun the analysis, so there is no need to remove any previous analysis files.

Paired Reads

The standard method to analyze paired-readdata assumes that you have a single Run Folder containing the images or image analysis files for both reads, with a continuously incremented cycle count.

(42)

`

For Genome Analyzer software SCS 1.0 and Pipeline version 0.3 and later, the Pipeline automatically knows where the second read starts.

`

For older versions of Genome Analyzer instrument and analysis software, use the option --new-read-cycle to identify the start of the second read.

For a description of the --new-read-cycle option, see Command Line Options for GOAT on page 29.

An alternative analysis method assumes that both reads of a pair are stored in two separate Run Folders. Specify both folders as arguments to

goat_pipeline.py (see Invoking GOAT for Image Analysis on page 26). This generates output only in the first Run Folder and the second folder is not touched. This means that the analysis will have to be performed starting from images. The two Run Folders will not work with IPAR or RTA data.

Parallelization Switch

If your system supports automatic load-sharing to multiple CPUs, you can parallelize the analysis run to <n> different processes by using the “make”

utility parallelization switch.

make recursive -j n

For more information on parallelization, see Using Parallelization in Pipeline on page 137.

Nohup Command

You should use the Unix nohup command to redirect the standard output and keep the “make” process running even if your terminal is interrupted or if you log out. The standard output will be saved in a nohup.out file and stored in the location where you are executing the makefile.

nohup make recursive -j n &

The optional “&” tells the system to run the analysis in the background, leaving you free to enter more commands.

(43)

Command Line Options for GOAT 29

Command Line Options for GOAT

You can invoke the goat_pipeline.py script with a number of optional command line arguments.

General Options

Any of the following general options can be included in any order on a single command line.

--make

The --make command creates the analysis directory and a makefile in the relevant analysis directory. You can start the analysis by changing to the directory and typing “make.” If this option is omitted, the Pipeline will not write any information to your Run Folder.

--new-read-cycle=<cycle>

Use this command to denote a new read in a paired-end run. The calculation of the matrix correction and the application of the phasing correction will be reset at the specified cycle.

--GERALD=<config.txt>

Use this command to start the GERALD makefile generator after the Bustard folder is created and to pass the relevant analysis information to GERALD.

You can specify multiple GERALD files by repeating the option with different configuration file names. For each GERALD configuration file specified, a separate GERALD subfolder is generated (under the same Bustard folder) with that configuration. For more information on the GERALD configuration file, see GERALD Configuration File on page 47.

--tiles=<tile>|<lane>[,<tile>|<lane>,...]

Use this command to select certain tiles for analysis. For example, specifying --tiles=s_1,s_2_01,s_3_0001,s_5_0002 selects all tiles in lane 1, all tiles starting with “01” in lane 2, position 1 in lane 3, and position 2 in lane 5.

You can also specify certain tiles for analysis from every lane. For example, specifying --tiles=_0010,_0020 selects only tiles 10 and 20 from every lane.

--cycles=<cycle>[-<cycle>[,<cycle>[-<cycle>...]]]:

Use this command to select certain cycles for analysis. For example, use --cycles=3–31 to include only cycles 3 through 31 in the analysis.

NOTE

If you skip cycles in the middle of a read, you cannot use ELAND to align the data.

(44)

Using the value “auto” tells the Pipeline to automatically select the first cycle to the last available cycle in all tiles and to make sure that all tiles have equal read lengths, regardless of the state of data acquisition/mirroring. --cycles cannot be used when starting from IPAR analysis files; the use of --cycles with the goat_pipeline.py will only work if the data is analysed from images.

--compression=<method>

Use “--compression” to reduce the size of the Firecrest output. Allowed values are “none” and “gzip” (the default).

In Pipeline version 0.3 and later, the intensity files are compressed by default.

For previous versions, you must specify “--compression=none” on the command line.

GOAT Options

Use the following options with the goat_pipeline.py script.

--nobasecall

Use --nobasecall to skip the base calling step in the analysis.

--offsets=<filename> | auto | default

Use --offsets=<filename> to specify a certain default offset file. If no offset file is specified, the Pipeline will create one in the Instruments folder.

--control-lane=<n>

Use this command to select a lane <n> that is to be used to estimate phasing and matrix correction for all other lanes. This option is synonymous with --phasing=auto<n> --matrix=auto<n>. Control lanes are necessary for samples with skewed base compositions.

--flow-cell=v4|1.4mm

Use the --flow-cell command to specify the type of flow cell and associated chemistry that was used for the sequencing run. Enter “v4” for a flow cell v4 and “1.4mm” for a 1.4 mm flow cell.

The sequencing chemistries for the 1.4 mm flow cell and flow cell v4 are differerent, and Pipeline has to adjust base call calibration based on the chemistry. If you used SCS v2.5 or later for sequencing, Pipeline can determine the flow cell type and chemistry used from the AnalysisInfo.xml file, as long as this file is in the standard location (<run-folder>/Config/).

If you used an earlier version of SCS, or if the AnalysisInfo.xml file is not in the standard location, you need to specify the option --flow-cell. This only needs to be done for the first Pipeline analysis module used; for subsequent modules, this information will be stored in config.xml.

--matrix=<filename> | auto | auto<n> | lane

Use the --matrix command to specify the frequency cross-talk matrix file, where filename refers to the path of the matrix file.

(45)

Command Line Options for GOAT 31

If no matrix is specified, or if you set the value to the default behavior “auto,”

the Pipeline auto-generates the matrix. A value of auto<n>, where <n> is a lane number between 1 and 8, is analogous to the --phasing=auto<n>

option and allows the matrix estimation to be derived from only one lane.

The value lane calculates a separate correction for each lane from data in that lane alone.

--phasing=<x> | auto | auto<n>

Use the --phasing command to apply a particular phasing correction. If you set the value to the default behavior “auto,” the Pipeline auto-generates the phasing and prephasing values.

A value of auto<n>, where <n> is a lane number between 1 and 8, uses the automated phasing estimates from the corresponding lane. This is useful for samples with an uneven base composition (such as in gene expression), for which the current phasing estimator does not work reliably and phasing needs to be estimated from a single control lane.

You can specify a phasing value directly. For example, --phasing=0.01 indicates a phasing correction with a rate of 1% per cycle (1% of molecules in a cluster fall behind the other molecules). In this case, the option is normally combined with the --prephasing option.

--prephasing=<x>

Use the --prephasing command to apply a particular prephasing correction.

For example, using --prephasing=0.01 sets a correction for prephasing with a prephasing rate of 1% per cycle.

The command --prephasing=auto is not recognized. Use --phasing=auto instead. By default the Pipeline autogenerates phasing and pre-phasing estimates.

--with-sig2, --with-seq, --with-prb, --with-qhg, --with-qval

Use these commands to generate the sig2, seq, prb, qhg, and qval files respectively. These files are not generated by default with the introduction of the qseq files.

Paired Reads

The following additional variations on the goat_pipeline.py and bustard.py options are supported for paired reads.

--phasing=<read>:value, --phasing=<read>:<read>

Use either of these option formats to specify phasing for one specific read of a pair.

The following example uses the default phasing option for read 1 but uses base phasing estimates from lane 5 for read 2:

--phasing=1:auto --phasing=2:auto5

The following example uses the phasing estimate for the second read and applies it to both read 1 and read 2:

--phasing=1:2

Sequencing Analysis Software User Guide