Notice. DNA Sequencing Module User Guide

(1)

GenomeStudio

^TM

DNA Sequencing Module v1.0 User Guide

An Integrated Platform for Data Visualization and Analysis

FOR RESEARCH ONLY

DS

(2)

(3)

Notice

This publication and its contents are proprietary to Illumina, Inc., and are intended solely for the contractual use of its customers and for no other purpose than to operate the system described herein. This publication and its contents shall not be used or distributed for any other purpose and/or otherwise communicated, disclosed, or reproduced in any way whatsoever without the prior written consent of Illumina, Inc.

For the proper operation of this system and/or all parts thereof, the instructions in this guide must be strictly and explicitly followed by experienced personnel. All of the contents of this guide must be fully read and understood prior to operating the system or any of the parts thereof.

FAILURE TO COMPLETELY READ AND FULLY UNDERSTAND AND FOLLOW ALL OF THE CONTENTS OF THIS GUIDE PRIOR TO OPERATING THIS SYSTEM, OR PARTS THEREOF, MAY RESULT IN DAMAGE TO THE EQUIPMENT, OR PARTS THEREOF, AND INJURY TO ANY PERSONS OPERATING THE SAME.

Illumina, Inc. does not assume any liability arising out of the application or use of any products, component parts, or software described herein.

Illumina, Inc. further does not convey any license under its patent, trademark, copyright, or common-law rights nor the similar rights of others. Illumina, Inc. further reserves the right to make any changes in any processes, products, or parts thereof, described herein without notice. While every effort has been made to make this guide as complete and accurate as possible as of the publication date, no warranty of fitness is implied, nor does Illumina accept any liability for damages resulting from the information contained in this guide.

© 2008 Illumina, Inc. All rights reserved. Illumina, Solexa, Making Sense Out of Life, Oligator, Sentrix, GoldenGate, DASL, BeadArray, Array of Arrays, Infinium, BeadXpress, VeraCode, IntelliHyb, iSelect, CSPro, iScan, and GenomeStudio are registered trademarks or trademarks of Illumina. All other brands and names contained herein are

(4)

(5)

Revision History

Part Number Revision Date

11319092 Rev. A November 2008

(6)

(7)

List of Figures

Figure 1 GenomeStudio Application Unzipping . . . 3

Figure 2 Illumina GenomeStudio Installation, Modules Tab . . . 4

Figure 3 Accept License Agreement . . . 5

Figure 4 Installation Progress dialog box . . . 5

Figure 5 Registration of the GenomeStudio Software . . . 6

Figure 6 DNA Sequencing Data Analysis Workflows . . . 7

Figure 7 CASAVA Project Directory . . . 10

Figure 8 Parsed Folder . . . 11

Figure 9 Chromosome Folder . . . 11

Figure 10 Starting New DNA Sequencing Project . . . 16

Figure 11 GenomeStudio Project Wizard - Welcome . . . 17

Figure 12 GenomeStudio Project Wizard—Location . . . 18

Figure 13 GenomeStudio Project Wizard—CASAVA Build Selection . . . . 19

Figure 14 GenomeStudio Project Wizard—Genome and Chromosomes. 20 Figure 15 GenomeStudio Recent Projects Pane . . . 21

Figure 16 Open Existing Project . . . 22

Figure 17 GenomeStudio Main Window and Illumina Genome Viewer . . 23

Figure 18 Data Plots Shown in Chromosome Slide Show Mode . . . 28

Figure 19 Data Plot Displayed Above the Chromosome . . . 29

Figure 20 Moving the View Region in ICB . . . 29

Figure 21 Zooming in to Specific Regions in ICB . . . 30

Figure 22 Low Resolution Stacked Alignment Plot . . . 31

Figure 23 Blue and Yellow Sequence Reads in Stacked Alignment View . 31 Figure 24 Stacked alignment view, individual sequences. . . 32

Figure 25 Find Genes Window . . . 33

Figure 26 IGV Data Workspace, Stacked Alignment Plots . . . 34

Figure 27 IGV Data Workspace, Data Plots . . . 35

Figure 28 Favorite Data Plots Form . . . 36

Figure 29 GenomeStudio Main Window Data Tables. . . 39

Figure 30 DNA Sequencing Sequences Table . . . 40

Figure 31 DNA Sequencing Samples Table . . . 41

(12)

xii List of Figures

Figure 33 DNA Sequencing AllelesTable . . . 44

Figure 34 Jump to IGV . . . 46

Figure 35 GenomeStudio Sequencing Reports Dialog Box . . . 46

Figure 36 Final Report Dialog Box. . . 47

Figure 37 ADT Report Dialog Box . . . 48

(13)

Chapter 1

Overview

Topics

2 Introduction

2 Audience and Purpose 3 Input Files

3 Installing the DNA Sequencing Module 7 DNA Sequencing Module Workflow

(14)

2 CHAPTER 1 Overview

Introduction

This manual describes the Illumina GenomeStudio DNA Sequencing Module. The DNA Sequencing Module facilitates data analysis from Illumina whole-genome resequencing experiments. Resequencing on the Illumina Genome Analyzer enables cost-effective reassembly of genomes and discovery of Single Nucleotide Polymorphisms (SNPs).

The DNA Sequencing Module enables two types of data analysis:

` Visualizing consensus reads in the reassembled genome.

` Discovering and confirming SNPs.

You can zoom in on regions with interesting SNPs or patterns, and after you have identified interesting SNPs, you can export the data for secondary analysis.

Audience and Purpose

This guide is written for researchers who would like to use the DNA Sequencing Module to analyze data generated by resequencing on the Illumina Genome Analyzer.

This guide includes procedures and user interface information specific to the DNA Sequencing Module. For information about the GenomeStudio Framework, the common user interface and functionality available in all GenomeStudio Modules, see the GenomeStudio Framework User Guide.

(15)

Input Files 3

Input Files

The DNA Sequencing Module uses output files from the CASAVA module of the Genome Analyzer Pipeline analysis software. See Chapter 2, Input Files for a description of these files and how to access them.

Installing the DNA Sequencing Module

To install the DNA Sequencing Module on your computer:

1. Put the GenomeStudio CD into your CD drive.

2. Do one of the following:

• If the Illumina GenomeStudio Installation screen (Figure 2) appears, continue to step 3.

• If the CD does not load automatically, double-click the GenomeStudio<version>.exe icon in the

GenomeStudio directory of the CD you received.

The GenomeStudio application suite unzips (Figure 1).

Figure 1 GenomeStudio Application Unzipping

(16)

The Illumina GenomeStudio Installation dialog box appears (Figure 2).

Figure 2 Illumina GenomeStudio Installation, Modules Tab

3. Read the software license agreement in the right-hand side of the Illumina GenomeStudio Installation window.

4. In the GenomeStudio Product area, select DNA Sequencing Module.

5. In the Serial Number area, enter your serial number for the DNA Sequencing Module.

6. [Optional] Enter the serial numbers for additional

GenomeStudio modules if you have licenses for additional NOTE

Select additional GenomeStudio modules if you have licenses for additional

GenomeStudio modules and want to install them now.

NOTE

Serial numbers are in the format ####-

####-####-#### and can be found on an insert included with your GenomeStudio CD.

(17)

Installing the DNA Sequencing Module 5

7. Click Install.

The Accept License Agreement dialog box appears (Figure 3).

Figure 3 Accept License Agreement

8. Click Yes to accept the software license agreement.

The DNA Sequencing Module is installed on your computer, along with any additional GenomeStudio modules or plug-in algorithms you selected.

The Installation Progress dialog box notifies you that installation is complete (Figure 4).

Figure 4 Installation Progress dialog box

9. Click OK.

10. In the Illumina GenomeStudio Installation dialog box, click Exit.

You can now start a new project using the DNA Sequencing Module.

You will be prompted to register your the GenomeStudio Software, if you have not done so previously (Figure 5). Fill in your contact details, and click Register. You may choose not to register; in that case, you will be prompted every time you start the GenomeStudio Software.

(18)

Figure 5 Registration of the GenomeStudio Software See Chapter 3, Starting a Project, for information about starting a new DNA Sequencing project.

(19)

DNA Sequencing Module Workflow 7

DNA Sequencing Module Workflow

The DNA Sequencing Module provides the flexibilty to perform different kinds of analyses. The basic workflow allows you to view SNPs and the sequence reads in the genomic context. You can view the relevant data in a table format and export the SNP calls.

Figure 6 DNA Sequencing Data Analysis Workflows

(20)

(21)

Chapter 2

Input Files

Topics

10 Introduction

10 Contents of the CASAVA Project Directory 11 Parsed_xx_xx_xx folder

12 Stats Folder 12 Conf Folder 12 Html Folder

12 Accessing CASAVA Build Files 12 Storage Locations 13 Accessing Linux Volume 13 Moving CASAVA Build Files

(22)

10 CHAPTER 2 Input Files

Introduction

The DNA Sequencing Module uses output files generated in the CASAVA module of the Genome Analyzer Pipeline analysis software. This chapter describes the folder structure and some of the essential files needed for DNA Sequencing analysis and provides instructions how to access them.

Contents of the CASAVA Project Directory

The required input files from the CASAVA build are stored in a CASAVA project directory (Figure 7).

Figure 7 CASAVA Project Directory

The CASAVA project directory should at a minimum contain 4 folders:

` conf folder

` html folder

` Parsed_xx_xx_xx folder

` stats folder NOTE

For a brief overview of how these files are generated in Pipeline and CASAVA, see Appendix A Generating CASAVA Output Files.

NOTE

There may be more folders, such as the export folder, but these are not used by the DNA Sequencing Module and provide no additional run information.

(23)

Contents of the CASAVA Project Directory 11

The section below describes the folder structure of this directory, and mentions some of the essential files for running the DNA Sequencing Module.

Parsed_xx_xx_

xx folder

At the top level of the CASAVA project directory, an important folder for the DNA Sequencing Module is the Parsed_xx_xx_xx folder (Figure 8), where xx_xx_xx is the date the run was started.

Figure 8 Parsed Folder

The Parsed_xx_xx_xx folder contains chromosome folders (c1, c2,…). Each chromosome folder at a minimum contains the following file types:

` A c*.snp.txt file, which contains the SNP calls from CASAVA.

` A number of sub-folders representing bins of data for each chromosome (Figure 9). The number of bins is variable per chromosome depending upon its size. Each bin contains one sorted.txt file of reads data (as well as other files that are not used by the DNA Sequencing Module).

Figure 9 Chromosome Folder

(24)

Stats Folder The stats folder contains statistical information, such as the runs_summary.xml file, which shows which lanes from which run were aggregated and called for a CASAVA build.

Conf Folder The conf folder contains information about the configuration of the project, such as the project.conf file.

Html Folder The html folder contains html and image files with additional information about the project.

Accessing CASAVA Build Files Storage

Locations

Both the CASAVA build files and DNA Sequencing project files consume a lot of disk space, especially with large projects. For best performance, the DNA Sequencing Module needs to have rapid access to these files. Consider the following when

determining storage location:

` DNA Sequencing project files files should be stored locally (for a description of these files, see Chapter 3, Starting a Project).

` CASAVA build files should not be stored on the same physical drive as the DNA Sequencing project files (or on separate partitions on the same physical volume). This slows down performance when both types of files need to be accessed at the same time.

Depending on the configuration of your workstation and network, you may either want to store the CASAVA build files locally or on the network. Consider the following:

` If you have separate physical drives locally, you can save the DNA Sequencing project files on one drive, and the

CASAVA build files on the other.

` If you have a fast network connection, you may leave the CASAVA build files on the network.

` If you have a slow network connection and only one physical drive, consider purchasing another harddrive.

These are not hard rules, and you may want to test what works

(25)

Accessing CASAVA Build Files 13

Accessing Linux Volume

CASAVA runs on a Linux computer, while DNA Sequencing Module runs on Windows. To copy or access data on the Linux volume from the Windows computer, set up the following:

` The Linux machine has turned on file-sharing using NFS or SAMBA.

` The Windows machine has mounted the Linux volume and mapped it as a network drive.

Moving CASAVA Build Files

Most likely, the CASAVA build files are not located on the workstation that runs the GenomeStudio Software. If you want to move the build files to your local drive, perform the following steps before running the DNA Sequencing Module:

1. Navigate to the location where the CASAVA project

directory is saved (like \\workstation-579\Data\DNA Seq\c19 PE-DNA in Figure 7).

2. Identify the CASAVA output directory, which should have the folders conf, html, Parsed_xx_xx_xx folder, and stats.

3. Copy the CASAVA build files to the hard drive of the workstation that runs the GenomeStudio Software.

4. If you want to compare multiple samples in the DNA Sequencing Module, repeat steps 1–3 for all samples you want to analyze.

NOTE

Certain folders, such as the export folder, do not have to be copied, since they are not used by the DNA Sequencing Module.

NOTE

If you want to combine two or more CASAVA builds in one sample, you will have to rerun CASAVA to combine the data.

(26)

(27)

Chapter 3

Starting a Project

Topics

16 Introduction

16 Creating a New Project

16 Starting the DNA Sequencing Module 17 Choosing a Project Location

18 Selecting the CASAVA Build

19 Selecting the Genome and Chromosomes 21 Opening an Existing Project

21 Opening a Recent Project 21 Browsing to an Existing Project 22 Project Windows

(28)

16 CHAPTER 3 Starting a Project

Introduction

To perform a DNA sequencing analysis, you first must create a project. In a project, you define samples by selecting output files from DNA sequencing experiments performed on the Genome Analyzer. You can load multiple samples in one project to compare results from different samples.

The following section, Creating a New Project, provides step-by- step instructions for defining your project.

If you want to open an existing project, see the section Opening an Existing Project on page 21.

Creating a New Project

Follow the instructions in this section to create a new project using the GenomeStudio Project Wizard.

Starting the DNA Sequencing Module

1. In the New Project area, double-click DNA Sequencing (Figure 10).

Figure 10 Starting New DNA Sequencing Project

The GenomeStudio Project Wizard—Welcome dialog box appears (Figure 11).

DNA Sequencing Module

(29)

Creating a New Project 17

Figure 11 GenomeStudio Project Wizard - Welcome 2. Click Next to advance to the GenomeStudio Project

Wizard—Location dialog box.

Choosing a Project Location

In the GenomeStudio Project Wizard—Location dialog box (Figure 12), perform the following steps to choose a project location:

1. Browse to the location where you want to save your project under Projects Repository, or select a location from the dropdown list.

2. Enter a name for your project in the Project Name text field.

The full path for your project appears beneath the name you enter.

NOTE

This is not the same location as where your CASAVA output lives. Your projects files should best be located on a separate drive from the CASAVA output files.

(30)

Figure 12 GenomeStudio Project Wizard—Location 3. Click Next to advance to the GenomeStudio Project

Wizard—CASAVA Build Selection dialog box.

Selecting the CASAVA Build

In the GenomeStudio Project Wizard—CASAVA Build Selection dialog box (Figure 13), perform the following steps to select your project data:

1. Browse to the location with the sequencing data folder under Repository, or select a location from the drop-down menu.

The available data folders appear in the CASAVA Builds pane.

2. Select the desired data folder in the CASAVA Builds pane.

3. Add the build data to the project using the Add Build to Project button.

The selected CASAVA build will now appear in the Project NOTE

Browse to the parent of the sequencing data folder. Make sure not to browse into the data folder.

(31)

Creating a New Project 19

Figure 13 GenomeStudio Project Wizard—CASAVA Build Selection

4. If you have another folder you want to select data from, repeat steps 1–3.

5. Once you are done adding data, click Next to advance to the GenomeStudio Project Wizard—Genome and

Chromosomes dialog box.

Selecting the Genome and Chromosomes

In the GenomeStudio Project Wizard—Genome and

Chromosomes dialog box (Figure 14), select the genome and build that was used in the Pipeline to align the reads, and select the chromosomes you wish to view.

(32)

1. Select the proper genome and build in the Genome dropdown list.

Figure 14 GenomeStudio Project Wizard—Genome and Chromosomes

2. Select the chromosomes you want to look at. By default, all chromosomes are selected. You can deselect chromosomes using the checkboxes, or click Deselect All, then select the chromosomes you want to include. Click Select All to include all chromosomes.

3. Click Finish.

NOTE

It is essential that you use the same genome build in the DNA Sequencing Module as you used in the Genome Analyzer Pipeline analysis.

If you need to download a different build, see the GenomeStudio Framework User Guide.

NOTE

Selecting fewer chromosomes will increase speed. Consider this if your CASAVA build only contains reads aligned to a subset of

chromosome, or if you want to look at a limited amount of data to get an idea of the quality.

(33)

Opening an Existing Project 21

GenomeStudio now loads the sequence files. This may take minutes to hours, depending on the size of the data files and the number of chromosomes being viewed.

Opening an Existing Project

Follow the instructions in this section to open an existing DNA Sequencing project in the DNA Sequencing Module.

Opening a Recent Project

If you have recently opened or created a project and it is visible in the Recent Projects pane, double-click the project in

GenomeStudio's main window in the Recent Projects pane. The project opens in the DNA Sequencing Module (Figure 15).

Figure 15 GenomeStudio Recent Projects Pane

Browsing to an Existing Project

If you want to open an existing project, perform the following steps:

1. Click File | Open Project, or click the Open button on the toolbar.

The Open Project dialog box appears (Figure 16).

NOTE

If you want to clean up the Recent Projects list, position the cursor in the Recent Projects pane and right-click to remove projects or clear the entire Recent Projects list.

(34)

Figure 16 Open Existing Project

2. Browse to the Projects Repository where the project is saved.

3. Select the <Project Name>.bsc file and click Open.

The project opens in the DNA Sequencing Module.

Project Windows

You have now generated a project, and two windows are opened automatically: the GenomeStudio Main Window and the Illumina Genome Viewer (IGV; Figure 17).

(35)

Project Windows 23

Figure 17 GenomeStudio Main Window and Illumina Genome Viewer

The GenomeStudio Main Window displays four data tables derived from the DNA sequencing data (Figure 17):

` DNA Sequencing sequences table

` DNA Sequencing samples table

` DNA Sequencing lanes table

` DNA Sequencing alleles table

Use the GenomeStudio Main Window for in-depth data analysis, sorting, and filtering of the data in these tables. See Chapter 5, Analyzing Numerical Data for instructions.

Use the Illumina Genome Viewer for visualizing data in these tables in the context of the genome. See Chapter 4, Using the IGV to Visualize Data for instructions.

GenomeStudio Main Window

Illumina Genome Viewer SamplesTable

Lanes Table Sequences Table

Alleles Table

(36)

(37)

Chapter 4

Using the IGV to Visualize Data

Topics

26 Introduction

26 Loading an Active Genome for Visualizing Data 26 IGV View Modes

27 Using the Illumina Genome Viewer 27 Chromosome Slide Show Mode 28 Using the Illumina Chromosome Browser

28 Launching the Illumina Chromosome Browser 29 Moving the View Region

29 Zooming in to Specific Regions 31 Zooming in Stacked Alignments 32 Finding Genes

33 Changing Data Selection or Appearance

(38)

26 CHAPTER 4

Using the IGV to Visualize Data

Introduction

This chapter describes visualization functions of the Illumina Genome Viewer (IGV) in the DNA Sequencing Module. You can view sequence reads and SNPs at the whole genome level, or drill down to the gene or nucleotide level.

The instructions in this chapter are specific for visualizing DNA Sequencing data. For more general information about the IGV, see the GenomeStudio Framework User Guide.

Loading an Active Genome for Visualizing Data

It is essential that you use the same genome build in the DNA Sequencing Module as you used in the Genome Analyzer Pipeline analysis. The Active Genome is shown in the top left corner of the IGV main window. If you need to load a different genome, go to Edit | Preferences, and select the proper Active Genome in the Genome tab. If you need to download a different build, follow the instructions in the GenomeStudio Framework User Guide.

IGV View Modes

DNA Sequencing Module contains two different views for different level of resolution:

` The Chromosome Slide Show Mode of the Illumina Genome Viewer (IGV) allows you to view results from one data set plotted on up to four chromosomes. Use this view if you want to have a chromosome view of your experiment and to decide which regions merit further investigation.

NOTE

If you need to re-open the IGV from the GenomeStudio Main Window, Select Tools | Show Genome Viewer.

(39)

Using the Illumina Genome Viewer 27

` Within the Illumina Chromosome Browser (ICB), you can view results from several data sets, and zoom in to gene and nucleotide level. Use the ICB if you want to see sequence reads and SNPs at the nucleotide level, or to correlate your data with gene information.

Using the Illumina Genome Viewer

The default view is the Illumina Genome Viewer—Chromosome Slide Show Mode. This section explains how to use this viewer specifically to view DNA Sequencing sequence reads, SNPs, and the consensus sequence. For more general information about the IGV, see the GenomeStudio Framework User Guide.

Chromosome Slide Show Mode

The Chromosome Slide Show Mode of the IGV can display up to four data sets at a time, on up to four chromosomes (Figure 18).

To display a data plot, or select how many chromosomes you want to view in Chromosome Slide Show Mode, click

Chromosome Slide Show Mode in the IGV. To select a chromosome, click the Jump to a Specific Chromosome button.

NOTE

You can copy plots to other applications by right-clicking the plot and selecting the appropriate copy option.

(40)

28 CHAPTER 4

Figure 18 Data Plots Shown in Chromosome Slide Show Mode

From the Chromosome Slide Show Mode, you can zoom out to the data at genome level using the Whole Genome View Mode (next section), or zoom in using the Illumina Chromosome Browser (page 28).

Using the Illumina Chromosome Browser

The Illumina Chromosome Browser (ICB) allows you to explore data by chromosome or by gene, down to the nucleotide level.

This section explains how to use the ICB specifically to view DNA Sequencing sequence reads, SNPs, and the consensus

sequence. For general information about the ICB, see the GenomeStudio Framework User Guide.

Launching the Illumina Chromosome Browser

Launch the ICB by double-clicking on a chromosome in the IGV (Figure 18). The data plot will appear above the chromosome in the ICB.

(41)

Using the Illumina Chromosome Browser 29

Figure 19 Data Plot Displayed Above the Chromosome

Moving the View Region

To change the region being viewed, move the red rectangle indicating the viewing region on the chromosome graphic, click the navigation arrows, or drag or scroll the base position below (Figure 20). For greater detail about these options, see the GenomeStudio Framework User Guide.

Figure 20 Moving the View Region in ICB

Zooming in to To move and resize the View Region, do any of the following Drag or scroll Base Position axis Move View Region Click Navigation Arrows

(42)

30 CHAPTER 4

The View Region in the data plot zooms in on the peak.

` Click the zoom buttons in the ICB toolbar.

` Double-click on a cytogenetic band on the displayed chromosome.

The View Region is fitted to the cytoband.

` Double-click a gene of interest.

This changes the View Region size to fit the gene.

` Position the cursor inside the red rectangle and scroll using the mouse wheel.

This changes the View Region by small increments.

Figure 21 Zooming in to Specific Regions in ICB

For greater detail about these options, see the GenomeStudio Framework User Guide.

Click zoom buttons

Double-click peak

Double-click cytogenetic band

Double-click gene

Scroll in View Region

(43)

Using the Illumina Chromosome Browser 31

Zooming in Stacked Alignments

The stacked alignment plot has additional features when zooming in. A low-resolution stacked alignment data plot is shown in Figure 22. The title of each plot shows the number of bases represented by one pixel at that resolution (6KB in Figure 22). The height of each bar represents the number of sequence reads in the region. This is determined by calculating the overlapping reads for each basepair in a four pixel wide region, and taking the maximum value.

Figure 22 Low Resolution Stacked Alignment Plot

Once you have zoomed in deep enough, the stacked alignments show as individual sequence reads. Tags aligning to the forward strand of the chromosome are colored blue by deafult; reverse- aligned tags are colored yellow (you can customize the colors, see Changing Data Selection or Appearance on page 33). The two reads of paired-end sequencing are linked by a thin, grey, dotted line (Figure 23).

Figure 23 Blue and Yellow Sequence Reads in Stacked Alignment View

(44)

32 CHAPTER 4

If you zoom in to a sufficient level of resolution, the individual sequences of the tags are shown, in alignment with the reference sequence of the chromosome (Figure 24). Blue and yellow coloring and paired-end linking lines remain the same.

Additionally, mismatched nucleotides (compared to reference sequence) show up in red, and allele calls are shown below the reference sequence.

Figure 24 Stacked alignment view, individual sequences

Finding Genes You can also zoom in directly to a gene using the Find Genes function:

1. Go to Find | Find Genes in the ICB Toolbar.

NOTE

The red mismatches are not always called in the Alleles table because the allele calls in CASAVA take into account the alignment score. Some differences do not have good enough alignment scores to indicate a SNP.

Mismatched nucleotide (in red) Reference sequence

Allele calls

(45)

Changing Data Selection or Appearance 33

2. Select the checkbox Search for Gene by Name.

3. Enter gene name in the Find What text box.

4. Select the appropriate Gene Identifier.

5. Click Find.

The search results appear at the bottom of the window.

6. Double-click the gene you are looking for the search results.

ICB zooms in to the selected gene.

Figure 25 Find Genes Window

The gene you select appears in red in the ICB. Information about the selected gene is shown in the Gene ID and Gene Details tabs below the ICB main window. You can click on active links to view more gene information; see the GenomeStudio Framework User Guide for information.

Changing Data Selection or Appearance

You can select which data to view and the appearance of that data using the IGV Data Workspace, Stacked Alignment Plots

Select Search for Gene by Name Enter gene name

Double-click desired gene Select gene identifier Click Find

(46)

34 CHAPTER 4

Select the Stacked Alignment Plot tab. The first sample listed is selected by default in the Sample Name area. If you want to change the default selection or add samples to your display, select the associated checkbox next to the sample name.

The appearance of the data plot can be configured using the Properties panel. Click Update to view the changes.

Figure 26 IGV Data Workspace, Stacked Alignment Plots For general information about the IGV Data Workspace, see the GenomeStudio Framework User Guide.

Viewing Allele Call Scores

You can view allele call scores using the IGV Data Workspace, Data Plots tab (Figure 27).

Properties Panel Stacked Alignment Plot

Update Button

Sample Selection Panel

(47)

Viewing Allele Call Scores 35

Figure 27 IGV Data Workspace, Data Plots

To select which data you want to display, right-click in the Data Plots area and select Add. The Add Favorite Data Plots Form appears (Figure 28). To change the sample selection or add samples to your display, select the associated checkbox next to the sample name. You can choose which scores and associated subcolumns to view.

Add Button

(48)

36 CHAPTER 4

Figure 28 Favorite Data Plots Form

The default Data Plot Type is Scatter. Additional data plots are Line and Bar. Click Update to view the changes.

(49)

Chapter 5

Analyzing Numerical Data

Topics

38 Introduction

38 Numerical Data in the GenomeStudio Main Window 39 Sequences Table

41 Samples Table 42 Lanes Table 43 Alleles Table

45 Using Data in the GenomeStudio Software 45 Viewing Interesting SNPs in IGV 46 Running Reports

46 Final Reports 47 ADT Reports 49 Custom Reports

(50)

38 CHAPTER 5

Analyzing Numerical Data

Introduction

The DNA Sequencing Module supports numerical data analysis as well as visual analysis. This chapter explains the following topics:

` Analyzing numerical data in the GenomeStudio Main Window.

` Viewing the most promising SNPs in the IGV.

Numerical Data in the GenomeStudio Main Window

First go to the GenomeStudio Main Window. Within the GenomeStudio Main Window, there are four data tables (Figure 29):

` DNA Sequencing sequences table, which contains

information about the sequence reads that were generated.

` DNA Sequencing samples table, which contains information about what samples were used.

` DNA Sequencing lanes table, which contains information about what lanes were used.

` DNA Sequencing Alleles table, which contains information about what SNPs were found.

(51)

Numerical Data in the GenomeStudio Main Window 39

Figure 29 GenomeStudio Main Window Data Tables

The type of information that can be found in the different tables is decribed in the sections below.

Sequences Table

The DNA Sequencing sequences table contains 14 columns to provide information about all the sequence reads that were generated (Figure 30).

Lanes Table Alleles Table Samples Table Sequences Table

(52)

40 CHAPTER 5

[

Figure 30 DNA Sequencing Sequences Table

The content of each column is explained below.

Column Content

Index The index number is generated in the DNA Sequencing Module to keep entries distinct, and does not have an experimental value.

Sample Name Provides the name of the sample.

Machine Name The name of the machine used for sequencing.

Run Index Run index number.

Lane Which lane on the flow cell generated the read.

Cluster ID The tile and XY coordinate of the cluster in the tile. (provides a unique Cluster ID).

Sequences The sequence of the tag, as determined by the Genome Analyzer.

Quality The per base quality score. For a single basecall, a quality value of 30 is great, 20 is a good score, while 10 is still usable. For more information, see the Sequencing Analysis Software User Guide For Pipeline Version 1.1 and CASAVA.

Chr ID The chromosome to which a sequence read is aligned. ChrM indicates a mitochondrial DNA alignment.

Start The start of alignment of the sequence read on the indicated chromosome.

Sequence Length The length of the sequence.

(53)

Samples Table The DNA Sequencing samples table contains 16 columns to provide sample information (Figure 31).

[

Figure 31 DNA Sequencing Samples Table

Strand The strand of the chromosome this read aligns to (F = forward, R = reverse).

SE Score The alignment score using only the single read.

PE Score The alignment score of the paired-end reads combined (empty if single read).

Column Content

Index The index number is generated in the DNA Sequencing Module to keep entries apart, and does not have an experimental value.

Read The paired-end read number.

Number of Lanes Provides the number of lanes that were combined in this sample.

Sample Target The target (reference) to which the sample was aligned.

Sample Type The type of alignment that was performed prior to the CASAVA analysis.

Avg. Cluster Raw The average number of clusters per tile before filtering.

Avg. % PF Cluster Average percentage of clusters passing filter.

(54)

42 CHAPTER 5

Lanes Table The DNA Sequencing lanes table contains 16 columns to provide lane information (Figure 32).

[

Figure 32 DNA Sequencing Lanes Table

Avg. % Error Rate

PF Average error rate for clusters passing filter that could be aligned.

Avg. % Phasing Average percentage of phasing in clusters. Phasing occurs when a small number of molecules in each cluster falls behind the current

incorporation cycle in a sequencing run.

Avg. % PrePhasing Average percentage of prephasing in clusters. Prephasing occurs when a small number of molecules in each cluster runs ahead the current incorporation cycle in a sequencing run.

Avg. SNP

Detected Average number of SNP detected.

Column Content

Chip ID ID of the flow cell used for sequencing.

Lane Number The number of the lane on the flow cell.

Read The paired-end read number.

(55)

Alleles Table The DNA Sequencing alleles table contains 15 columns to provide information about the discovered SNPs (Figure 33).

Machine Name The name of the machine used for sequencing.

Run Date The date of the sequencing run.

Length The length of the sequencing run (two lengths for paired-end sequencing).

Filter The filter used to filter out bad clusters.

Clusters (Raw) The total number of clusters per tile on the lane.

% PF Cluster Percentage of clusters passing filter.

% Align PF Percentage of clusters passing filter that could be aligned to the genome.

% Error Rate PF Error rate for clusters passing filter that could be aligned.

% Phasing Percentage of phasing in clusters. Phasing occurs when a small number of molecules in each cluster falls behind the current incorporation cycle in a sequencing run.

% PrePhasing Percentage of prephasing in clusters. Prephasing occurs when a small number of molecules in each cluster runs ahead the current incorporation cycle in a sequencing run.

Yield (kBases) The total yield of sequence (in kBases) from the lane.

Column Content

(56)

44 CHAPTER 5

[

Figure 33 DNA Sequencing AllelesTable

Column Content

Chr ID The chromosome to which a sequence read is aligned. ChrM indicates a mitochondrial DNA alignment.

Position The position of the SNP on the indicated chromosome.

A Bases Number of A bases called on the reads.

C Bases Number of C bases called on the reads.

G Bases Number of G bases called on the reads.

T Bases Number of T bases called on the reads.

Call The base called.

Bases Used Bases used for making the call.

Total Bases Total bases called at that position.

Call 1 Score The score of the call.

(57)

Using Data in the GenomeStudio Software 45

Using Data in the GenomeStudio Software

Using the GenomeStudio Software, you can perform many different types of analysis on your data. Some of the most useful ones for DNA Sequencing applications are:

` Sorting numerical data

` Filtering numerical data

` Exporting your data

To export the SNP selection after manipulating your data, click Export Displayed Data to a File.

In addition, you can create graphs and count histograms from the data tables, and filter within these graphs. See the

GenomeStudio Framework User Guide for information how to set up these analyses.

Viewing Interesting SNPs in IGV

If you have found interesting SNPs in the DNA Sequencing alleles table and you want to view them in the genomic context, do the following:

1. Select the SNP you want to jump to.

Call 2 Score The score of the second call, if heterozygote.

Ref Base The reference base at that position.

Call Type The call type:

` SNP_diff—difference with the reference is a homozygous call.

In any heterozygote calls, the strongest call is given first:

` SNP_het1—heterozygote has the reference base as the stronger call and a non-reference base as the minor allele.

` SNP_het2—heterozygote has a non-reference base as major call and the reference base as the minor allele.

` SNP_het_other—this is a heterozygous call where both of the called alleles are different from the reference base

Column Content

(58)

46 CHAPTER 5

Figure 34 Jump to IGV

The interesting SNP will now display in the IGV.

Running Reports

The DNA Sequencing Module has the ability to generate reports. To see a list of available reports, select Analysis | Reports from the toolbar menu to open the GenomeStudio Reports dialog box (Figure 35).

Figure 35 GenomeStudio Sequencing Reports Dialog Box

Final Reports The DNA Sequencing Module can generate a final report, which you can populate with the information provided in the tables.

1. Select Analysis | Reports from the toolbar menu.

Final report

Report file name and location

Available custom reports ADT reports

(59)

Running Reports 47

2. Select Final Report in the DNA Sequencing Reports dialog box.

3. Specify a file name and location under Report File Name.

4. Click OK.

The Final Report Dialog Box opens (Figure 36).

Figure 36 Final Report Dialog Box

5. Select the data you want to appear in the final report.

6. When you are finished, click OK.

You can now review the report.

ADT Reports The DNA Sequencing Module can generate Illumina Assay Design Tool (ADT) reports containing the regions where you have found SNPs.

You can generate an ADT dbSNP report or an ADT sequence report:

` The dbSNP ADT Report contains information from the Alleles table in the GenomeStudio Software, and provides a surrounding chromosomal region for the SNP. Use this report if you want to target an area but not necessarily the exact SNP you identified. dbSNP reports are only supported for human.

(60)

48 CHAPTER 5

` The Sequence ADT Report provides the SNP and the flanking sequences based on a reference genome. Use the sequence list for all non-human submissions and any human submissions for which you want to assay the particular idenitified SNP. Please note that the SNP_Name in a

sequence list cannot begin with "rs" or "cg" as this conflicts with existing IDs.The reports can be sent to Illumina, who can help you devise custom SNP assays.

Generating the ADT Report

To run an ADT report, perform the following:

1. Select Analysis | Run Reports in the toolbar menu.

The DNA Sequencing Reports dialog box opens.

2. Select ADT dbSNP Report or ADT Sequence Report in the DNA Sequencing Reports dialog box.

3. Specify a file name and location under Report File Name.

4. Click OK.

The ADT Report Dialog Box opens (Figure 37).

Figure 37 ADT Report Dialog Box 5. If you want a header for the report:

a. Select the Add Header to Report checkbox b. Fill out the ADT Report dialog box.

(61)

Running Reports 49

c. To work on the file later, press Save Info to File.

d. To load a previously saved file, press Load Info from File.

6. When you are finished, press Generate Report.

You can now review the report.

Submitting the ADT Reports

When you are satisfied with the reports, send the reports to Illumina.

Submitting via the web is preferred, since it provides rapid turnaround and 24-hour access. To submit the ADT report:

1. Log in to iCom.illumina.com.

2. Select Prelim assay design tool (ADT).

3. In the ADT interface, enter the necessary file information and attach your ADT report.

After the file has been scored, an email notification is returned to you.

If you need assistance with ADT reports or our Assay Design Tool, please contact our Technical Support Team.

Custom Reports

The DNA Sequencing Module has the ability to implement custom reports. To view a list of available custom reports, select Analysis | Run Reports from the toolbar menu. The DNA Sequencing Reports dialog box opens (Figure 35).

Specify a file name and location under Report File Name, and run the custom report by selecting it from the dropdown list.

(62)

50 CHAPTER 5

(63)

Appendix A

Generating CASAVA Output Files

Introduction

This appendix provides directions how to genreate a CASAVA build and call SNPs using Pipeline, ELAND and CASAVA.

Basic workflow The basic workflow is as follows:

1. Take Bustard reads from DNA sequencing experiment(s).

2. Align the reads to the genome with ELAND (the alignment part of Pipeline). Use ELAND_extended or ELAND_pair.

3. Aggregate lanes into samples with CASAVA.

4. Call SNPs with CASAVA.

The output files can now be viewed in the GenomeStudio Software.

Running ELAND

Perform the following steps to align your reads to the genome:

1. Locate the Bustard reads you want to use.

2. Perform alignment by running ELAND_extended or ELAND_pair.

3. Result is s_<lane>_export.txt for each lane of the flowcell.

Running Perform the following steps to build a consensus and call SNPs:

(64)

52 APPENDIX B

1. Open the summary.htm file(s) for a particular sample, and decide which lanes to use based on the quality information in the headline. This could be from one or many flow cells and can consist of one or many lanes.

2. Locate the appropriate s_<lane>_export.txt file from ELAND.

3. Feed the _export file(s) into CASAVA to bin/sort, aggregate, and perform SNP calling.

NOTE

If you want to add or subtract additional lanes, you must rerun CASAVA to make a new build.

(65)

Appendix B

Troubleshooting

Introduction

Use this troubleshooting guide to assist you with any questions you may have about the DNA Sequencing Module.

Frequently Asked Questions

Table 1 lists frequently asked questions and associated responses.

Table 1 Frequently Asked Questions

# Question Response

1.

The sequence reads do not seem to align at all with the genome.

You may have used different genome builds for

alignment in the Pipeline Analysis software and the DNA Sequencing Module. Make sure you use the same builds;

see the respective Pipeline Analysis User Guide or GenomeStudio Framework User Guide to change the build you are using.

2. How do I combine multiple samples into one?

You cannot combine multiple samples into one in the DNA Sequencing Module. You will have to rerun CASAVA to combine the data from multiple samples in one CASAVA build.

Note: you can compare multiple samples.

3. Some data looks bad. How

do I remove it from the build? You will have to rerun CASAVA without the bad data to generate a new CASAVA build.

(66)

54 APPENDIX B

4. SNPs are called at areas where the coverage dips.

These SNPs may not be real SNPs, but small indels. A small indel will cause a short run of snp calls (~indel+4) with a concomitant dip in coverage.

Check whether the apparent SNP can be explained by a short indel.

Table 1 Frequently Asked Questions (continued)

# Question Response

Notice. DNA Sequencing Module User Guide

GenomeStudio

DNA Sequencing Module v1.0 User Guide

Notice

Revision History

Table of Contents

List of Figures

Chapter 1

Overview

Chapter 2

Input Files

Chapter 3

Starting a Project

Chapter 4

Using the IGV to Visualize Data

Chapter 5

Analyzing Numerical Data

Appendix A

Generating CASAVA Output Files

Appendix B

Troubleshooting