• No results found

Essential Computing for Bioinformatics First Steps in Computing: Course Overview

N/A
N/A
Protected

Academic year: 2021

Share "Essential Computing for Bioinformatics First Steps in Computing: Course Overview"

Copied!
45
0
0

Loading.... (view fulltext now)

Full text

(1)

MARC: Developing Bioinformatics Programs

July 2009

Alex Ropelewski

PSC-NRBSC

Bienvenido Vélez

UPR Mayaguez

Reference: How to Think Like a Computer Scientist: Learning with Python

Essential Computing for Bioinformatics

First Steps in Computing: Course Overview

(2)

• The following material is the result of a curriculum development effort to provide a set

of courses to support bioinformatics efforts involving students from the biological sciences, computer science, and mathematics departments. They have been developed as a part of the NIH funded project “Assisting Bioinformatics Efforts at Minority Schools” (2T36 GM008789). The people involved with the curriculum development effort include:

• Dr. Hugh B. Nicholas, Dr. Troy Wymore, Mr. Alexander Ropelewski and Dr. David

Deerfield II, National Resource for Biomedical Supercomputing, Pittsburgh Supercomputing Center, Carnegie Mellon University.

• Dr. Ricardo González Méndez, University of Puerto Rico Medical Sciences Campus.

• Dr. Alade Tokuta, North Carolina Central University.

• Dr. Jaime Seguel and Dr. Bienvenido Vélez, University of Puerto Rico at Mayagüez.

• Dr. Satish Bhalla, Johnson C. Smith University.

• Unless otherwise specified, all the information contained within is Copyrighted © by

Carnegie Mellon University. Permission is granted for use, modify, and reproduce these materials for teaching purposes.

(3)

Course Overview

Introduction to Programming (Today)

Why learn to Program?

The Python Interpreter

Software Development Process

Numbers, Strings, Operators, Expressions

Control structures, decisions, iteration and

recursion

Outline

(4)

Course Overview

Essential Computing for Bioinformatics

Course Description

Educational Objectives

Major Course Modules

Module Descriptions

(5)

Essential Computing for Bioinformatics

Course Description

This course provides a broad introductory discussion

of essential computer science concepts that have wide

applicability in the natural sciences. Particular

emphasis will be placed on applications to

Bioinformatics. The concepts will be motivated by

practical problems arising from the use of

bioinformatics research tools such as genetic

sequence databases. Concepts will be discussed in a

weekly lecture and will be practiced via simple

programming exercises using Python, an easy to learn

and widely available scripting language.

(6)

Educational Objectives

Awareness of the mathematical models of computation and their

fundamental limits

Basic understanding of the inner workings of a computer system

Ability to extract useful information from various bioinformatics data

sources

Ability to design computer programs in a modern high level language

to analyze bioinformatics data.

Experience with commonly used software development environments

and operating systems

Experience applying computer programming to solve bioinformatics

problems

(7)

Major Course Modules

Module

Lecture

MARC

Lecture

First Steps in Computing: Course Overview 1

Using Bioinformatics Data Sources 2

Mathematical Computing Models 3 5

High-level Programming (Python): Flow Control 6 2

High-level Programming (Python): Container Objects 7 3

High-level Programming (Python): Files 8 4

High-level Programming (Python): BioPython 9

(8)

Main Advantages of Python

Familiar to C/C++/C#/Java Programmers

Very High Level

Interpreted and Multi-platform

Dynamic

Object-Oriented

Modular

Strong string manipulation

Lots of libraries available

Runs everywhere

Free and Open Source

Track record in bioInformatics (BioPython)

(9)

Using BioInformatics Data Sources

Searching Nucleotide Sequence Databases

Searching Amino Acid Sequence Database

Performing BLAST Searches

Using Specialized Data Sources

IDEA: How can we expedite data collection and analysis?

... writing programs to automate parts of the process.

9 These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh Supercomputing Center

Reference:

Bioinformatics for Dummies (Ch 1-4)

(10)

Mathematical Computing

Goal:General Awareness

What is Computing?

Mathematical Models of Computing

Finite Automata

Turing Machines

The Limits of Computation

Church/Turing Thesis

What is an Algorithm?

Big O Notation

(11)

High-Level Programming (Python)

Downloading and Installing the Interpreter

Values, Expressions and Naming

Designing your own Functional Building

Blocks

Controlling the Flow of your Program

String Manipulation (Sequence Processing)

Container Data Structures

File Manipulation

11 These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh Supercomputing Center

(12)

CS Fundamentals will be Interleaved Throughout the Course

Information Representation and Encoding

Computer Architecture

Programming Language Translation Methods

The Software Development Cycle

Fundamental Principles of Software

Engineering

Basic Data Structures for Bioinformatics

Design and Analysis of Bioinformatics

Algorithms

(13)

US Department of Labor, Bureau of Labor Statistics

Engineers, Life and Physical Scientists and Related Occupations.

Occupational Outlook Handbook, 2008-09 Edition

.

Biological scientists “…usually study allied disciplines

such as mathematics, physics, engineering and

computer science.

Computer courses are beneficial for

modeling and simulating biological processes, operating

some laboratory equipment and performing research in

the emerging field of bioinformatics

Why Learn to Program?

(14)

Why Learn to Program?

14

 

Need to compare output from a new run with an old run. (new hits in

database search)

 

Need to compare results of runs using different parameters. (Pam120

vs Blosum62)

 

Need to compare results of different programs (Fasta, Blast,

Smith-Waterman)

 

Need to modify existing scripts to work with new/updated programs and

web sites.

 

Need to use an existing program's output as input to a different

program, not designed for that program:

Database search -> Multiple Alignment

Multiple Alignment -> Pattern search

(15)

Bioinformatics Assembly Analyst

Responsibilities:

 Assembling genome sequence data using a variety of tools and parameters and performing the

experiments needed to evaluate sequencing strategies

 Using existing software and databases to analyze genomic data and correlating assemblies and

sequences with a variety of genetic and physical maps and other biological information

 Identifying problems and serving as point of contact for various groups to propose and

implement solutions

 Proposing and implementing upgrades to existing tools and processes to enhance analysis

techniques and quality of results

 Developing and implementing scripts to manipulate, format, parse, analyze, and display

genome sequence data; and developing new strategies for analysis and presentation of results.

Requirements:

 A bachelor's degree in biology or related field

 At least three years of experience in DNA sequencing and sequence analysis.

 Must possess solid knowledge of sequencing software and public sequencing databases.

 Knowledge of bioinformatics tools helpful.

Why Learn to Program?

(16)

C/C++

Language of choice for most large development projects

FORTRAN

Excellent language for math, not used much anymore

Java

Popular modern object oriented language

PERL

Excellent language for text-processing (bioperl.org)

PHP

Popular language used to program web interfaces

Python

Language easy to pick up and learn (biopython.org)

SQL

Language used to communicate with a relational database

Good Languages to Learn

In no particular order….

(17)

“Object Oriented” is simply a convenient way to organize your

data and the functions that operate on that data

A biological example of organizing data:

 

Human.CytochromeC.protein.sequence

 

Human.CytochromeC.RNA.sequence

 

Human.CytochromeC.DNA.sequence

Some things only make sense in the context that they are used:

 

Human,CytochromeC.DNA.intron

 

Human.CytochromeC.DNA.exon

 

Human.CytochromeC.DNA.sequence

 

Human.CytochromeC.protein.sequence

 

Human.CytochromeC.protein.intron

 

Human.CytochromeC.protein.exon

Python is Object Oriented

17

Meaningful

(18)

Go to

www.python.org

Go to DOWNLOAD section

Click on

Python 2.6.2 Windows installer

Save ~10MB file into your hard drive

Double click on file to install

Follow instructions

Start -> All Programs -> Python 2.6 -> Idle

Downloading and Installing Python

(19)

Idle: The Python Shell

(20)

Python as a Number Cruncher

20

>>> print 1 + 3

4

>>> print 6 * 7

42

>>> print 6 * 7 + 2

44

>>> print 2 + 6 * 7

44

>>> print 6 - 2 - 3

1

>>> print 6 - ( 2 - 3)

7

>>> print 1 / 3

0

>>>

/ and * higher precedence than + and -

integer division truncates fractional part Operators are left associative

(21)

Floating Point Expressions

21

>>> print 1.0 / 3.0

0.333333333333

>>> print 1.0 + 2

3.0

>>> print 3.3 * 4.23

13.959

>>> print 3.3e23 * 2

6.6e+023

>>> print float(1) /3

0.333333333333

>>>

Mixed operations converted to float

Scientific notation allowed

12 decimal digits default precision

(22)

String Expressions

22 >>> print "aaa" aaa >>> print "aaa" + "ccc" aaaccc >>> len("aaa") 3 >>> len ("aaa" + "ccc") 6 >>> print "aaa" * 4 aaaaaaaaaaaa >>> "aaa" 'aaa' >>> "c" in "atc" True >>> "g" in "atc" False >>> + concatenates string

len is a function that returns the length of its argument string

any expression can be an argument

* replicates strings

a value is an expression that yields itself

in operator finds a string inside another And returns a boolean result

(23)

23

>>> numAminoAcids = 20 >>> eValue = 6.022e23

>>> prompt = "Enter a sequence ->" >>> print numAminoAcids 20 >>> print eValue 6.022e+023 >>> print prompt Enter a sequence -> >>> print "prompt" prompt >>> >>> prompt = 5 >>> print prompt 5 >>>

= binds a name to a value

prints the value bound to a name

= can change the value associated with a name even to a different type

(24)

Values Have Types

24

>>> type("hello")

<type 'str'>

>>> type(3)

<type 'int'>

>>> type(3.0)

<type 'float'>

>>> type(eValue)

<type 'float'>

>>> type (prompt)

<type 'int'>

>>> type(numAminoAcids)

<type 'float'>

>>>

type is another function

the type of a name is the type of the

value bound to it

(25)

In Bioinformatics Words …

25 >>> codon=“atg” >>> codon * 3 ’atgatgatg’ >>> seq1 =“agcgccttgaattcggcaccaggcaaatctcaaggagaagttccggggagaaggtgaaga” >>> seq2 = “cggggagtggggagttgagtcgcaagatgagcgagcggatgtccactatgagcgataata” >>> seq = seq1 + seq2

>>> seq 'agcgccttgaattcggcaccaggcaaatctcaaggagaagttccggggagaaggtgaagacggggagtggggagttgagtc gcaagatgagcgagcggatgtccactatgagcgataata‘ >>> seq[1] 'g' >>> seq[0] 'a' >>> “a” in seq True >>> len(seq1) 60 >>> len(seq) 120

(26)

More Bioinformatics

Extracting Information from Sequences

26 >>> seq[0] + seq[1] + seq[2]

’agc’ >>> seq[0:3] ’agc’ >>> seq[3:6] ’gcc’ >>> seq.count(’a’) 35 >>> seq.count(’c’) 21 >>> seq.count(’g’) 44 >>> seq.count(’t’) 12 >>> long = len(seq) >>> pctA = seq.count(’a’) >>> float(pctA) / long * 100 29.166666666666668

Find the first codon from the sequence get ’slices’ from strings:

How many of each base does this sequence contain?

Count the percentage of each base on the sequence.

(27)

Additional Note About Python Strings

27

>>> seq=“ACGT”

>>> print seq

ACGT

>>> seq=“TATATA”

>>> print seq

TATATA

>>> seq[0] = seq[1]

Traceback (most recent call last):

File "<pyshell#33>", line 1, in <module>

seq[0]=seq[1]

TypeError: 'str' object does not support item assignment

seq = seq[1] + seq[1:]

Can replace

one whole

string with

another

whole string

Can

NOT

simply replace

a sequence

character with

another

sequence

character, but…

Can replace a whole string using substrings

(28)

How?

 

Precede comment with # sign

 

Interpreter ignores rest of the line

Why?

 

Make code more readable by others AND yourself?

When?

 

When code by itself is not evident

# compute the percentage of the hour that has elapsed

percentage = (minute * 100) / 60

 

Need to say something but Python cannot express it, such as

documenting code changes

percentage = (minute * 100) / 60 # FIX: handle float division

Commenting Your Code!

28

(29)

Problem Identification

 

What is the problem that we are solving

Algorithm Development

 

How can we solve the problem in a step-by-step manner?

Coding

 

Place algorithm into a computer language

Testing/Debugging

 

Make sure the code works on data that you already know the

answer to

Run Program

 

Use program with data that you do not already know the answer to.

Software Development Cycle

(30)

First, lets learn to SAVE our programs in a file:

From Python Shell: File -> New Window

From New Window: File->Save

Then, To run the program in the new window:

From New Window: Run->Run Module

Lets Try It With Some Examples!

(31)

What is the percentage composition of a nucleic

acid sequence

DNA sequences have four residues, A, C, G,

and T

Percentage composition means the percentage

of the residues that make up of the sequence

Problem Identification

(32)

Print the sequence

Count characters to determine how many A, C, G

and T’s make up the sequence

Divide the individual counts by the length of the

sequence and take this result and multiply it by

100 to get the percentage

Print the results

Algorithm Development

(33)

Coding

33

seq="ACTGTCGTAT"

print seq

Acount= seq.count('A')

Ccount= seq.count('C')

Gcount= seq.count('G')

Tcount= seq.count('T')

Total = len(seq)

APct = int((Acount/Total) * 100)

print 'A percent = %d ' % APct

CPct = int((Ccount/Total) * 100)

print 'C percent = %d ' % CPct

GPct = int((Gcount/Total) * 100)

print 'G percent = %d ' % GPct

TPct = int((Tcount/Total) * 100)

print 'T percent = %d ' % TPct

(34)

First SAVE the program:

From New Window: File->Save

Let’s Test The Program

(35)

Six Common Python Coding Errors:

 

Delimiter mismatch: check for matches and proper use.

 

Single and double quotes: ‘ ’ “ ”

 

Parenthesis and brackets: { } [ ] ( )

 

Spelling errors:

 

Among keywords

 

Among variables

 

Among function names

 

Improper indentation

 

Import statement missing

 

Function calling parameters are mismatched

 

Math errors:

 

Automatic type conversion: Integer vs floating point

 

Incorrect order of operations – always use parenthesis.

Testing / Debugging

(36)

seq='ACTGTCGTAT"

print seq;

Acount= seq.count('A')

Ccount= seq.count('C')

Gcount= seq.count('G')

Tcount= seq,count('T')

Total = Len(seq)

APct = int((Acount/Total) * 100)

print 'A percent = %d ' % APct

CPct = int((Ccount/Total) * 100)

print 'C percent = %d ' % Cpct

GPct = int(Gcount/Total) * 100)

primt 'G percent = %d ' % GPct

TPct = int((Tcount/Total) * 100)

print 'T percent = %d ' % TPct

Testing / Debugging

36
(37)

First, re-SAVE the program:

File->Save

Then RUN the program:

Run->Run Module

Then LOOK at the Python Shell Window:

If successful, the results are displayed

If unsuccessful, error messages will be

displayed

Let’s Test The Program

(38)

The program says that the composition is:

0%A, 0%C, 0%G, 0%T

The real answer should be:

20%A, 20%C, 20%G, 40%T

The problem is in the coding step:

Integer math is causing undesired rounding!

Testing/Debugging

(39)

seq="ACTGTCGTAT"

print seq

Acount= seq.count('A')

Ccount= seq.count('C')

Gcount= seq.count('G')

Tcount= seq.count('T')

Total =

float(

len(seq)

)

APct = int((Acount/Total) * 100)

print 'A percent = %d ' % APct

CPct = int((Ccount/Total) * 100)

print 'C percent = %d ' % CPct

GPct = int((Gcount/Total) * 100)

print 'G percent = %d ' % GPct

TPct = int((Tcount/Total) * 100)

print 'T percent = %d ' % TPct

Testing/Debugging

39
(40)

If the first line was changed to:

seq = “ACUGCUGUAU”

Would we get the desired result?

Let’s change the nucleic acid sequence from

DNA to RNA…

(41)

The program says that the composition is:

20%A, 20%C, 20%G, 0%T

The real answer should be:

20%A, 20%C, 20%G, 40%U

The problem is that we have not defined the problem

correctly!

We designed our code assuming input would be

DNA sequences

We fed the program RNA sequences

Testing/Debugging

(42)

What is the percentage composition of a nucleic

acid sequence

DNA sequences have four residues, A, C, G,

and T

In RNA sequences “U” is used in place of “T”

Percentage composition means the percentage

of the residues that make up of the sequence

Problem Identification

(43)

Print the sequence

Count characters to determine how many A, C, G,

T

and U’s

make up the sequence

Divide the individual

A,C,G

counts

and the sum of

T’s and U’s

by the length of the sequence and take

this result and multiply it by 100 to get the

percentage

Print the results

Algorithm Development

(44)

Testing/Debugging

44

seq="ACUGUCGUAU"

print seq

Acount= seq.count('A')

Ccount= seq.count('C')

Gcount= seq.count('G')

TUcount

= seq.count('T')

+ seq.count(‘U')

Total = float(len(seq))

APct = int((Acount/Total) * 100)

print 'A percent = %d ' % APct

CPct = int((Ccount/Total) * 100)

print 'C percent = %d ' % CPct

GPct = int((Gcount/Total) * 100)

print 'G percent = %d ' % GPct

TUPct

= int((

TUcount

/Total) * 100)

print 'T

/U

percent = %d ' %

TUPct

(45)

Extend your code to handle the nucleic acid

ambiguous sequence characters “N” and “X”

Extend your code to handle protein sequences

What’s Next

References

Related documents

(guidance for acceptable detailing of the junction between timber framed walls and external ground can also be found in Chapter 6.2 ‘External timber framed walls’)..

Apart from attending Gym Tonic sessions twice a week, she also attends Tango’s social activities such as a paper flower craft workshop, a smartphone photography workshop and a

The gendered nature of these negotiations was also complex, with the young men of the school project less likely to display support for feminist issues within

chaser of prescription drugs and devices in the US (due to new programmes such as the Medicare (the federal health- care programme for the aged and disabled) Part D pro- gramme

The Government’s 2007 decision to repeal section 67 of the Human Rights Act, which had shielded from scrutiny many actions taken by First Nations governments or the Federal

TG infection signi fi cantly improved biased swing, descent latency time, nociceptive behavior, and the Nissl-body distribution in striatal neurons in comparison to the PD

This Joint Use Agreement Agreement (the “Agreement”) is hereby entered into by and between the COACHELLA VALLEY UNIFIED SCHOOL DISTRICT , a school district organized and

Starting with QuickBooks 2007, the request processor will allow QuickBooks to close down as it should and gracefully handle the exception so the client SDK application continues