• No results found

Chapter 2 Materials and Methods

3.3 Discussion

Understanding the mechanisms that control gene expression requires tools that can in- terrogate non-coding sequences containing the transcriptional regulatory code. APPLES provides a programming environment that allows users to perform multiple analyses us- ing these sequences. The functionality described is particularly pertinent to the data analysis requirements of systems biology projects, such as the PRESTA project, which generate a wealth of expression data that provide a rich source of information for guiding sequence analysis. The pattern matching methods in APPLES facilitate these studies by providing a set of techniques that can be used to identify statistically significant sequence patterns that may contribute to regulating gene expression. Many excellent sequence analysis tools exist, but are often released as stand-alone applications that cannot di- rectly interface with other relevant utilities. This lack of interconnectivity makes it hard to combine several methods into a pipeline that meets all the requirements of a typical sequence analysis. For instance, the MEME suite (Baileyet al., 2006) offers a collection of some of the most popular sequence analysis tools, but combining these into a workflow is not straightforward. In contrast to sets of tools that are difficult to fit together in a workflow, the OO-approach allows the use of multiple tools that are connected together with common sets of objects. OOP developments that are specifically geared towards sequence analysis are not common and often lack some of the methods needed to analyse clusters of sequences. For example, the TFBS (Lenhard and Wasserman, 2002) toolkit is an OOP approach for the analysis of non-coding sequence yet it lacks some methods covered by APPLES. Also, unlike APPLES, TFBS is written using the old style Perl OO system, which is more low level than MooseX, and therefore more difficult for novice

Figure 3.4: APPLES workflow for promoter analysis. Object orientated design allows users to link APPLES objects to generate workflows. The figure demonstrates the relationships between objects (blue rounded boxes) and methods (red boxes) that could link together in

order to perform analysis of non-coding sequences. Arrows indicate links between methods

and objects. The output from a method is sometimes an object that can be used as input to another method. A typical analysis of non-coding sequences will begin by assembling the sequences corresponding to the genes of interest. These sequences can be defined in a fasta file or retrieved from a genome database (DB) such as Ensembl. For a given set of orthologous sequences, evolutionarily conserved regions can be identified within APPLES. All sequences are defined as a Genomic Interval Set object, irrespective of origin or context, providing a common representation that can be used by any APPLES method that probes the sequence architecture.

A set of sequences can then be used by a pattern matching model as input forde novo motif

finding. Pattern matching model objects assess the statistical enrichment of a weight matrix within a sequence or set of sequences. The weight matrix that defines a pattern matching model is either user-defined, retrieved from the PSSM DB or derived from motif discovery programs. Weight matrix clustering can be used to cluster a group of matrices discovered using motif finding programs or any other set of weight matrix objects.

Figure 3.5: Sample script that uses APPLES objects. The script assesses the enrichment for the PIF3 binding site (Transfac record, M00434) within a set of gene clusters using the hypergeometric test. Parameters include the length of promoters to test and the value used to threshold the binomial scores for each promoter sequence. Any parameters such as promoter length and PSSM to test can be changed to assess the impact of this on results.

programmers to use.

3.3.1 Functional decomposition

A key feature of the APPLES software concept is the principled design of objects to represent and describe biological features that are eternal in nature. There will always be biological sequences, regulatory sequences, and sequence patterns, and because the methods are built around these entities the system is more robust and extendable. Mod- elling biological entities in a generalised manner will facilitate reuse of code and methods in the future, as the generality allows other users to extend current object models, if needed, to meet new requirements. The high degree of re-usability allows other scientists to start asking questions immediately, and as code need not be built from scratch, pro- ductivity is enhanced. Because care was taken with the initial design, the software will not only serve to answer questions associated with the current project, but also future challenges- allowing users to build on past developments. This flexibility is important in order to exploit data from emerging technologies such as high-throughput sequencing, which will result in more sequenced genomes and expression data. For example, the rapid rate at which plant genomes are sequenced will get to the point where, for a given gene, the ortholog can be found in enough genomes for motif finding to be performed using the promoters from these sequences.

3.3.2 Future development

As with any software development there are always improvements to be made to func- tionality. The biggest area of further development is the incorporation of more published pattern matching/finding tools in the APPLES framework. For example, the ability to invoke morede novo motif finding algorithms would be to great advantage, as the use of multiple methods can improve the identification of relevant motifs (Harbison et al., 2004). The bioinformatics field is littered with motif discovery programs; the more pop- ular of these, which are not currently supported, include AlignAce (Hugheset al., 2000), Consensus (Hertz and Stormo, 1999) and Weeder (Pavesiet al., 2004).

Elaborate gene expression programs are likely to be orchestrated by the concerted action of multiple different motifs, rather than just one class of sequence patterns. A natural extension to current functionality would be to develop methods that can identify combi- nations of sequence motifs, associated within a set of sequences. This problem might be approached by identifying patterns in ade novomanner, or by examining combinatorial patterns of known TF binding site motifs within a regulatory sequence.

The key reason behind the design of APPLES was that it should be easy to use and develop, and it was because of this that it was extended using Perl MooseX. Perl is one of the easiest languages for a novice to learn and is frequently used in biological research. The OO behaviour is implemented through MooseX, which is high-level compared with other OO-languages. This ease of use does come at a cost, however, because it is much slower than other languages; this performance deficit could be reduced by writing CPU intensive tasks in faster languages.

Related documents