Introduction - Machine learning methods for detecting structure in metabolic flow networks

In Chapter 2 it was discussed how, under a number of reasonable assumptions, the metabolism of an organism can be modelled as a many dimensional polytope containing all feasible flux assignments. The two main groups of methods to analyse this polytope are unbiased methods, which attempt to decompose the flow network in order to achieve an understanding of every possible flux, and biased methods, which attempt to find the one true biologically correct solution.

However, there is a significant gap between biased and unbiased methods that is under investigated. Organisms are able to react to many situations, and so there is a large amount of merit in characterising the full range of phenotypes that they are capable of, but on the other hand we want to interpret these responses in a manner that is biased towards the most important and likely phenotypes, and we want to downplay detail that is less important. This more aggressive approach avoids the overly cautious responses given by fully unbiased methods which pin the correct answer down to somewhere in a few tens or hundreds of continuous dimensions.

The aim of this chapter is therefore to introduce a set of sampling techniques which fill this gap by finding sets of solutions which are all optimal, but for a set of models which are similar, but not identical in some way—for instance, different perturbations of a single model.

The most important property of these solution sets is that they exhibit variance that makes them suitable for characterising flow network properties during later analysis. The

success of this goal can primarily be evaluated through the success of the use of these datasets in the other chapters. In addition, particular attention is paid to attempting to mimic ‘biologically realistic’ variation. The only way of truly looking at biologically realistic variation is to base the sampling on real biological datasets, which is touched on in section 3.2.5, but outside the core scope of this chapter. However, even when the raw data is experimentally derived, transforming it into a usable model often requires extensive normalisation and data processing, and proving the correctness of this is often very difficult. For this reason the target of biological realism used in this chapter is mainly restricted to techniques which are biologically justifiable, in that they imitate a realistic biological mechanism—many of the techniques here have strong parallels with existing unbiased techniques discussed in Chapter 2.

The final goal of this chapter is to demonstrate that these techniques can actually be implemented to provide data for use in demonstrating techniques in the other chapters. Since metabolic models typically have large numbers of variables, the sample sizes required were very high, and by the nature of the use of random perturbation, many modifications of models were required. In addition, any kind of batch-based sampling requires that a large number of models is maintained and modified at any given time. These targets meant that the actual implementation of software capable of supporting the sampling discussed in this chapter was a significant part of the challenge. This new software enabled the creation of what I believe to be the two largest metabolic network datasets of their kind, at approximately half a million samples each.

3.1.1 Overview

In order to characterise phenotypes that imitate nature in a useful and justifiable way, we need to design a distribution from which to draw them. This requires some bias towards more biologically beneficial solutions, but also methods to induce variation, so that a larger space is explored.

Previous efforts [148, 149] to explore the flux space have sometimes taken an approach of blind, uniform Monte-Carol sampling: they pick a set of fluxes, check if it is feasible, and try again. This does have the effect of creating a very unbiased view of the shape of the flux space, but as described in section 2.4 there is no reason to expect a natural population to exhibit such a uniform distribution.

This implies that it in order to increase the realism of the sample, a superior approach is to use Flux Balance Analysis to find a solution that is on the surface of the flux polytope for a given set of constraints. This implies some degree of biological optimality, which is a reasonable assumption [150] and also has a performance advantage, since it allows us to avoid checking infeasible fluxes.

• Section 3.2 describes a selection of ways in which metabolic models can be modified to approximate real world variation. This defines the sampling space.

• Section 3.3 describes sampling strategies that can be used to define the draw distribution within the sampling space.

• Section 3.4 describes some concrete examples of pairings of techniques from section 3.2 and section 3.3.

Expressed mathmatically, the aim of this chapter is to sample vectors of reactions rates x which, or a system with r reactions and m metabolites, where

lb := the vector of lower bounds, length r, (3.1)

ub := the vector of upper bounds, length r, (3.2)

c := the objective function vector, length r, (3.3)

A := the stoichiometric matrix, with r rows and m columns, (3.4)

(3.5) satisfy the optimisation problem

Maximise x · c, (3.6)

subject to _{lb 5 x 5 ub,} (3.7)

and Ax = 0 (3.8)

as previously described in section 2.8.3.1.

Variation induction, in section 3.2, describes of methods of generating variance for this sampling, by exploiting the typically underdetermined nature of this optimisation problem, or by modifying the problem itself, either the upper and lower bounds, lb and ub, the objective function, c, or the network structure, A. Sampling priors, in section 3.3, describes the data sources that can be used to choose the values for these modifications, in a way that is as biologically justifiable as possible. This either means attempting to infer values from experimental sources, or simulating biological processes. Finally, the case studies in section 3.4 give specific instantiations of these sampling techniques, the results of which are used in later chapters.

In document Machine learning methods for detecting structure in metabolic flow networks (Page 38-40)