Part III | Results 40
3.3 Results| Generating values for computer simulations 64
Generating values for computer simulations is often misunderstood by non-statisticians, and it is a non-trivial task. A proper choice of statistical measure when describing a phenotype is also not an easy task, and there is no perfect protocol for statistical analysis of any living system. Moreover, many software packages designed for medical and biological research do not possess access to the source code or do not have a comprehensive documentation of the program procedures and thus become a ‘black box’ even when used by a professional statistician. To illustrate how uninformed use of statistical methods can impact experimental conclusions, we decided to apply three different statistical methods to simulate fungi as a living system and to visualize the outcomes. Therefore, we run computer simulations of the filamentous fungus, Neurospora crassa, growing on agar. We used three various ‘statistical modes’ and compared both the numerical and visual outcomes.
In the first simulation mode parametric values are withdrawn randomly from the actual numerical data that has been described previously in the thesis and is available under the following hyperlinks:
1) MATLAB numerical arrays used for the simulations of Neurospora crassa
2) Raw numerical data with associated movies introducing Neurospora crassa growing on agar
I used the following names for the numerical vectors: x1 (velocities, parents), x2 (velocities, daughters and further generations), x3a (angles for the branches sent on the right side of the parent hyphae), x3b (angles for the branches sent on the left side of the parent hyphae), x4a (angles for the branches sent on the right side of the daughters and further generation hyphae), x4b (angles for the branches sent on the left side of the daughters and further generation hyphae), x5 (branching distances, parents), x6 (branching distances, daughters and further generation hyphae).
To withdraw parametric values at random from the numerical data collected I wrote the following MATLAB function:
- 65 -
The function introduced in the Equation17 panel returns the value of the relevant parameter (apical extension velocity, branching angle, or branching distance) from the list of measured actual values. An array is an n-dimensional rectangular structure holding arbitrary data. It is a container object for storing data. The indices of an array must be sequences of an integer. The number of entries in an array is fixed. MATLAB® function named ‘randi’ returns an n-by-n matrix containing pseudorandom integer values drawn from the discrete uniform distribution on the interval [1,imax], while ‘numel’ returns the number of elements, n, in the array x. By combining these two functions, I obtained desired result: actual simulation values generated pseudo-randomly. In other words, I picked up at random (pseudo-random) values from the existing arrays (velocities, branching angles, branching distances). An example of function implementation in the code is given in Equation 18.
Explanation of applied procedure for generating actual data in fungi simulation program is illustrated in the (Figure 33).
is velocity, angle, or distance
is a numerical array x1, x2, x3a, x3b, x4a, x4b, x5, or x6 is a MATLAB function (withdraw
( ( ( ))) : randi numel where r parame and ter parameter i x x x ing at random)
is a MATLAB function (returning the n-th number from the numerical set) numel
Equation 17 Withdrawing at random the actual numbers from pre-defined numerical array
Equation 18 An example of implementing the MATLAB function for generating actual data
- 66 -
The second statistical method used for determining stochastic components of Neurospora crassa living system (apical extension velocities, branching angles, and branching distances) is simply withdrawing numbers from the most common parametric distribution family – Gaussian distribution, described in the Review section of the thesis. I let the MATLAB construct a Gaussian distribution from given sets of data (MATLAB numerical arrays used for the simulations of Neurospora crassa), and then I calculated means and standard deviation and input them to the program, and finally I withdrew the values at (pseudo) random from the given Gaussian distribution. The procedure is given by Equation 19 below.
_
is velocity, angle, or distance function that
(' ', ,
generates numbers from statistical distribution ) : ' ' Nor random where parame
param mean std deviat
ter random Norm e m a a ter l n l io
is the name of the statistical distribution is the mean value of a given parameter _ is the standard deviation value
mean
std deviation
Figure 33 Generating actual data values in the MATLAB fungi simulation program
- 67 -
An example of implementation of the above method in the MATLAB code is given in Equation 20.
The third, most sophisticated and complex method used for generating simulation values is the one based on the kernel estimation of the cumulative distribution function of a given parameter (apical extension velocity, branching angle, or branching distance). In this method, the probabilities of occurrence of specific values are found using non-parametric estimates. The fundamental concepts associated with this method are illustrated in Figure 34.
Fundamental concepts underlying computational statistical interference involve Probability
Distribution Function,Cumulative Distribution Function, and Inverse (Transform) of the Cumulative
Distribution Function. The methodology proposed in this thesis uses PDF, CDF, and ICDF methods subsequently. The established procedure produces an estimate of the real-world data linked to the growth patterns of the entire population of Neurospora crassa. The ICDF constructed in this study is based entirely on the laboratory data collected and uses a kernel estimator to approximate directly an Inverse of a Cumulative Distribution Function (ICDF).
Equation 20 Generating angle values by withdrawing numbers at random from Gaussian distribution based on the measurements of filamentous fungus, Neurospora crassa
- 68 -
In biostatistics, a random (stochastic) variable X is denoted by upper-case letters. A random variable is defined as a variable whose real value both changes randomly and is determined by specific distribution probabilities (Moore et al., 2012). A random variable may be discrete
or continuous. A discrete random variable can take the countable number of values only while a continuous random variable assumes a whole interval of numbers (Zwillinger, 1957, 2000, Khasminskii, 1992). In the practice of biostatistics the associated probability of a discrete random variable is given as a numerical (discrete) value. However, the probability a continuous random variable it is given as an area under the probability density function calculated from the infinity to x (Zwillinger, 1957, 2000, Khasminskii, 1992).
Figure 34 Illustration of the statistical methodology used for producing real-world data based on the sample (laboratory) data
- 69 -
Cumulative distribution function defines the probability of a random variable X being less or equal x and is described by using slightly different mathematical notations for discrete and for continuous cases (Zwillinger, 1957, reprinted 2000).
If the probability distribution (mass) function p x( ) of a discrete random variable has a value specified for every numerical value x such that (Zwillinger, 1957, reprinted 2000):
then the cumulative distribution function F x( ) of a discrete random variable X is defined as every number x in the following way (Zwillinger, 1957, reprinted 2000):
and CDF defined that way has the following mathematical properties (Zwillinger, 1957, reprinted 2000): Prob ( ) ( ) [ ] ( ) ( ) 0 ( ) ( ) 1 x a p x X x b p x c p x
| Prob ( ) [ ] ( ) y y x F x X x p y
if and _ are real numbers and , then
Prob Prob Prob
*whereat _ is the first value that is assumed by the ( ) lim ( ) 0 ( ) lim ( ) 1 ( ) ( ) ( ) ( ) [ ] [ ] [ ] ( ) ( ) x x a F x b F x c a b a b F a F b d a X b X b X a F b F a a random variable _ to be less than
**expression ( ) is valid for , and
X a
d a b a b
Equation 21 Defining Cumulative Distribution Function for a Discrete Random Variable - Assumptions
Equation 22 Defining Cumulative Distribution Function for a Discrete Random Variable – Mathematical Definition
Equation 23 Cumulative Distribution Function for a Discrete Random Variable –
- 70 -
If the probability density function f x( ) of the continuous random variable has a value specified for every numerical value x such that (Zwillinger, 1957, reprinted 2000):
then, the cumulative distribution function F x( ) of a continuous random variable X is defined as every number x in the following way (Zwillinger, 1957, reprinted 2000):
and CDF defined that way has the following mathematical properties (Zwillinger, 1957, reprinted 2000):
The inverse transform is a method of sampling pseudo-random numbers that imitate random numbers. Pseudorandom numbers are not entirely random as they are partially determined by the initial sets of values known as seeds (Luby, 1996). Pseudorandom numbers are used in general computational practice for their reproducibility and speed.
Prob whereat and
_for Prob ( ) [ ] ( ) , , ( ) ( ) 0 ( ) ( ) 1 ( ) [ ] 0 for b a a a X b f x dx a b a b b f x x c f x dx d X c c
Prob ( ) [ ] ( ) x F x X x f y dy
Prob Prob Prob
whereat
the probability density fun ( ) lim ( ) 0 ( ) lim ( ) 1 ( ) ( , ) ( ) ( ) ( ) ( ) [ ] [ ] [ ] ( ) ( ) ( , ) ( ) ( ) ction ( ) can be fo x x a F x b F x c if a b a b F a F b d a X b X b X a F b F a a b a b e f x
und from the cdf: f x( ) dF x( ) (if the derivate exists)
dx
Equation 25 Defining Cumulative Distribution Function for a Continuous Random Variable-Mathematical Definition
Equation 26 Cumulative Distribution Function for a Continuous Random Variable – Mathematical Properties Equation 24 Defining Cumulative Distribution Function for a Continuous Random Variable - Assumptions
- 71 -
Here, an inverse transform method is used to generate numbers pseudo-randomly from the parametric distributions describing the growth of Neurospora crassa. Number generation is done through the application of cumulative distribution functions of these distributions (Figure 34). The main concept behind the inverse transform method is to sample uniformly a pseudorandom number that falls into the interval between 0 and 1. Even sampling at random still means sampling according to some specified distribution. Therefore, sampling uniformly and at random in practice means drawing a sample from the distribution that is uniform. Therefore, the probability of drawing each element from that distribution is equally probable, e.g. if the distribution consists of 6 discrete numbers, the probability of drawing each of the element is 1/6.
In the case of non-uniform distribution, these probabilities might not be equal, e.g. two first elements might be withdrawn with the probability of 1/3 and the rest with the probability of 4/50. An inverse transform method in computational practice means computing the cumulative distribution function and inverting it (Figure 34). It is a straightforward method especially for a discrete random variable as the individual probabilities are added up to compute cumulative distribution function (Figure 34 left the panel, Error! Reference source not found.-23). However, for the continuous random variable the probability density function needs to be integrated (Error! Reference source not found.-26) and thus is often considered computationally inefficient. It still is widely used though as it allows for building universal sampling solutions in a real world and the real world applications of statistics have been the primary objective of this thesis.
If X is a random variable, and its distribution is characterized by the cumulative distribution function F, then the problem that inverse transform solves is generating random values of X according to this distribution. Therefore, in computational practice, the inverse transform method starts with generating a random number u from the standard uniform distribution whose range is between 0 and 1. Then, the value x is computed so that F(x) = u. Subsequently, x is assumed to be the random number drawn from the distribution given by F. This process is known as “generating nonuniform random variables” (Zwillinger, 1957, 2000, Khasminskii, 1992).
- 72 -
The mathematical definition of the Inverse Transform Method is as follows:
If X has probability density function f(x) and the equation states cumulative distribution function
Moreover, if Y is uniformly distributed on the interval [0,1):
then the universe transform method converts uniform random variables into the ones from the alternative distributions in the following way:
and X has probability density function f(x).
In practice, accurate and precise statistical methods are crucial for a description of phenotype, and thus also for professional computer simulations of biological species. Generating data that imitates the real world measurements always involves specifying a random variable that is a part of some probability distribution. Probability Mass Functions, Probability Density Functions (PDFs), and Cumulative Distribution Functions (CDFs) are essential for the description of data distribution. Importantly, there is often unspoken assumption while making statistical inferences about discrete uniform distributions whose mass points are associated with the observed values coming from a random sample. This implicit assumption is that the observed values at the mass points are the hotspots that indicate the direction of further data generation. In that respect, the mass distribution function of a discrete random variable becomes a dynamic adaptive model for generating simulation values.
Most of the modern statistical and machine learning software available on the market uses functions for generating pseudorandom numbers that are based on various parametrical distributions families. In some cases, however, using these methods without deeper understanding is a source of scientific bias and lead to incorrect assessments of variability and missing many important data patterns in the populations. Statistical errors of this kind can have serious consequences not only for real world laboratory data but also for the human economy, e.g. wrong allocation of resources, incorrect information processing and thus incorrect perception of the measurement results.
( ) ( ) x F x f u du
1( ) X F Y is [0,1) Y UEquation 27 Cumulative Distribution Function for random variable X
Equation 28 Defining Inverse Transform
– Fulfilling Uniformity Condition
Equation 29 Inverse Transform of Cumulative Distribution Function
- 73 -
Improving statistical measurement methods has been mentioned by the WHO as one of the biggest challenges in Public Health in post-2015 Era, especially in the field of non- communicable diseases (LSTM Leverhulme Lecture 2015).
Correct application of statistical methods when describing phenotype is of crucial importance especially when there is a need to generate values that mimic in detail the data collected. In the study presented here a kernel estimate is used to approximate Cumulative Distribution Function of the data collected. Then, the inversion method is applied to the CDF constructed that way, to generate uniform random values. Also, the kernel estimate allows to approximate the CDF at the points other than the measured ones.
Simulation values in a ‘kernel mode’ were generated according to the procedure that is given in the MATLAB tutorial (Kernel Estimate for Custom Distributions). Below, there is a function called ‘ksdensity’ that has been used to estimate the inverse of CDF in the fungi simulation program.
Label Description
r estimated values of the predefined function; the values are returned as a vector of the same dimension as x
ksdensity kernel density estimator – MATLAB function
x sample data that is returned as a column vector
u uniformly sampled pseudorandom number that falls into the interval [0,1)
‘function’ function to be estimated; in MATLAB (version R2013a) it can be probability density
function, cumulative distribution function, inverse cumulative distribution function, survivor function, or cumulative hazard function
‘icdf’ the inverse of the cumulative distribution function
‘width’, .35 the bandwidth of the kernel-smoothing window defined a function of the number of points in x; this property is given as the comma-separated pair consisting of ‘width’ and a scalar, which regulates the amount of smoothing
Table 7 Kernel estimation of the inverse of the Cumulative Distribution Function (CDF) - description of the properties included in the MATLAB procedure given in Equation 30
'function' 'ic
r = ksdensity(x, u,
,
df' 'width',
,.35)
Equation 30 Generating values for the fungi simulation program by creating custom distributions
- 74 -
Example implementation of procedure is defined in Equation 31 below:
Label Description r branching angle extension velocity branching distance
estimated values of the predefined function; the values are returned as a vector of the same dimension as x; the functions available in the fungi simulation program include: branching angle, apical extension velocity, and branching distance
Ksdensity kernel density estimator – MATLAB function X x1 x2 x3a, x3b x4a, x4b x5 x6
sample data that is returned as a column vector; in the fungi program sample data vectors are as follows: x1 (apical extension velocity of parent hyphae), x2 (apical extension velocity of daughter hyphae), x3a (branching angles for parent hyphae, right side), x3b (branching angles for parent hyphae, left side), x4a (branching angles for daughter hyphae right side), x4b (branching angles for daughter hyphae, left side), x5 (branching distances for parent hyphae), x6 (branching distances for daughter hyphae)
u uniformly sampled pseudorandom number that falls into the interval [0,1)
‘function’ function to be estimated; in MATLAB (version R2013a) it can be probability density
function, cumulative distribution function, inverse cumulative distribution function, survivor function, or cumulative hazard function
‘icdf’ the inverse of the cumulative distribution function
‘width,' 5 The bandwidth of the kernel-smoothing window defined as a function of the number of points in x (x1,x2,x3a,x3b,x4a,x4b,x5,x6); this property is given as the comma-separated pair consisting of ‘width’ and a scalar, which regulates the amount of smoothing; the choice of scalar value in fungi simulation program is based on the data distributions assessed by using descriptive statistics (5-number summary: minimum, 1st quartile, median, 3rd quartile, maximum) and supported by the general observations of Neurospora crassa growing on agar in a laboratory conditions. For the individual sample data, the specific values are as follows: 0.1 for apical extension velocities (x1,x2), 5 for the branching angles (x3 and x4), and 20 for branching distances (x5 and x6).
Table 8 Kernel estimation of the inverse of the Cumulative Distribution Function (CDF) - description of the properties included in the MATLAB procedure given in Equation 31
Equation 31 Implementation of the ‘ksdensity’ procedure in the fungi simulation programme. Here, the
- 75 -