CHAPTER 3: RESEARCH METHODOLOGY
3.6 Descriptive statistical procedures
3.6.1 Numerical summary
The most common descriptive statistical procedure is averaging the data values. The
averaging process can be done by adding all of the data values and dividing the sum by
the number of values. The result is known as the mean (X̅). The mean shows the central tendency of a group of values. Analysts are usually interested in the extent to which the
values are dispersed around the mean. The standard deviation is a unit of measurement
used to measure the deviation of the values from the mean. The term n – 1 is called the
degrees of freedom which indicates the number of data items that are independent of one
another. The standard deviation represents the sum of squared differences between the
measured values and their mean, and is typically extended to the sample variance. The
variance (S2) of the sample data is the standard deviation squared.
In population statistics, the symbol used for mean, variance and standard deviation
differs from those used in sample statistics. The symbol for population mean is μ, whereas
the symbol used for population variance and population standard deviation is σ2 and σ,
respectively.
Visualizing data through charts and graphs is usually the first task in data analysis
since this task enables analysts to observe the basic features of the data, including unusual
observations and unique patterns. Indeed, some variations observed in the data can be
which are used to visualize the relationship between two variables. Time series plots are
often used to present chronological data such as the number of road traffic fatalities, in
which the number of fatalities is plotted as a function of time.
3.6.2 Probability distributions
A random variable is a variable that can contain any value during different trials of an
experiment – the exact outcome being a chance or a random event. The random variable
is called a discrete variable if only certain values are possible. The number of rooms in a
hotel is an example of a discrete variable. In contrast, a continuous variable is a variable
that can contain any value of a random variable within a certain range. The weight of
students in a class is an example of a continuous variable.
The probability distribution of a discrete random variable is a list of all possible values
that a variable can take, along with the probability of each. The expected value of a
random variable is the mean value that the variable assumes over a large number of trials.
The expected value of a discrete probability distribution can be determined using the
following equation:
( ) ) (X X P X E (3.1) where:E(X) = Expected value of random variable X
X = Random variable
P(X) = Probability of the random variable X.
A distribution that represents many real-life variables measured on a continuous scale
required to identify a specific normal distribution. The distribution follows a normal, bell-
shaped curve which is symmetrical, as shown in Figure 3.2. The probabilities of the values
drawn from a normal distribution that fall within specific limits are first determined by
converting these limits into standard deviation units called Z-scores. The Z-score of any
X value is the number of standard deviations from the central value of the curve (μ) to
that value. The formula used to calculate the Z-scores is given by:
𝑍 = 𝑋 − 𝜇 𝜎 (3.2) where: X = Value of interest μ = Mean σ = Standard deviation.
The normal distribution table is constructed using the Z-score values, and this table
can be used to determine the area under the normal distribution curve between the centre
of the curve (μ) and the value of interest X. If random variable X has a normal distribution,
then the Z-scores of the random variable have a normal distribution with mean (μ) = 0
and standard deviation (σ) = 1.
3.6.3 Sampling distribution
A sampling distribution is defined as the distribution of all possible values of the
sample statistic that can be obtained from the population for a given sample size. Let us
consider the following example, whereby a random sample of 100 people is taken from a
population. The height of each person in the sample is measured and the mean height is
of all possible sample means of samples each having a sample size of 100 that can be
taken from the population. Similarly, it is assumed that each sample statistic that is
computed from the sample data can be considered as having been drawn from a sampling
distribution.
Figure 3.2: Normal distribution curve with population mean µ
According to the central limit theorem, the sampling distribution of the sample means
will approximate the normal distribution if the sampling size is sufficiently large. The
standard error of the sample mean is given by
n
. It shall be noted that the sampling
distribution will approximate a normal distribution regardless of the shape of the popu-
lation distribution from which the samples are drawn. The central limit theorem greatly
facilitates analysts in data analysis since it enables one to compute the probability of
various sample results based on the probabilities of the normal distribution curve.
However, further analysis is required if the sample size is small. In this case, it is
assumed that the population under investigation is normally distributed, and the pop-
ulation standard deviation is unknown and therefore, it must be estimated using the µ
demarcate the sampling distribution area and only the degrees of freedom (df) is required.
Once the degrees of freedom is known, the t values that exclude the desired percentages
of the curve can be determined.
3.6.4 Estimation
There are two types of estimation values which can be determined from a forecasting
process: point estimate and interval estimate. The point estimate of a population
parameter is a value calculated from the sample data that estimates the unknown
population value. This involves the mean, variance and standard deviation. In contrast,
an interval estimate or confidence interval is the interval where it is highly likely that the
population parameter of interest lies within it. The confidence interval is determined by
demarcating an interval around the point estimate and it is generally computed using
either the normal distribution or t-distribution.
3.6.5 Hypothesis testing
Hypothesis is an educated guess regarding the population and it is usually a part of
statistical analysis. Hypothesis testing is required to determine whether the findings of a
study support or reject the hypothesis made at the beginning of the study. The steps
involved in hypothesis testing are described briefly as follows:
(1) Formulate the null hypothesis (H0) and alternative hypothesis (Ht). The null
hypothesis is the outcome of the study that an analyst wants to nullify or refute.
In contrast, the alternative hypothesis is one of the possible outcomes expected in
the study, and it is also known as the working hypothesis. If H0 is rejected, then
Ht is accepted.
(3) Calculate the standard error of the sample mean
n
. Following this, define
the lower and upper limits for rejection of H0. This can be done by adding and
subtracting the population mean (μ) with/from the product of the t value (at α =
0.05 or 0.1) and standard error.
(4) If the sample mean (𝑋̅) is less or higher than the lower and upper limits for H0
rejection, reject the null hypothesis (H0) in favour of the alternative hypothesis
(Ht).
3.6.6 Correlation analysis
One of the essential tasks in the development of forecasting models is to examine the
relationship between two variables. There are two techniques used for this purpose,
namely correlation analysis and regression analysis.
3.6.7 Scatter diagrams
One of the simplest methods used to examine the relationship between two variables
is to plot a scatter diagram. Scatter diagrams show whether the dependent variable
(plotted on the ordinate or Y-axis of the diagram) tends to increase or decrease with
changes in the independent variable (plotted on the abscissa or X-axis of the diagram).
Hence, a preliminary conclusion can be drawn regarding the relationship between both of
the variables (x and y). There are various types of scatter diagrams – some show whether
the relationship between the two variables is strong or weak while others show whether
the relationship is positive or negative. Some scatter diagrams will reveal whether the
relationship is linear or non-linear while others may indicate that there is no relationship
at all between the two variables. One can draw a straight line that passes through most of
relationship between two variables. If most of the data points lie in close proximity to the
straight line, then there is a strong linear relationship between the variables. Likewise, if
most of the data points lie farther away from the straight line, the linear relationship
between the variables is rather weak. The scatter diagram is a good visual representation
of the relationship between two variables. However, there are cases whereby more details
are required and this can be achieved by measuring the correlation coefficient of the two
variables.
3.6.8 Correlation coefficient
The correlation coefficient is a measure of the strength of the linear relationship that
exists between the two variables of interest. The correlation coefficient value varies
between -1 and +1. If there is a positive relationship between the two variables (such that
an increase in x will result in an increase in y) the correlation coefficient will have a
positive value. Similarly, if there is a negative relationship between the two variables
(such that an increase in x will result in a decrease in y), the correlation coefficient will
have a negative value. A correlation coefficient value of +1 and -1 indicates a perfectly
linear relationship in the positive and negative direction, respectively. In contrast, the
correlation coefficient value is 0 if there is no linear relationship between the two
variables. The correlation coefficient is represented by r and is determined using the
following formula:
𝑟 = 𝑛𝛴𝑋𝑌 − (𝛴𝑋) (𝛴𝑌)