Numerical summary - Descriptive statistical procedures

CHAPTER 3: RESEARCH METHODOLOGY

3.6 Descriptive statistical procedures

3.6.1 Numerical summary

The most common descriptive statistical procedure is averaging the data values. The

averaging process can be done by adding all of the data values and dividing the sum by

the number of values. The result is known as the mean (X̅). The mean shows the central tendency of a group of values. Analysts are usually interested in the extent to which the

values are dispersed around the mean. The standard deviation is a unit of measurement

used to measure the deviation of the values from the mean. The term n – 1 is called the

degrees of freedom which indicates the number of data items that are independent of one

another. The standard deviation represents the sum of squared differences between the

measured values and their mean, and is typically extended to the sample variance. The

variance (S2) of the sample data is the standard deviation squared.

In population statistics, the symbol used for mean, variance and standard deviation

differs from those used in sample statistics. The symbol for population mean is μ, whereas

the symbol used for population variance and population standard deviation is σ2_{and σ,}

respectively.

Visualizing data through charts and graphs is usually the first task in data analysis

since this task enables analysts to observe the basic features of the data, including unusual

observations and unique patterns. Indeed, some variations observed in the data can be

which are used to visualize the relationship between two variables. Time series plots are

often used to present chronological data such as the number of road traffic fatalities, in

which the number of fatalities is plotted as a function of time.

3.6.2 Probability distributions

A random variable is a variable that can contain any value during different trials of an

experiment – the exact outcome being a chance or a random event. The random variable

is called a discrete variable if only certain values are possible. The number of rooms in a

hotel is an example of a discrete variable. In contrast, a continuous variable is a variable

that can contain any value of a random variable within a certain range. The weight of

students in a class is an example of a continuous variable.

The probability distribution of a discrete random variable is a list of all possible values

that a variable can take, along with the probability of each. The expected value of a

random variable is the mean value that the variable assumes over a large number of trials.

The expected value of a discrete probability distribution can be determined using the

following equation:







  ( ) ) (X X P X E (3.1) where:

E(X) = Expected value of random variable X

X = Random variable

P(X) = Probability of the random variable X.

A distribution that represents many real-life variables measured on a continuous scale

required to identify a specific normal distribution. The distribution follows a normal, bell-

shaped curve which is symmetrical, as shown in Figure 3.2. The probabilities of the values

drawn from a normal distribution that fall within specific limits are first determined by

converting these limits into standard deviation units called Z-scores. The Z-score of any

X value is the number of standard deviations from the central value of the curve (μ) to

that value. The formula used to calculate the Z-scores is given by:

𝑍 = 𝑋 − 𝜇 𝜎 (3.2) where: X = Value of interest μ = Mean σ = Standard deviation.

The normal distribution table is constructed using the Z-score values, and this table

can be used to determine the area under the normal distribution curve between the centre

of the curve (μ) and the value of interest X. If random variable X has a normal distribution,

then the Z-scores of the random variable have a normal distribution with mean (μ) = 0

and standard deviation (σ) = 1.

3.6.3 Sampling distribution

A sampling distribution is defined as the distribution of all possible values of the

sample statistic that can be obtained from the population for a given sample size. Let us

consider the following example, whereby a random sample of 100 people is taken from a

population. The height of each person in the sample is measured and the mean height is

of all possible sample means of samples each having a sample size of 100 that can be

taken from the population. Similarly, it is assumed that each sample statistic that is

computed from the sample data can be considered as having been drawn from a sampling

distribution.

Figure 3.2: Normal distribution curve with population mean µ

According to the central limit theorem, the sampling distribution of the sample means

will approximate the normal distribution if the sampling size is sufficiently large. The

standard error of the sample mean is given by

 _{. It shall be noted that the sampling}

distribution will approximate a normal distribution regardless of the shape of the popu-

lation distribution from which the samples are drawn. The central limit theorem greatly

facilitates analysts in data analysis since it enables one to compute the probability of

various sample results based on the probabilities of the normal distribution curve.

However, further analysis is required if the sample size is small. In this case, it is

assumed that the population under investigation is normally distributed, and the pop-

ulation standard deviation is unknown and therefore, it must be estimated using the µ

demarcate the sampling distribution area and only the degrees of freedom (df) is required.

Once the degrees of freedom is known, the t values that exclude the desired percentages

of the curve can be determined.

3.6.4 Estimation

There are two types of estimation values which can be determined from a forecasting

process: point estimate and interval estimate. The point estimate of a population

parameter is a value calculated from the sample data that estimates the unknown

population value. This involves the mean, variance and standard deviation. In contrast,

an interval estimate or confidence interval is the interval where it is highly likely that the

population parameter of interest lies within it. The confidence interval is determined by

demarcating an interval around the point estimate and it is generally computed using

either the normal distribution or t-distribution.

3.6.5 Hypothesis testing

Hypothesis is an educated guess regarding the population and it is usually a part of

statistical analysis. Hypothesis testing is required to determine whether the findings of a

study support or reject the hypothesis made at the beginning of the study. The steps

involved in hypothesis testing are described briefly as follows:

(1) Formulate the null hypothesis (H0) and alternative hypothesis (Ht). The null

hypothesis is the outcome of the study that an analyst wants to nullify or refute.

In contrast, the alternative hypothesis is one of the possible outcomes expected in

the study, and it is also known as the working hypothesis. If H0 is rejected, then

Ht is accepted.

(3) Calculate the standard error of the sample mean

 _{. Following this, define}

the lower and upper limits for rejection of H0. This can be done by adding and

subtracting the population mean (μ) with/from the product of the t value (at α =

0.05 or 0.1) and standard error.

(4) If the sample mean (𝑋̅) is less or higher than the lower and upper limits for H0

rejection, reject the null hypothesis (H0) in favour of the alternative hypothesis

(Ht).

3.6.6 Correlation analysis

One of the essential tasks in the development of forecasting models is to examine the

relationship between two variables. There are two techniques used for this purpose,

namely correlation analysis and regression analysis.

3.6.7 Scatter diagrams

One of the simplest methods used to examine the relationship between two variables

is to plot a scatter diagram. Scatter diagrams show whether the dependent variable

(plotted on the ordinate or Y-axis of the diagram) tends to increase or decrease with

changes in the independent variable (plotted on the abscissa or X-axis of the diagram).

Hence, a preliminary conclusion can be drawn regarding the relationship between both of

the variables (x and y). There are various types of scatter diagrams – some show whether

the relationship between the two variables is strong or weak while others show whether

the relationship is positive or negative. Some scatter diagrams will reveal whether the

relationship is linear or non-linear while others may indicate that there is no relationship

at all between the two variables. One can draw a straight line that passes through most of

relationship between two variables. If most of the data points lie in close proximity to the

straight line, then there is a strong linear relationship between the variables. Likewise, if

most of the data points lie farther away from the straight line, the linear relationship

between the variables is rather weak. The scatter diagram is a good visual representation

of the relationship between two variables. However, there are cases whereby more details

are required and this can be achieved by measuring the correlation coefficient of the two

variables.

3.6.8 Correlation coefficient

The correlation coefficient is a measure of the strength of the linear relationship that

exists between the two variables of interest. The correlation coefficient value varies

between -1 and +1. If there is a positive relationship between the two variables (such that

an increase in x will result in an increase in y) the correlation coefficient will have a

positive value. Similarly, if there is a negative relationship between the two variables

(such that an increase in x will result in a decrease in y), the correlation coefficient will

have a negative value. A correlation coefficient value of +1 and -1 indicates a perfectly

linear relationship in the positive and negative direction, respectively. In contrast, the

correlation coefficient value is 0 if there is no linear relationship between the two

variables. The correlation coefficient is represented by r and is determined using the

following formula:

𝑟 = 𝑛𝛴𝑋𝑌 − (𝛴𝑋) (𝛴𝑌)

In document A time series analysis of road traffic fatalities in Malaysia / Yusria Darma (Page 140-147)