Many engineering measurements have an element of randomness. Statistics is a branch of applied mathematics involving the analysis, interpretation, and presentation of data including some degree of randomness or uncertainty. In descriptive statistics, the important features or properties of a set of data are summarized or described.
Continuous random variable: A variable vector x, whose elements can have any of a continuum of values for each measurement, or observation.
Population: all possible measurements of a random variable
Sample: finite number of measurements. Assume that N measurements have been made of a random variable, denoted as the vector x, with elementsx(n), n = 1, . . . , N.
Frequency distribution of a random variable: indicates the relative frequency with which specific values of this random variable occur in the population.
Histogram: frequency distribution of sample data, indicating the frequency with which sample values fall within ranges of values, called bins.
• Range of values: xmin ≤ x ≤ xmax
• Number of bins: m
• Bin width: ∆x = xmax− xmin m
• Histogram value in bin k: N(k), the number of samples with value xmin+ (k − 1)∆x ≤ x ≤ xmin+k∆x. This is also known as the absolute frequency.
The Matlab commands for generating and plotting a histogram are;
Command Description
N = hist(x) Returns the row vector histogram N of the values in x using 10 bins. If x is matrix, returns a histogram matrix N operating on the columns of x.
N = hist(x,m) Returns the row vector histogram N of the values in x using m bins.
N = hist(x,xc) Returns the row vector histogram N of the values in x using bin centers specified by xc.
N = histc(x,edges) Counts the number of values in vector x that fall between the elements in the edges vector (which must contain monoton-ically non-decreasing values). N is a length(edges) vector containing these counts. N(k) will count the value x(i) if edges(k) >= x(i) > edges(k+1). The last bin will count any values of x that match edges(end). Values outside the values in edges are not counted. Use -inf and inf in edges to include all non-NaN values.
[N,xc] = hist(...) Also returns the position of the bin centers in xc.
hist (...) Without output arguments produces a histogram bar plot of the results.
For example, the following commands produce the 25-bin histograms shown in Figure 7.4, operating on the random data vectors data1 and data2.
% Histogram plots
Note the differences in the nature of the two distributions. Random data data1 is called a uniform distribution, as it has roughly equal, or uniform, frequency in all bins. Random data data2 has a bell-shaped distribution, called a Gaussian distribution or Normal distribution.
Relative frequency histogram: The histogram value in bink is normalized by the total number, N, of data samples: N(k)/N.
The hist function is limited in its ability to produce relative frequency histograms. It is better to generate such plots with the bar function, defined as follows:
2 2.2 2.4 2.6 2.8 3 3.2 3.4 3.6 3.8 4
Figure 7.4: Absolute frequency histograms
bar(X,Y) Draws the columns of the M-by-N matrix Y as M groups of N vertical bars. The vector X must be monotonically increasing or decreasing.
bar(Y) Uses the default value of X=1:M.
bar(x,y For vector inputs, length(y) bars are drawn, with each bar centered on a value of x.
bar(X,Y,width) Specifies the width of the bars. Values of width ¿ 1produce over-lapped bars. The default value is width=0.8
For example, the following commands produce relative histogram plots shown in Figure 7.5 for the random data vectors data1 and data2.
% Relative frequency histogram plots
%
title(’Relative histogram of data1’), grid,...
xlabel(’x’), ylabel(’relative frequency’),...
subplot(2,1,2),bar(x2,rfreq2),...
title(’Relative histogram of data2’), grid,...
xlabel(’x’), ylabel(’relative frequency’)
2 2.2 2.4 2.6 2.8 3 3.2 3.4 3.6 3.8 4
Relative histogram of data1
x
Relative histogram of data2
x
relative frequency
Figure 7.5: Relative frequency histograms
Measures of central tendency
The mean and median describe the middle, or center, of the range of the random variable.
The sample mean of vector x having N elements is ¯x
x =¯ 1 N
N n=1
x(m)
The population meanµ (mu) is the value of ¯x for an infinite number of measurements, denoted by
µ = lim
N→∞x¯
where the right hand side is read as “the limit of x bar asN approaches infinity.”
Median: Middle value of a set of random samples, sorted by value (rank ordered). If there is an even number of samples, then the median is the average of the two middle sample values.
Command Description
mean(x) Returns the sample mean of the elements of the vector x. Returns a row vector of the sample means of the columns of matrix x.
median(x) Returns the median of the elements of the vector x. Returns a row vector of the medians of the columns of matrix x.
sort(x) Returns a vector with the values of vector x in ascending order. Returns a matrix with each column of matrix x in ascending order.
For example:
These results can be confirmed by observing the plots of the data in Figure 7.1, in which the data values can be seen to be centered on 3.0. For the two data sets considered, the values of the mean and median are very close. This is not necessarily the case for other data sets. Also note that the results above show that mean(data1) = sum(data1)/length(data1).
Measures of variation
Measures of variation: indicate the degree of deviation of random samples from the measure of central tendency.
Referring again to the plots of our random data sets data1 and data2 in Figure 7.1, observe that data2 has greater variation from the mean.
The sample standard deviation of vector x havingN elements is
s =
The sample variance s2 is the square of the standard deviation.
std(x) Returns the sample standard deviation of the elements of the vector x. Re-turns a row vector of the sample standard deviations of the columns of matrix x.
For example:
>> std(data1)
ans = 0.5989
>> std(data2) ans =
0.9408
Thus, the variation of data2 is greater than that of data1, as we concluded from observation of the plotted data values.