Two Numeric Variables
43Follow these steps to construct a scatter plot:
Label the horizontal axis (x-axis) with the name of the influencing variable (called the independent variable, x)
Label the vertical axis (y-axis) with the name of the variable being influenced (called the dependent variable, y).
Plot each pair of data values (x; y) from the two numeric variables as coordinates on an x–y graph.
Example 2.6 Grocery Shoppers Survey – Amount Spent by Number of Store Visits Refer to the dataset in Table 2.1.
Construct a scatter plot for the amount spent on groceries and the number of visits to the grocery store per shopper by the sample of 30 shoppers surveyed.
Management Questions
By inspection of the scatter plot, describe the nature of the relationship between the number of visits and amount spent.
Solution
To construct the scatter plot, we need to define the x and y variables. Since the number of visits is assumed to influence the amount spent on groceries in a month, let x = number of visits and y = amount spent.
On a set of axes, plot each pair of data values for each shopper. For example, for shopper 1, plot x = 3 visits against y = R946; for shopper 2, plot x = 5 visits against y = R1 842.
The results of the scatter plot are shown in Figure 2.8.
2 500
2 000 1 500 1 000 500 0
6 5
4 3
2 1
0
Number of Visits to Store
Amount spent last month (R)
7
Figure 2.8 Scatter plot – monthly amount spent on groceries against number of visits
Management Interpretation
There is a moderate, positive linear relationship between the number of visits to a grocery store in a month and the total amount spent on groceries last month per shopper. The more frequent the visits, the larger the grocery bill for the month. There is only one possible outlier – shopper 13, who spent R2 136 over four visits.
Trendline Graph
A trendline graph plots the values of a numeric random variable over time.
Such data is called time series data. The x-variable is time and the y-variable is a numeric measure of interest to a manager (such as turnover, unit cost of production, absenteeism or share prices).
Follow these steps to construct a trendline graph:
The horizontal axis (x-axis) represents the consecutive time periods.
The values of the numeric random variable are plotted on the vertical (y-axis) opposite their time period.
The consecutive points are joined to form a trendline.
Trendline graphs are commonly used to identify and track trends in time series data.
Example 2.7 Factory Absenteeism Levels Study
Refer to the time series data in Table 2.10 on weekly absenteeism levels at a car manufacturing plant. (See Excel file C.2.2 – factory absenteeism.)
Table 2.10 Data on employee-days absent for a car manufacturing plant
Week 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Absent 54 58 94 70 61 61 78 56 49 55 95 85 60 64 99 80 Week 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 Absent 62 78 88 73 65 84 92 70 59 65 105 84 80 90 112 94 Produce a trendline plot of the weekly absenteeism levels (number of employee-days absent) for this car manufacturing plant over a period of 32 weeks.
Management Question
By an inspection of the trendline graph, describe the trend in weekly absenteeism levels within this car manufacturing plant over the past 32 weeks.
Solution
To plot the trendline, plot the weeks (x = 1, 2, 3, …, 32) on the x-axis. For each week, plot the corresponding employee-days absent on the y-axis. After plotting all 32 y-values, join the points to produce the trendline graph as shown in Figure 2.9.
45
Figure 2.9 Trendline graph for weekly absenteeism levels – car manufacturing plant Management Interpretation
Over the past 32 weeks there has been a modest increase in absenteeism, with an upturn occurring in more recent weeks. A distinct ‘monthly’ pattern exists, with absenteeism in each month generally low in weeks one and two, peaking in week three and declining moderately in week four.
Lorenz Curve
A Lorenz curve plots the cumulative frequency distributions (ogives) of two numeric random variables against each other. Its purpose is to show the degree of inequality between the values of the two variables.
For example, the Lorenz curve can be used to show the relationship between:
the value of inventories against the volume of inventories held by an organisation the spread of the total salary bill amongst the number of employees in a company the concentration of total assets amongst the number of companies in an industry the spread of the taxation burden amongst the total number of taxpayers.
A Lorenz curve shows what percentage of one numeric measure (such as inventory value, total salaries, total assets or total taxation) is accounted for by given percentages of the other numeric measure (such as volume of inventory, number of employees, number of companies or number of taxpayers). The degree of concentration or distortion can be clearly illustrated by a Lorenz curve. It is commonly used as a measure of social/economic inequality. It was originally developed by M Lorenz (1905) to represent the distribution of income amongst households.
Follow these steps to construct a Lorenz curve:
Identify intervals (similar to a histogram) for the y-variable, for which the distribution across a population is being examined (e.g. salaries across employees).
Calculate the total value of the y-variable per interval (total value of salaries paid to all employees earning less than R1 000 per month; total value of salaries paid to all employees earning between R1 001 and R2 000 per month; etc.).
Calculate the total number of objects (e.g. employees, households or taxpayers) that fall within each interval of the y-variable (number of employees earning less than R1 000 per month; number of employees earning between R1 001 and R2 000 per month; etc.).
Derive the cumulative frequency percentages for each of the two distributions above.
Scale each axis (x and y) from 0% to 100%.
For each interval of the y-variable, plot each pair of cumulative frequency percentages on the axes and join the coordinates (similar to a scatter plot).
If the distributions are similar or equal, the Lorenz curve will be result in a 45° line from the origin of both axes (called the line of uniformity or the line of equal distribution). The more unequal the two distributions, the more bent (concave or convex) the curve becomes.
A Lorenz curve always starts at coordinate (0%; 0%) and ends at coordinate (100%; 100%).
Example 2.8 Savings Balances versus Number of Savers Study
A bank wished to analyse the value of savings account balances against the number of savings accounts of a sample of 64 bank clients.
The two numeric frequency distributions and their respective percentage ogives (for the value of savings balances and number of savings accounts) are given in Table 2.11.
Table 2.11 Percentage cumulative frequency distributions of savings balances across savers Savings
Calculate the Lorenz curve of savings account balances against the number of savings accounts (savers).
Management Question
Are there equal proportions of savers across all levels of saving accounts balances?
Comment by inspecting the pattern of the Lorenz curve.
47 Solution
Figure 2.10 shows the Lorenz plot of the percentage of savers ogive against the percentage of savings balances ogive.
90
Cumulative % of savings acounts
Cumulative % of total savings
Figure 2.10 Lorenz curve of distribution of savings balances across savers Management Interpretation
The diagonal (45°) line shows an equal distribution of total savings across all savers (e.g. 60% of savings account clients hold 60% of the total value of savings).
In this example, an unequal distribution is evident. For example, almost half (47%) of all savers hold only 18% of total savings. At the top end of the Lorenz curve, it can be seen that the biggest 5% of savings accounts represent 22% of total savings. Overall, this bank has a large number of small savers, and a few large savers.
2.4 The Pareto Curve
A useful application of the bar chart – especially in quality control studies – is called a Pareto curve. A Pareto curve is a combination of a sorted bar chart and a cumulative categorical frequency table. In a sorted bar chart the categories on the x-axis are placed in decreasing order of frequency (or importance).
As a tool in quality management, its purpose is to graphically identify and separate the ‘critical few’ problems from the ‘trivial many’ problems (the 20/80 rule). For example, what are the top three causes of machine failure out of a possible 25 causes – and what percentage of failures do they represent? This allows a manager to focus on the few critical issues and address these issues ahead of the remaining many trivial issues.
Follow these steps to construct a Pareto curve:
Construct a categorical frequency table for the categorical random variable.
Rearrange the categories in decreasing order of frequency counts (or percentages).
Calculate the cumulative frequency counts (or cumulative percentages) starting from the highest frequency category (on the left) to the lowest frequency category (on the right).
Plot both the bar chart (using the left y-axis for the frequency counts or percentages) and the percentage cumulative frequency polygon (using the right y-axis) on the same x–y axes.
Example 2.9 Customer Complaints Study
A customer service manager has analysed 300 customer complaints received over the past year into eight categories, as shown in the categorical frequency table in Table 2.12.
(See Excel file C.2.3 – customer complaints.)
Table 2.12 Categorical frequency table of customer complaints
Code Description Count
1 Poor product knowledge 26
2 Product options limited 47
3 Internet site frequently down 12
4 Slow response times 66
5 Unfriendly staff 15
6 Non-reply to queries 22
7 Cost of service is excessive 82
8 Payment options limited 30
Total 300
Use the table to construct a Pareto curve.
Management Questions
1 From the Pareto curve, identify the top three customer complaints.
2 What percentage of all complaints received do these top three complaints represent?
3 What is the least important complaint and what percentage of customers complained about this issue?
Solution
The categorical variable is ‘customer complaints’, which is grouped into eight categories.
Table 2.13 shows the summary table for the calculation of the Pareto curve where the categories have been sorted by frequency count.
Table 2.13 Pareto curve – cumulative % table of sorted customer complaints
Code Description Count Cumulative
count
Cumulative percent
7 Cost of service is excessive 82 82 27%
4 Slow response times 66 148 49%
2 Product options limited 47 195 65%
8 Payment options limited 30 225 75%
1 Poor product knowledge 26 251 84%
6 Non-reply to queries 22 273 91%
5 Unfriendly staff 15 288 96%
3 Internet site frequently down 12 300 100%
Total 300
49