Aggregation and Statistical Functions - 2008 Dashboard Best Practices

There are many functions that aggregate or summarize information in terms of single measures, such as average values. The most common of these are detailed in the following sections.

MAX,MIN, AVERAGE, AVERAGEA, MEDIAN, and MODE

MAX(number1,number2,...) MIN(number1,number2,...) AVERAGE(number1,number2,...) AVERAGEA(Value1,Value2,...) MEDIAN(number1,number2,...) MODE(number1,number2,...)

MAXreturns the largest value found in a list of numbers or ranges of cells.

MINreturns the lowest value found in a list of numbers or ranges of cells. It is important to

note that negative values are always lower than any positive value, no matter how small they are. Positive values are always greater than any negative value.

Here are some sample calculations:

=MAX(10,20,45) returns 45 =MAX(10,20,SQRT(10000)) returns 100 =MIN(-986,0.002,1200,0) returns -986 =MIN(986,-0.002,1200,0) returns -0.002

83 Mathematical and Statistical Functions in Xcelsius 2008

4 AVERAGEreturns the statistical mean or average value over a set of numbers or range of cells.

Here is a sample calculation:

=AVERAGE(240, 180) returns 210

There are some important subtleties to keep in mind when using AVERAGEand the similar

function AVERAGEA_.AVERAGE_{tries to be intelligent about how values are computed. Suppose}

you are tracking average monthly sales for the year. You just finished the month of July and have seven months of data in cells A1:A7. Of course, you can use a formula like this:

=AVERAGE(A1:A7)

It would be time-consuming and error prone if you rewrote this formula every time an additional month’s worth of data were added. It would be a whole lot simpler if you could write a formula that accommodates all 12 months in a year and intelligently ignores the blank cells or cells with text labels. Fortunately, AVERAGEdoes precisely that. For example, the

following formula:

=AVERAGE(A1:A12)

sensibly computes averages based on the non-blank cells. There are times when you want to take into account the blank cells in a range.

The function AVERAGEA_{is a little more literal in its computation of averages. It treats empty}

cells, cells with text labels, and cells with a FALSEvalue as if they are zero values. It treats TRUEvalues as if they are the value 1, and it treats numeric values as whatever the numeric

value happens to be.

Understanding theMEDIAN_{function is fairly straightforward. For a dataset, it arranges the}

data points from lowest to highest. Then it removes both the highest and lowest data points. Your dataset shrinks a bit. The function repeats this process of removing the highest and lowest data points until there is only one member or data point left standing. That member is the value returned by MEDIAN_{. If there is an even number of data points,}MEDIAN_{returns an}

appropriate interpolated value. The order of the data members doesn’t change the outcome of the computed median. The following is an example:

MEDIAN(15,12,13) = MEDIAN(12,13,15).

The following are sample calculations:

=MEDIAN(1,9,5,6,8) returns 6 =MEDIAN(1,3,99,2,3) returns 3 =MEDIAN(1,2,976,977,978) returns 976 =MEDIAN(1,3,11,11) returns 7

MODEreturns the most common value in a dataset. MODEtends to be most useful for coarse-

grained classification. Chapter 9, “Xcelsius and Statistics,” addresses some practical issues to keep in mind when working with MODE.

84 Chapter 4 Embedded Spreadsheets: The Secret Sauce of Xcelsius 2008

COUNT,COUNTA, and COUNTIF

COUNT(value1,value2,...) COUNTA(value1,value2,...) COUNTIF(range,criteria)

COUNTreturns the number of cells that contain numbers and also numbers within a list of

arguments.

COUNTAindicates how many values are in a list of arguments.

COUNTIFcounts the number of cells that meet the criteria specified in the argument. COUNT_{tallies the number of numeric entries found in the arguments supplied to}COUNT_.

For example, if there are 13 cells populated with numeric values, and the cell D29 contains the value 1.239, then the following results would be returned:

=COUNT(A1:B3,G1:K4) returns 13 =COUNT(A1:B3,D29,G1:K4) returns 14 =COUNT(A1:B3,D29,G1:K4)+2+78-65 returns 15 COUNTA_{is similar to}COUNT_{except that non-blank cells are also counted. With}COUNTA_,

non-blank cells can be text, numeric, or any kind of formula.

For example, open a new spreadsheet and enter the following in cells A1, A2, and A3:

99 (cell B1) =IF(B1<23,”B1 is smaller than 23”,””) (cell B2) B3 has text in it (cell B3)

Leave the cell B4 blank, and you will find the following:

=COUNTA(B1:B4) returns 3 =COUNT(B1:B4) returns 1

COUNTIFtells you how many cells match a specific criterion for a range of cells that’s identi-

fied. For example, you may want to see how many of your 40 sales reps have reached their quota of $50,000. If you have this data in cells A1:A40, then your count of how many achieved their target would be computed using this formula:

=COUNTIF(A1:A40,”>=50000”)

This formula works but it is a bit restrictive because it is hardwired to the threshold of $50,000. That may be the target for this month, but next month, you might want to up it to $53,000. A better construction would be to remove the hardwired dependency by placing the threshold in a separate cell, such as B1. Your formula could then look like this:

=COUNTIF(A1:A40,”>=”&B1)

If you define a named range for the threshold (cell B1) as SalesQuotaand your sales data

(cells A1:A40) as SalesTeamData, you could write your formula as follows: =COUNTIF(SalesTeamData,”>=”&SalesQuota)

This formula is a lot easier to understand than the first COUNTIFexample given a few lines

earlier. It is also more flexible because you can control the formula’s behavior externally. For example, because the threshold is isolated to a single cell, you could set the threshold value from a slider or dial on the Xcelsius canvas.

85 Mathematical and Statistical Functions in Xcelsius 2008

4 If you want to start getting creative at the spreadsheet level, you can come up with all sorts

of interesting metrics or measures for your dashboards. For example, you could find the number of sales reps who performed above average by using either of the following formulas:

=COUNTIF(A1:A40,”>=”&AVERAGE(A1:A40))

=COUNTIF(SalesTeamData,”>=”&AVERAGE(SalesTeamData))

SUM, SUMIF,SUMPRODUCT, and SUMSQ

SUM(number1,number2,...)

SUMIF(range,criteria[,sum_range]) SUMPRODUCT(array1,array2,array3,...) SUMSQ(number1,number2,...)

If you are accustomed to working with Excel, you are already familiar with SUM. SUMcan

accept a range of cells as its “arguments.” You can have formulas like these:

=SUM(A1:B12) =SUM($A$1:C19)

SUMcan also accept multiple arguments separated by commas: =SUM(DailyExpensesForJanuary,DailyExpensesForFebruary)

Like essentially all other Xcelsius-supported Excel functions, SUMevaluates its arguments

before passing them to the function. Therefore, if you encounter a formula like this:

=SUM(3.14,12,1+2+3)

1+2+3is evaluated and converted to 6 so that the formula is effectively treated as SUM(3.14,12,6)_.

C A U T I O N

There are some differences between the way Xcelsius computes with SUMand the way

Excel computes with SUM. In Excel, SUMtreats text that resembles a number as a number. In Xcelsius 2008, SUMblissfully ignores text that resembles numbers. The same holds true

for true/false values and also for dates that appear as text. The following Xcelsius 2008 examples make this clear:

=SUM(50,25,”35”) returns 75 =SUM(3.14,TRUE) returns 3.14 =SUM(3.14,6=3*2) returns 3.14 =SUM(“12/31/2009”) returns 0

If you were to compute these in Excel, the results would be 110,4.14,4.14, and 40178, respectively.

One of the reasons the spreadsheet metaphor is popular is because it is easy to work with numeric data in rows and columns. It is easy to think about multiplying all the numbers in one column (such as price) by the corresponding numbers (such as units sold) in a second column, to arrive at a combined quantity in a third column (such as sales). While you can get all this detail on a line-by-line basis, sometimes all you need is the total result. The

86 Chapter 4 Embedded Spreadsheets: The Secret Sauce of Xcelsius 2008

SUMPRODUCTfunction exists for such circumstances. If you didn’t have a third column, you

could write a formula like either of the following:

=SUMPRODUCT(A1:A99,B1:B99) =SUMPRODUCT(Price,Units)

In one fell swoop, you could calculate the aggregate sales. This is a powerful method for directly and efficiently computing numbers you may need. You can make SUMPRODUCTeven

more versatile, as shown in the following example.

Assume that you are tracking the changes in price and quantity sold for a single product over time. For the sake of simplicity, consider only 12 lines of such data, 1 for each month of the year. The formula SUMPRODUCT(Price,Units)results in a single number for the whole year.

What if you want to calculate cumulative sales as of any month and graph only the relevant data? Say that a reportable period includes all months on or before some point in time of interest, as set by clicking a visual control on a dashboard. It might be convenient to tag these in a third column, alongside price and units (as in column H in Figure 4.6).

Figure 4.6 SUMPRODUCTof price, units, and reportable period calculates cumulative sales.

The month number (input cell J6) is set by a Spinner component on the canvas. In column H, the cells are set to either 1or 0, depending on whether the month number in column A

is less than or equal to the value in cell J6. The following formula appears in cell H3:

=IF($A3<>$J$6,1,0)

J6 is an absolute reference, so its cell coordinate won’t change as you replicate the formula. A3 is referenced as $A3, so only the column is fixed. This way, you can freely replicate the formula down the column. If you want to replicate the formula to other columns, you want the formula to reference column A. That’s why you see the $ symbol in front of the column letter.

87 Mathematical and Statistical Functions in Xcelsius 2008

4 If you define Priceas cells F3:F14, Unitsas cells G3:G14, and ReportableMonthsas cells

H3:H14, you can compute the cumulative sales with this formula:

=SUMPRODUCT(Price,Units,ReportableMonths)

Needless to say, there are plenty of other ways to calculate the cumulative sales, but this is one idea of how you can use SUMPRODUCT_.

There is one thing you must keep in mind with SUMPRODUCT: All the arguments supplied to it

have to be identically sized. For example, if Priceis 12 cells stacked up in a vertical column, Unitsand ReportableMonthsmust also each be 12 cells stacked up in a vertical column. You

cannot combine different-sized cell ranges in SUMPRODUCT_{. The following formula, for exam-}

ple, will not work:

=SUMPRODUCT(1.5,A1:A12,B1:B12)

Because 1.5is a single number, you can pull it out of the function and apply it correctly by

using this formula:

=1.5*SUMPRODUCT(A1:A12,B1:B12)

N O T E

In Figure 4.6 and the ch04_ContextSwitchingExample.xlffile, the Radio Button component allows you to switch between displaying price and units. Effectively, this gives you two views from a single chart. I call this context switching. I explain context switching in detail in Chapter 5, “Using Charts and Graphs to Represent Data.”

SUMIF_{is a very useful facility for gathering certain kinds of information.}SUMIF_{works a little}

like COUNTIF, except that it performs a sum of the range that matches a criterion rather than

counting the number of times a match is found. You could have a list of numbers in, say, cells A1:A100. These numbers could be insurance claims. Perhaps there might be some deductible, possibly $500. If you want to find the sum of these claims for which there would be no reimbursement, you could use a formula like this:

=SUMIF(A1:A100,”<=500”)

Of course, it would be better to remove the hardwired reference to the $500 deductible and park it in another cell, such as B1. Your formula would then look like this:

=SUMIF(A1:A100,”<=”&B1)

Notice that the range (cells A1:A100) is the same cells used for the testing and the summa- tion. There may be circumstances in which the test for inclusion in the conditional sum doesn’t directly involve the value being summed. For such situations, there is a second form of SUMIFthat incorporates three arguments: a range of cells for testing against some criteria,

the test criteria, and a corresponding set of cells that would be summed when matches are found. A formula for estimating potential exposure if customers having a lower credit rating than a given threshold default on their loans could be something like this:

88 Chapter 4 Embedded Spreadsheets: The Secret Sauce of Xcelsius 2008

Figure 4.7 shows how this might be applied in a dashboard. The SUMIFformula is now

incorporated into a more full-featured formula, as you can see here:

=”Potential exposure is

➥”&DOLLAR(SUMIF(CustRating,”<”&(Alert_Threshold_Pct*(MAX(CustRating) ➥-MIN(CustRating))+MIN(CustRating)),LoanAmt),0)

Notice in the dashboard shown in Figure 4.7 that the radio button controls context switching and that conditional formatting is applied by using the vertical slider.

Figure 4.7 SUMIFis used to compute potential exposure of out- standing loans to customers.

SUMSQreturns the sum of the squares of a series or range of numbers. It looks like an eso-

teric function that would be used infrequently. Actually, it is very practical. The concept is simple: The square of a number, whether positive or negative, always results in a positive value, and you can use this to your advantage. You may be populating a range of cells with some values. How can you determine whether the range is non-empty? You could try a formula like this:

=IF(SUM(C3:L3)<>0,”The cells are non-empty”,”The cells are empty”)

The <>is a “Not Equals” comparison operator. The formula almost works, but it is not

guaranteed to always work correctly. Suppose there are precisely five values of +1_{and five}

values of -1? The sum would result in the value zero, and the formula would fail. The fol-

lowing would be a better formula:

89 Mathematical and Statistical Functions in Xcelsius 2008

LARGE,SMALL, and RANK

LARGE(array,k) SMALL(array,k)

RANK(number,ref,order)

LARGEreturns the kth largest value in a dataset. SMALLreturns the kth smallest value in a dataset. RANK_{returns the rank of a number in a list of numbers.}

Displaying information in a horizontal bar chart can quickly become dizzying if you don’t organize the sequence of data for the chart. You could, of course, sort the data from largest to smallest or vice versa. Hand sorting doesn’t work very well, though, because you would have to manually sort the data every time it changes. There is a way you can actually sort the data, sort of. The technique doesn’t really involve sorting the data but rather picking items from an array of cells, by largest to smallest or smallest to largest, one at a time.

You could have a series of formulas in a column of cells like the following:

=LARGE($A$1:$A$10,1) =LARGE($A$1:$A$10,2) =LARGE($A$1:$A$10,3) ...

=LARGE($A$1:$A$10,10)

For all intents and purposes, this list of cells is automatically sorted, even as the values of the chart data (cells A1:A10) change. Plotting this sorted data on a bar chart works well, but there is one minor detail: Bar charts stack their data in reverse order of the way it appears on the spreadsheet. Consequently, instead of having a chart resembling a tornado (with the largest items appearing at the top and the smallest at the bottom), you would get a chart resembling a pyramid. There’s an easy way around this problem. Instead of ordering your sequence from largest to smallest, you do it the other way around:

=LARGE($A$1:$A$10,10) =LARGE($A$1:$A$10,9) =LARGE($A$1:$A$10,8) ...

=LARGE($A$1:$A$10,1)

Figure 4.8 shows an illustration of this. As in previous examples in this chapter, you can switch the data series by using context switching.

Alternatively, you could use the SMALLfunction and reverse the lookup index to achieve the

same effect as you get with LARGE_.

RANKis kind of complementary to SMALLand LARGE. Instead of plucking out values from a

range of cells, it lets you know how high up the food chain the value of a specific cell is.

N O T E

Duplicate items in a range of cells are given exactly the same rank. This can cause some complications if you need to uniquely distinguish each and every item on a list (that is, keep track of all the individual duplicates).

90 Chapter 4 Embedded Spreadsheets: The Secret Sauce of Xcelsius 2008

4 STDEVSTDEV(number1,number2,...)andVAR VAR(number1,number2,...)

STDEVestimates standard deviation based on a sample, ignoring text and logical values. VAR_{estimates variance based on a sample, ignoring logical values and text.}

Data can be voluminous. You might be collecting data for, say, insurance claim amounts, the real cost of installing fiber-optic cable on a property-by-property basis, or profit on product sales. If you want to analyze the overall import of such data to your business, it would make sense to aggregate these quantities and compute total expenses or whatever item you are measuring.

There are times, however, when it is more appropriate to characterize a statistical profile in terms of individual transactions or events. At its most rudimentary level, the AVERAGEfunction

gives you this kind of measure. It may not be informative enough. To help prove the point, let’s look at some specifics. Supposed you have the following sales data for six transactions:

120, 75, 86, 128, 92, 102

The average sale here is 100.5. You can verify this by entering values in your spreadsheet

and using the AVERAGEfunction. It may be that for the next 100 or so transactions, you’ll

really average just a tad over $100 per transaction. Assuming that the data you have is representative of future transactions, what would you guess the seventh transaction would be? Lacking any additional information, your probable best guess would be $100.50. How cer-

tain are you that you will get the value correct? If all the numbers in your sample were pretty close to the average, you’d be more certain than if they were all over the place.

Figure 4.8 Auto-sorted tornado- style bar charts can be created using

LARGE(ChartData,

ReverseRank Index).

91 Mathematical and Statistical Functions in Xcelsius 2008

4 Consider another dataset:

98, 110, 97, 102, 95, 101

Here, the average sale is also 100.5. If you are projecting the seventh transaction, you could

use the same $100.50 estimate. Assuming that this data is representative of future transactions, how far off do you think your estimate would be? There is a measure known as standard deviation that gives you a handle on this:

=STDEV(120, 75, 86, 128, 92, 102) returns approx 20.3 =STDEV(98, 110, 97, 102, 95, 101) returns approx 5.3

Notice in the first dataset that four of the six transactions fall between 100.5-20.3_and 100.5+20.3. In the second dataset, four of the six transactions fall between 100.5-5.3and 100.5+5.3. Perhaps standard deviation could be a good measure to characterize your histo-

rical data in quantitative terms. It turns out that as your sample size grows, the reliability of standard deviation as a way to characterize your data increases.

As a rule of thumb, roughly two-thirds of your data will fall between your average plus or minus the standard deviation, and close to 95 percent of your data will fall between your average plus or minus two times the standard deviation.

Variance and standard deviation are closely related measures. The choice between the usage of one metric and the other is largely a matter of industry practice and convenience. In the world of risk management and the actuarial sciences, variance is more popular than standard deviation:

=VAR(120, 75, 86, 128, 92, 102) returns approx 414.3 =VAR(98, 110, 97, 102, 95, 101) returns approx 28.3

N O T E

The variance is approximately the square of the standard deviation.

RAND, NORMINV, and NORMDIST

RAND()

NORMDIST(x,mean,standard_dev,cumulative) NORMINV(probability[,mean=0,standard_dev=1])

RAND_{is a pseudo-random number generator. It returns a decimal number between}0_and1_.

Every time the RANDfunction is recalculated, it is virtually guaranteed to return a different

number between 0and 1. What that number will be is as good as anybody’s guess, and this

is what makes RANDsuitable for simple probability analysis.

One of the characteristics of the RAND_{function is that the values it returns are pretty much}

In document 2008 Dashboard Best Practices (Page 101-111)