Aggregates and Derived Data - teradata-sqlunleash

The various aggregates can work on any column. However, most of the aggregates only work with numeric data. The COUNT function might be the primary one used on either character or numeric data. The aggregates can also be used with derived data.

The following table is used to demonstrate derived data and aggregation:

Figure 5-2

This SELECT totals the salaries for all employees and show what the total salaries will be if everyone is given a 5% or a 10% raise:

SELECT SUM(salary) (Title 'Salary Total', Format '$$$$,$$9.99') ,SUM(salary*1.05) (Title '+5% Raise', Format '$$$$,$$9.99') ,SUM(salary*1.10) (Title '+10% Raise', format '$$$$,$$9.99') ,AVG(salary) (Title 'Average Salary', format '$$$$,$$9.99') ,SUM(salary) / COUNT(salary) (Title 'Computed Average Salary') ( format '$$$$,$$9.99')

FROM Employee_table;

1 Row Returned Salary

Total +5% Raise +10%

Raise_ Average

Salary Computed

Average Salary

$421,039.38 $442,091.35 $463,143.32 $46,782.15 $46,782.15

Notice that since both TITLE and FORMAT require parentheses, they can share the same set. Also, the AVG function and dividing the SUM by the COUNT provide the same answer.

GROUP BY

It has been shown that aggregates produce one row of output with one value per aggregate.

However, the above SELECT is inconvenient if individual aggregates are needed based on different values in another column, like the class code. For example, you might want to see each aggregate for freshman, sophomores, juniors, and seniors.

The following SQL might be run once for each unique value specified in the WHERE clause for class code, here the aggregates only work on the senior class (‘SR’):

SELECT SUM(Grade_Pt) AS "Total"

,AVG(Grade_Pt) AS "Average"

,MIN(Grade_Pt) AS Smallest ,MAX(Grade_Pt) AS Highest ,COUNT(Grade_Pt) AS "Count"

FROM Student_table WHERE class_code = 'SR' ; 1 Row Returned

Total Average Smallest Largest Count

6.35 3.175 3.00 3.35 2

Although this technique works for finding each class, it is not very convenient. The first issue is that each unique class value needs to be known ahead of time for each execution.

Second, each WHERE clause must be manually modified for the different values needed.

Lastly, each time the SELECT is executed, it produces a separate output. In reality, it might be better to have all the results in a single report format.

Since the results of aggregates are incorporated into a single output line, it is necessary to create a way to provide one line returned per a unique data value. To provide a unique value, it is necessary to select a column with a value that groups various rows together. This column is simply selected and not used in an aggregate. Therefore, it is a not an aggregated column.

However, when aggregates and "non-aggregates" (normal columns) are selected at the same time, a 3504 error message is returned to indicate the mixture and that the non-aggregate is not part of an associated group. Therefore, the GROUP BY is required in the SQL statement to identify every column selected that is not an aggregate.

The resulting output consists of one line for all aggregate values for each unique data value stored in the column(s) named in the GROUP BY. For example, if the department number is used from the Employee table, the output consists of one line per department with at least one employee working in it.

The next SELECT uses the GROUP BY to create one line of output per unique value in the class code column:

SELECT Class_code

,SUM(Grade_Pt) AS "Total"

,AVG(Grade_Pt) AS "Average"

,MIN(Grade_Pt) AS Smallest ,MAX(Grade_Pt) AS Highest ,COUNT(Grade_Pt) AS "Count"

FROM Student_table GROUP BY Class_code ; 5 Rows Returned

Class_code Total Average Smallest Largest Count

FR 6.88 2.29 0.00 4.00 2

? ? ? ? ? 0

JR 5.85 2.925 1.90 3.95 2

SR 6.35 3.175 3.00 3.35 2

SO 5.80 2.9 2.00 3.80 2

Notice that the null value in the class code column is returned. At first, this may seem contrary to the aggregates ignoring nulls. However, class code is not being aggregated and is selected as a "unique value." All the aggregate values on the grade point for this row are null, except for COUNT. Although, the COUNT is zero and this does indicate that the null value is ignored. The COUNT value initially starts at zero, so: 0 + 0 = 0.

The GROUP BY is only required when a non-aggregate column is selected along with one or more aggregates. Without both a non-aggregate and a GROUP BY clause, the aggregates return only one row. Whereas, with a non-aggregate and a GROUP BY clause designating the column(s), the aggregates return one row per unique value in the column, as seen above.

Additionally, more than one non-aggregate column can be specified in the SELECT and in the GROUP BY clause. The normal result of this is that more rows are returned. This is because one row appears whenever any single column value changes, the combination of each column constitutes a new value. Remember, all non-aggregates selected with an aggregate must be included in the GROUP BY, or a 3504 error is returned.

As an example, the last name might be added as a second non-aggregate. Then, each combination of last name and class code are compared to other students in the same class.

This combination creates more lines of output. As a result, each aggregate value is primarily the aggregation of a single row. The only time multiple rows are processed together is when multiple students have the same last name and are in the same class. Then they group together based on the values in both columns being equal.

This SELECT demonstrates the correct syntax when using multiple non-aggregates with aggregates and the output is one line of output for each student:

SELECT Last_name ,Class_code

,SUM(Grade_Pt) AS "Total"

,AVG(Grade_Pt) AS "Average"

,MIN(Grade_Pt) AS Smallest ,MAX(Grade_Pt) AS Highest ,COUNT(Grade_Pt) AS "Count"

FROM Student_table GROUP BY 1, 2 ; 10 Rows Returned

Last_name Class_code Total Average Smallest Largest Count

Johnson ? ? ? ? ? 0

Last_name Class_code Total Average Smallest Largest Count

Thomas FR 4.00 4.00 4.00 4.00 1

Smith SO 2.00 2.00 2.00 2.00 1

McRoberts JR 1.90 1.90 1.90 1.90 1

Larkins FR 0.00 0.00 0.00 0.00 1

Phillips SR 3.00 3.00 3.00 3.00 1

Delaney SR 3.35 3.35 3.35 3.35 1

Wilson SO 3.80 3.80 3.80 3.80 1

Bond JR 3.95 3.95 3.95 3.95 1

Hanson FR 2.88 2.88 2.88 2.88 1

Beyond showing the correct syntax for multiple non-aggregates, the above output reveals that it is possible to request too many non-aggregates. As seen above, every output line is a single row. Therefore, every aggregated value consists of a single row. Therefore, the aggregate is meaningless because it is the same as the original data value. Also notice that without an ORDER BY, the GROUP BY does not sort the output rows.

Like the ORDER BY, the number associated with the column's relative position within the SELECT can also be used in the GROUP BY. In the above example, the two columns are the first ones in the SELECT and therefore, it is written using the shorter format: GROUP BY 1,2.

Caution: Using the shorter technique can cause problems if the location of a non-aggregate is changed in the SELECT list and the GROUP BY is not changed. The most common problem is a 3504 error message indicating that a non-aggregate is not included in the GROUP BY, so the SELECT does not execute.

As previously shown, the default for a column heading is the column name. It is not very pretty to see the name of the aggregate and column used as a heading. Therefore, an alias is suggested in all tools or optionally, a TITLE in BTEQ to define a heading.

Also seen earlier, a COUNT on the grade point for department null is zero. Actually, this is misleading in that 1 row contains a null not zero rows. But, because of the null value, the row is not counted. A better technique might be the use of COUNT(*), for a row count. Although this implies counting all columns, in reality it counts the row. The objective of this request is to find any column that contains a non-null data value.

Another method to provide the same result is to count any column that is defined as NOT NULL. However, since it takes time to determine such a column and its name is longer than typing an asterisk (*), it is easier to use the COUNT(*).

Again, the GROUP BY clause creates one line of output per unique value, but does not perform a sort. It only creates the distinct grouping for all of the columns specified.

Therefore, it is suggested that you always include an ORDER BY to sort the output.

The following might be a better way to code the previous request, using the COUNT(*) and an ORDER BY:

SELECT Class_code

,SUM(Grade_Pt) AS "Total"

,AVG(Grade_Pt) AS "Average"

,MIN(Grade_Pt) AS Smallest ,MAX(Grade_Pt) AS Highest ,COUNT(*) AS "Count"

FROM Student_table GROUP BY 1 ORDER BY 1 ; 5 Rows Returned

Class_code Total Average Smallest Largest Count

? ? ? ? ? 1

FR 6.88 2.29 0.00 4.00 3

JR 5.85 2.925 1.90 3.95 2

SO 5.80 2.9 2.00 3.80 2

SR 6.35 3.175 3.00 3.35 2

Now the output is sorted by the class code with the null appearing first, as the lowest "value."

Also notice the count is one for the row containing mostly NULL data. The COUNT(*) counts the row.

In document teradata-sqlunleash (Page 69-73)