GROUP BY: WHERE YOUR DATA BELONGS

---12

SELECT

COUNT(DISTINCT bk_publisher) AS pub_count FROM books;

pub_count ---9

The results are different because in the ﬁ rst query the DISTINCT keyword made sure that only distinct results of the COUNT function were returned. In this case, it is redundant because an aggregate function without the GROUP BY clause (see later in this chapter) always returns only a single record. In the second query, it makes sure that only distinct values are counted prior to the result being returned.

Even though some aggregate functions can be used with character data (or date and time data types), they are really designed for numbers. They accept neither XML nor binary data types as input.

GROUP BY: WHERE YOUR DATA BELONGS

By now, you might be wondering how you would go about selecting data based on the aggregated values. For instance, how do you ﬁ nd all publishers whose average book price is above $20? Surely, you could use the trusty WHERE clause, right? Uh, no. Try executing a query like this:

SELECT bk_publisher FROM books

WHERE AVG(bk_price) > 20;

You get an error message that in the context of Microsoft SQL Server 2008 reads as follows:

An aggregate cannot appear in the WHERE clause unless it is in subquery contained in a HAVING clause or a select list, and the column being aggregated in an outer reference.

c05.indd 144

c05.indd 144 3/15/2011 12:23:53 PM3/15/2011 12:23:53 PM

www.it-ebooks.info

Yet there is a way to ask the database engine such a question, and get a result, by using the GROUP BY clause (with the HAVING clause not far behind), but let’s get the basics nailed down ﬁ rst.

Aggregating results by rolling them up to a single value is useful, as is having all the data stored in the individual records, but sometimes a middle ground is needed. Using the WHERE clause with an aggregate function allows us to aggregate a subset of the results. For example, we can count books published by a particular publisher. To do the same for all publishers, we would have to run as many queries as there are distinct publishers in our table; moreover, we would need to know the publishers by name to ask for them, and we may not know the names of all the publishers. If only we could ask the database for aggregated information grouped by some criterion without supplying hard-coded values. Here’s where the GROUP BY clause comes in.

The GROUP BY clause only works in tandem with aggregate functions and appears after all the SELECT, WHERE, and JOIN keywords in the query. To produce an aggregated price list by pub-lisher (to discover how much you have spent buying books from this particular pubpub-lisher) without ever knowing what publishers are in the table, use the following query:

SELECT

bk_publisher

, SUM(bk_price) AS books_total FROM books GROUP BY bk_publisher;

bk_publisher books_total

--- ---Ace Trade 16.29

Broadway 18.48 Gallimard 69.26 Greenbook Publications 8.89 MacMillan 38.88 Microsoft 24.99 Orb Books 16.29 Picador 10.20 Wiley 151.63

We know that there are exactly nine distinct publishers in our BOOKS table, so nine records were returned. Only for one publisher, Wiley, have we multiple books in the table, and the prices of these books were summed up and attributed to Wiley — along with other publishers and their summaries, respectively. All of this was provided in a single query, without having to spell names of the speciﬁ c publishers. This is the beauty of the declarative nature of SQL.

The returned results in the previous query are sorted alphabetically by BK_

PUBLISHER, in descending order; this is the default. To sort it in a different order, you would need to use an additional ORDER BY clause, which is covered later in this chapter.

Every aggregate function, or combination thereof, could be used with GROUP BY clause. For exam-ple, to calculate not only SUM, but also AVG, MI, MAX, and COUNT, you could include it all in a single query:

c05.indd 145

c05.indd 145 3/15/2011 12:23:54 PM3/15/2011 12:23:54 PM

www.it-ebooks.info

SELECT

bk_publisher

, SUM(bk_price) AS books_total , AVG(bk_price) AS avg_price , MIN(bk_price) AS min_price , MAX(bk_price) AS max_price ,COUNT(bk_price) AS book_count FROM books

The results are predictable. With the exception of Wiley books, all other publishers are represented by a single book (output of the COUNT() function) and their total, min, max, and average prices are identical across the board. The differences are shown in Wiley’s case, where the product of the aggregate functions over prices for its four books is calculated.

It is possible to GROUP BY more than one column, but you need to understand how the set logic works to be able to formulate a question. Adding an additional column to your GROUP BY when selecting from a single table would produce a result set requiring additional manual steps to make sense of. For example, simply adding BK_TITLE to BK_PUBLISHER in the query asking to sum-marize BK_PRICE will give you results indistinguishable from a standard SELECT query:

SELECT

bk_publisher AS publisher , SUM(bk_price) AS books_total FROM books Greenbook Publications 8.89 Wiley 29.99

The order appears to be wrong. Should it not be ordered by BK_PUBLISHER? No, by default it is ordered by the last column speciﬁ ed in the GROUP BY clause; in this case, by BK_TITLE (even

c05.indd 146

c05.indd 146 3/15/2011 12:23:55 PM3/15/2011 12:23:55 PM

www.it-ebooks.info

though the column itself is not on the select list). The column BK_TITLE does not have to appear on the select list to be included in the GROUP BY clause, but the opposite is not true: all columns listed in the GROUP BY clause must be included in the SELECT list (with the exception of the aggregated function column).

You won’t be able to use an aggregate SUM() function in the standard SELECT query which includes non-aggregated columns on the list because the database engine would get confused by being asked to produce both record-level and aggregate results without a GROUP BY, but sum of a single value is, by deﬁ nition, this value.

If your question involves data from another table JOIN(ed) together, the results are much more interesting. To ﬁ nd the total price of the books published by a particular publisher per shelf in your bookcase, you might use the following query:

SELECT

SUM(bk_price) AS books_total , bk_publisher AS publisher FROM books INNER JOIN location

A total of nine records were returned, a sign that we have grouped the results by nine distinct pub-lishers. In addition, we can see price distribution by shelf, with all Wiley books being stored on shelf

#5, totaling $151.63.

GROUP BY could be used all by itself without any aggregate functions to support. A query, such as the following, would return a list of publishers grouped by, well, publisher:

SELECT bk_publisher AS publisher

c05.indd 147 3/15/2011 12:23:55 PM3/15/2011 12:23:55 PM

www.it-ebooks.info

Each publisher now appears only once on the list, and the result is sorted alphabetically, in default ascending order. If you’re saying that the same could be achieved by simply selecting BK_PUBLISHER with the DISTINCT keyword, you are absolutely correct. In fact, there is a good chance that behind the scenes RDBMSs would treat both statements in the same way, prepar-ing the same execution plan (see Chapter 9 for more information on execution plans and query optimization).

In document Discovering SQL (Page 178-182)