Performance Opportunities - teradata-sqlunleash

The Teradata optimizer has always had options available to it when performing SQL. It always attempts to use the most efficient path to provide the answer set. This is true for aggregation as well.

When performing aggregation, the main shortcut available might include the use of a secondary index. The index row is maintained in a subtable. This row contains the row ID (row hash + uniqueness value) and the actual data value stored in the data row.

Therefore, an index row is normally much shorter than a data row. Hence, more index rows exist in an index block than in a data block.

As a result, the read of an index block makes more values available than the actual data block. Since I/O is the slowest operation on all computer systems, less I/O equates to faster processing. If the optimizer can obtain all the values it needs for processing by using the secondary index, it will. This is referred to as a "covered query."

The creation of a secondary index is covered in this book as part of the Data Definition Language (DDL) chapter.

Chapter 6: Subquery Processing

Subquery

The subquery is a commonly used technique and powerful way to select rows from one table based on values in another table. It is predicated on the use of a SELECT statement within a SELECT and takes advantage of the relationships built into a relational database. The basic concept behind a subquery is that it retrieves a list of values that are used for comparison against one or more columns in the main query. To accomplish the comparison, the subquery is written after the WHERE clause and normally as part of an IN list.

In an earlier chapter, the IN was used to build a value list for comparison against the rows of a table to determine which rows to select. The next example illustrates how this technique can be used to SELECT all the columns for rows containing any of the three different values 10, 20 and 30:

SELECT * FROM My_table

WHERE column1 IN ( 10, 20, 30 ) ; 4 Rows Returned

Column1 Column2 _

10 A row with 10 in column1

30 A row with 30 in column1

10 A row with 10 in column1

20 A row with 20 in column1

As powerful as this is, what if the three values turned into a thousand values. That is too much work and too many opportunities to forget one of the values. Instead of writing the values manually, a subquery can be used to generate the values automatically.

The coding technique of a subquery replaces the values previously written in the IN list with a valid SELECT. Then the subquery SELECT dynamically generates the value list. Once the values have been retrieved, it eliminates the duplicates by automatically performing a DISTINCT.

The following is the syntax for a subquery:

SELECT <column-name>

[,<column-name>]

FROM <table-name>

-- the subquery starts here to form the list of values to compare for the IN WHERE <column-name>[, <column-name2> [,<column-nameN> ]] IN

( SELECT <column-name> [,<column-name2>[, <column-nameN> ] ] FROM <table-name> )

;

Conceptually, the subquery is processed first so that all the values are expanded into the list for comparison with the column specified in the WHERE clause. These values in the

subquery SELECT can only be used for comparison against the column or columns referenced in the WHERE.

Columns inside the subquery SELECT cannot be returned to the user via the main SELECT.

The only columns available to the client are those in the tables named in the main (first) FROM clause. The query in parentheses is called the subquery and it is responsible for building the IN list.

At the writing of this document, Teradata allows up to 64 tables in a single query. Therefore, if each SELECT accessed only one table, a query might contain 63 subqueries in a single statement.

The next two tables are used to demonstrate the functionality of subqueries:

Figure 6-1

Figure 6-2

The next SELECT uses a subquery to find all customers that have an order of more than

$10,000.00:

SELECT Customer_name ,Phone_number FROM Customer_Table

WHERE Customer_number IN ( SELECT Customer_number FROM Order_table WHERE Order_total > 10000 ) ; 3 Rows Returned

Customer_name Phone_number Billy's Best Choice 555-1234

XYZ Plumbing 347-8954

Databases N-U 322-1012

This is an appropriate place to mention that the columns being compared between the main and subqueries must be from the same domain. Otherwise, if no equal condition exists, no rows are returned. The above SELECT uses the customer number (FK) in the Order table to match the customer number (PK) in the Customer table. They are both customer numbers and therefore have the opportunity to compare equal from both tables.

The next subquery swaps the queries to find all the orders by a specific customer:

SELECT Order_number ,Order_total

FROM Order_Table

WHERE Customer_number IN ( SELECT Customer_number

FROM Customer_table WHERE Customer_name LIKE 'Bill%');

3 Rows Returned

Order_number Order_total

123456 12347.53

123512 8005.91

Notice that the Customer table is used in the main query to answer a customer question and the Order table is used in the main query to answer an order question. However, they both compare on the customer number as the common domain between the two tables.

Both of the previous subqueries work fine for comparing a single column in the main table to a value list in the subquery. Thus, it is possible to answer questions like, "Which customer

has placed the largest order?" However, it cannot answer this question, "What is the maximum order for each customer?"

To make Subqueries more sophisticated and powerful, they can compare more than one column at a time. The multiple columns are referenced in the WHERE clause, of the main query and also enclosed in parentheses.

The key is this: if multiple columns are named before the IN portion of the WHERE clause, the exact same number of columns must be referenced in the SELECT of the subquery to obtain all the required values for comparison.

Furthermore, the corresponding columns (outside and inside the subquery) must

respectively be of the same domain. Each of the columns must be equal to a corresponding value in order for the row to be returned. It works like an AND comparison.

The following SELECT uses a subquery to match two columns with two values in the subquery to find the highest dollar orders for each customer:

SELECT Customer_number AS Customer ,Order_number

,Order_total FROM Order_table

WHERE (customer_number, order_total) IN

(SELECT customer_number, MAX(order_total) FROM order_table GROUP BY 1) ;

4 Rows Returned

Customer_number Order_number Order_total

11111111 123546 12347.53

57896883 123777 23454.84

31323134 123552 5111.47

87323456 123585 15231.62

Although this works well for MIN and MAX type of values (equalities), it does not work well for finding values greater than or less than an average. For this type of processing, a Correlated subquery is the best solution and will be demonstrated later in this chapter.

Since 64 tables can be in a single Teradata SQL statement, as mentioned previously, this means that a maximum of 63 subqueries can be written into a single statement. The following shows a 3-table access using two separate subqueries. Additional subqueries simply follow the same pattern.

From the above tables, it is also possible to find the customer who has ordered the single highest dollar amount order. To accomplish this, the Order table must be used to determine the maximum order. Then, the Order table is used again to compare the maximum with each order and finally, compared to the Customer Table to determine which customer placed the order.

The next subquery can be used to find them:

SELECT Customer_name ,Phone_number

FROM Customer_Table WHERE customer_number IN

(SELECT customer_number FROM Order_Table WHERE Order_total IN

(SELECT MAX(Order_total) FROM Order_Table) ) ; 1 Row Returned

Customer_name Phone_number

XYZ Plumbing 347-8954

It is now known that XYZ Plumbing has the highest dollar order. What is not known is the amount of the order. Since the order total is in the Order table, which is not referenced in the main query, it cannot be part of the SELECT list.

In order to see the order total, a join must be used. Joins will be covered in the next chapter.

Using NOT IN

As seen in a previous chapter, when using the IN and a value list, the NOT IN can be used to find all of the rows that do not match.

Using this technique, the subquery above can be modified to find the customers without an order. The only changes made are to 1) add the NOT before the IN and 2) eliminate the WHERE clause in the subquery:

SELECT Customer_name ,Phone_number FROM Customer_Table

WHERE Customer_number NOT IN

( SELECT Customer_number FROM Order_table) ; 1 Row Returned

Customer_name Phone_number Databases R Us 322-1012

Caution needs to be used regarding the NOT IN when there is a potential for including a NULL in the value list. Since the comparison of a NULL and any other value is unknown, and the NOT of an unknown is still an unknown no rows are returned. Therefore when there is potential for a NULL in the subquery, it is best to also code a compound comparison as seen in the following SELECT:

SELECT Customer_name ,Phone_number FROM Customer_Table

WHERE Customer_number NOT IN

( SELECT Customer_number FROM Order_table WHERE customer_number IS NOT NULL) ;

In document teradata-sqlunleash (Page 87-92)