Correlated Subquery Processing - teradata-sqlunleash

The correlated subquery is a very powerful tool. It is an excellent technique to use when there is a need to determine which rows to SELECT based on one or more values from another table. This is especially true when the value for comparison is based on an aggregate. It combines subquery processing and join processing into a single request.

For example, one Teradata user has the need to bill their customers and incorporate the latest payment date. Therefore, the latest date needs to be obtained from the table. So, the payment date is found using the MAX aggregate in the subquery. However, it must be the latest payment date for that customer, which might be different for each customer. The processing involves the subquery locating the maximum date only for one customer account.

The correlated subquery is perfect for this processing. It is more efficient and faster than using a normal subquery with multiple values. One reason for its speed is that it can perform some processing steps in parallel, as seen in an EXPLAIN. The other reason is that it only finds the maximum date when a particular account is read for processing, not for all accounts like a normal subquery.

The operation for a correlated subquery differs from that of a normal subquery. Instead of comparing the selected subquery values against all the rows in the main query, the correlated subquery works backward. It first reads a row in the main query, and then goes into the subquery to find all the rows that match the specified column value. Then, it gets the next row in the main query and retrieves all the subquery rows that match the next value in this row. This processing continues until all the qualifying rows from the main SELECT are satisfied.

Although this sounds terribly inefficient and is inefficient on other databases, it is extremely efficient in Teradata. This is due to the way the AMPs handle this type of request. The AMPs are smart enough to remember and share each value that is located.

Thus, when a second row comes into the comparison that contains the same value as an earlier row, there is no need to re-read the matching rows again. That operation has already been done once and the AMPs remember the answer from the first comparison.

The following is the syntax for writing a correlated subquery:

SELECT <column-name>

[,<column-name> ]

FROM <table-name1> [ AS <table-alias-name>]

WHERE <column-name> { = | < | > | <= | >= | <> } ( SELECT { MIN | MAX | AVG }(<column-name>) FROM <table-name2> [AS <table-alias-name> ]

WHERE <table-name1>.<column-name>=<table-name2>.<column-name> )

;

The subquery does not have a semi-colon of its own. The SELECT in the subquery is all part of the same primary query and shares the one semi-colon.

The aggregate value is normally obtained using MIN, MAX or AVG. Then this aggregate value is in turn used to locate the row or rows within a table that compares equals, less than or greater than this value.

This table is used to demonstrate correlated subqueries:

Figure 6-4

Using the above table, this Correlated subquery finds the highest paid employee in each department:

SELECT Last_name ,First_name ,Dept_no

,Salary ( format '$$$$,$$9.99' ) FROM Employee_Table AS emp WHERE Salary =

( SELECT MAX(Salary) FROM Employee_table AS emt WHERE emp.Dept_no = emt.Dept_no )

ORDER BY 3,1 ; 5 Rows Returned

Last_name First_name Dept_no Salary _

Smythe Richard 10 $64,300.00

Chambers Mandee 100 $48,850.00

Smith John 200 $48,000.00

Larkins Loraine 300 $40,200.00

Strickling Cletus 400 $54,500.00

Notice that both of the tables have been assigned alias names (emp for the main query and emt for the correlated subquery). Because the same Employee table is used in the main query and the subquery, one of them must be assigned an alias. The aliases are used in the subquery to qualify and match the common domain values between the two tables. This coding technique "correlates" the main query table to the one in the subquery.

The following Correlated subquery uses the AVG function to find all employees who earn less than the average pay in their department:

SELECT Last_name ,First_name ,Dept_no

,Salary ( format '$$$$,$$9.99' ) FROM Employee_Table AS emp

WHERE Salary <= ( SELECT AVG(Salary) FROM Employee_table AS e WHERE emp.Dept_no = e.Dept_no )

ORDER BY 3,1 ; 5 Rows Returned

Last_name

_ First_name Dept_no Salary _

Smythe Richard 10 $64,300.00

Chambers Mandee 100 $48,850.00

Coffing Billy 200 $41,888.88

Larkins Loraine 300 $40,200.00

Reilly William 400 $36,000.00

Earlier in this chapter, it was indicated that a column from the subquery cannot be

referenced in the main query. This is still true. However, nothing is wrong with using one or more column references from the main query within the subquery to create a Correlated subquery.

EXISTS

Another powerful resource that can be used with a correlated subquery is the EXISTS. It provides a true-false test within the WHERE clause.

In the syntax that follows, it is used to test whether or not a single row is returned from the subquery SELECT:

SELECT '<character-literal>'

WHERE EXISTS ( SELECT <column-name> [,<column-name> ] FROM <table-name>

WHERE <column-name> { = | < | > | <= | >= | <>}

( SELECT {MIN | MAX | AVG }(<column-name>) FROM <table-name>) )

;

If a row is found, the EXISTS test is true, and conversely, if a row is not found, the result is false. When a true condition is determined, the value in the SELECT is returned from the main query. When the condition is determined to be false, no rows are selected.

Since EXISTS returns one or no rows, it is a fast way to determine whether or not a condition is present within one or more database tables. The correlated subquery can also be part of a join to add another level of test. It has potential to be very sophisticated.

As an example, to find all customers that have not placed an order the NOT IN subquery can be used. Remember, when you use the NOT IN clause the NULL needs to be considered and eliminated using the IS NOT NULL check in the subquery. When using the NOT EXISTS with a correlated subquery, the same answer is obtained, it is faster than a normal subquery and there is no concern for getting a null into the subquery. These next examples show the EXISTS and the NOT EXISTS tests.

Notice that the next SELECT is the same correlated subquery as seen earlier, except here it is utilizing the subquery to find all customers with orders:

SELECT Customer_name FROM Customer_table AS CUST

WHERE EXISTS ( SELECT * FROM Order_table AS OT

WHERE CUST.Customer_number = OT.Customer_number ) ; 4 Rows Returned

Customer_name _ Ace Consulting Databases R Us Billy's Best Choice XYZ Plumbing

By changing the EXISTS to a NOT EXISTS, the next SELECT finds all customers without orders:

SELECT Customer_name FROM Customer_table AS CUST

WHERE NOT EXISTS ( SELECT * FROM Order_table AS OT WHERE CUST.Customer_number = OT.Customer_number ) ; 1 Row Returned

Customer_name Acme Products

Since the Customer and Order tables are used in the above Correlated subquery, the table names did not require an alias. However, it was done to shorten the names to ease the equality coding in the subquery.

An added benefit of this technique (NOT EXISTS) is that the presence of a NULL does not affect the performance. Notice that in both subqueries, the asterisk (*) is used for the

columns. Since it is a true or false test, the columns are not used and it is the shortest way to code the SELECT. If the column in the subquery table is a Primary Index or a Unique

Secondary Index, the correlated subquery can be very fast.

The examples in this chapter only use a single column for the correlation. However, it is common to use more than one column from the main query in the correlated subquery.

Although the techniques presented in this last chapter seem relatively simple, they can be very powerful. Understanding subqueries and Correlated subqueries can help you unleash the power.

Chapter 7: Join Processing

Join Processing

A join is the combination of two or more tables in the same FROM of a single SELECT statement. When writing a join, the key is to locate a column in both tables that is from a common domain. Like the correlated subquery, joins are normally based on an equal comparison between the join columns.

An example of a common domain column might be a customer number. Whether it

represents a particular customer, as the primary key, in the Customer table, or the customer that placed a specific order, as a foreign key, in the Order table, it represents the same entity in both tables. Without a common value, a match cannot be made and therefore, no rows can be selected using a join. An equality join returns matching rows.

Any answer set that a subquery can return, a join can also provide. Unlike the subquery, a join lists all of its tables in the same FROM clause of the SELECT. Therefore, columns from multiple tables are available for return to the user. The desired columns are the main factor in deciding whether to use a join or a subquery. If only the columns come from a single table are desired, a subquery or a join work fine. However, if columns from more than one table are needed, a join must be used. In Version 2 Release 3, the number of tables allowed in a single join increased from sixteen (16) to sixty-four (64) tables.

In document teradata-sqlunleash (Page 94-98)