IMPACT OF TABLE ACCESS ORDER ON INDEX DESIGN In the two previous examples the local predicates, together with the indexes

Indexing for Table Joins

IMPACT OF TABLE ACCESS ORDER ON INDEX DESIGN In the two previous examples the local predicates, together with the indexes

available, leave no choice with regard to the starting point. Unfortunately, this isn’t always the case as many people have discovered to their cost. We will now consider one of these less straightforward cases by means of a case study.

In the ﬁrst part of this case study we will focus entirely on nested-loop joins. Once we have become familiar with the issues involved, and in particular the potential problems associated with this join method, we will continue the case study by focusing on the merge scan join and the hash join. Finally we will compare the advantages and disadvantages of the two techniques from the index design point of view.

Case Study

A global company has 800,000 domestic customers and 200,000 customers in 100 other countries. Basic data for all its 1,000,000 customers are stored in table CUST. The operational invoice table INVOICE contains 20,000,000 rows. The customers, both domestic and nondomestic, each have an average of 20 invoices. A new query transaction is about to be implemented to ﬁnd, in descending order, high value invoices for a speciﬁed foreign country, based on the IEUR column in the invoice table—we will call these large invoices. The user input consists of the high value limit for the invoice total, together with the foreign country code.

The query can be implemented either as a two-table join SELECT or by using one of the many program alternatives that implement the join logic with two single-table SELECTs. The program can build anested-loop structure; program A, for instance, may start with table CUST (CUST being the outer table); program B may use the INVOICE table as the outer table. We have this freedom of choice because the operation, in join terminology, is aninner join; we are only interested in customers that have at least one large invoice. If the requirement had been to obtain customer information, together with their invoices if any, this would have been anouter join; the program would then, of course,have toaccess table CUST ﬁrst.

In the following analysis, program A accesses table CUST ﬁrst (CUST is the outer table), while program B starts with table INVOICE (INVOICE is the outer table). Program C, with a two-table join SELECT, is neutral; if this alternative is chosen, we are leaving the optimizer to choose both the join method and the table access order. The table level decisions, which index to use and so forth, are, of course, made by the optimizer in all three cases.

Before we analyze the three programs, take a moment to consider which program you would choose. Some programmers would intuitively choose program A because it seems more natural to start with table CUST; one customer has many invoices. On the other hand, we have to consider the requirement to provide the high value invoices in descending order; to do this it is possible to include ORDER BY IEUR DESC in program B. The DBMS then takes care of the sort if indeed it is not rendered unnecessary by the use of an index on table INVOICE. Program B seems a bit more convenient.

Program A: Outer Table CUST SQL 8.3

DECLARE CURSORC CURSOR FOR Program A SELECT CNO, CNAME, CTYPE

FROM CUST

WHERE CCTRY = :CCTRY DECLARE CURSORI CURSOR FOR SELECT INO, IEUR

FROM INVOICE WHERE IEUR > :IEUR

AND

CNO = :CNO OPEN CURSORC

FETCH CURSORC while CCTRY = :CCTRY OPEN CURSORI

FETCH CURSORI while IEUR > :IEUR CLOSE CURSORI

CLOSE CURSORC

SORT THE RESULT ROWS BY IEUR DESC

Program B: Outer Table INVOICE SQL 8.4

DECLARE CURSORI CURSOR FOR Program B SELECT CNO, INO, IEUR

FROM INVOICE WHERE IEUR > :IEUR ORDER BY IEUR DESC OPEN CURSORI

FETCH CURSORI while IEUR > :IEUR SELECT CNAME, CTYPE

FROM CUST

WHERE CNO = :CNO AND

CCTRY = :CCTRY DO END

Program C: Optimizer Chooses Outer Table SQL 8.5

Program C DECLARE CURSORJ CURSOR FOR

SELECT CNAME, CTYPE, INO, IEUR FROM CUST, INVOICE

WHERE IEUR > :IEUR AND

CCTRY = :CCTRY AND

CUST.CNO = INVOICE.CNO ORDER BY IEUR DESC

OPEN CURSORJ

FETCH CURSORJ while IEUR > :IEUR and CCTRY = :CCTRY CLOSE CURSORJ

If only one of the tables had local predicates, starting with that table would seem like a good idea from the performance point of view, as we saw in SQL 8.1 and 8.2. As both tables now have local predicates, it is not obvious which program would be faster—unless we know how the tables are indexed.

For the purpose of this case study let us assume that, apart from the primary and foreign key indexes, the only additional index is (CCTRY) on table CUST, as shown in Figure 8.3. CCTRY is thecoderepresenting the country. Now most readers would probably vote for program A; we appear to have an appropriate index on the customer table but not on the invoice table.

1,000,000 rows 20,000,000 rows P F,C P,C CNO CNO INO CCTRY CUST INVOICE FF(CCTRY = :CCTRY) = 10% 20 invoices per average customer

As usual, when creating a new program, we should check whether the current indexes are adequate, even with the worst input. If they aren’t, we should design the best affordable index, probably starting by designing and evaluating theidealindex. To form an understanding of the actual difference in performance between these three programs, let us compare the estimates, using the QUBE, to see how they would perform withthe current indexes shown in Figure 8.3.

We will make the estimates using the highest ﬁlter factors, assuming that these represent the worst case.

FF (CCTRY = :CCTRY) max 10%. FF (IEUR > :IEUR) max 0.1%

Current Indexes

The current indexes are shown in Figure 8.3. This diagram is shown again for each of the three programs, together with the appropriate access path involved and the touches required to each index and table.

Program A: Outer Table CUST

Step 1: Access Table CUST via Nonclustering Index CCTRY The DBMS

must find all the customers from the specified country. It can do this either by a full table scan or by using the index CCTRY (Fig. 8.4). With a high filter factor for the predicate CCTRY=:CCTRY, a full table scan will be faster; with a very low filter factor, an index scan will be faster. If the optimizer selects the access path only once for many executions, it may choose an index scan based on a filter factor of 1% (one divided by the number of distinct values of the column CCTRY). If the optimizer selects the access pathevery timebased on actual user input, it will undoubtedly choose a table scan as long as it is aware of the true filter factor.

SQL 8.6

DECLARE CURSORC CURSOR FOR Program A

SELECT CNO, CNAME, CTYPE

FROM CUST

WHERE CCTRY = :CCTRY

DECLARE CURSORI CURSOR FOR

SELECT INO, IEUR

FROM INVOICE

WHERE IEUR > :IEUR AND

CNO = :CNO OPEN CURSORC

FETCH CURSORC while CCTRY = :CCTRY

OPEN CURSORI

FETCH CURSORI while IEUR > :IEUR CLOSE CURSORI

CLOSE CURSORC

1,000,000 rows 20,000,000 rows P START F,C P,C CNO CNO INO CCTRY CUST INVOICE 100 KT FF(CCTRY = :CCTRY) = 10% 20 invoices per average customer

2 MT 100 KT

2 MT

Figure 8.4 Program A with current indexes.

Index CCTRY TR = 1 TS = 10% × 1,000,000 Table CUST TR = 100,000 TS = 0 Fetch 10% × 1,000,000 = 100,000 × 0.1 ms LRT TR = 100,001 TS = 100,000 100,001 × 10 ms 100,000 × 0.01 ms 1000 s + 1 s + 10 s = 1000 s

According to the QUBE, this step contributes 1000 s to the response time! The large number of TRs for the customer table occurs because the index CCTRY is not the clustering index.

So much for using the “obvious” access path. With a ﬁlter factor of 10%, 100,000 TRs (17 min) is clearly not an option. Even a ﬁlter factor of 1% would require 10,000 TRs (1.7 min). The alternative, a full table scan, would only take

1×10 ms+1 m×0.01 ms=10 s

A filter factor of 0.1% with the index scan would require 1000 TRs, tak- ing the same length of time as the table scan. This is, of course, because one TR, according to the QUBE, takes 1000 times longer than one TS. A full table scan is faster than a nonfat, nonclustered index scan for filter factors greater than 0.1%. We must be very careful, however, when making sweeping state- ments such as this, based solely on figures produced by the QUBE. It should also be understood that the CPU time used would be very much greater with the table scan. Nevertheless, it should be appreciated that very low filter factors are required these days for nonfat, nonclustered index scans to be preferred to table scans.

Together with the FETCH cost, the LRT for this step will be 20 s.

Step 2: Access Table INVOICE via Clustering Index CNO When a cus-

In document Interscience Relational Database Index Design and the Optimizers pdf (Page 156-162)