Beyond SQL Essentials:
The Need for Speed When Accessing SAS ® Data
Transcript
SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Other brand and product names are trademarks of their respective companies.
Beyond SQL Essentials: The Need for Speed When Accessing SAS® Data Transcript
Copyright © 2010 SAS Institute Inc. Cary, NC, USA. All rights reserved. Printed in the United States of America. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, or otherwise, without the prior written permission of the publisher, SAS Institute Inc.
Book code E1408, course code RLSPSQLS, prepared date 16Apr2010. RLSPSQLS_001
ISBN 978-1-60764-436-1
Lecture Description ... iv
Prerequisites ... v
Accessibility Tips ... v
Beyond SQL Essentials: The Need for Speed When Accessing SAS® Data ... 1
1. Review of the LIBNAME Statement and Performance Measures ... 7
2. Optimizing Queries ... 20
3. Summary ... 43
Appendix A Demonstration Programs ... A-1
1. Turning on Options to Benchmark Queries ... A-3 2. Using Functions in WHERE Criteria ... A-4 3. Using Derived Columns in WHERE Criteria ... A-5
of manipulation techniques and resource cost benefits are designed to help programmers choose the most appropriate technique for their situation.
To learn more…
For information on other courses in the curriculum, contact the SAS Education Division at 1-800-333-7660, or send e-mail to [email protected]. You can also find this information on the Web at support.sas.com/training/ as well as in the Training Course Catalog.
For a list of other SAS books that relate to the topics covered in this
Course Notes, USA customers can contact our SAS Publishing Department at 1-800-727-3228 or send e-mail to [email protected]. Customers outside the USA, please contact your local SAS office.
Also, see the Publications Catalog on the Web at support.sas.com/pubs for a complete list of books and a convenient order form.
Accessibility Tips
If you are using a screen reader, such as Freedom Scientific’s JAWS, you may want to configure your punctuation settings so that characters used in code samples (comma, ampersand, semicolon, percent) are announced. Typically, the screen reader default for the character & is to read “and.” For clarity in code samples, you may want to configure your screen reader to read & as “ampersand.” In addition, depending on your verbosity options, the character & might be omitted. The same is true for some commas before a code variable. To confirm code lines, you may choose to read some lines character by character. When testing this scenario with Adobe Acrobat Reader 9.1 and JAWS 10, ampersands before SAS macro names were announced only when in character-reading mode.
for Speed When Accessing SAS Data
1. Review of the LIBNAME Statement and Performance Measures ... 7 2. Optimizing Queries ... 20 3. Summary ... 43
Beyond SQL Essentials:
The Need for Speed When Accessing SAS
®Data
Welcome to this SAS e-Lecture titled Beyond SQL Essentials: The Need for Speed When Accessing SAS® Data. My name is Betsy, and I work in the Education Division at SAS. I will be delivering this
presentation that was developed by my colleagues Mark and Linda. In this lecture, I would like to explain some of the factors affecting SQL query performance to help you write SQL queries that run faster when you work with SAS data as your data source. This lecture is the second of a two-part series of lectures dealing with the topic of query optimization. This lecture discusses accessing and working with SAS data sets. The second lecture presents accessing relational database management systems.
Reference Materials
To access the transcript for this lecture:
1. Go to the table of contents on the left side of the viewer.
2. Select Reference.
3. Select Transcript.
2
Before we begin the lecture, let me mention that we have included a transcript so that you can print all of the technical information provided in this lecture. To access the transcript, select Reference and then Transcript in the table of contents on the left side of the viewer. You can print this transcript now for use when viewing the lecture or print it later to keep as a reference. Also, note that Appendix A in the
transcript contains copies of the programs used for the demonstrations in this lecture.
Navigation Help
For information on how to navigate this lecture:
1. Go to the upper-right corner of the browser.
2. Select Help.
3
If you need help with the navigation of this lecture, please select Help in the upper-right corner of the browser.
1. Review of the LIBNAME Statement and Performance Measures
2. Optimizing Queries
3. Summary
4
Beyond SQL Essentials: The Need for Speed When Accessing SAS
®Data
We’ll get started in this lecture with a review of the LIBNAME statement. We’ll also discuss resource statistics that can be used to better optimize or tune your queries.
Then in section 2, we’ll discuss some of the common factors that effect SQL query performance. We will look at examples that point out techniques and tips for optimizing your queries.
Finally, in section 3, we’ll review the most important concepts talked about in this lecture and point out how you can continue learning more about SAS and PROC SQL.
1. Review of the LIBNAME Statement and Performance Measures
2. Optimizing Queries
3. Summary
5
Beyond SQL Essentials: The Need for Speed When Accessing SAS
®Data
Let’s start with a quick review of selected SAS SQL essentials.
Objectives
Use the LIBNAME statement to access data stored in SAS.
Gather and interpret SQL query performance statistics.
6
Specifically, we’ll revisit the SAS LIBNAME statement, which we will use to access data stored in native SAS. We’ll review and expand on gathering and interpreting SQL query performance statistics. In later segments, we’ll leverage these statistics to fine-tune our SQL queries to execute faster.
The LIBNAME Statement
The LIBNAME statement
establishes a libref, which points to a SAS library
enables data sets to be referenced with a two-level name
makes operating environment-specific references unnecessary throughout the code.
7
You must have the appropriate operating system authentication and permissions to successfully assign a libref.
A SAS library is a collection of SAS files. The LIBNAME statement assigns a short (eight characters or less) nickname that we can later use to access a SAS library. In SAS, this nickname is referred to as a libref. System storage location information, such as the fully qualified path for directory-based operating systems, is often complex and would be tedious to repeat in our program as we access different data files in a library. After the libref is assigned, you can reference the data sets in the library using a two-level name consisting of the libref and the name of the table, separated by a period. This makes it unnecessary to repeat system storage location information each time that you refer to a data set.
Remember that the SAS System is still bound by all the normal operating system authentication and permissions restrictions. So to read or write to a data set, you’ll still need to be logged on (authenticated) to the system where the data resides, and your user ID must have appropriate permissions.
The LIBNAME Statement
General form of the LIBNAME statement:
Example (Windows operating system):
8
LIBNAME libref 'SAS-data-library location' <options>;
customer_dim.sas7bdat
employee_addresses.sas7bdat employee_data.sas7bdat
s:\workshop
In its simplest form, a LIBNAME statement begins with the keyword LIBNAME followed by the libref that we want to assign and the information that the operating system will need to locate the SAS library.
How the location is designated is specific to the operating system.
Directory-based operating systems such as Windows and UNIX use a filename extension to indicate the format of a file’s contents. On these systems, SAS tables are stored in individual files using a .sas7bdat extension, as you can see in this example.
Non-directory based systems, such as IBM mainframes running z/OS, do not use file extensions and store tables differently.
The LIBNAME Statement
Use a two-level name to access a permanent table in a query.
9
libname orion 's:\workshop';
proc sql;
select CUSTOMER_COUNTRY, CUSTOMER_NAME from orion.customer_dim
; quit;
customer_dim.sas7bdat
employee_addresses.sas7bdat employee_data.sas7bdat
…
s:\workshop
If we wanted to access one of the tables in the workshop folder, we would submit this LIBNAME statement to assign the libref orion to the storage location. Then in the query, we simply access the table using the two-level name of orion.customer_dim.
That’s a quick review of the LIBNAME statement. Now let’s quickly review ways to obtain performance statistics in SAS for our queries.
Performance Benchmarking
The STIMER or FULLSTIMER options report CPU, memory, and I/O usage.
General form of the STIMER and FULLSTIMER options:
The statistics that are reported might differ among operating environments.
10
options STIMER;
options FULLSTIMER;
As we try to identify areas where we can speed up processing, we need to gather and interpret
performance statistics. Fortunately, this capability is built into SAS. There are options that we can request that print statistics concerning CPU, memory, and input/output resource usage for each DATA or PROC step that executes. Here is the syntax for these options. They are specified in an OPTIONS statement.
The type of statistics written to the SAS log varies, depending on the operating system. The FULLSTIMER option prints all of the performance statistics that are available to SAS, while the STIMER option reports a subset of the performance statistics.
Performance Benchmarking
The STIMER PROC SQL statement option reports performance statistics for each SQL statement in the current SQL invocation.
General form of the STIMER SQL option:
To use the STIMER SQL option, the SAS system options STIMER or FULLSTIMER must also be in effect.
11
PROC SQL STIMER;
If you have already invoked the STIMER or FULLSTIMER SAS system option, you can also invoke PROC SQL with the STIMER option. The STIMER option causes PROC SQL to provide CPU, memory, and input/output statistics for each individual statement executed during the PROC SQL invocation, enabling an even more granular analysis of resource utilization for benchmarking.
Let’s look at an example that uses these options.
Performance Benchmarking
Example: Capturing performance statistics for an SQL query:
12
options fullstimer;
proc sql stimer;
select product_ID, SUM(Quantity) 'Total Sold' from orion.bigdata
where Product_ID=240100300006 group by product_id
; quit;
In this example, we specify the FULLSTIMER system option with this OPTIONS statement. Then we invoke PROC SQL using the STIMER option.
Turning on Options to Benchmark Queries
This demonstration illustrates how to turn on options to benchmark queries.
13
Now it’s time for a demonstration. My colleague Linda will be presenting the demonstrations in this lecture. In this demo, Linda will take the query on the last slide and submit it so that we can see exactly what types of performance statistics are provided. Linda, I’ll turn things over to you.
(rlspsqls_section1_demo) Hi, my name is Linda.
We will be executing this PROC SQL query twice. In this first program, we have only the FULLSTIMER system option turned on. In the second program we have added the STIMER option to the PROC SQL statement. Let’s submit each program and compare the statistics that are generated.
1. I will submit the first program with this LIBNAME statement and the OPTIONS statement. Here is my output. If I go to my log, I see a set of statistics. There is CPU usage, as well as real time and memory statistics.
2. Let’s go back to the editor and run our second program with the addition of the STIMER option. I’ll go to my log. Now I see a lot more information. I see a set of statistics listed for the PROC SQL statement, a set of statistics for the SELECT statement, and a set of statistics for the QUIT statement.
As we discussed before, and I can see here, the STIMER option generated more granular statistics.
Now you might be wondering why these statistics and the statistics from the first program do not add up exactly? For example, why was less CPU time used in the second example if the query being executed is exactly the same in both programs? Actually, if you are experimenting with different efficiency
averages, not on one individual execution.
Also be aware that a QUIT statement is very different than starting a new SAS session. Step boundaries, such as the QUIT statement, do not guarantee a release on memory. If I were really trying to obtain performance information about this query, I would want to benchmark it properly. However, the purpose of this demonstration is simply to see and compare the types of statistics that are generated when you use FULLSTIMER and STIMER.
Performance Benchmarking
The statistics provided can vary with the operating system.
Windows XP
z/OS
14
NOTE: SQL Statement used (Total process time):
real time 15.23 seconds user cpu time 0.23 seconds system cpu time 0.17 seconds
Memory 170k OS Memory 8560k
NOTE: The SQL Statement used the following resources:
CPU time - 00:00:00.12 Elapsed time - 00:00:00.14 EXCP count - 403
Task memory - 7250K (0K data, 7250K program) Total memory - 16564K (4000K data, 12564K program)
As we discussed earlier, the types of statistics written to the SAS log do vary, depending on the operating system. I’m not going to demonstrate several operating systems to you in this lecture, but to give you an idea of what the statistics look like on different platforms, take a look at these two logs. The first log shows the statistics generated in a Windows XP environment. The second example shows statistics generated in z/OS.
Performance Benchmarking
Here are the definitions for the different performance statistics:
15
Statistic Description
Real-Time Time spent processing a SAS job, also referred to as elapsed time or clock time
User CPU Time CPU time spent executing SAS code
System CPU Time CPU time spent performing operating system tasks (system overhead)
Memory Memory required to execute a step
OS Memory Maximum amount of memory that a step requests from the operating system
Here is a quick explanation of the different performance statistics. If you would like more information, it can be found in the SAS Companion documentation for your operating system.
• Real-time is the time that it took to process a SAS job if you were looking at the clock on the wall. It is also referred to as elapsed time.
• The user CPU time is the amount of CPU time that the system spent processing a SAS job.
• The system CPU time is the amount of CPU time that was spent performing operating system tasks to support the execution of the SAS code. This is also referred to as system overhead.
• Memory is the amount of memory that is required to execute a step.
• OS Memory is the total amount of memory that was requested by a step from the operating system.
Different statistics are important to different customer sites. You will have to find out which of these statistics is the most important for your situation. In this lecture, our focus is on reducing elapsed time.
We will look at some of the small, sometimes inadvertent choices that we might make when writing our queries. These choices might impede processing speed. Our goal will be to write our programs with the PROC SQL Optimizer in mind. If you are not familiar with the PROC SQL Optimizer, we will discuss it in the next section. Our reward for doing so will be faster results from our queries.
Multi-Threading
CPU time can exceed the elapsed time on reports using the FULLSTIMER option when multi-threading is in effect.
16
CPU Time 10 sec. 10 sec.
Total CPU Time
= 20 seconds
Elapsed Time=10 seconds
Just a quick aside…
Be aware that in SAS®9, there are several procedures that can leverage computers with multiple CPUs and an operating system that is capable of running more than one thread simultaneously. For example, PROC SQL will frequently run multi-threaded sorts. When this happens, the CPU time on FULLSTIMER reports often exceeds the elapsed time.
If you are not familiar with multi-threading, it’s similar to calculating man-hours. If two people work together for one hour, two man-hours are expended, but only one clock hour has actually elapsed.
Similarly, if PROC SORT uses two threads running on separate CPUs simultaneously, the CPU time expended would be the sum of the CPU seconds consumed by each CPU. For example, if CPU #1 used 10 seconds and CPU #2 simultaneously used 10 seconds on the same sort, the total CPU time expended on the sort is 20 seconds, but only 10 seconds of real-time elapsed.
Be aware that this might be the case when you are looking at performance statistics.
1. Review of the LIBNAME Statement and Performance Measures
2. Optimizing Queries
3. Summary
17
Beyond SQL Essentials: The Need for Speed When Accessing SAS
®Data
We mentioned earlier that sometimes queries can be rewritten to improve performance. Let’s take a look at some performance tips and techniques that can help to optimize the queries that you write.
Objectives
Identify factors that effect the performance of query processing.
State the role of the PROC SQL Query Optimizer.
State the effect of referencing indexed columns in WHERE clauses.
List the effects of using derived columns in WHERE clauses.
Construct efficient logic conditions in WHERE clauses.
18
There is often more than one way of constructing a query, especially a WHERE clause, to achieve the desired results. Sometimes the syntax we choose prevents the Optimizer from considering all available options. We are going to take a look at three examples and how they can be accomplished by writing the most efficient query code.
• In the first example, we’ll be talking about the use of SAS functions versus comparison operators in WHERE clauses.
• In the second example, we will be creating a new column in our query and we’ll see the effects of the placement of the calculation in a query.
• In the third example, we’ll look at how we might be able to effect processing time by rearranging the order of our criteria in the WHERE clause when we are specifying multiple criteria.
Performance Factors
Factors affecting performance include
CPU
Memory
Input/Output (I/O).
Other factors affecting performance include
data location relative to processing
platform processing capability
concurrent processes on servers.
19
Bottlenecks in CPU, memory, or input/output can significantly increase the elapsed time required to execute our programs. The way that our queries are written can effect all three of these resources.
Although these contribute to many of our SQL query performance issues, in the real world other environmental factors can also significantly degrade performance, as you see here.
For instance, moving data from one location to another for processing introduces network latency, which can significantly degrade performance. When moving data is absolutely necessary, performance improves if you can minimize the size of the data being moved.
Also consider where and when your queries will be processed.
• If you are accessing data from outside of SAS, you should ask, "Where do I want my queries to be processed? In SAS or in the outside database?" For example, relational database systems are usually very efficient at subsetting and joining tables. Frequently when you join or subset relational database tables, efficiency will be better if the joins and subsetting happen inside the database prior to bringing the data into SAS for processing.
• If you have access to more than one platform running SAS, for example, UNIX and Windows, you should ask which platform has more and faster CPUs, larger amounts of RAM, and speedier disk storage systems. These platforms will frequently process your queries faster than systems with more limited resources.
• Finally, if your SAS platform hosts multiple users or batch processes, you should investigate when the system is under lighter processing loads. Performance will be better when more resources are available
Some of these factors we can control and others we cannot. In the remainder of this lecture, we will look at ways to write our queries to work best with the environmental factors we have in place.
The PROC SQL Query Optimizer
The PROC SQL Query Optimizer attempts to produce results as efficiently as possible.
20
Results PROC SQL
Query Optimizer
SQL is a high-level, set-oriented, declarative language. Therefore, we write SQL queries that describe the results that we want, without explicitly explaining how to access and process the data. What does this mean? In other words, the SQL code you submit is not actually the code that executes. The PROC SQL Query Optimizer attempts to produce the requested results as efficiently as possible.
To accomplish its mission, the Query Optimizer considers possible alternative methods to achieve the specified results and estimates the cost of each method. It then chooses a strategy designed to most efficiently access and process the data required to provide those results. The Query Optimizer first transforms the SQL query that you write into a series of “conjunctive normal form” expressions, which are a kind of logical algebraic notation. These expressions are then manipulated algebraically and data flow is analyzed to reveal the strategy that will return your specified results at a minimum cost in memory, CPU, and I/O. The SQL processor then executes this plan and returns your results.
Let’s take a look at some ways that we can help the Optimizer do its job better by rewriting our queries to be more efficient.
Example 1: Business Task
Generate a report showing the total quantity sold for product line 22.
Example PRODUCT_ID values Partial Desired Output
21
Total Quantity Sold for Product Line 22
Total PRODUCT_ID Sold ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ 220100100005 23 220100100010 38 220100100013 183 220100100014 268 220100100016 275 220100100017 292 220100100005
240500200003
In our first example, we are going to generate a report showing the total quantity sold for a particular group of products. The first two digits of the Product_ID column defines different product lines or types. For example, this 22 represents Clothes and Shoes, and this 24 represents Sports items. We are interested in product line 22. To target only this group of data in our query, we will need to write
WHERE criteria that tells SAS to include only those rows where Product_ID values start with a value of 22.
where int(Product_ID/10000000000)=22
Example 1: WHERE Clause Solutions
Possible WHERE clause solutions:
Method 1: Use a function.
Method 2: Use comparison operators.
22
where Product_ID between 220101400000 and 220101400009
Example:
22010140065/ 10000000000=22.0101400065 INT(22.010400065)=22
There are several techniques that I can think of to solve this, such as this first method. Here is a creative use of the INT function that you might not readily think to use. In effect, we will use the INT function to substring our numeric values. This WHERE clause asks SAS to take the values of Product_ID and divide them by 10 billion. The result set will be values that contain two-integer digits and 10 decimal places, as you can see in this example. We will then tell SAS to use the INT function to delete that long trail of decimal places. For the remainder of the comparison in the WHERE clause, SAS will then be comparing a two-digit value to the value 22 on the other side of the equal sign here. If the INT value is equal to 22, then we want SAS to process those rows in the query. That’s the WHERE clause that we will use in our solution.
This second method is more straightforward and might be the solution that comes to your mind first. Here we are asking for Product_ID values between this value and another value, which is all the values that start with 22 billion. That will get us our product line values of 22.
Using Indexes
Specifying the MSGLEVEL=I system option causes SAS to write a message to the SAS log whenever an index is used.
General form of the MSGLEVEL= system option:
N = (Default) prints notes, warnings, and error messages only.
I = prints additional notes pertaining to indexes, merging, and sorting.
23
options MSGLEVEL = N | I;
Before we move into a demonstration to execute and compare these two methods for solving our business task, let me show you another tool that can help to fine-tune our queries. The orion.bigdata table that we are using in our example is indexed on the numeric data column named Product_ID. There is a very informative SAS system option that helps detect whether the PROC SQL Query Optimizer chooses to use an index to improve query performance. It is called MSGLEVEL=I. This will produce an
informational note in the SAS log whenever an index is used during processing. Here is the syntax for this option. We will use this option in our demonstration program and I’ll show you the messages that it produces in the log.
Using Functions in WHERE Criteria
This demonstration illustrates the effect of using functions versus using comparison operators in WHERE criteria specified in a PROC SQL query.
24
Now, a demonstration of the function versus comparison operator solutions that we talked about previously
Part A: (rlspsqls_Section2_demo1)
• We start our program with an OPTIONS statement to turn on FULLSTIMER statistics. Notice that the MSGLEVEL= option is set to I so that SAS will report whether it used any indexes during the processing of our query.
• We are including the Product_ID and the sum of Quantity values, which we are labeling as Total Sold in our results.
• Our data source is a SAS table named orion.bigdata_indexed. Notice that this table is indexed on the numeric data column named Product_ID.
• Next we have a WHERE clause as explained on the previous slide.
• Finally, we are grouping the results by Product_ID values. We are grouping so that we get a Total Sold value for each Product_ID.
Let’s submit this query. I’ll go to my log. I don’t see any errors or warnings, so that’s good. I also don’t see any information or INFO notes indicating index usage. Why? Well, what has happened is that the inclusion of the INT function in this WHERE clause precluded SAS from using the PRODUCT_ID index for subsetting. As a result, every row in the entire table was read and processed. This method of reading data is sometimes called a “full table scan” and is not generally the most efficient method for retrieving small subsets from large data tables.
Part B:
In this modified example, we replaced the INT function with a BETWEEN-AND operator. Let’s submit this program and see what happens.
I’ll go to my log…Great! Now SAS was able to use the index.
So, what have we learned in this demonstration? As you can see, recoding the WHERE criteria using the BETWEEN-AND operator enables PROC SQL to use the Product_ID index and dramatically
improves processing efficiency. One tip that you should take away from this is that you should always try to think of a way to recode your query if your indexes are not being used. Also, with the BETWEEN- AND example, SAS is not having to perform a calculation in the WHERE clause. That will also speed up performance. In general, you will find that functions are convenient to use, but not very efficient.
Now, it’s difficult to say how much more efficient one method was over another, because we ran the two queries back to back. All we can say at this point is that one method precluded the index from being used and the other allowed the index to be used. In order to get a true reading of how much more efficient the BETWEEN-AND example was, we would have to do proper benchmarking. We are not going to start and stop three sessions here, and average the results. You will have to take my word for it that in
benchmarking these two techniques, we found that the BETWEEN-AND example ran over 75% faster than the INT function example. Of course, this was the result with this particular example. There are many factors to consider. Think about all of those environmental factors that could change the extent of the saved resources. With that said, you should try to eliminate the use of calculations, especially involving functions, in your WHERE clauses and try to tweak your queries to use indexes if they are available.
Example 2: Business Task
Generate a report of the total number of shoes and clothes sold in the U.S. and Germany.
25
The source data named bigtext contains indexes on the Countryand Product_Category columns.
Our second example illustrates the referencing of derived columns. Our report needs to contain the column named Category. This column concatenates the values of Country with the values of Product_Category. To create the report, we will use a data set named bigtext. It is important to note that this data set contains two indexes: one on the Country column and one on the
Product_Category column.
Let’s take a look at how to create this new, derived column.
Task 1: Creating Derived Columns
Step 1: Create the new column using the CATX function.
General form of the CATX function:
Example:
The value that is returned by the CATX function has a length of 200.
26
select catx("-",Country,Product_Category) as Category length=20
CATX(separator, string-1,… string-n)
To create the new Category column, we will use the CATX function. This function will enable us to concatenate the values from the Country and Product_Category columns by placing hyphens between the values as we saw in the report on the previous slide. This function will also remove any leading and trailing blanks in the values being concatenated. There are several other CAT functions if you need to only remove the leading blanks or only remove the trailing blanks. The CATX function is the one that we need to use in our example.
Notice that the result returned by the CATX function will have a length of 200 by default. To specify a shorter length, use the LENGTH= option in the SELECT statement as we did here.
Task 2: WHERE Clause Solutions
Step 2: Specify WHERE criteria to subset the data based on specific Countryand Product_Category values.
Method 1:
Method 2:
27
where catx("-",Country,Product_Category) = "US-Clothes"
or
where calculated Category = "US-Clothes" length=20
where Country = "US" and Product_Category = "Clothes"
Next, we want to subset on Country and Product_Category values. There are a couple of
approaches that we can take in specifying the WHERE criteria. Let’s take a look at the first method where we are deriving the value in the WHERE clause. This method might come to mind first, because we are having to concatenate values in the SELECT clause. We can subset based on the concatenated values by respecifying the expression that we had in our SELECT clause. A more convenient way to reference the new column is to use the CALCULATED keyword. This is actually the recommended way to reference the column so that SAS does not have to calculate the column in both the SELECT and WHERE clauses.
When it gets to the WHERE clause, this calculation has already been done. However, with either way of coding this calculation, during processing, SAS will not be able to apply the WHERE criteria to
determine which rows to read for processing because there is no Category column in the data source.
SAS will have to read all the rows from the data source before it can subset the data based on the calculated values.
Based on what we learned in the last section, in the second method, we take the approach that it is not necessary for SAS to calculate the Category column to subset the rows that we need. We only need the columns from our source table, Country and Product_Category, in the WHERE clause. This way, SAS does not have to calculate anything in the WHERE clause. It seems as though this would save some significant processing time.
Task 2: WHERE Clause Solutions
Alternate Step 2: Specify the HAVING clause instead of a WHERE clause to subset data values.
Method 3:
28
having Category in ("GB-Clothes",
"GB-Shoes", "US-Shoes", "US-Clothes")
At this point, it is even possible that you might have thought of this solution. Because there is a GROUP BY clause in this query, why not use only a HAVING clause to subset the groups? Remember that a HAVING clause executes subsetting criteria against groups of data, whereas a WHERE clause executes subsetting criteria against individual rows of data. Either method will produce the same result set. Then, what would the resources be like to process our query using a HAVING clause instead of a WHERE clause?
Using Derived Columns in WHERE Criteria
This demonstration illustrates the effect of using derived columns in WHERE criteria.
29
Let’s switch to a demonstration where we can compare these three approaches.
(rlspsqls_section2_demo2)
In this demo, we want to compare the three techniques discussed earlier for subsetting our data based on a column that is appropriately indexed. The premise is that if SAS is able to make use of the index, then the query will process faster. So, we want to do all that we can in our query to help SAS be able to use the index. We will execute three queries, each one using a different technique to subset our data by Country and Product_Category values. We will be executing them all within the same SAS session, which is not good benchmarking practice. However, for the purpose of this demo, this method will work. Again, you’ll have to take our word for it that in our proper benchmarking of the queries, we found similar differences between the techniques. The point here is that you don’t need to focus on the exact resource statistics that you will see in the log, but rather you should focus on the differences between them.
The first query that we will execute is the one that we found performed the worst, and the last query that we will execute is the one that performed the best. So, we’ll improve on the query as we move along.
Also, remember that there are two indexes on the orion.bigtext data set. They are named COUNTRY and PRODUCT_CATEGORY.
In this first query, we are calculating a new column in our SELECT clause, and we are using a HAVING clause in place of a WHERE clause to subset our data based on the calculated variable. Notice that the
CALCULATED keyword. The idea is that the WHERE clause preprocesses the data, if you will. It subsets the data before any summarization takes place. So, in theory, the WHERE clause should be faster than using the HAVING clause to subset our data.
In this last example, we are replacing the CALCULATED keyword in the WHERE clause with our multiple criteria. The only columns being referenced on this WHERE clause come from our source table.
They are not derived columns. Since SAS is not having to do any calculations while subsetting the data, we expect this technique to be the fastest.
We will go ahead and execute all of these queries at once, which again is not best practice, but it is a viable way of getting our point across. Here is our output for the HAVING clause. There is our calculated keyword in the WHERE clause and our referencing source columns example. Now I will go to our log and look at this first example. We have 17.14 seconds for real time. If I scroll down and compare that to the second and the third examples, I see that this first example was the most expensive to process.
Remember that criteria on a HAVING clause is applied to groups of data, not to individual rows of data.
When this query begins to execute, all of the rows from orion.bigtext must be read and summarized to produce an intermediate result set. Then the intermediate results are subset by the HAVING clause.
The subsetting of the data comes too late. In theory, it should be much more efficient to subset the input data first with a WHERE clause, and then process the rest of the query, including any summarization.
Let’s take a look at the second example that does this. In the second example, notice that neither index was selected for WHERE clause optimization. As we discussed earlier, this is because the column being referenced in the WHERE clause is a derived column. The indexes are based on two columns coming from our source data. The Optimizer has no index to use in this case.
Then we rewrote the query, in this third example, in a way that gave the Optimizer the option of using the COUNTRY and PRODUCT_CATEGORY indexes and also kept SAS from having to perform a calculation in the WHERE clause. Now, that’s more like it!
Once again, and we keep driving this fact home, we can’t really rely on the performance statistics that you see here, but I can tell you from benchmarking the methods, between the second and third methods, the third method reduced our real time processing by about 40%. This query “tune-up” provided some tangible efficiency improvements.
What did we learn from this demonstration? We learned that if you only want certain rows from your source data to be processed, use a WHERE clause to extract those rows, not a HAVING clause. It was also reinforced that it is more efficient to specify only source variables (or columns) in your WHERE criteria when possible.
That’s it for this demo.
Let’s look at one more optimization example before leaving this section.
Example 3: Ordering WHERE Criteria
Write a query to extract information about clothes produced in the U.S. where more than six items were purchased at one time.
Example WHERE clause:
30
where Product_Category="Clothes"
and Country="US"
and Quantity > 6
;
1
2 3
In our last two examples, we used data sets that were indexed and talked about how to help the Optimizer make use of indexes while processing WHERE clauses. We did not talk about this, but the Base SAS engine will try to optimize the order of evaluation if indexes are present. In our third example, we will be using a table that is not indexed and we will focus on the possibility of improving query performance by ordering our WHERE criteria appropriately.
In certain special cases, the order in which the expressions are written in a WHERE clause can affect processing efficiency. However, it is good to remember that the Base SAS engine will not necessarily honor the sequence in which we write our expressions in all cases. As the complexity of the WHERE expression increases, the order of WHERE clause conditions becomes less likely to affect the efficiency of the query.
However, this technique is definitely worth considering for simple subsetting queries, especially those intended to be executed many times in a production environment. Coding a WHERE clause so that the conditions most likely terminate processing earlier in the expression might well improve efficiency, and in any case, will never adversely impact efficiency. If efficiency is important, you should always perform benchmarking to determine whether the code changes you are making actually prove effective.
Let’s take a look at a simple example where ordering of criteria in the WHERE clause does have an impact on performance. We are querying a table named orion.bigtext_unindexed to extract information about clothes produced in the United States where more than six items were purchased at one time. Coding the WHERE clause is not difficult. Let’s say this is our first attempt at writing our query. In
Example 3: WHERE Clauses and Unindexed Data
The distribution of Product_Category values is as follows:
31
Product Category Count Percent
Clothes 199226 21
Assorted Sports Articles 147013 15
Outdoors 141553 15
Shoes 130869 14
Other (Various) 333008 35
In order to know the best way to order our criteria, we have to first understand how our data values are distributed in our data set. We will need to produce some sort of frequency report either with PROC SQL or with other SAS procedures such as PROC FREQ.
Here is a report that shows how the values of Product_Category are distributed in our data set. From these frequency counts, we can see that clothing items comprise approximately 21% of the entire
database.
Example 3: WHERE Clauses and Unindexed Data
The distribution of Country values is as follows:
32
Country Count Percent
US 519131 55
GB 113016 12
ES 64549 7
CA 43748 5
Other 211225 22
In this next report, we are looking at the distribution of Country values. We see that U.S.-produced items comprise about 55% of the entire database.
Example 3: WHERE Clauses and Unindexed Data
The distribution of Quantity values is as follows:
33
Quantity Count Percent
1 500172 52.6
2 320383 33.7
3 86995 9.1
4 29472 3.1
5 11193 1.2
6 2960 0.3
Other 494 0.1
In this last report, we see Quantity values of 1 through 6, and then values over 6 are lumped into one last row labeled Other. What we find is that only one-tenth of one percent of all items had a quantity greater than 6.
The conclusion that we come to after looking at these reports is that subsetting by Quantity first should reject the largest number of rows earlier in processing. The next criteria should be checking for
Product_Category values of Clothes and the last criteria should check for Country values of US.
Example 3: Ordering WHERE Criteria
Rearrange the criteria in the WHERE statement to execute conditions in the order that will eliminate the most rows first.
Example WHERE clause:
34
where Quantity > 6
and Product_Category="Clothes"
and Country="US"
Here is the new order of the WHERE criteria. There is no value in demonstrating the performance differences between this new ordering and our original ordering because, once again, we would have to run multiple benchmarking sessions to do so. We did, however, perform our own benchmarking and this is what we found.
When we executed the query as we first wrote it with Product_Category, Country, and Quantity listed in the WHERE clause, our results looked reasonable. If this were a one-shot query, then we would move on at this point. However, if this is a query that will be run frequently, it might be worth our time and effort to investigate the best order in which to specify the WHERE criteria to yield the best performance.
When we did rearrange the order of the columns in the WHERE clause to reject the largest number of rows first, we achieved a nice 30% savings in both CPU time and real time.
As I stated at the beginning of this example, in general, you cannot predict, nor predictably influence, the order of evaluation of conditions in a SAS WHERE clause. The SAS Base engine will try to optimize the order of evaluation if indexes are present to give it some help, but there were no indexes in this last example. In special cases, you can do as we did here, that is, experimentally determine how the system happens to be working for that particular scenario and release of software. It is difficult to generalize from any specific case. Oh, and in case you are wondering, the standard mathematical trick of using
parentheses to force evaluation of certain expressions first does not work.
WHERE Clause Generalizations
Experiment with programs to be run in a production environment.
For performance statistics to be useful, you need to benchmark programs correctly.
Simple modifications to queries can save significant resources.
When writing queries, keep resource usage in mind.
35
So, can we draw any general conclusions from our discussions in this section? Absolutely.
• We learned that if you intend to use a program in a production environment, it might be worthwhile to try different variations of your program, as we did, to help tune your query to get the best results.
• We learned that if you do benchmark programs, you need to benchmark correctly so that your performance statistics are useful and accurate.
In the course of the section, we saw three interesting examples of making simple modifications to our queries that caused significant resource savings.
Hopefully the examples in this section will make you more aware of how to write queries with resource usage in mind.
We have only used SAS data up to this point in the examples, but be aware that some of the same principles hold true when you access data from other data sources outside of SAS.
1. Review of the LIBNAME Statement and Performance Measures
2. Optimizing Queries
3. Summary
36
Beyond SQL Essentials: The Need for Speed When Accessing SAS Data
This concludes our lecture discussion.
Objectives
List the topics presented in this e-lecture.
37
Before we finish, I would like to do a quick review of what was presented in the lecture.
Summary
In this e-lecture, we
reviewed the LIBNAME statement for accessing data stored in SAS
specified and interpreted SQL query performance statistics
discussed proper benchmarking techniques
discovered environmental factors that might degrade performance
discussed the SQL Query Optimizer’s purpose and function
investigated ways to improve WHERE clause processing.
38
In this e-lecture, we discussed a variety of topics to help improve the performance of your SAS PROC SQL queries and to help you be more aware of writing efficient queries.
• We spent time in the first section reviewing the LIBNAME statement. Specifically, we discussed its purpose and how it can be used to access data stored in SAS. We also talked about different options in SAS that provide performance statistics that can help you when benchmarking your programs.
That led to a discussion on proper benchmarking techniques.
• We discussed how environmental factors can contribute to the performance of your queries. As we discussed, these factors might be out of your control, but there are other factors that you can control.
That led to a discussion of the SQL Query Optimizer.
• We defined the role of the SQL Query Optimizer and that prompted us to look at ways that we can help the Optimizer do its job better by rewriting our queries to be more efficient.
• Specifically, we investigated WHERE clause optimization when accessing data in a SAS data set.
Related e-Lectures
These e-lectures might also be of interest to you:
Beyond SQL Essentials: The Need for Speed When Accessing RDBMS Data
Processing Relational Databases Using SAS®9.2
Indexing SAS®Data Sets for WHERE Optimization 1: Index Qualification and Data Profiling
Indexing SAS®Data Sets for WHERE Optimization 2: Index Application and Performance
For a complete list of available e-lectures and other SAS training products, visit
support.sas.com/training
39 39
This lecture focused on tips and techniques for optimizing queries that access SAS data sets. If you frequently use SAS/ACCESS engines to process relational database data in SAS, or are interested in delving deeper into WHERE clause processing, these e-lectures might also be of interest to you. The first lecture listed here, Beyond SQL Essentials: The Need for Speed When Accessing RDBMS Data, is the companion lecture to this one. As stated earlier, some of the same principles discussed in this lecture apply when you access relational databases, but there are additional considerations that you should take into account. These are discussed in this second lecture.
For a complete list of available e-lectures and other SAS training products, please visit the SAS Web site at support.sas.com/training.
Learning More
More information can be found at:
support.sas.com/resources/papers/
A topic of special interest is performance.
40
You can also find a wealth of information at support.sas.com/resources/papers. A topic in particular that you can find at this site is Performance. These papers can help you make your queries run even faster, if you still feel the "need for speed."
Comments?
We would like to hear what you think.
Do you have any comments about this e-lecture?
Did you find the information in this e-lecture useful?
What other e-lectures would you like SAS to develop in the future?
Please e-mail your comments to
Or, you can fill out the short evaluation form.
41
If you have any comments about this lecture or e-lectures in general, we would appreciate receiving your input. You can use the e-mail address listed here to provide that feedback, or you can complete the short evaluation form at the end of this lecture.
Credits
Beyond SQL Essentials: The Need for Speed When Accessing SAS®Data was developed by Mark Jordan and Linda Mitterling.
Additional contributions were made by Lewis Church, David Ghan, Richard Hallquist, Marty Hobbs, Cynthia Johnson, and Kent Reeve.
42
This concludes the SAS e-Lecture Beyond SQL Essentials: The Need for Speed When Accessing SAS® Data. I hope you found the material in this lecture to be helpful to your work assignments.
Copyright
SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries.
® indicates USA registration. Other brand and product names are trademarks of their respective companies.
Copyright © 2010 by SAS Institute Inc., Cary, NC 27513, USA.
All rights reserved.
43
Thank you for your time.
Programs
... A-3 ... A-4 ... A-5 1. Turning on Options to Benchmark Queries
2. Using Functions in WHERE Criteria
3. Using Derived Columns in WHERE Criteria
/*******************************************************************
*Section 1 Demo
*Data used: orion.bigdata
*Program: rlspsqls_section1_demo
*Purpose: Demonstrate the resource statistics generated by using the
* FULLSTIMER system option and the STIMER option on the
* PROC SQL statement
*********************************************************************/
libname orion 's:\workshop';
options fullstimer;
*Using just FULLSTIMER;
proc sql;
select product_ID, SUM(Quantity) 'Total Sold' from orion.bigdata
where Product_ID=240100300006 group by product_id
; quit;
*Adding the STIMER option on the PROC statement;
proc sql stimer;
select product_ID, SUM(Quantity) 'Total Sold' from orion.bigdata
where Product_ID=240100300006 group by product_id
; quit;
/*******************************************************************
*Section 2 Demo 1
*Data used: orion.bigdata_indexed
*Program: rlspsqls_section2_demo1
*Purpose: Demonstrate that often queries can be recoded to made use
* of indexes to improve processing efficiency.
*********************************************************************/
*** Part A - Using the INT Function;
options fullstimer msglevel=I;
proc sql;
select product_ID, SUM(Quantity) 'Total Sold' from orion.bigdata_indexed
where int(Product_ID/10000000000)=22 group by product_id
; quit;
*** Part B - Using the BETWEEN-AND Operator;
options fullstimer msglevel=i;
proc sql;
select product_ID, SUM(Quantity) 'Total Sold' from orion.bigdata_indexed
where Product_ID between 220101400000 and 220101400009 group by product_id
; quit;
/*******************************************************************
*Section 2 Demo 2
*Data used: orion.bigtext
*Program: rlspsqls_section2_demo2
*Purpose: To illustrate that typically if SAS uses an index for
* processing subsetting criteria, then a query will process
* faster. Therefore, queries should be written in a way that
* gives the Optimizer the option of using indexes if they
* are available.
*********************************************************************/
libname orion 's:\workshop';
options nonumber nodate fullstimer msglevel=I;
***Example 1: Using the HAVING clause;
title "Using the HAVING clause";
proc sql;
select catx("-",Country,Product_Category) as Category length=20, sum(Quantity) "Total Quantity"
from orion.bigtext group by Category
having Category in ("GB-Clothes",
"GB-Shoes", "US-Shoes", "US-Clothes")
; quit;
***Example 2: Using the CALCULATED keyword;
title "Using the CALCULATED keyword in the WHERE statement";
proc sql;
select catx("-",Country,Product_Category) as Category length=20, sum(Quantity) "Total Quantity"
from orion.bigtext
where calculated Category in ("GB-Clothes", "GB-Shoes", "US-Shoes", "US-Clothes") group by Category
; quit;
***Example 3: Referencing only source columns;
title "Referencing only source columns in the WHERE statement";
proc sql;
select catx("-",Country,Product_Category) as Category length=20, sum(Quantity) "Total Quantity"
from orion.bigtext
where Country in("US", "GB") and
Product_Category in ("Clothes", "Shoes") group by Category
; quit;