Writing SQL. PegaRULES Process Commander

(1)

Writing SQL

(2)

© Copyright 2007

Pegasystems Inc., Cambridge, MA

This document describes products and services of Pegasystems Inc. It may contain trade secrets and proprietary information. The document and product are protected by copyright and distributed under licenses restricting their use, copying distribution, or transmittal in any form without prior written authorization of Pegasystems Inc.

This document is current as of the date of publication only. Changes in the document may be made from time to time at the discretion of Pegasystems. This document remains the property of Pegasystems and must be returned to it upon request. This document does not imply any commitment to offer or deliver the products or services described.

This document may include references to Pegasystems product features that have not been licensed by your company. If you have questions about whether a particular capability is included in your installation, please consult your Pegasystems service consultant.

Although Pegasystems Inc. strives for accuracy in its publications, any publication may contain inaccuracies or typographical errors. This document could contain technical inaccuracies or typographical errors. Changes are periodically added to the information herein. Pegasystems Inc. may make improvements and/or changes in the information described herein at any time.

This document is the property of: Pegasystems Inc. 101 Main Street Cambridge, MA 02142-1590 Phone: (617) 374-9600 Fax: (617) 374-9620 www.pega.com

PegaRULES Process Commander Document: Writing SQL

Software Version 5.1 Updated: January 11, 2007

(3)

By following some simple guidelines, it is possible to write efficient SQL statements that optimize the use of the PegaRULES database with minimal performance impact. This document provides a set of guidelines for writing SQL in the Pega Application, and may be used with versions 4.x and 5.x of Process Commander.

These guidelines help developers create reasonable SQL requests to the PRPC

database. While this paper just scratches the surface of SQL writing and tuning, it is meant to be a guide to help understand how to build SQL and run analysis tools on queries. When necessary, always ask your local DBA to assist with tuning tasks.

Structured Query Language (SQL) was invented at IBM by E.F. Codd. The concepts were published in the ACM in June 1970. SQL has since become the de facto data

manipulation language for relational database management systems (RDBMS). SQL is a set-based programming language. This means that requests to the database describe data operations in terms of sets, returning output in one or more output sets as rows and columns. One of the most common operations in SQL is joining two sets or tables into a new logical table. This operation is called a join. Joins typically use a common key column to relate data from each of the component tables.

The SQL 92 ANSI standard was defined so that all DBMSs can comply with one language syntax. However, not all platforms completely support the standard, creating

inconsistencies in SQL that have to be accounted for at the programming interface. In addition, certain databases have some of their most useful features implemented in proprietary SQL calls. Using these feature sets can cause incompatibility among various database platforms unless precautions are taken to unify the access API (usually though stored procedures or views).

Using SQL in PRPC

The PRPC application currently uses basic queries based on filter conditions to retrieve data to be used in rules processing or to populate UI lists. Inserts or updates are typically full updates of a specific row with simple filter conditions. Standard queries in PRPC are not complex join statements. Developers typically code queries in a pseudo-SQL language where table names are replaced by object names and filter conditions are simple set language constructs designed to retrieve data.

PRPC also makes use of a few stored procedures and triggers to automatically populate dependent tables and execute database-centric functionality, such as generation of unique ids and usage aggregation.

Practical Suggestions when Writing SQL for

PRPC

• Understand the SQL operation • Perform a PRPC Database Trace • Determine a query optimization plan

(6)

Practical Suggestions when Writing SQL for PRPC

• Test on all supported DB platforms

Understand the SQL operation

While this point may seem trivial, it is next to impossible to tune a SQL query if the functional goal of the SQL statement or query is not understood. When you have

determined the appropriate boundaries of the question the query will answer, and have an understanding of the schema being queried, you may choose to use a variety of

expressions to return the answer. (There is usually more than one way in SQL to get the same result set). Armed with these data points, you may ‘build’ a SQL query by breaking down the resultant data set into joining data sets which, when combined, provide the complete result-set that answers the original question. With the help of an explain plan (database in debug mode) on the statement, you can substitute SQL statements (or sub-expressions) with similar ones for better performance.

It is also very important to set a goal for the expected performance of the query to provide a concrete target for tuning efforts. Without a goal, tuning efforts can get mired in shifting expectations from end users. Examples of targets are:

• The query has to return within 50 ms.

• The query cannot consume more than 10% CPU • The query must not request more than 5000 i/os.

NOTE: Even if the SQL works well in a single user or small dataset environment, it is not necessarily true that this same query will operate to the same level of efficiency in a production environment. Therefore, production scale and data size must be considered when designing and testing the query.

Perform a PRPC DB Trace

The PRPC application developer writes queries at a logical level (using the object definition of the PRPC items being worked with) in the application. This style of

development masks the creation of the actual SQL needed for database interactions. This SQL is generated by either rules or workflows created by the user to process work. To determine the SQL generated and submitted to the database, the PRPC application has a database tracing facility that dumps out the SQL along with any supplied parameters. One may enable this facility on global or a local user level (local is new in 5.1).

By enabling this feature, a developer can view the SQL queries passed to the database, including the parameters used. With this SQL statement we can make sure only the data needed for the query is being requested. If unexpected SQL appears, determine the cause of the unexpected SQL, (i.e., was it a new workflow or rule that is generating the SQL). Then determine if the SQL is necessary; if it is, tune the SQL to minimize the resources used at the DB. It is also possible to use this SQL in isolation (using a database query tool like SQL*Plus for Oracle or MS Query Analyzer for SQL Server) and determine the performance aspects of the SQL statement.

For more information on the PRPC Database Tracer, see the Using Performance Tools technology paper on the Pega Developer Network:

http://pdn.pega.com/DevNet/PRPCv4/SystemTools/documents/UsingPerformanceTools.pdf

(7)

Practical Suggestions when Writing SQL for PRPC

Determine a Query Optimization Plan

As explained earlier, the DBMS compiles the SQL request into a query plan that will be executed to return result sets to the user. These query plans can be exposed to the developer through tools such as MS Query Analyzer for SQL Server. The query plan consists of a graphical (in the case of SQL Server) or a text-based hierarchical tree that shows the query’s execution path.

Below is a sample of a query plan from a query made against a PRPC system running on SQL Server. In this case, the direction of the arrows shows the direction of transformation of data, while the nodes of the tree represent simple database operations to be performed at each transformation step. By clicking on a node in MS Query Analyzer, one can determine:

• The type of operation (in this case, a clustered index scan). • The estimated number of rows retrieved.

• The I/O cost and the percentage cost of this step vs the whole plan.

Figure 1

Information in the above query plan can be used to determine if there are any additional indexes needed, or whether a change of join type is necessary.

Test on all supported DB Platforms

Since the various database platforms have vastly different underlying architectures, it is quite common that some performance fixes (like the addition of an index) cause vastly different results on different database platforms. The only way to be certain of the effect of a common change is to test it on the supported database platforms and evaluate the behavior. Additional debugging and plan display tools (see the Determine a Query Optimization Plan section) are available for the various database platforms and are described in the next section for each of the major supported platforms.

(8)

Recommendations for Writing SQL

Recommendations for Writing SQL

• Use bind variables

• Write queries that operate on small result sets • Create and use indexes effectively

• Avoid non-indexable search conditions

• Avoid Arithmetic Operators or Functions in the WHERE clause • Use joins on primary key columns for complex data sets • Avoid resource-intensive queries

• Reduce transaction costs

• Reduce the number of network round-trips

• Build complex database operations into a stored procedure

Use Bind Variables

When SQL statements are sent to the database, they are compiled into byte coded execution plans that will be used by the database engine to process raw data. Most database management systems have special caches (called the statement cache) that will keep a hash map of the SQL text and its associated compiled form in memory, with the expectation that the SQL will be re-used in the near future. The statement cache is maintained because most optimizers take some time (usually a few milliseconds) to parse and process a SQL request. If this is done on a million SQL requests to the database, the parsing time can be significant.

When a filter criteria like ‘WHERE pxName = ‘John Doe’’ is sent to the DBMS for processing, the SQL compilation engine will consider this statement as a unique occurrence based on the text in the query string. It will then compile and appropriately cache this occurrence.

The next time a similar SQL statement arrives at the DB with a clause that includes a different parameter value (‘WHERE pxName = ’Sally Smith’), it will also be treated as a unique SQL statement, thus causing another compile and build of the SQL. Note that the only difference between these two cases is in the value of the SQL expression. These recompiles of SQL statements with different parameters can be very costly for the DBMS operating at high transaction rates.

Bind variables are <substitution> variables that are used in place of literals (such as John Doe or Sally Smith) and have the effect of sending exactly the same SQL to the database each time the query is executed. Because you can change the value of the bind variable as many times as necessary, without changing the actual text of the SQL command, the database can repeatedly use the same statement without incurring the overhead of reparsing the command each time it is executed.

(9)

Therefore, statements with all bind variables are compiled once by the DBMS and the compiled form is stored in the statement cache. This compiled SQL can now be reused with different parameter values being “bound” to the already compiled SQL and submitted for execution. The following is an example of a bind variable used in PRPC.

SELECT other_col1 as "other_col1", other_col2 as "other_col2" FROM Table1 WHERE my_column = {MyPage.MyProperty}

In the example above, PRPC will replace {MyPage.MyProperty} with a bind variable that can substitute the filter values each time the query is called. At the database level, the optimizer recognizes the identical SQL and does not reparse the query.

The above methodology will reduce the number of soft and hard parses per SQL statement. Ideally, the system should be tuned to have exactly 1 hard parse per parameterized SQL string sent to the DB.

Write queries that operate on small results sets

To improve performance of queries, limit the amount of data on which the query operates to the smallest possible set. This includes limiting both columns and rows. Operating on smaller result sets reduces the number of resources consumed by the query and increases effectiveness of the indexes. Reducing the result set includes limiting the number of columns to those that are absolutely necessary and using highly selective WHERE clauses.

“Selectivity” of data in a particular column within a table increases the effectiveness of using the column in a WHERE clause to limit the number of rows returned. Columns with a large proportion of unique values to the total number of rows in the table are highly selective and benefit greatly from an index and the ability to reduce the result sets. For example, a LastName column containing numerous unique entries would benefit from using an index.

Columns with a low proportion of unique values, like a column with the values either M (Male) or F (Female), have low selectivity and are not effective in limiting the result sets unless the data has a skewed distribution which means that there are far larger amounts of one value versus the other.

Create and use indexes effectively

As described in Perform a PRPC Database Trace section, the SQL queries can be analyzed to determine the columns that need to be indexed to return the data in an expedient fashion. In addition, after you design and create an index, you need to re-analyze the query plans to ensure that the new index is being used. If an index change has not altered the plan, more analysis is necessary to determine why the index changes are not being considered.

While appropriate indexes improve query performance, there is a cost. Indexes need to be maintained during INSERT, UPDATE, and DELETE operations, adding more overhead to these operations. Therefore, when considering adding an index, the associated table DML (Data Manipulation Language) statements should also be reviewed to ensure no negative effects are realized in these statements.

Note that adding indexes to a table may cause performance issues to un-related SQL queries on the same table. This is because the database optimizer has a new set of indexes to consider while executing any query on the particular table. Considering the new

(10)

index (or statistical information about it) may lead the optimizer to choose another, and possibly, less optimal plan. In this case, make sure to review a system-wide profile to identify potential slow-running SQL caused by a changed query plan.

Avoid Non-indexable search conditions

Search conditions that result in inclusion in the output set usually take advantage of underlying indexes to provide performance. Using an inclusion operator allows the database to look up the rows that satisfy the criteria in an index and then get the queried columns from the physical table. On the other hand, exclusion search conditions (like those listed below) prevent the optimizer from using the index on the searched column. For example, using the operator != will cause a database to scan all rows in a table to identify rows not meeting the specific condition. Scanning all the rows involves excessive I/Os, considerably slowing down queries.

Indexable search conditions:

• = (equals) • > (greater than) • < (less than)

• <= (less than or equal to) • >= (greater than or equal to) • BETWEEN

• LIKE ‘<literal>%’

Non-Indexable search conditions:

• != (not equal to) • !> (not greater than) • !< (not less than) • NOT EXISTS • NOT IN • NOT LIKE • OR • LIKE ‘%<literal>’ NOTES:

• The IN clause works well on Oracle. In SQL server however, the IN clause is implemented using ‘OR’ and should be avoided.

(11)

• When using the ‘LIKE’ clause, inserting the search condition or ‘percent’ after the literal could still make use of an index on the searched column. When using one of the non-indexable constructs, try to restructure the query to change the operation to one that can use the appropriate index.

Avoid Arithmetic Operators or Functions in the WHERE clause

Using an arithmetic operator or function on a column in the WHERE clause will

automatically cause the optimizer to reject the use of an index. In this case, it is important to re-structure the query so that the operator or function is applied to the RHS (Right hand side) of the WHERE expression.

For example:

Select * from PC_WORK where pyEffortActual*2 = 10; Should be changed to:

Select * from PC_WORK where pyEffortActual = 10/2;

In the first example, the optimizer has to calculate pyEffort*2 and get each value from the table and compare it with 10, this makes it impossible to use the index.

In the second example, the optimizer can easily determine that the query is the same using a where clause like “where pyEffortActual = 5”, therefore it can build an execution plan that will just look up the entry in an index.

Use of functions such as SQRT (pyEffortActual) on the LHS (Left hand side) of the expression will also cause the optimizer to disregard the use of the index. This is because every value in the index will have to be recomputed using the operator or function before the rows that are selected are determined, versus a B-tree algorithm walk to the right node of the index that contains a pointer to the row data in the table.

Select * from PC_WORK where SQRT(pyEffortActual) = 2; Should be changed to

Select * from PC_WORK where pyEffortActual = 2*2;

NOTE: Only Oracle has functional indexes, that is, indexes that can be created and maintained based on the return value of a function. SQL server has a concept of indexed views that can serve a similar purpose.

Use Joins on primary key columns for complex data sets

One of the advantages of using a relational database management system is the ability to retrieve virtual sets of related data based on relationships between the key columns of two related tables.

For example, you may want to obtain the sum of the number of days past due for all items in the assigned work list that has been updated by user X. Instead of retrieving all the work objects in the assigned work list corresponding to user X, and then summarizing the days in the application server, you can use the database to write a single inner join query that would correlate the two tables using the pzInskey of the updated work objects. One can then sum up the values in the days past due column.

(12)

Avoid resource intensive queries

While databases do an excellent job of offloading application data processing to a different hardware resource, excessively intensive queries can affect overall database

performance. For example, a select statement which requires a scan of 100,000 rows will fill and potentially exhaust the cache causing other reusable data sets to be purged from the buffer cache. As a result, queries which would have been able to hit the buffer cache due to their frequent requests for a smaller data-set, must then constantly retrieve their required data from disk. The same contention can be seen for common locks acquired for the long running query or for read and write I/Os to the disk sub-system. Typically, the behavior of such a resource is discovered during scale and performance testing of the system.

Reduce transaction costs

DB2 and SQL Server do not have a row versioning mechanism like Oracle. Each action query (typically DML) is performed as an atomic action for these two platforms. This means that the state of the database moves from one consistent state to the next and each state is logged in the transaction logs. Therefore, when an action “is occurring” no other action can simultaneously occur.

To accomplish this, SQL Server and DB2 take locks on every request to the DB. Keeping the time these locks are held to a minimum, and further reducing the amount of data written to the transaction log about the currently executing transaction, is essential to good database performance. Therefore, small atomic transactions work well for these

databases. Note that in Oracle, due to row versioning, reducing transaction length has little effect on database performance.

Reduce the number of network round trips

Database operations often involve multiple queries. In this case, it is sometimes possible to batch the queries to minimize the network round trips needed for the operation. At times, it may also be possible to re-order the sequence of operations to minimize

additional queries that cause delay. A common methodology for doing so is to use stored procedures to encapsulate multiple statements. This is described in further detail in the next section.

Build complex database operations into a stored procedure

As noted in the Reduce the number of network round trips section, if the SQL needed for completing a database operation is complex (requires multiple SQL statements) or a large SQL statement, consider using a stored procedure. The advantage of stored procedures is that they provide a clean and common application interface for processing data in the DB while maintaining the complexity close to the data, reducing potential network

roundtrips. The disadvantage is that the stored procedure may have to be re-implemented and tested in all supported database platforms.

Stored procedures also provide a useful way to take advantage of the database-specific features of the DB. For example, in Oracle, aggregation can be done using a variety of analytical functions. By encapsulating report output in a stored procedure, the same aggregation can be done by more conventional SQL and multiple statements on other database platforms. For example, in PRPC v5.2, the license usage aggregation function

(13)

Methods to obtain SQL query plans on various databases

performs a complex data summarization task by using a stored procedure called from the application.

Methods to obtain SQL query plans on various

databases

Most database systems provide multiple approaches to obtaining SQL query plans based on the origination point of the SQL (command line query or program), the statistics available for the plan, and user interface for the output of the SQL plan.

SQL Server

SQL Server creates execution plans that are sent to a screen rather than storing them in database tables for later query.

For a graphical display of the execution plan, you can use the SQL Server Query analyzer tool to enter the query. Once the query plans have been entered, highlight the query and click on the Display Estimated Execution Plan button:

Figure 2

The execution plan displayed will be very similar to Figure 1 on page X. A user may click on any of the nodes on the execution plan tree to see detailed statistics of the query plan. To see a text-based version of a SQL Server SQL query plan, run the following script in SQL Server Query Analyzer.

SET SHOWPLAN_TEXT ON GO

<< Enter the query here>> (SELECT * FROM PR_HISTORY) GO

By using the profiler utility in SQL Server, you can get real-time execution plans for SQL statements being executed by an application. Run this utility on the database server machine using an administrative account. Start a trace by choosing the default profile template and then add the following from the Events tab:

The Performance items Execution Plan Show Plan All

Show Plan Statistics Show Plan Text

When you run the trace, you will get a tabular UI with information on the current SQL statements being run on the database. You may choose to save the trace to a file to help with post analysis.

(14)

Oracle

Oracle uses a SQL-centric approach to generating and displaying execution plans. Use SQL to place plan data in a table after which you can view the plan data with another SQL query. Typically, Oracle hides these implementation details under the covers for the SQL*Plus user. To determine an explain in SQL*Plus use the command:

SQL>set autotrace traceonly explain;

SQL>select * from pr_history; (or the sql command you intend to profile)

This will produce a text based representation of the explain plan. To turn off autotrace for that session use the command:

SQL> set autotrace off;

Note if you use a tool like SQLDeveloper (free download from Oracle) you are able to click on a button to get the SQL trace in a similar fashion to the SQL Server query analyzer. Oracle can also maintain the explain plans of all SQL statements that are currently in the statement cache. This allows a user to query the statement cache for SQL that is being run by an application in real time, rather than from a command line. Oracle provides a special view called V$SQL_PLAN that provides access to the plan information used. This can be combined with the DBMS_XPLAN stored procedure package to provide

information on SQL statements currently executing on the system. The steps below will provide details on how to set up an Oracle database for profiling running SQL.

1. Run the create view script below as the SYS user with SYSDBA privileges. create or replace view dynamic_plan_table

as select rawtohex(address) || '_' || child_number statement_id, sysdate timestamp, operation, options, object_node, object_owner, object_name, 0 object_instance, optimizer, search_columns, id,

parent_id, position, cost, cardinality, bytes, other_tag, partition_start, partition_stop, partition_id, other, distribution, cpu_cost, io_cost, temp_space, access_predicates, filter_predicates

from v$sql_plan;

2. Grant the following privileges to all users using the SYS account: grant select on dynamic_plan_table to PUBLIC;

3. Run the following SQL statement from the user that owns the PRPC schema: select plan_table_output from

TABLE( dbms_xplan.display( 'sys.dynamic_plan_table', (select rawtohex(address) || '_' || child_number x

from v$sql where sql_text = 'select * from pr_history'), 'serial'));

where the quoted text ‘select * from pr_history’ is the SQL statement for which you want to get the query plan.

(15)

DB2

DB2 provides two methods to view the query plan of a SQL statement; visual and command-line based. In DB2, youmust also set up the appropriate plan tables in the database unless using DB2 Control Center. If the database host machine has DB2 Control Center installed, then the plan tables are automatically generated. To set up the plan tables manually, run the script EXPLAIN.DDL from the $DB2/sqllib/misc directory:

Db2 –tf EXPLAIN.DDL

In addition, the visual method requires the installation of the DB2 Command Center client application on your PC. Once this tool has been configured to point to the appropriate DB2 instance, you can use the ‘execute and access plan’ option to view the access plan for any SQL being sent to the database from that tool.

To get a query plan for a statement using command line tools, first delete all rows from the EXPLAIN_INSTANCE table. Then set one of the parameters below in the DB2 CLP to enable capturing of SNAPSHOT or explain data for the SQL statement to be analyzed. - SET CURRENT EXPLAIN SNAPSHOT Yes

- SET CURRENT EXPLAIN SNAPSHOT EXPLAIN

- Use EXPLAIN PLAN FOR SELECT * FROM PR_HISTORY;

where SELECT * FROM PR_HISTORY is the SQL statement you want to tune. Once the statement to be analyzed has been run, the db2exfmt tool will format the output into an easy to understand format. Below is a sample command line that can be used to obtain an explain plan for an already run SQL Statement.

Db2exfmt –d pega –w -1 –n % -s % -# 0

(16)