Teradata SQL—Unleash the Power Chapter 1- Teradata Parallel Architecture Chapter 2-Fundamental SQL Using SELECT Chapter 3-On-line HELP and SHOW Commands Chapter 4-Data Conversions
Chapter 5-Aggregation
Chapter 6-Subquery Processing Chapter 7-Join Processing
Chapter 8-Date and Time Processing Chapter 9-Character String Processing Chapter 10-OLAP Functions
Chapter 11-SET Operators Chapter 12-Data Manipulation Chapter 13-Data Interrogation Chapter 14-View Processing Chapter 15-Macro Processing Chapter 16-Transaction Processing
Chapter 17-Reporting Totals and Subtotals Chapter 18-Data Definition Language Chapter 19-Temporary Tables Chapter 20-Trigger Processing Chapter 21-Stored Procedures
Chapter 1:
Teradata Parallel Architecture
Teradata Introduction
The world's largest data warehouses commonly use the superior technology of NCR's Teradata relational database management system (RDBMS). A data warehouse is normally loaded directly from operational data. The majority, if not all of this data will be collected on-line as a result of normal business operations. The data warehouse therefore acts as a central repository of the data that reflects the effectiveness of the methodologies used in running a business.
As a result, the data loaded into the warehouse is mostly historic in nature. To get a true representation of the business, normally this data is not changed once it is loaded. Instead, it is interrogated repeatedly to transform data into useful information, to discover trends and the effectiveness of operational procedures. This interrogation is based on business rules to determine such aspects as profitability, return on investment and evaluation of risk.
For example, an airline might load all of its maintenance activity on every aircraft into the database. Subsequent investigation of the data could indicate the frequency at which certain parts tend to fail. Further analysis might show that the parts are failing more often on certain models of aircraft. The first benefit of the new found knowledge regards the ability to plan for the next failure and maybe even the type of airplane on which the part will fail. Therefore, the part can be on hand when and maybe where it is needed, or the part might be proactively changed prior to its failure.
If the information reveals that the part is failing more frequently on a particular model of aircraft, this could be an indication that the aircraft manufacturer has a problem with the design or production of that aircraft. Another possible cause is that the maintenance crew is doing something incorrectly and contributing to the situation. Either way, you cannot fix a problem if you do not know that a problem exists. There is incredible power and savings in this type of knowledge.
Another business area where the Teradata database excels is in retail. It provides an environment that can store billions of sales. This is a critical capability when you are
recording and analyzing the sales of every item in every store around the world. Whether it is used for inventory control, marketing research or credit analysis, the data provides an insight into the business. This type of knowledge is not easily attainable without detailed data that records every aspect of the business. Tracking inventory turns, stock replenishment, or predicting the number of goods needed in a particular store yields a priceless perspective into the operation of a retail outlet. This information is what enables one retailer to thrive while others go out of business.
Teradata is flourishing with the realization that detail data is critical to the survival of a business in a competitive, lower margin environment. Continually, businesses are forced to do more with less. Therefore, it is vital to maximize the efforts that work well to improve profit and minimize or correct those that do not work.
One computer vendor used these same techniques to determine that it cost more to sell into the desktop environment than was realized in profit. Prior to this realization, the sales effort had attempted to make up the loss by selling more computers. Unfortunately, increased sales meant increased losses. Today, that company is doing much better and has made a huge step into profitability by discontinuing the small computer line.
Teradata Architecture
The Teradata database currently runs normally on NCR Corporation's WorldMark Systems in the UNIX MP-RAS environment. Some of these systems consist of a single processing node (computer) while others are several hundred nodes working together in a single system. The NCR nodes are based entirely on industry standard CPU processor chips, standard internal and external bus architectures like PCI and SCSI, and standard memory modules with 4-way interleaving for speed.
At the same time, Teradata can run on any hardware server in the single node environment when the system runs Microsoft NT and Windows 2000. This single node may be any computer from a large server to a laptop.
Whether the system consists of a single node or is a massively parallel system with
hundreds of nodes, the Teradata RDBMS uses the exact same components executing on all the nodes in parallel. The only difference between small and large systems is the number of processing components.
When these components exist on different nodes, it is essential that the components
communicate with each other at high speed. To facilitate the communications, the multi-node systems use the BYNET interconnect. It is a high speed, multi-path, dual redundant
communications channel. Another amazing capability of the BYNET is that the bandwidth increases with each consecutive node added into the system. There is more detail on the BYNET later in this chapter.
Teradata Components
As previously mentioned, Teradata is the superior product today because of its parallel operations based on its architectural design. It is the parallel processing by the major components that provide the power to move mountains of data. Teradata works more like the early Egyptians who built the pyramids without heavy equipment using parallel,
coordinated human efforts. It uses smaller nodes running several processing components all working together on the same user request. Therefore, a monumental task is completed in record time.
Teradata operates with three major components to achieve the parallel operations. These components are called: Parsing Engine Processors, Access Module Processors and the Message Passing Layer. The role of each component is discussed in the next sections to provide a better understanding of Teradata. Once we understand how Teradata works, we will pursue the SQL that allows storage and access of the data.
Parsing Engine Processor (PEP or PE)
The Parsing Engine Processor (PEP) or Parsing Engine (PE), for short, is one of the two primary types of processing tasks used by Teradata. It provides the entry point into the database for users on mainframe and networked computer systems. It is the primary director task within Teradata.
As users "logon" to the database they establish a Teradata session. Each PE can manage 120 concurrent user sessions. Within each of these sessions users submit SQL as a request for the database server to take an action on their behalf. The PE will then parse the SQL statement to establish which database objects are involved. For now, let's assume that the database object is a table. A table is a two-dimensional array that consists of rows and columns. A row represents an entity stored in a table and it is defined using columns. An example of a row might be the sale of an item and its columns include the UPC, a description and the quantity sold.
Any action a user requests must also go through a security check to validate their privileges as defined by the database administrator. Once their authorization at the object level is verified, the PE will verify that the columns requested actually exist within the objects referenced.
Next, the PE optimizes the SQL to create an execution plan that is as efficient as possible based on the amount of data in each table, the indices defined, the type of indices, the selectivity level of the indices, and the number of processing steps needed to retrieve the data. The PE is responsible for passing the optimized execution plan to other components as the best way to gather the data.
An execution plan might use the primary index column assigned to the table, a secondary index or a full table scan. The use of an index is preferable and will be discussed later in this chapter. For now, it is sufficient to say that a full table scan means that all rows in the table must be read and compared to locate the requested data.
Although a full table scan sounds really bad, within the architecture of Teradata, it is not necessarily a bad thing because the data is divided up and distributed to multiple, parallel components throughout the database. We will look next at the AMPs that perform the parallel disk access using their file system logic. The AMPs manage all data storage on disks. The PE has no disks.
Activities of a PE:
Convert incoming requests from EBCDIC to ASCII (if from an IBM mainframe) Parse the SQL to determine type and validity
Validate user privileges
Optimize the access path(s) to retrieve the rows
Build an execution plan with necessary steps for row access Send the plan steps to Access Module Processors (AMP) involved
Access Module Processor (AMP)
The next major component of Teradata's parallel architecture is called an Access Module Processor (AMP). It stores and retrieves the distributed data in parallel. Ideally, the data rows of each table are distributed evenly across all the AMPs. The AMPs read and write data and are the workhorses of the database. Their job is to receive the optimized plan steps, built by the PE after it completes the optimization, and execute them. The AMPs are designed to work in parallel to complete the request in the shortest possible time.
Optimally, every AMP should contain a subset of all the rows loaded into every table. By dividing up the data, it automatically divides up the work of retrieving the data. Remember, all work comes as a result of a users' SQL request. If the SQL asks for a specific row, that row exists in its entirety (all columns) on a single AMP and other rows exist on the other AMPs.
If the user request asks for all of the rows in a table, every AMP should participate along with all the other AMPs to complete the retrieval of all rows. This type of processing is called an all AMP operation and an all rows scan. However, each AMP is only responsible for its rows, not the rows that belong to a different AMP. As far as the AMPs are concerned, it owns all of the rows. Within Teradata, the AMP environment is a "shared nothing" configuration. The AMPs cannot access each others' data rows, and there is no need for them to do so. Once the rows have been selected, the last step is to return them to the client program that initiated the SQL request. Since the rows are scattered across multiple AMPs, they must be consolidated before reaching the client. This consolidation process is accomplished as a part of the transmission to the client so that a final comprehensive sort of all the rows is never performed. Instead, all AMPs sort only their rows (at the same time – in parallel) and the Message Passing Layer is used to merge the rows as they are transmitted from all the AMPs.
Therefore, when a client wishes to sequence the rows of an answer set, this technique causes the sort of all the rows to be done in parallel. Each AMP sorts only its subset of the rows at the same time all the other AMPs sort their rows. Once all of the individual sorts are complete, the BYNET merges the sorted rows. Pretty brilliant!
Store and retrieve data rows using the file system Aggregate data
Join processing between multiple tables
Convert ASCII returned data to EBCDIC (IBM mainframes only) Sort and format output data
Message Passing Layer (BYNET)
The Message Passing Layer varies depending on the specific hardware on which the Teradata database is executing. In the latter part of the 20th century, most Teradata
database systems executed under the UNIX operating system. However, in 1998, Teradata was released on Microsoft's NT operating system. Today it also executes under Windows 2000. The initial release of Teradata, on the Microsoft systems, is for a single node.
When using the UNIX operating system, Teradata supports up to 512 nodes. This massively parallel system establishes the basis for storing and retrieving data from the largest
commercial databases in the world, Teradata. Today, the largest system in the world consists of 176 nodes. There is much room for growth as the databases begin to exceed 40 or 50 terabytes.
For the NCR UNIX systems, the Message Passing Layer is called the BYNET. The amazing thing about the BYNET is its capacity. Instead of a fixed bandwidth that is shared among multiple nodes, the bandwidth of the BYNET increases as the number of nodes increase. This feat is accomplished as a result of using virtual circuits instead of using a single fixed cable or a twisted pair configuration.
To understand the workings of the BYNET, think of a telephone switch used by local and long distance carriers. As more and more people place phone calls, no one needs to speak slower. As one switch becomes saturated, another switch is automatically used. When your phone call is routed through a different switch, you do not need to speak slower. If a natural or other type of disaster occurs and a switch is destroyed, all subsequent calls are routed through other switches. The BYNET is designed to work like a telephone switching network. An additional aspect of the BYNET is that it is really two connection paths, like having two phone lines for a business. The redundancy allows for two different aspects of its
performance. The first aspect is speed. Each path of the BYNET provides bandwidth of 10 Megabytes (MB) per second with Version 1 and 60 MB per second with Version 2. Therefore the aggregate speed of the two connections is 20MB/second or 120MB/second. However, as mentioned earlier, the bandwidth grows linearly as more nodes are added.
Using Version 1 any two nodes communicate at 40MB/second (10MB/second * 2 BYNETs * 2 nodes). Therefore, 10 nodes can utilize 200MB/second and 100 nodes have
2000MB/second available between them. When using the version 2 BYNET, the same 100 nodes communicate at 12,000MB/second (60MB/second * 2 BYNETs * 100 nodes).
The second and equally important aspect of the BYNET uses the two connections for availability. Regardless of the speed associated with each BYNET connection, if one of the connections should fail, the second is completely independent and can continue to function at its individual speed without the other connection. Therefore, communications continue to pass between all nodes.
Although the BYNET is performing at half the capacity during an outage, it is still operational and SQL is able to complete without failing. In reality, when the BYNET is performing at only 10MB/second per node, it is still a lot faster than many normal networks that typically transfer messages at 10MB per second.
All messages going across the BYNET offer guaranteed delivery. So, any messages not successfully delivered because of a failure on one connection automatically route across the other connection. Since half of the BYNET is not working, the bandwidth reduces by half. However, when the failed connection is returned to service, its topology is automatically configured back into service and it begins transferring messages along with the other connection. Once this occurs, the capacity returns to normal.
A Teradata Database
Within Teradata, a database is a storage location for database objects (tables, views, macros, and triggers). An administrator can use Data Definition Language (DDL) to establish a database by using a CREATE DATABASE command.
A database may have PERMANENT (PERM) space allocated to it. This PERM space establishes the maximum amount of disk space for storing user data rows in any table located in the database. However, if no tables are stored within a database, it is not required to have PERM space. Although a database without PERM space cannot store tables, it can store views and macros because they are physically stored in the Data Dictionary (DD) PERM space and require no user storage space. The DD is in a "database" called DBC. Teradata allocates PERM space to tables, up to the maximum, as rows are inserted. The space is not pre-allocated. Instead, it is allocated, as rows are stored in blocks on disk. The maximum block size is defined either at a system level in the DBS Control Record, at the database level or individually for each table. Like PERM, the block size is a maximum size. Yet, it is only a maximum for blocks that contain multiple rows. By nature, the blocks are variable in length. So, disk space is not pre-allocated; instead, it is allocated on an as needed basis, one sector (512 bytes) at a time. Therefore, the largest possible wasted disk space in a block is 511 bytes.
A database can also have SPOOL space associated with it. All users who run queries need workspace at some point in time. This SPOOL space is workspace used for the temporary storage of rows during the execution of user SQL statements. Like PERM space, SPOOL is defined as a maximum amount that can be used within a database or by a user. Since PERM is not pre-allocated, unused PERM space is automatically available for use as SPOOL. This maximizes the disk space throughout the system.
It is a common practice in Teradata to have some databases with PERM space that contain only tables. Then, other databases contain only views. These view databases require no PERM space and are the only databases that users have privileges to access. The views in these databases control all access to the real tables in other databases. They insulate the actual tables from user access. There will be more on views later in this book.
The newest type of space allocation within Teradata is TEMPORARY (TEMP) space. A database may or may not have TEMP space, however, it is required if Global Temporary Tables are used. The use of temporary tables is also covered in more detail later in the SQL portion of this book.
A database is defined using a series of parameter values at creation time. The majority of the parameters can easily be changed after a database has been created using the MODIFY DATABASE command. However, when attempting to increase PERM or TEMP space maximums, there must be sufficient disk space available even though it is not immediately allocated. There may not be more PERM space defined that actual disk on the system. A number of additional database parameters are listed below along with the user parameters in the next section. These parameters are tools for the database administrator and other experienced users when establishing databases for tables and views.
PERMANENT TEMPORARY SPOOL ACCOUNT FALLBACK JOURNAL DEFAULT JOURNAL
Teradata Users
In Teradata, a user is the same as a database with one exception. A user is able to logon to the system and a database cannot. Therefore, to authenticate the user, a password must be established. The password is normally established at the same time that the CREATE USER statement is executed. The password can also be changed using a MODIFY USER
command.
Like a database, a user area can contain database objects (tables, views, macros and triggers). A user can have PERM and TEMP space and can also have spool space. On the other hand, a user might not have any of these types of space, exactly the same as a database.
The biggest difference between a database and a user is that a user must have a password. This similarity between the two makes administering the system easier and allows for default values that all databases and users can inherit.
The next two lists regard the creation and modification of databases and users.
{ CREATE | MODIFY } DATABASE or USER (in common)
PERMANENT TEMPORARY SPOOL ACCOUNT FALLBACK JOURNAL DEFAULT JOURNAL
{ CREATE | MODIFY } USER (only)
PASSWORD
STARTUP
DEFAULT DATABASE
By no means are these all of the parameters. It is not the intent of this chapter, nor the intent of this book to teach database administration. There are reference manuals and courses available to use. Teradata administration warrants a book by itself.
Symbols Used in this Book
Since there are no standard symbols for teaching SQL, it is necessary to understand some of the symbols used in our syntax diagrams throughout this book.
Figure 1-1
DATABASE Command
When users negotiate a successful logon to Teradata, they are automatically positioned in a default database as defined by the database administrator. When an SQL request is
executed, by default, it looks in the current database for all referenced objects.
There may be times when the object is not in the current database. When this happens, the user has one of two choices to resolve this situation. One solution is to qualify the name of the object along with the name of the database in which it resides. To do this, the user simply associates the database name to the object name by connecting them with a period (.) or dot as shown below:
<database-name>.<table-name>
The second solution is to use the database command. It repositions the user to the specified database. After the database command is executed, there is no longer a need to qualify the objects in that database. Of course, if the SQL statement references additional objects in another database, they will have to be qualified in order for the system to locate them. Normally, you will DATABASE to the database that contains most of the objects that you need. Therefore it reduces the number of object names requiring qualification.
The following is the syntax for the DATABASE command. DATABASE <database-name>
;
If you are not sure what database you are in, either the HELP SESSION or SELECT DATABASE command may be used to make that determination. These commands and other HELP functions are covered in the SQL portion of this book.
Use of an Index
Although a relational data model uses Primary Keys and Foreign Keys to establish the relationships between tables, that design is a Logical Model. Each vendor uses specialized techniques to implement a Physical Model. Teradata does not use keys in its physical model. Instead, Teradata is implemented using indices, both primary and secondary. The Primary Index (PI) is the most important index in all of Teradata. The performance of Teradata can be linked directly to the selection of this index. The data value in the PI column(s) is submitted to the hashing function. The resulting row hash value is used to map the row to a specific AMP for data distribution and storage.
To illustrate this concept, I have on several occasions used two decks of cards. Imagine if you will, fourteen people in a room. To the largest, most powerful looking man in the room, you give one of the decks of cards. His large hands allow him to hold all fifty-two cards at
one time, with some degree of success. The cards are arranged with the ace of spades continuing through the king of spades in ascending order. After the spades, are the hearts, then the clubs and last, the diamonds. Each suit is arranged starting with the ace and ascending up to the king. The cards are partitioned by suit.
The other deck of cards is divided among the other thirteen people. Using this procedure, all cards with the same value (i.e. aces) all go to the same person. Likewise, all the deuces, treys and subsequent cards each go to one of the thirteen people. Each of the four cards will be in the same order as the suits contained in the single deck that went to the lone man: spades, hearts, clubs and diamonds. Once all the cards have been distributed, each of the thirteen people will be holding four cards of the same value (4*13=52). Now, the game can begin.
The requests in this game come in the form of "give-me," one or more cards.
To make it easy for the lone player, we first request: give-me the ace of spades. The person with four aces finds their ace, as does the lone player with all 52 cards, both on the top other their cards. That was easy!
As the difficulty of the give-me requests increase, the level of difficulty dramatically increases for the lone man. For instance, when the give-me request is for all of the twos, one of the thirteen people holds up all four of their cards and they are done. The lone man must locate the 2 of spades between the ace and trey. Then, go and locate the 2 of hearts, thirteen cards later between the ace and trey. Then, find the 2 of clubs, thirteen cards after that, as well as the 2 of diamonds, thirteen cards after that to finally complete the request.
Another request might be give-me all of the diamonds. For the thirteen people, each person locates and holds up one card of their cards and the request is finished. For the lone person with the single deck, the request means finding and holding up the last thirteen cards in their deck of fifty-two. In each of these give-me requests, the lone man had to negotiate all fifty two cards while the thirteen other people only needed to determine which of the four cards applied to the request, if any. This is the same procedure used by Teradata. It divides up the data like we divided up the cards.
As illustrated, the thirteen people are faster than the lone man. However, the game is not limited to thirteen players. If there were 26 people who wished to play on the same team, the cards simply need to be divided or distributed differently.
When using the value (ace through king) there are only 13 unique values. In order for 26 people to play, we need a way to come up with 26 unique values for 26 people. To make the cards more unique, we might combine the value of the card (i.e. ace) with the color.
Therefore, we have two red aces and two black aces as well as two sets for every other card. Now when we distribute the cards, each of the twenty-six people receives only two cards instead of the original four. The distribution is still based on fifty-two cards (2 times 26). At the same time, 26 people is not the optimum number for the game. Based on what has been discussed so far, what is the optimum number of people?
If your answer is 52, then you are absolutely correct.
With this many people, each person has one and only one card. Any time a give-me is requested of the participants, their one card either qualifies or it does not. It doesn't get any simpler or faster than this situation.
As easy as this sounds, to accomplish this distribution the value of the card alone is not sufficient to manifest 52 unique values. Neither is using the value and the color. That
combination only gives us a distribution of 26 unique values when 52 unique values are desired.
To achieve this distribution we need to establish still more uniqueness. Fortunately, we can use the suit along with the value. Therefore, the ace of spades is different than the ace of hearts, which is different from the ace of clubs and the ace of diamonds. In other words, there are now 52 unique identities to use for distribution.
To relate this distribution to Teradata, one or more columns of a table are chosen to be the Primary Index.
Primary Index
The Primary Index can consist of up to sixteen different columns. These columns, when considered together, provide a comprehensive technique to derive a Unique Primary Index (UPI, pronounced as "you-pea") value as we discussed previously regarding the card analogy. That is the good news.
To store the data, the value(s) in the PI are hashed via a calculation to determine which AMP will own the data. The same data values always hash the same row hash and therefore are always associated with the same AMP.
The advantage to using up to sixteen columns is that row distribution is very smooth or evenly based on unique values. This simply means that each AMP contains the same number of rows. At the same time, there is a downside to using several columns for a PI. The PE needs every data value for each column as input to the hashing calculation to directly access a particular row. If a single column value is missing, a full table scan will result because the row hash cannot be recreated. Any row retrieval using the PI column(s) is always an efficient, one AMP operation.
Although uniqueness is good in most cases, Teradata does not require that a UPI be used. It also allows for a Non-Unique Primary Index (NUPI, pronounced as new-pea). The potential downside of a NUPI is that if several duplicate values (NUPI dups) are stored, they all go to the same AMP. This can cause an uneven distribution that places more rows on some of the AMPs than on others. This means that any time an AMP with a larger number of rows is involved, it has to work harder than the other AMPs. The other AMPs will finish before the slower AMP. The time to process a single user request is always based on the slowest AMP. Therefore, serious consideration should be used when making the decision to use a NUPI. Every table must have a PI and it is established when the table is created. If the CREATE TABLE statement contains: UNIQUE PRIMARY INDEX( <column-list> ), the value in the column(s) will be distributed to an AMP as a UPI. However, if the statement reads:
PRIMARY INDEX ( <column-list> ), the value in the column(s) will be distributed as a NUPI and allow duplicate values. Again, all the same values will go to the same AMP.
If the DDL statement does not specify a PI, but it specifies a PRIMARY KEY (PK), the named column(s) are used as the UPI. Although Teradata does not use primary keys, the DDL may be ported from another vendor's database system.
A UPI is used because a primary key must be unique and cannot be null. By default, both UPIs and NUPIs allow a null value to be stored unless the column definition indicates that null values are not allowed using a NOT NULL constraint.
Now, with that being said, when considering JOIN accesses on the tables, sometimes it is advantageous to use a NUPI. This is because the rows being joined between tables must be on the same AMP. If they are not on the same AMP, one of the rows must be moved to the
same AMP as the matching row. Teradata will use one of two different strategies to temporarily move rows. It can copy all needed rows to all AMPs or it can redistribute them using the hashing mechanism on the column defined as the join domain that is a PI. However, if neither join column is a PI, it might be necessary to redistribute all participating rows from both tables by hash code to get them together on a single AMP.
Planning data distribution, using access characteristics, can reduce the amount of data movement and therefore improve join performance. This works fine as long as there is a consistent number of duplicate values or only a small number of duplicate values. The logical data model needs to be extended with usage information in order to know the best way to distribute the data rows. This is done during the physical implementation phase before creating tables.
Secondary Index
A Secondary Index (SI) is used in Teradata as a way to directly access rows in the data, sometimes called the base table, without requiring the use of PI values. Unlike the PI, an SI does not effect the distribution of the data rows. Instead, it is an alternate read path and allows for a method to locate the PI value using the SI. Once the PI is obtained, the row can be directly accessed using the PI. Like the PI, an SI can consist of up to 16 columns.
In order for an SI to retrieve the data row by way of the PI, it must store and retrieve an index row. To accomplish this Teradata creates, maintains and uses a subtable. The PI of the subtable is the value in the column(s) that are defined as the SI. The "data" stored in the subtable row is the previously hashed value of the real PI for the data row or rows in the base table. The SI is a pointer to the real data row desired by the request. An SI can also be unique (USI, pronounced as you-sea) or non-unique (NUSI, pronounced as new-sea). The rows of the subtable contain the row hashed value of the SI, the actual data value(s) of the SI, and the row hashed value of the PI as the row ID. Once the row ID of the PI is obtained from the subtable row, using the hashed value of the SI, the last step is to get the actual data row from the AMP where it is stored. The action and hashing for an SI is exactly the same as when starting with a PI. values can generate small hash values and small data values can produce large hash values. So, to overcome the issue associated with a hashed value, there is a range feature called Value Ordered NUSIs. At this time, it may only be used with a four byte or smaller numeric data column. Based on its functionality, a Value Ordered NUSI is perfect for date processing. See the HYPERLINK
"http://www.books24x7.com/viewer.asp?bkid=5558&destid=661" \l "661" \t "_parent" µDDL chapter§ in this book for more details on USI and NUSI usage.
Chapter 2: Fundamental SQL Using SELECT Fundamental Structured Query Language (SQL)
The access language for all modern relational database systems (RDBMS) is Structured Query Language (SQL). It has evolved over time to be the standard. The ANSI SQL group defines which commands and functionality all vendors should provide within their RDBMS. There are three levels of compliance within the standard: Entry, Intermediate and Full. The three level definitions are based on specific commands, data types and functionalities. So, it is not that a vendor has incorporated some percentage of the commands; it is more that each command is categorized as belonging to one of the three levels. For instance, most data types are Entry level compliant. Yet, there are some that fall into the Intermediate and Full definitions.
Since the standard continues to grow with more options being added, it is difffiles, it has always been more efficient to read the entire file and not use an index if more than 15% of the records were needed. This is compounded with Teradata because the "file" is read in parallel instead of all data from a single file. So, the efficiency percentage is probably closer to being less than 3% of all the rows in order to use the NUSI.
If the SQL does not use a NUSI, you should consider dropping it, due to the fact that the subtable takes up PERM space with no benefit to the users. The Teradata EXPLAIN is covered in this book and it is the easiest way to determine if your SQL is using a NUSI. Furthermore, the optimizer will never use a NUSI without STATISTICS.
There has been another evolution in the use of NUSI processing. It is called NUSI
Bitmapping. This means that if a table has two different NUSI indices and individually they are weakly selective, but together they can be bitmapped together to eliminate most of the non-conforming rows; it will use the two different NUSI columns together because they become highly selective. Therefore, many times, it is better to use smaller individual NUSI indices instead of a large composite (more than one column) NUSI.
There is another feature related to NUSI processing that can improve access time when a value range comparison is requested. When using hash values, it is impossible to determine any value within the range. This is because large data values can generate small hash values and small data values can produce large hash values. So, to overcome the issue associated with a hashed value, there is a range feature called Value Ordered NUSIs. At this time, it may only be used with a four byte or smaller numeric data column. Based on its functionality, a Value Ordered NUSI is perfect for date processing. See the DDL chapter in this book for more details on USI and NUSI usage.
Chapter 2:
Fundamental SQL Using SELECT
Fundamental Structured Query Language (SQL)
The access language for all modern relational database systems (RDBMS) is Structured Query Language (SQL). It has evolved over time to be the standard. The ANSI SQL group defines which commands and functionality all vendors should provide within their RDBMS. There are three levels of compliance within the standard: Entry, Intermediate and Full. The three level definitions are based on specific commands, data types and functionalities. So, it is not that a vendor has incorporated some percentage of the commands; it is more that each command is categorized as belonging to one of the three levels. For instance, most data types are Entry level compliant. Yet, there are some that fall into the Intermediate and Full definitions.
Since the standard continues to grow with more options being added, it is difficult to stay fully ANSI compliant. Additionally, all RDBMS vendors provide extra functionality and options that are not part of the standard. These extra functions are called extensions because they extend or offer a benefit beyond those in the standard definition.
At the writing of this book, Teradata was fully ANSI Entry level compliant based on the 1992 Standards document. NCR also provides much of the Intermediate and some of the Full capabilities. This book indicates feature by feature which SQL capabilities are ANSI and which are Teradata specific, or extensions. It is to NCR's benefit to be as compliant as possible in order to make it easier for customers of other RDBMS vendors to port their data warehouse to Teradata.
As indicated earlier, SQL is used to access, store, remove and modify data stored within a relational database, like Teradata. The SQL is actually comprised of three types of
statements. They are: Data Definition Language (DDL), Data Control Language (DCL) and Data Manipulation Language (DML). The primary focus of this book is on DML and DDL. Both DDL and DCL are, for the most part, used for administering an RDBMS. Since the SELECT statement is used the vast majority of the time, we are concentrating on its functionality, variations and capabilities.
Everything in the first part of this chapter describes ANSI standard capabilities of the SELECT command. As the statements become more involved, each capability will be designated as either ANSI or a Teradata Extension.
Basic SELECT Command
Using the SELECT has been described like playing the game, Jeopardy. The answer is there; all you have to do is come up with the correct question.
The basic structure of the SELECT statement indicates which column values are desired and the tables that contain them. To aid in the learning of SQL, this book will capitalize the SQL keywords. However, when SQL is written for Teradata, the case of the statement is not important. The SQL statements can be written using all uppercase, lowercase or a
combination; it does not matter to the Teradata PE.
The SELECT is used to return the data value(s) stored in the columns named within the SELECT command. The requested columns must be valid names defined in the table(s) listed in the FROM portion of the SELECT.
The following shows the format of a basic SELECT statement. In this book, the syntax uses expressions like: <column-name> (see Figure 1-1) to represent the location of one or more names required to construct a valid SQL statement:
SEL[ECT] <column-name> [,<column-name] FROM <table-name> ;
The structure of the above command places all keywords on the left in uppercase and the variable information such as column and table names to the right. Like using capital letters, this positioning is to aid in learning SQL. Lastly, although the use of SEL is acceptable in Teradata, with [ECT] in square brackets being optional, it is not ANSI standard.
Lastly, when multiple column names are requested in the SELECT, a comma must separate them. Without the separator, the optimizer cannot determine where one ends and the next begins.
The following syntax format is also acceptable: SEL[ECT] <column-name> FROM <table-name> ;
Both of these SELECT statements produce the output report, but the above style is easier to read and debug for complex queries. The output display might appear as:
3 Rows Returned <column-name> aaaaaaaaaaaaaaaaaa bbbbbbbbbbbbbbbbbb cccccccccccccccccc
In the output, the column name becomes the default heading for the report. Then, the data contained in the selected column is displayed once for each row returned.
The next variation of the SELECT statement returns all of the columns defined in the table indicated in the FROM portion of the SELECT.
SEL[ECT] *
FROM <table-name> ;
The output of the above request uses each column name as the heading and the columns are displayed in the same sequence as they are defined in the table. Depending on the tool used to submit the request, care should be taken, because if the returned display is wider than the media (i.e. terminal=80 and paper=133); it may be truncated.
At times, it is desirable to select the same column twice. This is permitted and to accomplish it, the column name is simply listed in the SELECT column list more than once. This
technique might often be used when doing aggregations or calculating a value, both are covered in later chapters.
The table below is used to demonstrate the results of various requests. It is a small table with a total of ten rows for easy comparison.
For Example: the next SELECT might be used with Figure 2-1, to display the student number, the last name, first name, the class code and grade point for all of the students in the Student table:
Figure 2-1
SELECT *
FROM Student_Table ; 10 Rows returned
Student_ID Last_Name First_Name Class_Code Grade_Pt
423400 Larkins Michael FR 0.00 125634 Hanson Henry FR 2.88 280023 McRoberts Richard JR 1.90 260000 Johnson Stanley ? ? 231222 Wilson Susie SO 3.80 234121 Thomas Wendy FR 4.00 324652 Delaney Danny SR 3.35 123250 Phillips Martin SR 3.00
Student_ID Last_Name First_Name Class_Code Grade_Pt
322133 Bond Jimmy JR 3.95
333450 Smith Andy SO 2.00
Notice that Johnson has question marks in the grade point and class code columns. Most client software uses the question mark to represent missing data or an unknown value (NULL). More discussion on this condition will appear throughout this book. The other thing to note is that character data is aligned from left to right, the same as we read it and numeric is from right to left, from the decimal.
This SELECT returns all of the columns except the Student ID from the Student table: SELECT First_name ,Last_name ,Class_Code ,Grade_Pt FROM Student_Table ; 10 Rows returned
First_Name Last_Name Class_Code Grade_Pt
Michael Larkins FR 0.00 Henry Hanson FR 2.88 Richard McRoberts JR 1.90 Stanley Johnson ? ? Susie Wilson SO 3.80 Wendy Thomas FR 4.00 Danny Delaney SR 3.35 Martin Phillips SR 3.00 Jimmy Bond JR 3.95 Andy Smith SO 2.00
There is no short cut for selecting all columns except one or two. Also, notice that the columns are displayed in the output in the same sequence they are requested in the SELECT statement.
WHERE Clause
The previous "unconstrained" SELECT statement returned every row from the table. Since the Teradata database is most often used as a data warehouse, a table might contain millions of rows. So, it is wise to request only certain types of rows for return. By adding a WHERE clause to the SELECT, a constraint is established to potentially limit which rows are returned based on a TRUE comparison to specific criteria or set of conditions.
SEL[ECT] { * | <column-name> } [,<column-name> ]
FROM <table-name>
WHERE <column-name> <comparison> <data-value>
;
Figure 2-2
The following SELECT can be used to return the students with a B (3.0) average or better from the Student table:
SELECT Student_ID ,Last_Name ,Grade_Pt FROM Student_Table WHERE Grade_Pt >= 3.0 ; 5 Rows returned
Student_ID Last_Name Grade_Pt
231222 Wilson 3.80
234121 Thomas 4.00
324652 Delaney 3.35
123250 Phillips 3.00
322133 Bond 3.95
Without the WHERE clause, the AMPs return all of the rows in the table to the user. More and more Teradata user systems are getting to the point where they are storing billions of rows in a single table. There must be a very good reason for needing to see all of them. More simply put, you will always use a WHERE clause whenever you want to see only a portion of the rows in a table.
Compound Comparisons ( AND / OR )
Many times a single comparison is not sufficient to specify the desired rows. To add more functionality to the WHERE it is common to use more than one comparison. The multiple condition checks and column names are not separated by a comma, like column names. Instead, they must be connected using a logical operator.
The following is the syntax for using the AND logical operator: SEL[ECT] <column-name>
[,<column-name> ] FROM <table-name>
WHERE <column-name> <comparison> <data-value> { AND | OR } <column-name> <comparison> <data-value>
;
Notice that the column name is listed for each comparison separated by a logical operator; this will be true even when it is the same column being compared twice. The AND signifies
that each individual comparison on both sides of the AND must be true. The final result of the comparison must be TRUE for a row to be returned.
Figure 2-3
When using AND, different columns must be used because a single column can never contain more than a single data value.
Therefore, it does not make good sense to issue the next SELECT using an AND on the same column because no rows will ever be returned.
SELECT Last_Name ,First_Name FROM Student_Table
WHERE Grade_Pt = 3.0 AND Grade_Pt = 4.0; No rows found
The above SELECT will never return any rows. It is impossible for a column to contain more than one value. No student has a 3.0 grade average AND a 4.0 average. They might have one or the other, but not both. It might contain one or the other, but never both at the same time. The AND operator indicates both must be TRUE and should never be used between two comparisons on the same column.
By substituting an OR logical operator for the previous AND, rows will now be returned. The following is the syntax for using OR:
SELECT Student_ID ,Last_Name ,First_Name ,Grade_Pt FROM Student_Table
WHERE Grade_Pt = 3.0 OR Grade_Pt = 4.0 ; 2 Rows returned
Student_ID Last_Name First_Name Grade_Pt
234121 Thomas Wendy 4.00
123250 Phillips Martin 3.00
The OR signifies that only one of the comparisons on each side of the OR needs to be true for the entire test to result in a true and the row to be selected.
When using the OR, the same column or different column names may be used. In this case, it makes sense to use the same column because a row is returned when a column contains either of the specified values as opposed to both values as seen with AND.
It is perfectly legal and common practice to combine the AND with the OR in a single SELECT statement.
The next SELECT contains both an AND as well as an OR: SELECT *
FROM Student_Table
WHERE Grade_Pt = 3.0 OR Grade_Pt = 4.0 AND Class_Code = 'FR' ; 2 Rows returned
Student_ID Last_Name First_Name Class_Code Grade_Pt
234121 Thomas Wendy FR 4.00
123250 Phillips Martin SR 3.00
At first glance, it appears that the comparison worked correctly. However, upon closer evaluation it is incorrect because Phillips is a senior and not a freshman.
When mixing AND with OR in the same WHERE clause, it is important to know that the AND is evaluated first. The previous SELECT actually returns all rows with a grade point of 3.0. Hence, Phillips was returned. The second comparison returned Thomas with a grade point of 4.0 and a class code of ‘FR’.
When it is necessary for the OR to be evaluated before the AND the use of parentheses changes the priority of evaluation. A different result is seen when doing the OR first. Here is how the statement should be written:
SELECT Last_Name ,Class_Code ,Grade_Pt FROM Student_Table
WHERE ( Grade_Pt = 3.0 OR Grade_Pt = 4.0 ) AND Class_Code = 'FR' ; 1 Row returned
Last_Name Class_Code Grade_Pt
Thomas FR 4.00
Now, only Thomas is returned and the output is correct.
Impact of NULL on Compound Comparisons
NULL is an SQL reserved word. It represents missing or unknown data in a column. Since NULL is an unknown value, a normal comparison cannot be used to determine whether it is true or false. All comparisons of any value to a NULL result in an unknown; it is neither true nor false. The only valid test for a null uses the keyword NULL without the normal
comparison symbols and is explained in this chapter.
When a table is created in Teradata, the default for a column is for it to allow a NULL value to be stored. So, unless the default is over-ridden and NULL values are not allowed, it is a good idea to understand how they work.
A SHOW TABLE command (chapter 3) can be used to determine whether a NULL is allowed. If the column contains a NOT NULL constraint, you need not be concerned about the presence of a NULL because it is disallowed.
Figure 2-5
Figure 2-6
For most comparisons, an unknown (null) is functionally equivalent to a false because it is not a true. Therefore, when using any comparison symbol a row is not returned when it contains a NULL.
At the same time, the next SELECT does not return Johnson because all comparisons against a NULL are unknown:
SELECT *
FROM Student_Table
WHERE Grade_Pt = NULL AND Class_Code = NULL ; No rows found
As seen in the above Truth tables, a comparison test cannot be used to find a NULL. To find a NULL, it becomes necessary to make a slight change in the syntax of the conditional comparison. The coding necessary to find a NULL is seen in the next section.
Using NOT in SQL Comparisons
It can be fairly straightforward to request exactly which rows are needed. However,
sometimes rows are needed that contain any value other than a specific value. When this is the case, it might be easier to write the SELECT to find what is not needed instead of what is needed. Then convert it to return everything else. This might be the situation when there are 100 potential values stored in the database table and 99 of them are needed. So, it is easier to eliminate the one value than it is to specifically list the desired 99 different values
individually.
Either of the next two SELECT formats can be used to accomplish the elimination of the one value:
SEL[ECT] <column-name> FROM <table-name>
WHERE <column-name> NOT <comparison> <data-value> ;
Or
SEL[ECT] <column-name> FROM <table-name>
WHERE NOT ( <column-name> <comparison> <data-value> ) ;
This second version of the SELECT is normally used when compound conditions are required. This is because it is usually easier to code the SELECT to get what is not wanted and then to enclose the entire set of comparisons in parentheses and put one NOT in front of it. Otherwise, with a single comparison, it is easier to put NOT in front of the comparison operator without requiring the use of parentheses.
The next SELECT uses the NOT with an AND comparison to display seniors and lower classmen with grade points less than 3.0:
SELECT Last_Name ,First_Name ,Class_Code ,Grade_Pt FROM Student_Table
WHERE NOT ( Grade_Pt >= 3.0 AND Class_Code <> 'SR' ) ; 6 Rows returned
Last_Name First_Name Class_Code Grade_Pt
McRoberts Richard JR 1.90 Hanson Henry FR 2.88 Delaney Danny SR 3.35 Larkins Michael FR 0.00 Phillips Martin SR 3.00 Smith Andy SO 2.00
Without using the above technique of a single NOT, it is necessary to change every
individual comparison. The following SELECT shows this approach, notice the other change necessary below, NOT AND is an OR:
Since you cannot have conditions like: NOT >= and NOT <>, they must be converted to < (not < and not =) and = (not, not =). It returns the same 5 rows, but also notice that the AND is now an OR: SELECT Last_Name ,First_Name ,Class_Code ,Grade_Pt FROM Student_Table
WHERE Grade_Pt < 3.0 OR Class_Code = 'SR' ; 6 Rows returned
Last_Name First_Name Class_Code Grade_Pt
McRoberts Richard JR 1.90
Last_Name First_Name Class_Code Grade_Pt Delaney Danny SR 3.35 Phillips Martin SR 3.00 Larkins Michael FR 0.00 Smith Andy SO 2.00 Figure 2-7
To maintain the integrity of the statement, all portions of the WHERE must be changed, including AND, as well as OR. The following two SELECT statements illustrate the same concept when using an OR:
SELECT Last_Name FROM Student_Table
WHERE NOT ( Grade_Pt >= 3.0 OR Grade_Pt <= 2.0 ) ; Or
SELECT Last_Name FROM Student_Table
WHERE Grade_Pt < 3.0 AND Grade_Pt > 2.0 ; 1 Row returned
Last_Name
Hanson
In the earlier Truth table, the NULL value returned an unknown when checked with a comparison operator. When looking for specific conditions, an unknown was functionally equivalent to a false, but really it is an unknown.
These two Truth tables can be used together as a tool when mixing AND and OR together in the WHERE clause along with NOT.
Figure 2-8
Figure 2-9
There is an issue associated with using NOT. When a NOT is done on a true condition, the result is a false. Likewise, the NOT of a false is a true. However, when a NOT is done with an unknown, the result is still an unknown. Whenever a NULL appears in the data for any of
the columns being compared, the row will never be returned and the answer set will not be what is expected.
It takes some practice and consideration when using NOT. Another area where care must be taken is when allowing NULL values to be stored in one or both of the columns. As
mentioned earlier, previous versions of Teradata had no concept of "unknown" and if a compare didn't result in a true, it was false. With the emphasis on ANSI compatibility the unknown was introduced.
If NULL values are allowed and there is potential for the NULL to impact the final outcome of compound tests, additional tests are required to eliminate them. One way to eliminate this concern is to never allow a NULL value in any columns. However, this may not be
appropriate and it will require more storage space because a NULL can be compressed. Therefore, when a NULL is allowed, the SQL needs to simply check for a NULL.
Therefore, using the expression IS NOT NULL is a good technique when NULL is allowed in a column and the NOT is used with a single or a compound comparison. This does require another comparison and could be written as:
SELECT Last_Name ,First_Name ,Class_Code ,Grade_Pt FROM Student_Table
WHERE NOT ( Grade_Pt >= 3.0 AND Grade_Pt IS NOT NULL AND Class_Code <> 'SR' AND Class_Code IS NOT NULL )
;
7 Rows returned
Last_Name First_Name Class_Code Grade_Pt
Larkins Michael FR 0.00 Hanson Henry FR 2.88 McRoberts Richard R 1.90 Johnson Stanley ? ? Delaney Danny SR 3.35 Phillips Martin SR 3.00 Smith Andy SO 2.00
Notice that Johnson came back this time and did not appear previously because of the NULL values.
Later in this book, the COALESCE will be explored as another way to eliminate NULL values directly in the SQL instead of in the database.
Multiple Value Search (IN)
Previously, it was shown that adding a WHERE clause to the SELECT limited the returned rows to those that meet the criteria. The IN comparison is an alternative to using one or more OR comparisons on the same column in the WHERE clause of a SELECT statement and the IN comparison also makes it a bit easier to code:
SEL[ECT] <column-name> [,<column-name> ] FROM <table-name>
WHERE <column-name> IN (<value-list>) ;
The value list normally consists of multiple values separated by commas. When the value in the column being compared matches one of the values in the list, the row is returned. The following is an example for the alternative method when any one of the conditions is enough to satisfy the request using IN:
SELECT Last_Name ,Class_Code ,Grade_Pt FROM Student_Table WHERE Grade_Pt IN ( 2.0, 3.0, 4.0 ) ; 3 Row returned
Last_Name Class_Code Grade_Pt
Phillips SR 3.00
Thomas FR 4.00
Smith SO 2.00
The use of multiple conditional checks as well as the IN can be used in the same SELECT request. Considerations include the use of AND for declaring that multiple conditions must all be true. Earlier, we saw the solution using a compound OR.
Using NOT IN
As seen earlier, sometimes the unwanted values are not known or it is easier to eliminate a few values than to specify all the values needed. When this is the case, it is a common practice to use the NOT IN as coded below.
The next statement eliminates the rows that match and return those that do not match: SELECT Last_Name
,Grade_Pt FROM Student_Table
WHERE Grade_Pt NOT IN ( 2.0, 3.0, 4.0 ) ; 6 Rows returned Last_Name Grade_Pt McRoberts 1.90 Hanson 2.88 Wilson 3.80 Delaney 3.35
Last_Name Grade_Pt
Larkins 0.00
Bond 3.95
The following SELECT is a better way to make sure that all rows are returned when using a NOT IN:
SELECT Last_Name ,Class_Code ,Grade_Pt FROM Student_Table
WHERE Grade_Pt NOT IN ( 2.0, 3.0, 4.0 ) OR Grade_Pt IS NULL ; 7 Rows returned
Last_Name Class_Code Grade_Pt
Larkins FR 0.00 Hanson FR 2.88 McRoberts JR 1.90 Johnson ? ? Wilson SO 3.80 Delaney SR 3.35 Bond JR 3.95
Notice that Johnson came back in this list and not the previous request using the NOT IN. You may be thinking that if the NULL reserved word is used within the IN list it will cover the situation. Unfortunately, you are forgetting that this comparison always returns an unknown. Therefore, the next request will NEVER return any rows:
SELECT Last_Name ,Class_Code ,Grade_Pt FROM Student_Table
WHERE Grade_Pt NOT IN ( 2.0, 3.0, 4.0, NULL ) ; No Rows found
Making this mistake will cause no rows to ever be returned. This is because every time the column is compared against the value list the NULL is an unknown and the Truth table shows that the NOT of an unknown is always an unknown for all rows.
If you are not sure about this, do an EXPLAIN (chapter 3) of the NOT IN and a subquery to see that the AMP step will actually be skipped when a NULL exists in the list. There are also extra AMP steps to compensate for this condition. It makes the SQL VERY inefficient.
Using Quantifiers Versus IN
There is another alternative to using the IN. Quantifiers can be used to allow for normal comparison operators without requiring compound conditional checks.
The following is equivalent to an IN: SEL[ECT] <column-name> [,<column-name> ] FROM <table-name>
WHERE <column-name> = ANY (<value-list>) ;
This next request uses ANY instead of IN: SELECT Last_Name
,Class_Code ,Grade_Pt FROM Student_Table
WHERE Grade_Pt = ANY ( 2.0, 3.0, 4.0 ) ; 3 Row returned
Last_Name Class_Code Grade_Pt
Phillips SR 3.00
Thomas FR 4.00
Smith SO 2.00
Using a qualifier, the equivalent to a NOT IN is: SEL[ECT] <column-name>
[,<column-name> ] FROM <table-name>
WHERE <column-name> NOT = ALL (<value-list>) ;
Notice that like adding a NOT to the compound condition, all elements need to be changed here as well. To reverse the = ANY, it becomes NOT = ALL. This is important, because the NOT = ANY selects all the rows except those containing a NULL. The reason is that as soon as a value is not equal to any one of the values in the list, it is returned.
The following SELECT is converted from an earlier NOT IN: SELECT Last_Name
,Grade_Pt FROM Student_Table
WHERE Grade_Pt NOT = ALL ( 2.0, 3.0, 4.0 ) ;
6 Rows returned
Last_Name Grade_Pt
McRoberts 1.90
Last_Name Grade_Pt
Hanson 2.88
Wilson 3.80
Delaney 3.35
Bond 3.95
Multiple Value Range Search (BETWEEN)
The BETWEEN comparison can be used as another technique to request multiple values for a column that are all in a specific range. It is easier than writing a compound OR comparison or a long value list of sequential numbers when using the IN.
This is a good time to point out that this chapter is incrementally adding new ways to compare for values within a WHERE clause. However, all of these techniques can be used together in a single WHERE clause. One method does not eliminate the ability to use one or more of the others using logical operators between each comparison.
The next SELECT shows the syntax format for using the BETWEEN: SEL[ECT] <column-name>
[,<column-name> ] FROM <table-name>
WHERE <column-name> BETWEEN <low-value> AND <high-value>
;
The first and second values specified are inclusive for the purposes of the search. In other words, when these values are found in the data, the rows are included in the output.
As an example, the following code returns all students whose grade points of 2.0, 4.0 and all values between them:
SELECT Grade_Pt FROM Student_Table
WHERE Grade_Pt BETWEEN 2.0 and 4.0 ; 7 Rows returned Grade_Pt 3.00 2.88 4.00 3.80 3.95 3.35 2.00
Notice that due to the inclusive nature of the BETWEEN, both 2.0 and 4.0 were included in the answer set. The first value of the BETWEEN must be the lower value, otherwise, no rows will be returned. This is because it looks for all values that are greater or equal to the first value and less than or equal to the second value.
A BETWEEN can also be used to search for character values. When doing this, care must be taken to insure that rows are received with the values that are needed. The system can only compare character values that are the same length. So, if one column or value is shorter than the other, the shortest will automatically be padded with spaces out to the same length as the longer value.
Comparing ‘CA’ and ‘CALIFORNIA’ never constitutes a match. In reality, the database is comparing ‘CA ’ with ‘CALIFORNIA ‘ and they are not equal. Sometimes, it is easier to use the LIKE comparison operator which will be covered in the next section. Although, easier to code, it does not always mean faster to execute. There is always a trade-off to consider. The next SELECT finds all of the students whose last name starts with an L:
SELECT Last_Name FROM Student_Table
WHERE Last_Name BETWEEN 'L' AND 'LZ' ; 1 Row returned
Last_Name
Larkins
In reality, the WHERE could have used BETWEEN ‘L’ and ‘M’ as long as no student's last name was ‘M’. The data needs to be understood when using BETWEEN for character comparisons.
Character String Search (LIKE)
The LIKE is used exclusively to search for character data strings. The major difference between the LIKE and the BETWEEN is that the BETWEEN looks for specific values within a range. The LIKE is normally used when looking for a string of characters within a column. Also, the LIKE has the capability to use "wildcard" characters.
Figure 2-10
The next SELECT finds all rows that have a character string that begins with ‘Sm’: SELECT *
FROM Student_Table
WHERE Last_Name LIKE 'sm%' ; 1 Row returned
Student_ID Last_Name First_Name Class_Code Grade_Pt
333450 Smith Andy SO 2.00
The fact that the ‘s’ is in the first position dictates its location in the data. Therefore, the ‘m’ must be in the second position. Then, the ‘%’ indicates that any number of characters (including none) may be in the third and subsequent positions. So, if the WHERE clause contained: LIKE ‘%sm’, it only looks for strings that end in "SM." On the other hand, if it were written as: LIKE ‘%sm%’, then all character strings containing "sm" anywhere are returned.
Also, remember that in Teradata mode, the database is not case sensitive. However, in ANSI mode, the case of the letters must match exactly and the previous request must be written as ‘Sm%’ to obtain the same result. Care should be taken regarding case when working in ANSI mode. Otherwise, case does not matter.
The ‘_’ wildcard can be used to force a search to a specific location in the character string. Anything in that position is considered a match. However, a character must be in that position.
The following SELECT uses a LIKE to find all last names with an "A" in the second position of the last name:
SELECT *
FROM Student_Table
WHERE Last_Name LIKE ('_a%' ) ; 2 Rows returned
Student_ID Last_Name First_Name Class_Code Grade_Pt
423400 Larkins Michael FR 0.00
125634 Hanson Henry FR 2.88
In the above example, the "_" allows any character in the first position, but requires a character to be there.
The keywords ALL, ANY, or SOME can be used to further define the values being searched. They are the same quantifiers used with the IN. Here, the quantifiers are used to extend the flexibility of the LIKE clause.
Normally, the LIKE will look for a single set of characters within the data. Sometimes, that is not sufficient for the task at hand. There will be times when the characters to search are not consecutive, nor are they in the same sequence.
The next SELECT returns rows with both an ‘s’ and an ‘m’ because of the ALL. /* set session transaction BTET in BTEQ */
SELECT * FROM Student_Table
WHERE Last_Name LIKE ALL ('%S%', '%m%' ) ; 3 Rows returned
Student_ID Last_Name First_Name Class_Code Grade_Pt
280023 McRoberts Richard JR 1.90
234121 Thomas Wendy FR 4.00
333450 Smith Andy SO 2.00
It does not matter if the ‘s’ appears first or the ‘m’ appears first, as long as both are contained in the string.
Below, ANSI is case sensitive and only 1 row returns due to the fact that the ‘S’ is uppercase, so Thomas and McRoberts are not returned:
SELECT * FROM Student_Table
WHERE Last_Name LIKE ALL ('%S%', '%m%' ) ; 1 Rows returned
Student_ID Last_Name First_Name Class_Code Grade_Pt
333450 Smith Andy SO 2.00
If, in the above statement, the ALL quantifier is changed to ANY (ANSI standard) or SOME (Teradata extension), then a character string containing either of the characters, ‘s’ or ‘m’, in either order is returned. It uses the OR comparison.
This next SELECT returns any row where the last name contains either an ‘s’ or an ‘m’: /* set session transaction ANSI in BTEQ */
SELECT * FROM Student_Table
WHERE Last_Name LIKE ANY ('%s%', '%m%' ) ; 8 Rows returned
Student_ID Last_Name First_Name Class_Code Grade_Pt
423400 Larkins Michael FR 0.00 125634 Hanson Henry FR 2.88 280023 McRoberts Richard JR 1.90 260000 Johnson Stanley ? ? 231222 Wilson Susie SO 3.80 234121 Thomas Wendy FR 4.00 333450 Smith Andy SO 2.00 123250 Phillips Martin SR 3.00
Always be aware of the issue regarding case sensitivity when using ANSI Mode. It will normally affect the number of rows returned and usually reduces the number of rows. There is a specialty operation that can be performed in conjunction with the LIKE. Since the search uses the "_" and the "%" as wildcard characters, how can you search for actual data that contains a "_" or "%" in the data?
Now that we know how to use the wildcard characters, there is a way to take away the special meaning and literally make the wildcard characters an ‘_’ and a ‘%’. That is the purpose of ESCAPE. It tells the PE to not match anything, but instead, match the actual character of ‘_’ or ‘%’.
The next SELECT uses the ESCAPE to find all table names that have an "_" in the 8th
position of the name from the Data Dictionary. SELECT Tablename
FROM DBC.tables
AND Databasename = 'mikel' ; 2 Rows returned
Tablename _______
Student_Table
Student_Course_Table
In the above output, the only thing that matters is the ‘_’ in position eight because of the first seven ‘_’ characters are still wildcards.
Derived Columns
The majority of the time, columns in the SELECT statement exist within a database table. However, sometimes it is more advantageous to calculate a value than to store it.
An example might be the salary. In the employee table, we store the annual salary.
However, a request comes in asking to display the monthly salary. Does the table need to be changed to create a column for storing the monthly salary? Must we go through and update all of the rows (one per employee) and store the monthly salary into the new column just so we can select it for display?
The answer is no, we do not need to do any of this. Instead of storing the monthly salary, we can calculate it from the annual salary using division. If the annual salary is divided by 12 (months per year), we "derive" the monthly salary using mathematics.
Figure 2-11
These math functions have a priority associated with their order of execution when mixed in the same formula. The sequence is basically the same as their order in the chart. All
exponentiation is performed first. Then, all multiplication and division is performed and lastly, all addition and subtraction is done. Whenever two different operands are at the same priority, like addition and subtraction, they are performed based on their appearance in the equation from left to right.
Although the above is the default priority, it can be over-ridden within the SQL. Normally an equation like 2+4*5 yields 22 as the answer. This is because the 4*5 = 20 is done first and then the 2 is added to it. However, if it is written as (2+4)*5, now the answer becomes 30 (2+4=6*5=30).
The following SELECT shows these and the results of an assortment of mathematics: SELECT 2+4*5, (2+4)*5
,2+4/5, (2+4)/5 ,2+4.0/5, (2+4.0)/5 ,10**9 ;
2+4*5 (2+4)*5 2+4/5 (2+4)/5 2+4.0/5 (2+4.0)/5 10**9
22 30 2 1 2.8 1.2 1000000000
Note: starting with integer values, as in the above, the answer is an integer. If decimals are used, the result is a decimal answer. Otherwise, a conversion can be used to change the characteristics of the data before being used in any calculation. Adding the decimal makes a difference in the precision of the final answer. So, if the SQL is not providing the answer expected from the data, convert the data first (CAST function later in this book).
The next SELECT shows how the SQL can be written to implement the earlier example with annual and monthly salaries:
SELECT salary (format 'ZZZ,ZZ9.99') ,salary/12 (format 'Z,ZZ9.99') FROM Pay_Table ; 2 Rows returned salary salary/12 48,024.00 4,002.00 10,800.00 900.00
Since the column name is the default column heading, the derived column is called salary/12, which is not probably what we wish to see there. The next section covers the usage of an alias to temporarily change the name of a column during the life of the SQL. Derived data can be used in the WHERE clause as well as the SELECT. The following SQL will only return the columns when the monthly salary is greater than $1,000.00:
SELECT salary (format 'ZZZ,ZZ9.99') ,salary/12 (format 'Z,ZZ9.99') FROM Pay_Table WHERE salary/12 > 1000 ; 1 Row returned salary salary/12 48,024.00 4,002.00