performance
2.2.2 Primary and foreign keys
2.3.3.4 Local application caching
Caching means to temporarily store data, from a database, somewhere in memory. Obviously, you would want to cache data that is frequently used, changed as little as possible, and is small in size. The perfect type of data for caching is static data. Static data is generally physically small, it is much in demand, and it is rarely changed.
Also, you can store static data in memory on a database server computer or on a middle tier computer such as an application or web server. You can even send copies of static data to client computers, even those on the other side of the world, connected to your database over an Internet connection.
2.3.4
Denormalizing and special purpose objects
Some relational databases allow creation of specialized objects, allowing storage of things like summaries, and copies of data. The result is reduction in conflicting requirements of customer facing OLTP applications, and reporting such as in a data warehouse. The ultimate objective is increasing overall database performance. The simplest forms of copies are copies of tables, views, or temporary tables. Let’s examine different types of database objects in use in relational databases:
Views: A view does not store data but stores a query. That query can be executed at will, by querying the view whenever you want. This is not a performance issue in itself. One of the problems with views is that they are often used to filter and sort data from tables. When a view is read, someone writing the query should be aware of how the underlying view query works, in order to avoid extra work in the view and extra work in the query against the view. The worst scenario is major reduction filtering or sorting against a view, which finds all records in a large table anyway. The equivalent filter against a table would read only the filtered portion. The view will always read the entire table. Where the underlying table is extremely large, 1 record read from the view could still retrieve millions of records from the underlying table because only the view uses no filtering. Another problem with views is the tendency for developers to sometimes over- use them, embedding layer upon layer of views within each other. This can simply become too complex for any database to handle. This is an example of a view:
CREATE VIEW VIEW_ChartOfAccounts AS
SELECT coa.coa#, coa.type, coa.subtype, coa.text, sum(gl.dr), sum(gl.cr)
FROM coa, generalledger gl WHERE coa.coa# = gl.coa#
GROUP BY coa.coa#, coa.type, coa.subtype, coa.text;
The above script is pseudocode and creates a view, which joins two tables shown in the data model of Figure 2.4.
Note: Pseudocode is a type of coding used to explain a concept. Pseudocode will not function 100% in any particular environment, data- base, programming language, or on any specific hardware platform. The intention of presenting a concept using pseudocode is to demonstrate—not to provide bulletproof example coding.
Temporary tables: These can be used to form intermediary functions per session and are manually or automatically emptied and dropped after each use.
A session is a connection to a database, between a user and a data- base.
Temporary tables can sometimes help to eliminate intermediary functionality and eliminate duplication. The result is less I/O activity on primary tables because temporary tables are read after they have first been filled. Temporary tables are superseded in modern relational databases by other more exotic and more versatile database objects, such as materialized views. This is a simple pseudocode example of a temporary table created to drop all its records when the session creat- ing it terminates:
CREATE TABLE TEMP_ChartOfAccounts AS
SELECT coa.coa#, coa.type, coa.subtype, coa.text, sum(gl.dr), sum(gl.cr)
FROM coa, generalledger gl WHERE coa.coa# = gl.coa#
GROUP BY coa.coa#, coa.type, coa.subtype, coa.text DELETE ON DISCONNECT;
Intermediary tables: A table with more permanent data content than a temporary table. That data is retained for different sessions to make use of and would like be deleted manually when so required. These tables could also be used to contain summary information. Summary records consist of groups of many other records.
Materialized views and clusters can also be used, which are regen- erated periodically and automatically. These types of objects are com- monly used in data warehouses but can sometimes be useful in OLTP databases.
Intermediary tables are useful as long as less than real time response is acceptable for the summarized information.
Cluster: Places most commonly accessed indexed fields, sometimes including some data fields, together into the same physical space. They would also be commonly ordered in a specified order. This is a method of splitting active and inactive data by fields, making more frequent use fields more readily available, in pre-sorted, indexed, and clustered form. So, the most active data is copied to a presorted, pre- organized cluster. Less active data remains only in underlying table (or tables), from which the cluster has been created. Clusters help performance by organizing in desired order, and allowing for less I/O time on the most highly active fields in a table.
Clustered and non-clustered indexes: A clustered index contains actual data values in leaf nodes. A non-clustered index contains addresses to data values, in leaf nodes, not the actual data values in an index. Both these types of indexes use a BTree structure. So, a clus- tered index creates an index out of the fields in a table, sorting table data in the order of the index, and physically stored in the order of a specified group of one or more fields (in the same table). In other words, it turns a table into an index, but retaining all the fields in the table. The result is a table logically sorted based on the index. This can help with performance only when reading the data in the order of the index. Updating can be relatively efficient, depending on how flexible the index structure is. BTree indexes are generally amenable to frequent small changes. Bitmap, hash key, and ISAM indexes cause changes to be made outside of the indexed structure—this is known as overflow. Subsequent reads into overflow space cause bouncing around the disk, which is highly inefficient.
A clustered single table index is also known in some relational databases as an index organized table (IOT).
Indexed view: This allows creation of either a clustered or non-clus- tered index against a view. The view still consists of a query, which will be executed against underlying tables, every time the view is accessed.
Some relational databases allow creation of what is called a materi- alized view. A materialized view does not contain a query, but is a copy of data. The data copy in the materialized view can be updated automatically, periodically, or even in real-time. Additionally, materi- alized views usually allow automated changes to queries, as they are submitted to a database, effectively rewriting the query (query rewrite) during the optimization process. This can help to improve performance if a materialized view contains a summary (much fewer records) of data in underlying tables.
Materialized views are mostly applicable in data warehouses but can be utilized in OLTP databases under certain circumstances. As with views, materialized views, and index views—these objects will not help performance if over-used or used to cater to overlaying a new table design on top of a poorly designed data model.
SQL Server does not allow explicit creation of, or even direct access to materialized views. In SQL Server, materialized views are all inter- nalized within Analysis Services and cannot be used outside of the
context of the analytical processing engine. In other words, you can’t explicitly build a materialized view, you can’t explicitly access a materi- alized view (using a query), and you can’t force the optimizer to use or ignore query rewrite. In essence, materialized views are utilized in SQL Server 2005 but completely contained within Analysis Services.
Other factors: This includes more architectural aspects, such as par- titioning at the table and index level, parallel processing, clustering, replication, duplication, and mirrored (or standby) databases.
Most of these performance solutions are more physical rather than logi- cal forms of denormalization, as they require creation of objects not part of the standard normalization process. These specialized objects do not reduce logical complexity, but can very well do the opposite by increasing com- plexity. This is simply because more database objects are being created. What specialized objects do accomplish is to place data sets into purpose built constructs. These constructs can be used to access data in the required order, without jeopardizing access speed to the underlying tables, which may be very busy doing other things like servicing a busy website. So, in addition to allowing access to data in required order, extra objects such as materialized views can help to reduce multi-user conflict issues.
2.4
Extreme denormalization in data warehouses
Data warehouses deal with such enormous quantities of data that they can sometimes require specialized data modeling techniques. Some data ware- houses use specialized modeling technique called the dimensional data model.
The dimensional data model is sometimes also called the dimensional- fact model.
The dimensional data model contains tables built to form a star (a star schema) or a snowflake (a snowflake schema). Some data warehouses may use 3rd normal form schemas (just like an OLTP data model relational schema).
Many data warehouses contain hybrid schemas using a combination of two or even all three data model types including star, snowflake, and 3rd normal form schemas. In general, for large amounts of data, and the inten- sity of data warehouse analytical reporting, the most efficient data model for a data warehouse is a star schema. A snowflake is a slightly more granu- lar structural form of a star schema, containing more tables, and is thus less
efficient than a star schema. A 3rd normal form schema is a highly granular and broken down schema, requiring many tables in join queries—and con- sequently provides very poor performance.
So, let’s briefly describe the data warehouse dimensional model for rep- resenting data. The dimensional model consists of dimensions and facts. A fact is some historical or archived record, about some type of activity. A dimension is literally a dimension of a fact, or something describing a fact.
OLTP transactional databases use a normalized relational data model, which requires a large number of small operations. Data warehouses on the other hand require small numbers of large transactions, for data loading and reporting. The requirements are thus completely different with respect to performance—a data warehouse does lots of joins, on lots of tables, and reads lots of data in each transaction. Performing data warehouse type reports on a normalized OLTP database would likely cause the customer facing OLTP applications to perform very poorly.
A data mart is a term often bandied around in data warehouse terminol- ogy. A data mart is simply a subsection of a data warehouse.
Let’s take a quick look at an OLTP relational data model for a container shipping company. This company ships containers of luxury goods. The company owns a number of small container ships. They ship goods between a European port and a number of West African ports. Figure 2.13 shows this OLTP database relational data model.
The meanings of tables shown in Figure 2.13 are as follows:
CONSIGNEE. The party receiving the container contents.
SENDER. The party shipping the container contents.
VESSEL. The ship on which the container was transported.
VOYAGE. Ships making voyages can call at multiple ports. Individ- ual containers or groups of containers are transported on all or some legs of the entire voyage.
SHIPMENT. Contents of part of a container, or even one or more containers, are shipped from one port to another. For the sake of sim- plicity we assume a shipment as being an individual container shipped from one port to another.
GROUP. A group of containers are transported from one port to another as a single shipment.
CONTAINER. An individual container.
TYPE. A container can be refrigerated, open-topped, or a flatbed, amongst numerous other types.
DEPOT. A container depot or container port. Figure 2.13
OLTP relational model for tracking container shipments
DAMAGE. Containers can sustain damage.
SCRAP. Damaged containers can become irreparably damaged and have a scrap value.
LOST. Containers can be stolen or sometimes even lost at sea. In fact loss of containers at sea happens often. Additionally these containers being sealed can float just below the water, sometimes doing really nasty things to smaller craft.
Now let’s convert the relational model to a data warehouse dimensional- fact model.
2.4.1
The dimensional data model
A table relationship model is inappropriate to the requirements of a data warehouse, even a denormalized one. Another modeling technique used for data warehouses is called dimensional modeling. In layman’s terms a dimensional model consists of facts and dimensions. What does that mean? What is a fact and what is a dimension? A fact is a single iteration in a his- torical record. A dimension is something used to dig into, divide, and col- late those facts into something useful. That isn’t really layman’s terms now is it? Let’s try to explain this a little more easily by example.
Let’s explain dimensional modeling in small steps. Figure 2.14 shows the same table relationship as that shown in Figure 2.13, but with a slight difference. Vaguely, facts are the equivalent of transactional tables and dimensions are the equivalent of static data. Therefore in Figure 2.14 the fact tables are colored gray and the dimensions tables are not. Note how the facts represent historical or archived data and dimensions represent smaller static data tables. It follows that dimension tables will generally be small and fact tables can become frighteningly huge. What does this tell us? Fact tables will always be appended to and dimension tables can be changed, preferably not as often as the fact tables are appended to. The result is many very small tables related to data in groups from very large tables.
The most desirable result when modeling for a data warehouse using dimensions and facts is called a star schema. Figure 2.15 and Figure 2.16 show slightly modified, pseudo-type star schema versions of the normalized table relationship diagrams in Figure 2.13 and Figure 2.14. In Figure 2.15 we can see that all dimensions would be contained within a single fact table, containing shipping history records of containers. Each record in the fact table would have foreign key values to all related dimension tables.
Every star or snowflake schema always has a fact table. A single data warehouse can consist of multiple fact tables and thus multiple star and or snowflake schemas.
Figure 2.15 simply contains another fact table or another subset of the data contained in the normalized table relationship structure in Figure 2.13 and Figure 2.14. It is quite conceivable that the two fact table sets in Figure Figure 2.14
Highlighting dimensions and facts for container shipments
2.14 and Figure 2.15 should be merged into a single table, separating used, damaged, scrapped, and lost containers by an appropriate type field.
There could be a small problem with the fact table shown in Figure 2.15. Damaged, scrapped, and lost containers could either be a fact table or part of the container dimension. This decision would depend on exactly how often containers are damaged, scrapped or lost. It is more than likely that this type of thing occurs frequently in relation to other dimensions, but not necessarily in relation to the high frequency of container shipments.
The star schemas shown in Figure 1.8 and Figure 1.9 show that there are two potential fact tables for the relational schema shown in Figure 2.13 and Figure 2.14.