Datawarehouse
Names
INTRODUCTION
3
HIGH LEVEL STANDARD
7
PROJECTION STANDARDS
11
CONSTRAINT STANDARD
17
ATTRIBUTE STANDARD
18
GENERAL RENAMING RULES
23
TIME STANDARD
24
Introduction
Licence
As these are generic software documentation standards, they will be covered by the 'Creative Commons Zero v1.0 Universal' CC0 licence.
Warranty
The author does not make any warranty, express or implied, that any statements in this document are free of error, or are consistent with particular standard of merchantability, or they will meet the requirements for any particular application or environment. They should not be relied on for solving a problem whose incorrect solution could result in injury or loss of property. If you do use this material in such a manner, it is at your own risk. The author disclaims all liability for direct or consequential damage resulting from its use.
This is really a sample document, illustrating particular choices taken. As each DW environment is different, so these choices will also be different. There is no universal standard for naming conventions.
Purpose
This Naming standards document defines the standards to follow for schemas, subject areas, projections, attributes, constraints, boilerplate columns, renaming, and time. This document is needed to make it easier for Business users to understand column and table names, and reduce the cost of egregious renaming.
Audience
The primary audience of this document are any staff who use the DW, or do design development or maintenance on the DW. It will also be useful to Business Intelligence end users.
Assumptions
It is assumed that the naming conventions will support a variety of DW design patterns such as Data Vault, Kimball, and source standards.
Approach
The naming standard will be based on common DW design patterns and other essential design decisions.
Related Documents
There are many documents describing the DW design methodologies such as Kimball, Inmon, etc. Some are below.
Author Reference Publisher Year
Author Reference Publisher Year
CJ Date Temporal Data & the Relational Model Morgan Kaufmann
2002 Ralph Kimball The Data Warehouse Toolkit: The Complete
Guide to Dimensional Modelling
Wiley 2002
Definitions
Term Source Definition
Aggregate Kimball An aggregate is a summary table using group by, often based on fact and/or dimension tables.
Bridge Kimball A bridge table to capture many-to-many relationships, such as when a fact table row can be associated with more than one value in a dimension.
Data Model Chris Date The data model must represent demonstrably true statements about the business area. That is, the entities must represent things that mean something to the business, and the relationships between entities must represent meaningful links.
Data Vault Linstedt Data Vault is a database method that is designed to provide long-term historical storage of data coming in from multiple operational systems. It provides a DW pattern that supports the Inmon goals of Subject-orientation, non-volatility, integration and Time-variance.
Dimension Kimball An independent entity in a dimensional model that serves as an entry point or as a mechanism for slicing and dicing the additive measure located in the fact table of the dimensional model. For example, all months, quarters, years, etc., make up a time dimension. Based on Measure Theory.
Disemvowel DB This is a function that removes all vowels from a text string. E.g. A sentence such as: “The quick brown fox jumps over the lazy dog” would, after being disemvowelled, become: “Th qck brwn fx jmps vr th lzy dg”.
EAV DB An Entity-Attribute-Value table. This is a type of RDF table. Data is recorded in only three columns: Entity: the item being described; Attribute or parameter: a foreign key into a table of attribute definitions; Value of the attribute. This pattern is very popular with business users, as they have the opportunity to redefine data in an application. Often, much of the useful reporting data are in these columns. However, it can be quite difficult to pivot this data into usable reporting tables. See the Resource Description Framework
Term Source Definition
(RDF) as an example.
Fact Kimball A business performance measurement, typically numeric and additive, that is stored in a fact table. Based on Measure Theory.
Hub Linstedt A hub represents a core business concept such as Customer, Vendor, Sale or Product. The hub table is formed around the business key, as well as the source system keys.
Integrated Inmon Integration is the process of mapping dissimilar codes to a common base, developing consistent data element presentations and delivering this standardized data as broadly as possible.
Invariant Algebra An invariant is a property of a mathematical object that remains unchanged when transformations of a certain type are applied to the object. As an example in a DW, the count of network nodes should be identical in a source database, as well as the target DW, at the same point in time.
Join Codd A join is a binary operator on two relations or database tables. Link Linstedt A link represents a natural business relationship between business
keys.
Logical name Chris Date These are full words from a standard business vocabulary. This
contrasts with physical names that are often abbreviated due to name length limitations, etc.
Metadata DB Metadata is "data about data". While not often used in reporting, these tables are important in DW standards, and for generating and describing DW components.
Name Hygiene
Computer Science
The prevention of accidental capture of identifiers, or name collision. This occurs especially when generating identifiers or names. In programming languages, these are solved using hygienic macros. Non-volatile Inmon Non-volatile design is essential. Non-volatility literally means that once
a row is written, it is never modified. This is necessary to preserve incremental net change history. This, in turn, is required to represent data as of any point in time. When a data row is updated, the past information is destroyed. A fact or total that included the unmodified data can never be recreated.
Object DB For the purposes of the DW, an object is a projection over a table or a table join. They are often used in applications to layer and rename physical column names to attribute names. Consequently, the end user is more familiar with the attribute name, rather than the column
Term Source Definition
name.
Pivot DB A pivot table is the transformation of an EAV table into columnar form. Projection Codd A projection is a unary operation that selects a subset of attribute
names. In database terms, this is a select. This can be implemented as tables, views, XML messages, etc.
RDF W3C The Resource Description Framework (RDF) is a family of World Wide Web Consortium (W3C) specifications originally designed as a
metadata data model. It has come to be used as a general method for conceptual description of information and/or knowledge management, which is implemented in web resources, and other formats. An RDF triple is a statement about resources in the form of a
subject-predicate-object expression. DF is a more general form of the original EAV Entity-Attribute-Value design pattern, where the subject is an entity, predicate is an attribute and object is a value. A set of such RDF triples forms an RDF graph.
Relational Algebra
Codd Relational algebra is an offshoot of first-order logic and of algebra of sets concerned with operations over relations.
Satellite Linstedt A satellite contains the descriptive information (context) for a business key.
Subject-orientation
Inmon Subject-orientation mandated a cross-functional slice of data drawn from multiple sources to support a diversity of needs. This is a departure from serving only either the vertical application views of data (supply-side) or the overlapping departmental needs for data (demand side).
Time Variant Inmon Time variance calls for storage of multiple copies of the underlying detail in aggregations of differing periodicity and/or time frames. There may be detail for seven years along with weekly, monthly and
quarterly aggregates of differing duration. This is critical for maintaining the consistency of reported summaries over time.
Tags
Business Intelligence ; Data Mapping ; Metadata ; Standards ; Data Transformation ; Data
Warehouse ; Database ; Fact / Dimension ; Data Load ; Data Model ; Data Vault ; Database Design ; Extract Load Transform - ELT ; Extract Transform Load - ETL ; Inmon ; Kimball ; Massive Parallel Processing - MPP ; Netezza ; Oracle ; Data Integration ; Data Lineage ; Data Traceability ; Time Variant ; Metadata Glossary ; Hub / Satellite ; Projection ;
High Level Standard
Layer Standard
These are all defined as schemas in the DW. This assumes a simple 4 layer DW design. The final layers start with A*, so they will appear first in any schema listing.
Layer Aligned Area
Schema Name Description
Gold Layer
Business AU_GOLD_ DECISION
This is a schema for tables and views that are available to Business Decision Makers. It represents the final state of the transformed data, suitably cleansed for the primary customer of the DW, the business decision maker.
Silver Layer
Business AG_SILVER_ INTEGRATION
This is a schema that enables integration across multiple source schemas. It represents transformed Lead data into more valuable Silver data.
Iron Layer
Business FE_IRON_ METADATA
This is a schema for all metadata data.
Lead Layer Source PB_ <Source System Code>_ <DB Name/Path>_ <Schema Name/File>_ <Load Type>
This contains the raw loaded source data. The primary need is for a unique definition of the data. This naming pattern supports this. The name defines the important characteristics of the load. The first code identifies the source system. The next code identifies the DBName for databases, or the address for web sources or the directory path for file sources. The DBName must always be based on the ultimate
production database name, not the development name. In the case of directory or path names, the protocol prefix will need to be stripped off, and illegal characters like ":/" will need to be substituted with underscores. The next code identifies the Schema Name for databases, or the page for web sources, or the file for file based sources. The next code identifies the load type. This more precise method of identifying sources will enable a much large number of different sources to be identified, and to be able to manage multiple loads from the same source, so that they do not impact on each other. This approach is more
self-documenting, with a clearer self documenting data lineage.
Load Type
This is needed to distinguish how the data has been loaded. If data is brought in by separate ETL or CDC processes, it will be subject to different time variance and recovery processes. This is even an issue within CDC, where data can be either refreshed or mirrored. If the target is not distinct, then it
may be impossible to properly reason about its state, leading to incorrect replication and, ultimately, erroneous business decisions.
Load Type
Load Name Description
CDC Change Data Capture
This is the default process for loading data.
ELT Extract Load Transform
A common pattern in DW is to load the data in before transformation. This results in faster loads in an MPP appliance.
ETL Extract Transform Load
ETL is used to load data into the DW. The transformation occurs in a separate ETL tool, which is a common anti-pattern which leads to poor single threaded ETL performance.
MNL Manually
loaded
This is practical for data that does not need to be loaded more than once, or very infrequently, which is generally invariant data.
REF Reference Used mainly for schema analysis, and possibly prototyping, but not to be used for reporting. A throw away design area.
Just as there can be alternate means of loading, there can also be alternate source types from the same source. In almost all cases, there is not a strict isomorphism between source and target, and the source type plays an important part in determining the amount of change that occurs between the source and target. In other words, the different source types have different constraints that dictate how well the data is loaded. If the source type is also a file type, then they should conform to standard file extensions. The source type may also be appended to the name, but it is not
mandatory.
File Type
File Type Name Description
DOC Microsoft Word Document HPR Hyperion
HTML Hypertext Markup Language
MDB Microsoft Access Database MSP Microsoft SharePoint NTZ Netezza
ODS Operational Data Store Many source systems currently provide a ODS layer. While well meaning, this anti-pattern often creates a reconciliation
File Type
File Type Name Description
nightmare, as rarely are the ODS mappings back to the source system actually documented.
ORA Oracle
XLS Microsoft Excel Spreadsheet XML Extensible Markup
Language
Subject Area Standard
The physical name can be used in schema names. The Subject Areas will have a three letter code. The Subject Area codes can be used in projection names, or in primary key values. These subject area names are examples only.
SA Code
Subject Area (logical)
Subject Area (physical) Layer
AAT Authentication and Authorization
Authentication_Authorization, Athntctn_Athrztn
Iron
GLS Glossary Glossary Silver
BUS Business Business Gold
CTL Control CONTROL Iron
DQ Data Quality DATA_QUALITY Iron
DVT Data Vault DATA VAULT Silver
ERL Entity & Reference Links
Entity_Reference_Links Iron
GLD General Ledger General_Ledger Gold
HRS Human Resources Human_Resources Gold
INV Invariant INVARIANT Iron
LNG Lineage DATA Lineage Iron
OPS Operations Operations Gold
SA Code
Subject Area (logical)
Subject Area (physical) Layer
Management
QLT Quality Quality Iron
SLN Source Lineage Source_LinEage Silver
TRC Traceability TRACEABILITY Iron
Note that some metadata can be placed in other layers. However, if the metadata is shared across schemas or layers, then it is best to place in a separate schema. Lineage can be considered the column to column mappings. Traceability can be considered the actual row value to row value mappings.
Projection Standards
A projection can be thought of as a subset of columns. These are a series of codes or types that can be used as part of a projection name. These codes will help to make the purpose of projection more transparent, through consistent usage of the name patterns. The next chapter will give examples of how they can be combined to create projection names.
Logical Projection Type (LPT)
Logical means the primary purpose of this projection type within a given methodology. This purpose will help determine the correct set of columns (or attributes) for the projection. All new projections MUST have a Logical Projection Type as a prefix, except SAL tables copied from source tables. This is necessary to avoid name collisions, as most of these projection types can be used in any
layer/schema. As far as possible, each type should belong within a type set, so the types are
mutually exclusive, and all types form a complete type set. For example, within the Inmon type set, there can only be 3 types: Ephemeral, Invariant or Time Variant.
LPT Code
LPT Name Standard Definition
A, ACT
Active (current) Source A projection of the logical active columns for an active table, filtered to show current data only, and excludes deleted rows. An active column has a number of distinct values > 0 (greater than zero). An active table has at least one row. Some redundancy here, as this is also defined by the column filter type. However, this will be retained as it is now standard practice.
B, BRD
Bridge, Helper Kimball A join over tables that can be used as a bridge in a Kimball schema.
C, CUB
Cube Kimball A projection that is a join of a Kimball Fact table, and related Kimball Dimension tables. This is useful when the users do not know how to join tables. This does not use GROUP BY, so it is different to an Aggregate.
D, DMN
Dimension Kimball A join over tables that can be used as a dimension in a Kimball schema. Typical columns are codes which have low cardinality, which allow for grouping. A dimension can be either a singleton view (over a single table), or based on a set of joined tables.
E, EPH
Ephemeral Inmon Something that is true at an instant in time, but has no duration or longevity. That is, it only lasts momentarily, or for a very short time. A typical example is a transaction or trade, which is effectively instantaneous. Typically, this data
LPT Code
LPT Name Standard Definition
is not mutable. That is, it cannot be updated, except for error correction.
F, FCT
Fact Kimball A join over tables that can be used as a fact in a Kimball schema. Typical columns are numeric which represent either continuous data or countable data. This is often based on a single table.
G, AGR
Aggregate Kimball A join over tables that can be used as an aggregate or summary. This denormalisation can be used in all other standards. This uses GROUP BY, so it is different to a Cube. H,
HUB
Hub Data
Vault
A join over tables that can be used as a hub in a Data Vault schema.
I, INV Invariant Inmon Something that is always true. For example, system setup data does not change over the life of a system, so this can be considered invariant. This data is not mutable. That is, it cannot be updated, except for error correction. K,
KEY
Key Source This represents cleansed Source tables that have had a surrogate and/or a distribution key added. A surrogate key is needed for Fact/Dimension joins. A distribution key is critical for adequate Netezza performance. Note that these tables would still retain their source names.
L, LNK
Link Data
Vault
A join over tables that can be used as a link in a Data Vault schema.
M, MTD
Metadata
Metadata
A join over tables that can be used for metadata. Examples include A&A, Glossary, Data Quality, Data Lineage, Data Traceability, Invariants, etc.
N, NUB
Nub Source This represents Source tables that have been cleansed, but without column name changes. For example, cleansing can discard boilerplate columns, de-duplicate rows, add defaults values (e.g. N/A for nulls), convert types (e.g. Text -> Dates), fix column lengths, etc.
O, OTH
Other Source This represents some source system based column and/or table grouping which is not defined as a table/view the source RDBMS. For example, this could represent a manually maintained subject area grouping, or some set of data in the application layer.
LPT Code
LPT Name Standard Definition
P, PVT
Pivot (from RDF/EAV)
Source A join over RDF/EAV tables that pivot or flatten the EAV data into multiple tables in columnar form. They can then be joined to other tables.
R, RDF Resource Description Framework Metadata
A join over source tables used to describe source data that is in RDF form. Typically, there are 4 tables needed: Resource (Entity) which defines tables and primary keys, Predicate (Relation) which defines relationships, ResourceType which defines type of each attribute (e.g. date, char, etc.) and RDF (Value) which contains the RDF triples. This data can be used to pivot into standard tables.
S, STL Satellite Data
Vault
A join over tables that can be used as a satellite in a Data Vault schema.
T, TMV
Time Variant (Historical)
Inmon A projection of the logical active columns for an active table, which shows current and historical data, and excludes deleted rows. An active column has a number of distinct values > 0 (greater than zero). This is only required for source tables. The non-source DW Design Patterns all support Time Variance. This data is mutable. That is, it changes over time, and a new row is created whenever the source data changes. Some redundancy here, as this is also defined by the column filter type. However, this will be retained as it is now standard practice.
U, UCN
Unconditional, (Audit)
Source A projection of all columns for all tables, without filtering the column set or the row set. Therefore, this will show both current and historical data. This also shows deleted rows. Some redundancy here, as this is also defined by the column filter type. However, this will be retained as it is now standard practice.
Z, ZWK
ZWork
Metadata
These projections satisfy technical requirements, and not reporting requirements. These should not be visible to reporting users. Many will be required by the RDBMS to define projections like indices, etc.
This hierarchy shows what standards the logical projection types belong to. The projection designer needs to ensure that the projection conforms to these projection types. For example, if the
projection is a fact table, it must begin with F_*, and no other prefix.
Logical
Projection
Metadata
M
Metadata
R RDF
Z Work
Source
A Active
K Key
N Nub
O Other
P Pivot
U Uncond
Kimball
B Bridge
C Cube
D
Dimension
F Fact
G
Aggregate
Data Vault
H Hub
L Link
S Satellite
Inmon
E
Ephemeral
I Invariant
T Time
Variant
Physical Projection Type (PPT)
All physical types are defined, which will help minimize name collisions. The designer may choose not to add any suffix, at the risk of subsequent name collisions. Not defining the Physical Projection Type suffix is unsafe practice when generating projections automatically.
PPT Code
PPT Name Oracle Netezza Comment
A Abstract An ‘Abstract’ type is also provided, even though this does
not exist as a type in any database. The *_A type can be a table or a view, etc. This will enable the designer to replace a view with a table, without having to change the name. This may help reduce impacts on downstream users.
F Function True True
G Package True
I Index True True
L DB Link True
M Materialised True True Avoid using this type. Use V instead.
O Object View True
P Procedure True True
Q Queue True
R Synonym True True While a synonym can point to any other physical type, it still needs to be managed as a particular type. Use *_A, if *_R is too restrictive.
S Sequence True True
T Table True True This covers sub-types such as temporary and external.
U Type True
V View True True Views can be implemented as materialized or
dematerialized.
X XML True
The hierarchy below shows which environments physical projection types belong in. Physical Projection A Abstract Shared Types F Function I index M Materialized P Procedure R Synonym S Sequence T Table V View Oracle Only Types G Package O Object View L Link DB Q Queue U type X XML Z Trigger
Constraint Standard
Constraints will be implemented as a single character suffix. Note that these are RDBMS specific.
Constraint Code
CT Suffix Constraint Type Oracle Netezza Description
_C Check True
_F Foreign Key True True
_H Hash Expression True
_O Only View Read True
_P Primary Key True True
_R Referential Integrity True
_S Supplemental Logging True
_U Unique Key True
_V View Check True
Constraint Name Patterns
This shows what would be the most common examples. It is assumed that there will be a maximum of 9 unique column sets on any given table. Given that these are logically equivalent to the primary key, more than this is unlikely. Note that the Distribution Key is not a constraint, but a part of the table structure, so it cannot be altered with an add or drop statement. Therefore, there no naming standard applies to a Distribution Key as a constraint.
Constraint Pattern Example Base Projection
Result Constraint
Primary Key
<tableName>_P D_NODE_T D_NODE_T_P
Foreign Key
< tableName>_<integer>_F D_NODE_T D_NODE_T_1234_F
Unique < tableName> _Unique<integer>_U
Attribute Standard
Attribute Data Type (ADT)
These standards only apply to new attributes created for the DW. All source column names and data types should remain unchanged. All non-source columns MUST have an attribute data type. The DB data type can be varied where it makes sense. Some choices are provided. For example, an
enumeration can be INTEGER or VARCHAR. However, a date must be DATE. Similarly, length can also be varied where it makes sense.
Suffix Description Oracle Data Type Netezza Data type
Definition
_A, _AMT
Amount NUMBER(28,10) FLOAT(15) Any currency or monetary
amount, including balances, prices, etc. This can enable fractions of cents. _B, _BL Binary Large Object BLOB BINARY VARYING
Any Binary Large Object such as an image, audio or video. For example, well-known binary (WKB), which is used to define geometric objects in binary. _C, _CD Enumeration (aka code or List of Values) VARCHAR2(30), NUMBER VARCHAR(30), INTEGER An enumeration is a collection of items that is a complete, ordered listing of all of the items in that collection. For example, Frequency which can be Day, Week, Month, etc.
DAY_OF_WEEK_C can be VARCHAR2(9) or
NUMBER(1,0). _D,
_DT
Date DATE DATE Any date that does not
include time. That is, only YYYY-MM-DD.
_E, _EX
Explain VARCHAR2(255) VARCHAR(255) A description or definition
statement. _F,
_FL
Float NUMBER(28,18) FLOAT(15) Any numeric that is not an
integer, a degree or a rate. _G,
_DG
Degree NUMBER(10,7) FLOAT(15) A number that can be used
for latitude or longitude. _I, _ID Identifier (aka Name or business key) VARCHAR2(80), NUMBER VARCHAR(80), INTEGER
An identifier is a name that identifies (that is, labels the identity of) a unique object. This is always provided by the source system. It may even be a surrogate in the source, but is still an id. The length or type may be changed if needed. E.g. NODE_I
_J Geometry SDO_GEOMETRY ST_GEOMETRY A type that defines a
geometic object. This type varies between platforms. The types can be converted to WKB or WKT format. _K, _KN Kind VARCHAR2(255), NUMBER VARCHAR(255), INTEGER
Any code that must follow rules. E.g. A FSA must be in SAAA format.
_N, _NT
Note VARCHAR2(2000) VARCHAR(2000) Text or string is a sequence
of characters. This can be used for any definition or description, except for Identifiers, Codes or Explanations. Oracle limit is 4,000; Netezza limit is 64,000.
_O, _OC
O’clock TIME TIME Any moment in a 24 hour
period, independent of date. _P,
_PCT
Percentage NUMBER(28,18) FLOAT(15) A percentage as a fractional
number. _R,
_RT
Rate/Percentage NUMBER(28,18) FLOAT(15) A rate as any fractional number.
_S, _SK
Surrogate Key (Natural)
NUMBER(28,0) BIGINT An identifier that is based on
the natural key of a table. On non-time variant tables, this is identical to the Unique
Surrogate Key (_U). For time variant tables, this will change when there is a change to the key. _T,
_TS
Timestamp TIMESTAMP(6) TIMESTAMP A date and time down to the
micro-second. That is, YYYY-MM-DD HH:MM:SS.FFFFFF. _U,
_UK
Unique Surrogate Key
NUMBER(28,0) BIGINT A unique identifier for each
row of a table. Incremented whenever a new row is added.
_W, _WH
While NUMBER(28,0) BIGINT Any time duration measured
in seconds. _X,
_XM
XML CLOB VARCHAR XML Document. For
example, well-known text (WKT), which is used to define geometric objects in XML.
_Y, _YN
Boolean (aka, flag or indicator) VARCHAR(1 CHAR) 'Y' | 'N', VARCHAR(5 CHAR) 'True' | 'False', BIT, 0 | 1 NUMBER(1, 0), 0 | 1 CHAR(1), CHAR(5), BYTEINT
An attribute type with only two possible values: yes or no. It must be expressed as a question with a clear true or false value e.g.
is_deleted_YN. _Z,
_ZI
Integer NUMBER INTEGER Any integer. This excludes
-monetary or fractional numeric. This could be count or sequential value, etc. Z is the default symbol for integer in Maths.
The longer attribute type (e.g. _AMT) may be used if there less than 27 characters in the name. Otherwise, the shorter attribute type must be used, even if that means truncating the name to 28 characters.
Why such large Numeric values for ORACLE?
Oracle implements NUMBER as a variable length column. So, the column size in bytes for a particular numeric data value NUMBER(p), where p is the precision of a given value, can be calculated using the following formula:
ROUND((length(p)+s)/2))+1
where s equals zero if the number is positive and s equals 1 if the number is negative. Therefore, the large number sizes will not impact storage, as they are not fixed.
Boilerplate Physical Columns
System Column Description
SOURCE_CHANGED_ON_TS The date and time when the record was last modified in the source system.
SOURCE_CREATED_ON_TS The date and time when the record was initially created in the source system.
EFFECTIVE_AT_TS This column stores the moment date and time at which the record represents a true value. A value is either assigned by the Unified Data Store or extracted from the source. Note that this column is not needed on most tables. It will only be on event type tables, and certain kinds of fact tables.
EFFECTIVE_FROM_TS This column stores the date and time from which the record represents a true value. A value is either assigned by the Unified Data Store or extracted from the source. This column will be on all SCD2 tables.
EFFECTIVE_TO_TS This column stores the date and time after which the record does not represents a true value. The EFFECTIVE_TO_T of the previous row MUST = EFFECTIVE_FROM_T of the next row – 1 TIME UNIT. A value is either assigned by the Unified Data Store or extracted from the source. This column will be on all SCD2 tables.
IS_DELETED_YN This boolean indicates the deletion status of the record in the source system. A value of YT indicates the record is deleted from the source system and logically deleted from the source aligned layer. A value of NF indicates that the record is active.
PROCESS_INSERT_ID System field. This column is the unique identifier for the specific ETL batch process used to create insert or update this data row. Both INSERT_PROCESS_ID and UPDATE_PROCESS_ID are necessary in order to be able to easily back out incorrectly loaded data. PROCESS_UPDATE_ID This column is the unique identifier for the specific process used to
update this row. Both INSERT_PROCESS_ID and
UPDATE_PROCESS_ID are necessary in order to be able to easily back out incorrectly loaded data.
<ForeignKeyName>
_DISTRIBUTION_ID, _DSTR_I
ForeignKeyName is the name of the Primary Key on the table that is being used for distribution over the Netezza nodes.
System Column Description
ROW_NATURAL_ID Concatenated artificial primary key. This is constructed out of the columns that represent the true table key. These values are cast into string type.
SOURCE_SYSTEM_C This should be a valid 3 character source system code. TRANSACTION_TYPE_C This code indicates the kind of load transaction used for this
record. There are 4 possible values: I (Insert), U (Update), D (Delete ), R (Refresh).
A flag IS_CURRENT_YN can be added to views, but it will not be needed on the physical tables, as this can be derived from high date for on EFFECTIVE_TO_TS.
Primary and Foreign Key Names
This standard applies to all views, in all layers. They also apply to all tables not in SAL.
The primary key standard is needed to support the automatic key matching in Tableau and other tools.
Name Pattern Description
Primary Key Name
<Logical Table name>_I
Apply in all layers.
Foreign Key Name
<Logical Table name>_I
The only exception is when a child table has the two relationships to the parent table. In this case, the most important relationship should have the same name. All other keys will have a prefix.
Foreign Unique Column Name <Logical TableName> Unique <n> ID
This acts like a "foreign key" column that uses the nth unique key of a parent table. N ranges from 1 to 9. These key types must also match easily in Tableau.
Unique Column Name <Logical TableName> Unique <n> ID
This is the nth unique key for this table. N ranges from 1 to 9. This kind of column can also be called an Alternate Primary Key.
General Renaming Rules
These are global rules for all names, whether they are used for schemas, projections, attributes, etc.
Logical to Physical Name Mapping Automated Function
This describes the function that changes a logical name to physical name. The key idea is to minimise all renaming, as this only introduces confusion, and forces users to lookup reference lists. The rules are only applied when needed, which is not often. The default is NO change at all.
For Oracle, MaxNameLength = 30, and Netezza, MaxNameLength = 128. Restrict the length to (MaxNameLength - 4) chars, so that 2 chars can be prefixed and 2 chars can be suffixed e.g. F_*_T (for fact table).
1. If the name length is 26 or less, then use the name as is, after taking out illegal characters e.g. "Unique GNAFs incl Proposed" -> "Unique GNAFs incl Proposed" or
UNIQUE_GNAFS_INCL_PROPOSED
2. If the deduplicated, disemvowelled name length is 26 or less, then de-duplicate and disemvowel the name e.g. "ERP Project Template Parameters" -> "ERP Prjct Tmplt Prmtrs" or ERP_PRJCT_TMPLT_PRMTRS
3. Else de-duplicate, disemvowel, truncate all words to a max of 4 letters and truncate whole name <= 26 e.g. "ERP Actual Burdened Total Cost Amount Total" -> "ERP Actl Brdn Ttl Cst Amn" or ERP_ACTL_BRDN_TTL_CST_AMN
Logical to Physical Name Definitions
Disemvowelling means removing vowels from a word. e.g. Assignment -> Assgnmnt. The first char is always retained, regardless of whether is a vowel or not. All other vowels are removed.
De-duplication removes repeated consonants. E.g. Access -> ACCSS -> ACS.
Truncation means removing all but the first 4 letters E.g. Burdened -> BRDND -> BRDN.
Logical to Physical Name Mapping Manual Process
This applies where a manual approach is being used. More flexibility is available here, while retaining the overall goal of not forcing the user to look up some dictionary of valid abbreviations. For
example, DESC is a standard abbreviation for DESCRIPTION. So "ACCESS TECHNOLOGY DESCRIPTION" could be "ACCESS TECHNOLOGY DESC". However, please note that DESC could also be DESCENDING, or some other thing. So this approach should be used for standard abbreviations only. The
automated function above would become "ACCSS TCHNLGY DSCRPTN".
However, only use this approach when the name is approaching the Max Name Length (e.g. 26 for Oracle). For example, if the Logical name is Appointment, turning this into APNT may create confusion, as this could also be Apparent, or many other words starting with Ap*n*t. Use this approach wisely.
Time Standard
Time Quantum or Time Delta
The DW time quantum will be a second. That is, this will be the smallest unit of time in the data.
High and Low Dates
The high date is 31-12-2999 00:00:00.
The low date will be the start time for Epoch or UNIX time. That is: (UTC), 1 January 1970 or 01-01-1970 00:00:00. However, in general, this low date should not be used. Instead, the low date of a row should be the date that the row was first created in the source system. Only use low date when a date is mandatory, and the data is missing from the source system.
Time Zone
This is EST or Australian Eastern Standard Time.
Table types and Additional Time keys
History Type Definition Additional Time Keys Example Time
Variant (aka Duration)
Data that is true over a period of time. Must contain valid from and valid to as part of primary key. This information is almost always based on load date, rather than on a specific business defined event date.
EFFECTIVE_FROM_TS, EFFECTIVE_TO_TS Normally an SCD2 Dim table. Ephemeral (aka Activity)
Data that is true at a point in time. Must contain valid timestamp as part of primary key. This information is based on business information such as transaction data, or effective date. As a last resort, it can be based on load date.
EFFECTIVE_AT_TS This occurs on transaction grain Fact tables, as well as some source tables.
Invariant (aka No History)
Data that is always true or data whose change is not of interest. Date is not part of key.
Not Applicable Equivalent to an SCD1 Dim table.
Time periods must be non-overlapping
That is, the EFFECTIVE_TO_TS of the previous row = EFFECTIVE_FROM_TS of the next row – 1 TIME QUANTUM. If there is no clear distinction between the timestamps, then a query can result in 2 rows of data being true for the overlapping time period. This will cause great confusion for any analyst querying the data, and impact the credibility of the DW. The rule is to modify the previous EFFECTIVE
_FROM_TS to the next EFFECTIVE _TO_TS – 1 TIME QUANTUM. This is a fundamental standard that must be adhered to.
Platform Constraints
Summary
Most of these standards are dictated by platform constraints. In many cases, this is obvious, as in the set of Physical Projection Types is clearly dictated by the Oracle and Netezza DB engines. However, there are some other standards that are more subtle. These are listed below.
Standard Platform Rationale
Time Quantum
Oracle Logging
The minimum granularity of Oracle Logging is 1 second. Therefore, this is the finest/smallest discrete time unit available.
Upper Case only in Target
Datastage This ETL tool does not support mapping to case sensitive object names. So, for physical tables loaded by Datastage, they will be in upper case. Note that this does not apply to the CDC tool. Column count
limit
Oracle Oracle has a 1,000 column limit for tables and views.
Column size limit
Oracle Oracle has a 4K column size limit, for VARCHAR.
Row size limit Oracle Also known as record size. No limit. Column size
limit
Netezza Netezza has a 64K column size limit.
Row size limit Netezza Netezza has a 65,535K total row size table limit. This is the sum of the column lengths in a row.
Column count limit
Netezza Netezza have a 1,600 column limit for tables and views.
Oracle
The following list of rules applies to both quoted and nonquoted Schema Object Name identifiers unless otherwise indicated:
1. Names must be from 1 to 30 bytes long.
2. Nonquoted identifiers cannot be Oracle Database reserved words. Quoted identifiers can be reserved words, although this is not recommended.
3. The Oracle has many pseudo reserved words with special meanings, such as DIMENSION,
SEGMENT, ALLOCATE, DISABLE, and so forth. These words are not reserved, but as Oracle uses them internally in specific ways, this may lead to unpredictable results.
4. Use ASCII characters in names.
5. Nonquoted identifiers must begin with an alphabetic character from your database character set. Quoted identifiers can begin with any character.
6. Nonquoted identifiers can contain only alphanumeric characters from your database character set and the underscore (_), dollar sign ($), and pound sign (#). Oracle strongly discourages you from using $ and # in nonquoted identifiers.
7. Within a namespace, no two objects can have the same name. The following schema objects share one namespace: Tables, Views, Sequences, Private synonyms, alone procedures, Stand-alone stored functions, Packages, Materialized views and User-defined types. Each of the following schema objects has its own namespace: Indexes, Constraints, Clusters, Database triggers, Private database links, Dimensions. Because tables and views are in the same namespace, a table and a view in the same schema cannot have the same name. However, tables and indexes are in different namespaces. Therefore, a table and an index in the same schema can have the same name. Each of the following nonschema objects also has its own namespace: User roles, Public synonyms, Public database links, Tablespaces, Profiles, Parameter files (PFILEs) and server parameter files (SPFILEs). Because the objects in these namespaces are not contained in schemas, these namespaces span the entire database.
8. Nonquoted identifiers are not case sensitive. Oracle interprets them as uppercase. Quoted identifiers are case sensitive.
9. Columns in the same table or view cannot have the same name. However, columns in different tables or views can have the same name.
10. Procedures or functions contained in the same package can have the same name, if their arguments are not of the same number and datatypes. Creating multiple procedures or functions with the same name in the same package with different arguments is called overloading the procedure or function.
Netezza
Netezza objects include tables, views, and columns. Follow these naming conventions: 1. A name must be from 1 to 128 characters long.
2. A name must begin with a letter (A through Z), diacritic marks, or non-Latin characters (200-377 octal).
3. A name cannot begin with an underscore (_). Leading underscores are reserved for system objects.
4. Names without quotes are not case sensitive. For example, CUSTOMER and Customer are the same, but object names are converted to lowercase when they are stored in the Netezza database. However, if a name is enclosed in quotation marks, then it is case sensitive.