• No results found

Data Warehouse Names

N/A
N/A
Protected

Academic year: 2021

Share "Data Warehouse Names"

Copied!
27
0
0

Loading.... (view fulltext now)

Full text

(1)

Datawarehouse

Names

(2)

INTRODUCTION

3

HIGH LEVEL STANDARD

7

PROJECTION STANDARDS

11

CONSTRAINT STANDARD

17

ATTRIBUTE STANDARD

18

GENERAL RENAMING RULES

23

TIME STANDARD

24

(3)

Introduction

Licence

As these are generic software documentation standards, they will be covered by the 'Creative Commons Zero v1.0 Universal' CC0 licence.

Warranty

The author does not make any warranty, express or implied, that any statements in this document are free of error, or are consistent with particular standard of merchantability, or they will meet the requirements for any particular application or environment. They should not be relied on for solving a problem whose incorrect solution could result in injury or loss of property. If you do use this material in such a manner, it is at your own risk. The author disclaims all liability for direct or consequential damage resulting from its use.

This is really a sample document, illustrating particular choices taken. As each DW environment is different, so these choices will also be different. There is no universal standard for naming conventions.

Purpose

This Naming standards document defines the standards to follow for schemas, subject areas, projections, attributes, constraints, boilerplate columns, renaming, and time. This document is needed to make it easier for Business users to understand column and table names, and reduce the cost of egregious renaming.

Audience

The primary audience of this document are any staff who use the DW, or do design development or maintenance on the DW. It will also be useful to Business Intelligence end users.

Assumptions

It is assumed that the naming conventions will support a variety of DW design patterns such as Data Vault, Kimball, and source standards.

Approach

The naming standard will be based on common DW design patterns and other essential design decisions.

Related Documents

There are many documents describing the DW design methodologies such as Kimball, Inmon, etc. Some are below.

Author Reference Publisher Year

(4)

Author Reference Publisher Year

CJ Date Temporal Data & the Relational Model Morgan Kaufmann

2002 Ralph Kimball The Data Warehouse Toolkit: The Complete

Guide to Dimensional Modelling

Wiley 2002

Definitions

Term Source Definition

Aggregate Kimball An aggregate is a summary table using group by, often based on fact and/or dimension tables.

Bridge Kimball A bridge table to capture many-to-many relationships, such as when a fact table row can be associated with more than one value in a dimension.

Data Model Chris Date The data model must represent demonstrably true statements about the business area. That is, the entities must represent things that mean something to the business, and the relationships between entities must represent meaningful links.

Data Vault Linstedt Data Vault is a database method that is designed to provide long-term historical storage of data coming in from multiple operational systems. It provides a DW pattern that supports the Inmon goals of Subject-orientation, non-volatility, integration and Time-variance.

Dimension Kimball An independent entity in a dimensional model that serves as an entry point or as a mechanism for slicing and dicing the additive measure located in the fact table of the dimensional model. For example, all months, quarters, years, etc., make up a time dimension. Based on Measure Theory.

Disemvowel DB This is a function that removes all vowels from a text string. E.g. A sentence such as: “The quick brown fox jumps over the lazy dog” would, after being disemvowelled, become: “Th qck brwn fx jmps vr th lzy dg”.

EAV DB An Entity-Attribute-Value table. This is a type of RDF table. Data is recorded in only three columns: Entity: the item being described; Attribute or parameter: a foreign key into a table of attribute definitions; Value of the attribute. This pattern is very popular with business users, as they have the opportunity to redefine data in an application. Often, much of the useful reporting data are in these columns. However, it can be quite difficult to pivot this data into usable reporting tables. See the Resource Description Framework

(5)

Term Source Definition

(RDF) as an example.

Fact Kimball A business performance measurement, typically numeric and additive, that is stored in a fact table. Based on Measure Theory.

Hub Linstedt A hub represents a core business concept such as Customer, Vendor, Sale or Product. The hub table is formed around the business key, as well as the source system keys.

Integrated Inmon Integration is the process of mapping dissimilar codes to a common base, developing consistent data element presentations and delivering this standardized data as broadly as possible.

Invariant Algebra An invariant is a property of a mathematical object that remains unchanged when transformations of a certain type are applied to the object. As an example in a DW, the count of network nodes should be identical in a source database, as well as the target DW, at the same point in time.

Join Codd A join is a binary operator on two relations or database tables. Link Linstedt A link represents a natural business relationship between business

keys.

Logical name Chris Date These are full words from a standard business vocabulary. This

contrasts with physical names that are often abbreviated due to name length limitations, etc.

Metadata DB Metadata is "data about data". While not often used in reporting, these tables are important in DW standards, and for generating and describing DW components.

Name Hygiene

Computer Science

The prevention of accidental capture of identifiers, or name collision. This occurs especially when generating identifiers or names. In programming languages, these are solved using hygienic macros. Non-volatile Inmon Non-volatile design is essential. Non-volatility literally means that once

a row is written, it is never modified. This is necessary to preserve incremental net change history. This, in turn, is required to represent data as of any point in time. When a data row is updated, the past information is destroyed. A fact or total that included the unmodified data can never be recreated.

Object DB For the purposes of the DW, an object is a projection over a table or a table join. They are often used in applications to layer and rename physical column names to attribute names. Consequently, the end user is more familiar with the attribute name, rather than the column

(6)

Term Source Definition

name.

Pivot DB A pivot table is the transformation of an EAV table into columnar form. Projection Codd A projection is a unary operation that selects a subset of attribute

names. In database terms, this is a select. This can be implemented as tables, views, XML messages, etc.

RDF W3C The Resource Description Framework (RDF) is a family of World Wide Web Consortium (W3C) specifications originally designed as a

metadata data model. It has come to be used as a general method for conceptual description of information and/or knowledge management, which is implemented in web resources, and other formats. An RDF triple is a statement about resources in the form of a

subject-predicate-object expression. DF is a more general form of the original EAV Entity-Attribute-Value design pattern, where the subject is an entity, predicate is an attribute and object is a value. A set of such RDF triples forms an RDF graph.

Relational Algebra

Codd Relational algebra is an offshoot of first-order logic and of algebra of sets concerned with operations over relations.

Satellite Linstedt A satellite contains the descriptive information (context) for a business key.

Subject-orientation

Inmon Subject-orientation mandated a cross-functional slice of data drawn from multiple sources to support a diversity of needs. This is a departure from serving only either the vertical application views of data (supply-side) or the overlapping departmental needs for data (demand side).

Time Variant Inmon Time variance calls for storage of multiple copies of the underlying detail in aggregations of differing periodicity and/or time frames. There may be detail for seven years along with weekly, monthly and

quarterly aggregates of differing duration. This is critical for maintaining the consistency of reported summaries over time.

Tags

Business Intelligence ; Data Mapping ; Metadata ; Standards ; Data Transformation ; Data

Warehouse ; Database ; Fact / Dimension ; Data Load ; Data Model ; Data Vault ; Database Design ; Extract Load Transform - ELT ; Extract Transform Load - ETL ; Inmon ; Kimball ; Massive Parallel Processing - MPP ; Netezza ; Oracle ; Data Integration ; Data Lineage ; Data Traceability ; Time Variant ; Metadata Glossary ; Hub / Satellite ; Projection ;

(7)

High Level Standard

Layer Standard

These are all defined as schemas in the DW. This assumes a simple 4 layer DW design. The final layers start with A*, so they will appear first in any schema listing.

Layer Aligned Area

Schema Name Description

Gold Layer

Business AU_GOLD_ DECISION

This is a schema for tables and views that are available to Business Decision Makers. It represents the final state of the transformed data, suitably cleansed for the primary customer of the DW, the business decision maker.

Silver Layer

Business AG_SILVER_ INTEGRATION

This is a schema that enables integration across multiple source schemas. It represents transformed Lead data into more valuable Silver data.

Iron Layer

Business FE_IRON_ METADATA

This is a schema for all metadata data.

Lead Layer Source PB_ <Source System Code>_ <DB Name/Path>_ <Schema Name/File>_ <Load Type>

This contains the raw loaded source data. The primary need is for a unique definition of the data. This naming pattern supports this. The name defines the important characteristics of the load. The first code identifies the source system. The next code identifies the DBName for databases, or the address for web sources or the directory path for file sources. The DBName must always be based on the ultimate

production database name, not the development name. In the case of directory or path names, the protocol prefix will need to be stripped off, and illegal characters like ":/" will need to be substituted with underscores. The next code identifies the Schema Name for databases, or the page for web sources, or the file for file based sources. The next code identifies the load type. This more precise method of identifying sources will enable a much large number of different sources to be identified, and to be able to manage multiple loads from the same source, so that they do not impact on each other. This approach is more

self-documenting, with a clearer self documenting data lineage.

Load Type

This is needed to distinguish how the data has been loaded. If data is brought in by separate ETL or CDC processes, it will be subject to different time variance and recovery processes. This is even an issue within CDC, where data can be either refreshed or mirrored. If the target is not distinct, then it

(8)

may be impossible to properly reason about its state, leading to incorrect replication and, ultimately, erroneous business decisions.

Load Type

Load Name Description

CDC Change Data Capture

This is the default process for loading data.

ELT Extract Load Transform

A common pattern in DW is to load the data in before transformation. This results in faster loads in an MPP appliance.

ETL Extract Transform Load

ETL is used to load data into the DW. The transformation occurs in a separate ETL tool, which is a common anti-pattern which leads to poor single threaded ETL performance.

MNL Manually

loaded

This is practical for data that does not need to be loaded more than once, or very infrequently, which is generally invariant data.

REF Reference Used mainly for schema analysis, and possibly prototyping, but not to be used for reporting. A throw away design area.

Just as there can be alternate means of loading, there can also be alternate source types from the same source. In almost all cases, there is not a strict isomorphism between source and target, and the source type plays an important part in determining the amount of change that occurs between the source and target. In other words, the different source types have different constraints that dictate how well the data is loaded. If the source type is also a file type, then they should conform to standard file extensions. The source type may also be appended to the name, but it is not

mandatory.

File Type

File Type Name Description

DOC Microsoft Word Document HPR Hyperion

HTML Hypertext Markup Language

MDB Microsoft Access Database MSP Microsoft SharePoint NTZ Netezza

ODS Operational Data Store Many source systems currently provide a ODS layer. While well meaning, this anti-pattern often creates a reconciliation

(9)

File Type

File Type Name Description

nightmare, as rarely are the ODS mappings back to the source system actually documented.

ORA Oracle

XLS Microsoft Excel Spreadsheet XML Extensible Markup

Language

Subject Area Standard

The physical name can be used in schema names. The Subject Areas will have a three letter code. The Subject Area codes can be used in projection names, or in primary key values. These subject area names are examples only.

SA Code

Subject Area (logical)

Subject Area (physical) Layer

AAT Authentication and Authorization

Authentication_Authorization, Athntctn_Athrztn

Iron

GLS Glossary Glossary Silver

BUS Business Business Gold

CTL Control CONTROL Iron

DQ Data Quality DATA_QUALITY Iron

DVT Data Vault DATA VAULT Silver

ERL Entity & Reference Links

Entity_Reference_Links Iron

GLD General Ledger General_Ledger Gold

HRS Human Resources Human_Resources Gold

INV Invariant INVARIANT Iron

LNG Lineage DATA Lineage Iron

OPS Operations Operations Gold

(10)

SA Code

Subject Area (logical)

Subject Area (physical) Layer

Management

QLT Quality Quality Iron

SLN Source Lineage Source_LinEage Silver

TRC Traceability TRACEABILITY Iron

Note that some metadata can be placed in other layers. However, if the metadata is shared across schemas or layers, then it is best to place in a separate schema. Lineage can be considered the column to column mappings. Traceability can be considered the actual row value to row value mappings.

(11)

Projection Standards

A projection can be thought of as a subset of columns. These are a series of codes or types that can be used as part of a projection name. These codes will help to make the purpose of projection more transparent, through consistent usage of the name patterns. The next chapter will give examples of how they can be combined to create projection names.

Logical Projection Type (LPT)

Logical means the primary purpose of this projection type within a given methodology. This purpose will help determine the correct set of columns (or attributes) for the projection. All new projections MUST have a Logical Projection Type as a prefix, except SAL tables copied from source tables. This is necessary to avoid name collisions, as most of these projection types can be used in any

layer/schema. As far as possible, each type should belong within a type set, so the types are

mutually exclusive, and all types form a complete type set. For example, within the Inmon type set, there can only be 3 types: Ephemeral, Invariant or Time Variant.

LPT Code

LPT Name Standard Definition

A, ACT

Active (current) Source A projection of the logical active columns for an active table, filtered to show current data only, and excludes deleted rows. An active column has a number of distinct values > 0 (greater than zero). An active table has at least one row. Some redundancy here, as this is also defined by the column filter type. However, this will be retained as it is now standard practice.

B, BRD

Bridge, Helper Kimball A join over tables that can be used as a bridge in a Kimball schema.

C, CUB

Cube Kimball A projection that is a join of a Kimball Fact table, and related Kimball Dimension tables. This is useful when the users do not know how to join tables. This does not use GROUP BY, so it is different to an Aggregate.

D, DMN

Dimension Kimball A join over tables that can be used as a dimension in a Kimball schema. Typical columns are codes which have low cardinality, which allow for grouping. A dimension can be either a singleton view (over a single table), or based on a set of joined tables.

E, EPH

Ephemeral Inmon Something that is true at an instant in time, but has no duration or longevity. That is, it only lasts momentarily, or for a very short time. A typical example is a transaction or trade, which is effectively instantaneous. Typically, this data

(12)

LPT Code

LPT Name Standard Definition

is not mutable. That is, it cannot be updated, except for error correction.

F, FCT

Fact Kimball A join over tables that can be used as a fact in a Kimball schema. Typical columns are numeric which represent either continuous data or countable data. This is often based on a single table.

G, AGR

Aggregate Kimball A join over tables that can be used as an aggregate or summary. This denormalisation can be used in all other standards. This uses GROUP BY, so it is different to a Cube. H,

HUB

Hub Data

Vault

A join over tables that can be used as a hub in a Data Vault schema.

I, INV Invariant Inmon Something that is always true. For example, system setup data does not change over the life of a system, so this can be considered invariant. This data is not mutable. That is, it cannot be updated, except for error correction. K,

KEY

Key Source This represents cleansed Source tables that have had a surrogate and/or a distribution key added. A surrogate key is needed for Fact/Dimension joins. A distribution key is critical for adequate Netezza performance. Note that these tables would still retain their source names.

L, LNK

Link Data

Vault

A join over tables that can be used as a link in a Data Vault schema.

M, MTD

Metadata

Metadata

A join over tables that can be used for metadata. Examples include A&A, Glossary, Data Quality, Data Lineage, Data Traceability, Invariants, etc.

N, NUB

Nub Source This represents Source tables that have been cleansed, but without column name changes. For example, cleansing can discard boilerplate columns, de-duplicate rows, add defaults values (e.g. N/A for nulls), convert types (e.g. Text -> Dates), fix column lengths, etc.

O, OTH

Other Source This represents some source system based column and/or table grouping which is not defined as a table/view the source RDBMS. For example, this could represent a manually maintained subject area grouping, or some set of data in the application layer.

(13)

LPT Code

LPT Name Standard Definition

P, PVT

Pivot (from RDF/EAV)

Source A join over RDF/EAV tables that pivot or flatten the EAV data into multiple tables in columnar form. They can then be joined to other tables.

R, RDF Resource Description Framework Metadata

A join over source tables used to describe source data that is in RDF form. Typically, there are 4 tables needed: Resource (Entity) which defines tables and primary keys, Predicate (Relation) which defines relationships, ResourceType which defines type of each attribute (e.g. date, char, etc.) and RDF (Value) which contains the RDF triples. This data can be used to pivot into standard tables.

S, STL Satellite Data

Vault

A join over tables that can be used as a satellite in a Data Vault schema.

T, TMV

Time Variant (Historical)

Inmon A projection of the logical active columns for an active table, which shows current and historical data, and excludes deleted rows. An active column has a number of distinct values > 0 (greater than zero). This is only required for source tables. The non-source DW Design Patterns all support Time Variance. This data is mutable. That is, it changes over time, and a new row is created whenever the source data changes. Some redundancy here, as this is also defined by the column filter type. However, this will be retained as it is now standard practice.

U, UCN

Unconditional, (Audit)

Source A projection of all columns for all tables, without filtering the column set or the row set. Therefore, this will show both current and historical data. This also shows deleted rows. Some redundancy here, as this is also defined by the column filter type. However, this will be retained as it is now standard practice.

Z, ZWK

ZWork

Metadata

These projections satisfy technical requirements, and not reporting requirements. These should not be visible to reporting users. Many will be required by the RDBMS to define projections like indices, etc.

(14)

This hierarchy shows what standards the logical projection types belong to. The projection designer needs to ensure that the projection conforms to these projection types. For example, if the

projection is a fact table, it must begin with F_*, and no other prefix.

Logical

Projection

Metadata

M

Metadata

R RDF

Z Work

Source

A Active

K Key

N Nub

O Other

P Pivot

U Uncond

Kimball

B Bridge

C Cube

D

Dimension

F Fact

G

Aggregate

Data Vault

H Hub

L Link

S Satellite

Inmon

E

Ephemeral

I Invariant

T Time

Variant

(15)

Physical Projection Type (PPT)

All physical types are defined, which will help minimize name collisions. The designer may choose not to add any suffix, at the risk of subsequent name collisions. Not defining the Physical Projection Type suffix is unsafe practice when generating projections automatically.

PPT Code

PPT Name Oracle Netezza Comment

A Abstract An ‘Abstract’ type is also provided, even though this does

not exist as a type in any database. The *_A type can be a table or a view, etc. This will enable the designer to replace a view with a table, without having to change the name. This may help reduce impacts on downstream users.

F Function True True

G Package True

I Index True True

L DB Link True

M Materialised True True Avoid using this type. Use V instead.

O Object View True

P Procedure True True

Q Queue True

R Synonym True True While a synonym can point to any other physical type, it still needs to be managed as a particular type. Use *_A, if *_R is too restrictive.

S Sequence True True

T Table True True This covers sub-types such as temporary and external.

U Type True

V View True True Views can be implemented as materialized or

dematerialized.

X XML True

(16)

The hierarchy below shows which environments physical projection types belong in. Physical Projection A Abstract Shared Types F Function I index M Materialized P Procedure R Synonym S Sequence T Table V View Oracle Only Types G Package O Object View L Link DB Q Queue U type X XML Z Trigger

(17)

Constraint Standard

Constraints will be implemented as a single character suffix. Note that these are RDBMS specific.

Constraint Code

CT Suffix Constraint Type Oracle Netezza Description

_C Check True

_F Foreign Key True True

_H Hash Expression True

_O Only View Read True

_P Primary Key True True

_R Referential Integrity True

_S Supplemental Logging True

_U Unique Key True

_V View Check True

Constraint Name Patterns

This shows what would be the most common examples. It is assumed that there will be a maximum of 9 unique column sets on any given table. Given that these are logically equivalent to the primary key, more than this is unlikely. Note that the Distribution Key is not a constraint, but a part of the table structure, so it cannot be altered with an add or drop statement. Therefore, there no naming standard applies to a Distribution Key as a constraint.

Constraint Pattern Example Base Projection

Result Constraint

Primary Key

<tableName>_P D_NODE_T D_NODE_T_P

Foreign Key

< tableName>_<integer>_F D_NODE_T D_NODE_T_1234_F

Unique < tableName> _Unique<integer>_U

(18)

Attribute Standard

Attribute Data Type (ADT)

These standards only apply to new attributes created for the DW. All source column names and data types should remain unchanged. All non-source columns MUST have an attribute data type. The DB data type can be varied where it makes sense. Some choices are provided. For example, an

enumeration can be INTEGER or VARCHAR. However, a date must be DATE. Similarly, length can also be varied where it makes sense.

Suffix Description Oracle Data Type Netezza Data type

Definition

_A, _AMT

Amount NUMBER(28,10) FLOAT(15) Any currency or monetary

amount, including balances, prices, etc. This can enable fractions of cents. _B, _BL Binary Large Object BLOB BINARY VARYING

Any Binary Large Object such as an image, audio or video. For example, well-known binary (WKB), which is used to define geometric objects in binary. _C, _CD Enumeration (aka code or List of Values) VARCHAR2(30), NUMBER VARCHAR(30), INTEGER An enumeration is a collection of items that is a complete, ordered listing of all of the items in that collection. For example, Frequency which can be Day, Week, Month, etc.

DAY_OF_WEEK_C can be VARCHAR2(9) or

NUMBER(1,0). _D,

_DT

Date DATE DATE Any date that does not

include time. That is, only YYYY-MM-DD.

_E, _EX

Explain VARCHAR2(255) VARCHAR(255) A description or definition

statement. _F,

_FL

Float NUMBER(28,18) FLOAT(15) Any numeric that is not an

(19)

integer, a degree or a rate. _G,

_DG

Degree NUMBER(10,7) FLOAT(15) A number that can be used

for latitude or longitude. _I, _ID Identifier (aka Name or business key) VARCHAR2(80), NUMBER VARCHAR(80), INTEGER

An identifier is a name that identifies (that is, labels the identity of) a unique object. This is always provided by the source system. It may even be a surrogate in the source, but is still an id. The length or type may be changed if needed. E.g. NODE_I

_J Geometry SDO_GEOMETRY ST_GEOMETRY A type that defines a

geometic object. This type varies between platforms. The types can be converted to WKB or WKT format. _K, _KN Kind VARCHAR2(255), NUMBER VARCHAR(255), INTEGER

Any code that must follow rules. E.g. A FSA must be in SAAA format.

_N, _NT

Note VARCHAR2(2000) VARCHAR(2000) Text or string is a sequence

of characters. This can be used for any definition or description, except for Identifiers, Codes or Explanations. Oracle limit is 4,000; Netezza limit is 64,000.

_O, _OC

O’clock TIME TIME Any moment in a 24 hour

period, independent of date. _P,

_PCT

Percentage NUMBER(28,18) FLOAT(15) A percentage as a fractional

number. _R,

_RT

Rate/Percentage NUMBER(28,18) FLOAT(15) A rate as any fractional number.

_S, _SK

Surrogate Key (Natural)

NUMBER(28,0) BIGINT An identifier that is based on

the natural key of a table. On non-time variant tables, this is identical to the Unique

(20)

Surrogate Key (_U). For time variant tables, this will change when there is a change to the key. _T,

_TS

Timestamp TIMESTAMP(6) TIMESTAMP A date and time down to the

micro-second. That is, YYYY-MM-DD HH:MM:SS.FFFFFF. _U,

_UK

Unique Surrogate Key

NUMBER(28,0) BIGINT A unique identifier for each

row of a table. Incremented whenever a new row is added.

_W, _WH

While NUMBER(28,0) BIGINT Any time duration measured

in seconds. _X,

_XM

XML CLOB VARCHAR XML Document. For

example, well-known text (WKT), which is used to define geometric objects in XML.

_Y, _YN

Boolean (aka, flag or indicator) VARCHAR(1 CHAR) 'Y' | 'N', VARCHAR(5 CHAR) 'True' | 'False', BIT, 0 | 1 NUMBER(1, 0), 0 | 1 CHAR(1), CHAR(5), BYTEINT

An attribute type with only two possible values: yes or no. It must be expressed as a question with a clear true or false value e.g.

is_deleted_YN. _Z,

_ZI

Integer NUMBER INTEGER Any integer. This excludes

-monetary or fractional numeric. This could be count or sequential value, etc. Z is the default symbol for integer in Maths.

The longer attribute type (e.g. _AMT) may be used if there less than 27 characters in the name. Otherwise, the shorter attribute type must be used, even if that means truncating the name to 28 characters.

Why such large Numeric values for ORACLE?

Oracle implements NUMBER as a variable length column. So, the column size in bytes for a particular numeric data value NUMBER(p), where p is the precision of a given value, can be calculated using the following formula:

(21)

ROUND((length(p)+s)/2))+1

where s equals zero if the number is positive and s equals 1 if the number is negative. Therefore, the large number sizes will not impact storage, as they are not fixed.

Boilerplate Physical Columns

System Column Description

SOURCE_CHANGED_ON_TS The date and time when the record was last modified in the source system.

SOURCE_CREATED_ON_TS The date and time when the record was initially created in the source system.

EFFECTIVE_AT_TS This column stores the moment date and time at which the record represents a true value. A value is either assigned by the Unified Data Store or extracted from the source. Note that this column is not needed on most tables. It will only be on event type tables, and certain kinds of fact tables.

EFFECTIVE_FROM_TS This column stores the date and time from which the record represents a true value. A value is either assigned by the Unified Data Store or extracted from the source. This column will be on all SCD2 tables.

EFFECTIVE_TO_TS This column stores the date and time after which the record does not represents a true value. The EFFECTIVE_TO_T of the previous row MUST = EFFECTIVE_FROM_T of the next row – 1 TIME UNIT. A value is either assigned by the Unified Data Store or extracted from the source. This column will be on all SCD2 tables.

IS_DELETED_YN This boolean indicates the deletion status of the record in the source system. A value of YT indicates the record is deleted from the source system and logically deleted from the source aligned layer. A value of NF indicates that the record is active.

PROCESS_INSERT_ID System field. This column is the unique identifier for the specific ETL batch process used to create insert or update this data row. Both INSERT_PROCESS_ID and UPDATE_PROCESS_ID are necessary in order to be able to easily back out incorrectly loaded data. PROCESS_UPDATE_ID This column is the unique identifier for the specific process used to

update this row. Both INSERT_PROCESS_ID and

UPDATE_PROCESS_ID are necessary in order to be able to easily back out incorrectly loaded data.

<ForeignKeyName>

_DISTRIBUTION_ID, _DSTR_I

ForeignKeyName is the name of the Primary Key on the table that is being used for distribution over the Netezza nodes.

(22)

System Column Description

ROW_NATURAL_ID Concatenated artificial primary key. This is constructed out of the columns that represent the true table key. These values are cast into string type.

SOURCE_SYSTEM_C This should be a valid 3 character source system code. TRANSACTION_TYPE_C This code indicates the kind of load transaction used for this

record. There are 4 possible values: I (Insert), U (Update), D (Delete ), R (Refresh).

A flag IS_CURRENT_YN can be added to views, but it will not be needed on the physical tables, as this can be derived from high date for on EFFECTIVE_TO_TS.

Primary and Foreign Key Names

This standard applies to all views, in all layers. They also apply to all tables not in SAL.

The primary key standard is needed to support the automatic key matching in Tableau and other tools.

Name Pattern Description

Primary Key Name

<Logical Table name>_I

Apply in all layers.

Foreign Key Name

<Logical Table name>_I

The only exception is when a child table has the two relationships to the parent table. In this case, the most important relationship should have the same name. All other keys will have a prefix.

Foreign Unique Column Name <Logical TableName> Unique <n> ID

This acts like a "foreign key" column that uses the nth unique key of a parent table. N ranges from 1 to 9. These key types must also match easily in Tableau.

Unique Column Name <Logical TableName> Unique <n> ID

This is the nth unique key for this table. N ranges from 1 to 9. This kind of column can also be called an Alternate Primary Key.

(23)

General Renaming Rules

These are global rules for all names, whether they are used for schemas, projections, attributes, etc.

Logical to Physical Name Mapping Automated Function

This describes the function that changes a logical name to physical name. The key idea is to minimise all renaming, as this only introduces confusion, and forces users to lookup reference lists. The rules are only applied when needed, which is not often. The default is NO change at all.

For Oracle, MaxNameLength = 30, and Netezza, MaxNameLength = 128. Restrict the length to (MaxNameLength - 4) chars, so that 2 chars can be prefixed and 2 chars can be suffixed e.g. F_*_T (for fact table).

1. If the name length is 26 or less, then use the name as is, after taking out illegal characters e.g. "Unique GNAFs incl Proposed" -> "Unique GNAFs incl Proposed" or

UNIQUE_GNAFS_INCL_PROPOSED

2. If the deduplicated, disemvowelled name length is 26 or less, then de-duplicate and disemvowel the name e.g. "ERP Project Template Parameters" -> "ERP Prjct Tmplt Prmtrs" or ERP_PRJCT_TMPLT_PRMTRS

3. Else de-duplicate, disemvowel, truncate all words to a max of 4 letters and truncate whole name <= 26 e.g. "ERP Actual Burdened Total Cost Amount Total" -> "ERP Actl Brdn Ttl Cst Amn" or ERP_ACTL_BRDN_TTL_CST_AMN

Logical to Physical Name Definitions

Disemvowelling means removing vowels from a word. e.g. Assignment -> Assgnmnt. The first char is always retained, regardless of whether is a vowel or not. All other vowels are removed.

De-duplication removes repeated consonants. E.g. Access -> ACCSS -> ACS.

Truncation means removing all but the first 4 letters E.g. Burdened -> BRDND -> BRDN.

Logical to Physical Name Mapping Manual Process

This applies where a manual approach is being used. More flexibility is available here, while retaining the overall goal of not forcing the user to look up some dictionary of valid abbreviations. For

example, DESC is a standard abbreviation for DESCRIPTION. So "ACCESS TECHNOLOGY DESCRIPTION" could be "ACCESS TECHNOLOGY DESC". However, please note that DESC could also be DESCENDING, or some other thing. So this approach should be used for standard abbreviations only. The

automated function above would become "ACCSS TCHNLGY DSCRPTN".

However, only use this approach when the name is approaching the Max Name Length (e.g. 26 for Oracle). For example, if the Logical name is Appointment, turning this into APNT may create confusion, as this could also be Apparent, or many other words starting with Ap*n*t. Use this approach wisely.

(24)

Time Standard

Time Quantum or Time Delta

The DW time quantum will be a second. That is, this will be the smallest unit of time in the data.

High and Low Dates

The high date is 31-12-2999 00:00:00.

The low date will be the start time for Epoch or UNIX time. That is: (UTC), 1 January 1970 or 01-01-1970 00:00:00. However, in general, this low date should not be used. Instead, the low date of a row should be the date that the row was first created in the source system. Only use low date when a date is mandatory, and the data is missing from the source system.

Time Zone

This is EST or Australian Eastern Standard Time.

Table types and Additional Time keys

History Type Definition Additional Time Keys Example Time

Variant (aka Duration)

Data that is true over a period of time. Must contain valid from and valid to as part of primary key. This information is almost always based on load date, rather than on a specific business defined event date.

EFFECTIVE_FROM_TS, EFFECTIVE_TO_TS Normally an SCD2 Dim table. Ephemeral (aka Activity)

Data that is true at a point in time. Must contain valid timestamp as part of primary key. This information is based on business information such as transaction data, or effective date. As a last resort, it can be based on load date.

EFFECTIVE_AT_TS This occurs on transaction grain Fact tables, as well as some source tables.

Invariant (aka No History)

Data that is always true or data whose change is not of interest. Date is not part of key.

Not Applicable Equivalent to an SCD1 Dim table.

Time periods must be non-overlapping

That is, the EFFECTIVE_TO_TS of the previous row = EFFECTIVE_FROM_TS of the next row – 1 TIME QUANTUM. If there is no clear distinction between the timestamps, then a query can result in 2 rows of data being true for the overlapping time period. This will cause great confusion for any analyst querying the data, and impact the credibility of the DW. The rule is to modify the previous EFFECTIVE

(25)

_FROM_TS to the next EFFECTIVE _TO_TS – 1 TIME QUANTUM. This is a fundamental standard that must be adhered to.

(26)

Platform Constraints

Summary

Most of these standards are dictated by platform constraints. In many cases, this is obvious, as in the set of Physical Projection Types is clearly dictated by the Oracle and Netezza DB engines. However, there are some other standards that are more subtle. These are listed below.

Standard Platform Rationale

Time Quantum

Oracle Logging

The minimum granularity of Oracle Logging is 1 second. Therefore, this is the finest/smallest discrete time unit available.

Upper Case only in Target

Datastage This ETL tool does not support mapping to case sensitive object names. So, for physical tables loaded by Datastage, they will be in upper case. Note that this does not apply to the CDC tool. Column count

limit

Oracle Oracle has a 1,000 column limit for tables and views.

Column size limit

Oracle Oracle has a 4K column size limit, for VARCHAR.

Row size limit Oracle Also known as record size. No limit. Column size

limit

Netezza Netezza has a 64K column size limit.

Row size limit Netezza Netezza has a 65,535K total row size table limit. This is the sum of the column lengths in a row.

Column count limit

Netezza Netezza have a 1,600 column limit for tables and views.

Oracle

The following list of rules applies to both quoted and nonquoted Schema Object Name identifiers unless otherwise indicated:

1. Names must be from 1 to 30 bytes long.

2. Nonquoted identifiers cannot be Oracle Database reserved words. Quoted identifiers can be reserved words, although this is not recommended.

3. The Oracle has many pseudo reserved words with special meanings, such as DIMENSION,

SEGMENT, ALLOCATE, DISABLE, and so forth. These words are not reserved, but as Oracle uses them internally in specific ways, this may lead to unpredictable results.

(27)

4. Use ASCII characters in names.

5. Nonquoted identifiers must begin with an alphabetic character from your database character set. Quoted identifiers can begin with any character.

6. Nonquoted identifiers can contain only alphanumeric characters from your database character set and the underscore (_), dollar sign ($), and pound sign (#). Oracle strongly discourages you from using $ and # in nonquoted identifiers.

7. Within a namespace, no two objects can have the same name. The following schema objects share one namespace: Tables, Views, Sequences, Private synonyms, alone procedures, Stand-alone stored functions, Packages, Materialized views and User-defined types. Each of the following schema objects has its own namespace: Indexes, Constraints, Clusters, Database triggers, Private database links, Dimensions. Because tables and views are in the same namespace, a table and a view in the same schema cannot have the same name. However, tables and indexes are in different namespaces. Therefore, a table and an index in the same schema can have the same name. Each of the following nonschema objects also has its own namespace: User roles, Public synonyms, Public database links, Tablespaces, Profiles, Parameter files (PFILEs) and server parameter files (SPFILEs). Because the objects in these namespaces are not contained in schemas, these namespaces span the entire database.

8. Nonquoted identifiers are not case sensitive. Oracle interprets them as uppercase. Quoted identifiers are case sensitive.

9. Columns in the same table or view cannot have the same name. However, columns in different tables or views can have the same name.

10. Procedures or functions contained in the same package can have the same name, if their arguments are not of the same number and datatypes. Creating multiple procedures or functions with the same name in the same package with different arguments is called overloading the procedure or function.

Netezza

Netezza objects include tables, views, and columns. Follow these naming conventions: 1. A name must be from 1 to 128 characters long.

2. A name must begin with a letter (A through Z), diacritic marks, or non-Latin characters (200-377 octal).

3. A name cannot begin with an underscore (_). Leading underscores are reserved for system objects.

4. Names without quotes are not case sensitive. For example, CUSTOMER and Customer are the same, but object names are converted to lowercase when they are stored in the Netezza database. However, if a name is enclosed in quotation marks, then it is case sensitive.

References

Related documents

Like researchers who work in the field of English as an international language (EIL), García, Johnson, and Seltzer clearly support an approach to TESOL teacher preparation that

Same as taxpayer’s building Envelope 90.1-2001 minimum requirements Same as taxpayer’s building * Interior Lighting 90.1-2001 minimum requirements Same as taxpayer’s building *

Resolution 11 – is to empower the directors to offer and grant awards pursuant to the Sembcorp Industries Performance Share Plan 2010 and the Sembcorp Industries Restricted Share

Of course there were substantial ‘umbrella’ organisations such as the Co-operative Union (established in 1870) and the Co-operative Wholesale Societies (English:

3:00 PM Islands of Venice Boat Tour - Murano, Isola di Burano Saturday, August 20, 2016 - Vicenza, Venezia. Sunday, August 21, 2016 - Venezia, Murano.. All options our exquisite

In this paper, the influence of construction details on the overall thermal insulation and energy demand in Belgian houses is examined.. The research was carried out for

allocations: Provided further, That up to $120,000,000 shall be available only: (1) for adjustments in the allocations for public housing agencies, after application for an

Our evaluation shows that SCOO can significantly improve SpMV performance compared to existing formats of the Cusp library for large unstructured matrices. We have further presented