Data Warehouse
A Data warehouse is a relational database containing fact
and dimension tables.
A data warehouse is a subject-oriented, integrated, time-variant and non-volatile collection of data in support of management's decision making proces - Bill Inmon
Subject-Oriented: A data warehouse can be used to analyze a particular subject area. For example, "sales" can be a particular subject.
Integrated: A data warehouse integrates data from multiple data sources. For example, source A and source B may have different ways of identifying a product, but in a data warehouse, there will be only a single way of identifying a product.
Time-Variant: Historical data is kept in a data warehouse. For example, one can retrieve data from 3 months, 6 months, 12 months, or even older data from a data warehouse. This contrasts with a transactions system, where often only the most recent data is kept. For example, a transaction system may hold the most recent address of a customer, where a data warehouse can hold all addresses associated with a customer.
Non-volatile: Once data is in the data warehouse, it will not change. So, historical data in a data warehouse should never be altered.
A data warehouse is a
copy of transaction data specifically structured for query and analysis. - Ralph Kimball
We will know how the data warehouse going to build in general
Any database or data warehouse development starts and ends with below steps.
Requirement gathering - Gathering requirements from business documents, discussion with functional teams, analysts and end user to identify reporting / analysis requirements. What they actually looking for, which data they required for analyzing etc. In this phase they will do the recovery plan for data warehouse in case of disasters.
Data Modeling - This model is the foundation of any data warehouse. By the requirements gathered in first step, conceptual model will be derived. Based on the conceptual model Logical model will be designed. Finally it turns into physical data model.
e.g., Conceptual Model - Includes all major entities, relationships and does not contain much detailed level of information about attributes.
Logical Model - A logical data model describes the data in as much detail as possible, without regard to how they will be physical implemented in the database. Includes all required entities, relationships, attributes, key groups that represent business information and define business rules.
Physical Model - How the model will be built in the database. A physical database model shows all table structures, including column name, column data type, column constraints, primary key, foreign key, and relationships between tables.
Once the data modeling has been done the next phase is the Schema design.
Here Schema is nothing but structure of the database. How the table structure should be to fit for the relationships between dimensions and facts tables to facilitate for query performance. This will be constructed by dividing the tables.
Star Schema:
A star schema is the one in which a central fact table is sourrounded by denormalized dimensional tables. A star schema can be simple or complex. A simple star schema consists of one fact table where as a complex star schema have more than one fact table.
Snow Flake Schema:
A snow flake schema is an enhancement of star schema by adding additional dimensions. Snow flake schema are useful when there are low cardinality attributes in the dimensions.
Galaxy/ Integrated/ Hybride Schema:
Galaxy schema contains many fact tables with some common dimensions (conformed dimensions). This schema is a combination of many data marts.
Fact Constellation Schema:
The dimensions in this schema are segregated into independent dimensions based on the levels of hierarchy. For example, if geography has five levels of hierarchy like teritary, region, country, state and city; constellation schema would have five dimensions instead of one.
The schema's are formed by the fact and dimension tables. There are many types of dimension tables are there which you can find in google. Here I am explaining only few.
Dimension Table
Dimension tables are used to describe descriptive information (dimension) of the field of a fact table. Think dimensions as things or objects. If you take product as example, product should have some dimensions like shape, colour, name etc.
e.g., The time dimension would contain every hour, day, week, month, quarter and year that has occurred since you started your business operations.
Conformed Dimension:
Conformed Dimensions are dimensions that has the same values for all areas of the business.
e.g., Suppose you have a “shipping” data mart (telling you what you’ve shipped to whom and when) and a “sales” data mart (telling you who has purchased what and when). Both marts require a “customer” dimension and a “time” dimension. Since April usually means the same thing to every department in the organization.
Junk Dimension
The junk dimension is simply a structure which is used to store un wanted data which is not useful for report generating purpose. the junk dimensions like text or flags, non-generic comments or yes/no or true/false indicators.
e.g., consider a trade company which consists of fact about trading that take places in a share trading firm. In these there may be some facts like mode of trade(which indicates whether the user is trading through phone or online)which will be not related to any of the dimensions such as account,date,indices,amount of share etc.
so these unrelated facts are removed from the fact table and stored as a separate dimension as a junk dimension which will be useful to provide extra information.
Degenerated Dimension
As The name interprets , Its not a Dimension. But rather can
be Quoted as a primary key of the fact, This key is not a foreign key to any
other dimensions rather belongs solely as a primary key to the Fact.
e.g., Let's take example of a Sales Invoices. Each invoice
contains the following details
Invoice Date
Customer ID
Products within the Invoice
Reference to Order Number(s)
Invoice Number
Invoice Line Numbers (which are multiple lines in single Invoice)
Invoice Line Amount
Invoice Total Amount
Customer ID
Products within the Invoice
Reference to Order Number(s)
Invoice Number
Invoice Line Numbers (which are multiple lines in single Invoice)
Invoice Line Amount
Invoice Total Amount
When we model the
above following Dimensional Modeling standards, we get following distinct Dimensions:
Time Dimension -
representing the Invoice Date
Customer Dimension - representing Customer ID
Product Dimension - representing Products within the Invoice
Order Dimension - representing Orders
Invoice Dimension representing Invoice Number & Invoice Line Numbers
Customer Dimension - representing Customer ID
Product Dimension - representing Products within the Invoice
Order Dimension - representing Orders
Invoice Dimension representing Invoice Number & Invoice Line Numbers
We can observe
that Only candidate attributes are Invoice Number and Invoice Line Numbers. But,
this is at the granularity of the FACT, which stores references to all above said
Dimensions as well as the measures i.e. Invoice Line Amount, Invoice Total
Amount. We can degenerate these two into
facts.
Def: If a table contains the values, which
are neither dimension nor measures is called degenerate dimensions.
Ex : invoice number,
empno
Slowly Changing Dimensions
Dimensions that changing over time are known as SCD. Prices of products, names of people or address may change over time for some reason. These are few examples in SCD.
These are categorized into three types namely SCD1, SCD2 and SCD3.
SCD1: The new record replaces the original record. No trace of the old record exists.In other words no history is kept.
Ex:
The below one is the Original table
Key
|
Name
|
State
|
1001
|
Tanuj
|
A.P
|
After Tanuj moved from AP to Karnataka, the new information
replaces the new record, and we have the following table:
Key
|
Name
|
State
|
1001
|
Tanuj
|
Karnataka
|
Advantages:
- This is the easiest way to handle the SCD problem, since
there is no need to keep track of the old information.
Disadvantages:
- All history is lost. By applying this methodology, it is
not possible to trace back in history. For example, in this case, the company
would not be able to know that Tanuj lived in AP before.
SCD2: Creating an another additional record. Therefore, both the original and the new record will be
present. The newe record gets its own primary key.
The below one is the Original table
Key
|
Name
|
State
|
1001
|
Tanuj
|
A.P
|
After Tanuj moved from AP to Karnataka, the new information added
as a new record.
Key
|
Name
|
State
|
1001
|
Tanuj
|
A.P
|
1002
|
Tanuj
|
Karnataka
|
Advantages:
- This allows us to accurately keep all historical
information.
Disadvantages:
- This will cause the size of the table to grow fast. In
cases where the number of rows for the table is very high to start with,
storage and performance can become a concern.
- This necessarily complicates the ETL process
SCD3: Adding two columns to indicate the particular attribute of interest, one indicating the original value, and one indicating the current value.
To accommodate SCD3, we will now have the following columns:
(Customer Key, Name, Original
State, Current
State, Current State)
Key
|
Name
|
Original
State
|
Current
State
|
Current
State
|
1001
|
1001
|
A.P
|
Karnataka
|
Jan -15 - 2010
|
Advantages:
.
Disadvantages:
- SCD3 will not be able to keep all history where an
attribute is changed more than once. For example, if Tanuj later moves to Kerala
on December 15, 2010, the Karnataka information will be lost.
Fact Table
Fact tables contain keys to dimension tables as well as measurable facts.
e.g .,If you take same example product, which can have quantity, weight, price etc . All these values are measurable.
Fact table store different types of measures.
Additive facts: Measures that can be added across dimensions.
e.g.,
Date
|
Store
|
Product
|
Sales_amount
|
The purpose of this table is to record the sales for each product in each store on a daily basis. Sales_Amount is the fact. In this case, Sales_Amount is an additive fact, because you can sum up this fact along any of the three dimensions present in the fact table -- date, store, and product. For example, the sum of Sales_Amount for all 7 days in a week represent the total sales amount for that week.
Semi Additive facts: Measures can be added across few dimensions not with others
e.g.,
Date
|
Account
|
Current_bal
|
Profit_margin
|
Current Balance of single account cannot be added for month. Here current balance will be for the particular day. At end of the each day you can sum up current balance. If you try to sum up current balance for month it will give wrong data, because day by day the balance will change. There will be no meaning, but can be summed up across all the accounts.
Non Additive facts: Measures cannot be added across dimensions
e.g., Avg, Ratio, variance & Percentage
Hence, the relation between a fact and dimension table is one to many.
Factless Fact Table
A factless fact table is a fact table that does not have any measures.
e.g.,
How many students attended a particular class on a particular day?
One can easily answer the above questions with the use of factless fact table. In the above question we can model three types of facts. Class ID, Date ID and Student ID.