Data Warehouse
A Data warehouse is a relational database containing fact
and dimension tables.
A data warehouse is a
subject-oriented, integrated, time-variant and non-volatile collection of data
in support of management's decision making proces - Bill Inmon
Subject-Oriented: A data
warehouse can be used to analyze a particular subject area. For example,
"sales" can be a particular subject.
Integrated: A data warehouse
integrates data from multiple data sources. For example, source A and source B
may have different ways of identifying a product, but in a data warehouse,
there will be only a single way of identifying a product.
Time-Variant: Historical data is
kept in a data warehouse. For example, one can retrieve data from 3 months, 6
months, 12 months, or even older data from a data warehouse. This contrasts
with a transactions system, where often only the most recent data is kept. For
example, a transaction system may hold the most recent address of a customer,
where a data warehouse can hold all addresses associated with a customer.
Non-volatile: Once data is in the
data warehouse, it will not change. So, historical data in a data warehouse
should never be altered
.
A data warehouse is a
copy of transaction data specifically structured for query and analysis. - Ralph Kimball
We will know how the data warehouse
going to build in general
Any database or data warehouse
development starts and ends with below steps.
Requirement gathering - Gathering requirements
from business documents, discussion with functional teams, analysts and end
user to identify reporting / analysis requirements. What they actually
looking for, which data they required for analyzing etc. In this phase they
will do the recovery plan for data warehouse in case of disasters.
Data Modeling - This model is the foundation of any data warehouse. By the
requirements gathered in first step, conceptual model will be derived. Based on
the conceptual model Logical model will be designed. Finally it turns into
physical data model.
e.g.,
Conceptual Model - Includes all major entities, relationships and
does not contain much detailed level of information about attributes.
Logical Model - A logical data model describes the data in as much
detail as possible, without regard to how they will be physical implemented in
the database. Includes all required entities, relationships, attributes, key
groups that represent business information and define business rules.
Physical Model - How the model will be built in the database. A physical
database model shows all table structures, including column name, column data
type, column constraints, primary key, foreign key, and relationships between
tables.
Once the data modeling has been done the next phase is the Schema design.
Here Schema is nothing but structure
of the database. How the table structure should be to fit for the relationships
between dimensions and facts tables to facilitate for query performance. This
will be constructed by dividing the tables.
Star Schema:
A star schema is the one in which a central fact table is sourrounded by
denormalized dimensional tables. A star schema can be simple or complex. A
simple star schema consists of one fact table where as a complex star schema
have more than one fact table.
Snow Flake Schema:
A snow flake schema is an enhancement of star schema by adding additional
dimensions. Snow flake schema are useful when there are low cardinality
attributes in the dimensions.
Galaxy/ Integrated/ Hybride Schema:
Galaxy schema contains many fact tables with some common dimensions (conformed
dimensions). This schema is a combination of many data marts.
Fact Constellation Schema:
The dimensions in this schema are segregated into independent dimensions based
on the levels of hierarchy. For example, if geography has five levels of
hierarchy like teritary, region, country, state and city; constellation schema
would have five dimensions instead of one.
The schema's are formed by the fact and dimension tables. There are many
types of dimension tables are there which you can find in google. Here I am
explaining only few.
Dimension Table
Dimension tables are used to describe descriptive information (dimension) of
the field of a fact table. Think
dimensions as things or objects. If you take product as example, product should
have some dimensions like shape, colour, name etc.
e.g., The time dimension would
contain every hour, day, week, month, quarter and year that has occurred since
you started your business operations.
Conformed Dimension:
Conformed Dimensions are dimensions that has the same values for all areas of
the business.
e.g., Suppose you have a “shipping” data mart (telling you what you’ve shipped
to whom and when) and a “sales” data mart (telling you who has purchased what
and when). Both marts require a “customer” dimension and a “time” dimension. Since
April usually means the same thing to every department in the organization.
Junk Dimension
The junk dimension is simply a structure which is used to store un wanted
data which is not useful for report generating purpose. the junk dimensions
like text or flags, non-generic comments or yes/no or true/false indicators.
e.g., consider a trade company which consists of fact about trading that
take places in a share trading firm. In these there may be some facts like mode of trade(which indicates
whether the user is trading through
phone or online)which will be not related to any of the dimensions such as
account,date,indices,amount of share etc.
so these unrelated facts are removed from the fact table and stored as a
separate dimension as a junk dimension which will be useful to provide extra
information.
Degenerated Dimension
As The name interprets , Its not a Dimension. But rather can
be Quoted as a primary key of the fact, This key is not a foreign key to any
other dimensions rather belongs solely as a primary key to the Fact.
e.g., Let's take example of a Sales Invoices. Each invoice
contains the following details
Invoice Date
Customer ID
Products within the Invoice
Reference to Order Number(s)
Invoice Number
Invoice Line Numbers (which are multiple lines in single Invoice)
Invoice Line Amount
Invoice Total Amount
When we model the
above following Dimensional Modeling standards, we get following distinct Dimensions:
Time Dimension -
representing the Invoice Date
Customer Dimension - representing Customer ID
Product Dimension - representing Products within the Invoice
Order Dimension - representing Orders
Invoice Dimension representing Invoice Number & Invoice Line Numbers
We can observe
that Only candidate attributes are Invoice Number and Invoice Line Numbers. But,
this is at the granularity of the FACT, which stores references to all above said
Dimensions as well as the measures i.e. Invoice Line Amount, Invoice Total
Amount. We can degenerate these two into
facts.
Def: If a table contains the values, which
are neither dimension nor measures is called degenerate dimensions.
Ex : invoice number,
empno
Slowly Changing Dimensions
Dimensions that changing over time are known as SCD. Prices of products, names of people or address may change over time for some reason. These are few examples in SCD.
These are categorized into three types namely SCD1, SCD2 and SCD3.
SCD1:
The new record replaces the original record. No trace of the old record exists.In other words no history is kept.
Ex:
The below one is the Original table
Key
|
Name
|
State
|
1001
|
Tanuj
|
A.P
|
After Tanuj moved from AP to Karnataka, the new information
replaces the new record, and we have the following table:
Key
|
Name
|
State
|
1001
|
Tanuj
|
Karnataka
|
Advantages:
- This is the easiest way to handle the SCD problem, since
there is no need to keep track of the old information.
Disadvantages:
- All history is lost. By applying this methodology, it is
not possible to trace back in history. For example, in this case, the company
would not be able to know that Tanuj lived in AP before.
SCD2: Creating an another additional record. Therefore, both the original and the new record will be
present. The newe record gets its own primary key.
The below one is the Original table
Key
|
Name
|
State
|
1001
|
Tanuj
|
A.P
|
After Tanuj moved from AP to Karnataka, the new information added
as a new record.
Key
|
Name
|
State
|
1001
|
Tanuj
|
A.P
|
1002
|
Tanuj
|
Karnataka
|
Advantages:
- This allows us to accurately keep all historical
information.
Disadvantages:
- This will cause the size of the table to grow fast. In
cases where the number of rows for the table is very high to start with,
storage and performance can become a concern.
- This necessarily complicates the ETL process
SCD3: Adding two columns
to indicate the particular attribute of
interest, one indicating the original value, and one indicating the current
value.
To accommodate SCD3, we will now have the following columns:
(Customer Key, Name, Original
State, Current
State, Current State)
After Tanuj moved from AP to KA, the original information gets updated, and
we have the following table (assuming the effective date of change is January
15, 2010):
Key
|
Name
|
Original
State
|
Current
State
|
Current
State
|
1001
|
1001
|
A.P
|
Karnataka
|
Jan -15 - 2010
|
Advantages:
- This does not increase the size of the table, since new information is
updated. - This allows us to keep some part of history
.
Disadvantages:
- SCD3 will not be able to keep all history where an
attribute is changed more than once. For example, if Tanuj later moves to Kerala
on December 15, 2010, the Karnataka information will be lost.
Fact Table
Fact tables contain keys to dimension tables as well as measurable facts.
e.g .,If you take same example product, which can have quantity, weight,
price etc . All these values are measurable.
Fact table store different types of measures.
Additive facts: Measures that can be added across dimensions.
e.g.,
Date
|
Store
|
Product
|
Sales_amount
|
The purpose of this table is to record the sales for each product in each store on a daily basis. Sales_Amount is
the fact. In this case, Sales_Amount is an additive fact, because you
can sum up this fact along any of the three dimensions present in the fact
table -- date, store, and product. For example, the sum of Sales_Amount
for all 7 days in a week represent the total sales amount for that week.
Semi Additive facts: Measures can be added across few dimensions not with
others
e.g.,
Date
|
Account
|
Current_bal
|
Profit_margin
|
Current Balance of single
account cannot be added for month. Here current balance will be for the particular day. At end of the each day you can sum up current balance. If you try to sum up current balance for month it will give wrong data, because day by day the balance will change. There will be no meaning, but can be summed
up across all the accounts.
Non Additive facts: Measures cannot be added across dimensions
e.g., Avg, Ratio, variance &
Percentage
Hence, the relation between a fact and
dimension table is one to many.
Factless Fact Table
A factless fact table is a fact table that does not have any measures.
e.g.,
How many students attended a particular class on a particular day?
One can easily answer the above questions
with the use of factless fact table. In the above question we can model three types of facts. Class ID, Date ID and Student ID.