What
is ETL?
ETL is used to migrate data from one database to
another, to form data marts and data warehouses and also to convert databases
from one format or type to another.
As we know the data coming from different places may be
unrelated. So first it should be Extract from all the sources and Tranform into
a common format and load it to data warehouse.
ETL is short for extract, transform,
load. These
three are database functions combined into one tool to pull data out from one
DB and place into another.
Extract is the
process of reading the data from different sources like Relational databases, flat files, IMS etc. Data will be extracted to one area before transforming. Which is called Staging area.
Transform
is the process of converting the extracted data from its previous
form into the required format. First cleansing will done on source data.
Cleansing is nothing but cleaning unwanted data or junk data.
e.g., Like nulls,
zeros, zero length strings, and corrupted rows, apostrophe or a comma or an
extra space.
This kind of data will affect reports generated from the
data warehouse via OLAP tools. The ultimate goal of data cleansing is to
improve the organization's confidence in their data.
Transformation occurs by using certain rules to meet the
business and technical needs of the target database.
·
e.g., Selecting
only certain columns to load (or selecting null columns not to load).
·
Translating coded
values (if the source system stores 1 for male and 2 for female in one source, A for male, B for
female in other source, but the warehouse stores M for male and F for female.
Hence this will make common format).
·
11th March 2007
can be stored as '03/11/07' or '11/03/07' among other formats. A data
warehousing project would require the different date formats to be transformed
to a uniform standard before being entered in the data warehouse
There are many rules in transformation but I mentioned only
to as an example. Google if you want to know more.
Load
is the process of writing the data into the target database. The
information from data sources are loaded and stored in a form of tables. There
are two types of tables in the database structure: fact tables and dimensions
tables. Before loading into the tables its very important to find the aggregate
values.
Performance improvement will purely depends on the
aggregatables. Finding the aggregates and storing into their fact tables is the
first step. All related dimension are stored into the dimension tables.
Aggregations will help in speeding up the query time in business decisions.
·
e.g., Sales is a
numeric value in fact table. Total sales revenue is the aggregate value. Before
loading aggregate values should define and move to the tables.
ETL tools available in the market as of now are
Commercial ETL Vendors
- Informatica
- Information Server
- Ab Initio
- BusinessObjects Data Integrator
- Oracle Data Integreator
- Oracle Warehouse Builder
- Integration Services
- Sybase ETL
- SAS ETL
- Pervasive ETL
- Adeptia ETL
- DBSoftLab
- Pentaho Data Integration
- Jasper
- Talend
- Clover ETL
** If you find this
ETL stuff as interesting, then go deep about ETL and chose your ETL tool and be
successful.**
No comments:
Post a Comment