Wednesday 16 May 2012

ETL


What is ETL?

ETL is used to migrate data from one database to another, to form data marts and data warehouses and also to convert databases from one format or type to another.

As we know the data coming from different places may be unrelated. So first it should be Extract from all the sources and Tranform into a common format and load it to data warehouse.




ETL is short for extract, transform, load. These three are database functions combined into one tool to pull data out from one DB and place into another.


           Extract is the process of reading the data from different sources like Relational  databases, flat files, IMS etc. Data will be extracted to one area before transforming. Which is called Staging area.


            Transform is the process of converting the extracted data from its previous form into the required  format.  First cleansing will done on source data. Cleansing is nothing but cleaning unwanted data or junk data.

 e.g., Like nulls, zeros, zero length strings, and corrupted rows, apostrophe or a comma or an extra space.


This kind of data will affect reports generated from the data warehouse via OLAP tools. The ultimate goal of data cleansing is to improve the organization's confidence in their data.


Transformation occurs by using certain rules to meet the business and technical needs of the target database.


·         e.g., Selecting only certain columns to load (or selecting null columns not to load).


·         Translating coded values (if the source system stores 1 for male and 2 for       female in one source, A for male, B for female in other source, but the warehouse stores M for male and F for female. Hence this will make common format).


·         11th March 2007 can be stored as '03/11/07' or '11/03/07' among other formats. A data warehousing project would require the different date formats to be transformed to a uniform standard before being entered in the data warehouse




There are many rules in transformation but I mentioned only to as an example. Google if you want to know more.


               Load is the process of writing the data into the target database. The information from data sources are loaded and stored in a form of tables. There are two types of tables in the database structure: fact tables and dimensions tables. Before loading into the tables its very important to find the aggregate values.


Performance improvement will purely depends on the aggregatables. Finding the aggregates and storing into their fact tables is the first step. All related dimension are stored into the dimension tables. Aggregations will help in speeding up the query time in business decisions.


·         e.g., Sales is a numeric value in fact table. Total sales revenue is the aggregate value. Before loading aggregate values should define and move to the tables.

ETL tools available in the market as of now are

Commercial ETL Vendors


  • Informatica
  • Information Server
  • Ab Initio
  • BusinessObjects Data Integrator
  • Oracle Data Integreator
  • Oracle Warehouse Builder
  • Integration Services
  • Sybase ETL
  • SAS ETL
  • Pervasive ETL
  • Adeptia ETL
  • DBSoftLab
Open Source ETL Vendors

  • Pentaho Data Integration
  • Jasper
  • Talend
  • Clover ETL



** If you find this ETL stuff as interesting, then go deep about ETL and chose your ETL tool and be successful.**

No comments:

Post a Comment