I am going to give my understanding what ETL means and how it fits in the grand scheme of things. I find most people talking about it assume that we know the basic functions of ETL and how it fits into Data-warehousing. First, ETL stands for Extract, Transform and Load.
ETL addresses the problem of moving data from many different data sources, which is the Extract; then Transform it into data that does not duplicate information and make it consistent in format with the Data-warehouse, and Load the data into the data-warehouse. The data-warehouse usually contains data from many different sources and is accessible to many part of the organization that need such access.
An example might help. Inside a company many different departments have information in their systems on customers, suppliers, and their people. Often being able to put this data together can really provide some new insights and assistance in making decisions. However these systems generally cannot share data, the ETL process takes this data from these disparate sources and places it in a data-warehouse that contains information about customers, suppliers, and people in a consistent format so you can look at things a different ways.
The ETL process is quite a challenge and maintaining the integrity of the data in the data-warehouse and the source data. An example might be the payable system has payment history of customers and order processing has order history, the external information about financial health of a customer and other information on the customer. Putting this information in one spot and making it accessible to the account manager of that customer in a timely manner can be invaluable. The process puts this data in consistent fashion in a data-warehouse.
I understand that designing this process and implementing it in one step can be a very long and difficult process. By breaking this project into steps, that provides value to quickly, is something that Project X has been doing successfully for some time.
I hope that Stephen and Graham will review this post carefully and correct my misconceptions.
A very good article. There are still a lot challenges in ETL today even though we have some great tools. These challenges are data cleansing, data quality and exception handling. These challenges related to questions like: how to remove the dirty data? How to consolidate different format addresses into one standard address? How to automate the ETL exception handling process? When and how to implement the referential integrity? And etc. I am expecting more interesting articles.
Another challenge is that the tools in this area have really evolved in a relatively short period of time.
This is great for future development, but there is a heck of a lot of custom code out there.
And to add insult to injury, the data warehouse vendors have some pretty slick systems that can do a lot of transformation in the database without having to do large data moves.
I thought I had already commented on this but clearly I didn’t post it properly.
The description Jim gives is a good one. The day I read it the first time I was looking in to a data warehouse and at addresses. They can be excrutiating. One system might store addresses as a series of lines of text while another system would break the address down in to Street_numnber, Street_name, etc.