The Business Intelligence Guide
   BI Strategy | BI Program | BI Projects | BI Data | BI Infrastructure | BI User Tools | BI Vendors | Resource Guides | Articles | BI Blog | BIG Bookstore

Get a FREE Sample of the
The Total BI Guide
THE LOGICAL ORGANIZATION

and receive the
TLO NEWSLETTER
Just enter your details below

Email:
Name:
 
HOME
 
Business Intelligence
BI Strategy
BI Program Guide
BI Tools
- Dashboards
- Scorecards
- Operational BI
- Analytics
BI Software Solutions
Data Management
Databases
DataWarehouses
Decision Support
Marketing Tools
Industry Solutions
Case Studies
BI Surveys & Awards
RESOURCES
ARTICLES
BI EVENTS
BI NEWS
Sitemap

About the Author

View Gail La Grouw's profile on LinkedIn

Google+ Gail La Grouw

Bookmark and Share

Data Integration


Most organizations accumulate a staggering volume of data. This data is commonly spread across multiple databases and operational systems. This distributed data architecture makes it difficult to maintain data consistency and retrieve it across the business.

By integrating your data into a unified database or data view, Business Intelligence [BI] tools can help manage the data and make it readily available to BI applications, such as scorecards, dashboards and analytic tools.

Data Integration Solutions

A typcial data integration solution will enable data developers to combine data from multiple operational systems across a company, integrate similar databases from different sources, an pull syndicated information or RSS feeds into a database.

 

ETL

Extract, Transform and Load [ETL] is a toolset that automatically performs the forementioned three actions between data source and data destination [data warehouse]:

  • Extracting data from outside sources
  • Transforming it to fit business needs
  • Loading it into the data warehouse.

Data Extraction

Data warehousing projects consolidate data from different source systems. Often, each separate system uses a different way of organizing the data [data format].

Common data source formats include:

  • Relational databases and flat files
  • Non-relational database structures - IMS, VSAM or ISAM.

Extraction converts the data into a format for transformation processing.

Transformation

During transformation, a series of rules or functions are applied to the extracted data to derive the desired data to be loaded. Some data sources will require very little manipulation of data; others may require one or more of the following transformations :

  • Selecting only certain columns to load (or selecting null columns not to load)
  • Translating coded values (e.g., if the source system stores 1 for male and 2 for female, but the warehouse stores M for male and F for female), this is called data cleansing
  • Encoding free-form values (e.g., mapping "Male" and "1" and "Mr" into M)
  • Deriving a new calculated value (e.g., sale_amount = qty * unit_price)
  • Joining together data from multiple sources (e.g., lookup, merge, etc.)
  • Summarizing multiple rows of data (e.g., total sales for each store, and for each region)
  • Generating surrogate key values
  • Transposing or pivoting (turning multiple columns into multiple rows or vice versa)
  • Splitting a column into multiple columns (e.g., putting a comma-separated list specified as a string in one column as individual values in different columns)

Load

Loading data into the data warehouse [DW] depends on the requirements of the organization. The loading process ranges widely to include paramters such as:

Data Versioning - whether new data overwrites existing information to provide a cumulative, updated data set.

Load Schedule - whether data is added incrementally real time, or batch loaded hourly or overnight.

The loading sequence is a strategic decision based on the time available and the business needs. More complex systems maintain a history and audit trail of all changes to the data. Significant operational problems can occur with improperly designed ETL systems.

 

ETL Issues

ETL challenges include:

Expectations - the range of data values or data quality in an operational system may not be visible to the designers at the time validation and transformation rules are specified. This is avoided by completing data profiling of a source during data analysis.

Scalability - The scalability of an ETL system also needs to be established during analysis. This includes:

  • The volumes of data to be processed within Service Level Agreements, (SLAs).
  • The time available to extract from source systems may change, resulting in less time.
  • Increasing volumes of data may require designs that can scale from daily batch to intra-day micro-batch to integration with message queues or real-time change data capture (CDC) for continuous transformation and update.

Consistency - ensuring the data being uploaded is relatively consistent. Different source databases have different update cycles [minutes, days, weeks]. This means the ETL system may be required to hold back certain data until all sources are available to be synchronized. Data reconciliation may also be required between data in a source system and that of a downstream system, such as the general ledger, establishing synchronization and reconciliation points as necessary.

 

Parallel Processing

Parellel Processing is a more recent development in ETL software, improving the performance of ETL processes when dealing with large volumes of data.

There are 3 main types of ETL parallelism:

Data - splitting a single sequential file into smaller data files to provide parallel access.

Pipeline - allowing simultaneous running of several components on the same data stream. An example would be looking up a value on record 1 at the same time as adding together two fields on record 2.

Component - the simultaneous running of multiple processes on different data streams in the same job. Sorting one input file while performing a deduplication on another file would be an example of component parallelism.

All three types of parallel processing are usually combined in a single job.

 

Data Integration Products

Current Enterprise Application Integration tools include ETL tools as well as data profiling, data quality and metadata capabilities.

Microsoft SQL Server 2005 Integration Services

 

Back To Top

For The World's Leading Guide To BI Strategy, Program & Technology


Data Index | Data Defintion | Meta Data | Data Management | MDM | Data Governance | Data Cleansing | Normalization | Data Integration | Data Growth | Data Solutions

Bookmark and Share
 


THE LOGICAL ORGANIZATION

A Strategic Guide To Corporate Performance Using Business Intelligence

THE COMPLETE BI MANUAL
FOR MANAGERS & CONSULTANTS

The Logical Organization Book Cover