Tuesday, March 29, 2011

Data integration

Data integration: retrieve data from different sources and assemble it in a unified way.
Data integration focuses mainly on databases. A database is an organized collection of data. It's similar to a file system, which is an organizational structure for files so they're easy to find, access and manipulate.
There's the common data storage method, also known as data warehousing. Using this method, all the data from the various databases you intend to integrate are extracted, transformed and loaded. That means that the data warehouse first pulls all the data from the various data sources. Then, the data warehouse converts all the data into a common format so that one set of data is compatible with another. Then it loads this new data into its own database. When you submit your query, the data warehouse locates the data, retrieves it and presents it to you in an integrated view.

A data warehouse is a database that stores information from other databases using a common format.

Descriptions of data are called metadata. Metadata is useful for naming and defining data as well as describing the relationship of one set of data to other sets. Data integration systems use metadata to locate the information relevant to queries.

The warehouse must have a database large enough to store data gathered from multiple sources. Some data warehouses include an additional step called a data mart. The data warehouse takes over the duties of aggregating data, while the data mart responds to user queries by retrieving and combining the appropriate data from the warehouse.

One problem with data warehouses is that the information in them isn't always current. That's because of the way data warehouses work -- they pull information from other databases periodically. If the data in those databases changes between extractions, queries to the data warehouse won't result in the most current and accurate views. If the data in a system rarely changes, this isn't a big deal. For other applications, though, it's problematic.