Linna Wang's Blog: March 2011

Tuesday, March 29, 2011

Data integration

Data integration: retrieve data from different sources and assemble it in a unified way.

Data integration focuses mainly on databases. A database is an organized collection of data. It's similar to a file system, which is an organizational structure for files so they're easy to find, access and manipulate.

There's the common data storage method, also known as data warehousing. Using this method, all the data from the various databases you intend to integrate are extracted, transformed and loaded. That means that the data warehouse first pulls all the data from the various data sources. Then, the data warehouse converts all the data into a common format so that one set of data is compatible with another. Then it loads this new data into its own database. When you submit your query, the data warehouse locates the data, retrieves it and presents it to you in an integrated view.

A data warehouse is a database that stores information from other databases using a common format.

Descriptions of data are called metadata. Metadata is useful for naming and defining data as well as describing the relationship of one set of data to other sets. Data integration systems use metadata to locate the information relevant to queries.

The warehouse must have a database large enough to store data gathered from multiple sources. Some data warehouses include an additional step called a data mart. The data warehouse takes over the duties of aggregating data, while the data mart responds to user queries by retrieving and combining the appropriate data from the warehouse.

One problem with data warehouses is that the information in them isn't always current. That's because of the way data warehouses work -- they pull information from other databases periodically. If the data in those databases changes between extractions, queries to the data warehouse won't result in the most current and accurate views. If the data in a system rarely changes, this isn't a big deal. For other applications, though, it's problematic.

Thursday, March 24, 2011

Load balancing

In networking, load balancing is a technique to distribute workload evenly across two or more computers, network links, CPUs, hard drives, or other resources, in order to get optimal resource utilization, maximize throughput, minimize response time, and avoid overload. Using multiple components with load balancing, instead of a single component, may increase reliability through redundancy. The load balancing service is usually provided by a dedicated program or hardware device (such as a multilayer switch or a DNS server).

F5 Networks originally manufactured and sold some of the very first load balancing product, called BIG-IP. If a server went down or became overloaded, BIG-IP directed traffic away from that server to other servers that could handle the load.

F5's BIG-IP product is based on a network appliance (either virtual or physical), which runs F5's Traffic Management Operating System (TMOS), which runs on top of Linux. This appliance can then run one or more product modules (depending on the appliance selected), which provide the BIG-IP functionality.

Wednesday, March 23, 2011

LDAP

What is LDAP ?

Lightweight Directory Access Protocol
Based on X.500
Directory service (RFC1777)
Stores attribute based data
Data generallly read more than written to
- No transactions
- No rollback
Hierarchical data structure

Entries are in a tree-like structure called Directory Information Tree (DIT)

Attribute abbreviations

uid User id

cn Common Name

sn Surname

l Location

ou Organisational Unit

o Organisation

dc Domain Component

st State

c Country

The Lightweight Directory Access Protocol (LDAP) is an application protocol for reading and editing directories over an IP network.

A client starts an LDAP session by connecting to an LDAP server, called a Directory System Agent (DSA), by default on TCP port 389. The client then sends an operation request to the server, and the server sends responses in return. With some exceptions, the client does not need to wait for a response before sending the next request, and the server may send the responses in any order.

A common alternate method of securing LDAP communication is using an SSL tunnel. This is denoted in LDAP URLs by using the URL scheme "ldaps". The default port for LDAP over SSL is 636. The use of LDAP over SSL was common in LDAP Version 2 (LDAPv2) but it was never standardized in any formal specification.

Tuesday, March 22, 2011

Data integration

Data integration is a set of procedures, techniques, and technologies used to design and build processes that extract, restructure, move, and load data in either operational or analytic data stores either in real time or in batch mode.

Metadata is the “data” about the data; it is the business and technical definitions that provide the

data meaning.

A major function of data integration is to integrate disparate data into a single view of information.

ETL Data integration

ETL is the collection and aggregation of transactional data with data extracted from multiple sources to be conformed into databases used for reporting and analytics.

Most of the cost and maintenance of complex data integration processing occurs in the bulk data

movement space. ETL has experienced explosive growth in both frequency and size in the past 15 years. In the mid-1990s, pushing 30GB to 40GB of data on a monthly basis was considered a

large effort. However, by the twenty-ﬁrst century, moving a terabyte of data on a daily basis was a requirement. In addition to standard ﬂat ﬁle and relational data formats, data integration environments need to consider XML and unstructured data formats. With these new formats, along with the exponential growth of transactional data, multi-terabyte data integration processing environments are not unusual.

ETL: Extract, Transform and Load

ETL stands for extract, transform and load, includes reading data from its source, cleaning it up and formatting it uniformly, and then writing it to the target repository to be exploited.

The data used in ETL processes can come from any source: a mainframe application, an ERP application, a CRM tool, a flat file, an Excel spreadsheet—even a message queue.

The processes enable companies to move data from multiple sources, reformat and cleanse it, and load it into another database, a data mart or a data warehouse for analysis, or on another operational system to support a business process.

In OTC, ETL mostly used in transfer Data from Informix to Vertica, move Data inside database, and extract Data report from database. I will explain in my next blog how OTC use Vertica with market data warehouse.

Tuesday, March 8, 2011

How to use mysql in Linux

1. sudo apt-get install mysql-client mysql-server
2. sudo apt-get install php5-mysql
3. mysql -u root -p

Monday, March 7, 2011

JDBC Informix

1. Add external jar for jfxjdbc.jar to Java Build Path.
2. Coding:
package db;
import java.io.IOException;
import java.io.InputStream;
import java.sql.*;
import java.util.*;

public class jdbc_informix
{
public static void main(String[] args) throws IOException
{
Connection conn = null;
String userName = "xx";
String password = "xx";
String url = "jdbc:informix-sqli://db.test.ps:1111/table_name:INFORMIXSERVER=table_n";

try
{
Class.forName("com.informix.jdbc.IfxDriver");

System.out.println("Driver OK");

}
catch (Exception e)
{
System.out.println("FAILED: failed to load Informix JDBC driver.");
}

try
{
conn = DriverManager.getConnection(url, userName, password);
System.out.println ("Database connection established");

Statement s = conn.createStatement ();

s.executeQuery ("select * from table_name");

ResultSet rs = s.getResultSet ();
if (rs.next ())
{
String nameVal = rs.getString 1);
System.out.println(nameVal);
}
else
System.out.println("error");

}
catch (SQLException e)
{
System.out.println("FAILED: failed to connect!"+e);
}

}

private static void dispValue(InputStream value) {
// TODO Auto-generated method stub

}
}