You are on page 1of 10

This Data Warehousing site aims to help people get a good high-level understanding of what it takes to implement a successful

data warehouse project. A lot of the information is from my personal experience as a business intelligence professional, both as a client and as a vendor. This site is divided into five main areas. Tools:

(As the old Chinese saying goes, "To accomplish a goal, make sure the proper tools are selected." This is especially true when the goal is to achieve business intelligence. Given the complexity of the data warehousing system and the cross-departmental implications of the project, it is easy to see why the proper selection of business intelligence software and personnel is very important. This section will talk about the such selections. They are grouped into the following: y General Considerations

When we evaluate which business intelligence tool to use, the first determination is the Buy vs. Build decision. We can use the following table to compare the two approaches.
Category Cost Implementation Time Documentation Functionality / Features Tailored for the exact needs Reliance on third-party A A A A A Buy Build A

Clearly, both approaches have its own advantages and disadvantages, and it is often wise to consider each of the components individually. For example, it is clearly not viable to write a relational database from scratch. Therefore, we may have a case where the hardware and the database are bought, but other tools are built from within. In general, deciding which approach to go is dependent on the following criteria:
y y y y

User technical skills Requirements Available budget Time

Because each tool in the business intelligence arena has different functionalities, the criteria for the "Buy vs. Build" decision is different for each type. We will get into a more detailed discussion for each tool later. Should we decide to purchase an existing third-party business intelligence tool, we must then decide which one to buy. Often, there are a number of choices to pick from, some are well-known, and some others not as well-known, but In addition to tool func tionalities, which we will discuss in the following sections, there are several considerations that we should take into account when considering tool vendors in general:

Tool Vendor's Stability: More than anything else, this is probably the most important measure. In my opinion, this is even more important than the current functionalities that the tool itself provide, for the simple reason that if the company is going to be around for a while, it will be able to make enhancements to its business intelligence tool. On the other hand, if the company is likely to be out of business in six months, then it doesn't matter that it has the state-of-the-art features, because sooner or later these features will be out-of-date. Some of the ways to tell about company's stability are:

What type of office space is it occupying? Is it wasting money by renting the most expensive office space in the area just so that it can be noticed? Or is it plugging all its money back into R&D so that the product can be improved? The background of senior management. The company might be new, but if it has seasoned veterans from major companies like IBM, Oracle, and Microsoft, to name a few, it is more likely to be successful because top management has seen how it's done right.

Support: What type of support is offered? It is industry standard for vendors to charge an annual support fee that is 15-20% of the software product license. Will any software issues be handled promptly? Professional Services: This includes consulting and education. W hat type of consulting proposal does the vendor give? Is the personnel requirements and consulting rates reasonable? Is the vendor going to put in someone fresh out of college and charge $200/hr for that person? It might be wise to speak with members of the consulting team before signing on the dotted line. On the education front, what type of training is available? And how much is the consulting team willing to do knowledge transfer? Does the consulting team purposely hold off information so that either 1) you will need to send more people to vendor's education classes, or 2) you will need to hire additional consulting to make any changes to the system.

y y


Buy vs. Build

The only choices here are what type of hardware and database to purchase, as there is basically no way that one can build hardware/database systems from scratch.

Database/Hardware Selections
In making selection for the database/hardware platform, the re are several items that need to be carefully considered:

yScalability: How can the system grow as your data storage needs grow? Which RDBMS and hardware

platform can handle large sets of data most efficiently? To get an idea of this, one needs to determi ne the

approximate amount of data that is to be kept in the data warehouse system once it's mature, and base any testing numbers from there. yParallel Processing Support: The days of multi-million dollar supercomputers with one single CPU are gone, and nowadays the most powerful computers all use multiple CPUs, where each processor can perform a part of the task, all at the same time. When I first started working with massively parallel computers in 1993, I had thought that it would be the best way for any large computations to be done within 5 years. Indeed, parallel computing is gaining popularity now, although a little slower than I had originally thought. yRDBMS/Hardware Combination: Because the RDBMS physically sits on the hardware platform, there are going to be certain parts of the code that is hardware platform -dependent. As a result, bugs and bug fixes are often hardware dependent. True Case: One of the projects I have worked on was with a major RDBMS provider paired with a hardware platform that was not so popular (at least not in the data warehousing world). The DBA constantly complained about the bug not being fixed because the support level for the particular type of hardware that client had chosen was Level 3, which basically meant that no one in the RDBMS support organization will fix any bug particular to that hardware platform.

Popular Relational Databases

y y y y y y Oracle Microsoft SQL Server IBM DB2 Teradata Sybase MySQL

Popular OS Platforms
y y y y y Linux FreeBSD Microsoft

ETL Tools

Buy vs. Build

When it comes to ETL tool selection, it is not always necessary to purchase a third-party tool. This determination largely depends on three things: y y Complexity of the data transformation: The more complex the data transformation is, the more suitable it is to purchase an ETL tool. Data cleansing needs: Does the data need to go through a thorough cleansing exercise before it is suitable to be stored in the data warehouse? If so, it is best to purchase a tool with strong data cleansing functionalities. Otherwise, it may be sufficient to simply build the ETL routine from scratch. Data volume. Available commercial tools typically have features that can speed up data movement. Therefore, buying a commercial product is a better approach if the volume of data transferred is large.

ETL Tool Functionalities

While the selection of a database and a hardware platform is a must, the selection of an ETL tool is highly recommended, but it's not a must. When you evaluate ETL tools, it pays to look for the following characteristics: yFunctional capability: This includes both the 'transformation' piece and the 'cleansing' piece. In general, the typical ETL tools are either geared towards having strong transformation capabilities or having strong cleansing capabilities, but they are seldom very strong in both. As a result, if you know your data is going to be dirty coming in, make sure your ETL tool has strong cleansing capabilities. If you know there are going to be a lot of different data transformations, it then makes sense to pick a tool that is strong in transformation. yAbility to read directly from your data source: For each organization, there is a different set of data sources. Make sure the ETL tool you select can connect directly to your source data. yMetadata support: The ETL tool plays a key role in your metadata because it maps the source data to the destination, which is an important piece of the metadata. In fact, some organizations have come to rely on the documentation of their ETL tool as their metadata source. As a result, it is very importa nt to select an ETL tool that works with your overall metadata strategy.

Popular Tools
y y y y IBM WebSphere Information Integration (AscentialDataStage) Ab Initio Informatica Talend

y y

OLAP Tools

Buy vs. Build OLAP tools are geared towards slicing and dicing of the data. As such, they require a strong metadata layer, as well as front-end flexibility. Those are typically difficult features for any home-built systems to achieve. Therefore, my recommendation is that if OLAP analysis is part of your charter for building a data warehouse, it is best to purchase an existing OLAP tool rather than creating one from scratch. OLAP Tool Functionalities Before we speak about OLAP tool selection criterion, we must first distinguish between the two types of OLAP tools, MOLAP (Multidimensional OLAP) and ROLAP (Relational OLAP). 1. MOLAP: In this type of OLAP, a cube is aggregated from the relational data source (data warehouse). When user generates a report request, the MOLAP tool can generate the create quickly because all data is already pre-aggregated within the cube. 2. ROLAP: In this type of OLAP, instead of pre-aggregating everything into a cube, the ROLAP engine essentially acts as a smart SQL generator. The ROLAP tool typically comes with a 'Designer' piece, where the data warehouse administrator can specify the relationship between the relational tables, as well as how dimensions, attributes, and hierarchies map to the underlying database tables.

Right now, there is a convergence between the traditional ROLAP and MOLAP vendors. ROLAP vendor recognize that users want their reports fast, so they are implementing MOLAP functionalities in their tools; MOLAP vendors recognize that many times it is necessary to drill down to the most detail level information, levels where the traditional cubes do not get to for performance and size reasons. So what are the criteria for evaluating OLAP vendors? Here they are: yAbility to leverage parallelism supplied by RDBMS and hardware: This would greatly increase the tool's performance, and help loading the data into the cubes as quickly as possible. yPerformance: In addition to leveraging parallelism, the tool itself should be quick both in terms of loading the data into the cube and reading the data from the cube. yCustomization efforts: More and more, OLAP tools are used as an advanced reporting tool. This is because in many cases, especially for ROLAP implementations, OLAP tools often can be used as a reporting tool. In such cases, the ease of front-end customization becomes an important factor in the tool selection process. ySecurity Features: Because OLAP tools are geared towards a number of users, making sure people see only what they are supposed to see is important. By and large, all established OLAP tools have a security layer that can interact with the common corporate login protocols. There are, however, cases where large corporations have developed their own user authentication mechanism and have a "single sign-on" policy. For these cases, having a seamless integration between the tool and the in-house authentication can require some work. I would recommend that you have the tool vendor team come in and make sure that the two are compatible. yMetadata support: Because OLAP tools aggregates the data into the cube and sometimes serves as the front-end tool, it is essential that it works with the metadata strategy/tool you have selected. Popular Tools
y y y y y y y y y

Business Objects Cognos Hyperion Microsoft Analysis Services MicroStrategy Pentaho Palo OLAP Server
Reporting Tools

Buy vs. Build There is a wide variety of reporting requirements, and whether to buy or build a reporting tool for your business intelligence needs is also heavily dependent on the type of requirements. Typically, the determination is based on the following:

Number of reports: The higher the number of reports, the more likely that buying a reporting tool is a good idea. This is not only because reporting tools typically

make creating new reports easier (by offering re-usable components), but they also already have report management systems to make maintenance and support functions easier. Desired Report Distribution Mode: If the reports will only be distributed in a single mode (for example, email only, or over the browser only), we should then strongly consider the possibility of building the reporting tool from scratch. However, if users will access the reports through a variety of different channels, it would make sense to invest in a third-party reporting tool that already comes packaged with these distribution modes. Ad Hoc Report Creation: Will the users be able to create their own ad hoc reports? If so, it is a good idea to purchase a reporting tool. These tool vendors have accumulated extensive experience and know the features that are important to users who are creating ad hoc reports. A second reason is that the ability to allow for ad hoc report creation necessarily relies on a strong metadata layer, and it is simply difficult to come up with a metadata model when building a reporting tool from scratch.

Reporting Tool Functionalities Data is useless if all it does is sit in the data warehouse. As a result, the presentation layer is of very high importance. Most of the OLAP vendors already have a front-end presentation layer that allows users to call up pre-defined reports or create ad hoc reports. There are also several report tool vendors. Either way, pay attention to the following points when evaluating reporting tools: yData source connection capabilities In general there are two types of data sources, one the relationship database, the other is the OLAP multidimensional data source. Nowadays, chances are good that you might want to have both. Many tool vendors will tell you that they offer both options, but upon closer inspection, it is possible that the tool vendor is especially good for one type, but to connect to the other type of data source, it becomes a difficult exercise in programming. yScheduling and distribution capabilities In a realistic data warehousing usage scenario by senior executives, all they have time for is to come in on Monday morning, look at the most important weekly numbers from the previous week (say the sales numbers), and that's how they satisfy their business intelligence needs. All the fancy ad hoc and drilling capabilities will not interest them, because they do not touch these features. Based on the above scenario, the reporting tool must have scheduling and distribution capabilities. Weekly reports are scheduled to run on Monday morning, and the resulting reports are distributed to the senior executives either by email or web publishing. There are claims by various vendors that they can distribute reports through various interfaces, but based on my experience, the only ones that really matter are delivery via email and publishing over the intranet.

ySecurity Features: Because reporting tools, similar to OLAP tools, are geared towards a number of users, making sure people see only what they are supposed to see is important. Security can reside at the report level, folder level, column level, row level, or even individual cell level. By and large, all established reporting tools have these capabilities. Furthermore, they have a security layer that can interact with the common corporate login protocols. There are, however, cases where large corporations have developed their own user authentication mechanism and have a "single sign-on" policy. For these cases, having a seamless integration between the tool and the in-house authentication can require some work. I would recommend that you have the tool vendor team come in and make sure that the two are compatible. yCustomization Every one of us has had the frustration over spending an inordinate amount of time tinkering with some office productivity tool only to make the report/presentation look good. This is definitely a waste of time, but unfortunately it is a necessary evil. In fact, a lot of times, analysts will wish to take a report directly out of the reporting tool and place it in their presentations or reports to their bosses. If the reporting tool offers them an easy way to pre-set the reports to look exactly the way that adheres to the corporate standard, it makes the analysts jobs much easier, and the time savings are tremendous. yExport capabilities The most common export needs are to Excel, to a flat file, and to PDF, and a good report tool must be able to export to all three formats. For Excel, if the situation warrants it, you will want to verify that the reporting format, not just the data itself, will be exported out to Excel. This can often be a time-saver. yIntegration with the Microsoft Office environment Most people are used to work with Microsoft Office products, especially Excel, for manipulating data. Before, people used to export the reports into Excel, and then perform additional formatting / calculation tasks. Some reporting tools now offer a Microsoft Office-like editing environment for users, so all formatting can be done within the reporting tool itself, with no need to export the report into Excel. This is a nice convenience to the users. Popular Tools
y y y y y y

SAP Business Objects MicroStrategy IBM Cognos Actuate Jaspersoft Pentaho

y y y

Metadata Tools

Buy vs. Build

Only in the rarest of cases does it make sense to build a metadata tool from scratch. This is because doing so requires resources that are intimately familiar with the operational, technical, and business aspects of the data warehouse system, and such resources are difficult to come by. Even when such resources are available, there are often other tasks that can provide more value to the organization than to build a metadata tool from scratch. In fact, the question is often whether any type of metadata tool is needed at all. Although metadata plays an extremely important role in a successful data warehousing implementation, this does not always mean that a tool is needed to keep all the "data about data." It is possible to, say, keey such information in the repository of other tools used, in a text documentation, or even in a presentation or a spreadsheet. Having said the above, though, it is author's believe that having a solid metadata foundation is one of the keys to the success of a data warehousing project. Therefore, even if a metadata tool is not selected at the beginning of the project, it is essential to have a metadata strategy; that is, how metadata in the data warehousing system will be stored.

y y

Metadata Tool Functionalities

This is the most difficult tool to choose, because there is clearly no standard. In fact, it might be better to call this a selection of the metadata strategy. Traditionally, people have put the data modeling information into a tool such as ERWin and Oracle Designer, but it is difficult to extract information out of such data modeling tools. For example, one of the goals for your metadata selection is to provide information to the end users. Clearly this is a difficult task with a data modeling tool. So typically what is likely to happen is that additional efforts are spent to create a layer of metadata that is aimed at the end users. While this allows the end users to gain the required insight into what the data and reports they are looking at means, it is clearly inefficient because all that information already resides somewhere in the data warehouse system, whether it be the ETL tool, the data modeling tool, the OLAP tool, or the reporting tool. There are efforts among data warehousing tool vendors to unify on a metadata model. In June of 2000, the OMG released a metadata standard called CWM (Common Warehouse Metamodel), and some of the vendors such as Oracle have claimed to have implemented it. This sta ndard incorporates the latest technology such as XML, UML, and SOAP, and, if accepted widely, is truly the best thing that can happen to the data warehousing industry. As of right now, though, the author has not really seen that many tools leveraging this standard, so clearly it has not quite caught on yet. So what does this mean about your metadata efforts? In the absence of everything else, I would recommend that whatever tool you choose for your metadata support supports XML, and that whatever other tool that needs to leverage the metadata also supports XML. Then it is a matter of defining your DTD across your data warehousing system. At the same time, there is no need to worry about criteria that typically is important for the other tools such as perform ance and support for parallelism because the size of the metadata is typically small relative to the size of the data warehouse. Data Warehouse Team Personnel

y y y

y y

y y

There are two areas of discussion: First is whether to use external consultants or hire permanent employees. The second is on what type of personnel is recommended for a data warehousing project. The pros of hiring external consultants are: 1. They are usually more experienced in data warehousing implementations. The fact of the matter is, even today, people with extensive data warehousing backgrounds are difficult to find. With that, when there is a need to ramp up a team quickly, the easiest route to go is to hire external consultants. The pros of hiring permanent employees are: 1. They are less expensive. With hourly rates for experienced data warehousing professionals running from $100/hr and up, and even more for Big-5 or vendor consultants, hiring permanent employees is a much more economical option. 2. They are less likely to leave. With consultants, whether they are on contract, via a Big-5 firm, or one of the tool vendor firms, they are likely to leave at a

y y y

y y y

y y

moment's notice. This makes knowledge transfer very important. Of course, the flip side is that these consultants are much easier to get rid of, too. The following roles are typical for a data warehouse project: yProject Manager: This person will oversee the progress and be responsible for the success of the data warehousing project. yDBA: This role is responsible to keep the database running smoothly. Additional tasks for this role may be to plan and execute a backup/recovery plan, as well as performance tuning. yTechnical Architect: This role is responsible for developing and implementing the overall technical architecture of the data warehouse, from the backend hardware/software to the client desktop configurations. yETL Developer: This role is responsible for planning, developing, and deploying the extraction, transformation, and loading routine for the data warehouse. yFront End Developer: This person is responsible for developing the front -end, whether it be client-server or over the web. yOLAP Developer: This role is responsible for the development of OLAP cubes. yTrainer: A significant role is the trainer. After the data warehouse is implemented, a person on the data warehouse team needs to work with the end users to get them familiar with how the fron t end is set up so that the end users can get the most benefit out of the data warehouse system. yDataModeler: This role is responsible for taking the data structure that exists in the enterprise and model it into a schema that is suitable for OLAP analysis. yQA Group: This role is responsible for ensuring the correctness of the data in the data warehouse. This role is more important than it appears, because bad data quality turns away users more than any other reason, and often is the start of the downfall for the data warehousing project. The above list is roles, and one person does not necessarily correspond to only one role. In fact, it is very common in a data warehousing team where a person takes on multiple roles. For a typical project, it is common to see teams of 5-8 people. Any data warehousing team that contains more than 10 people is definitely bloated.

y y Please note that this site is vendor neutral. Some business intelligence vendor names will be mentioned, but it should not be considered as an endorsement from this site.)

y y y y y

The selection of business intelligence tools and the selection of the data warehousing team. Tools covered are: Database, Hardware ETL (Extraction, Transformation, and Loading) OLAP Reporting Metadata

- Steps: This selection contains the typical milestones for a data warehousing project, from requirement gathering, query optimization, to production rollout and beyond. I also offer myobservations on the data warehousing field.

- Business Intelligence: Business intelligence is closely related to data warehousing. This section discusses business intelligence, as well as the relationship between business intelligence and data warehousing. - Concepts: This section discusses several concepts particular to the data warehousing field. Topics include: y y y y y y y y y y y y Dimensional Data Model Star Schema Snowflake Schema Slowly Changing Dimension Conceptual Data Model Logical Data Model Physical Data Model Conceptual, Logical, and Physical Data Model Data Integrity What is OLAP MOLAP, ROLAP, and HOLAP Bill Inmon vs. Ralph Kimball

- Business Intelligence Conferences: Lists upcoming conferences in the business intelligence / data warehousing industry. - Glossary: A glossary of common data warehousing terms. This site is updated frequently to reflect the latest technology, informa tion, and reader feedback. Please bookmark this site now.