Sie sind auf Seite 1von 13

What is a three tier data warehouse?

A data warehouse can be thought of as a three-tier system in which a middle system provides usable data in a secure way to end users. On either side of this middle system are the end users and the back-end data stores. there are three 1.application tier 2.data tier 3.presentation tier In DWH the three tire architecture can be as follows: 1>The source layer where data lands. 2> The integration layer where after a cleansing,transformation we actually store the data. 3> The dimension layer on which the actual presentation layer stands. Why do we use Surrogate Key? Surrogate key: It is the system generated key which cannot be edited by the user. It is the primary key of the dimension table in warehouse. It is nothing but the sequence which generates the numbers for primary key column.

Surrogate keys are system generated keys. They are integers. Surrogate keys are extremely useful when having type 2 data ( i.e. storing historical information) For ex: Consider one has a table in which a person and his location are stored. Now when his location is changed and we want to keep a historical record of the same, it is stored with a surrogate key that will help us to uniquely identify the record. This is also a reason that OLTP keys are not used in the warehouse and a seperate dimension or surrogate key is maintained.

Its mainly used for tracking the changes of the data.We can easily find the last updated data through this key.

What is the metadata extension?

Informatica allows end users and partners to extend the metadata stored in the repository by associating information with individual objects in the repository. For example, when you

create a mapping, you can store your contact information with the mapping. You associate information with repository metadata using metadata extensions. Informatica Client applications can contain the following types of metadata extensions: Vendor-defined. Third-party application vendors create vendor-defined metadata extensions. You can view and change the values of vendor-defined metadata extensions, but you cannot create, delete, or redefine them. User-defined. You create user-defined metadata extensions using PowerCenter/PowerMart. You can create, edit, delete, and view user-defined metadata extensions. You can also change the values of user-defined extensions.

Data Warehousing Concepts


The IBM Cognos tool, Decision Stream or by it's latest version name Data Manager, is one of the ETL development tools available to support data delivery. Data Manager has all of the functionality needed to extract the data from the source system, create the necessary programming to transform the data, and deliver it to the warehouse. Extract the source data Once the BI analyst has worked with the business user to determine the requirements necessary, an ETL programmer begins by writing a SQL statement against the source system(s) to gather all the required elements. It is a best practice to simply pull the data from the source system without attempting to make any changes to it. In the extract process, the best performance will be from a simple select statement. The Decision Stream tool calls this the data stream. Transformations of extracted data The temporary work table has all the data elements required to support the transformation step. The transformation step is the "meat and potatoes" of building a structured data warehouse. All of the elements extracted from the source system(s) are used in this step. The Decision Stream tool has derivations, dimensional lookups, and functions, to provide the programmer with all the tools needed to transform the data to meet the business requirements. The tool calls this the transformation model. A derivation is a calculation type of operation. An example would be adding 19000000 to dates that are in the CYYMMDD format or multiplying the quantity by the unit price to get a total gross price. There are many supported calculation types within a Decision Stream derivation. This include but are not limited to date math, type conversion, contol structures such as if..then..else, and boolean operators. Dimensional lookups are used to create a link to the necessary control files. An example would be using the transaction date and the transaction currency to include the active foreign exchange rate for the transaction. Dimensional lookups often use the actual value of the key data to bring back the surrogate identification of the record containing the key data in the master or control files. A function does everything else. Often a function is used at the beginning of the process to get literal or variable data used within the process or at the end to validate the transformations were successful. An example of this would be to go get the beginning and ending date ranges used in the current process, then update those date ranges at the end of the transformation. Generating the total extracted against the total transformed is another common function. Load the final delivery table(s) Lastly the delivery module provides the method for the data to be delivered to a table or multiple tables. While this seems like an afterthought, it is important to structure the columns so that the most

used columns are at the beginning of the delivery. It is also a best practice to avoid delivering transition data. A field that is only used as a source (the date in CYYMMDD format) should not be delivered to the final table. Utilizing the IBM Cognos Decision Stream tool gives the ETL developer all the necessary tools to gather, manipulate, and deliver the data needed to meet the reporting needs of the business.

Testing the Data Warehouse


Testing the data warehouse and business intelligence system is critical to success. Without testing, the data warehouse could produce incorrect answers and quickly lose the faith of the business intelligence users. Effective testing requires putting together the right processes, people and technology and deploying them in productive ways. Data Warehouse Testing Responsibilities Who should be involved with testing? The right team is essential to success: Business Analysts gather and document requirements QA Testers develop and execute test plans and test scripts Infrastructure people set up test environments Developers perform unit tests of their deliverables DBAs test for performance and stress Business Users perform functional tests including User Acceptance Tests (UAT) Business Requirements and Testing When should your project begin to think about testing? The answer is simple - at the beginning of the project. Successful testing begins with the gathering and documentation of requirements. Without requirements there is no measure of system correctness. Expect to produce a Requirements Traceability Matrix (RTM) that cross references data warehouse and business intelligence features to business requirements. The RTM is a primary input to the Test Plan. Data Warehousing Test Plan The Test Plan, typically prepared by the QA Testers, describes the tests that must be performed to validate the data warehousing and business intelligence system. It describes the types of tests and the coverage of required system features. Test Cases are details that enable implementation of the Test Plan. The Test Case itemizes steps that must be taken to test the system along with expect results. A Text Execution Log tracks each test along with the results (pass or fail) of each test item. Testing Environments and Infrastructure Multiple environments must typically be created and maintained to support the system during its lifecycle: Development QA Staging / Performance

Production These kinds of tools can facilitate testing and problem correction: Automated test tool Test data generator Test data masker Defect manager Automated test scripts Unit Testing for the Data Warehouse Developers perform tests on their deliverables during and after their development process. The unit test is performed on individual components and is based on the developer's knowledge of what should be developed. Unit testing should definitely be performed before deliverables are turned over to QA by developers. Tested components are likely to have fewer bugs. QA Testers Perform Many Types of Tests QA Testers design and execute a number of tests: Integration Test Test the systems operation from beginning to end, focusing on how data flows through the system. This is sometimes called "system testing" or "end-to-end testing". Regression Test Validate that the system continues to function correctly after being changed. Avoid "breaking" the system. Can the Data Warehouse Perform? Tests can be designed and executed that show how well the system performs with heavy loads of data: Extract Performance Test Test the performance of the system when extracting a large amount of data. Transform and Load Performance Test Test the performance of the system when transforming and loading a large amount of data. Testing with a high volume is sometimes called a "stress test". Analytics Performance Test calculations. Test the performance of the system when manipulating the data through

Business Users Test Business Intelligence Does the system produce the results desired by business users? The main concern is functionality, so business users perform functional tests to make sure that the system meets business requirements. The testing is performed through the user interface (UI) which includes data exploration and reporting. Correctness Test The system must be produce correct results. The measures and supporting context need to match numbers in other systems and be calculated correctly. Usability Test The system should be as easy to use as possible. It involves a controlled experiment

about how business users can use the business intelligence system to reach stated goals. Performance Test The system must be able to return results quickly without bogging down other resources.

Business Intelligence Must Be Believed Quality must be baked into the data warehouse or users will quickly lose faith in the business intelligence produced. It then becomes very difficult to get people back on board. Putting the quality in requires both the testing described in this article and data quality at the source described in the article, Data Sources for Data Warehousing, to launch a successful data warehousing / business intelligence effort.

FUNDAMENTALS OF DATA WAREHOUSE TESTING


Description This course introduces the student to the phases of testing and validation in a data warehouse or other decision support systems project. Students will learn the role of the testing process as part of a software development project, see how business requirements become the foundation for testing cases and test plans, develop a testing strategy develop audience profiles and learn about how to develop and execute effective tests, all as part of a data warehouse / decision support initiative. Students will be able to apply the data warehouse concepts in a series of related exercises that enable them to create and refine the various artifacts of testing for their data warehouse programs. What Makes This Certified Course Unique This ICCP-certified course provides participants with practical, in-depth understanding of how to create accurate data models for complex Business Intelligence solutions. Hands-on workshops throughout the course will reinforce the learning experience and provide the attendees with concrete results that can be utilized in their organizations. Course Objectives: Upon completion of this course, students will be able to: Review the fundamental concepts of data warehousing and its place in an information management environment Learn about the role of the testing process as part of software development and as part of data warehouse development Learn about test strategies, test plans and test cases what they are and how to develop them, specifically for data warehouses and decision support systems Create effective test cases and scenarios based on business and user requirements for the data warehouse Plan and coordinate usability testing for data warehousing Conduct reviews and inspections for validation and verification Participate in the change management process and document relevant changes to decision support requirements Prerequisites: Experience as a test analyst, business analyst or experience in the testing process Audience: Testing analysts, business analysts, project managers, business staff members who will participate in the testing function; data warehouse architects, data analysts Course Topics:

Understanding Business Intelligence Analyze the current state of the data warehousing industry Data warehousing fundamentals Operational data store (ODS) concepts Data mart fundamentals Defining meta data and its critical role in data warehousing and testing Key Principles in Testing Introduction Testing concepts Overview of the testing and quality assurance phases Project Management Overview Basic project management concepts Project management in software development and data warehousing Testing and quality assurance as part of software project management Requirements Definition for Data Warehouses Requirements management workflow Characteristics of good requirements for decision support systems Requirements-based testing concepts and techniques Audiences in Testing Audiences and their profiles User profiles Customer profiles Functional profiles Testing strategies by audience Test management overview Risk Analysis and Testing Risk analysis overview for testing Test Methods and Testing Levels Static vs. dynamic tests Black, grey and white box testing Prioritizing testing activities Testing from unit to user acceptance Test Plans and Procedures Writing and managing test plans and procedures Test plan structure and test design specifications Test Cases Overview Test case components Designing test scenarios for data warehouse usage Creating and executing test cases from scenarios Validation and Verification Validating customer needs for decision support Tools and techniques for validation, verification and assessment Acceptance Testing for Data Warehouses Ways to capture informal and formal user issues and concerns

Test readiness review Iterative testing for data warehouse projects Reviews and Walk-throughs Reviews versus walkthroughs Inspections in testing and quality assurance Testing Traceability Linking tests to requirements with a traceability matrix Change management in decision support systems and testing To learn more about how *EWSolutions* can provide our World-Class Training for your company or to request a quote, please feel free to contact David Marco, our Director of Education at "DMarco@EWSolutions.com":mailto:DMarco@EWSolutions.com or call him at 630.920.0005 ext. 103. Test Execution and Documentation Managing the testing and quality assurance process Documentation for the testing process Conclusion Summary, advanced exercises, resources for further study

Should You Use An ETL Tool?


The Extract, Transformation, and Load (ETL) system is the most time-consuming and expensive part of building a data warehouse and delivering business intelligence to your user community. A decade ago the majority of ETL systems were hand crafted, but the market for ETL software has steadily grown and the majority of practitioners now use ETL tools in place of hand-coded systems. Does it make sense to hand-code an ETL system in 2008, or is an ETL tool a better choice? Kimball Group now generally recommends using an ETL tool, but a custom-built approach can still makes sense. This article summarizes the advantages and disadvantages of ETL tools and offers advice on making the choice that's right for you. ADVANTAGES OF ETL TOOLS Visual flow and self-documentation.The single greatest advantage of an ETL tool is that it provides a visual flow of the system's logic. Each ETL tool presents these flows differently, but even the leastappealing of these user interfaces compare favorably to custom systems consisting of stored procedures, SQL and operating system scripts, and a handful of other technologies. Ironically, some ETL tools have no practical way to print the otherwise-attractive self documentation. Learn about a predictive analytics solution that provides forward-looking insights, and offers best practices to ensure you maximize the value of your company's customer relationships Best Practices for Collaboration in the Enterprise Structured system design. ETL tools are designed for the specific problem of populating a data warehouse. Although they are only tools, they do provide a metadata-driven structure to the development team. This is particularly valuable for teams building their first ETL system. Operational resilience. Many of the home-grown ETL systems I've evaluated are fragile: they have too many operational problems. ETL tools provide functionality and practices for operating and monitoring the ETL system in production. You can certainly design and build a well instrumented hand-coded ETL application, and ETL tool operational features have yet to mature. Nonetheless, it's easier for a data warehouse / business intelligence team to build on the management features of an ETL tool to build a

resilient ETL system. Data-lineage and data-dependency functionality. We would like to be able to right-click on a number in a report and see exactly how it was calculated, where the data was stored in the data warehouse, how it was transformed, when the data was most recently refreshed, and what source system or systems underlay the numbers. Dependency is the flip side of lineage: we'd like to look at a table or column in the source system and know which ETL modules, data warehouse tables, OLAP cubes, and user reports might be affected by a structural change. In the absence of ETL standards that hand-coded systems could conform to, we must rely on ETL tool vendors to supply this functionality though, unfortunately, few have done so to date. Advanced data cleansing functionality. Most ETL systems are structurally complex, with many sources and targets. At the same time, requirements for transformation are often fairly simple, consisting primarily of lookups and substitutions. If you have a complex transformation requirement, for example if you need to de-duplicate your customer list, you should use a specialized tool. Most ETL tools either offer advanced cleansing and de-duplication modules (usually for a substantial additional price) or they integrate smoothly with other specialized tools. At the very least, ETL tools provide a richer set of cleansing functions than are available in SQL. Performance. You might be surprised that performance is listed last under the advantages of the ETL tools. It's possible to build a high-performance ETL system whether you use a tool or not. It's also possible to build an absolute dog of an ETL system whether you use a tool or not. I've never been able to test whether an excellent hand-coded ETL system outperforms an excellent tool-based ETL system; I believe the answer is that it's situational. But the structure imposed by an ETL tool makes it easier for an inexperienced ETL developer to build a quality system. Software licensing cost.The greatest disadvantage of ETL tools in comparison to hand-crafted systems is the licensing cost for the ETL tool software. Costs vary widely in the ETL space, from several thousand dollars to hundreds of thousands of dollars. Uncertainty. We've spoken with many ETL teams that are uncertain and sometimes misinformed about what an ETL tool will do for them. Some teams under-value ETL tools, believing they are simply a visual way to connect SQL scripts together. Other teams unrealistically over-value ETL tools, imagining that building the ETL system with such a tool will be more like installing and configuring software than developing an application. Reduced flexibility. A tool-based approach limits you to the tool vendor's abilities and scripting languages. Build a Solid Foundation Learn about a predictive analytics solution that provides forward-looking insights, and offers best practices to ensure you maximize the value of your company's customer relationships Best Practices for Collaboration in the Enterprise There are some over-arching themes in successful ETL system deployments regardless of which tools and technologies are used. Most important and most frequently neglected is the practice of designing the ETL system before development begins. Too often we see systems that just evolved without any initial planning. These systems are inefficient and slow, they break down all the time, and they're unmanageable. The data warehouse team has no idea how to pinpoint the bottlenecks and problem areas of the system. A solid system design should incorporate the concepts described in detail in Kimball University: The Subsystems of ETL Revisited, by Bob Becker. Good ETL system architects will design standard solutions to common problems such as surrogate key assignment. Excellent ETL systems will implement these standard solutions most of the time but offer enough flexibility to deviate from those standards where necessary. There are usually half a dozen ways to solve any ETL problem, and each one may be the best solution in a specific set of

circumstances. Depending on your personality and fondness for solving puzzles, this can be either a blessing or a curse. One of the rules you should try to follow is to write data as seldom as possible during the ETL process. Writing data, especially to the relational database, is one of the most expensive tasks that the ETL system performs. ETL tools contain functionality to operate on data in memory and guide the developer along a path to minimize database writes until the data is clean and ready to go into the data warehouse table. However, the relational engine is excellent at some tasks, particularly joining related data. There are times when it is more efficient to write data to a table, even index it, and let the relational engine perform a join than it is to use the ETL tool's lookup or merge operators. We usually want to use those operators, but don't overlook the powerful relational database when trying to solve a thorny performance problem. Whether your ETL system is hand-coded or tool-based, it's your job to design the system for manageability, auditability, and restartability. Your ETL system should tag all rows in the data warehouse with some kind of batch identifier or audit key that describes exactly which process loaded the data. Your ETL system should log information about its operations, so your team can always know exactly where the process is now and how long each step is expected to take. You should build and test procedures for backing out a load, and, ideally, the system should roll back transactions in the event of a midstream failure. The best systems monitor data health during extraction and transformation, and they either improve the data or issue alerts if data quality is substandard. ETL tools can help you with the implementation of these features, but the design is up to you and your team. Should you use an ETL tool? Yes. Do you have to use an ETL tool? No. For teams building their first or second ETL system, the main advantage of visual tools are self-documentation and a structured development path. For neophytes, these advantages are worth the cost of the tool. If you're a seasoned expert perhaps a consultant who has built dozens of ETL systems by hand it's tempting to stick to what has worked well in the past. With this level of expertise, you can probably build a system that performs as well, operates as smoothly, and perhaps costs less to develop than a tool-based ETL system. But many seasoned experts are consultants, so you should think objectively about how maintainable and extensible a hand-crafted ETL system might be once the consultant has moved on. Don't expect to reap a positive return on investment in an ETL tool during the development of your first system. The advantages will come as that first phase moves into operation, as it's modified over time, and as your data warehouse grows with the addition of new business process models and associated ETL systems.

Difference between Reference Data and Master Data


It is not unusual for people to use Reference Data and Master Data interchangeably without understanding the differences. Lets try to understand the differences with an example of sales transaction. A sales transaction contains information like. Store, Products Sold, Sales Person, Store Name, Sales Date, Customer, Price, Quantity, etc. Attributes from the above example can be separated into two types: Factual (transactional) and Dimensional information

Price and Quantity are measurable attributes of a transaction. Store, Products Sold, Sales Person, Store Name, Sales Date, and Customer are dimensional attributes of a transaction. We can see that the dimensional data is already embedded in the transaction. And with dimensional attributes we can successfully complete the transaction.Dimensional data that directly participates in a transaction is master data. But is the list of dimensional attributes in the transaction complete? Asking few analytical questions can help us discover the answer. -What is the Male to Female ratio of customers doing purchase at the store? -What type of products are customers buying? Ex: Electronic, Computers, Toys -What type of Store is it? Ex: Web store, Brick & Mortar, Telesales, Catalog Sales The above questions cannot be answered by attributes in the transaction. These dimensional data is missing in the transactions. This missing dimensional data that does not directly participate in transaction but are attributes of the dimension is reference data. Why it is important for an ETL person to understand the differences? Well once the Reference Data Management (RDM) was popular then suddenly in last few years there is this new word Master Data Management (MDM). These words mean different things and they have significant implication on how they are managed. But that will be a topic of discussion for some future post! I hope this article will help clear atleast some confusion.

Different Types Of Data Warehouse Architectures

Das könnte Ihnen auch gefallen