Sie sind auf Seite 1von 8

Data-Warehousing Concepts

Q. What is a Data Warehouse? A data warehouse is integrated information collected from multiple sources that becomes the foundation for decision support and data analysis. transaction processing. It usually contains historical data that is derived from transaction data, but it can include data from other sources. It separates analysis workload from transaction workload and enables an organization to consolidate data from several sources. Characteristics of Data Warehouse 1.Subject Oriented: A data warehouses is designed with the purpose of analyzing a particular area of business, e.g. Sales or Finance. 2.Integrated: Data warehouses have the data from disparate sources put into a consistent format.

A data warehouse is a relational database that is designed for query and analysis rather than

3.

Static/Non volatile: As the data is made for analysis it is said to be static or Nonvolatile means that the data should not change once entered into the warehouse. 4.Time Variant: Historical data has to be maintained to analyze the business or market trends. Q. What is Data Mart? Data Marts can be said to be a subset of data warehouse or can be a small data warehouse itself. It is logical grouping of the data warehouse dimensions and the related fact, created to meet a specific group of users or requirements. It is a repository of data gathered from operational data and other sources that is designed to serve a particular community of knowledge workers. E.g. costing Data Mart, sales Data Mart. A Data Mart tends to be tactical and aimed at meeting an immediate need or meeting the specific demands of a particular group of knowledge users in terms of analysis, content, presentation, and ease-of-use. Users of a Data Mart can expect to have data presented in terms that are familiar or specific to a group of users. Data marts can be part of an Enterprise Wide Data Warehouse.

Q. What are the different types of data-mart?

Data-marts are classified based on the data-sources used to build the data-mart.

The different types of data-mart are: a) Dependent data-mart: built using data-warehouse as a source. b) Independent data-mart: built using the operational data source. c) Hybrid/Federated data-mart: using both data-warehouse and operational data source.

Q. What is metadata?

Metadata is the information about the data. This is the layer of the data warehouse, which
stores the information about the various aspects of data warehouse, like the source data, transformed data, date and time of data extraction, target databases, date and time of data loading, how the structures and calculation rules are stored, plus, possibly, additional information on data sources, definitions, quality, transformations, date of last update, user access privileges, etc. For ex in BO, the repository is the Metadata. In an ETL tool, the metadata contains information about the source tables, target tables, transformations, mappings etc. An Ideal situation is when we can control everything in a Data Warehouse through a Useful Metadata; that means data loading, data cleansing, transformations, reporting, admin activities, security, etc. Q. What is DataWarehousing? Why is it useful and important? Data Warehousing is a way to convert huge volume of data into useful information, which can be used for making business decisions.

It is useful in the below situations: a) Complex Analysis b) What if Analysis? c) Past & present trend analysis d) Moving averages e) Multidimensional Analysis f) Slice and Dice of data g) Drill Down and Drill Up to different levels of data Q. What are the differences between OLTP and OLAP (Data-Warehouse)? The differences between OLAP and OLTP can be listed as below: OLAP Definition Example Data On Line Analytical Processing Data Warehouse Static. A time frame is decided to load the data into data warehouse, so data remains stagnant over a certain period. Historical data is stored thus making it difficult to study the trend of the business over the past to help in analysis. Data is aggregated or summarized and stored at the higher level. Denormalized database are used to maintain detailed information in a row of record Lesser and easier joins as the OLTP On Line Transactional Processing ERP, Legacy system Dynamic. As the updates, deletions and modifications are online the data is continuously changing thus not helpful for analysis or decision making Old data is purged or archived

History

Data Atomicity Normalization

Data is stored at microscopic level Normalized databases

Joins in queries

More and complex joins as the tables

tables are Denormalized User Senior management or Sales and marketing head to analyze the business trends and make decisions Faster and better and ease of use as non technical people can create ad-hoc reports Read only data Updated at a fixed interval of time Data Warehouse

are normalized Operational staff adding, modifying or deleting day to day transactions

Performance

Complex

Read-write Update Frequency Example

Data can be updated, modified or deleted Continuously updated ERP, Legacy system

Q. What are the major stages/steps in a Data Warehousing project? The steps in a data warehousing project can be listed as: a) b) c) d) e) f) g) h) i) Understanding the business Understanding the present and future needs Strategic Planning Design of Data Warehouse Extraction of Data from different sources to a common staging area Cleaning of data Transformation of Data Transportation of Data Analysis of data (OLAP)

Q. What are the different tools used in data-warehousing projects? The tools used in data-warehousing projects can be categorized into three technical streams:

a) Database (Data warehouse, Data Mart). b) ETL (Extraction, Transformation, Load). c) Reporting (OLAP Tools).
Some tools and products available from different vendors are listed below:

Category
ETL Tools OLAP Server

Products
ETI Extract, Informatica, IBM Visual Warehouse Oracle Warehouse Builder, Ab Initio Oracle Express Server, Hyperion Essbase, IBM DB2 OLAP Server, Microsoft SQL Server OLAP Services, Seagate HOLOS, SAS/MDDB Oracle Express Suite, Business Objects, Web Intelligence, SAS, Cognos Powerplay/Impromtu, KALIDO, MicroStrategy, Brio Query, MetaCube Oracle, Informix, Teradata, DB2/UDB, Sybase, Microsoft SQL Server, RedBricks SAS Enterprise Miner, IBM Intelligent Miner, SPSS/Clementine , MARsc-Centurion

OLAP Tools

Data Warehouse Data Mining & Analysis

Q. What is dimensional modeling?

It is a logical design technique used for building data-warehouses. It uses the concept of facts
and dimensions. Dimensional Modeling is intended to support end-user queries in a datawarehouse. Dimensional modeling visualizes the data in terms of Cubes. For example, if we want to measure Sales by Products, Customers, Location, And Time of a company. We can visualize that the x, y a z-axis of the cube depicts products, customers, location and time, every point in the cube depict the sales. This is a very simple way of representing the business. The Dimensional modeling is also known as Star Schema. The reason is that in Dimensional modeling we have a large central table with many dimension tables surrounding it. Q. What are the various available schemas for dimensional modeling? Star Schema Snowflake Schema Multistar Schema Aggregate schema

Q. What is a Star Schema? The star schema is the simplest data warehouse schema. It is called a star schema because the diagram of a star schema resembles a star, with points radiating from a centre. The centre of the star consists of one or more fact tables and the points of the star are the dimension tables. The main advantage to a star schema is optimized performance. A star schema keeps queries simple and provides fast response time because all the information about each level is stored in one row.

Q. What is a Snowflake Schema? The snowflake schema is a complex data warehouse schema. A snowflake schema is a set of tables comprised of a single, central fact table surrounded by normalized dimension hierarchies. Each dimension level is represented in a table. Snowflake schema implements dimensional data structures with fully normalized dimensions.

Q. What is MultiStar Schema? MultiStar Schema: It is various star schemas joined to create a Data Warehouse.

Q. What is a measure/fact?

A performance indicator that is quantifiable and used to determine how well a business is operating. For example, measures can be Revenue, Revenue/Employee, Profit margin % etc.

In relational modeling, this is also called a fact.


The different types of fact are:

Additive: Data that can be aggregated by using the sum function e.g. Sales Semi additive: Data that cannot be aggregated directly over time. E.g. Inventory, account balances Non-additive: Data that cannot be aggregated at all e.g. Time Q. What are the different types of fact tables? The different types of fact table can be listed as below:

a) Cumulative Fact Table: This type of fact table describes what has happened over a period of time. For Example- This fact table may describe the total sales by product by store by day. The facts for this type of fact tables are mostly additive facts. The above example given for Additive Facts is a Cumulative fact table. b) Snapshot Fact Table: This type of fact table describes the state of things in a particular instance of time, and usually includes more semi-additive and non-additive facts. For Example - The table given above for Semi-additive fact is an example of Snapshot fact table. Q. What is a Fact-less Fact table? A fact-less fact is a fact table in which only primary keys of the dimension tables present as foreign keys but no individual fact columns.

Q. What is a dimension?

Dimensions are different perspectives through which a person can analyze the business measures. Dimensions contain descriptive data of a business. Example Geography, Product, Company, Time

Q. What are the types of dimensions? The different types of dimensions can be listed as below:

a. Conformed dimension - A dimension which can be shared by multiple fact tables is


known as conformed dimension. It has exactly the same meaning and content when being referred from different fact tables. b. Junk dimension - A junk dimension is a collection of random transactional codes, flags and/or text attributes that are unrelated to any particular dimension. In simple terms, a randomly used dimension is junk dimension. c. Degenerated dimension It is a dimension which has only a single attribute. This dimension is typically represented as a single field in fact table. These dimensions are used when fact tables represent transactional data. Q. What are Slowly Changing Dimensions?

Slowly Changing Dimensions are the dimensions in which attribute values keep on changing. Example Product price of Products, Address/Phone No. of Employees/Customers

Q. What are the different types of SCDs/Different methods to track changes in SCD?

The different methods/types of SCDs can be listed as below:

a) Type 1: Overwriting old value with new value.


In this type of SCD, the old value will be lost. Example: The phone number of the customer C1 is changed from 2341233 to 55210456.

b) Type 2: Recording new values as new records.


In this type, the new values will be stored into new records. So, Type 2 SCD records both new as well as old values. Example: The new Phone number for the customer C1 is recorded again in a new row. Issue in Type 2 Tracking: In the above example, The Customer ID column has duplicate values and it loses the primary key properties. Here, Surrogate Key comes into picture. Surrogate Key acts as the primary key for the dimension table.

c) Type 3: Old values are stored in new columns.


In this method, when the values are updated in dimension tables, a new column is created in which the old value gets stored.

Q. What is a Surrogate key? Where is it used? Explain with example.

Surrogate key is a substitution for the natural primary key.


It is just a unique identifier or number for each row that can be used for the primary key to the table. The only requirement for a surrogate primary key is that it is unique for each row in the table. Data warehouses typically use a surrogate (also known as artificial or identity key) key for the dimension tables primary keys. It is useful because:

a) The natural primary key (i.e. Customer Number in Customer table) can change.
Some tables have columns such as AIRPORT_NAME or CITY_NAME which are stated as the primary keys (according to the business users) but not only can these change; indexing on a numerical value is probably better and you could consider creating a surrogate key called say AIRPORT_ID. This would be internal to the

system and as far as the client is concerned you may display only the AIRPORT_NAME.

b) Another benefit you can get from surrogate keys (SID) is, the Tracking of SCD Slowly Changing Dimension. Example: As described above for Type 2 SCD, the phone number of a Customer C1 changes and it is recorded in a new row; thus making the primary key, Customer Id duplicate. Here, a surrogate key would be useful in tracking the change in dimension. Two new columns, Surrogate Key and Status would help in tracking the change with the status column indicating whether the phone no is current or expired. Q. What are OLAP, MOLAP, ROLAP and HOLAP?

OLAP: On-Line Analytical Processing. A category of applications and technologies for


collecting, managing, processing and presenting multidimensional data for analysis and management purposes.

MOLAP: Generally when the OLAP is based on a Multi dimensional server than it is called as
MOLAP. An example is Express objects of Oracle, Express Objects has a server, which contains the cubes, this server is then used for reporting, and the cubes contain the data. High performance, multidimensional data storage format. Data is stored on the OLAP server. Gives the best query performance, for small to medium-sized data sets

ROLAP: With ROLAP data remains in the original relational tables. A separate set of relational
tables is used to store and reference aggregation data. ROLAP is ideal for large databases or legacy data that is infrequently queried. Example: Business Objects, the relations of the data Warehouse is stored in the Repository, which is actually a set of tables in a RDBMS.

HOLAP: Its the combination of ROLAP and MOLAP.A example is HOLOS.

Q. What is aggregation in a Data Warehouse? What is aggregate navigation? Usually in Data Warehouses, the facts store the data by a very low level grain, something like I want to store the sales of each product by customer by day by store in the fact table. If this is our grain statement, and suppose we have a base fact of 10 million records, then to calculate the sales of one customer for a period of 3 years would go through many records (actually all the records of that customer for that period). This whole process would be slow. To improve performance of such queries, we design special tables in the Data Warehouse which contain data at higher granularity, in our case we can design a table containing, the monthly sales of each customer by each product. So to calculate the total sales of a Customer for 3 years would become very easy if this table is accessed. Such tables are called aggregates, as data is aggregated. The challenge comes in for the front-end tools to understand when to look in these tables and when to look in the basic fact table. This is known as aggregate navigation. When an OLAP tool has aggregate navigation feature, it can automatically select the right table depending upon the query fired by the User, without the user knowing anything about the background processes. For example, Business Objects handle aggregate navigation by something known as aggregate

aware. This feature allows the Designer to refer to the aggregate table when a query of higher granularity than the base fact is fired. Q. What are popular OLAP tools?

OLAP Tool Business objects Powerplay SAS Software Seagate Info Sagent Data Mart solutions Oracle Advanced Analytic Services MicroStrategy Intelligence Server Microsoft Analysis Services Intelligent Decision Server Hummingbird Pablo

Company Business Objects Cognos SAS Crystal Decisions Sagent Technologies Oracle Microstrategy Microsoft IBM Humming Bird

Das könnte Ihnen auch gefallen