Sie sind auf Seite 1von 34

Online analytical processing, or OLAP (/olp/), is an approach to answering multidimensional analytical (MDA) queries swiftly in computing, .

[1] OLAP is part of the broader


category of business intelligence, which also encompasses relational database, report writing
and data mining.[2] Typical applications of OLAP include business reporting for sales,marketing,
management eporting, business process management (BPM),budgeting and forecasting, financial
reporting and similar areas, with new applications coming up, such as agriculture.[4] The
term OLAP was created as a slight modification of the traditional database term online
transaction processing (OLTP).
OLAP tools enable users to analyze multidimensional data interactively from multiple
perspectives. OLAP consists of three basic analytical operations: consolidation (roll-up), drilldown, and slicing and dicing.[6] Consolidation involves the aggregation of data that can be
accumulated and computed in one or more dimensions. For example, all sales offices are rolled
up to the sales department or sales division to anticipate sales trends. By contrast, the drill-down
is a technique that allows users to navigate through the details. For instance, users can view the
sales by individual products that make up a region's sales. Slicing and dicing is a feature whereby
users can take out (slicing) a specific set of data of the OLAP cube and view (dicing) the slices
from different viewpoints. These viewpoints are sometimes called dimensions (such as looking at
the same sales by salesperson or by date or by customer or by product or by region, etc.)
Databases configured for OLAP use a multidimensional data model, allowing for complex
analytical and ad hoc queries with a rapid execution time.[7] They borrow aspects ofnavigational
databases, hierarchical databases and relational databases.
OLAP is typically contrasted to OLTP (online transaction processing), which is generally
characterized by much less complex queries, in a larger volume, to process transactions rather
than for the purpose of business intelligence or reporting. Whereas OLAP systems are mostly
optimized for read, OLTP has to processes all kinds of queries (read, insert, update and delete).

Overview of OLAP systems


At the core of any OLAP system is an OLAP cube (also called a 'multidimensional cube' or
a hypercube). It consists of numeric facts called measures that are categorized bydimensions. The
measures are placed at the intersections of the hypercube, which is spanned by the dimensions as
a vector space. The usual interface to manipulate an OLAP cube is a matrix interface, like Pivot
tables in a spreadsheet program, which performs projection operations along the dimensions,
such as aggregation or averaging.
The cube metadata is typically created from a star schema or snowflake schema or fact
constellation of tables in a relational database. Measures are derived from the records in the fact
table and dimensions are derived from the dimension tables.
Each measure can be thought of as having a set of labels, or meta-data associated with it.
A dimension is what describes these labels; it provides information about the measure.
A simple example would be a cube that contains a store's sales as a measure, and Date/Time as
a dimension. Each Sale has a Date/Time label that describes more about that sale.
For example:
Sale Fact Table
+-------------+----------+
| sale_amount | time_id |
+-------------+----------+
|

2008.10|

1234 |----+

+-------------+----------+ |
|

Time Dimension
+---------+-------------------+
| time_id | timestamp

+---------+-------------------+

+---->| 1234 | 20080902 12:35:43 |


+---------+-------------------+

Multi dimensional databases


Multidimensional structure is defined as "a variation of the relational model that uses
multidimensional structures to organize data and express the relationships between data". [8]The
structure is broken into cubes and the cubes are able to store and access data within the confines
of each cube. "Each cell within a multidimensional structure contains aggregated data related to
elements along each of its dimensions".[9] Even when data is manipulated it remains easy to
access and continues to constitute a compact database format. The data still remains interrelated.
Multidimensional structure is quite popular for analytical databases that use online analytical
processing (OLAP) applications.[10]Analytical databases use these databases because of their
ability to deliver answers to complex business queries swiftly. Data can be viewed from different
angles, which gives a broader perspective of a problem unlike other models.[11]

Aggregations
It has been claimed that for complex queries OLAP cubes can produce an answer in around 0.1%
of the time required for the same query on OLTP relational data.[12][13] The most important
mechanism in OLAP which allows it to achieve such performance is the use of aggregations.
Aggregations are built from the fact table by changing the granularity on specific dimensions and
aggregating up data along these dimensions. The number of possible aggregations is determined
by every possible combination of dimension granularities.
The combination of all possible aggregations and the base data contains the answers to every
query which can be answered from the data.[14]
Because usually there are many aggregations that can be calculated, often only a predetermined
number are fully calculated; the remainder are solved on demand. The problem of deciding
which aggregations (views) to calculate is known as the view selection problem. View selection
can be constrained by the total size of the selected set of aggregations, the time to update them
from changes in the base data, or both. The objective of view selection is typically to minimize
the average time to answer OLAP queries, although some studies also minimize the update time.
View selection is NP-Complete. Many approaches to the problem have been explored,
including greedy algorithms, randomized search,genetic algorithms and A* search algorithm.

Types
OLAP systems have been traditionally categorized using the following taxonomy.
Multidimensional OLAP (MOLAP)
MOLAP (multi-dimensional online analytical processing) is the classic form of OLAP and is
sometimes referred to as just OLAP. MOLAP stores this data in an optimized multi-dimensional
array storage, rather than in a relational database.
Some MOLAP tools require the pre-computation and storage of derived data, such as
consolidations the operation known as processing. Such MOLAP tools generally utilize a precalculated data set referred to as a data cube. The data cube contains all the possible answers to a
given range of questions. As a result, they have a very fast response to queries. On the other
hand, updating can take a long time depending on the degree of pre-computation. Precomputation can also lead to what is known as data explosion.
Other MOLAP tools, particularly those that implement the functional database model do not precompute derived data but make all calculations on demand other than those that were previously
requested and stored in a cache.

Advantages of MOLAP

Fast query performance due to optimized storage, multidimensional indexing and


caching.

Smaller on-disk size of data compared to data stored in relational database due to
compression techniques.

Automated computation of higher level aggregates of the data.

It is very compact for low dimension data sets.

Array models provide natural indexing.

Effective data extraction achieved through the pre-structuring of aggregated data.

Disadvantages of MOLAP

Within some MOLAP Solutions the processing step (data load) can be quite lengthy,
especially on large data volumes. This is usually remedied by doing only incremental
processing, i.e., processing only the data which have changed (usually new data) instead of
reprocessing the entire data set.

Some MOLAP methodologies introduce data redundancy.

Products
Examples of commercial products that use MOLAP are Cognos Powerplay, Oracle Database
OLAP Option, MicroStrategy, Microsoft Analysis Services, Essbase, TM1, Jedox, andicCube.

Relational OLAP (ROLAP)


ROLAP works directly with relational databases and does not require pre-computation. The base
data and the dimension tables are stored as relational tables and new tables are created to hold
the aggregated information. It depends on a specialized schema design. This methodology relies
on manipulating the data stored in the relational database to give the appearance of traditional
OLAP's slicing and dicing functionality. In essence, each action of slicing and dicing is
equivalent to adding a "WHERE" clause in the SQL statement. ROLAP tools do not use precalculated data cubes but instead pose the query to the standard relational database and its tables
in order to bring back the data required to answer the question. ROLAP tools feature the ability
to ask any question because the methodology does not limit to the contents of a cube. ROLAP
also has the ability to drill down to the lowest level of detail in the database.
While ROLAP uses a relational database source, generally the database must be carefully
designed for ROLAP use. A database which was designed for OLTP will not function well as a
ROLAP database. Therefore, ROLAP still involves creating an additional copy of the data.
However, since it is a database, a variety of technologies can be used to populate the database.

Advantages of ROLAP

ROLAP is considered to be more scalable in handling large data volumes, especially


models with dimensions with very high cardinality (i.e., millions of members).

With a variety of data loading tools available, and the ability to fine-tune the ETL code to
the particular data model, load times are generally much shorter than with the
automated MOLAP loads.

The data are stored in a standard relational database and can be accessed by
any SQL reporting tool (the tool does not have to be an OLAP tool).

ROLAP

tools

are

better

at

handling non-aggregatable

facts (e.g.,

textual

descriptions). MOLAP tools tend to suffer from slow performance when querying these
elements.

By decoupling the data storage from the multi-dimensional model, it is possible to


successfully model data that would not otherwise fit into a strict dimensional model.

The ROLAP approach can leverage database authorization controls such as row-level
security, whereby the query results are filtered depending on preset criteria applied, for
example, to a given user or group of users (SQL WHERE clause).

Disadvantages of ROLAP

There is a consensus in the industry that ROLAP tools have slower performance than
MOLAP tools. However, see the discussion below about ROLAP performance.

The loading of aggregate tables must be managed by custom ETL code. The ROLAP
tools do not help with this task. This means additional development time and more code to
support.

When the step of creating aggregate tables is skipped, the query performance then suffers
because the larger detailed tables must be queried. This can be partially remedied by adding
additional aggregate tables, however it is still not practical to create aggregate tables for all
combinations of dimensions/attributes.

ROLAP relies on the general purpose database for querying and caching, and therefore
several special techniques employed by MOLAP tools are not available (such as special
hierarchical indexing). However, modern ROLAP tools take advantage of latest
improvements in SQL language such as CUBE and ROLLUP operators, DB2 Cube Views,
as well as other SQL OLAP extensions. These SQL improvements can mitigate the benefits
of the MOLAP tools.

Since ROLAP tools rely on SQL for all of the computations, they are not suitable when
the model is heavy on calculations which don't translate well into SQL. Examples of such
models include budgeting, allocations, financial reporting and other scenarios.

Performance of ROLAP
In the OLAP industry ROLAP is usually perceived as being able to scale for large data volumes,
but suffering from slower query performance as opposed to MOLAP. The OLAP Survey, the
largest independent survey across all major OLAP products, being conducted for 6 years (2001
to 2006) have consistently found that companies using ROLAP report slower performance than
those using MOLAP even when data volumes were taken into consideration.
However, as with any survey there are a number of subtle issues that must be taken into account
when interpreting the results.

The survey shows that ROLAP tools have 7 times more users than MOLAP tools within
each company. Systems with more users will tend to suffer more performance problems at
peak usage times.

There is also a question about complexity of the model, measured both in number of
dimensions and richness of calculations. The survey does not offer a good way to control for
these variations in the data being analyzed.

Downside of flexibility
Some companies select ROLAP because they intend to re-use existing relational database tables
these tables will frequently not be optimally designed for OLAP use. The superior flexibility
of ROLAP tools allows this less than optimal design to work, but performance
suffers. MOLAP tools in contrast would force the data to be re-loaded into an optimal OLAP
design.

Hybrid OLAP (HOLAP)


The undesirable trade-off between additional ETL cost and slow query performance has ensured
that most commercial OLAP tools now use a "Hybrid OLAP" (HOLAP) approach, which allows
the model designer to decide which portion of the data will be stored in MOLAP and which
portion in ROLAP.
There is no clear agreement across the industry as to what constitutes "Hybrid OLAP", except
that a database will divide data between relational and specialized storage. [16] For example, for
some vendors, a HOLAP database will use relational tables to hold the larger quantities of
detailed data, and use specialized storage for at least some aspects of the smaller quantities of
more-aggregate

or

less-detailed

data.

HOLAP

addresses

the

shortcomings

of MOLAP and ROLAP by combining the capabilities of both approaches. HOLAP tools can
utilize both pre-calculated cubes and relational data sources.

Vertical partitioning
In this mode HOLAP stores aggregations in MOLAP for fast query performance, and detailed
data in ROLAP to optimize time of cube processing.

Horizontal partitioning
In this mode HOLAP stores some slice of data, usually the more recent one (i.e. sliced by Time
dimension) in MOLAP for fast query performance, and older data in ROLAP. Moreover, we can
store some dices in MOLAP and others in ROLAP, leveraging the fact that in a large cuboid,
there will be dense and sparse subregions.[17]

Products
The first product to provide HOLAP storage was Holos, but the technology also became
available in other commercial products such as Microsoft Analysis Services, Oracle Database
OLAP Option, MicroStrategy and SAP AG BI Accelerator. The hybrid OLAP approach
combines ROLAP and MOLAP technology, benefiting from the greater scalability of ROLAP
and the faster computation of MOLAP. For example, a HOLAP server may allow large volumes
of detail data to be stored in a relational database, while aggregations are kept in a separate
MOLAP store. The Microsoft SQL Server 7.0 OLAP Services supports a hybrid OLAP server

Comparison
Each type has certain benefits, although there is disagreement about the specifics of the benefits
between providers.

Some MOLAP implementations are prone to database explosion, a phenomenon causing


vast amounts of storage space to be used by MOLAP databases when certain common
conditions are met: high number of dimensions, pre-calculated results and sparse
multidimensional data.

MOLAP generally delivers better performance due to specialized indexing and storage
optimizations. MOLAP also needs less storage space compared to ROLAP because the
specialized storage typically includes compression techniques.[16]

ROLAP is generally more scalable.[16] However, large volume pre-processing is difficult


to implement efficiently so it is frequently skipped. ROLAP query performance can therefore
suffer tremendously.

Since ROLAP relies more on the database to perform calculations, it has more limitations
in the specialized functions it can use.

HOLAP encompasses a range of solutions that attempt to mix the best of ROLAP and
MOLAP. It can generally pre-process swiftly, scale well, and offer good function support.

Other types
The following acronyms are also sometimes used, although they are not as widespread as the
ones above:

WOLAP Web-based OLAP

DOLAP Desktop OLAP

RTOLAP Real-Time OLAP

APIs and query languages


Unlike relational databases, which had SQL as the standard query language, and
widespread APIs such as ODBC, JDBC and OLEDB, there was no such unification in the OLAP
world for a long time. The first real standard API was OLE DB for OLAP specification
from Microsoft which appeared in 1997 and introduced the MDX query language. Several OLAP
vendors both server and client adopted it. In 2001 Microsoft and Hyperion announced
the XML for Analysis specification, which was endorsed by most of the OLAP vendors. Since
this also used MDX as a query language, MDX became the de facto standard.[18] Since
September-2011 LINQ can be used to query SSAS OLAP cubes from Microsoft .NET.[19]

Products
History
The first product that performed OLAP queries was Express, which was released in 1970 (and
acquired by Oracle in 1995 from Information Resources).[20] However, the term did not appear
until 1993 when it was coined by Edgar F. Codd, who has been described as "the father of the
relational database". Codd's paper[1] resulted from a short consulting assignment which Codd
undertook for former Arbor Software (later Hyperion Solutions, and in 2007 acquired by Oracle),
as a sort of marketing coup. The company had released its own OLAP product, Essbase, a year
earlier. As a result, Codd's "twelve laws of online analytical processing" were explicit in their
reference to Essbase. There was some ensuing controversy and when Computerworld learned
that Codd was paid by Arbor, it retracted the article. OLAP market experienced strong growth in
late 90s with dozens of commercial products going into market. In 1998, Microsoft released its
first OLAP Server Microsoft Analysis Services, which drove wide adoption of OLAP
technology and moved it into mainstream.

Product comparison
OLAP Clients
OLAP clients include many spreadsheet programs like Excel, web application, sql,dashboard
tools, etc.

Market structure
Below is a list of top OLAP vendors in 2006, with figures in millions of US Dollars.
Vendor

Global Revenue

Consolidated company

Microsoft Corporation

1,806

Microsoft

Hyperion Solutions Corporation

1,077

Oracle

Cognos

735

IBM

Business Objects

416

SAP

MicroStrategy

416

MicroStrategy

SAP AG

330

SAP

Cartesis (SAP)

210

SAP

Applix

205

IBM

Infor

199

Infor

Vendor

Global Revenue

Consolidated company

Oracle Corporation

159

Oracle

Others

152

Others

Total

5,700

Open-source

Druid (open-source data store) is a popular open-source distributed data store for OLAP
queries that is used at scale in production by various organizations.

Apache Kylin is a distributed data store for OLAP queries originally developed by eBay.

Cubes (OLAP server) is another light-weight open-source toolkit implementation of


OLAP functionality in the Python programming language with built-in ROLAP.

Linkedin Pinot is used at LinkedIn to deliver scalable real time analytics with low
latency. It can ingest data from offline data sources (such as Hadoop and flat files) as well as
online sources (such as Kafka). Pinot is designed to scale horizontally.

Online Analytical Processing Server (OLAP) is based on the multidimensional data model. It
allows managers, and analysts to get an insight of the information through fast, consistent, and
interactive access to information. This chapter cover the types of OLAP, operations on OLAP,
difference between OLAP, and statistical databases and OLTP.

Types of OLAP Servers


We have four types of OLAP servers:

Relational OLAP (ROLAP)

Multidimensional OLAP (MOLAP)

Hybrid OLAP (HOLAP)

Specialized SQL Servers

Relational OLAP
ROLAP servers are placed between relational back-end server and client front-end tools. To
store and manage warehouse data, ROLAP uses relational or extended-relational DBMS.

ROLAP includes the following:

Implementation of aggregation navigation logic.

Optimization for each DBMS back end.

Additional tools and services.

Multidimensional OLAP
MOLAP uses array-based multidimensional storage engines for multidimensional views of data.
With multidimensional data stores, the storage utilization may be low if the data set is sparse.
Therefore, many MOLAP server use two levels of data storage representation to handle dense
and sparse data sets.

Hybrid OLAP (HOLAP)


Hybrid OLAP is a combination of both ROLAP and MOLAP. It offers higher scalability of
ROLAP and faster computation of MOLAP. HOLAP servers allows to store the large data
volumes of detailed information. The aggregations are stored separately in MOLAP store.

Specialized SQL Servers


Specialized SQL servers provide advanced query language and query processing support for
SQL queries over star and snowflake schemas in a read-only environment.

OLAP Operations
Since OLAP servers are based on multidimensional view of data, we will discuss OLAP
operations in multidimensional data.
Here is the list of OLAP operations:

Roll-up

Drill-down

Slice and dice

Pivot (rotate)

Roll-up
Roll-up performs aggregation on a data cube in any of the following ways:

By climbing up a concept hierarchy for a dimension

By dimension reduction

The following diagram illustrates how roll-up works.

Roll-up is performed by climbing up a concept hierarchy for the dimension location.

Initially the concept hierarchy was "street < city < province < country".

On rolling up, the data is aggregated by ascending the location hierarchy from the level
of city to the level of country.

The data is grouped into cities rather than countries.

When roll-up is performed, one or more dimensions from the data cube are removed.

Drill-down
Drill-down is the reverse operation of roll-up. It is performed by either of the following ways:

By stepping down a concept hierarchy for a dimension

By introducing a new dimension.

The following diagram illustrates how drill-down works:

Drill-down is performed by stepping down a concept hierarchy for the dimension time.

Initially the concept hierarchy was "day < month < quarter < year."

On drilling down, the time dimension is descended from the level of quarter to the level
of month.

When drill-down is performed, one or more dimensions from the data cube are added.

It navigates the data from less detailed data to highly detailed data.

Slice
The slice operation selects one particular dimension from a given cube and provides a new subcube. Consider the following diagram that shows how slice works.

Here Slice is performed for the dimension "time" using the criterion time = "Q1".

It will form a new sub-cube by selecting one or more dimensions.

Dice
Dice selects two or more dimensions from a given cube and provides a new sub-cube. Consider
the following diagram that shows the dice operation.

The dice operation on the cube based on the following selection criteria involves three
dimensions.

(location = "Toronto" or "Vancouver")

(time = "Q1" or "Q2")

(item =" Mobile" or "Modem")

Pivot
The pivot operation is also known as rotation. It rotates the data axes in view in order to provide
an alternative presentation of data. Consider the following diagram that shows the pivot
operation.

In this the item and location axes in 2-D slice are rotated.

OLAP vs OLTP
Sr.No.

Data Warehouse (OLAP)

Operational Database (OLTP)

Involves historical processing of


information.

Involves day-to-day processing.

OLAP systems are used by


knowledge workers such as
executives, managers and analysts.

OLTP systems are used by clerks, DBAs,


or database professionals.

Useful in analyzing the business.

Useful in running the business.

It focuses on Information out.

It focuses on Data in.

Based on Star Schema, Snowflake,

Based on Entity Relationship Model.

Schema and Fact Constellation


Schema.
6

Contains historical data.

Contains current data.

Provides summarized and


consolidated data.

Provides primitive and highly detailed


data.

Provides summarized and


multidimensional view of data.

Provides detailed and flat relational view


of data.

Number or users is in hundreds.

Number of users is in thousands.

10

Number of records accessed is in


millions.

Number of records accessed is in tens.

11

Database size is from 100 GB to 1


TB

Database size is from 100 MB to 1 GB.

12

Highly flexible.

Provides high performance.

Multidimensional OLAP (MOLAP) uses array-based multidimensional storage engines


for multidimensional views of data. With multidimensional data stores, the storage utilization
may be low if the data set is sparse. Therefore, many MOLAP servers use two levels of data
storage representation to handle dense and sparse data-sets.

Points to Remember:

MOLAP tools process information with consistent response time regardless of level of
summarizing or calculations selected.

MOLAP tools need to avoid many of the complexities of creating a relational database to
store data for analysis.

MOLAP tools need fastest possible performance.

MOLAP server adopts two level of storage representation to handle dense and sparse
data sets.

Denser sub-cubes are identified and stored as array structure.

Sparse sub-cubes employ compression technology.

MOLAP Architecture
MOLAP includes the following components:

Database server.

MOLAP server.

Front-end tool.

Advantages

MOLAP allows fastest indexing to the pre-computed summarized data.

Helps the users connected to a network who need to analyze larger, less-defined data.

Easier to use, therefore MOLAP is suitable for inexperienced users.

Disadvantages

MOLAP are not capable of containing detailed data.

The storage utilization may be low if the data set is sparse.

MOLAP vs ROLAP
Sr.No.

MOLAP

ROLAP

Information retrieval is fast.

Information retrieval is comparatively


slow.

Uses sparse array to store data-sets.

Uses relational table.

MOLAP is best suited for


inexperienced users, since it is very
easy to use.

ROLAP is best suited for experienced


users.

Maintains a separate database for


data cubes.

It may not require space other than


available in the Data warehouse.

DBMS facility is weak.

DBMS facility is strong.

Schema is a logical description of the entire database. It includes the name and description of
records of all record types including all associated data-items and aggregates. Much like a
database, a data warehouse also requires to maintain a schema. A database uses relational
model, while a data warehouse uses Star, Snowflake, and Fact Constellation schema. In this
chapter, we will discuss the schemas used in a data warehouse.

Star Schema

Each dimension in a star schema is represented with only one-dimension table.

This dimension table contains the set of attributes.

The following diagram shows the sales data of a company with respect to the four
dimensions, namely time, item, branch, and location.

There is a fact table at the center. It contains the keys to each of four dimensions.

The fact table also contains the attributes, namely dollars sold and units sold.

Note: Each dimension has only one dimension table and each table holds a set of attributes. For
example, the location dimension table contains the attribute set {location_key, street, city,
province_or_state,country}. This constraint may cause data redundancy. For example,
"Vancouver" and "Victoria" both the cities are in the Canadian province of British Columbia.
The entries for such cities may cause data redundancy along the attributes province_or_state
and country.

Snow flake Schema

Some dimension tables in the Snowflake schema are normalized.

The normalization splits up the data into additional tables.

Unlike Star schema, the dimensions table in a snowflake schema are normalized. For
example, the item dimension table in star schema is normalized and split into two
dimension tables, namely item and supplier table.

Now the item dimension table contains the attributes item_key, item_name, type, brand,
and supplier-key.

The supplier key is linked to the supplier dimension table. The supplier dimension table
contains the attributes supplier_key and supplier_type.

<b< style="box-sizing: border-box;">Note: Due to normalization in the Snowflake schema, the


redundancy is reduced and therefore, it becomes easy to maintain and the save storage
space.</b<>

Fact Constellation Schema

A fact constellation has multiple fact tables. It is also known as galaxy schema.

The following diagram shows two fact tables, namely sales and shipping.

The sales fact table is same as that in the star schema.

The shipping fact table has the five dimensions, namely item_key, time_key,
shipper_key, from_location, to_location.

The shipping fact table also contains two measures, namely dollars sold and units sold.

It is also possible to share dimension tables between fact tables. For example, time, item,
and location dimension tables are shared between the sales and shipping fact table.

Schema Definition
Multidimensional schema is defined using Data Mining Query Language (DMQL). The two
primitives, cube definition and dimension definition, can be used for defining the data
warehouses and data marts.

Syntax for Cube Definition


define cube < cube_name > [ < dimension-list > }: < measure_list >
Syntax for Dimension Definition
define dimension < dimension_name > as ( < attribute_or_dimension_list > )
Star Schema Definition
The star schema that we have discussed can be defined using Data Mining Query Language
(DMQL) as follows:
define cube sales star [time, item, branch, location]:
dollars sold = sum(sales in dollars), units sold = count(*)
define dimension time as (time key, day, day of week, month, quarter, year)
define dimension item as (item key, item name, brand, type, supplier type)
define dimension branch as (branch key, branch name, branch type)
define dimension location as (location key, street, city, province or state, country)
Snowflake Schema Definition
Snowflake schema can be defined using DMQL as follows:
define cube sales snowflake [time, item, branch, location]:
dollars sold = sum(sales in dollars), units sold = count(*)
define dimension time as (time key, day, day of week, month, quarter, year)
define dimension item as (item key, item name, brand, type, supplier (supplier key, supplier
type))
define dimension branch as (branch key, branch name, branch type)
define dimension location as (location key, street, city (city key, city, province or state, country))
Fact Constellation Schema Definition
Fact constellation schema can be defined using DMQL as follows:

define cube sales [time, item, branch, location]:


dollars sold = sum(sales in dollars), units sold = count(*)
define dimension time as (time key, day, day of week, month, quarter, year)
define dimension item as (item key, item name, brand, type, supplier type)
define dimension branch as (branch key, branch name, branch type)
define dimension location as (location key, street, city, province or state,country)
define cube shipping [time, item, shipper, from location, to location]:
dollars cost = sum(cost in dollars), units shipped = count(*)
define dimension time as time in cube sales
define dimension item as item in cube sales
define dimension shipper as (shipper key, shipper name, location as location in cube sales,
shipper type)
define dimension from location as location in cube sales
define dimension to location as location in cube sales
Partitioning is done to enhance performance and facilitate easy management of data.
Partitioning also helps in balancing the various requirements of the system. It optimizes the
hardware performance and simplifies the management of data warehouse by partitioning each
fact table into multiple separate partitions. In this chapter, we will discuss different partitioning
strategies.

Why is it Necessary to Partition?


Partitioning is important for the following reasons:

For easy management,

To assist backup/recovery,

To enhance performance.

For Easy Management


The fact table in a data warehouse can grow up to hundreds of gigabytes in size. This huge size
of fact table is very hard to manage as a single entity. Therefore it needs partitioning.

To Assist Backup/Recovery
If we do not partition the fact table, then we have to load the complete fact table with all the
data. Partitioning allows us to load only as much data as is required on a regular basis. It
reduces the time to load and also enhances the performance of the system.
Note: To cut down on the backup size, all partitions other than the current partition can be
marked as read-only. We can then put these partitions into a state where they cannot be
modified. Then they can be backed up. It means only the current partition is to be backed up.

To Enhance Performance
By partitioning the fact table into sets of data, the query procedures can be enhanced. Query
performance is enhanced because now the query scans only those partitions that are relevant. It
does not have to scan the whole data.

Horizontal Partitioning
There are various ways in which a fact table can be partitioned. In horizontal partitioning, we
have to keep in mind the requirements for manageability of the data warehouse.
Partitioning by Time into Equal Segments
In this partitioning strategy, the fact table is partitioned on the basis of time period. Here each
time period represents a significant retention period within the business. For example, if the user
queries for month to date data then it is appropriate to partition the data into monthly
segments. We can reuse the partitioned tables by removing the data in them.
Partition by Time into Different-sized Segments
This kind of partition is done where the aged data is accessed infrequently. It is implemented as
a set of small partitions for relatively current data, larger partition for inactive data.

Points to Note

The detailed information remains available online.

The number of physical tables is kept relatively small, which reduces the operating cost.

This technique is suitable where a mix of data dipping recent history and data mining
through entire history is required.

This technique is not useful where the partitioning profile changes on a regular basis,
because repartitioning will increase the operation cost of data warehouse.

Partition on a Different Dimension


The fact table can also be partitioned on the basis of dimensions other than time such as product
group, region, supplier, or any other dimension. Let's have an example.
Suppose a market function has been structured into distinct regional departments like on a state
by state basis. If each region wants to query on information captured within its region, it would
prove to be more effective to partition the fact table into regional partitions. This will cause the
queries to speed up because it does not require to scan information that is not relevant.

Points to Note

The query does not have to scan irrelevant data which speeds up the query process.

This technique is not appropriate where the dimensions are unlikely to change in future.
So, it is worth determining that the dimension does not change in future.

If the dimension changes, then the entire fact table would have to be repartitioned.

Note: We recommend to perform the partition only on the basis of time dimension, unless you
are certain that the suggested dimension grouping will not change within the life of the data
warehouse.
Partition by Size of Table
When there are no clear basis for partitioning the fact table on any dimension, then we
should partition the fact table on the basis of their size. We can set the predetermined size as
a critical point. When the table exceeds the predetermined size, a new table partition is created.

Points to Note

This partitioning is complex to manage.

It requires metadata to identify what data is stored in each partition.


Partitioning Dimensions
If a dimension contains large number of entries, then it is required to partition the dimensions.
Here we have to check the size of a dimension.
Consider a large design that changes over time. If we need to store all the variations in order to
apply comparisons, that dimension may be very large. This would definitely affect the response
time.
Round Robin Partitions
In the round robin technique, when a new partition is needed, the old one is archived. It uses
metadata to allow user access tool to refer to the correct table partition.
This technique makes it easy to automate table management facilities within the data
warehouse.
Vertical Partition
Vertical partitioning, splits the data vertically. The following images depicts how vertical
partitioning is done.

Vertical partitioning can be performed in the following two ways:

Normalization

Row Splitting

Normalization
Normalization is the standard relational method of database organization. In this method, the
rows are collapsed into a single row, hence it reduce space. Take a look at the following tables
that show how normalization is performed.

Table before Normalization


Product_id

Qty

Value

sales_date

Store_id

Store_name

Location

Region

30

3.67

3-Aug-13

16

sunny

Bangalore

35

5.33

3-Sep-13

16

sunny

Bangalore

40

2.50

3-Sep-13

64

san

Mumbai

45

5.66

3-Sep-13

16

sunny

Bangalore

Table after Normalization


Store_id

Store_name

Location

Region

16

sunny

Bangalore

64

san

Mumbai

Product_id

Quantity

Value

sales_date

Store_id

30

3.67

3-Aug-13

16

35

5.33

3-Sep-13

16

40

2.50

3-Sep-13

64

45

5.66

3-Sep-13

16

Row Splitting

Row splitting tends to leave a one-to-one map between partitions. The motive of row splitting is
to speed up the access to large table by reducing its size.
Note: While using vertical partitioning, make sure that there is no requirement to perform a
major join operation between two partitions.

Identify Key to Partition


It is very crucial to choose the right partition key. Choosing a wrong partition key will lead to
reorganizing the fact table. Let's have an example. Suppose we want to partition the following
table.
Account_Txn_Table
transaction_id
account_id
transaction_type
value
transaction_date
region
branch_name
We can choose to partition on any key. The two possible keys could be

region

transaction_date

Suppose the business is organized in 30 geographical regions and each region has different
number of branches. That will give us 30 partitions, which is reasonable. This partitioning is
good enough because our requirements capture has shown that a vast majority of queries are
restricted to the user's own business region.
If we partition by transaction_date instead of region, then the latest transaction from every
region will be in one partition. Now the user who wants to look at data within his own region
has to query across multiple partitions.

Das könnte Ihnen auch gefallen