Sie sind auf Seite 1von 24

->Data warehouse and its features

1. Data warehousing provides architectures and tools for business executives to


systematically organize, understand, and use their data to make strategic
decisions.
2. Data warehouses collected data from a range of different data sources, such as
mainframe computers, minicomputers, as well as personal computers and office
automation software such as spreadsheet, and integrate this information in a
single place.
3. The data warehouse is thus an informational environment which does the
following:
● provides an integrated view of the enterprise.
● renders the enterprise’s current as well as historical data readily
available for making strategic decisions.
● makes decision making possible without hindering operational systems.
● makes the organization’s information consistent and easily accessible.
● provides a flexible, conducive and interactive source of strategic
information.
4. Features are:
● Immediate information delivery​: Data warehouses reduces the time period
lapsed between the request for information and the actual delivery of
information to the users. For example, the sales report was formed once
every month, usually in the first week of every month. But with data
warehouses the same report can be formulated on a daily basis thereby
enabling the business analysts to exploit opportunities that could
otherwise have been raised.
● Integration of data from within and outside the organization​: Data
warehouses combine data from multiple sources. The data is collected
from different departments like sales, marketing, finance, and
accounting. Besides this, data is also taken from external sources like
business magazines, news reports, surveys etc.
● Provides an insight into the future​: Data warehouses store large amounts
of historical information that enables the decision makers to analyse
the prevailing trends in the market and produce goods according to the
customers demands.
● Enables users to look at the same data in different ways: A data
warehouse provides its users with tools for analysing and manipulating
data in many different ways. It facilitates the users to drill down into
detailed data with the click of a mouse that could have otherwise taken
a few days with the traditional approach.
● Provides freedom from the dependency on IT: With data warehouses, the
users have to no longer depend on the availability of IT professionals
to answer their queries. Now, if the manager needs an ad hoc report, he
can himself form it without the assistance of any computer guru.
-> Datawarehouse v/s Datamart

Parameter Data Warehouse Data Mart

Definition A Data Warehouse is a large A data mart is the only subtype of


repository of data collected a Data Warehouse. It is designed
from different organizations to meet the needs of a certain
or departments within a user group.
corporation.

Usage It helps to make a strategic It helps to make tactical


decision. decisions for the business.

Objective The main objective of Data A data mart mostly used in a


Warehouse is to provide an business division at the
integrated environment and department level.
coherent picture of the
business at a point in time.

Designing The designing process of Data The designing process of Data Mart
Warehouse is quite difficult. is easy.

May or may not use in a It is built focused on a


dimensional model. However, dimensional model using a star
it can feed dimensional schema.
models.

Data Handling Data warehousing includes a Data marts are easy to use, design
large area of the corporation and implement as they can only
which is why it takes a long handle small amounts of data.
time to process it.

Focus Data warehousing is broadly Data Mart is subject-oriented, and


focused all the departments. it is used at a department level.
It is possible that it can
even represent the entire
company.
Data type The data stored inside the Data Marts are built for
Data Warehouse is always particular user groups. Therefore,
detailed when compared with data is short and limited.
data mart.

Subject-area The main objective of Data Mostly hold only one subject area-
Warehouse is to provide an for example, Sales figure.
integrated environment and
coherent picture of the
business at a point in time.

Data storing Designed to store Dimensional modeling and star


enterprise-wide decision schema design employed for
data, not just marketing optimizing the performance of
data. access layers.

Data type Time variance and Mostly includes consolidation data


non-volatile design are structures to meet subject area's
strictly enforced. query and reporting needs.

Data value Read-Only from the end-users Transaction data regardless of


standpoint. grain fed directly from the Data
Warehouse.

Scope Data warehousing is more Data mart contains data, of a


helpful as it can bring specific department of a company.
information from any There are maybe separate data
department. marts for sales, finance,
marketing, etc. Has limited usage

Source In Data Warehouse Data comes In Data Mart data comes from very
from many sources. few sources.

Size The size of the Data The Size of Data Mart is less than
Warehouse may range from 100 100 GB.
GB to 1 TB+.

Implementatio The implementation process of The implementation process of Data


n time Data Warehouse can be Mart is restricted to a few
months.
extended from months to
years.
Tier-1:

The bottom tier is a warehouse database server that is almost always a relational
database system. Back-end tools and utilities are used to feed data into the bottom
tier from operational databases or other external sources (such as customer profile
information provided by external consultants). These tools and utilities perform data
extraction, cleaning, and transformation (e.g., to merge similar data from different
sources into a unified format), as well as load and refresh functions to update the
data warehouse . The data is extracted using application program interfaces known as
gateways. A gateway is supported by the underlying DBMS and allows client programs to
generate SQL code to be executed at a server.

Examples of gateways include ODBC (Open Database Connection) and OLEDB (Open Linking
and Embedding for Databases) by Microsoft and JDBC (Java Database Connection).
This tier also contains a metadata repository, which stores information about the
data warehouse and its contents.
Tier-2:

The middle tier is an OLAP server that is typically implemented using either a
relational OLAP (ROLAP) model or a multidimensional OLAP. OLAP model is an extended
relational DBMS thatmaps operations on multidimensional data to standard relational
operations. A multidimensional OLAP (MOLAP) model, that is, a special-purpose server
that directly implements multidimensional data and operations.

Tier-3:

The top tier is a front-end client layer, which contains query and reporting tools,
analysis tools, and/or data mining tools (e.g., trend analysis, prediction, and so
on).

OLAP Operations:

• Roll-up is also known as drill-up operation, performs aggregation on a data cube by


climbing up a dimensional hierarchy. ​for ex. If the time dimension is defined by
hierarchy week<Month<Quarter<year, then roll-up operation when performed will create
and aggregates move up in dimensional hierarchy.

• Drill-down is the reverse of roll-up, as it navigates from less detailed data to


more detailed data. ​for ex. Want to see data at lower levels of details, for
individual weeks.

• Slice – focus on particular partitions along (one or more) dimension i.e., focus on
a particular slice of the cube WHERE clause in SQL.

• Dice – partition the cube into smaller subcubes and aggregate the points in each
GROUP BY clause in SQL.

• Pivot (rotate): – Reorient the cube, visualization, 3D to series of 2D planes.


Identification of Data Sources
• For every piece of information that has to be stored in the data warehouse, first its source has
to be identified.
• Steps Performed in source identification:
a) List every fact needed for analysis in fact Tables.
b) For every dimension table list each and every Attribute.
c) for each target data item, find the source system and the appropriate source data
item.
d) If there are multiple sources for the same data then choose preferred sources.

Extracting Data for Refreshing


• Data Extraction Techniques can be classified into two categories:
A) Immediate Data Extraction
• Data Extraction is Real time.
• It occurs as the transactions happen at the source db and files.
• Immediate data extraction further divided into three sub categories.
1. Capture through Transaction logs
• Makes use of transaction logs of DBMS.
• When each business transaction adds, updates, or deletes a row from the db
table, the DBMS immediately updates the log file as well by writing every entry in
it.
2. Capture through DB Triggers
• Only applicable for DB applications.
• Triggers stored procedures that are stored on db and fired when an event
occurs.
• The output of trigger program written on a separate file that will be used to
extract data.
• Extracted data will be stored in DW.
3. Capture in Source Application
• Source application is used to capture the data for DW.
• All relevant applications that write to source files are modified to write all adds,
updates & delete to both source file and DB tables.
• Suited for DB, indexed file, flat files & all Other types of files.

B) Deferred Data Extraction


• Here data capture does not take place in real time.
• Capture is done at a later point of time.
• Two types of deferred data extraction:
1. Capture based on Date & Timestamp
• Record is marked with timestamp.
• Timestamp shows the date & time at which source record was created &
updated.
• Data is usually extracted during the midnight.
2. Capture by Comparing files
• It compares two snapshots of source data.
• Ex: Sales Data extracted from today’s copy and previous day’s copy.
• This technique necessitates keeping of prior copies of all relevant source data.
What Is an Attribute?

● An attribute is a data field, representing a characteristic or feature of a


data object.
● The type of an attribute is determined by the set of possible values—nominal,
binary, ordinal, or numeric—the attribute can have.
● Nominal Attributes
➢ Nominal means “relating to names.” The values of a nominal attribute are
symbols or names of things. Each value represents some kind of category,
code, or state, and so nominal attributes are also referred to as
categorical.
● Binary Attributes
➢ A binary attribute is a nominal attribute with only two categories or
states: 0 or 1, where 0 typically means that the attribute is absent,
and 1 means that it is present.
➢ Binary attributes are referred to as Boolean if the two states
correspond to true and false.
● Ordinal Attributes
➢ An ordinal attribute is an attribute with possible values that have a
meaningful order or ranking among them, but the magnitude between
successive values is not known.
● Numeric Attributes
➢ A numeric attribute is quantitative; that is, it is a measurable
quantity, represented in integer or real values. Numeric attributes can
be interval-scaled or ratio-scaled.
● Interval-Scaled Attributes
➢ Interval-scaled attributes are measured on a scale of equal-size units.
The values of interval-scaled attributes have order and can be positive,
0, or negative. Thus, in addition to providing a ranking of values, such
attributes allow us to compare and quantify the difference between
values.
● Ratio-Scaled Attributes
➢ A ratio-scaled attribute is a numeric attribute with an inherent
zero-point. That is, if a measurement is ratio-scaled, we can speak of a
value as being a multiple (or ratio) of another value. In addition, the
values are ordered, and we can also compute the difference between
values, as well as the mean, median, and mode.