Beruflich Dokumente
Kultur Dokumente
Outline
Introduction
View - I: Data Access Crisis
View - II: Operational Applications vs. Analytical Applications
View – III:
Data Integration
View – IV:
Text Book View
Overview
Conclusion
Introduction
You might have seen when you go to a super market that your bill is processed in a
computerized billing system. When you go to a railway station, you book the ticket in
a computerized system. Even in the airlines, the ticket is booked in a computerized
system. So almost in everything computerization is abundantly followed. And you will
see that all these organizations are collecting data effortlessly and the amount of
data they are collecting everyday is large in volume.
Taking the example of a railway reservation system, except for managerial decisions,
the details of who has booked a ticket and on which train is sent to an archival file
after the train or after the commencement of the journey. As a result, huge volumes
of data, which is effortlessly generated, is sent to an archival file without any proper
use.
Particularly in a business environment this kind of data can be made use of to have
better profitability and better performance. Keeping that in mind new technology that
has been developed that tries to make use of the historical data and brings out some
useful information from that data and which we shall be learning in this particular
course called Data Warehousing. Data Warehousing essentially means that to extract
information, very useful and very valuable information from huge volume of data and
it is useful for the decision making purpose.
In this lesion we will take care of the general overview of the whole concept of Data
Warehousing starting with the introductory concepts of Data Warehousing and then
proceeding to detailed components of Data Warehousing.
View – I:
Data Access Crisis
We have three views on Data Warehousing. The first one is Data Access Crisis. Lets
go through the details of Data Access Crisis.
And to do this, top managers, analysts and knowledge workers in our enterprises
need more and better information.
Information technology itself has made possible revolutions in the way that
organizations today operate throughout the world.
Despite the availability of more and more powerful computers on everyone's desks
and Communication networks that span the globe, executives and decision makers
can't get their hands on critical information that already exists in the organization.
“Data in Jail”
But for the most part, this data is locked up in a myriad of computer systems and is
exceedingly difficult to get at.
Data Poor
Only a small fraction of the data that is captured, processed and stored in the
enterprise is actually available to executives and decision makers.
Technologies for the manipulation and presentation of data have literally exploded
Large segments of the enterprise are still "data poor."
Data Warehousing
The term that has come to characterize this new technology is "data warehousing."
–To provide organizations flexible, effective and efficient means of getting at the sets
of data that has come to represent one of the organizations most critical and
valuable assets.
View - II:
Operational Applications vs. Analytical Applications
Database Applications
Business process involves a series of events- Nature and frequency may differ. We
call these as transactions.
Highly efficient and optimized execution of large number of atomic transactions and
near fault-tolerant availability of data.
–A product is manufactured
–One account is credited and another debited
–A seat is reserved
–An order is booked
–An invoice is generated
–A payment is posted
Operational Data
Limitations
-
Do not show the way manager looks at the process of booking orders, picking
shipments, invoicing customers.
-
These are optimized to carry out the transaction efficiently and correctly.
- Managers’ concerns
- Sales, Volumes, Margins
- By product, By division, Over time
Analytical Applications
Class of applications
–That support analyst and knowledge workers in their efforts to gain insight into
data.
–
–Through fast, interactive access to a wide range of corporate information.
–It transforms the raw data so that it reflects the real dimensionality of the
enterprise.
FASMI
-FAST
-Most responses within 5 sacs, should not exceed 20secs.
-ANALYSIS
-Time series analyses, cost allocation, exception alerting (Without having to do any
programming)
-SHARED
-MULTIDIMENSIONAL
-Single most important feature
-INFORMATION
Analytical Data
-Historical data
-
-Static in nature
In Advance Integration
On-Demand-Integration
On-Demand-Integration
This is also called as Lazy model; it is also called as Query-driven model or also
called as “Virtual” system. To be precise about On-demand-integration lets say that
given a query we want to find the relevant information sources where the data is
available, generate a sub query for each of the sources, integrate the results
obtained from different sources, and return them to the specified application.
To be more elaborate, let us imagine that a query that is required to access four to
five different data sources. One may be in Informix other may be in Oracle another
may be Sybase. So the user gives a query then the On-demand-integration receives
that query understands that query and generates different sub-queries one query for
Informix another query for Oracle another for Sybase. Those sub-queries are sent to
different systems, the respective systems of Sybase and Oracle. The data is collected
from the systems integrated and the integrated result is sent to the user.
The diagram explains the concept very clearly. Let us look at the different sources.
Let us imagine three sources an Oracle source, an Informix Source and a Sybase
Source. The client gives a query the system, which receives the query makes the
sub-queries, sends the Oracle sub-queries to the Oracle source , Informix sub-query
to the Informix source and Sybase sub-query to the Sybase source.
These systems are specific to the DBMS present in the source. So the query is sent
here, it is understood and the data is received back to the wrapper and this data sent
to the integrator which integrates the results obtained from all the sources and
finally it sends back the result to the client. So the client feels as if he is sending a
query, which can be understood by Sybase, Informix and Oracle simultaneously.
Actually the system integrates this particular query into Sub-queries. This concept is
called On-demand-integration because as and when the client gives a query the sub-
query generations, obtaining data from different sources and integrating those
results and sending to the client is carried out.
In-Advance-Integration
When the user poses the query it evaluates those queries directly from the new
database and the result is returned to the client.
This is called Eager model because the system evaluates the possible set of queries
in-advance, gets the data from multiple sources, prepares the results and keep it
ready for the user to give the query and all these process are carried out even before
a user makes any demand.
This is called analysis-driven because the possible set of analysis that can be carried
out by a user is estimated in advance and all the results are generated and kept.
This is also called as Materialized system because all the consolidated, filtered data is
physically stored in a new database.
Materialized Systems
To be more precise about Materialized Systems, data coming from local sources is
integrated into a single new database.
In the materialized system, we have two different ways of storing the data.
Universal DBMS - The universal DBMS migrates data from local systems to an Object
Relational or Object Oriented database system that can handle novel types of data.
Data Warehouse - Data warehousing it imports data from local sources for the
purposes of OLAP and data mining.
Let us have a complete picture of all the different kinds of integration that are
possible. Integration can be of two different ways:
The Materialized System is called Eager model because the system before the query
is put it evaluates the possible set of queries that may be sent by the user and gets
the data and makes it ready for the query on certain system.
In the virtual system, one category falls in the search engine and other is multi
DBMS system and another is mediated integration system.
This gives you a complete picture of where Data Warehouse lies in the integration
system.
And in On-demand system only after the query the data is picked up from the
different sources .Hence the materialize system is preferable when the
network connectivity is unreliable, because you know that the moment the query is
put the network connection is not available. When connecting to different sources
and getting the data would not be possible.
The response time to queries is another important aspect of this. The data is
precompiled and kept. The moment the user gives a query immediately the response
is given.
Where as in Lazy model, if after the query is given the processing starts, then it will
take a long time to get back the answer from different sources.
Thus we learnt that data warehouse is materialized integration of data from multiple
sources.
View – IV:
Text Book View
- W. H. Inmon, 1993
Having understood what essentially a Data Warehouse is let us have a review of the
whole course of Data Warehousing and Data Mining.
Let us first ask our selves what should be the motivation for understanding and
studying Data Warehousing. What should be the motivation for having Data
Warehousing as a subject of study?
Hence it is appropriate for any IT professional to be acquainted with the cutting edge
technology of Data Warehousing.
We should first design a data warehouse, so we should know the techniques of Data
Warehouse designing like DBMS or any other business applications.
It is essential for us to understand Data modeling. But when you are studying Data
Warehousing, the concepts that we have learned in Data modeling and other courses
cannot be trivially extended to Data Warehousing Concepts.
Data Warehousing being a new technology, the data modeling aspects of Data
Warehousing are altogether different. The core part of Data Warehousing and data
modeling is what we call as multi-dimensional data modeling model. The complete
Data Warehousing organization, the way the data is stored in a Data Warehouse can
be viewed as a multi-dimensional model. It has got different business dimensions
and the core data is stored as a multidimensional array of these dimensions. It is
essential for Data Warehouse design to first do the dimensional modeling. Unless we
design dimensions, it will not be possible for us to do Data Warehouse design.
Let us recall that in our DBMS courses we have learnt about different schemas. In
the same manner Data Warehouse, where Data Modeling is a topic, we have different
Data Warehouse Schemas. Those are Star Schema, Snowflake Schema, and
Constellation. In fact, the most popular among these schemas is the Star Schema.
Essential part of the Star Schema is that dimensional modeling is carried out in a
Star Schema. Each dimension is represented as a dimension table, which is nothing
but a relational table. But, in Snowflake Schema, the dimension table has to be
normalized in the same way as you have learnt in a DBMS course. And hence, a
single dimension table will then be decomposed into sub tables. That is why it is
called a Snowflake Schema.
Another major aspect of Data Warehousing techniques is the OLAP engine. The OLAP
Engine provides all the different analytical processing that can be carried out on a
multi-dimensional data model or in other words on a Data Cube. But we have
different kinds of OLAP engines we call one as MOLAPAnd another is ROLAP, in fact
another is a combination of MOLAP and ROLAP, which we called as H(ybrid)OLAP. To
distinguish between MOLAP and ROLAP, MOLAP is kind of OLAP engine which
assumes the data as multi-dimensional data cube, it describes the data in terms of
dimensions and it views the data as a multi-dimensional data array. But ROLAP, R
representing relational, relational OLAP assumes that the Data Cube is stored
physically as a relational table.
Hence even if it views the Data cube in multi-dimensional fashion the physical
storage makes use of the relational DBMS and the relational table concepts. Naturally
the HOLAP makes use of the advantages of MOLAP and ROLAP.
Overview of this course –III
In order to distinguish between a Centralized Data Warehouse and a Data Mart. Data
Mart essentially means that the Data Warehousing requirement for the whole
enterprise is too big to be handled. So for each department, may be for each
individual department and individual manager, a smaller Data Warehouse is
designed, which categorizes to his or her requirements and that we termed as Data
Mart. So the most popular concept of Data Mart essentially points to a Data
Warehouse in a smaller scale.
But again please remember that the concept of Data Mart is different from the
concept of Data Marting. I would also like to tell you here that the concept of Data
Warehouse basically explained is a warehouse. But the concept of Data Warehousing
is basically all the technology that is involved in maintaining, creating, analyzing a
Data Warehouse.
Having Understood the Data Warehousing concepts, the basic core part of Data
Warehouse, the main question now is how we populate data into a Data Warehouse.
One of the major aspects of Data Warehouse is that it contains data, which is
necessarily read only. You cannot change or update the data in a Data Warehouse.
We normally call changing of data in a Data Warehouse as data loading or data
refreshing.
Since the Data Warehouse is basically an integrated system it obtains data from
different sources. So data loading is an integrated process and which requires a
separate sub system. We call that system as an ETL system. ETL stands for
extracting, transforming, and loading.
So any Data Warehouse system must have an ETL process as it’s front-end. As a
result we can have a three-tier architecture of Data Warehouse.
• First level is an ETL process, the next level is Data Warehouse server,
• The next level is the OLAP engine
• The final tier contains Data Mining, Visualization and Report Generation.
The ETL process is not the part of three-tier architecture in the sense that this is a
front end to the Data Warehouse.
So tier-I should be Data Warehouse server, tier-II is OLAP engine and tier-III is
where the user interacts with the whole Data Warehouse as Data Mining,
Visualization, and Reporting systems.
Essentially, I have outlined all major aspects of the Data Warehousing process.
Conclusion
We have seen that Data Warehousing is an essential concept that is required for data
integration for easy access of data
for managerial decisions and for analytical processing and for the data that is
required for the knowledge workers. We have also understood different components
of data warehousing such as Data Marting, OLAP operations, Dimensional modeling
etc., so in the subsequent lessons each of these concepts will be elaborated in detail.