Data Mining and Data Warehouse BY

DATA MINING AND DATA WAREHOUSE
BY
T.SUDHEER KUMAR A.VENKATESH

05491A0547 05491A0557
III/IV CSE III/IV CSE
Email :Sudheer_kumar292@yahoo.com
Dept. of Computer Science Engineering

QIS COLLEGE OF ENGINEERING & TECHNOLOGY
ONGOLE.
ABSTRACT:
A data warehouse is a repository of integrated information, available for queries
and analysis. Data and information are extracted from non-homogeneous sources as they
are generated and processed using process managers (load/warehouse/query).This makes
it much easier and more efficient to run queries over data that originally came from
different sources. It also enables the people to take informed decisions.
Data mining draws from the data warehouse, revealing patterns of informationin
historical data, in terms of customer data or any other data in ways that we never thought
possible. It combines techniques like statistical analysis, data visualization, induction and
neutral networks. Data mining systems improve an organization’s effectiveness,
efficiency and value by increasing the usefulness of the knowledge the organization
possesses.
Data warehousing- Definition:Data warehouses are built to support large cost-
effective data volumes (above 100GB of database) which can be a relational database,
multidimensional database, flat file, hierarchical database, object database, etc.
Data warehouse- goals:
The fundamental to enable user’s appropriate access to a homogenized and
comprehensive view of organization. It also supports forecasting planning and decision
making processes. In additional goal is to achieve information consistency provide
security and adaptability.
Data warehouse-process flow:
The process flow is represented as follows:
1 Extract and load the data: Data extraction involves extracting the data from source
systems and makes it available to the data warehouse where as data load takes
extracted data and loads it into the data warehouse.
Clean and transform data: It performs the consistency checks on the loaded data, and then
structures it for query performance and for minimizing the operational costs.
2 Back up and archive data: The data is being backed up regularly and also older data is
removed from the system in a format that allows it to be quickly restored if required.
3 Query management: It manages the queries and speeds them up by directing queries
to the most effective data source and also monitor the actual query profiles.
Data warehouse architecture:

The architecture is made up of a number of interconnected parts:
1 Operational database/External database layer: Operational systems process data
to support critical operational needs. To do that, operational databases have been
historically created to provide an efficient processing structure for a relatively small
number of well-defined business transactions.
2 Information Access Layer: This is the layer that the end-user deals with directly. In
particular, it represents the tools that the end-user normally uses day to day.
e.g.: Excel, Lotus 1-2-3, etc.
3 Data Access Layer: The Data Access Layer of the Data Warehouse Architecture is
involved with allowing the information Layer to talk to the Operational Layer
4.Data Directory (Meta-data) Layer: Meta-data is the data about data with in the
enterprise. Record description in a COBOL program is Meta-data.
Data Warehouse Architecture
1 Process Management Layer: The Process Management Layer is involved in
scheduling the various tasks that must be accomplished to build and maintain the data
warehouse and data directory.
2 Application Messaging Layer: The Application Message Layer has to do with
transporting information around the enterprise-computing network. Application
Message is also referred to as “Middleware”, but it can involve those just networking
protocols.
3 Data Warehouse (physical) layer: The (core) Data Warehouse is where the actual
data used primarily for informational uses occur. In a Physical Data Warehouse,
copies, in some cases many copies, of operational and or external data are actually
stored in form that is easy to access.
4 Data Stating Layer: Data staging is also called copy management or replication
management, but in fact, it includes all of the processes necessary to select, edit,
summarize, combine and load data warehouse and information access data from
operational and/or external databases.
Data Marts
A data mart is typically defined as a subset of the contents of a data warehouse,

stored within its own database. A data mart tends to contain data focused at the department
level, or on a specific business area. The data can exist at both the detail and summary
levels. The data mart can be populated with data taken directly from operational sources,
similar to a data warehouse, or data taken from the data warehouse itself. Because the
volume of data in a data mart is less than that in a data warehouse, query processing is
often faster.
Data Warehousing and OLAP
DTS can function independent of SQL Server and can be used as a stand-alone
tool to transfer data from Oracle to any other ODBC or OLE DB-compliant database.
Accordingly, DTS can extract data from operational databases for inclusion in a data
warehouse or data mart for query and analysis.
In the illustration, the transaction data resides on an IBM DB2 transaction server.
A package is created using DTS to transfer and clean the data from the DB2 transaction
server and to move it into the data warehouse or data mart. In this example, the relational
database server is SQL Server 7.0, and the data warehouse is using OLAP Services to
provide analytical capabilities. Client programs (such as Excel) access the OLAP Services
server using the OLE DB for OLAP interface, which is exposed through a client-side
component called Microsoft PivotTable service. Client programs using PivotTable service
can manipulate data in the OLAP server and can even change individual cells.
Data Warehouse-applications: Role of data warehouse in various application areas.

1. Marketing solutions: Marketing database, customer loyalty scheme & profiling,
etc.
2. Retail: sales analysis, shrinkage analysis, promotion analysis, space planning.
3. Insurance: Product profitability analysis, orphan analysis.
4. Telephone companies: individual terrifying through call analysis, network
analysis.
5. Retail banking: customer profitability analysis, customer scoring/loan decision.
Data Warehouse-Future developments:
Data warehousing is such a new field that it is difficult to estimate what new
developments are likely to most affect it. Clearly, the development of parallel DB servers
with improved query engines is likely to be one of the most important. Another new
technology is data warehouses that allow for the mixing of traditional numbers, text and
multi-media. The availability of improved tools for data visualization (business
intelligence) will allow users to see things that could never be seen before.
Data mining – definition:
Data mining, “the extraction of hidden predictive information from large
databases”, is a powerful new technology with great potential to help companies focus
on the most important information in their data warehouses. Data mining tools predict
future trends and behaviors, allowing business to make proactive, knowledge-
driven decisions. The automated, prospective analyses offered by data mining move
beyond the analyses of past events provided by retrospective tools typical of decision
support systems.
Data mining – Supporting Technologies:

Data mining can be applied in the business field, because it is supported by three
technologies.
Massive data collection: Databases are growing at unprecedented rates and can be larger
than expected.
Powerful multiprocessor computers: The need for improved computational engines can
now be met.
Data mining algorithms: They have been implemented as mature, reliable,
understandable tools.
Data Mining:
Scope: Given databases of sufficient size and quality, data mining technology can
generate new business opportunities by providing these capabilities:
Automated prediction of trends and behaviors: Example: predictive problem is
targeted marketing.
Automated discovery of previously unknown patters: Data mining tools sweep
through databases and identify previously hidden patterns in one step. An example is the
analysis of retail sales.
Data Mining-Algorithms: Some of the most common data mining algorithms in use
today are two sections based on when the technique was developed and when it became
ready to be used.
1. Classical Techniques: Statistics, neighborhoods and clustering that have been used for
decades.
Statistics: These are data driven and are used to discover patterns and build predictive
models.
(a)Histograms: One of the best ways to summarize data is to provide a histogram of the
data.
Ex : Representing the majority of customers that are over the age of 50.
Figure – depicts customers of different ages.

(b)Linear regression: In statistics, prediction is usually synonymous with
regression of some form. The simplest form of regression is simple linear regression that
just contains one predictors and a prediction. The relationship between the two can be
mapped on a two dimensional space.
Clustering: It is the method by which like records are grouped together. Usually this is
done to give the end user a high level view of what is going on in the database. There are
mainly two types.
Hierarchical and Non-Hierarchical Clustering: The hierarchy of clusters is usually
viewed as a tree where the smallest clusters merge together to create the next highest
level of clusters and so on.
Hierararchy of clusters elongated clusters
2. Next Generations Techniques: They represent techniques such as Trees, Networks
and Rules that have only been widely used since the early 1980’s.
Neural Networks: Neural networks consist of a number of neurons that are
interconnected--often in complex ways--and then organized into layers. Neurons are very
simple processing units that compute a linear combination of a number of inputs and then
perform a simple mathematical process on the result to produce an output.
Data Mining - Working Procedure: While large-scale information technology

has been evolving separate transaction and analytical systems, data mining provides the
link between the two. Data mining software analyzes relationships and patterns in stored
transaction data based on open-ended user queries.. Generally, any of four types of
relationships are sought:
• Classes: Stored data is used to locate data in predetermined groups. For example,
a restaurant chain could mine customer purchase data to determine when
customers visit and what they typically order. This information could be used to
increase traffic by having daily specials.
• Clusters: Data items are grouped according to logical relationships or consumer

preferences. For example, data can be mined to identify market segments or
consumer affinities.
• Associations: Data can be mined to identify associations. The beer-diaper

example is an example of associative mining
• Sequential patterns: Data is mined to anticipate behavior patterns and trends. For
example, an outdoor equipment retailer could predict the likelihood of a hiking
shoes.
Data Mining – Architecture:

To best apply these advanced techniques, they must be fully integrated with data
Warehouse as well as flexible interactive business analysis tools. Many data mining tools
currently operate outside of the Warehouse, requiring extra steps for extracting,
importing and analyzing the data. Furthermore, when new insights require operational
implementation, integration with the warehouse simplifies the application of results from
data mining. The resulting analytical data warehouse can be applied to improve business
processes throughout the organization. The following figure illustrates architecture for
advanced analysis in a large data warehouse.
The ideal starting point is a data warehouse containing a combination of internal
data tracking all customer contact coupled with external market data about competitor
activity. Background information on potential customers also provides an excellent basis
for prospecting. This warehouse can be implemented in a variety of relational database
systems: Sybase, Oracle, Redbrick and so on.
An OLAP (On-Line Analytical Processing) server enables amore sophisticated
end-user business model to be applied when navigating the data ware house. The
multidimensional structures allow the user to analyze the data as they want to view their
business. The Data Mining Server must be integrated with the data warehouse and the
OLAP server to embed ROI-focused business analysis directly into this infrastructure. As
the warehouse grows with new decisions and results, the organization can continually
mine the best practices and apply them to future decisions.
The datamining process:
The process of data mining consists of three stages: (1) the initial exploration, (2)
model building or pattern identification with validation/verification, and (3) deployment
(i.e., the application of the model to new data in order to generate predictions).
Stage 1: Exploration. This stage usually starts with data preparation which may
involve cleaning data, data transformations, selecting subsets of records and - in case of
data sets with large numbers of variables ("fields") - performing some preliminary feature
selection operations to bring the number of variables to a manageable range (depending
on the statistical methods which are being considered).
Stage 2: Model building and validation. This stage involves considering various
models and choosing the best one based on their predictive performance (i.e., explaining
the variability in question and producing stable results across samples).
Stage 3: Deployment. That final stage involves using the model selected as best
in the previous stage and applying it to new data in order to generate predictions or
estimates of the expected outcome
Data mining – Applications:

Some successful application areas:
1. A pharmaceutical company can analyze its recent sales and can determine which
marketing activities will have the greatest impact in future.
2. A credit card company using a small test mailing can identify the customer
attributes.
3. A diversified transportation comp-any can apply data mining to identify the best
prospects.
4. A large consumer package goods company can apply data mining to improve its
sales process to retailers.
Conclusions- Data Warehousing & Data Mining:
All large organizations already have data warehouses, but they are just not
managing them. In order to get most out of this period, the data warehouse planners and
developers must have a clear idea of what they are looking for and then choose strategies
and methods that will improve the performance and flexibility.
There is a growing gap between more powerful storage and retrieval systems and
the users’ ability of effectively analyzing them. As seen, both relational and OLAP
technologies are used for navigating massive data warehouses. Quantifiable business
benefits have been proven through the integration of data mining with current
information systems, and new products are on the horizon that will bring this integration
to an even wider audience of users.
BIBILIOGRAPHY
1. www.kdnuggets.com
2. www.ultragem.com
3. info.gte.com/kdd/
4. www.google.com
5. Data Base Management Systems by RaghuRamaKrishnan
6. Data Mining Techniques.

Data Mining and Data Warehouse BY

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Data Mining and Data Warehouse BY

Hochgeladen von

Copyright:

Verfügbare Formate

DATA MINING AND DATA WAREHOUSE

T.SUDHEER KUMAR A.VENKATESH

Dept. of Computer Science Engineering

Data warehouse architecture:

A data mart is typically defined as a subset of the contents of a data warehouse,

Data Warehouse-applications: Role of data warehouse in various application areas.

Data mining – Supporting Technologies:

Figure – depicts customers of different ages.

mapped on a two dimensional space.

Data Mining - Working Procedure: While large-scale information technology

• Clusters: Data items are grouped according to logical relationships or consumer

• Associations: Data can be mined to identify associations. The beer-diaper

Data Mining – Architecture:

Data mining – Applications:

Das könnte Ihnen auch gefallen