Beruflich Dokumente
Kultur Dokumente
BY
1 Extract and load the data: Data extraction involves extracting the data from source
systems and makes it available to the data warehouse where as data load takes
extracted data and loads it into the data warehouse.
Clean and transform data: It performs the consistency checks on the loaded data, and then
structures it for query performance and for minimizing the operational costs.
2 Back up and archive data: The data is being backed up regularly and also older data is
removed from the system in a format that allows it to be quickly restored if required.
3 Query management: It manages the queries and speeds them up by directing queries
to the most effective data source and also monitor the actual query profiles.
DTS can function independent of SQL Server and can be used as a stand-alone
tool to transfer data from Oracle to any other ODBC or OLE DB-compliant database.
Accordingly, DTS can extract data from operational databases for inclusion in a data
warehouse or data mart for query and analysis.
In the illustration, the transaction data resides on an IBM DB2 transaction server.
A package is created using DTS to transfer and clean the data from the DB2 transaction
server and to move it into the data warehouse or data mart. In this example, the relational
database server is SQL Server 7.0, and the data warehouse is using OLAP Services to
provide analytical capabilities. Client programs (such as Excel) access the OLAP Services
server using the OLE DB for OLAP interface, which is exposed through a client-side
component called Microsoft PivotTable service. Client programs using PivotTable service
can manipulate data in the OLAP server and can even change individual cells.
Clustering: It is the method by which like records are grouped together. Usually this is
done to give the end user a high level view of what is going on in the database. There are
mainly two types.
Hierarchical and Non-Hierarchical Clustering: The hierarchy of clusters is usually
viewed as a tree where the smallest clusters merge together to create the next highest
level of clusters and so on.
Hierararchy of clusters elongated clusters
2. Next Generations Techniques: They represent techniques such as Trees, Networks
and Rules that have only been widely used since the early 1980’s.
Neural Networks: Neural networks consist of a number of neurons that are
interconnected--often in complex ways--and then organized into layers. Neurons are very
simple processing units that compute a linear combination of a number of inputs and then
perform a simple mathematical process on the result to produce an output.
• Classes: Stored data is used to locate data in predetermined groups. For example,
a restaurant chain could mine customer purchase data to determine when
customers visit and what they typically order. This information could be used to
increase traffic by having daily specials.
• Sequential patterns: Data is mined to anticipate behavior patterns and trends. For
example, an outdoor equipment retailer could predict the likelihood of a hiking
shoes.
The process of data mining consists of three stages: (1) the initial exploration, (2)
model building or pattern identification with validation/verification, and (3) deployment
(i.e., the application of the model to new data in order to generate predictions).
Stage 1: Exploration. This stage usually starts with data preparation which may
involve cleaning data, data transformations, selecting subsets of records and - in case of
data sets with large numbers of variables ("fields") - performing some preliminary feature
selection operations to bring the number of variables to a manageable range (depending
on the statistical methods which are being considered).
Stage 2: Model building and validation. This stage involves considering various
models and choosing the best one based on their predictive performance (i.e., explaining
the variability in question and producing stable results across samples).
Stage 3: Deployment. That final stage involves using the model selected as best
in the previous stage and applying it to new data in order to generate predictions or
estimates of the expected outcome