Beruflich Dokumente
Kultur Dokumente
Q1)
Bottom-up
Data
Warehouse
development
Despite the fact that Data Warehouses can be designed in a number of different
ways, they all share a number of important characteristics.
Most Data Warehouses are Subject Oriented. This means that the information
that is in the Data Warehouse is stored in a way that allows it to be connected to
objects or event, which occur in reality.
Another characteristic that is frequently seen in Data Warehouses is called Time
Variant. A time variant Data Warehouse will allow changes in the information to
be monitored and recorded over time.
All the programs that are used by a particular institution will be stored in the Data
Warehouse, and it will be integrated together.
The first Data Warehouses were developed in the 1980s.
As societies entered the information age, there was a large demand for efficient
methods of storing information.
Many of the systems that existed in the 1980s were not powerful enough to store
and manage large amounts of data.
The systems that existed at the time took too long to report and process
information. Many of these systems were not designed to analyze or report
information.
In addition to this, the computer programs that were necessary for reporting
information were both costly and slow. To solve these problems, companies
began designing computer databases that placed an emphasis on managing and
analyzing information. These were the first Data Warehouses, and they could
obtain data from a variety of different sources, and some of these include PCs
and mainframes.
Spreadsheet programs have also played an important role in the development of
Data Warehouses. By the end of the 1990s, the technology had greatly
advanced, and was much lower in cost.
The technology has continued to evolve to meet the demands of those who are
looking for more functions and speed.
There are four advances in Data Warehouse technology that has allowed it to
evolve. These advances are offline operational databases, real time Data
Warehouses, offline Data Warehouses, and the integrated Data Warehouses.
The offline operational database is a system in which the information within the
database of an operational system is copied to a server that is offline.
When this is done, the operational system will perform at a much higher level. As
the name implies, a real time Data Warehouse system will be updated every time
an event occurs. For example, if a customer orders a product, a real time Data
Warehouse will automatically update the information in real time.
Another important concept that is related to Data Warehouses is called data
transformation. As the name suggests, data transformation is a process in which
information transferred from specific sources is cleaned and loaded into a
repository.
Multicube
First fig shows a data cube for multidimensional analysis of sales data with respect to
annual sales per item type for each All Electronics branch. Each cell holds an aggregate
data value, corresponding to the data point in multidimensional space.
Dimensionality Reduction
Basic heuristic methods of attribute subset selection include the following techniques,
some of which are illustrated in below fig.
1
2
3
Stepwise forward selection: The procedure starts with an empty set of attributes.
The best of the original attributes is determined and added to the set
Stepwise backward elimination: The procedure starts with the full set of attributes.
At each step, it removes the worst attribute remaining in the set.
Combination of forward selection and backward elimination: The stepwise
forward selection and backward elimination methods can be combined so that, at
each step, the procedure selects the best attribute and removes the worst from
among the remaining attributes.
Numerosity Reduction
In Numerosity reduction data are replaced or estimated by alternative, smaller data
representations such as a parametric models (which need store only the model
parameters instead of the actual data), or nonparametric methods such as clustering,
sampling, and the use of histograms.
Sampling
Sampling can be used as a data reduction technique since it allows a large data set
to be represented by a much smaller random sample (or subset) of the data.
Suppose that a large data set, D, contains N tuples.
Q5) Describe K-means method for clustering. List its advantages and drawbacks.
Answer:
K-means (MacQueen, 1967) is one of the simplest unsupervised learning algorithms
that solve the well known clustering problem.
The procedure follows a simple and easy way to classify a given data set through a
certain number of clusters (assume k clusters) fixed a priori.
The main idea is to define k centroids, one for each cluster.
The basic step of k-means clustering is simple. In the beginning we determine
number of cluster K and we assume the centroid or center of these clusters. We can
take any random objects as the initial centroids or the first K objects in sequence can
also serve as the initial centroids.
Then the K means algorithm will do the three steps given below until convergence
iterate until stable (= no object move group)
o Determine the centroid coordinate
o Determine the distance of each object to the centroids.
o Group the object based on minimum distance.
Advantages:
With a large number of variables, K-Means may be computationally faster than
hierarchical clustering (if K is small).
K-Means may produce tighter clusters than hierarchical clustering, especially if the
clusters are globular.
The K-means method as described has the following drawbacks:
It does not do well with overlapping clusters.
The clusters are easily pulled off-center by outliers.