Sie sind auf Seite 1von 29

SYSTEM ANALYSIS AND DESIGN

RESEARCH REPORT

DATA MINING

26.11.2016
Seyit Mert AYVAZ
2012555008
Introduction

Data is a term which is occured with the fast developing of


computers. As long as the evolution of computers continued, data is
seperated into big and different levels. This seperation has leaded to study
on data with different aspects. In time , big databases has became to hard
to work on it. Both the scientists and employees of big companies has
started to think for a solution. After all these, the Data Mining process
has taken a place at informatics world.

In this report, I approached to data mining in various ways. This


report consist almost every topic concerning data mining. Firstly, some
introduction topics are considered on a preferential basis such as term of
data mining, architecture, mining process. Since it is important the
understand how data mining is evolved, its history and milestones are
evaluated demonstratively. Afterwards, the most important topic regarding
to data mining, I think, the scope of data mining is handled. As much as it
is important for academicians, it is also accepted as so important for
business world. So, the great affects of data mining and the current
studies are told for the both sides, in academically and in business
world.After all, some ideas and future trends are explanied shortly.

I hope that, this report can be beneficial for its readers and people
who is curious about data mining.

Page 2 of 29
TABLE OF CONTENT
1.Introductino to Data
Mining...................................................................................................4
1.1 What is Data
Mining.................................................................................................4
1.1.1.Automatic
Discovery..................................................................................4
1.1.2.Prediction...........................................................................
.......................5
1.1.3.Grouping............................................................................
........................5
1.1.4.Actionable
Information.............................................................................5
1.2.Architecture of Data
Mining.....................................................................................6
1.2.1. Data
Sources.......................................................................................
......7
1.2.2. Database or Data Warehouse
Server.......................................................7
1.2.3. Data Mining
Engine...................................................................................7
1.2.4. Pattern Evaluation
Modules.....................................................................7
1.2.5. Graphical User
Interface...........................................................................7
1.2.6. Knowledge
Base........................................................................................7
1.3.Data Mining
Processes.............................................................................................8
1.3.1. Problem
definition....................................................................................
8
1.3.2. Data
exploration..................................................................................
.....9
1.3.3. Data
preparation.................................................................................
.....9
1.3.4.
Modeling.....................................................................................
..............9
1.3.5.
Evaluation...................................................................................
..............9
1.3.6.
Deployment.................................................................................
.............9

Page 3 of 29
2.History of Data
Mining .........................................................................................................1
0
2.1 Foundations of Data
Mining...................................................................................10
2.2. Evolution in data mining for
business...................................................................11
2.3. Milestones of Data
Mining....................................................................................12
3.Scope of Data
Mining............................................................................................................
15
3.1. Usage of Data Mining
Techniques ........................................................................16
3.1.1.
Association..................................................................................
............16
3.1.2.
Classification...............................................................................
............16
3.1.3.
Clustering....................................................................................
............17
3.1.4.
Prediction....................................................................................
...........17
3.1.5. Sequential
Patterns................................................................................17
3.1.6. Decision
trees.........................................................................................1
7
3.2. Data Mining in
Academically.................................................................................18
3.2.1.Science and
Engineering..........................................................................18
3.2.2. Medical Data
Mining...............................................................................19
3.2.3. Spatial Data
Mining.................................................................................19
3.2.4. Pattern
mining........................................................................................
20
3.2.5. Human
Rights.........................................................................................
20
3.2.6. Sensor Data
Mining................................................................................20
3.3 Data Mining in
Business.........................................................................................20

Page 4 of 29
4.Future of Data
Mining...........................................................................................................
23
4.1. Distributed/Collective Data Mining
(DDM) ..........................................................23
4.2. Ubiquitous Data Mining (UDM)
............................................................................23
4.3. Hypertext and Hypermedia Data
Mining...............................................................23
4.4. Multimedia Data
Mining........................................................................................24
4.5. Time Series/Sequence Data
Mining.......................................................................24

1.Introduction to Data Mining

Before anything else , you have to study on and understand some


terms about data mining like data,information and knowledge. Since all
the studies related with data mining are also related with those; it is
improtant to catch the main point of the relation between data,information
and knowledge.
Data: data are any facts, numbers, or text that can be processed by a
computer. Today, organizations are accumulating vast and growing
amounts of data in different formats and different databases. This
includes:
operational or transactional data such as, sales, cost, inventory,
payroll, and accounting
nonoperational data, such as industry sales, forecast data, and
macro economic data
meta data - data about the data itself, such as logical database
design or data dictionary definitions
Information: the patterns, associations, or relationships among all this
data can provide information. For example, analysis of retail point of sale
transaction data can yield information on which products are selling and
when.
Knowledge: information can be converted into knowledge about
historical patterns and future trends. For example, summary information
on retail supermarket sales can be analyzed in light of promotional efforts
to provide knowledge of consumer buying behavior. Thus, a manufacturer
or retailer could determine which items are most susceptible to
promotional efforts.

1.1.What is Data Mining ?

Data mining is an interdisciplinary subfield of computer science. It is


the computational process of discovering patterns in large data
sets involving methods at the intersection of artificial
intelligence, machine learning, statistics, and database systems.

Page 5 of 29
The overall goal of the data mining process is to extract information from a
data set and transform it into an understandable structure for further use.
Data mining uses sophisticated mathematical algorithms to segment the
data and evaluate the probability of future events. Data mining is also
known as Knowledge Discovery in Data (KDD).

In general, key proeprties of the data mining can be summarized as:


Automatic discovery of patterns
Prediction of likely outcomes
Creation of actionable information
Focus on large data sets and databases

In order to understand better the properties, we can make some more


explanation as follows.

1.1.1.Automatic Discovery
Data mining is accomplished by building models. A model uses an
algorithm to act on a set of data. The notion of automatic discovery refers
to the execution of data mining models.Data mining models can be used
to mine the data on which they are built, but most types of models are
generalizable to new data. The process of applying a model to new data is
known as scoring.

1.1.2.Prediction
Many forms of data mining are predictive. For example, a model might
predict income based on education and other demographic factors.
Predictions have an associated probability (How likely is this prediction to
be true?). Prediction probabilities are also known as confidence. Some
forms of predictive data mining generate rules, which are conditions that
imply a given outcome. For example, a rule might specify that a person
who has a bachelor's degree and lives in a certain neighborhood is likely to
have an income greater than the regional average.

1.1.3.Grouping
Other forms of data mining identify natural groupings in the data. For
example, a model might identify the segment of the population that has
an income within a specified range, that has a good driving record, and
that leases a new car on a yearly basis.

1.1.4.Actionable Information
Data mining can derive actionable information from large volumes of
data. For example, a town planner might use a model that predicts income
based on demographics to develop a plan for low-income housing. A car
leasing agency might a use model that identifies customer segments to
design a promotion targeting high-value customers.

The actual data mining task is the automatic or semi-automatic


analysis of large quantities of data to extract previously unknown,
interesting patterns such as groups of data records (cluster analysis),

Page 6 of 29
unusual records (anomaly detection), and dependencies (association rule
mining). This usually involves using database techniques such as spatial
indices. These patterns can then be seen as a kind of summary of the
input data, and may be used in further analysis or, for example, in
machine learning and predictive analytics. For example, the data mining
step might identify multiple groups in the data, which can then be used to
obtain more accurate prediction results by a decision support system.
Neither the data collection, data preparation, nor result interpretation and
reporting is part of the data mining step, but do belong to the overall KDD
process as additional steps.

In other words, data mining (sometimes called data or knowledge


discovery) is the process of analyzing data from different perspectives and
summarizing it into useful information - information that can be used to
increase revenue, cuts costs, or both. Data mining software is one of a
number of analytical tools for analyzing data. It allows users to analyze
data from many different dimensions or angles, categorize it, and
summarize the relationships identified. Technically, data mining is the
process of finding correlations or patterns among dozens of fields in large
relational databases.

1.2.Architecture of Data Mining

Figure 1.2.1:Architecture of data mining levels.

Page 7 of 29
The major components of any data mining system are data source,
data warehouse server, data mining engine, pattern evaluation module,
graphical user interface. In order to get a better knowledge on these, we
will examine that what are these? and what is the aim of these
components ?

1.2.1. Data Sources:Database, data warehouse, World Wide Web


(WWW), text files and other documents are the actual sources of data. It is
a necessarity to have large quantity of historical data for data mining to be
successful. Organizations generally store data in databases or data
warehouses. Data warehouses may contain one or more databases, text
files, spreadsheets or other kinds of information repositories. Sometimes,
data may reside even in plain text files or spreadsheets. World Wide Web
or the Internet is another big source of data.

Why does the data get involved to different processes?

The data needs to be cleaned, integrated and selected before passing it to


the database or data warehouse server. As the data is from different
sources and in different formats, it cannot be used directly for the data
mining process because the data might not be complete and reliable. So,
first data needs to be cleaned and integrated. Again, more data than
required will be collected from different data sources and only the data of
interest needs to be selected and passed to the server. These processes
are not as simple as we think. A number of techniques may be performed
on the data as part of cleaning, integration and selection.

1.2.2. Database or Data Warehouse Server:The database or data


warehouse server contains the actual data that is ready to be processed.
Hence, the server is responsible for retrieving the relevant data based on
the data mining request of the user.

1.2.3. Data Mining Engine:The data mining engine is the core


component of any data mining system. It consists of a number of modules
for performing data mining tasks including association, classification,
characterization, clustering, prediction, time-series analysis etc.

1.2.4. Pattern Evaluation Modules:The pattern evaluation module is


mainly responsible for the measure of interestingness of the pattern by
using a threshold value. It interacts with the data mining engine to focus
the search towards interesting patterns.

1.2.5. Graphical User Interface:The graphical user interface module


communicates between the user and the data mining system. This module
helps the user use the system easily and efficiently without knowing the
real complexity behind the process. When the user specifies a query or a
task, this module interacts with the data mining system and displays the
result in an easily understandable manner.

Page 8 of 29
1.2.6. Knowledge Base:The knowledge base is helpful in the whole data
mining process. It might be useful for guiding the search or evaluating the
interestingness of the result patterns. The knowledge base might even
contain user beliefs and data from user experiences that can be useful in
the process of data mining. The data mining engine might get inputs from
the knowledge base to make the result more accurate and reliable.

1.3.Data Mining Processes ( explain )

Figure 1.3.1: Phases of the Cross Industry Standard Process for data
mining (CRISP DM) process model. From where*********

Many organizations in various industries are taking advantages of


data mining including manufacturing, marketing, chemical, aerospace
etc, to increase their business efficiency. Therefore, the needs for a
standard data mining process increased comparatively. A data mining
process must be reliable and it must be repeatable by business people
with little or no knowledge of data mining background. As the result, in
1990, a cross-industry standard process for data mining (CRISP-DM) first
announced after going through a lot of workshops, and contributions from
over 300 organizations.
Cross Industry Standard Process for data mining is an iterative process
that typically involves the following phases:

1.3.1. Problem definition


A data mining project starts with the understanding of the business
problem. Data mining experts, business experts, and domain experts work
closely together to define the project objectives and the requirements
from a business perspective. The project objective is then translated into a
data mining problem definition.
In the problem definition phase, data mining tools are not yet
required.

Page 9 of 29
1.3.2. Data exploration
Domain experts understand the meaning of the metadata. They
collect, describe, and explore the data. They also identify quality problems
of the data. A frequent exchange with the data mining experts and the
business experts from the problem definition phase is vital.
In the data exploration phase, traditional data analysis tools, for example,
statistics, are used to explore the data.

1.3.3. Data preparation


Domain experts build the data model for the modeling process. They
collect, cleanse, and format the data because some of the mining
functions accept data only in a certain format. They also create new
derived attributes, for example, an average value.
In the data preparation phase, data is tweaked multiple times in no
prescribed order. Preparing the data for the modeling tool by selecting
tables, records, and attributes, are typical tasks in this phase. The
meaning of the data is not changed.

1.3.4. Modeling
Data mining experts select and apply various mining functions
because you can use different mining functions for the same type of data
mining problem. Some of the mining functions require specific data types.
The data mining experts must assess each model.
In the modeling phase, a frequent exchange with the domain experts from
the data preparation phase is required.

The modeling phase and the evaluation phase are coupled. They can
be repeated several times to change parameters until optimal values are
achieved. When the final modeling phase is completed, a model of high
quality has been built.

1.3.5. Evaluation
Data mining experts evaluate the model. If the model does not
satisfy their expectations, they go back to the modeling phase and rebuild
the model by changing its parameters until optimal values are achieved.
When they are finally satisfied with the model, they can extract business
explanations and evaluate the following questions:
Does the model achieve the business objective?
Have all business issues been considered?
At the end of the evaluation phase, the data mining experts decide how to
use the data mining results.

1.3.6. Deployment
Data mining experts use the mining results by exporting the results
into database tables or into other applications, for example, spreadsheets.
The Intelligent Miner**** products assist you to follow this process.
You can apply the functions of the Intelligent Miner products
independently, iteratively, or in combination.

2.History of Data Mining


Page 10 of 29
First of all, to be able to know history and evolution of data mining, it
is important to find out the milestones and foundations of data mining and
evolution of these. All of these processes dont lie to long background
except the theories of several scientific field like statistic,machine learning
and artifical intelligence. At further sections, foundations and milestones
will be expounded in detail. Here we will go into the relation between
statistics,machine learning,artifical intelligence and data mining and how
their relation is evolved together.

Data mining roots are traced back along three family lines: classical
statistics, artificial
intelligence, and machine learning.
Statistics are the foundation of most technologies on which data
mining is built, e.g. regression analysis, standard distribution, standard
deviation, standard variance, discriminate analysis, cluster analysis, and
confidence intervals. All of these are used to study data and data
relationships.
Artificial intelligence, or AI, which is built upon heuristics as opposed
to statistics, attempts to apply human-thought-like processing to statistical
problems. Certain AI concepts which were adopted by some high-end
commercial products, such as query optimization modules for Relational
Database Management Systems (RDBMS).
Machine learning is the union of statistics and AI. It could be
considered an evolution of AI, because it blends AI heuristics with
advanced statistical analysis. Machine learning attempts to let computer
programs learn about the data they study, such that programs make
different decisions based on the qualities of the studied data, using
statistics for fundamental concepts, and adding more advanced AI
heuristics and algorithms to achieve its goals.
Data mining, in many ways, is fundamentally the adaptation of
machine learning techniques to business applications. Data mining is best
described as the union of historical and recent developments in statistics,
AI, and machine learning. These techniques are then used together to
study data and find previously-hidden trends or patterns within.

2.1 Foundations of Data Mining

Data mining techniques are the result of a long process of research and
product development. This evolution began when business data was first
stored on computers, continued with improvements in data access, and
more recently, generated technologies that allow users to navigate
through their data in real time. Data mining takes this evolutionary
process beyond retrospective data access and navigation to prospective
and proactive information delivery. Data mining is ready for application in
the business community because it is supported by three technologies
that are now sufficiently mature:

Page 11 of 29
Massive data collection
Powerful multiprocessor computers
Data mining algorithms

2.2. Evolution in data mining for business

In the evolution from business data to business information, each new


step has built upon the previous one. For example, dynamic data access is
critical for drill-through in data navigation applications, and the ability to
store large databases is critical to data mining. From the users point of
view, the four steps listed in Table 1 were revolutionary because they
allowed new business questions to be answered accurately and quickly

Evolutiona Business Enabling Product Characteris


ry Step Question Technologie Providers tics
s
Data "What was Computers, IBM, CDC Retrospectiv
Collection my total tapes, disks e, static data
revenue in delivery
(1960s) the last five
years?"

Data Access "What were Relational Oracle, Retrospectiv


unit sales indatabases Sybase, e, dynamic
(1980s) New England (RDBMS), Informix, data delivery
last March?" Structured IBM, at record
Query Microsoft level
Language
(SQL), ODBC
Data "What were On-line Pilot, Retrospectiv
Warehousin unit sales in analytic Comshare, e, dynamic
g &Decision New England processing Arbor, data delivery
Support last March? (OLAP), Cognos, at multiple
Drill down to multidimensi Microstrateg levels
(1990s) Boston." onal y
databases,
data
warehouses
Data Mining "Whats Advanced Pilot, Prospective,
(Emerging likely to algorithms, Lockheed, proactive
Today) happen to multiprocess IBM, SGI, information
Boston unit or numerous delivery
sales next computers, startups
month? massive (nascent
Why?" databases industry)
Table 2.2.1: Steps in the Evolution of Data Mining.[1]

Page 12 of 29
The core components of data mining technology have been under
development for decades, in research areas such as statistics, artificial
intelligence, and machine learning. Today, the maturity of these
techniques, coupled with high-performance relational database engines
and broad data integration efforts, make these technologies practical for
current data warehouse environments.

2.3. Milestones of Data Mining

Figure 2.3.1:Milestones of data mining related with main topics.

The following are major milestones and firsts in the history of data
mining plus how its evolved and blended with data science and big data.

1763 Thomas Bayes paper is published posthumously regarding a


theorem for relating current probability to prior probability called the
Bayes theorem. It is fundamental to data mining and probability, since it
allows understanding of complex realities based on estimated
probabilities.

1805 Adrien-Marie Legendre and Carl Friedrich Gauss apply regression to


determine the orbits of bodies about the Sun (comets and planets). The
goal of regression analysis is to estimate the relationships among
variables, and the specific method they used in this case is the method of
least squares. Regression is one of the key tools in data mining.

Page 13 of 29
1936 This is the dawn of computer age which makes possible the
collection and processing of large amounts of data. In a 1936 paper, On
Computable Numbers, Alan Turing introduced the idea of a Universal
Machine capable of performing computations like our modern day
computers. The modern day computer is built on the concepts pioneered
by Turing.

1943 Warren McCulloch and Walter Pitts were the first to create a
conceptual model of a neural network. In a paper entitled A logical
calculus of the ideas immanent in nervous activity, they describe the idea
of a neuron in a network. Each of these neurons can do 3 things: receive
inputs, process inputs and generate output.

1965 Lawrence J. Fogel formed a new company called Decision Science,


Inc. for applications of evolutionary programming. It was the first company
specifically applying evolutionary computation to solve real-world
problems.

1970s With sophisticated database management systems, its possible to


store and query terabytes and petabytes of data. In addition, data
warehouses allow users to move from a transaction-oriented way of
thinking to a more analytical way of viewing the data. However, extracting
sophisticated insights from these data warehouses of multidimensional
models is very limited.

1975 John Henry Holland wrote Adaptation in Natural and Artificial


Systems, the ground-breaking book on genetic algorithms. It is the book
that initiated this field of study, presenting the theoretical foundations and
exploring applications.

1980s HNC trademarks the phrase database mining. The trademark was
meant to protect a product called DataBase Mining Workstation. It was a
general purpose tool for building neural network models and now no
longer is available. Its also during this period that sophisticated
algorithms can learn relationships from data that allow subject matter
experts to reason about what the relationships mean.

1989 The term Knowledge Discovery in Databases (KDD) is coined by


Gregory Piatetsky-Shapiro. It also at this time that he co-founds the first
workshop also named KDD.

1990s The term data mining appeared in the database community.


Retail companies and the financial community are using data mining to
analyze data and recognize trends to increase their customer base, predict
fluctuations in interest rates, stock prices, customer demand.

1992 Bernhard E. Boser, Isabelle M. Guyon and Vladimir N. Vapnik


suggested an improvement on the original support vector machine which
Page 14 of 29
allows for the creation of nonlinear classifiers. Support vector machines
are a supervised learning approach that analyzes data and recognizes
patterns used for classification and regression analysis.

1993 Gregory Piatetsky-Shapiro starts the newsletter Knowledge


Discovery Nuggets (KDnuggets). It was originally meant to connect
researchers who attended the KDD workshop. However, KDnuggets.com
seems to have a much wider audience now.

2001 Although the term data science has existed since 1960s, it wasnt
until 2001 that William S. Cleveland introduced it as an independent
discipline. As per Build Data Science Teams, DJ Patil and Jeff
Hammerbacher then used the term to describe their roles at LinkedIn and
Facebook.

2015 In February 2015, DJ Patil became the first Chief Data Scientist at
the White House. Today, data mining is widespread in business, science,
engineering and medicine just to name a few. Mining of credit card
transactions, stock market movements, national security, genome
sequencing and clinical trials are just the tip of the iceberg for data mining
applications.

Present (2016) Finally, one of the most active techniques being explored
today is Deep Learning. Capable of capturing dependencies and
complex patterns far beyond other techniques, it is reigniting some of the
biggest challenges in the world of data mining, data science and artificial
intelligence. [2]

3.Scope of Data Mining

At this section, the scope will be examined according to the types of the
relations between transaction and analytical systems, analysis levels and
tasks of the data mining.Then the usage of data mining in academically
and business will both be explained.

While large-scale information technology has been evolving separate


transaction and analytical systems, data mining provides the link between
the two.Comparatively, mining softwares have been developed
continuously. Data mining software analyzes relationships and patterns in
stored transaction data based on open-ended user queries. Several types
of analytical software are available such as statistical, machine learning,
and neural networks. Mostly, any of four types of relationships are sought:

Classes: Stored data is used to locate data in predetermined groups.


For example, a restaurant chain could mine customer purchase data
to determine when customers visit and what they typically order.
This information could be used to increase traffic by having daily
specials.

Page 15 of 29
Clusters: Data items are grouped according to logical relationships or
consumer preferences. For example, data can be mined to identify
market segments or consumer affinities.
Associations: Data can be mined to identify associations. The beer-
diaper example is an example of associative mining.
Sequential patterns: Data is mined to anticipate behavior patterns
and trends. For example, an outdoor equipment retailer could predict
the likelihood of a backpack being purchased based on a consumer's
purchase of sleeping bags and hiking shoes.

Data mining consists of five major elements:

1. Extract, transform, and load transaction data onto the data


warehouse system.
2. Store and manage the data in a multidimensional database system.
3. Provide data access to business analysts and information technology
professionals.
4. Analyze the data by application software.
5. Present the data in a useful format, such as a graph or table.

Different levels of analysis are available:

Artificial neural networks: Non-linear predictive models that learn


through training and resemble biological neural networks in
structure.
Genetic algorithms: Optimization techniques that use processes such
as genetic combination, mutation, and natural selection in a design
based on the concepts of natural evolution.
Decision trees: Tree-shaped structures that represent sets of
decisions. These decisions generate rules for the classification of a
dataset. Specific decision tree methods include Classification and
Regression Trees (CART) and Chi Square Automatic Interaction
Detection (CHAID) . CART and CHAID are decision tree techniques
used for classification of a dataset.
They provide a set of rules that you can apply to a new (unclassified)
dataset to predict which records will have a given outcome. CART
segments a dataset by creating 2-way splits while CHAID segments
using chi square tests to create multi-way splits. CART typically
requires less data preparation than CHAID.
Nearest neighbor method: A technique that classifies each record in
a dataset based on a combination of the classes of the k record(s)
most similar to it in a historical dataset (where k 1). Sometimes
called the k-nearest neighbor technique.
Rule induction: The extraction of useful if-then rules from data based
on statistical significance.

Page 16 of 29
Data visualization: The visual interpretation of complex relationships
in multidimensional data. Graphics tools are used to illustrate data
relationships.

3.1. Usage of Data Mining Techniques

With the meaning of in academically, topic is seperated into two


parts. First part is the techniques that are using for mining and
significant studies by using these in different areas.

There are several major data mining techniques have been


developing and using by researchers in data mining studies recently
including association, classification, clustering, prediction, sequential
patterns and decision tree. We will briefly examine those data mining
techniques in the following sections.

3.1.1. Association
Association is one of the best-known data mining technique. In
association, a pattern is discovered based on a relationship between
items in the same transaction. Thats is the reason why association
technique is also known as relation technique. The association
technique is used in market basket analysis to identify a set of products
that customers frequently purchase together.

Retailers are using association technique to research customers


buying habits. Based on historical sale data, retailers might find out
that customers always buy crisps when they buy beers, and, therefore,
they can put beers and crisps next to each other to save time for
customer and increase sales.

3.1.2. Classification
Classification is a classic data mining technique based on machine
learning. Basically, classification is used to classify each item in a set of
data into one of a predefined set of classes or groups. Classification
method makes use of mathematical techniques such as decision trees,
linear programming, neural network and statistics. In classification, we
develop the software that can learn how to classify the data items into
groups. For example, we can apply classification in the application that
given all records of employees who left the company, predict who will
probably leave the company in a future period. In this case, we divide
the records of employees into two groups that named leave and
stay. And then we can ask our data mining software to classify the
employees into separate groups.

3.1.3. Clustering
Clustering is a data mining technique that makes a meaningful or
useful cluster of objects which have similar characteristics using the
automatic technique. The clustering technique defines the classes and
puts objects in each class, while in the classification techniques, objects

Page 17 of 29
are assigned into predefined classes. To make the concept clearer, we
can take book management in the library as an example. In a library,
there is a wide range of books on various topics available. The
challenge is how to keep those books in a way that readers can take
several books on a particular topic without hassle. By using the
clustering technique, we can keep books that have some kinds of
similarities in one cluster or one shelf and label it with a meaningful
name. If readers want to grab books in that topic, they would only have
to go to that shelf instead of looking for the entire library.

3.1.4. Prediction
The prediction, as its name implied, is one of a data mining
techniques that discovers the relationship between independent
variables and relationship between dependent and independent
variables. For instance, the prediction analysis technique can be used in
the sale to predict profit for the future if we consider the sale is an
independent variable, profit could be a dependent variable. Then based
on the historical sale and profit data, we can draw a fitted regression
curve that is used for profit prediction.

3.1.5. Sequential Patterns


Sequential patterns analysis is one of data mining technique that
seeks to discover or identify similar patterns, regular events or trends in
transaction data over a business period.

In sales, with historical transaction data, businesses can identify a


set of items that customers buy together different times in a year. Then
businesses can use this information to recommend customers buy it
with better deals based on their purchasing frequency in the past.

3.1.6. Decision trees


The A decision tree is one of the most common used data mining
techniques because its model is easy to understand for users. In
decision tree technique, the root of the decision tree is a simple
question or condition that has multiple answers. Each answer then
leads to a set of questions or conditions that help us determine the
data so that we can make the final decision based on it.

3.2. Data Mining in Academically

Since the data mining algorithms are generated and used in


researches, many studies are varied and started to apply in different
areas.

3.2.1.Science and Engineering

In recent years, data mining has been used widely in the areas of
science and engineering, such as bioinformatics, genetics, medicine,
education and electrical power engineering.

Page 18 of 29
In the study of human genetics, sequence mining helps address the
important goal of understanding the mapping relationship between the
inter-individual variations in human DNA sequence and the variability in
disease susceptibility. In simple terms, it aims to find out how the
changes in an individual's DNA sequence affects the risks of developing
common diseases such as cancer, which is of great importance to
improving methods of diagnosing, preventing, and treating these
diseases. One data mining method that is used to perform this task is
known as multifactor dimensionality reduction.
In the area of electrical power engineering, data mining methods
have been widely used for condition monitoring of high voltage
electrical equipment. The purpose of condition monitoring is to obtain
valuable information on, for example, the status of the insulation (or
other important safety-related parameters). Data clustering techniques
such as the self-organizing map (SOM), have been applied to vibration
monitoring and analysis of transformer on-load tap-changers (OLTCS).
Using vibration monitoring, it can be observed that each tap change
operation generates a signal that contains information about the
condition of the tap changer contacts and the drive mechanisms.
Obviously, different tap positions will generate different signals.
However, there was considerable variability amongst normal condition
signals for exactly the same tap position. SOM has been applied to
detect abnormal conditions and to hypothesize about the nature of the
abnormalities.
Data mining methods have been applied to dissolved gas analysis
(DGA) in power transformers. DGA, as a diagnostics for power
transformers, has been available for many years. Methods such as SOM
has been applied to analyze generated data and to determine trends
which are not obvious to the standard DGA ratio methods (such as
Duval Triangle).
In educational research, where data mining has been used to study
the factors leading students to choose to engage in behaviors which
reduce their learning, and to understand factors influencing university
student retention. A similar example of social application of data mining
is its use in expertise finding systems, whereby descriptors of human
expertise are extracted, normalized, and classified so as to facilitate the
finding of experts, particularly in scientific and technical fields. In this
way, data mining can facilitate institutional memory.
Data mining methods of biomedical data facilitated by domain
ontologies, mining clinical trial data, and traffic analysis using SOM.
In adverse drug reaction surveillance, the Uppsala Monitoring Centre
has, since 1998, used data mining methods to routinely screen for
reporting patterns indicative of emerging drug safety issues in the WHO
global database of 4.6 million suspected adverse drug reaction
incidents. Recently, similar methodology has been developed to mine
large collections of electronic health records for temporal patterns
associating drug prescriptions to medical diagnoses.

Data mining has been applied to software artifacts within the realm
of software engineering: Mining Software Repositories.
Page 19 of 29
3.2.2. Medical Data Mining

Some machine learning algorithms can be applied in medical field as


second-opinion diagnostic tools and as tools for the knowledge
extraction phase in the process of knowledge discovery in databases.
One of these classifiers (called Prototype exemplar learning classifier
(PEL-C) is able to discover syndromes as well as atypical clinical cases.

In 2011, the case of Sorrell v. IMS Health, Inc., decided by the


Supreme Court of the United States, ruled that pharmacies may share
information with outside companies. This practice was authorized under
the 1st Amendment of the Constitution, protecting the "freedom of
speech." However, the passage of the Health Information Technology
for Economic and Clinical Health Act (HITECH Act) helped to initiate the
adoption of the electronic health record (EHR) and supporting
technology in the United States. The HITECH Act was signed into law on
February 17, 2009 as part of the American Recovery and Reinvestment
Act (ARRA) and helped to open the door to medical data mining. Prior to
the signing of this law, estimates of only 20% of United States-based
physicians were utilizing electronic patient records.Sren Brunak notes
that the patient record becomes as information-rich as possible and
thereby maximizes the data mining opportunities. Hence, electronic
patient records further expands the possibilities regarding medical data
mining thereby opening the door to a vast source of medical data
analysis.

3.2.3. Spatial Data Mining

Spatial data mining is the application of data mining methods to


spatial data. The end objective of spatial data mining is to find patterns
in data with respect to geography. So far, data mining and Geographic
Information Systems (GIS) have existed as two separate technologies,
each with its own methods, traditions, and approaches to visualization
and data analysis. Particularly, most contemporary GIS have only very
basic spatial analysis functionality. The immense explosion in
geographically referenced data occasioned by developments in IT,
digital mapping, remote sensing, and the global diffusion of GIS
emphasizes the importance of developing data-driven inductive
approaches to geographical analysis and modeling.

3.2.4. Pattern mining


"Pattern mining" is a data mining method that involves finding
existing patterns in data. In this context patterns often means
association rules. The original motivation for searching association rules
came from the desire to analyze supermarket transaction data, that is,
to examine customer behavior in terms of the purchased products. For
example, an association rule "beer potato chips (80%)" states that
four out of five customers that bought beer also bought potato chips.

Page 20 of 29
In the context of pattern mining as a tool to identify terrorist activity,
the National Research Council provides the following definition:
"Pattern-based data mining looks for patterns (including anomalous
data patterns) that might be associated with terrorist activity these
patterns might be regarded as small signals in a large ocean of noise."
Pattern Mining includes new areas such a Music Information Retrieval
(MIR) where patterns seen both in the temporal and non temporal
domains are imported to classical knowledge discovery search
methods.

3.2.5. Human Rights


Data mining of government records particularly records of the
justice system (i.e., courts, prisons) enables the discovery of systemic
human rights violations in connection to generation and publication of
invalid or fraudulent legal records by various government agencies.

3.2.6. Sensor Data Mining


Wireless sensor networks can be used for facilitating the collection of
data for spatial data mining for a variety of applications such as air
pollution monitoring. A characteristic of such networks is that nearby
sensor nodes monitoring an environmental feature typically register
similar values. This kind of data redundancy due to the spatial
correlation between sensor observations inspires the techniques for in-
network data aggregation and mining. By measuring the spatial
correlation between data sampled by different sensors, a wide class of
specialized algorithms can be developed to develop more efficient
spatial data mining algorithms.

3.3. Data Mining in Business

In business, data mining is the analysis of historical business


activities, stored as static data in data warehouse databases. The goal
is to reveal hidden patterns and trends. Data mining software uses
advanced pattern recognition algorithms to sift through large amounts
of data to assist in discovering previously unknown strategic business
information. Examples of what businesses use data mining is to include
performing market analysis to identify new product bundles, finding the
root cause of manufacturing problems, to prevent customer attrition
and acquire new customers, cross-selling to existing customers, and
profiling customers with more accuracy.

In todays world raw data is being collected by companies at an


exploding rate. For example, Walmart processes over 20 million point-
of-sale transactions every day. This information is stored in a
centralized database, but would be useless without some type of data
mining software to analyze it. If Walmart analyzed their point-of-sale
data with data mining techniques they would be able to determine
sales trends, develop marketing campaigns, and more accurately
predict customer loyalty.

Page 21 of 29
Categorization of the items available in the e-commerce site is a
fundamental problem. A correct item categorization system is essential
for user experience as it helps determine the items relevant to him for
search and browsing. Item categorization can be formulated as a
supervised classification problem in data mining where the categories
are the target classes and the features are the words composing some
textual description of the items. One of the approaches is to find groups
initially which are similar and place them together in a latent group.
Now given a new item, first classify into a latent group which is called
coarse level classification. Then, do a second round of classification to
find the category to which the item belongs to.
Every time a credit card or a store loyalty card is being used, or a
warranty card is being filled, data is being collected about the users
behavior. Many people find the amount of information stored about us
from companies, such as Google, Facebook, and Amazon, disturbing
and are concerned about privacy. Although there is the potential for our
personal data to be used in harmful, or unwanted, ways it is also being
used to make our lives better. For example, Ford and Audi hope to one
day collect information about customer driving patterns so they can
recommend safer routes and warn drivers about dangerous road
conditions.
Data mining in customer relationship management(CRM)
applications can contribute significantly to the bottom line. Rather than
randomly contacting a prospect or customer through a call center or
sending mail, a company can concentrate its efforts on prospects that
are predicted to have a high likelihood of responding to an offer. More
sophisticated methods may be used to optimize resources across
campaigns so that one may predict to which channel and to which offer
an individual is most likely to respond (across all potential offers).
Additionally, sophisticated applications could be used to automate
mailing. Once the results from data mining (potential
prospect/customer and channel/offer) are determined, this
"sophisticated application" can either automatically send an e-mail or a
regular mail. Finally, in cases where many people will take an action
without an offer, "uplift modeling" can be used to determine which
people have the greatest increase in response if given an offer. Uplift
modeling thereby enables marketers to focus mailings and offers on
persuadable people, and not to send offers to people who will buy the
product without an offer. Data clustering can also be used to
automatically discover the segments or groups within a customer data
set.
Businesses employing data mining may see a return on investment,
but also they recognize that the number of predictive models can
quickly become very large. For example, rather than using one model
to predict how many customers will churn, a business may choose to
build a separate model for each region and customer type. In situations
where a large number of models need to be maintained, some
businesses turn to more automated data mining methodologies.
Data mining can be helpful to human resources (HR) departments in
identifying the characteristics of their most successful employees.
Page 22 of 29
Information obtained such as universities attended by highly
successful employees can help HR focus recruiting efforts accordingly.
Additionally, Strategic Enterprise Management applications help a
company translate corporate-level goals, such as profit and margin
share targets, into operational decisions, such as production plans and
workforce levels.
Market basket analysis, relates to data-mining use in retail sales. If a
clothing store records the purchases of customers, a data mining
system could identify those customers who favor silk shirts over cotton
ones. Although some explanations of relationships may be difficult,
taking advantage of it is easier. The example deals with association
rules within transaction-based data. Not all data are transaction based
and logical, or inexact rules may also be present within a database.
Market basket analysis has been used to identify the purchase
patterns of the Alpha Consumer. Analyzing the data collected on this
type of user has allowed companies to predict future buying trends and
forecast supply demands.[citation needed]
Data mining is a highly effective tool in the catalog marketing industry.
[citation needed] Catalogers have a rich database of history of their
customer transactions for millions of customers dating back a number
of years. Data mining tools can identify patterns among customers and
help identify the most likely customers to respond to upcoming mailing
campaigns.
Data mining for business applications can be integrated into a
complex modeling and decision making process. LIONsolver uses
Reactive business intelligence (RBI) to advocate a "holistic" approach
that integrates data mining, modeling, and interactive visualization into
an end-to-end discovery and continuous innovation process powered by
human and automated learning.
In the area of decision making, the RBI approach has been used to
mine knowledge that is progressively acquired from the decision maker,
and then self-tune the decision method accordingly. The relation
between the quality of a data mining system and the amount of
investment that the decision maker is willing to make was formalized by
providing an economic perspective on the value of extracted
knowledge in terms of its payoff to the organization. This decision-
theoretic classification framework was applied to a real-world
semiconductor wafer manufacturing line, where decision rules for
effectively monitoring and controlling the semiconductor wafer
fabrication line were developed.[3]

Page 23 of 29
4.Future of Data Mining

Over recent years data mining has been establishing itself as one of
the major disciplines in computer science with growing industrial
impact. Undoubtedly, research in data mining will continue and even
increase over coming decades.In this section we will examine the future
trends and applications of data mining.

4.1. Distributed/Collective Data Mining (DDM)


One area of data mining which is attracting a good amount of
attention is that of distributed and collective data mining. Much of the
data mining which is being done currently focuses on a database or
data warehouse of information which is physically located in one place.
However, the situation arises where information may be located in
different places, in different physical locations. This is known generally
as distributed data mining (DDM). Therefore, the goal is to effectively
mine distributed data which is located in heterogeneous sites.
Examples of this include biological information located in different
databases, data which comes from the databases of two different firms,
or analysis of data from different branches of a corporation, the
combining of which would be an expensive and time-consuming
process.
Distributed data mining (DDM) is used to offer a different approach
to traditional approaches analysis, by using a combination of localized
data analysis, together with a global data model. In more specific
terms, this is specified as:- performing local data analysis for
generating partial data models, and-combining the local data models
from different data sites in order to develop the global model. This
global model combines the results of the separate analyses. Often the
global model produced, especially if the data in different locations has
different features or characteristics, may become incorrect or
ambiguous. This problem is especially critical when the data in
distributed sites is heterogeneous rather than homogeneous

4.2. Ubiquitous Data Mining (UDM)


The advent of laptops, palmtops, cell phones, and wearable
computers is making ubiquitous access to large quantity of data
possible. Advanced analysis of data for extracting useful knowledge is
the next natural step in the world of ubiquitous computing. Accessing
and analyzing data from a ubiquitous computing device offer many
challenges.For example, UDM introduces additional cost due to
communication, computation, security, and other factors. So one of the
objectives of UDM is to mine data while minimizing the cost of
ubiquitous presence.

4.3. Hypertext and Hypermedia Data Mining

Page 24 of 29
Hypertext and hypermedia data mining can be characterized as
mining data which includes text, hyperlinks, text mark-ups, and various
other forms of hypermedia information. As such, it is closely related to
both web mining, and multimedia mining, but in reality are quite close
in terms of content and applications. While the World Wide Web is
substantially composed of hypertext and hypermedia elements, there
are other kinds of hypertext/hypermedia data sources which are not
found on the web. Examples of these include the information found in
online catalogues, digital libraries, online information databases, and
the like.. Some of the important data mining techniques used for
hypertext and hypermedia data mining include classification
(supervised learning), clustering(unsupervised learning), semi-
structured learning, and social network analysis.

In the case of classification, or supervised learning, the process


starts off by reviewing training data in which items are marked as being
part of a certain class or group. This data is the basis from which the
algorithm is trained. One application of classification is in the area of
web topic directories, which can group similar sounding or spelled
terms into appropriate categories, so that searches will not bring up
inappropriate sites and pages.

Semi-supervised learning and social network analysis are other


methods which are important to hypermediabaseddata mining. Semi-
supervised learning is the case where there are both labelled and
unlabeled documents, and there is a need to learn from both types of
documents. Social network analysis is also applicable because the web
is considered a social network, which examines networks formed
through collaborative association, whether it be between friends,
academics doing research or service on committees, and between
papers through references and citations.

4.4. Multimedia Data Mining


Multimedia Data Mining is the mining and analysis of various types
of data, including images, video, audio, and animation. As multimedia
data mining incorporates the areas of text mining, as well as
hypertext/hypermedia mining, these fields are closely related. Much of
the information describing these other areas also applies to multimedia
data mining. This field is also rather new, but holds much promise for
the future. Multimedia information, because its nature as a large
collection of multimedia objects, must be represented differently from
conventional forms of data. One approach is to create a multimedia
data cube which can be used to convert multimedia-type data into a
form which is suited to analysis using one of the main data mining
techniques, but taking into account the unique characteristics of the
data.

4.5. Time Series/Sequence Data Mining


Another important area in data mining centres on the mining of time
series and sequence-based data. Simply put, this involves the mining of
Page 25 of 29
a sequence of data, which can either be referenced by time (time-
series, such as stock market and production process data), or is simply
a sequence of data which is ordered in a sequence. In general, one
aspect of mining time series data focuses on the goal of identifying
movements or components which exist within the data (trend
analysis).These can include long-term or trend movements, seasonal
variations, cyclical variations, and random movements.

Sequential pattern mining has as its focus the identification of


sequences which occur frequently in a time series or sequence of data.
This is particularly useful in the analysis of customers, where certain
buying patterns could be identified, such as what might be the likely
follow-up purchase to purchasing a certain electronics item or
computer, for example.

Conclusion

I started to this report to by noting for creating various aspects of


data mining as a whole. All the information given here were aimed to a
complete research by composing different partions. Clearly some of
portions that we wrote are not entirely unique. The hypotesises and
methods are explained correctly, trends and future dynamism are taken
as possible as current. The statistical and theorical data are checked
carefully and are verified with the help of various sources.
In addition, all we can see the importance of data mining in
increasingly globalized world. There are many techniques,studies and
softwares to make the life easier and to increase companies market
values. Especially, enterprise resource planning and customer
relationships management softwares are getting higher places at the
cost and budget of companies latterly which is based on data mining.
Since the data mining also underlies the bussines intelligence, we will
see much more studies related with it in future.
I hope this research report can be beneficial for both its readers and
people who is curious about data mining.

Page 26 of 29
Glossary

cluster analysis: or clustering is the task of grouping a set of objects in


such a way that objects in the same group (called a cluster) are more
similar (in some sense or another) to each other than to those in other
groups(cluster)

anomaly detection: anomaly detection (also outlier detection) is the


identification of items, events or observations which do not conform to an
expected pattern or other items in a dataset.

association rule mining: method for discovering interesting relations


between variables in large databases

predictive analytics:predictive analytics encompasses a variety of


statistical techniques from predictive modeling, machine learning, and
data mining that analyze current and historical facts to make predictions
about future or otherwise unknown events

classification: is the problem of identifying to which of a set of categories


(sub-populations) a new observation belongs, on the basis of a training set
of data containing observations (or instances) whose category
membership is known.
data warehouse: in computing, a data warehouse (DW or DWH), also
known as an enterprise data warehouse (EDW), is a system used for
reporting and data analysis, and is considered a core component of
business intelligence
time series analysis: comprises methods for analyzing time series data
in order to extract meaningful statistics and other characteristics of the
data.
threshold value: The threshold limit value (TLV) of a chemical substance
is a level to which it is believed a worker can be exposed day after day for
a working lifetime without adverse effects.

LIONsolver: LIONsolver is an integrated software for data mining,


business intelligence, analytics, and modeling Learning and Intelligent
OptimizatioN.

Reactive business intelligence (RBI): advocates an holistic approach


that integrates data mining, modeling and interactive visualization, into an
end-to-end discovery and continuous innovation process powered by
human and automated learning.

VLSI Test:very large scale integration test.


IC: integrated circuit.

Page 27 of 29
Page 28 of 29
References
[1]
: www.thearling.com
[2]
: www.rayli.net
[3]
: www.wikipedia.org
4: http://www.ibm.com/support/knowledgecenter/
5: https://www.linkedin.com/pulse/what-does-future-hold-data-mining-thiensi-le
6:
http://www.anderson.ucla.edu/faculty/jason.frand/teacher/technologies/palace/dat
amining.htm
7:
https://webdocs.cs.ualberta.ca/~zaiane/courses/cmput690/materials.shtml#data
ware
8: http://www.cs.bu.edu/~gkollios/dm07/lectnotes.html
9: http://searchsqlserver.techtarget.com/definition/data-mining
10: Introduction to Data Mining, Pang-Ning Tan, Michigan State University,
Michael Steinbach,University of Minnesota Vipin Kumar, University of Minnesota,
(March 25, 2006)
11: Introduction to Data Mining Dr. Sanjay Ranka Professor Computer and
Information Science and Engineering University of Florida, Gainesville

Page 29 of 29

Das könnte Ihnen auch gefallen