Sie sind auf Seite 1von 28

1.

INTRODUCTION
1.1 SPATIAL DATA
Spatial data also known as geospatial data or geographic information it is the data
or information that identifies the geographic location of features and boundaries
on Earth, such as natural or constructed features, oceans, and more. Spatial data
is usually stored as coordinates and topology, and is data that can be mapped.
Spatial data is often accessed, manipulated or analyzed through Geographic
Information Systems (GIS).
According to [Ries1993] spatial information can be divided into three
primary architectures: data, function, and organization. Data architecture
describes the activity performed between two types of data. Organization
architecture describes objects used as input to, or created by, a function.
Functional architecture describes the mission, policies, and rules, which
determine and shape the former two. By exploring each of these architectures,
one can develop a framework to establish spatial design requirements. Spatial
data can be displayed in different ways: point, line, polygon, surface, volume, and
pixel. Each of these display mechanisms has three components: location, shape,
and topology. These components help define the scope of the design
requirements for each Location Control Management level. Along with data, its
properties are equally important. Data properties describe the quality, or
condition of the spatial data.
A spatial preference query ranks objects based on the qualities of features in their
spatial neighborhood. For example, using a real estate agency database of flats
for lease, a customer may want to rank the flats with respect to the
appropriateness of their location, defined after aggregating the qualities of other
features (e.g., restaurants, cafes, hospital, market, etc.) within their spatial
neighborhood. Such a neighborhood concept can be specified by the user via
different functions. It can be an explicit circular region within a given distance
from the flat. Another intuitive definition is to consider the whole spatial domain
and assign higher weights to the features based on their proximity to the flat.
1.2 SPATIAL QUERY EVALUATION
In this project, we formally define spatial preference queries and propose
appropriate indexing techniques and search algorithms for them. Extensively
evaluation of our methods on both real and synthetic data reveals that an
optimized branch-and-bound solution is efficient and robust with respect to
different parameters spatial database systems manage large collections of
geographic entities, which apart from spatial attributes contain non-spatial
information (e.g., name, size, type, price, etc.). In this project, we study an
interesting type of preference queries, which select the best spatial location with
respect to the quality of facilities in its spatial neighborhood.
Given a set D of interesting objects (e.g., candidate locations), a top-k spatial
preference query retrieves the k objects in D with the highest scores. The score of
an object is defined by the quality of features (e.g., facilities or services) in its
spatial neighborhood. As a motivating example, consider a real estate agency
office that holds a database with available flats for lease. Here feature refers to
a class of objects in a spatial map such as specific facilities or services.
A customer may want to rank the contents of this database with respect to
the quality of their locations, quantized by aggregating non spatial characteristics
of other features (e.g., restaurants, cafes, hospital, market, etc.) in the spatial
neighborhood of the flat (defined by a spatial range around it). Quality may be
subjective and query-parametric. For example, a user may define quality with
respect to non-spatial attributes of restaurants around it (e.g., whether they serve
seafood, price range, etc.).
As another example, the user (e.g., a tourist) wishes to find a hotel p that is
close to a high -quality restaurant and a high-quality cafe.
FIGURE.1.1 EXAMPLES OF TOP-K SPATIAL PREFERENCE QUERIES
Figure 1.1: illustrates the locations of an object dataset D (hotels) in white, and
two feature datasets. the set F1 (restaurants) in gray, and the set F2 (cafes) in
black. Feature points are labeled by quality values that can be obtained from
rating providers.
A customer may want to rank the contents of this database with respect to
the quality of their locations, quantized by aggregating non spatial characteristics
of other features (e.g., restaurants, cafes, hospital, market, etc.) in the spatial
neighborhood of the flat (defined by a spatial range around it). Quality may be
subjective and query-parametric. For example, a user may define quality with
respect to non-spatial attributes of restaurants around it (e.g., whether they serve
seafood, price range, etc.).
As another example, the user (e.g., a tourist) wishes to find a hotel p that is
close to a high -quality restaurant and a high-quality cafe.
FIGURE.1.1 EXAMPLES OF TOP-K SPATIAL PREFERENCE QUERIES
Figure 1.1: illustrates the locations of an object dataset D (hotels) in white, and
two feature datasets. the set F1 (restaurants) in gray, and the set F2 (cafes) in
black. Feature points are labeled by quality values that can be obtained from
rating providers.
A customer may want to rank the contents of this database with respect to
the quality of their locations, quantized by aggregating non spatial characteristics
of other features (e.g., restaurants, cafes, hospital, market, etc.) in the spatial
neighborhood of the flat (defined by a spatial range around it). Quality may be
subjective and query-parametric. For example, a user may define quality with
respect to non-spatial attributes of restaurants around it (e.g., whether they serve
seafood, price range, etc.).
As another example, the user (e.g., a tourist) wishes to find a hotel p that is
close to a high -quality restaurant and a high-quality cafe.
FIGURE.1.1 EXAMPLES OF TOP-K SPATIAL PREFERENCE QUERIES
Figure 1.1: illustrates the locations of an object dataset D (hotels) in white, and
two feature datasets. the set F1 (restaurants) in gray, and the set F2 (cafes) in
black. Feature points are labeled by quality values that can be obtained from
rating providers.
1.3 COMBINATION OF QUALITIES AND RELATIVE LOCATION OF POINTS
A simple score instance, called the range score, binds the neighborhood
region to a circular region at p with radius (shown as a circle), and the aggregate
function to SUM. For instance, the maximum quality of gray and black points
within the circle of p1 are 0.9 and 0.6 respectively, so the score of p1 is p1 =
0.9+0.6 = 1.5. Similarly, we obtain p2 = 1.0+0.1 = 1.1 and _ (p3) = 0.7+0.7 = 1.4.
Hence, the hotel p1 is returned as the top result. In fact, the semantics of the
aggregate function is relevant to the users query. The SUM function attempts to
balance the overall qualities of all features. For the MIN function, the top result
becomes p3, with the score_ (p3) = MIN {0.7, 0.7} = 0.7. It ensures that the top
result has reasonably high qualities in all features. For the MAX function, the top
result is p2, with (p2) = MAX {1.0, 0.1} = 1.0. It is used to optimize the quality in a
particular feature, but not necessarily all of them.
Object ranking is a popular retrieval task in various applications. In relational
databases, we rank tuples using an aggregate score function on their attribute
values For example, a real estate agency maintains a database that contains
information of flats available for rent. A potential customer wishes to view the
top-10 flats with the largest sizes and lowest prices. In this case, the score of each
flat is expressed by the sum of two qualities: size and price, after normalization to
the domain [0, 1] (e.g., 1 means the largest size and the lowest price). In spatial
databases, ranking is often associated to nearest neighbor (NN) retrieval. Given a
query location, we are interested in retrieving the set of nearest objects to it that
qualify a condition (e.g., restaurants).
Assuming that the set of interesting objects is indexed by an R-tree, we can
apply distance bounds and traverse the index in a branch-and-bound fashion to
obtain the answer Nevertheless, it is not always possible to use multi dimensional
indexes for top-k retrieval. First, such indexes break-down in high dimensional
spaces.
Second, top-k queries may involve an arbitrary set of user-specified
attributes (e.g., size and price) from possible ones (e.g., size, price, distance to the
beach, number of bedrooms, floor, etc.) And indexes may not be available for all
possible attribute combinations (i.e., they are too expensive to create and
maintain).
Third, information for different rankings to be combined (i.e., for different
attributes) could appear in different databases (in a distributed database
scenario) and unified indexes may not exist for them. Solutions for top-k queries
focus on the efficient merging of object rankings that may arrive from different
(distributed) sources. Their motivation is to minimize the number of accesses to
the input rankings until the objects with the top-k aggregate scores have been
identified. To achieve this, upper and lower bounds for the objects seen so far are
maintained while scanning the sorted lists. The most popular spatial access
method is the R-tree, which indexes minimum bounding rectangles (MBRs) of
objects-trees can efficiently process main spatial query types, including spatial
range queries, nearest neighbor queries, and spatial joins. Given a spatial region
W, a spatial range query retrieves from D the objects that intersect W.
For instance, consider a range query that asks for all objects within the
shaded area starting from the root of the tree, the query is processed by
recursively following entries, having MBRs that intersect the query region. For
instance, e1 does not intersect the query
FIGURE 1.2 SPATIAL QUERIES ON R-TREES
Figure 1.2: shows a set D = {p1, p2 ...p8} of spatial objects (e.g., points) and an R-
tree that indexes them.
For instance, consider a range query that asks for all objects within the
shaded area starting from the root of the tree, the query is processed by
recursively following entries, having MBRs that intersect the query region. For
instance, e1 does not intersect the query
FIGURE 1.2 SPATIAL QUERIES ON R-TREES
Figure 1.2: shows a set D = {p1, p2 ...p8} of spatial objects (e.g., points) and an R-
tree that indexes them.
For instance, consider a range query that asks for all objects within the
shaded area starting from the root of the tree, the query is processed by
recursively following entries, having MBRs that intersect the query region. For
instance, e1 does not intersect the query
FIGURE 1.2 SPATIAL QUERIES ON R-TREES
Figure 1.2: shows a set D = {p1, p2 ...p8} of spatial objects (e.g., points) and an R-
tree that indexes them.
2. DATA MINING
2.1 MOTIVATION
The major reason that data mining has attracted a great deal of attention in
information industry in recent years is due to the wide availability of huge
amounts of data and the imminent need for turning such data into useful
information and knowledge. The information and knowledge gained can be used
for applications ranging from business management, production control, and
market analysis, to engineering design and science exploration. Data mining can
be viewed as a result of the natural evolution of information technology. An
evolutionary path has been witnessed in the database industry in the
development of the following functionalities:
Data collection and database creation, data management (including data
storage and retrieval, and database transaction processing)
Data analysis and understanding (involving data warehousing and data
mining).
For instance, the early development of data collection and database creation
mechanisms served as a prerequisite for later development of elective
mechanisms for data storage and retrieval, and query and transaction processing.
With numerous database systems opening query and transaction processing as
common practice, data analysis and understanding has naturally become the next
target.
Since the 1960's, database and information technology has been evolving
systematically from primitive processing systems to sophisticated and powerful
databases systems. The research and development in database systems .since the
1970's has led to the development of relational database systems (where data are
stored in relational table structures, data modeling tools, and indexing and data
organization techniques. In addition, users gained convenient and flexible data
access through query languages, query processing, and user interfaces methods
for on-line transaction processing (OLTP), where a query is viewed as a read-only
transaction, have contributed substantially to the evolution and wide acceptance
of relational technology as a major tool for ancient storage, retrieval, and
management of large amounts of data.
Database technology since the mid-1980s has been characterized by the popular
adoption of relational technology and an upsurge of research and development
activities on new and powerful database systems. These employ advanced data
models such as extended-relational, object-oriented, object-relational, and
deductive models Application oriented database systems, including spatial,
temporal, multimedia, active, and scientific databases, knowledge bases, and
once information bases, have flourished. Issues related to the distribution,
diversification, and sharing of data have been studied extensively. Heterogeneous
database systems and Internet-based global information systems such as the
World-Wide Web (WWW) also emerged and play a vital role in the information
industry.
The steady and amazing progress of computer hardware technology in the past
three decades has led to powerful, adorable, and large supplies of computers,
data collection equipment, and storage media. This technology provides a great
boost to the database and information industry, and makes a huge number of
databases and information repositories available for transaction management,
information retrieval, and data analysis. Data can now be stored in many different
kinds of databases and information repositories.
One data repository architecture that has emerged is the data warehouse, a
repository of multiple heterogeneous data sources, organized under a unified
schema at a single site in order to facilitate management decision making. Data
warehouse technology includes data cleaning, data integration, and On-Line
Analytical Processing (OLAP), that is, analysis techniques with functionalities such
as summarization, consolidation, and aggregation, as well as the ability to view
information from different angles. Although OLAP tools support multidimensional
analysis and decision making, additional data analysis tools are required for in-
depth analysis, such as data classification, clustering, and the characterization of
data changes over time. In addition, huge volumes of data can be accumulated
beyond databases and data warehouses. Typical examples include the World
Wide Web and data streams, where data flow in and out like streams as in
applications like video surveillance, telecommunication, and sensor networks. The
effective and efficient analysis of data in such different forms becomes a
challenging task. The abundance of data, coupled with the need for powerful data
analysis tools, has been described as a data rich but information poor situation.
The fast-growing, tremendous amount of data, collected and stored in large and
numerous data repositories, has far exceeded our human ability for
comprehension without powerful tools.
As a result, data collected in large data repositories become data
tombsdata archives that are seldom visited. Consequently, important decisions
are often made based not on the information-rich data stored in data repositories
but rather on a decision makers intuition, simply because the decision maker
does not have the tools to extract the valuable knowledge embedded in the vast
amounts of data. In addition, consider expert system technologies, which typically
rely on users or domain experts to manually input knowledge into knowledge
bases. Unfortunately, this procedure is prone to biases and errors, and is
extremely time-consuming and costly. Data mining tools perform data analysis
and may uncover important data patterns, contributing greatly to business
strategies, knowledge bases, and scientific and medical research. The widening
gap between data and information calls for a systematic development of data
mining tools that will turn data tombs into golden nuggets of knowledge.
2.2 WHAT IS DATA MINING?
Simply stated, data mining refers to extracting or mining knowledge from large
amounts of data. The term is actually a misnomer. Remember that the mining of
gold from rocks or sand is referred to as gold mining rather than rock or sand
mining. Thus, data mining should have been more appropriately named
knowledge mining from data, which is unfortunately somewhat long.
Knowledge mining, a shorter termmay not reflect the emphasis on mining from
large amounts of data. Nevertheless, mining is a vivid term characterizing the
process that finds a small set of precious nuggets from a great deal of raw
material Thus, such a misnomer that carries both data and mining became a
popular choice. Many other terms carry a similar or slightly different meaning to
data mining, such as knowledge mining from data, knowledge extraction,
data/pattern analysis, data archaeology, and data dredging.
Many people treat data mining as a synonym for another popularly used
term, Knowledge Discovery from Data, or KDD. Alternatively, others view data
mining as simply an essential step in the process of knowledge discovery.
Knowledge discovery as a process and consists of an iterative sequence of the
following steps:
1. Data cleaning (to remove noise and inconsistent data)
2. Data integration (where multiple data sources may be combined)
3. Data selection (where data relevant to the analysis task are retrieved from the
database)
4. Data transformation (where data are transformed or consolidated into forms
appropriate for mining by performing summary or aggregation operations, for
instance)
5. Data mining (an essential process where intelligent methods are applied in
order to extract data patterns)
6. Pattern evaluation (to identify the truly interesting patterns representing
knowledge based on some interestingness measures
7. Knowledge presentation (where visualization and knowledge representation
techniques are used to present the mined knowledge to the user)
Steps 1 to 4 are different forms of data preprocessing, where the data are
prepared for mining. The data mining step may interact with the user or a
knowledge base. The interesting patterns are presented to the user, and may be
stored as new knowledge in the knowledge base. Note that according to this
view, data mining is only one step in the entire process, albeit an essential one
because it uncovers hidden patterns for evaluation. We agree that data mining is
a step in the knowledge discovery process. However, in industry, in media, and in
the database research milieu, the term data mining is becoming more popular
than the longer term of knowledge discovery from data. Therefore, in this book,
we choose to use the term data mining. We adopt a broad view of data mining
functionality: data mining is the process of discovering interesting knowledge
from large amounts of data stored in databases, data warehouses, or other
information repositories.
Based on this view, the architecture of a typical data mining system may
have the following major components
Figure 2.1 Architecture of a typical data mining system.
Database, data ware house, World Wide Web, or other information
repository:
This is one or a set of databases, data warehouses, spreadsheets, or
other kinds of information repositories. Data cleaning and data integration
techniques may be performed on the data.
Database or data warehouse server: The database or data warehouse
server is responsible for fetching the relevant data, based on the users
data mining request.
we choose to use the term data mining. We adopt a broad view of data mining
functionality: data mining is the process of discovering interesting knowledge
from large amounts of data stored in databases, data warehouses, or other
information repositories.
Based on this view, the architecture of a typical data mining system may
have the following major components
Figure 2.1 Architecture of a typical data mining system.
Database, data ware house, World Wide Web, or other information
repository:
This is one or a set of databases, data warehouses, spreadsheets, or
other kinds of information repositories. Data cleaning and data integration
techniques may be performed on the data.
Database or data warehouse server: The database or data warehouse
server is responsible for fetching the relevant data, based on the users
data mining request.
we choose to use the term data mining. We adopt a broad view of data mining
functionality: data mining is the process of discovering interesting knowledge
from large amounts of data stored in databases, data warehouses, or other
information repositories.
Based on this view, the architecture of a typical data mining system may
have the following major components
Figure 2.1 Architecture of a typical data mining system.
Database, data ware house, World Wide Web, or other information
repository:
This is one or a set of databases, data warehouses, spreadsheets, or
other kinds of information repositories. Data cleaning and data integration
techniques may be performed on the data.
Database or data warehouse server: The database or data warehouse
server is responsible for fetching the relevant data, based on the users
data mining request.
Knowledge base: This is the domain knowledge that is used to guide the
search or evaluate the interestingness of resulting patterns. Such
knowledge can include concept hierarchies, used to organize attributes or
attribute values into different levels of abstraction. Knowledge such as user
beliefs, which can be used to assess a patterns interestingness based on its
unexpectedness, may also be included. Other examples of domain
knowledge are additional interestingness constraints or thresholds, and
metadata (e.g., describing data from multiple heterogeneous sources).
Data mining engine: This is essential to the data mining system and ideally
consists of a set of functional modules for tasks such as characterization,
association and correlation analysis, classification, prediction, cluster
analysis, outlier analysis, and evolution analysis.
Pattern evaluation module: This component typically employs
interestingness measures and interacts with the data mining modules so as
to focus the search toward interesting patterns. It may use interestingness
thresholds to filter out discovered patterns. Alternatively, the pattern
evaluation module may be integrated with the mining module, depending
on the implementation of the data mining method used. For efficient data
mining, it is highly recommended to push the evaluation of pattern
interestingness as deep as possible into the mining process so also confine
the search to only the interesting patterns.
User interface: This module communicates between users and the data
mining system, allowing the user to interact with the system by specifying a
data mining query or task, providing information to help focus the search,
and performing exploratory data mining based on the intermediate data
mining results. In addition, this component allows the user to browse
database and data warehouse schemas or data structures, evaluate mined
patterns, and visualize the patterns in different forms. From a data
warehouse perspective, data mining can be viewed as an advanced stage of
on-line analytical processing (OLAP). However, data mining goes far beyond
the narrow scope of summarization-style analytical processing of data
warehouse systems by incorporating more advanced techniques for data
analysis.
Although there are many data mining systems on the market, not all of them
can perform true data mining. A data analysis system that does not handle large
amounts of data should be more appropriately categorized as a machine learning
system, a statistical data analysis tool, or an experimental system prototype. A
system that can only perform data or information retrieval, including finding
aggregate values, or that performs deductive query answering in large databases
should be more appropriately categorized as a database system, an information
retrieval system, or a deductive database system.
Data mining involves an integration of techniques from multiple disciplines
such as database and data warehouse technology, statistics, machine learning,
high-performance computing, pattern recognition, neural networks, data
visualization, information retrieval, image and signal processing, and spatial or
temporal data analysis. We adopt a database perspective in our presentation of
data mining in this book. That is, emphasis is placed on efficient and scalable data
mining techniques. For an algorithm to be scalable, its running time should grow
approximately linearly in proportion to the size of the data, given the available
system resources such as main memory and disk space. By performing data
mining, interesting knowledge, regularities, or high-level information can be
extracted from databases and viewed or browsed from different angles. The
discovered knowledge can be applied to decision making, process control,
information management, and query processing. Therefore, data mining is
considered one of the most important frontiers in database and information
systems and one of the most promising interdisciplinary developments in the
information technology.
2.3 MAJOR ISSUES IN DATA MINING
The major issues in data mining regarding mining methodology, user interaction,
performance, and diverse data types are introduced below:
Mining methodology and user interaction issues: These reflect the kinds of
knowledge mined the ability to mine knowledge at multiple granularities, the use
of domain knowledge, ad hoc mining, and knowledge visualization.
Mining different kinds of knowledge in databases: Because different users can
be interested in different kinds of knowledge, data mining should cover a wide
spectrum of data analysis and knowledge discovery tasks, including data
characterization, discrimination, association and correlation analysis,
classification, prediction, clustering, outlier analysis, and evolution analysis (which
includes trend and similarity analysis). These tasks may use the same database in
different ways and require the development of numerous data mining
techniques.
Interactive mining of knowledge at multiple levels of abstraction: Because it is
difficult to know exactly what can be discovered within a database, the data
mining process should be interactive. For databases containing a huge amount of
data, appropriate sampling techniques can first be applied to facilitate interactive
data exploration. Interactive mining allows users to focus the search for patterns,
providing and refining data mining requests based on returned results.
Specifically, knowledge should be mined by drilling down, rolling up, and pivoting
through the data space and knowledge space interactively, similar to what OLAP
can do on data cubes. In this way, the user can interact with the data mining
system to view data and discovered patterns at multiple granularities and from
different angles.
Incorporation of background knowledge: Background knowledge, or information
regarding the domain under study, may be used to guide the discovery process
and allow discovered patterns to be expressed in concise terms and at different
levels of abstraction. Domain knowledge related to databases, such as integrity
constraints and deduction rules, can help focus and speed up a data mining
process, or judge the interestingness of discovered patterns.
Data mining query languages and ad hoc data mining: Relational query
languages (such as SQL) allow users to pose ad hoc queries for data retrieval. In a
similar vein, high-level data mining query languages need to be developed to
allow users to describe ad hoc data mining tasks by facilitating the specification of
the relevant sets of data for analysis, the domain knowledge, the kinds of
knowledge to be mined, and the conditions and constraints to be enforced on the
discovered patterns. Such a language should be integrated with a database or
data warehouse query language and optimized for efficient and flexible data
mining.
Presentation and visualization of data mining results: Discovered knowledge
should be expressed in high-level languages, visual representations, or other
expressive forms so that the knowledge can be easily understood and directly
usable by humans. This is especially crucial if the data mining system is to be
interactive. This requires the system to adopt expressive knowledge
representation techniques, such as trees, tables, rules, graphs, charts, crosstabs,
matrices, or curves.
Handling noisy or incomplete data: The data stored in a database may reflect
noise, exceptional cases, or incomplete data objects. When mining data
regularities, these objects may confuse the process, causing the knowledge model
constructed to over fit the data. As a result, the accuracy of the discovered
patterns can be poor .Data cleaning methods and data analysis methods that can
handle noise are required, as well as outlier mining methods for the discovery and
analysis of exceptional cases.
Pattern evaluationthe interestingness problem: A data mining system can
uncover thousands of patterns. Many of the patterns discovered may be
uninteresting to the given user, representing common knowledge or lacking
novelty. Several challenges discovered patterns, particularly with regard to
subjective measures that estimate the value of patterns with respect to a given
user class, based on user beliefs or expectations. The use of interestingness
measures or user-specified constraints to guide the discovery process and reduce
the search space is another active area of research.
Performance issues: These include efficiency, scalability, and parallelization of
data mining algorithms.
Efficiency and scalability of data mining algorithms: To effectively extract
information from a huge amount of data in databases, data mining algorithms
must be efficient and scalable. In other words, the running time of a data mining
algorithm must be predictable and acceptable in large databases. From a
database perspective on knowledge discovery, efficiency and scalability are key
issues in the implementation of data mining systems. Many of the issues
discussed above under mining methodology and user interaction must also
consider efficiency and scalability.
Parallel, distributed, and incremental mining algorithms: The huge size of many
databases, the wide distribution of data, and the computational complexity of
some data mining methods are factors motivating the development of parallel
and distributed data mining algorithms. Such algorithms divide the data into
partitions, which are processed in parallel. The results from the partitions are
then merged .Moreover, the high cost of some data mining processes promotes
the need for incremental data mining algorithms that incorporate database
updates without having to mine the entire data again from scratch. Such
algorithms perform knowledge modification incrementally to amend and
strengthen what was previously discovered remain regarding the development of
techniques to assess the interestingness
Handling of relational and complex types of data: Because relational databases
and data warehouses are widely used, the development of efficient and effective
data mining systems for such data is important. However, other databases may
contain complex data objects, hypertext and multimedia data, spatial data,
temporal data, or transaction data. It is unrealistic to expect one system to mine
all kinds of data, given the diversity of data types and different goals of data
mining. Specific data mining systems should be constructed for mining specific
kinds of data.
Therefore, one may expect to have different data mining systems for different
kinds of data.
Mining information from heterogeneous databases and global information
systems:
Local- and wide-area computer networks (such as the Internet) connect many
sources of data, forming huge, distributed, and heterogeneous databases. The
discovery of knowledge from different sources of structured, semi structured, or
unstructured data with diverse data semantics poses great challenges to data
mining. Data mining may help disclose high-level data regularities in multiple
heterogeneous databases that are unlikely to be discovered by simple query
systems and may improve information exchange and interoperability in
heterogeneous databases. Web mining, which uncovers interesting knowledge
about Web contents, Web structures, Web usage, and Web dynamics, becomes a
very challenging and fast evolving field in data mining.
The above issues are considered major requirements and challenges for the
further evolution of data mining technology. Some of the challenges have been
addressed in recent data mining research and development, to a certain extent,
and are now considered requirements, while others are still at the research stage.
The issues, however, continue to stimulate further investigation and
improvement.
2.4 TRENDS IN DATA MINING
1. Application exploration: Early data mining applications focused mainly on
helping businesses gain a competitive edge.
2. Scalable and interactive data mining methods: In contrast with traditional data
analysis methods, data mining must be able to handle huge amounts of data
efficiently and, if possible, interactively.
3. Integration of data mining with database systems, data warehouse systems,
and Web database systems: Database systems, data warehouse systems, and the
Web have become mainstream information processing systems.
4. Standardization of data mining language: A standard data mining language or
other standardization efforts will facilitate the systematic development of data
mining solutions, improve interoperability among multiple data mining systems
and functions, and promote the education and use of data mining systems in
industry and society.
5. Visual data mining: Visual data mining is an effective way to discover
knowledge fromhuge amounts of data
6. Biological data mining: Although biological data mining can be considered
under application exploration or mining complex types of data, the unique
combination of complexity, richness, size, and importance of biological data
warrants special attention in data mining.
7. Data mining and software engineering: As software programs become
increasingly bulky in size, sophisticated in complexity, and tend to originate from
the integration of multiple components developed by different software teams, it
is an increasingly challenging task to ensure software robustness and reliability.
8. Web mining: Web mining is the application of data mining techniques to
discover patterns from the Web. According to analysis targets, web mining can be
divided into three different types, which are Web usage mining, Web content
mining and Web structure mining.
9. Distributed data mining: Traditional data mining methods, designed to work at
a centralized location, do not work well in many of the distributed computing
environments present today (e.g., the Internet, intranets, local area networks,
high-speed wireless networks, and sensor networks). Advances in distributed data
mining methods are expected.
10. Real-time or time-critical data mining: Many applications involving stream
data (such as e-commerce, Web mining, stock analysis, intrusion detection,
mobile data mining, and data mining for counterterrorism) require dynamic data
mining models to be built in real time. Additional development is needed in this
area.
11.Graph mining, link analysis, and social network analysis: Graph mining, link
analysis, and social network analysis are useful for capturing sequential,
topological, geometric, and other relational characteristics of many scientific data
sets (such as for chemical compounds and biological networks) and social data
sets (such as for the analysis of hidden criminal networks)
12. Multi relational and multi database data mining: Most data mining
approaches search for patterns in a single relational table or in a single database.
However, most real world data and information are spread across multiple tables
and databases.
13. New methods for mining complex types of data: mining complex types of
data is an important research frontier in data mining. Although progress has been
made in mining stream, time-series, sequence, graph, spatiotemporal,
multimedia, and text data, there is still a huge gap between the needs for these
applications and the available technology.
14. Privacy protection and information security in data mining: An abundance of
recorded personal information available in electronic forms and on the Web,
coupled with increasingly powerful data mining tools, poses a threat to our
privacy and data security
2.5 APPLICATIONS OF DATA MINING
Data Mining for Financial Data Analysis few typical cases:
1. Design and construction of data warehouses for multidimensional data
analysis and data mining
2. Loan payment prediction and customer credit policy analysis
3. Classification and clustering of customers for targeted marketing
4. Detection of money laundering and other financial crimes
5. Data Mining for the Retail Industry
A few examples of data mining in the retail industry:
1. Design and construction of data warehouses based on the benefits of data
mining
2. Multidimensional analysis of sales, customers, products, time, and region
3. Analysis of the effectiveness of sales campaigns
4. Customer retentionanalysis of customer loyalty
5. Product recommendation and cross-referencing of items
Data Mining for the Telecommunication Industry
1. Multidimensional analysis of telecommunication data
2. Fraudulent pattern analysis and the identification of unusual patterns
3. Multidimensional association and sequential pattern analysis:
4. Mobile telecommunication services
5. Use of visualization tools in telecommunication data analysis
Data Mining for Biological Data Analysis
1. Semantic integration of heterogeneous, distributed genomic and proteomic
databases
2. Alignment, indexing, similarity search, and comparative analysis of multiple
nucleotide/
protein sequences
3. Discovery of structural patterns and analysis of genetic networks and
protein pathways
4. Association and path analysis: identifying co-occurring gene sequences and
linking genes to different stages of disease development
5. Visualization tools in genetic data analysis
Data Mining in Other Scientific Applications
Data collection and storage technologies have recently improved, so that
today, scientific data can be amassed at much higher speeds and lower costs. This
has resulted in the accumulation of huge volumes of high-dimensional data,
stream data, and heterogeneous data, containing rich spatial and temporal
information. Consequently, scientific applications are shifting from the
hypothesize-and-test paradigm toward a collect and store data, mine for new
hypotheses, confirm with data or experimentation process. This shift brings
about new challenges for data mining.
Challenges:
1. Data warehouses and data preprocessing
2. Mining complex data types
3. Graph-based mining
4. Visualization tools and domain-specific knowledge
Data Mining for Intrusion Detection
The security of our computer systems and data is at continual risk. The extensive
growth of the Internet and increasing availability of tools and tricks for intruding
and attacking networks have prompted intrusion detection to become a critical
component of network administration.
The following are areas in which data mining technology may be applied or
further developed for intrusion detection:
1. Development of data mining algorithms for intrusion detection
2. Association and correlation analysis, and aggregation to help select and
build discriminating attributes
3. Analysis of stream data
4. Distributed data mining
5. Visualization and querying tools
Data Mining System Products and Research Prototypes data mining systems
should be assessed based on the following multiple features:
1. Data types
2. System issues
3. Data sources
4. Data mining functions and methodologies.
5. Coupling data mining with database and/or data warehouse systems.
6. Scalability
7. Visualization tools
8. Data mining query language and graphical user interface:
Additional Themes on Data Mining: Theoretical Foundations of Data Mining
1. Data reduction: In this theory, the basis of data mining is to reduce the
data representation
2. Data compression: According to this theory, the basis of data mining is to
compress the given data by encoding in terms of bits, association rules,
decision trees, clusters, and so on
3. Pattern discovery: In this theory, the basis of data mining is to discover
patterns occurring in the database, such as associations, classification
models, sequential patterns, and so on
4. Probability theory: This is based on statistical theory. In this theory, the
basis of data mining is to discover joint probability distributions of random
variables, for example, Bayesian belief networks or hierarchical Bayesian
models.
5. Microeconomic view: The microeconomic view considers data mining as
the task of finding patterns that are interesting only to the extent that they
can be used in the decision making process of some enterprise (e.g.,
regarding marketing strategies and production plans).
6. Inductive databases: According to this theory, a database schema consists
of data and patterns that are stored in the database.
Statistical Data Mining techniques:
1. Regression
2. Generalized linear model
3. Analysis of variance
4. Mixed effect model
5. Factor analysis
6. Discriminate analysis
7. Time series analysis
8. Survival analysis
9. Quality control
Visual and Audio Data Mining
Visual data mining discovers implicit and useful knowledge from large data sets
using data and/or knowledge visualization techniques.
In general, data visualization and data mining can be integrated in the following
ways:
1. Data visualization
2. Data mining result visualization
3. Data mining process visualization
4. Interactive visual data mining
Data Mining and Collaborative Filtering
A collaborative filtering approach is commonly used, in which products are
recommended based on the opinions of other customers. Collaborative
recommender systems may employ data mining or statistical techniques to search
for similarities among customer preferences.
Security of Data Mining
Data securityenhancing techniques have been developed to help protect data.
Databases can employ a multilevel security model to classify and restrict data
according to various security levels, with users permitted access to only their
authorized level. Privacy-sensitive data mining deals with obtaining valid data
mining results without learning the underlying data values.

Das könnte Ihnen auch gefallen