Sie sind auf Seite 1von 14

Data Mining

BITS Pilani
Pilani|Dubai|Goa|Hyderabad

M1: Introduction to Data Mining

BITS Pilani
Pilani|Dubai|Goa|Hyderabad

1.1 Data Mining Defined

Source Courtesy: Some of the contents of this PPT are sourced from materials provided by publishers of prescribed books

Why Data Mining?


The Explosive Growth of Data: from terabytes to petabytes
Data collection and data availability
Automated data collection tools, database systems, Web,
computerized society
Major sources of abundant data
Business: Web, e-commerce, transactions, stocks,
Science: Remote sensing, bioinformatics, scientific
simulation,
Society and everyone: news, digital cameras, YouTube
We are drowning in data, but starving for knowledge!
Necessity is the mother of inventionData miningAutomated
analysis of massive data sets Data Mining

3
BITS Pilani, Deemed to be University
under Section 3 of UGC Act, 1956

Why Data Mining


A search engine (e.g., Google) receives hundreds of millions of queries every day. Each
query can be viewed as a transaction where the user describes her or his information need.
What novel and useful knowledge can a search engine learn from such a huge collection of
queries collected from users over time? Some patterns found in user search queries can
disclose invaluable knowledge that cannot be obtained by reading individual data items
alone.
For example, Google's Flu Trends uses specific search terms as indicators of flu activity. It
found a close relationship between the number of people who search for flu-related
information and the number of people who actually have flu symptoms. A pattern emerges
when all of the search queries related to flu are aggregated. Using aggregated Google
search data, Flu Trends can estimate flu activity up to two weeks faster than traditional
systems can. This example shows how data mining can turn a large collection of data into
knowledge that can help meet a current global challenge.

Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Evolution of Database Technology


1960s:
Data collection, database creation, IMS and network DBMS

1970s:
Relational data model, relational DBMS implementation

1980s:
RDBMS, advanced data models (extended-relational, OO, deductive, etc.)
Application-oriented DBMS (spatial, scientific, engineering, etc.)

1990s:
Data mining, data warehousing, multimedia databases, and Web databases

2000s
Stream data management and mining
Data mining and its applications
Web technology (XML, data integration) and global information systems
Data Mining

5
BITS Pilani, Deemed to be University
under Section 3 of UGC Act, 1956

What Is Data Mining?


Data mining (knowledge discovery from data)
Extraction of interesting (non-trivial, implicit, previously unknown and potentially
useful) patterns or knowledge from huge amount of data
Data mining: a misnomer?

Alternative names
Knowledge discovery (mining) in databases (KDD), knowledge extraction,
data/pattern analysis, data archeology, data dredging, information harvesting,
business intelligence, etc.

Watch out: Is everything data mining?


Simple search and query processing
(Deductive) expert systems
Data Mining

6
BITS Pilani, Deemed to be University
under Section 3 of UGC Act, 1956

What is (not) Data Mining?


What is not
Data Mining?

Look up phone
number in
phone
directory
Query a Web
search engine
for information
about
Amazon

What is Data Mining?


Certain names are more
prevalent in certain US
locations (OBrien,
ORurke, OReilly in
Boston area)
Group together similar
documents returned by
search engine according
to their context (e.g.
Amazon rainforest,
Amazon.com,)
Data Mining

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Origins of Data Mining


Draws ideas from machine learning/AI, pattern
recognition, statistics, and database systems
Traditional Techniques
Statistics/ Machine Learning/
AI
Pattern
may be unsuitable due to
Recognition

Enormity of data
High dimensionality
of data
Heterogeneous,
distributed nature
of data

Data Mining

Database
systems

Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Data Mining in Business Intelligence


Increasing potential
to support
business decisions

End User

Decisio
n
Making
Data Presentation

Business
Analyst

Visualization Techniques
Data Mining
Information Discovery

Data
Analyst

Data Exploration
Statistical Summary, Querying, and Reporting
Data Preprocessing/Integration, Data Warehouses
DBA

Data Sources
Paper, Files, Web documents, Scientific experiments, Database Systems
Data Mining

9
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Data Mining/KDD Process

Input Data

Data PreProcessing

Data integration
Normalization
Feature selection
Dimension reduction

Data
Mining

Pattern discovery
Association &
correlation
Classification
Clustering
Outlier analysis

Data Mining

PostProcessin
g

Patte
Inform rn
a
Know tion
ledge

Pattern evaluation
Pattern selection
Pattern
interpretation
Pattern visualization

10

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Multi-Dimensional View of Data Mining


Data to be mined
Database data (extended-relational, object-oriented, heterogeneous, legacy), data warehouse,
transactional data, stream, spatiotemporal, time-series, sequence, text and web, multi-media,
graphs & social and information networks
Knowledge to be mined (or: Data mining functions)
Characterization, discrimination, association, classification, clustering, trend/deviation, outlier
analysis, etc.
Descriptive vs. predictive data mining
Multiple/integrated functions and mining at multiple levels
Techniques utilized
Data-intensive, data warehouse (OLAP), machine learning, statistics, pattern recognition,
visualization, high-performance, etc.
Applications adapted
Retail, telecommunication, banking, fraud analysis, bio-data mining, stock market analysis, text
mining, Web mining, etc.
Data Mining

11
BITS Pilani, Deemed to be University
under Section 3 of UGC Act, 1956

Data Mining & Machine Learning


According to Tom M. Mitchell, Chair of Machine Learning at Carnegie
Mellon University and author of the book Machine Learning (McGrawHill),
A computer program is said to learn from experience E with respect to some class
of tasks T and performance measure P, if its performance at tasks in T, as
measured by P, improves with the experience E.
We now have a set of objects to define machine learning:
Task (T), Experience (E), and Performance (P)
With a computer running a set of tasks, the experience should be leading to
performance increases (to satisfy the definition)

Many data mining tasks are executed successfully with help of machine
learning
Machine Learning: Hands-on for Developers and Technical Professionals by Jason Bell John Wiley & Sons
Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Data Mining on Diverse kinds of Data


Besides relational database data (from operational or analytical systems), there are many other
kinds of data that have diverse forms and structures and different semantic meanings.
Examples of data can be :
time-related or sequence data (e.g., historical records, stock exchange data, and time-series
and biological sequence data),
data streams (e.g., video surveillance and sensor data, which are continuously transmitted),
spatial data (e.g., maps),
engineering design data (e.g., the design of buildings, system components, or integrated
circuits),
hypertext and multimedia data (including text, image, video, and audio data),
graph and networked data (e.g., social and information networks), and
the Web (a widely distributed information repository).
Diversity of data brings in new challenges such as handling special structures (e.g., sequences,
trees, graphs, and networks) and specific semantics (such as ordering, image, audio and video
contents, and connectivity)
Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Prescribed Text Books

Author(s), Title, Edition, Publishing House


T1

Tan P. N., Steinbach M & Kumar V. Introduction to Data Mining


Pearson Education

T2

Data Mining: Concepts and Techniques, Third Edition by Jiawei


Han, Micheline Kamber and Jian Pei Morgan Kaufmann Publishers

Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Das könnte Ihnen auch gefallen