Sie sind auf Seite 1von 32

DATA MINING

Kusrini

Definisi Data Mining

Garner Group:
Data mining is the process of discovering meaningful
new correlations, patterns and trends by sifting
through large amounts of data stored in repositories,
using pattern recognition technologies as well as
statistical and mathematical techniques.

Hand et al
Data mining is the analysis of (often large)
observational data sets to find unsuspected
relationships and to summarize the data in novel
ways that are both understandable and useful to the
data owner

Evangelos Simoudis in Cabena et al:


Data mining is an interdisciplinary field bringing
together techniques from machine learning,
pattern recognition, statistics, databases, and
visualization to address the issue of information
extraction from large data bases

Berry and Linoff:


Data mining is the process of exploration and
analysis, by automatic or semi-automatic means,
of large quantities of data in order to discover
meaningful patterns and rules

Why data mining?


The explosive growth in data collection
The storing of data in data warehouses
The availability of increased access to data
from Web navigation and intranet

We have to find a more effective way to


use these data in decision support
process than just using traditional query
languages

On what kind of data?

Data warehouses
Transactional databases
Advanced database systems
Spacial and Temporal
Time-series
Multimedia, text
WWW

Knowledge Discovery in Database

1. Data cleaning (to remove noise and inconsistent data)


2. Data integration (where multiple data sources may be
combined)
3. Data selection (where data relevant to the analysis task are
retrieved from the database)
4. Data transformation (where data are transformed or
consolidated into forms appropriate for mining by performing
summary or aggregation operations, for instance)
5. Data mining (an essential process where intelligent
methods are applied in order to extract data patterns)
6. Pattern evaluation (to identify the truly interesting patterns
representing knowledge based on some interestingness measures)
7. Knowledge presentation (where visualization and
knowledge representation techniques are used to present the
mined knowledge to the user)

Task in Data Mining

Classification
Estimation
Prediction
Clustering
Association

Clasification

Classification is the process of finding a


model (or function) that describes and
distinguishes data classes or concepts,
for the purpose of being able to use the
model to predict the class of objects
whose class label is unknown.
The derived model is based on the
analysis of a set of training data (i.e.,
data objects whose class label is known).

In classification, there is a target categorical variable,


such as income bracket, which, for example, could be
partitioned into three classes or categories: high
income, middle income, and low income. The data
mining model examines a large set of records, each
record containing information on the target variable as
well as a set of input or predictor variables.

Algorithm:

k-nearest neighbor classification


Pohon Keputusan
Nave Bayesian classification
support vector machines

Nearest Neighbor (KNN)

Nearest Neighbor adalah pendekatan untuk mencari


kasus dengan menghitung kedekatan antara kasus baru
dengan kasus lama, yaitu berdasarkan pada pencocokan
bobot dari sejumlah fitur yang ada.
Misalkan diinginkan untuk mencari solusi terhadap
seorang pasien baru dengan menggunakan solusi dari
pasien terdahulu.
Untuk mencari kasus pasien mana yang akan digunakan
maka dihitung kedekatan kasus pasien baru dengan
semua kasus pasien lama.
Kasus pasien lama dengan kedekatan terbesar-lah yang
akan diambil solusinya untuk digunakan pada kasus
pasien baru.

Seperti tampak pada


Gambar
Ada 2 pasien lama A dan B.
Ketika ada pasien Baru,
maka solusi yang akan
diambil adalah solusi dari
pasien terdekat dari pasien
Baru.
Seandainya d1 adalah
kedekatan antara pasien
Baru dan pasien A,
sedangkan d2 adalah
kedekatan antara pasien
Baru dengan pasien B.
Karena d2 lebih dekat dari
d1 maka solusi dari pasien
B lah yang akan digunakan
untuk memberikan solusi
pasien Baru.

Rumus kedekatan
Kedekatan biasanya berada pada nilai antara 0
s/d 1.
Nilai 0 artinya kedua kasus mutlak tidak mirip,
sebaliknya untuk nilai 1 kasus mirip dengan
mutlak.

Bobot antar
variabel

Kedekatan Jenis
Kelamin

Kedekatan
Pendidikan

Kedekatan
Agama

Rancangan Sistem

Pohon Keputusan

Pohon keputusan merupakan


sebuah struktur yang dapat
digunakan untuk membagi
kumpulan data yang besar
menjadi himpunan-himpunan
record yang lebih kecil dengan
menerapkan serangkaian
aturan keputusan. Dengan
masing-masing rangkaian
pembagian, anggota
himpunan hasil menjadi mirip
satu dengan yang lain (Berry,
Michael J.A., Linoff, Gordon S.,
2004)
Sebuah model pohon
keputusan terdiri dari
sekumpulan aturan untuk
membagi sejumlah populasi
yang heterogen menjadi lebih
kecil, lebih homogen dengan
memperhatikan pada variabel
tujuannya

Pilih atribut sebagai root


Buat cabang untuk masing-masing nilai
Bagi kasus dalam cabang
Ulangi proses untuk masing-masing
cabang sampai semua kasus pada
cabang memiliki kelas yang sama

Bayesian Clasification

Bayesian classification adalah


pengklasifikasi statistik yang dapat
digunakan untuk memprediksi
probabilitas keanggotaan suatu class.
Bayesian classification didasarkan pada
teorema bayes yang memiliki
kemampuan klasifikasi serupa dengan
decision tree dan neural network.
Bayesian classification terbukti memiliki
akurasi dan kecepatan yang tinggi saat
diaplikasikan ke dalam database

X = (age = <=30, income =medium,


Class buys_computer?)

student = yes, credit_rating = fair,

age = <=30, income =medium,


student = yes, credit_rating = fair

Dibutuh untuk memaksimalkan P(X|Ci) P(Ci)


untuk i= 1, 2
P(Ci) merupakan prior probability untuk
setiap class berdasar data contoh :

P(buys_computer=yes) = 9/14 = 0.643


P(buys_computer=no) = 5/14 = 0.357

Hitung P(X|Ci), untuk i=1,2

P(age = <30 | buys_computer =yes) = 2/9


= 0.222
P(age = <30 | buys_computer =no) = 3/5
= 0.600
P(income = medium | buys_computer
=yes) = 4/9 = 0.444
P(income = medium | buys_computer =no)
= 2/5 = 0.400
P(student = yes | buys_computer =yes) =
6/9 = 0.667
P(student = yes | buys_computer =no)
=1/5=0.200
P(credit_rating= fair | buys_computer
=yes) = 6/9 = 0.667
P(credit_rating= fair | buys_computer =no)
= 2/5 = 0.400

P(X|buys_computer=yes)

P(X|buys_computer=no)

= 0.044 x 0.643 = 0.028

P(X|buys_computer=no)
P(buys_computer=no)

= 0.600 x 0.400 x 0.200


x 0.400 = 0.019

P(X|buys_computer=yes)
P(buys_computer=yes)

= 0.222 x 0.444 x 0.677


x 0.677 = 0.044

= 0.019 x 0.357 = 0.007

Kesimpulan : buys_computer = yes

Department: systems, age: 26-30, salary:46-50K

Status??

Estimation

Estimation is similar to classification except that the


target variable is numerical rather than categorical.
Models are built using complete records, which
provide the value of the target variable as well as
the predictors.
Example:

Estimating the grade-point average (GPA) of a graduate


student, based on that students undergraduate GPA.
Estimating the systolic blood pressure reading of a
hospital patient, based on the patients age, gender,
body-mass index, and blood sodium levels

Method:

simple linear regression and correlation


multiple regression

Prediction

Prediction is similar to classification and


estimation, except that for prediction, the
results lie in the future
Example:

Predicting the price of a stock three months into


the future
Predicting the percentage increase in traffic
deaths next year if the speed limit is increased
Predicting whether a particular molecule in drug
discovery will lead to a profitable new drug for a
pharmaceutical company

Method:

simple linear regression and correlation


multiple regression
k-nearest neighbor classification
Pohon Keputusan
Nave Bayesian classification
support vector machines

Das könnte Ihnen auch gefallen