Data Mining

DATA MINING
Kusrini
Definisi Data Mining
Garner Group:
Data mining is the process of discovering meaningful
new correlations, patterns and trends by sifting
through large amounts of data stored in repositories,
using pattern recognition technologies as well as
statistical and mathematical techniques.
Hand et al
Data mining is the analysis of (often large)
observational data sets to find unsuspected
relationships and to summarize the data in novel
ways that are both understandable and useful to the
data owner
Evangelos Simoudis in Cabena et al:

Data mining is an interdisciplinary field bringing
together techniques from machine learning,
pattern recognition, statistics, databases, and
visualization to address the issue of information
extraction from large data bases
Berry and Linoff:

Data mining is the process of exploration and
analysis, by automatic or semi-automatic means,
of large quantities of data in order to discover
meaningful patterns and rules
Why data mining?

The explosive growth in data collection
The storing of data in data warehouses
The availability of increased access to data
from Web navigation and intranet
We have to find a more effective way to

use these data in decision support
process than just using traditional query
languages
On what kind of data?
Data warehouses
Transactional databases
Advanced database systems
Spacial and Temporal
Time-series
Multimedia, text
WWW
Knowledge Discovery in Database
1. Data cleaning (to remove noise and inconsistent data)

2. Data integration (where multiple data sources may be
combined)
3. Data selection (where data relevant to the analysis task are
retrieved from the database)
4. Data transformation (where data are transformed or
consolidated into forms appropriate for mining by performing
summary or aggregation operations, for instance)
5. Data mining (an essential process where intelligent
methods are applied in order to extract data patterns)
6. Pattern evaluation (to identify the truly interesting patterns
representing knowledge based on some interestingness measures)
7. Knowledge presentation (where visualization and
knowledge representation techniques are used to present the
mined knowledge to the user)
Task in Data Mining
Classification
Estimation
Prediction
Clustering
Association
Clasification
Classification is the process of finding a

model (or function) that describes and
distinguishes data classes or concepts,
for the purpose of being able to use the
model to predict the class of objects
whose class label is unknown.
The derived model is based on the
analysis of a set of training data (i.e.,
data objects whose class label is known).
In classification, there is a target categorical variable,

such as income bracket, which, for example, could be
partitioned into three classes or categories: high
income, middle income, and low income. The data
mining model examines a large set of records, each
record containing information on the target variable as
well as a set of input or predictor variables.
Algorithm:
k-nearest neighbor classification

Pohon Keputusan
Nave Bayesian classification
support vector machines
Nearest Neighbor (KNN)
Nearest Neighbor adalah pendekatan untuk mencari

kasus dengan menghitung kedekatan antara kasus baru
dengan kasus lama, yaitu berdasarkan pada pencocokan
bobot dari sejumlah fitur yang ada.
Misalkan diinginkan untuk mencari solusi terhadap
seorang pasien baru dengan menggunakan solusi dari
pasien terdahulu.
Untuk mencari kasus pasien mana yang akan digunakan
maka dihitung kedekatan kasus pasien baru dengan
semua kasus pasien lama.
Kasus pasien lama dengan kedekatan terbesar-lah yang
akan diambil solusinya untuk digunakan pada kasus
pasien baru.
Seperti tampak pada

Gambar
Ada 2 pasien lama A dan B.
Ketika ada pasien Baru,
maka solusi yang akan
diambil adalah solusi dari
pasien terdekat dari pasien
Baru.
Seandainya d1 adalah
kedekatan antara pasien
Baru dan pasien A,
sedangkan d2 adalah
kedekatan antara pasien
Baru dengan pasien B.
Karena d2 lebih dekat dari
d1 maka solusi dari pasien
B lah yang akan digunakan
untuk memberikan solusi
pasien Baru.
Rumus kedekatan
Kedekatan biasanya berada pada nilai antara 0
s/d 1.
Nilai 0 artinya kedua kasus mutlak tidak mirip,
sebaliknya untuk nilai 1 kasus mirip dengan
mutlak.
Bobot antar
variabel
Kedekatan Jenis
Kelamin
Kedekatan
Pendidikan
Kedekatan
Agama
Rancangan Sistem
Pohon Keputusan
Pohon keputusan merupakan

sebuah struktur yang dapat
digunakan untuk membagi
kumpulan data yang besar
menjadi himpunan-himpunan
record yang lebih kecil dengan
menerapkan serangkaian
aturan keputusan. Dengan
masing-masing rangkaian
pembagian, anggota
himpunan hasil menjadi mirip
satu dengan yang lain (Berry,
Michael J.A., Linoff, Gordon S.,
2004)
Sebuah model pohon
keputusan terdiri dari
sekumpulan aturan untuk
membagi sejumlah populasi
yang heterogen menjadi lebih
kecil, lebih homogen dengan
memperhatikan pada variabel
tujuannya
Pilih atribut sebagai root

Buat cabang untuk masing-masing nilai
Bagi kasus dalam cabang
Ulangi proses untuk masing-masing
cabang sampai semua kasus pada
cabang memiliki kelas yang sama
Bayesian Clasification
Bayesian classification adalah

pengklasifikasi statistik yang dapat
digunakan untuk memprediksi
probabilitas keanggotaan suatu class.
Bayesian classification didasarkan pada
teorema bayes yang memiliki
kemampuan klasifikasi serupa dengan
decision tree dan neural network.
Bayesian classification terbukti memiliki
akurasi dan kecepatan yang tinggi saat
diaplikasikan ke dalam database
X = (age = <=30, income =medium,

Class buys_computer?)
student = yes, credit_rating = fair,
age = <=30, income =medium,

student = yes, credit_rating = fair
Dibutuh untuk memaksimalkan P(X|Ci) P(Ci)

untuk i= 1, 2
P(Ci) merupakan prior probability untuk
setiap class berdasar data contoh :
P(buys_computer=yes) = 9/14 = 0.643

P(buys_computer=no) = 5/14 = 0.357
Hitung P(X|Ci), untuk i=1,2
P(age = <30 | buys_computer =yes) = 2/9

= 0.222
P(age = <30 | buys_computer =no) = 3/5
= 0.600
P(income = medium | buys_computer
=yes) = 4/9 = 0.444
P(income = medium | buys_computer =no)
= 2/5 = 0.400
P(student = yes | buys_computer =yes) =
6/9 = 0.667
P(student = yes | buys_computer =no)
=1/5=0.200
P(credit_rating= fair | buys_computer
=yes) = 6/9 = 0.667
P(credit_rating= fair | buys_computer =no)
= 2/5 = 0.400
P(X|buys_computer=yes)
P(X|buys_computer=no)
= 0.044 x 0.643 = 0.028
P(X|buys_computer=no)
P(buys_computer=no)
= 0.600 x 0.400 x 0.200

x 0.400 = 0.019
P(X|buys_computer=yes)
P(buys_computer=yes)
= 0.222 x 0.444 x 0.677

x 0.677 = 0.044
= 0.019 x 0.357 = 0.007
Kesimpulan : buys_computer = yes
Department: systems, age: 26-30, salary:46-50K
Status??
Estimation
Estimation is similar to classification except that the

target variable is numerical rather than categorical.
Models are built using complete records, which
provide the value of the target variable as well as
the predictors.
Example:
Estimating the grade-point average (GPA) of a graduate

student, based on that students undergraduate GPA.
Estimating the systolic blood pressure reading of a
hospital patient, based on the patients age, gender,
body-mass index, and blood sodium levels
Method:
simple linear regression and correlation

multiple regression
Prediction
Prediction is similar to classification and

estimation, except that for prediction, the
results lie in the future
Example:
Predicting the price of a stock three months into

the future
Predicting the percentage increase in traffic
deaths next year if the speed limit is increased
Predicting whether a particular molecule in drug
discovery will lead to a profitable new drug for a
pharmaceutical company
Method:
simple linear regression and correlation

multiple regression
k-nearest neighbor classification
Pohon Keputusan
Nave Bayesian classification
support vector machines

Data Mining

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Data Mining

Hochgeladen von

Copyright:

Verfügbare Formate

DATA MINING

Definisi Data Mining

Evangelos Simoudis in Cabena et al:

Berry and Linoff:

Why data mining?

We have to find a more effective way to

On what kind of data?

Knowledge Discovery in Database

1. Data cleaning (to remove noise and inconsistent data)

Task in Data Mining

Classification is the process of finding a

In classification, there is a target categorical variable,

k-nearest neighbor classification

Nearest Neighbor (KNN)

Nearest Neighbor adalah pendekatan untuk mencari

Seperti tampak pada

Pohon keputusan merupakan

Pilih atribut sebagai root

Bayesian classification adalah

X = (age = <=30, income =medium,

student = yes, credit_rating = fair,

age = <=30, income =medium,

Dibutuh untuk memaksimalkan P(X|Ci) P(Ci)

P(buys_computer=yes) = 9/14 = 0.643

Hitung P(X|Ci), untuk i=1,2

P(age = <30 | buys_computer =yes) = 2/9

= 0.044 x 0.643 = 0.028

= 0.600 x 0.400 x 0.200

= 0.222 x 0.444 x 0.677

= 0.019 x 0.357 = 0.007

Kesimpulan : buys_computer = yes

Department: systems, age: 26-30, salary:46-50K

Estimation is similar to classification except that the

Estimating the grade-point average (GPA) of a graduate

simple linear regression and correlation

Prediction is similar to classification and

Predicting the price of a stock three months into

simple linear regression and correlation

Das könnte Ihnen auch gefallen