Penelitian Datamining Di BPPT

Institut Teknologi Telkom, 15 April 2011 Name Anto Satriyo Nugroho
Birthday October 1970

Education received B.Eng(S1), M.Eng (S2) and Dr.Eng (S3) from
Nagoya Institute of Technology, Japan (Electrical &
Computer Eng.) in 1995, 2000 dan 2003.
Scholarship from Pemerintah RI (S1) & Japanese
government Monbukagakusho (S2 & S3)
Core competence
- Pattern Recognition, Artificial Intelligence & Datamining
Penelitian
Penelitian Datamining
Datamining di
di BPPT
BPPT - Biomedical Engineering & Bioinformatics/Computational Biology
- Research overview & publication can be seen from
http://asnugroho.net/publist.html
Work experiences
• 1990-now BPP Teknologi (Pusat Teknologi Informasi & Komunikasi)
Dr. Anto Satriyo Nugroho • 2003-2004 Visiting Professor at School of Computer & Cognitive
Sciences, Chukyo University, Japan
Center for Information & Communication Technology, • 2004-2007 Visiting Professor at School of Life System Science &
Agency for the Assessment & Application of Technology (PTIK-BPPT) Technology, Chukyo University, Japan
Email: asnugroho@gmail.com URL: http://asnugroho.net • 2003-2007 Researcher at Institute for Advanced Studies of Artificial
Intelligence, Chukyo University Japan
• 2007-now Vice president of Indonesian Society for Softcomputing
• 2007-now Lecturer at Swiss German University, Al Azhar Univ. Indonesia
Research
Research Projects
Projects (1995-2007
(1995-2007 Japan)
Japan) Research
Research Projects
Projects (2007-now,
(2007-now, Indonesia)
Indonesia)
1. Coupling ground and airborne-based hyperspectral (HyMap)
1. Long Time Holter Electrocardiogram Compression using Wavelet detection over rice canopy to predict leaf area index (LAI) and SPAD
Transform value using support vector machine (SVM) technique in irrigated
2. Meteorological Forecasting (won first prize award IEICE wetland rice, west Java, Indonesia (collaboration with TISDA-BPPT
Competition 1999, Japan) & ERSDAC Japan)
3. Handwriting Character Recognition (Japanese, Alphanumeric) 2. Spatio Temporal Information Extraction & Visualization of Tropical
4. Bioinformatics: Tumor Suppressor Gene TP53 Status Disease related web articles
Classification based on mRNA profiles of cancer patients 3. Handwriting Javanese Character Recognition
(collaboration with Faculty of Medicine, Hokkaido University, Japan)
4. License Number Plate Recognition for Intelligent Transportation
5. Prediction of Interferon Efficacy for Chronic Hepatitis C using System using Connected Component Analysis
Support Vector Machines (collaboration with Faculty of Medicine,
5. TELUSUR HUKUM: Textmining for Indonesian law regulation
Nagoya University, Japan)
analysis (Self Organizing Map, Association Rules)
6. Confidence Margin based Feature Subset Selection for SVM
6. Indonesian Automatic Document Reader to help people with visual
7. Computer Vision: Automatic System for locating text in scene impairment
image using stroke analysis neural network
7. Self Organizing Map untuk menganalisa data Indikator TIK Indonesia
*Font warna merah adalah topik yang berkaitan dengan datamining 8. Webmining used in Document containing information in relation to
Cyber Terrorism (supervision dengan Charles Lim SGU)
Research
Research Projects
Projects (2007-now,
(2007-now, Indonesia)
Indonesia) What
Whatis
is Datamining
Datamining??
9. Analisa Berita Tindak Pidana Korupsi memakai Self Organizing – Non-trivial extraction of implicit, previously unknown and potentially
Map useful information from data
10. Malware Detection (cosupervision dengan Charles Lim SGU) – Exploration & analysis, by automatic or semi-automatic means, of
11. Recursive Feature Elimination for Spam Identification using large quantities of data in order to discover meaningful patterns
Support Vector Machine
12. Form Reader : integrasi Optical Character Recognition & Optical
Mark Recognition
13.Prediksi Struktur Protein Sekunder memakai Hidden Markov Model AI,
(cosupervision dengan Dr.Agus Buono IPB) Statistics Data Mining Machine Learning,
14.Biometrics: Automated Fingerprint Identification System for Pattern Recognition
National e-ID card
15.Computer Aided Diagnosis for Malaria Parasite Classification
(kolaborasi dengan Lembaga Biologi Molekuler Eijkman)
16.Intelligent Searching of Al Quran verses (e-Fathurrahman) Database Technology, Parallel Computing, Distributed Computing
17.Social Network Mining
18.Speaker Diarization
1
Why
WhyMine
Mine Data
Data ?? Commercial
Commercial Viewpoint
Viewpoint Why
WhyMine
Mine Data
Data ??Scientific
Scientific Viewpoint
Viewpoint
• Lots of data is being collected
• Data collected and stored at
and warehoused
enormous speeds (GB/hour)
– Web data, e-commerce
– remote sensors on a satellite
– purchases at department/
grocery stores – telescopes scanning the skies
– Bank/Credit Card – microarrays generating gene
transactions expression data
• Computers have become cheaper and more powerful – scientific simulations

generating terabytes of data
• Competitive Pressure is Strong
• Traditional techniques infeasible for raw data
– Provide better, customized services for an edge (e.g. in Customer
Relationship Management) • Data mining may help scientists
– in classifying and segmenting data
– in Hypothesis Formation
Slide source: Tan, Steinbach, Kumar, Introduction to Datamining, Pearson Int’l Edition Slide source: Tan, Steinbach, Kumar, Introduction to Datamining, Pearson Int’l Edition
7 8
Classifying
Classifying Galaxies
Galaxies
Courtesy: http://aps.umn.edu
Early Class: Attributes:

• Stages of Formation • Image features,
• Characteristics of light
waves received, etc.
Intermediate
Late
Data Size:
• 72 million stars, 20 million galaxies
• Object Catalog: 9 GB
• Image Database: 150 GB
Slide source: Tan, Steinbach, Kumar, Introduction to Datamining, Pearson Int’l Edition
9 10
http://www.ncbi.nlm.nih.gov/Genbank/genbankstats.html
Layanan
LayananSocial
Social Network
Network
Microarray
 Measuring the expression of genes

 Possible to obtain the expression of thousands
of genes
 Disease classification
http://cmgm.stanford.edu/pbrown/array.html
11 12
2
Layanan
LayananSocial
Social Network
Network Facebook
Facebookstatistics
statistics
• Source: http://www.digitalbuzzblog.com/facebook-statistics-stats-facts-2011/
• Mission: giving the people the power to share and make the world more open &
connected
• Users: more than 500M<(used by more than 1 of 13 people on earth)
• Login everyday: more than 250M
• Average friends: 130
• 48% of 18-34 year olds check Facebook when they wake up
• with 28% doing so before even getting out of bed.
• The core 18-24 year old segment is now growing the fastest at 74% year on
year
• Almost 72% of all US internet users are on now Facebook, while 70% of the
entire user base is located outside of the US.
• over 700 Billion minutes a month are spent on Facebook
• 20 million applications are installed per day and over 250 million people interact
with Facebook from outside the official website on a monthly basis across 2
million websites.
• Over 200 million people access Facebook via their mobile phone
• 48% of young people said they now get their news through Facebook.
• in just 20 minutes on Facebook over 1 million links are shared, 2 million friend
13 14
requests are accepted and almost 3 million messages are sent
Twitter
TwitterStatistics
Statistics Challenges
Challengesin
in Datamining
Datamining
• Source: http://www.digitalbuzzblog.com/infographic-twitter-statistics- • Scalability
facts-figures/ • Dimensionality
• Num of accounts: more than 106M, with 300K increase per day • Complex and Heterogeneous Data
• 3 billion requests per day • Data Quality
• Twitter users sending 55M tweets per day (640 tweets per second) • Data Ownership and Distribution
• Most active: • Privacy Preservation
– Thursday & Friday (16% of total tweets) • Streaming Data
– 10-11 pm (4.8% of total tweets in an average day)
15 16
Datamining
Datamining &
& Knowledge
Knowledge Discovery
Discovery Datamining
DataminingTasks
Tasks
• Prediction (to predict the value of a particular attribute based on the
value of other attributes)
– classification
– regression
– deviation detection
• Description (to derive patterns that summarize the underlying
relationships in data)
– clustering
– association rule discovery
– sequential pattern discovery
17 18
3
Riset
RisetDatamining
Datamining Agenda
Agenda
• Meteorologi: Fog Forecasting (1999-2002) • Coupling ground and airborne-based hyperspectral (HyMap) detection
• Biomedis Analisa profile mRNA untuk prediksi status over rice canopy to predict leaf area index (LAI) and SPAD value using
Tumour Suppressor Gene P53 support vector machine (SVM) technique in irrigated wetland rice, west
Efficacy Prediction of Hepatitis C Therapy Java, Indonesia (collaboration with TISDA-BPPT & ERSDAC Japan)
• Remote Sensing Coupling ground and airborne-based • Spatio Temporal Information Extraction & Visualization of Tropical
hyperspectral (HyMap) detection over rice
canopy to predict leaf area index (LAI) Disease related web articles
and SPAD value • TELUSUR HUKUM: Intelligent Searching using Association Analysis for
• Hukum Intelligent Searching for Indonesian Law Law Documents of Indonesian Government
• Webmining Text Classification • Self Organizing Map untuk menganalisa data Indikator TIK Indonesia
Spatio Temporal Information Extraction &
Visualization of Tropical Disease related
web articles
• Finance Exchange Rate Prediction between
Rupiah and U.S.Dollars
• Social Network Mining
• Security & Networking Malware Analysis
Spam Filtering
19 20
Latar
Latar Belakang
Belakang Tujuan
TujuanPenelitian
Penelitian
• Kesulitan dalam menentukan produksi beras secara akurat di • Tujuan penelitian ini adalah memprediksi nilai crop variable (LAI (leaf
Indonesia area index), SPAD, dan Yield) menggunakan metode Support Vector
• Teknologi hyperspectral merupakan teknologi baru dalam bidang Machine (SVM).
penginderaan jauh (remote sensing) yang mampu merekam Leaf area index
ratusan sampai ribuan band dalam sekali pengambilan data.
Dengan hyperspectral, tutupan lahan dapat dikenali lebih rinci
sampai jenis dan kondisi tanamannya.
• Penelitian dilakukan oleh Pusat Teknologi Inventarisasi
Sumberdaya Alam Badan Pengkajian dan Penerapan Teknologi Hasil panen (yield)
(PTISDA BPPT)
Klorofil
(SPAD)
“Prediksi Nilai Crop Variabel Tanaman Padi pada Data Hyperspectral
menggunakan metode Support Vector Machine”, Yohan, Dept.Ilmu
Komputer, Institut Pertanian Bogor, 2010 Nitrogen
Fosfor
Carbon
21 22
• IR42
• Ciherang
• Ketan
Vegetatif
Reproduktif
Ripening
INDRAMAYU SUBANG
23 24
4
Support
Support Vector
Vector Machines
Machines Binary
Binary Classification
Classification
• Developed by Vapnik (1992)
• In principle, SVM works as binary classifier discrimination boundaries/
hyperplane
• SVM works to find the best separating hyperplane by maximizing
margin (margin is the minimum distance from hyperplane to the training
set)
Class 1 Class +1
25 26
Optimal
Optimal Hyperplane
Hyperplane in
in SVM
SVM SVM
SVM for
for Regression
Regression
SVM can be modified to solve regression problem by
SVM works to find the optimal
modifying its loss function.
margin
Hyperplane (maximal margin)
d 1 l  2
regularized error function: { f i  yi }2  2 w
2 i 1
-insensitive error function

  0, if f ( x )  y   ;
E ( f ( x )  y )   
 y( x )  t   , otherwise
modified regularized error function (to be minimized):
l
1 2
C  E ( fi  yi ) 2  w
i 1 2
Class 1 Class +1
27 28
SVM
SVM for
for Regression
Regression Experimental
Experimental Results
Results
• Observations conducted in Subang and Indramayu
Solution could be obtained by introducing Lagrange
 • Reflectance data were measured at height 10 cm and 50
Multipliers ai and ai .
cm
• Independent variables: observed 116 wavelengths
The prediction of new inputs can be made by applying
• Dependent variables: biophysical parameters, LAI and
the following Equation SPAD
 l
   • Performance of SVM and ANN were evaluated on Hymap
f ( x )   ( ai  ai ) K ( x , xi )  b Data (independent from the training set)
i 1
K is Kernel Function (RBF Kernel, Polynomial Kernel, etc)
29 30
5
Kesimpulan
Kesimpulan Agenda
Agenda
• Secara keseluruhan, hasil prediksi untuk daerah Indramayu lebih • Coupling ground and airborne-based hyperspectral (HyMap) detection
akurat dibandingkan dengan daerah Subang over rice canopy to predict leaf area index (LAI) and SPAD value using
• Untuk daerah Indramayu prediksi LAI (nilai R2 > 0.9 dan RMSE < 0.43 ) support vector machine (SVM) technique in irrigated wetland rice, west
menunjukkan hasil yang memuaskan dan SPAD (nilai R2 di atas 0.5) Java, Indonesia (collaboration with TISDA-BPPT & ERSDAC Japan)
menunjukkan hasil yang cukup baik, namun prediksi nilai Yield masih • Spatio Temporal Information Extraction & Visualization of Tropical
kurang memuaskan Disease related web articles
• Untuk daerah Subang hanya prediksi LAI yang cukup baik (nilai R2 di • TELUSUR HUKUM: Intelligent Searching using Association Analysis for
atas 0.5) , namun prediksi nilai SPAD dan Yield masih kurang Law Documents of Indonesian Government
memuaskan • Self Organizing Map untuk menganalisa data Indikator TIK Indonesia
• Hasil prediksi untuk daerah Indramayu yang lebih baik daripada Subang
dapat dikarenakan pengaruh cuaca pada saat pengambilan data, selain
itu data dari daerah subang juga cenderung lebih sedikit dibandingkan
dengan Indramayu, serta di daerah Subang jenis padi yang ditanam
lebih beragam.
31 32
Spatio
SpatioTemporal
TemporalInformation
InformationExtraction
Extraction&& Latar
Visualization Latar Belakang
Belakang
Visualizationof
ofTropical
TropicalDisease
Diseaserelated
relatedweb
web
articles
articles • Tiga penyakit penting di Indonesia
Tujuan: – Malaria
Sistem untuk menganalisa spatio-temporal penyebaran penyakit – Demam Berdarah Dengue
Tropis (Demam Berdarah, Flu burung, dsb) dari artikel bahasa – Flu Burung
Indonesia yang tercatat di internet • Indonesia merupakan salah satu hotspot Flu burung di dunia
– “Indonesia is currently the world’s avian-flu hotspot”, Nature 440,
Capaian Akademis: 726-727, 6 Apr 2006)
3 paper + 1 best paper award – “Bird flu outbreaks in Indonesia going unstudied”
(http://www.nature.com/news/2006/060724/full/060724-12.html )
– “Indonesian bird-flu cluster rings alarm bells,” Nature 441, 554-555
(1 June 2006)
Rival :
• Pemantauan penyebaran penyakit merupakan langkah penting dalam
healthmap.org (Harvard-MIT Division of Health Sciences &
mengatasi masalah tsb.
Technology)
34
Flu
FluBurung
Burung di
di Journal
Journal Nature
Nature Tujuan
Tujuan
Mengembangkan sistem pemantauan penyebaran penyakit menular
dengan karakteristik :
• Menampilkan data spatio-temporal penyebaran penyakit
menular di Indonesia berdasarkan informasi tekstual yang
tersedia di internet
• Pemakaian datamining, text mining, pattern recognition dan
berbagai teknologi komputasi yang lain untuk menemukan
informasi berharga (knowledge discovery) dari koleksi artikel
berskala besar
• Piranti lunak dikembangkan atas lisensi Open source,
sehingga :
– Dapat diredistribusikan secara bebas
– Source code dapat dilihat dan dimodifikasi
– Memberikan kebebasan pengguna untuk melakukan
perbaikan dan modifikasi
Nature 440, 726-727, 6 Apr 2006
35 36
6
Mengapa
MengapaWeb
WebMining
Mining ?? Tampilan
TampilanSpatio
Spatio Temporal
Temporal
• Memanfaatkan “power of the web” : massive repository of text
• Pencarian di google untuk artikel berbahasa Indonesia
– “Flu burung” 991,000 entries spatial
– “Demam Berdarah” 377,000 entries
Tanggal : 1 J anuari 2006
– “Malaria” 345,000 entries

Daer ah : J ak arta S elatan
Tanggal : 1 J anuari
Penyakit : Demam 2006 Berdarah Dengue
Daer ah : J ak arta S elatan
Jumlah penderita: 16
Tanggal
Penyakit : 1 J: anuari
Meningal Demam :2 2006Berdarah Dengue
Daer ah : Jpenderita:
Jumlah ak arta S elatan16
Berita :
Penyakit
Meningal : Demam
:2 2006 Berdarah Dengue
• Memanfaatkan secara optimal format “explicit structural markup” pada

Tanggal : 1 J anuari
Jumlah
Daer ah Berita
: J akpenderita:
: Sdetik
1.arta elatan16
.c om
Meningal
Penyakit : Demam : 2 Berdarah
D i ka mpu ng Dengue
xx x Jaksetl …(next )
Berita
Jumlah 1. : detik16
penderita: .c omDua o ran g men ing gal d i x xx Jakset l…(nex t)
D i ka mpu ng xx x Jaksetl …(next )
Tangga l : Meningal
1 Januari: 20062 2. DuaKompas
artikel di web, baik yang sifatnya internal maupun eksternal (link dari 1.: Se detik .c om o ran g men ing gal d i x xx Jakset l…(nex t)
Daerah : Berita
Jakarta latan
D i ka mpu ng xx
Penya kit : Demam KompasDua
2. Berdarah De
x Jaksetl
o ran g …(next
menin ggal
) di x xx Jak set l…(nex t)
3.
Dua o ranRepublika
g men ing ngue
gal d i x xx Jakset l…(nex t)
1. rita:
Jum lah pende detik16 .c om Dua o ran g menin ggal di x xx Jak set l…(nex t)
Du a orang meni ngg al di xxx Jaksetl…(next)
2. D3.i kaKompas
mpuRepublika
ng xx x Jaksetl …(next )
temporal
Meningal : 2
satu artikel ke artikel yang lain) Berita :
1.
3.2.
detik.com
3.
Dua o ranDua
g Du
men
Republika
Kompas
Republika
aing
o ran ggal
orang dmeni
menini xggal
xx Jakset
ngg l…(nex
dialxdixxxxx sett)l…(nex t)
JakJaksetl…(next)

Dua o ran g menin ggal di x xx Jak set l…(nex t)
Di kampung xxx Jakset l…(next)

• Software yang dikembangkan dewasa ini umumnya untuk bahasa 2.

Dua orang meni ngga l di xxx Jakset l… (next)
Ko mpas
Dua ora ng me ni nggal di xxx Ja kse tl …(next)
3. Republika
Inggris, sehingga tidak dapat dipakai untuk analisa semantik bahasa Dua ora ng me ni nggal di xxx Ja kse tl …(next)
2007
Indonesia 2006
2005
Model Matematika Penyebaran Penyakit Menular ?

37
Arsitektur
Arsitektur Sistem
Sistem
Tropical Disease MAP (TD-MAP)
Url bibit
NUTCH Text Classification
Clustering etc
Searcher
inject
generate inv ertlinks

i ndeks
Cra wlDB Segments LinkDB Index
artikel di internet
updatedb fetch
Crawle r
Siklus
penjelaja han
Information Extraction
POS
POS Analisa
Analisa Template
Template Template
Template
Artikel Tokenizer
Tokenizer Tagger
Tagger Kalimat
Kalimat Filler
Filler Merger
Merger
ekstraksi informasi spatio-temporal

penyebaran penyakit
visualisasi google-earth 39 40
Publikasi
Publikasi Agenda
Agenda
1. Desain Sistem Analisa Spatio-Temporal Penyebaran Penyakit Tropis
memakai Web Mining, Proc. Of Konferensi Nasional Sistem & Informatika • Coupling ground and airborne-based hyperspectral (HyMap) detection
2008, pp.44-49, Inna Sindhu Beach Hotel, Sanur-Bali, Indonesia (meraih over rice canopy to predict leaf area index (LAI) and SPAD value using
Best Paper Award) support vector machine (SVM) technique in irrigated wetland rice, west
2. Text Classification using Support Vector Machine for Webmining based Java, Indonesia (collaboration with TISDA-BPPT & ERSDAC Japan)
Spatio Temporal Analysis of the Spread of Tropical Disease, Proc. of • Spatio Temporal Information Extraction & Visualization of Tropical
International Conference on Rural Information & Communication Disease related web articles
Technology pp.189-192, Bandung Institute of Technology, Indonesia, 17
June 2009 • TELUSUR HUKUM: Intelligent Searching using Association Analysis for
3. Analisa dan Rancang Bangun Modul Visualisasi pada Sistem Analisa
Law Documents of Indonesian Government
Spatio-Temporal Penyebaran Penyakit Tropis, Proc. Of Konferensi dan • Self Organizing Map untuk menganalisa data Indikator TIK Indonesia
Temu Nasional Teknologi Informasi dan Komunikasi untuk Indonesia
(eII2010), Institut Teknologi Bandung, Indonesia, 6 Mei 2010
4. A Study on Text Classification for Webmining Based Spatio Temporal
Analysis of the Spread of Tropical Diseases, Proc. of International
Conference on Advance Computer Science & Information System
(ICACSIS) 2010, pp.311-314, Bali, Indonesia, 2010
Rival : healthmap.org (Harvard-MIT Division of Health Sciences & Technology)

41 42
7
Motivation
Motivation “Telusur
“Telusur Hukum”
Hukum”
the aim of this study is to develop an intelligent searching system
for Indonesian Law Document with the following features: Search the data section on the “Telusur Hukum" developed
Agency for the Assessment and Application of Technology (BPPT)
 Searching based on words (keyword),
has not been effective because it can not provide a correlation
 Find out the relationship between the keyword, so that
of one paragraph with another paragraph.
searching process can be run more effectively.
dokumen ayat
database
data
Law Database Searching

Documents Related Word
Machine
43 43 44 44
“Telusur
“Telusur Hukum”
Hukum” “Telusur
“Telusur Hukum”
Hukum”
Search the data section on the “Telusur Hukum" developed Search the data section on the “Telusur Hukum" developed
Agency for the Assessment and Application of Technology (BPPT) Agency for the Assessment and Application of Technology (BPPT)
has not been effective because it can not provide a correlation has not been effective because it can not provide a correlation
of one paragraph with another paragraph. of one paragraph with another paragraph.
45 45 46 46
“Telusur
“Telusur Hukum”
Hukum” Architecture
Architecture
Law Documents,
verses
Search the data section on the “Telusur Hukum" developed
Agency for the Assessment and Application of Technology (BPPT)
has not been effective because it can not provide a correlation Tokenization
of one paragraph with another paragraph.

Stopword Removal
Text to Vector Conversion
Frequent Itemset Generation

Association Analysis
Rules Generation
Association
Rules 48
47 47 48
8
Proposed
Proposed Method
Method Apriori
Apriori Algorithm
Algorithm
Definition
Definition
Association analysis is one technique of data mining, which
describes the dependence between attributes. Apriori algorithm is the algorithm to find patterns of high
frequency (frequent itemset).High frequency pattern is the
Support (s) for a set of items is the percentage of
transactions containing all items in these collections. pattern of items in a database that has a frequency or
support above a certain threshold.
Confidence (c) declare any number of possible itemsets
that contain X and Y in a time of all itemsets that contain X.
X Y
49 49 50 50
Experiment
Experiment and
and Result
Result Experiment
Experiment and
and Result
Result
Generate Frequent Itemset Rules Generation
No. Verses No. Verses
1 dokumen identitas administrasi pemerintah 1 dokumen identitas administrasi pemerintah

Minimum support = 3 X Y
2 urusan administrasi pemerintah identitas 2 urusan administrasi pemerintah identitas
Dokumen, administrasi  pemerintah
3 urusan administrasi identitas dokumen 3 urusan administrasi identitas dokumen
4 dokumen pemerintah administrasi 4 dokumen pemerintah administrasi
1-itemset 2-itemset 3-itemset
Keyword Count Keyword Count Keyword Count

Dokumen, identitas 2
dokumen 3 Dokumen,
Confidence  ( XUY ) 2
identitas 3
Dokumen, administrasi 3
administrasi,
pemerintah 2 c   0 .67
Dokumen, Pemerintah 2
(X ) 3
administrasi 4
urusan 2
Identitas, administrasi 2 Support  ( XUY ) 2
pemerintah 3 identitas, pemerintah 2 s   0 .5
Administrasi, pemerintah 3
|T| 4
51 52
51 52
Experiment
Experiment and
and Result
Result Experiment
Experiment and
and Result
Result
100
89
• 300 Data Rules = 58 90
80
• Minimum support= 0,1
70 65
58
• Minimum confidence = 0,9 60
rules
No. Rules s c 50
1 Pendaftaran => Penduduk 0,21 1 40
2 Penduduk Pelaksanaan 2006 => Instansi 0,15 1 30
3 Pendaftaran 2006 => Penduduk 0,15 1 18
20
10 10 12
4 Instansi Pendaftaran => Penduduk 0,13 1 10 2 0 2 2 2
5 Pelaksanaan Pendaftaran => Penduduk 0,13 1 0
6 Dokumen Data => Kependudukan 0,13 1 0.1 0.2 0.3
7 Instansi Pelaksanaan Pendaftaran => Penduduk 0,13 1 minsup
8 Dokumen Pelaksanaan => Instansi 0,11 1 minconf=1.0 minconf=0.9 minconf=0.8 minconf=0.7
9. Instansi Pendaftaran 2006 => Penduduk 0,1 1
10. Pelaksanaan Pendaftaran 2006 => Instansi 0,1 1
11. Instansi Penduduk Pendaftaran 2006 0,1 153 54
53 54
9
Telusur
Telusur Hukum
Hukum Development
Development Telusur
Telusur Hukum
Hukum Development
Development
55 56
55 56
Conclusion
Conclusion Agenda
Agenda
• Coupling ground and airborne-based hyperspectral (HyMap) detection
over rice canopy to predict leaf area index (LAI) and SPAD value using
 “Telusur Hukum” systems have the ability to browse the law
support vector machine (SVM) technique in irrigated wetland rice, west
documents based on keywords supplied by the user. Java, Indonesia (collaboration with TISDA-BPPT & ERSDAC Japan)
• Spatio Temporal Information Extraction & Visualization of Tropical
 It also assisted the searching by providing highly associated
Disease related web articles
keywords. • TELUSUR HUKUM: Intelligent Searching using Association Analysis for
Law Documents of Indonesian Government
 Provide additional information in searching correlated
• Self Organizing Map untuk menganalisa data Indikator TIK Indonesia
keywords , making the searching process become more
effective
57 58
57
Background
Background Objective
Objective
• To map the village data in DKI Jakarta based on the availability of
telecommunication and internet facilities
• Self Organizing Map
Data: Indonesian Village Potention Data for 2005

(Source: Badan Pusat Statistik Republik Indonesia (BPS) or
Indonesian Statistic Agency)
59 60
10
Self
SelfOrganizing
Organizing Maps
Maps(SOM)
(SOM) Characteristics
Characteristicsof
ofSOM
SOM
• Dikembangkan oleh Prof.Teuvo Kohonen (Helsinki University of • SOM has not been meant for Pattern Recognition
Technology) • SOM is clustering, visualization & abstraction method
• Memvisualisasikan informasi pada ruang berdimensi tinggi • SOM is unsupervised algorithm. If you want to use it for classification,
• Mengkonversikan relasi statistik non linear data pada ruang dimensi you should use Learning Vector Quantization instead of SOM.
tinggi ke dalam relasi geometris dari bayangan mereka pada ruang • Preprocessing is important & should not be overlooked
dimensi rendah (1 atau 2 dimensi grid nodes) • In benchmarking studies, one should compare the computational speed,
• Project: too. Not only accuracy of the models.
– Penyempurnaan Fitur Pencarian Cerdas pada TELUSUR HUKUM • Difference between SOM and traditional Vector Quantization:
– Pemetaan Tingkat Kemajuan TIK (studi kasus pada Wilayah the reference vector of the nodes are determined so as the mapping is
Jakarta) ordered and descriptive of the statistical distribution of input dataset.
– Pemetaan Berita Penyakit Menular, Berita Tipikor This characteristic demonstrates that SOM is a kind of nonparameteric
regression of the data (regression: fittng the data distribution into a
mathematical equation)
61 62
Neuron
Neuron Model
Model Algoritma
Algoritma
Input xk Siapkan M neuron, dan inisialisasi-lah reference vector tiap neuron
itu secara random. Iterasi t=1
 xk 1 
Training set/Input data (K buah) x  k=1
mi1 {x k | k  1,2, , K } xk   k 2 
   Hitunglah jarak antara datum k dari training set dengan semua
mi 2   reference vector neuron, dan pilihlah neuron yang reference vectornya
mi 3  xkn  memiliki jarak terpendek dengan datum tadi (neuron itu disebut winner)
Reference vectors (M buah)  mi1  Update-lah reference vector winner dan tetangganya yang
m  didefinisikan dengan neighboring function
min m i   i2 
{m i | i  1,2, , M } N
   k=k+1 k>K?
 
 min 
yi
Output Y
  Update-lah (kurangkanlah) learning rate & radius neighboring function
yi  dist ( xk , mi ) t = t+1
Output yi N
Stop ? Y END
63 64
Step-1:
Step-1: Initialization
Initialization Step-2:
Step-2: Find
Findthe
thewinner
winner neuron
neuron
m1
x c
mM
Random initialization or choosing a subset from training set 65
Find neuron with smallest distance to input pattern 66
11
Step-2:
Step-2: Find
Findthe
thewinner
winner neuron
neuron Step-3:
Step-3: Update
Updatethe
thewinner
winner &
& neighborhood
neighborhood
c  arg min  x  m i 
i
Winner neuron
Eg.Euclidean Distance
c
N c (t )
Neighbor of winner
neuron at time t
67 68
Step-3:
Step-3: Update
Updatethe
thewinner
winner &
& neighborhood
neighborhood Step-3:
Step-3: Update
Updatethe
thewinner
winner &
& neighborhood
neighborhood
m i (t  1)  mi (t )  hci (t )x(t )  m i (t ) m i (t  1)  mi (t )  hci (t )x(t )  m i (t )
x(t ) x(t )
x(t )  m i (t ) hci (t )x(t )  m i (t )
m i (t ) m i (t )
69 70
Step-3:
Step-3: Update
Updatethe
thewinner
winner &
& neighborhood
neighborhood Example
Example of
of Neighborhood
Neighborhood Function
Function(1)
(1)
m i (t  1)  mi (t )  hci (t )x(t )  m i (t )
x(t )  (t ) i  Nc
hci (t )  
 0 otherwise
hci (t )x(t )  m i (t )
0   (t )  1
m i (t  1)
Scalar valued learning rate factor
(monotonically decreasing function)
m i (t )
71 72
12
Example
Example of
of Neighborhood
Function(2)
(2) Neighborhood
Functionat time tt
at time
Gaussian Function
Location vectors of node c and i
 r r 2  N c (t1 )
hci (t )   (t ). exp c 2 i 
 2 (t ) 
  N c (t 2 )
N c (t3 )
Scalar valued learning rate factor 0   (t )  1
t1  t2  t3
Width of kernel, corresponds to the radius of N c (t )
73 74
Neighborhood
Functionat time tt
at time Contoh
ContohAplikasi
Aplikasi SOM
SOM
• Animal Dataset
• Peta kemakmuran dunia
• WebSOM
• Experimen dengan huruf Angka
• Experimen dengan huruf Kanji
Source:
A.S. Nugroho, S. Kuroyanagi, A. Iwata : “Mathematical perspective
of CombNET and its application to meteorological prediction,”
Special Issue of Meteorological Society of Japan on Mathematical
Perspective of Neural Network and its Application to Meteorological
Problem, Meteorological Research Note, No.203, pp.77-107,
October 2002 (Japanese Edition)
75 76
World
World Poverty
PovertyMap
Map World
World Poverty
PovertyMap
Map
Source: http://www.cis.hut.fi/research/som-research/worldmap.html A map of the world where countries have been colored with the color
describing their poverty type (the color was obtained with the SOM
(Based on World bank data World Bank statistics of countries in 1992. in the previous figure):
Altogether 39 indicators describing various quality-of-life factors, such
as state of health, nutrition, educational services, etc, were used.)
77 78
13
Pemetaan
Pemetaan Kemajuan
KemajuanTIK
TIK Experiments
Experiments
• Tujuan: • Data Preprocessing : DKI Jakarta Province data with 18 attributes and
– Mengimplementasikan algoritme SOM dalam clustering data 267 records
podes 2005 wilayah Jakarta daun untuk membentuk model yang
dapat merepresentasikan tingkat kemajuan IT di wilayah Jakarta
– Mendapatkan karakteristik data hasil clustering
• Manfaat :
Memperoleh informasi mengenai
– pengelompokan desa berdasarkan indikator TIK
– topologi antar kelompoksehingga diharapkan dapat membantu
pihak yang bersangkutan dalam pengambilan keputusan
79 80
SOM
SOM Results
Results Publication
Publication
• The Mapping of Information Communication and Technology (ICT)
Progress by Using Self Organizing Maps (SOM), Proc. of Second
International Conference on Advances in Computing, Control, and
Telecommunication Technologies - ACT 2010, pp.185-187, 2 December
2010, Bina Nusantara University, Jakarta-Indonesia
81 82
14

Penelitian Datamining Di BPPT

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Penelitian Datamining Di BPPT

Hochgeladen von

Copyright:

Verfügbare Formate

Institut Teknologi Telkom, 15 April 2011 Name Anto Satriyo Nugroho

Birthday October 1970

• Computers have become cheaper and more powerful – scientific simulations

Early Class: Attributes:

 Measuring the expression of genes

K is Kernel Function (RBF Kernel, Polynomial Kernel, etc)

– “Malaria” 345,000 entries

• Memanfaatkan secara optimal format “explicit structural markup” pada

satu artikel ke artikel yang lain) Berita :

Du a orang meni ngg al di xxx Jaksetl…(next)

Di kampung xxx Jakset l…(next)

• Software yang dikembangkan dewasa ini umumnya untuk bahasa 2.

Model Matematika Penyebaran Penyakit Menular ?

generate inv ertlinks

Cra wlDB Segments LinkDB Index

ekstraksi informasi spatio-temporal

Rival : healthmap.org (Harvard-MIT Division of Health Sciences & Technology)

Law Database Searching

of one paragraph with another paragraph.

Text to Vector Conversion

Frequent Itemset Generation

1 dokumen identitas administrasi pemerintah 1 dokumen identitas administrasi pemerintah

4 dokumen pemerintah administrasi 4 dokumen pemerintah administrasi

1-itemset 2-itemset 3-itemset

Keyword Count Keyword Count Keyword Count

Data: Indonesian Village Potention Data for 2005

m i (t  1)  mi (t )  hci (t )x(t )  m i (t ) m i (t  1)  mi (t )  hci (t )x(t )  m i (t )

x(t )  m i (t ) hci (t )x(t )  m i (t )

Das könnte Ihnen auch gefallen