Sie sind auf Seite 1von 37

Data Mining:

8. Text Mining

Romi Satria Wahono


romi@romisatriawahono.net
http://romisatriawahono.net/dm
WA/SMS: +6281586220090

1
Romi Satria Wahono
SD Sompok Semarang (1987)
SMPN 8 Semarang (1990)
SMA Taruna Nusantara Magelang (1993)
B.Eng, M.Eng and Ph.D in Software Engineering
from
Saitama University Japan (1994-2004)
Universiti Teknikal Malaysia Melaka (2014)
Research Interests: Software Engineering,
Machine Learning
Founder dan Koordinator IlmuKomputer.Com
Peneliti LIPI (2004-2007)
Founder dan CEO PT Brainmatics Cipta Informatika

2
Course Outline
1. Pengantar Data Mining

2. Proses Data Mining

3. Persiapan Data

4. Algoritma Klasifikasi

5. Algoritma Klastering

6. Algoritma Asosiasi

7. Algoritma Estimasi dan Forecasting

8. Text Mining
3
8. Text Mining
7.1 Text Mining Concepts
7.2 Text Clustering
7.3 Text Classification

4
7.1 Text Mining Concepts

5
How Text Mining Works
The fundamental step in text mining involves
converting text into semi-structured data
Once you convert the unstructured text into
semi-structured data, there is nothing to
stop you from applying any of the analytics
techniques to classify, cluster, and predict
The unstructured text needs to be converted
into a semi-structured dataset so that you
can find patterns and even better, train
models to detect patterns in new and
unseen text
6
Text Processing

7
Proses Data Mining

1. Himpunan 2. Metode 3. Pengetahuan 4. Evaluation


Data Data Mining

(Pemahaman dan (Pilih Metode (Pola/Model/Rumus/ (Akurasi, AUC,


Pengolahan Data) Sesuai Karakter Data) Tree/Rule/Cluster) RMSE, Lift Ratio,)

DATA PRE-PROCESSING
Data Cleaning Estimation
Data Integration Prediction
Data Reduction Classification
Data Transformation Clustering
Association
Text Processing

8
Word, Token and Tokenization

Words are separated by a special character: a blank space


Each word is called a token
The process of discretizing words within a document is
called tokenization
For our purpose here, each sentence can be considered a
separate document, although what is considered an
individual document may depend upon the context
For now, a document here is simply a sequential collection
of tokens

9
Matrix of Terms
We can impose some form of structure on this raw
data by creating a matrix, where:
the columns consist of all the tokens found in the two
documents
the cells of the matrix are the counts of the number of
times a token appears
Each token is now an attribute in standard data
mining parlance and each document is an example

10
Term Document Matrix (TDM)
Basically, unstructured raw data is now transformed
into a format that is recognized, not only by the
human users as a data table, but more importantly
by all the machine learning algorithms which
require such tables for training
This table is called a document vector or term
document matrix (TDM) and is the cornerstone of
the preprocessing required for text mining

11
TFIDF
We could have also chosen to use the TFIDF
scores for each term to create the document vector
N is the number of documents that we are trying to
mine
Nk is the number of documents that contain the
keyword, k

12
Stopwords
In the two sample text documents was the occurrence of
common words such as a, this, and, and other similar
terms
Clearly in larger documents we would expect a larger number
of such terms that do not really convey specific meaning
Most grammatical necessities such as articles, conjunctions,
prepositions, and pronouns may need to be filtered before we
perform additional analysis
Such terms are called stopwords and usually include most articles,
conjunctions, pronouns, and prepositions
Stopword filtering is usually the second step that follows immediately
after tokenization
Notice that our document vector has a significantly reduced
size after applying standard English stopword filtering

13
Stopwords Bahasa Indonesia
Lakukan googling dengan keyword:
stopwords bahasa Indonesia
Download stopword bahasa Indonesia dan
gunakan di Rapidminer

14
Stemming
Words such as recognized, recognizable, or
recognition in different usages, but contextually they
may all imply the same meaning, for example:
Einstein is a well-recognized name in physics
The physicist went by the easily recognizable name of
Einstein
Few other physicists have the kind of name recognition that
Einstein has
The so-called root of all these highlighted words is recognize
By reducing terms in a document to their basic stems,
we can simplify the conversion of unstructured text to
structured data because we now only take into account
the occurrence of the root terms
This process is called stemming. The most common
stemming technique for text mining in English is the
Porter method (Porter, 1980)

15
A Typical Sequence of Preprocessing Steps to
Use in Text Mining

16
N-Grams
There are families of words in the spoken and written
language that typically go together
The word Good is usually followed by either Morning,
Afternoon, Evening, Night, or in Australia, Day
Grouping such terms, called n-grams, and analyzing them
statistically can present new insights
Search engines use word n-gram models for a variety
of applications, such as:
Automatic translation, identifying speech patterns,
checking misspelling, entity detection, information
extraction, among many different use cases

17
Rapidminer Process of Text Mining

18
7.2 Text Clustering

19
Latihan
Lakukan eksperimen mengikuti buku
Matthew North (Data Mining for the Masses)
Chapter 12 (Text Mining), p 189-215

Datasets: Federalist Papers

Pahami alur text mining yang dilakukan dan


sesuaikan dengan konsep yang sudah
dipelajari

20
1. Business Understanding
Gillian is a historian and archivist, and she has recently curated an exhibit on the
Federalist Papers, the essays that were written and published in the late 1700s
The essays were published anonymously under the author name Publius, and no
one really knew at the time if Publius was one individual or many
Years later, after Alexander Hamilton died in the year 1804, some notes were
discovered that revealed that he (Hamilton), James Madison and John Jay had been
the authors of the papers
The notes indicated specific authors for some papers, but not for others. Specifically,
John Jay was revealed to be the author for papers 3, 4 and 5; Madison for paper 14;
and Hamilton for paper 17. Paper 18 had no author named, but there was evidence
that Hamilton and Madison worked on that one together
Gillian would like to analyze paper 18s content in the context of the other papers
with known authors, to see if she can generate some evidence that the suspected
collaboration between Hamilton and Madison is in fact a likely scenario
Having studied all of the Federalist Papers and other writings by the three
statesmen who wrote them, Gillian feels confident that paper 18 is a collaboration
that John Jay did not contribute tohis vocabulary and grammatical structure was
quite different from those of Hamilton and Madison

21
2. Data Understanding
Gillian is a historian and archivist, and she has recently curated an exhibit on the
Federalist Papers, the essays that were written and published in the late 1700s
The essays were published anonymously under the author name Publius, and no
one really knew at the time if Publius was one individual or many
Years later, after Alexander Hamilton died in the year 1804, some notes were
discovered that revealed that he (Hamilton), James Madison and John Jay had been
the authors of the papers
The notes indicated specific authors for some papers, but not for others. Specifically,
John Jay was revealed to be the author for papers 3, 4 and 5; Madison for paper 14;
and Hamilton for paper 17. Paper 18 had no author named, but there was evidence
that Hamilton and Madison worked on that one together
Gillian would like to analyze paper 18s content in the context of the other papers
with known authors, to see if she can generate some evidence that the suspected
collaboration between Hamilton and Madison is in fact a likely scenario
Having studied all of the Federalist Papers and other writings by the three
statesmen who wrote them, Gillian feels confident that paper 18 is a collaboration
that John Jay did not contribute tohis vocabulary and grammatical structure was
quite different from those of Hamilton and Madison

22
23
Latihan
Lakukan eksperimen mengikuti buku Vijay
Kotu (Predictive Analytics and Data Mining)
Chapter 9 (Text Mining), Case Study 1:
Keyword Clustering, p 284-287
Datasets:
1. http://sport.detik.com
2. http://travel.detik.com
Gunakan stopword Bahasa Indonesia
(download dari Internet)

24
25
26
27
7.3 Text Classification

28
Latihan

Lakukan eksperimen mengikuti buku Vijay Kotu


(Predictive Analytics and Data Mining) Chapter 9
(Text Mining), Case Study 2: Predicting the
Gender of Blog Authors, p 287-301
Datasets: blog-gender-dataset.xslx
Split Data: 50% data training dan 50% data testing
Gunakan algoritma Nave Bayes
Apply model yang dihasilkan untuk data testing
Ukur performance nya

29
Latihan
Dengan berbagai konsep dan teknik yang
anda kuasai, lakukan text classification pada
dataset polarity data - small
Ambil 1 artikel di dalam folder pos, uji
apakah artikel tersebut termasuk sentiment
negative atau positive

30
31
Latihan
Dengan berbagai konsep dan teknik yang
anda kuasai, lakukan text classification pada
dataset polarity data
Terapkan beberapa metode feature
selection, baik filter maupun wrapper
Lakukan komparasi terhadap berbagai
algoritma klasifikasi, dan pilih yang terbaik

32
33
34
35
36
Referensi
1. Jiawei Han and Micheline Kamber, Data Mining: Concepts and
Techniques Third Edition, Elsevier, 2012
2. Ian H. Witten, Frank Eibe, Mark A. Hall, Data mining: Practical
Machine Learning Tools and Techniques 3rd Edition, Elsevier, 2011
3. Markus Hofmann and Ralf Klinkenberg, RapidMiner: Data Mining
Use Cases and Business Analytics Applications, CRC Press Taylor &
Francis Group, 2014
4. Daniel T. Larose, Discovering Knowledge in Data: an Introduction
to Data Mining, John Wiley & Sons, 2005
5. Ethem Alpaydin, Introduction to Machine Learning, 3rd ed., MIT
Press, 2014
6. Florin Gorunescu, Data Mining: Concepts, Models and
Techniques, Springer, 2011
7. Oded Maimon and Lior Rokach, Data Mining and Knowledge
Discovery Handbook Second Edition, Springer, 2010
8. Warren Liao and Evangelos Triantaphyllou (eds.), Recent Advances
in Data Mining of Enterprise Data: Algorithms and Applications,
World Scientific, 2007

37

Das könnte Ihnen auch gefallen