Sie sind auf Seite 1von 1

OPTIMUMPARAMETER SELECTIONFOR KLD

BASEDAUTHORSHIP ATTRIBUTIONINGUJARATI
Parth Mehta, Prasenjit Majumder
parth_me@daiict.ac.in, p_majumder@daiict.ac.in
PROBLEM
Given an article and a list of candidate authors, Authorship attribution system needs to predict the author who wrote the given article.The
proposed method utilises the implicit feature weighting property of Kullback Liebler Divergence to create the author proles which do not face
the following problems

2
Method Completely unreliable proles when number of authors increases.
Delta Method[1] L
1
Norm inefcient to capture the distance between author proles.
Z-Score[2] Performance limited by the source of the training set. In case the author writes for two separate newspapers (or two columns
for same newspaper) Z-Score based method distinguishes between the two. That might not always be the ideal case.
PROFILE BASED SYSTEMS
DATA
The articles used here are taken fromsup-
plements of the popular Gujarati newspaper
Gujarat Samachar
49 Weekly periodicals
40 Different authors (9 authors wrote
two columns every week)
5039 Total documents
9 Different categories
WE PROPOSE
Keeping in mind that KLD implicitly weights the terms by their frequencies we propose
using KLD with a low number of most frequent terms because:
As the most frequent terms tend to be common across all articles, even a few articles can
build a reliable author prole. On the other hand for specic vocabulary based systems
a small training set might be insufcient to represent the entire set of specic vocabulary.
Also these frequent terms will be common across articles of different genres (written by
same author) while the same might not be true for specic vocabulary.
RESULTS
For a given experiment the techniques
which had all the parameters same (except
the one under experimentation) are shown in
same colour. shows signicant difference
compared to the maximum value.

0
10
20
30
40
50
60
70
80
90
100
= 0 = 0.1 = 0, L1 = 0, L2 = 0.01 = 0.001 = 0
Z-Score Delta KLD 2
A
c
c
u
r
a
c
y

Effect of variation in smoothing parameter




50
55
60
65
70
75
80
85
90
95
100
tf > 10, df
> 3
tf > 100,
df>3
Top 400 Top 100 tf > 10, df
> 3
tf > 100, df
> 3
tf > 1000,
df > 3
Z-Score Delta (L2) KLD
A
c
c
u
r
a
c
y

Effect of variation in number of terms






0
10
20
30
40
50
60
70
80
90
100
N = 10 N = 40 N = Nmax
A
c
c
u
r
a
c
y

Effect of variation in size of training set
Delta (L2)
Z-Score
KLD
EXPERIMENTS
While keeping other parameters constant
1. Find the best performing value of
smoothing parameter ()
2. Optimum number of terms (T)
3. Effect of variation in number of train-
ing texts (N)
We start with a reasonable guess of the
parameter values keeping in mind the re-
sults found in the literature.
ANALYSIS
For each of the 9 authors having two dis-
tinct columns we constructed the author pro-
les assuming that each column is written by
a separate author. We then analyse the top
50 terms that distinguish a particular prole
from that of other authors.
For the author Ashok Dave the top ve
terms selected from On a wednesday afternoon
and Encounter are more or less same for the
KLD based method but are distinct for the
Z Score based method.
CONCLUSION
For Gujarati newspaper articles K.L.D.
based authorship attribution with proper pa-
rameter selection is comparable to the cur-
rent state of art Z-score based method when
sufcient number of articles are available as
a training set and outperforms it when the
size of training set is small.
ACKNOWLEDGEMENT
This research is supported by part by
the Cross Lingual Information Access project
funded by D.I.T., Government of India.
REFERENCES
[1] John Burrows. Delta: A measure of stylistic difference and a guide to likely authorship In Literary and Linguistic
Computing, 2002
[2] Jacques Savoy. Authorship attribution based on specic vocabulary. In ACM Transactions on Information Systems,
2012

Das könnte Ihnen auch gefallen