Beruflich Dokumente
Kultur Dokumente
Table 1: Data set characteristics containing total unigrams (TU) and common unigrams (CU) for Euclidean (Euc), cosine
similarity (CS), Manhattan (Man), hybrid TF-IDF (HTI) and phrase reinforcement (PR).
plays a very important role. All tweets about an important and Phrase Reinforcement approaches. Various distance
moment tend to cluster immediately after that moment’s metrics used for clustering are (1) Euclidean, (2) Cosine
occurrence. Thus in a lexical-temporal feature space, the Similarity, and (3) Manhattan. After clustering tweets, the
temporal axis time serves to separate tweets describing sim- 2 tweets closest to each cluster centroids are generated at
ilar moments that are, however, separated by time. Our the output. Tab. 4 presents the summaries generated us-
analysis shows that tweets describing an important moment ing different clustering techniques for IPL matches. Re-
in an event contain similar fields, e.g. for a cricket match, sults show that using Manhattan distance tends to favor the
name of the player, wicket and the type of play, four, six etc. short tweets as outputs. Cosine similarity and Euclidean
Simply put, among the tweets that contribute to a peak in distance consistently identify medium length tweets at their
traffic volume above the chosen threshold of µ + σ, we seek outputs. Hybrid TF-IDF and phrase reinforcement produce
to identify a tweet close to the cluster centroid in lexical the longest length tweets at their outputs, because they as-
feature space. signed more weight to tweet length than its contents. Only
marked moments have been identified. In Tab. 4, moments
3.5 Feature Extraction of IPL match at time instants 15 : 41, 15 : 42 and 17 : 46
We perform clustering using tweet timestamps metadata have not been identified by any clustering using any dis-
and the words tweets are composed of. Tab. 1 tabulates, tance metric. We have not included tables similar to Tab. 4
among other statistics, the total number of unique words for for the UCL and EURO matches for brevity reasons. Simi-
IPL, UCL and EURO matches. Thousands of unique words larly, moments of UCL match occurring at 18 : 45, 21 : 00,
decrease the efficiency of clustering but act as a source of 21 : 27, 21 : 28 and 21 : 29 were missed. It is pertinent
noise for detection of critical moments. So the most im- to mention that not all critical moments in close succession
portant words have been filtered based upon the words fre- were detected because local maxima requires three values to
quency. Fig. 2 is a semi-log plot of the histogram of unique identify the peak value. For the EURO data set, the yellow
words after cleaning the UCL match data. A threshold of card moment goes undetected by all clustering schemes. The
2.18 on a log10 scale has been selected to determine the lower number of tweets for yellow card moments and other unde-
bound of the local maxima. tected single instant moments were very low, so moments
were missed by all clustering schemes.
185 2.26717173 2.17609126
match
3
184 2.26481782 Words
2.17609126 threshold
Performance of different clustering schemes for IPL, UCL
chanc 176 2.24551267 2.17609126 and EURO matches is shown in Tab. 3. Results show that
log (Word Frequency)
2.5
172 2.23552845 2.17609126 maximum performance for IPL match is obtained for Eu-
robben 2 166 2.22010809 2.17609126
neuer 164 2.21484385 2.17609126 clidean and Cosine Similarity which identify 20 of 24 mo-
1.5 160 2.20411998 2.17609126 ments. For UCL, all clustering techniques identified 10 out
leagu 156 2.1931246 2.17609126
1 155 2.1903317 2.17609126
of 18 moments. For EURO, 7 out of 8 moments were iden-
147 2.16731733 2.17609126 tified for all clustering techniques. Comparison shows that
0.5 143 2.15533604 2.17609126 clustering using Euclidean distance identified the most mo-
muller 141 2.14921911 2.17609126
chelsea
0
141 2.14921911 2.17609126 ments compared to other metrics.
1
89
177
265
353
441
529
617
705
793
881
969
1057
1145
1233
1321
1409
1497
1585
1673
1761
1849
1937
2025
2113
2201
2289
2377
Table 2: Important moments identified by various methods Table 3: Important moments identified by IPL, UCL and
using mean (µ), standard deviation (σ) and median (M). EURO.
Performance Metrics: Recall, precision and F-measure
are the key measures used for the evaluation of LTC. The
following sections describe the results of LTC using ROUGE
relying on reference summaries of the matches.
4.1 Recall
In the context of information retrieval, recall is the frac-
tion of correctly classified instances to the total number of
identified instances. Tab. 5 shows the recall of key mo-
ments of IPL match. Results show that clustering using Eu-
clidean distance has highest recall value for moments of all
Time Important Mo- E C.S. M. H. P.R. categories. Hybrid TF-IDF and phrase reinforcement have
ments
lower recall statistics at the moments capturing “Fours” and
14:32 Match started X X X
14:59 Lee to M Hussey, X X X
“Sixes”. Clustering using cosine similarity gave lower re-
SIX. call statistics than Euclidian but higher recall than all other
14:59 Lee to M Vijay, X X clustering techniques.
FOUR
15:25 Bhatia to Vijay, X X X X Key Euclid. Cos. Manhttn. Hyb. Phrase
OUT, slower one and Mo- Sim. TF- Reinf.
what a catch ment IDF
15:33 Kallis to Raina, SIX X X X Match 1 1 1 1 1
start
15:41 Pathan to Raina, SIX Innings 1 1 1 1 1
15:42 Pathan to Raina, SIX End
15:43 1. Pathan to Raina, X X X X X Four 0.67 0.67 0.67 0.33 0.33
FOUR 2. MEK Six 0.86 0.71 0.57 0.29 0.43
hussey 50 Out 1 1 0.75 1 1
15:54 1. Narine to Raina, X X X X X Match 1 1 1 1 1
SIX, 2. Fifty Raina end
15:58 Kallis to Hussey, X X X X X
OUT Table 5: Recall of key moments of IPL match using various
16:06 Narine to Raina, SIX X clustering techniques.
16:11 Shakib Al Hasan to X X X X X
Raina, OUT
16:12 End of first inning X X X X X Tab-6 shows the recall of key moments of UCL and EURO
16:30 Hilfenhaus to Gamb- X X X X X matches identified by different clustering techniques. Re-
hir, OUT sults show that almost the same recall statistics are obtained
16:51 Ashwin to Bisla, SIX X X X for all clustering techniques. “Game Start” moment has a re-
17:16 Jakati to Bisla, call of 0.5 because all methods have successfully identified
FOUR (Twice) the start of EURO match but the start of UCL match could
17:21 Ashwin to Bisla, SIX, X
not be identified. “Yellow card” moment has a recall of 0.33
17:36 Morkel to Bisla, OUT X X X X
17:45 Bravo to Shukla, X X X X X because only a small number of tweets are available for this
OUT moment. No red card moment occurred in either UCL or
17:49 Bravo to Kallis, SIX X X X X EURO.
17:51 Ashwin to Pathan, X X X X X
OUT Key Euclid. Cos. Manhttn. Hyb. Phrase
18:02 Hilfenhaus to Kallis, X X X X X Mo- Sim. TF- Reinf.
ment IDF
OUT
Goal 0.78 0.78 0.78 0.78 0.78
18:06 9 required on 6 Balls X X X X Red 7 7 7 7 7
18:15 Bravo to Tiwary, X X X X X Card
FOUR, KKR won Yellow 0.33 0.33 0.33 0.33 0.33
the IPL Card
Total Detected Moments 20 19 14 14 15 Penalty 1 1 1 1 1
Game 0.5 0.5 0.5 0.5 0.5
Start
Table 4: Important moments of IPL match identified by Half 1 1 1 1 1
different methods. Time
Disallwd. 1 1 1 1 1
Goal
Match 1 1 1 1 1
End
0.15
Recall
0.4
0.1 0.3
0.2
0.05
0.1
0
0
Euclidean Cosine Similarity Manhattan Hybrid TF-IDF Phrase
Euclidean Cosine Similarity Manhattan Hybrid TF-IDF Phrase
Reinforcement
Reinforcement
Recall
Figure 4: Recall of UCL match using ROUGE-1, ROUGE-2, Figure 6: Precision of IPL match using ROUGE-1, ROUGE-
Phrase Reinforcement
ROUGE-SU and LTC. 2, ROUGE-SU and LTC. 0.22642
0.03966
0.05469
0.204492641
0.4
0.7
0.76087 0.81405 0.69744 0.69681
0.3 0.16667 0.175 0.14691 0.13904
0.6
Precision
Euclidean
0.5 Cosine Similarity Manhattan Hybrid TF-IDF Phrase Reinforcement 0.25
F-measure
Precision
Figure 8: Precision of EURO match using ROUGE-1, Figure 10: F-measure of UCL match using ROUGE-1,
ROUGE-2, ROUGE-SU and LTC. ROUGE-2, ROUGE-SU and LTC.
F-measure
from ROUGE-1, ROUGE-2, ROUGE-SU and manual evalu- 0.4
ation (labeled just ‘LTC’) of LTC for IPL, UCL and EURO
0.3
matches, respectively. Results showed that automatic post
summarization techniques gave poor performance relative to 0.2
LTC. An investigation of various matches showed that Eu- 0.1
clidean gave better performance for IPL with an F-measure
0
of 44%, followed by cosine similarity with F-measure of 40%. Euclidean Cosine Similarity Manhattan Hybrid TF-IDF Phrase
For UCL, Euclidean and cosine similarity gave an F-measure Reinforcement
of 34% and 29%, respectively. For EURO, Euclidean gave
best performance with an F-measure of 62% followed by Figure 11: F-measure of EURO match using ROUGE-1,
hybrid TF-IDF with an F-measure of 58%. Performance ROUGE-2, ROUGE-SU and LTC.
comparison of various distance metrics for clustering over
different matches showed that Euclidean outperformed all
other clustering schemes. Phrase Reinforcement
0.34008 various matches can identify these anomalies. Observation
0.45
ROUGE-1
0.03265
ROUGE-2
of total unigrams and common unigrams for various matches
ROUGE-SU 0.4 ROUGE-SU LTC 0.11392 have been found helpful in the investigation of such varying
0.425087108 performance gains. Total unigrams represent the number of
0.35
0.3
words (after data cleaning) in all tweets of a given match
in the collected data set. Common unigrams represent the
F-measure
0.25
tweet words of the collected data set matching the tweets
0.2 of ground truth data set of important moments. For analy-
0.15 sis, unigram precision P and recall R are given by P = w/t
0.1 and R = w/r, respectively. Here, w represents common un-
0.05
igrams, t represents the total unigrams and r represents the
number of words in the reference summary.
0
Euclidean Cosine Similarity Manhattan Hybrid TF-IDF Phrase
Reinforcement 5.1 IPL match
Figure 9: F-measure of IPL match using ROUGE-1, Fig. 12 presents the total and common unigram statistics
ROUGE-2, ROUGE-SU and LTC. for the IPL match along with the precision and recall for
different distance metrics. Results show that Euclidean has
406 common words out of 547 total words which yields a
high recall. Cosine similarity has the highest precision be-
cause the ratio of total unigrams to common unigrams is
5. DISCUSSION highest (0.79). Recall of cosine similarity is lower than that
In this section, we perform content analysis to identify of Euclidean because cosine similarity has only 349 common
ambiguities in performance gains of different clustering tech- unigrams. Similarly, Manhattan has a precision comparable
niques across different matches. Our results for IPL match to other schemes but it has the lowest recall of 0.19 be-
from Sec. 4 suggest that cosine similarity has the highest cause Manhattan produces the shortest summary. Phrase
precision with minimum recall and F-measure. For UCL reinforcement and hybrid TF-IDF produce summaries of
and EURO matches, Manhattan has highest precision with roughly similar length, but the number of common unigrams
lowest recall and F-measure. Individual content analysis of are greater for phrase reinforcement than hybrid TF-IDF.
Hence, phrase reinforcement has lower recall and precision small ratio between total unigrams and number of common
than hybrid TF-IDF. unigrams, as 60 out of 106 words are in common. Cosine
similarity has a lower precision than Euclidean and Man-
0.9
Precision Recall TU CU
600 hattan but has better precision than hybrid TF-IDF and
0.8 600
500 phrase reinforcement.
Precision and Recall Stats
0.7
500
Unigram Stats
0.6 400
180
0.5 0.8 Precision Recall TU CU
400 300 160
Unigram Stats
120
0.2 0.5
100 100
0.1 200 0.4 80
0 0 0.3 60
Euclidean Cosine Similarity Manhattan Hybrid TF-IDF
100 Phrase
0.2 40
Reinforcement
0.1 20
0.8 200
Approach Precision Recall F-measure
Unigram Stats