You are on page 1of 68

MULTILINGUAL WIKIPEDIA ENRICHMENT

USING INFOBOX AND WIKIDATA ALIGNMENT

By

Ta Hoang Thang

Master of Science in Information Technology


School of Information Technology
Shinawatra University

SIU THE: SOIT 2015 - XX

MULTILINGUAL WIKIPEDIA ENRICHMENT


USING INFOBOX AND WIKIDATA ALIGNMENT

A Thesis Presented
By

Ta Hoang Thang

Master of Science in Information Technology


School of Information Technology
Shinawatra University
Academic Year 2015

Title:

Multilingual Wikipedia Enrichment using Infobox and


Wikidata Alignment

Author:

Ta Hoang Thang

Program:

Master of Science in Information Technology

Advisor:

Assoc. Prof. Dr. Chutiporn Anutariya and Dr. Aekavute Sujarae

Academic Year:

2015

The Thesis is accepted by the School of Information Technology, Shinawatra


University in Partial Fulfillment of the Requirements for the Degree of Master of
Science in Information Technology
.......................................................Dean, School of Information Technology
(......................................................)
Committee:
.......................................................Advisor
(......................................................)
.......................................................Co- Advisor
(......................................................)
.......................................................Committee
(......................................................)
.......................................................Committee
(......................................................)
.......................................................External Examiner
(......................................................)

Acknowledgments
I would like to acknowledge and express my gratitude for the completion of
this study to the following:
Associate Professor Dr. Chutiporn Anutariya and Dr. Aekavute Sujarae, for
their guidance and valuable comments. Their encouragements and insights have led
me to the completion of my study. I have gained a lot of knowledge and precious
experiences from their kind support. I believe that the completion of this study will be
a great motivation both to widening my research ability and my professional career in
the future;
The thesis committee, for their patience and insightful comments which
enriched and refined the focus of my study;
The librarians, for providing me some materials when I am doing my research;
All my classmates, for being there for me in some ways when I need some
helps;
All my friends, for their encouragements and moral support;
All my family members, for having confidence in me, for their
encouragements, for loving me as I am and for their support throughout my study.

Ta Hoang Thang

ii

Abstract
Title:

Multilingual Wikipedia Enrichment using Infobox and


Wikidata Alignment

Author:

Ta Hoang Thang

Program:

Master of Science in Information Technology

Academic Year:

2015

Wikipedia offers a large knowledge database which has been contributed


freely by many international contributors in 287 languages. Every hour, Wikipedia
receives thousands of edits, and this brings out some challenges for Wikipedia in antivandalism and content update between language editions. This thesis describes a new
approach to enrich Wikipedia contents throughout a model for retrieving, analyzing
semantic relations and updating contents. There are three main steps in this model.
Firstly, the infobox structures are aligned by matching their properties with Wikidata.
This step will provide new infobox structures which adapt with the old ones when the
newly added part can be saved in a hidden form in the same templates of origin
infoboxes. Next, the system will detect missing interwiki links of articles depending
on the new infobox structures by assessing the correlation of semantic relations
between the infoboxes in different languages. The last step is to enrich the article
content from these semantic relations by interwiki links. Furthermore, semantic
relations can also be contributed to Wikidata statements. We can use DBPedia and
Wikidata as secondary sources when the system retrieves insufficiently semantic
relations of certain articles. We propose to improve the taxonomy and category to be
more fine-grained for Wikipedias from the very basic semantic rules. The system will
semi-automatically detect and arrange the articles into categories based on English
classification. Finally, we apply this model to enrich biological articles with some
meaningful and positive results.
Keywords:

Multilingual Wikis
Wikidata

iii

Table of Contents
Title

Page

Page

Acknowledgments

Abstract

ii

Table of Contents

iii

List of Figures

List of Tables

vi

Chapter 1 Introduction

1.1 Background

1.2 Problem Statement

1.3 Objectives

1.4 Scope and Research Objects

1.5 Thesis Organization

Chapter 2 Literature Review

2.1 Wikipedia

2.2 Wikipedia Architecture

2.3 DBPedia

13

2.4 SPARQL

15

2.5 Interwiki Links in Wikipedia and Wikidata

16

2.6 Wikipedia Categories

19

iv

2.7 Wikipedia Infoboxes

22

2.8 Multilingual Approaches

25

2.9 Summary

26

Chapter 3 Proposed Model for Multilingual Wikis


3.1
3.3
3.2
3.4

27

Introduction
Align Infobox Parameters with Wikidata Properties
General Model
Detect, Connect Missing Interwiki Links and Synthesize Semantic

27
28
27

Relations

32

3.5 Enrich Semantic Relations for Articles and Wikidata


Chapter 4 Experiments and Obtained Results

38
43

4.1 Preparation Steps

43

4.2 Biological Domain

44

4.3 Results of Aligning Biological Species

48

Chapter 5 Conclusions and Recommendations

50

5.1 Conclusions

50

5.2 Recommendations and Future Works

51

References

52

Appendices
Appendix A Converter 1.1.6

56

Appendix B Category 1.0.8

57

Appendix C AutoWikiBrowser

58

Biography

59

List of Figures
Title

Page

Figure 2.1

The content structure of MediaWiki

Figure 2.2

The content structure of Graphium Stratiotes article

Figure 2.3

Ancient Roman scientists category at English Wikipedia

10

Figure 2.4

Poultry Template in Wikipedia

11

Figure 2.5

University Infobox about Shinawatra University

11

Figure 2.6

Several permissions by user groups in Wikipedia

12

Figure 2.7

Contribution mechanism of Wikipedia

13

Figure 2.8

Overview of DBpedia-Live extraction framework

14

Figure 2.9

The phases of Wikidata plan

16

Figure 2.10 Adding an interwiki link to Risk Management entity at Wikidata

17

Figure 2.11 The definition of property instance of (Property:P31)

19

Figure 2.12 Converting infobox by using bilingual parameter couples

23

Figure 3.1

General model for multilingual wikis

27

Figure 3.2

The graph of the semantic structure of Alcina article at Vietnamese


Wikipedia in English.

36

Figure 4.1

Most transcluded pages

43

Figure 4.2

Detect Interlinks 1.0 tool

44

Figure A.1 Converter 1.1.6

56

Figure B.1 Category 1.0.8

57

Figure B.2 Statistics of enriching 500 edits

57

Figure C.1 AutoWikiBrowser screen shot

58

vi

List of Tables
Title

Page

Table 2.1

MediaWiki General Architecture

Table 2.2

Wikipedia Namespaces

Table 2.3

List of Language Editions of Mathematics Article

18

Table 2.4

The Structure of Mathematics Article (Q395)

18

Table 2.5

Some NLP Patterns which can Describe Category Names

21

Table 3.1

The Alignment between Template: Infobox School and Wikidata

29

Table 3.2

The Alignment between Infoboxes (Bn Mu: Trng Hc in

7-8

Vietnamese and Template: Infobox School in English) and


Wikidata

30

Table 3.3

Article Titles about Barack Obama in some Languages

33

Table 3.4

Article Titles about Dog in Vietnames and English.

34

Table 3.5

The Semantic Structure of Alcina Article at Vietnamese Wikipedia


in English

35

Table 3.6

Semantic Relations of Ch (vi) & Dog (en) Articles

37

Table 3.7

The Comparison result between Dog Articles in


Vietnamese and English

38

Table 3.8

The Alignment Table

39

Table 3.9

Some Data Values in English and Vietnamese

40

Table 3.10

The Categories of Asparagus Persicus in English and Vietnamese

41

Table 3.11

The Translation List and Enrichment List of Asparagus Persicus

42

Table 4.1

Alignment of Bn Mu: Bng Phn Loi (vi) and Template: Taxobox


with Wikidata

45

Table 4.2

Results of Comparing Article Couples in Vietnamese and English

48

Table 4.3

Result of Enriching Vietnamese Articles by Categories, External


Links and Bottom Templates

49

Chapter 1
Introduction
1.1 Background
Wikipedia is an encyclopedia that allows the public community to develop
content voluntarily in numerous languages (Anderson, 2011, p. 10-11; O'Sullivan,
2012, p. 85; Bieberstein, 2008). It covers the content differentiation which is from the
differences of language structure and editor contributions. Wikipedia must face with
several difficulties, such as content management, anti-vandalism (Kittur, Suh,
Pendleton & Chi, 2007), data number values verification and content synchronization
between its projects. Many researchers retrieved semantic relations from Wikipedia
content in order to widen semantic database and to improve Wikipedia performance
based on the discovered outcomes. To extract Wikipedia data, DBPedia is one of
many projects that have been deeply examined by the research community (Hellmann
et al., 2014; Gurevych, Kim & Calzolari, 2013). DBPedia also included the
relationships among entities (for example, articles, categories and templates) which
were linked to a large multilingual knowledge base. Other researchers concentrate
their works on some specific languages to retrieve the common semantic relations and
then add the missing content to the language as needed (Sorg & Cimiano, 2008).
Except many entities of Wikipedia that contain interwiki links, there are a lot of
entities as well as the relationship between them which need to be researched were
unlinked to other language projects. From this perspective, the researches about
multilingual wikis are also opened enormous potential for future works.
1.2 Problem Statement
When editors create new entities, mostly articles, categories and templates in
Wikipedia, they need to arrange content by following a defined format. 1 This format
supports not only for readers to find the needed information easily, but also for the
management staff and bots to manage the content effectively. Because of the
differences of language structures and editor communities, each Wikipedia language
gradually diverges its own data compared with others. We call this case is the
1

https://en.wikipedia.org/wiki/Help:Wiki_markup

2
heterogeneity. In some high collaborative quality Wikipedias, such as English
Wikipedia, the information is plentiful and content structure is organized logically.
But at other Wikipedias, especially ones which lack of contributors, these things are
still poor and limited. To contribute information to these Wikipedias, we cannot only
depend on the local editors. According to Wikipedia statistics, we found that the
number of editors whose contributions is higher than 5 edits and 100 edits has slightly
decreased in the recent years. 2 Therefore, the pivotal point is we need a model that
can retrieve data in a semantic aspect from several high quality Wikipedias, then
contribute needed data of entities to other ones semi-automatically. By this way, we
can improve the performance and enrich the content of many languages generally
without using much human effort.
We can utilize DBPedia to form a new model because it contained many
semantic datasets that were extracted from various Wikipedia languages. However,
DBPedia depends on entities which have interwiki links, mainly those links to the
English entities. This leads to one of the drawbacks of DBPedia is non-interlinked
relations of entities may describe incompletely and imprecisely. With the low update
frequency of data (Kittur, Suh, Pendleton & Chi, 2007) and prevent contributions
from public community broadly, in some cases, DBPedia can offer not enough
semantic relations which help the content enriching effectively for all languages.
Another project of DBPedia, Live Extraction can solve the problem of low update
frequency of data, but it only supports English content and depends primarily on
update threads. On the other hand, Wikidata project allows editors contribute semantic
relations to its data openly (Vrandei & Krtzsch, 2014). From this point, we also
can use Wikidata to enrich Wikipedia content. However, this project is in the process
of developing its semantic data and may not have enough semantic rules which help
much in the content enrichment.
In this thesis, we propose a new model which solves some the disadvantages
of DBPedia when using the infobox and Wikidata property alignment (Ta &
Anutariya, 2014). This alignment will produce the aligned structures of infoboxes
which allow updating their content openly on demands and can be derived by next
researchers. Thus, the semantic relations are always updated whenever we update the
aligned structures. In addition, these structures also help boosting the unified process
of infobox properties in Wikidata. We can consider this model as an intermediate step
2

http://stats.wikimedia.org/EN/TablesWikipediaZZ.htm

3
that Wikidata needs in matching and unifying Wikipedia infoboxes of all languages.
For those articles lack of interwiki links, we can connect these links throughout
semantic comparisons based on the aligned structures. Generally, we can enrich the
article content and synchronize the common understanding between languages.
1.3 Objectives
The objectives of this thesis are to create a general model for extracting
semantic relations from different entities based on datasets of multilingual
Wikipedias; to contribute the gathered results to some Wikipedia languages, which are
suitable for smaller-scale Wikipedias; and to enrich some basic data types of articles
for different language editions.
1.4 Scope and Research Objects
Previous researchers focused on how to extract semantic relations from
Infoboxes properties in some languages (Nguyen, Moreira, Nguyen, Nguyen &
Freire, 2011; Tacchini, Schultz & Bizer, 2009; Rinser, Lange & Naumann, 2013;
Adar, Skinner & Weld, 2009). They defined methods to match Infobox properties of
various languages which have interwiki links. They used infobox extraction
algorithms to detect infobox structures, form ontologies (Auer & Lehmann, 2007) and
store outcomes in outside databases, such as DBPedia, YAGO, etc. The comparison of
these ontologies in different languages was a main key to enrich Wikipedia language
editions.
This thesis approaches a different way. Our purpose is to form a model which
can enrich for all languages. We reused a general model to enrich Wikipedia content
from our research paper. This model extracts semantic relations from Infobox and
Wikidata property alignment (Ta & Anutariya, 2014). Wikidata is used as a central
server to align Infobox properties and translate terms between languages. For each
Infobox property, we try to match it with a Wikidata property, which includes labels in
many languages. We will store the alignment results in Wikipedia templates when we
gain the agreement of language communities because we dont own Wikipedia
projects. Next, we identify the correlation of properties in different languages, then
enrich the missing properties for Wikipedias language editions. The more Infobox
properties of languages we can align with Wikidata, the more data we can enrich to

4
the content of these languages. We also enrich other datasets such as external links,
images, geo-coordinates, categories and bottom templates, etc.
We concentrates mainly on Wikipedia of several Latin-based languages, in
particular between Vietnamese Wikipedia and English Wikipedia. In this thesis, the
accuracy of article contents and errors that are made by Wikipedia editors are beyond
the scope of this thesis.
1.5 Thesis Organization
Beside this chapter, this thesis includes four other chapters:
Chapter 2 Literature Review & Foundation Reviews related works
which clarify the development pace of current research and list several related
technologies and knowledge.
Chapter 3 Proposed Model for Multilingual Wikis Describes in detail
about the general model which is used to enrich Wikipedia contents.
Chapter 4 Experiments and Obtained Results Points out how to execute
the model in Chapter 3 for implementation processes.
Chapter 5 Conclusions and Recommendations Concludes and future
thesis works.

Chapter 2
Literature Review
2.1 Wikipedia
Wikipedia founders, Jimmy Wales and Larry Sanger first launched this
website on January 15, 2001 (Anderson, 2011, p. 42). As a free encyclopedia,
Wikipedia allows everyone can access and edit its article content. Now, Wikipedia is
one of the most popular websites and the largest reference work.
English is the only initial language in Wikipedia. Then, Wikipedia opened
other languages and it gradually became a multilingual site. Currently, there are 287
languages, all editions was established in the same technical framework, but with
different content and editing practices. English Wikipedia is the biggest project with
over 4.68 million articles and its depth (collaborative quality) is 887.

Wikipedia

gained 18 billion page views and approximately 500 million unique visitors each
month as of February 2014. As of May 2014, Wikipedia had 22 million accounts, with
over 73,000 active editors globally. There are many sites which extract semantic
relations from Wikipedia data, such as YAGO (Suchanek, Kasneci & Weikum, 2007),
FreeBase, DBPedia and Cycorp.
2.2 Wikipedia Architecture
2.2.1 MediaWiki general architecture. Wikipedia architecture is based on
MediaWiki, which is an open source wiki written in PHP language. MediaWiki has
been developed by Wikimedia Foundation and MediaWiki volunteers. It is used by
Wikimedia Foundation and other websites. The latest version of MediaWiki is
MediaWiki 1.25 alpha. 4 In general, MediaWiki architecture contains 4 layers which
are called User layer, Network layer, Logic layer and Data layer. In network layer,
Squid is a high-performance proxy server which executes caching.
Table 2.1

https://meta.wikimedia.org/wiki/List_of_Wikipedias

MediaWiki 1.25 alpha (September 25, 2014)

6
MediaWiki General Architecture. 5
User layer

web browser
Squid
Apache webserver
MediaWiki's PHP scripts
PHP
File system, MySQL Database (program and content)

Network layer
Logic layer
Data layer

and Caching system


Besides web interface, MediaWiki contains API (Application Programming
Interface) which is the other main access point. API allows client programs interact
with server to login, retrieve data and submit edits throughout bot accounts. By this
way, editors can contribute to MediaWiki content more rapidly and conveniently
instead of implementing many repetitive edits manually.
2.2.2

Content organization. The simple content structure of MediaWiki is

divided into different namespaces which is shown in Figure 2.1 as bellowing:


Others

Templates

Namespaces

User, Groups and


Categories

Pages/Articles

Access levels

Figure 2.1 The content structure of MediaWiki.


2.2.2.1

Namespaces. MediaWiki allows administrators to create different

namespaces for their wiki to manage content systematically. Namespaces are prefixes
before

an

article

name.

For

example,

link

http://localhost/mediawiki-

1.22.2/index.php/User:Thang is User namespace which refers to a user name


Thang. Other namespaces in MediaWiki include User_Talk:, Template_Talk:,
Category:, Category_Talk:, Special:, and Talk:. These namespaces can be directed
and converted to other languages automatically. For example, when a user browses
5

https://www.mediawiki.org/wiki/Manual:MediaWiki_architecture

7
this

link:

redirected

http://localhost/mediawiki-1.22.2/index.php/User:Thang,
to

http://localhost/mediawiki-1.22.2/index.php/Thnh

it

will

be

vin:Thang

in

Vietnamese version.
The interwiki link prefixes dont determine namespaces, however they link to the
pages in other MediaWiki language projects.

Some examples are en: for English

version, fr: for French version, th: for Thai version and vi: for Vietnamese version. If
we have en:Mathematics means a link which will connect to Mathematics article in
English Wikipedia. MediaWiki uses _ (dash) between two words to identify a blank
space in URLs, so namespace User_talk: is the same as User talk:.
In the table 2.2, we will see all the namespaces and their identity number (ID)
which are defined in Wikipedia. Wikipedia contains two virtual namespaces that are
Special (-1) and Media (-2).
Table 2.2
Wikipedia Namespaces 7
Wikipedia namespaces
Subject namespaces
Talk namespaces
0
2
4
6
8

(Main/Articles)
User
Wikipedia
File
MediaWiki

Talk
User talk
Wikipedia talk
File talk
MediaWiki talk

1
3
5
7
9

Table 2.2 (Continued.)


Wikipedia namespaces
Subject namespaces
Talk namespaces
10
12
14
100
108
118
446
710
828

Template
Help
Category
Portal
Book
Draft
Education Program
TimedText
Module

-1
-2

Special
Media

Template talk
Help talk
Category talk
Portal talk
Book talk
Draft talk
Education Program talk
TimedText talk
Module talk

Virtual namespaces

https://en.wikipedia.org/wiki/Wikipedia:Namespace

https://en.wikipedia.org/wiki/Template:Namespaces

11
13
15
101
109
119
557
711
829

8
2.2.2.2 Page/Article. Articles or pages are the most important contents of
MediaWiki that include templates inside and are classified into various categories. An
article has history log which shows user contributions by chronological order. It also
has level of restrictions which permit which groups can interact with that article by
group rights. To format the articles, Wikipedia defined Wiki markup (wikitext or
wikicode) which includes syntax and keywords. Wiki markup also supports some
HTML elements. The main components of Graphium stratiotes, a butterfly species is
shown Figure 2.2.

Figure 2.2 The content structure of Graphium Stratiotes article.


This is the source code of above article.
{{Taxobox
| name = Graphium stratiotes
| status =
| regnum = [[Animal]]ia
| phylum = [[Arthropod]]a
| classis = [[Insect]]a
| ordo = [[Lepidoptera]]
| familia = [[Papilionidae]]
| genus = ''[[Graphium (butterfly)|Graphium]]''
| species = '''''G. stratiotes
}}
'''''Graphium
[[Borneo]]

stratiotes'''''

that

belongs

Swallowtail]] family.
==Subspecies==

to

is
the

butterfly

[[Swallowtail

found

in

butterfly|

9
* G. s. stratiotes
* G. s. sukirmani
==References==
*Collins,

N.M.,

Morris,

M.G.,

IUCN,

1985

''Threatened

Swallowtail Butterflies of the World'': the IUCN Red Data


Book

1985

IUCN

[http://ia600501.us.archive.org/4/items/threatenedswallo85col
l/threatenedswallo85coll.pdf

pdf]

{{Papilionidae-stub}}
[[Category:Graphium (butterfly)]]
[[Category:Animals described in 1887]]

In Figure 2.2, the article name is Graphium Stratiotes in bold and italic text.
The main content is part 4. This article uses two templates: Taxobox ({{Taxobox
}}) and Papilionidae-stub ({{Papilionidae-stub}}) (2 and 3). Two categories

of

this

article

are

Graphimum

(butterly)

([[Category:Graphium

(butterfly)]]) and Animals described in 1887 ([[Category:Animals


described in 1887]]) (1).

2.2.2.3

Category. Categories start with namespace prefix Category:. For

example, [[Category:ABC]] is Wiki markup of the category with name ABC. To


execute categorization, editors will add the category tags in articles, normally at the
end of them. Then, readers can browse the related articles or all articles in a certain
category conveniently. The category taxonomy follows hierarchical classifications
which are organized as overlapping "trees". There is no strict standard for category
taxonomy currently. Different Wikipedias can have different taxonomies which are
established and developed from the editors understanding. Wikipedia recommends its
editors comply to the guidances that it offered in order to execute the categorization
properly.8 A category can have many child categories (subcategories) and belong to
many parent categories. Furthermore, this category also can include the articles
(pages in category) and the templates. We have two main kinds of categories:

Topic categories are named after a topic. For example: Category:Love

includes all articles relating to Love.

Set categories are named after a class (usually in the plural in English

https://en.wikipedia.org/wiki/Wikipedia:Categorization#Administration_category

10
Wikipedia). For example, Category:Cities in Thailand contains articles whose
subjects are cities in Thailand.
In Figure 2.3, this Ancient Roman scientists category includes a child
category (Ancient Roman astronomers) and an article (Lucilius Junior). It has four
parent categories are Roman science, Ancient scientists, Scientists by nationality
and Ancient Romans by occupation.

Figure 2.3 Ancient Roman scientists category at English Wikipedia.


2.2.2.4 Template. Templates are created in order to reduce the repetition of
using the same content/code in many articles. They contain information, navigation
links or alerts about article status. Templates are embedded into articles and categories
with the syntax: {{template_name}}. The template namespace begins with syntax
Template:. In Figure 2.4, Template:Poultry shows articles related to poultry topic.

Figure 2.4 Poultry Template in Wikipedia


Infobox is a kind of Template, contains structured information using in
analyzing semantic relations. DBPedia and other external projects extract semantic
knowledge mainly from Wikipedia Infobox.

11
{{Infobox University
|name = Shinawatra University
|image_name = SIU_logo.jpg
|established = 1999
|Founder = Dr. Thaksin Shinawatra
|President

Prof.

Dr.

Voradej

Chandarasorn
|city =[[Bangkok]]
|country = [[Thailand]]
|campus = [[Pathumthani]]
|website=

http://www.siu.ac.th|

Shinawatra University
}}

Figure 2.5 University Infobox about Shinawatra University.


Infobox is a fixed-format table, includes many parameters and their
corresponding data values. Infobox displays a brief of some unifying aspect that the
articles share and sometimes to improve navigation to other interrelated articles. The
template of Infobox University in Figure 2.5 has some parameters such as name,
image_name, founder, president, etc. On the right, we can see the appearance of this
template.
2.2.2.5 User, group and access levels. A user can belong to many groups with
different access levels when interacting with the system. User groups are unregistered
users, new users, autoconfirmed users, administrators, bureaucrats, stewards, bots,
founders, researchers, ombudsman, etc.9 The access levels of some groups are shown
in the Figure 2.6. Wikipedia allows everybody can read its content freely. Edit right
can granted to all editors who are not blocked. Logging to the system gives the users
more rights when they interact with the system.

https://en.wikipedia.org/wiki/Wikipedia:User_access_levels

12

Figure 2.6 Several permissions by user groups in Wikipedia.


2.2.3 Contribution mechanism. Wikipedia welcomes all editors contribute
their knowledge willingly. Each article has its own history log to store editor changes.
Editor community is the one who is responsible for the accuracy, consistency and
validation of Wikipedia content. An article can created and improved by different
users based on their favorites and purposes.

Figure 2.7 Contribution mechanism of Wikipedia.


The procrastination principle is a favorable principle of Wikipedia (Baldwin,
Cave & Lodge, 2010, p. 538) when it prefers waiting for the problems happen and fix
them instead of controlling the content strictly before that. Users can publish their
changes directly after pushing button Save at all articles. Therefore, sometimes, the

13
articles may appear inaccuracies such as misspellings, ideological biases, and
inappropriate text.
2.3 DBPedia
DBPedia is a project which extracts structured information from Wikipedia
and supplies the data availably for everyone, especially for the research community.
To extract different types of Wikipedia content, DBPedia uses 19 extractors such as
labels, abstracts, interlanguage links, images, redirects, disambiguation, etc (Auer et
al., 2007). DBPedia organized its data into many datasets which mainly incorporates
many RDF triples. Currently, the latest version of DBPedia is 3.9. 10 According to
DBPedia, its English edition can determine 4.0 million things. DBPedia also has data
for 119 languages with 24.9 million defined things.
DBPedia knowledge base and its datasets can support computational
linguistics tasks (Mendes, Jakob & Bizer, 2012; Cabrio, Cojan, Gandon & Hallili,
2013). DBPedia datasets are helpful in researching about the semantic relations of
Wikipedia. DBPedia is also a multilingual data which researchers can develop
question answering over it (Hahn et al., 2010). DBPedia datasets can be as format
standards that we can refer to develop my own datasets. These datasets also help we
know about the relationship of data types in Wikipedia and how to organize the
dataset structures.
However, two essential limitations of DBPedia are the obsolete datasets which
are not updated frequently (Morsey & Lehmann, 2011) and lacks of supporting for
non-English languages effectively (de Melo & Weikum, 2010, October). Furthermore,
DBpedia Live which can extract the data of Wikipedia by current time but just for the
support of the English edition.

10

http://wiki.dbpedia.org/Changelog

14

Figure 2.8 Overview of DBpedia-Live extraction framework. (from Morsey et al.,


2012)
Figure 2.8 describes the extraction framework of DBPedia-Live. Extraction
manager gets inputs from Wikipedia Dumps and Wikipedia OAI-PMH which are
organized as Article Queue and Page Collections. Then, the system will extract input
data into different datasets by extractors and parsers. Next, N-Triple Serializer
serializes to N-Triple Dumps. Then, it and SPARQL-Update Destination update data
to Triple Store Virtuoso database. At last, SPARQL endpoint and Linked Data serve
data on web interfaces. We can access DBPedia database by four main ways: DBPedia
apps, SPARQL clients, RDF browser and HTML browser (Morsey, Lehmann, Auer,
Stadler & Hellmann, 2012).
2.4 SPARQL
SPAQRL is an RDF query language which is used for retrieving and executing
data from RDF database. Currently, the latest version of SPARQL is SPARQL 1.1
which was released on 26 March 2013. 11 SPARQL contains some query types such as
SELECT, CONSTRUCT, ASK, and DESCRIBE. SPARQL 1.1 supports some result

11

http://www.w3.org/TR/rdf-sparql-query/

15
formats such as XML, JSON, CSV, TVS and RDF (Prudhommeaux & Seaborne,
2013).
This is an example how to use SELECT query in SPARQL:
Data
@prefix foaf:

<http://xmlns.com/foaf/0.1/> .

_:a

foaf:name

" Thang" .

_:a

foaf:email

_:b

foaf:name

_:b

foaf:email

<mailto:alex@yahoo.com> .

_:c

foaf:email

<mailto:carol@love.net> .

<mailto:thang@gmail.com> .
"Alex M" .

SELECT query
PREFIX foaf:

<http://xmlns.com/foaf/0.1/>

SELECT ?name ?email


WHERE
{ ?x foaf:name ?name . ?x foaf:email ?email }

Query Result
Name

Email

Thang

<mailto:thang@gmail.com>.

"Alex M"

<mailto:alex@yahoo.com>.

First, PREFIX refers to the source to get the data. In the example above, the
source data is http://xmlns.com/foaf/0.1/ and foaf: is a prefix for
querying easier. The query gets all people who have both name and email.
2.5 Interwiki Links at Wikipedia and Wikidata
An article can have many different language editions. To link these editions
together, Wikipedia allows editors use language prefixes such as [[en: (English
Wikipedia), [[fr: (French Wikipedia), as mentioned in Section 2.2.2.1 of this
chapter. However, this way is obsolete because of its complexity. If an article
appears in 4 language editions, to link these editions together, the article content of

16
each language will need 3 language prefixes. In total, we need 12 language prefixes
(language links). In addition, the editors of each language edition must always
maintain these links correctly. To simplify and counteract vandalism, Wikipedia
deployed Wikidata project in 2012 which stores multilingual structured data in its
server (Erxleben, Gnther, Krtzsch, Mendez & Vrandei, 2014).

Figure 2.9 The phases of Wikidata plan. 12


The first phase (Phase 1) is to convert interwiki links of Wikipedia articles to
Wikidata was done completely. The second phase (Phase 2) is to unify infobox
templates will be continued executing in the future. Recently, Wikidata described
Phase 3 (Lists) in their technical proposal.

13

This phase facilitates more complicated

queries for supplying common views of data and scaling down the maintenance tasks
of Wikipedia immensely (Nguyen, 2013, p. 60).
When a user creates a new article (or template, category) which has interwiki
links, this user must manually add this new one to the common database at Wikidata.
In Figure 2.10, Risk Management is an article in English edition. When a user
translates it into Vietnamese with name Qun l ri ro d n, he/she must add it to
Wikidata by specifying language edition and article title in the form in Figure 2.10.

12

https://www.wikidata.org/w/index.php?title=Wikidata:Introduction& oldid=42879390

13

https://meta.wikimedia.org/wiki/Wikidata/Technical_proposal

17

Figure 2.10 Adding an interwiki link to Risk Management entity at Wikidata.


Wikidata can be used as a server data to translate and detect terms/phrases in
different languages. For example, the term Mathematics refers to an article in
English, throughout Wikidata, editors can easily find this term and its topic content in
another languages in Table 2.3.
Table 2.3
List of Language Editions of Mathematics Article.
Language
Deutsch

Code
Dewiki

Article name
Mathematik

English

Enwiki

Mathematics

Franais

Frwiki

Mathmatiques

Ting Vit

Thwiki

Ton hc

Viwiki

Besides the interwiki links, each entity includes its abstract, statements,
alternative names and properties (Erxleben, Gnther, Krtzsch, Mendez & Vrandei,
2014). Table 2.4 shows these things of mathematics article.

18
Table 2.4
The Structure of Mathematics Article (Q395).
mathematics (Q395)
Abstract study of numbers, quantity, structure, relationships, etc.
Alternative names: math, maths

In other languages
Vietnamese
franais

ton hc
Mathmatiques, science des nombres

Statements
part of
commons category
instance of

formal science
Mathematics
branch of science

Wikipedia pages linked to this item (196 entries)


Afrikaans afwiki
Alemannisch alswiki

Wiskunde
Mathematik

Defining the properties in different languages is very important for unifying


the infoboxes when Wikidata plan aims to use the unified infoboxes for all languages.
Based on English infoboxes, we can translate and compare infobox properties
(parameters) and their data values into other languages throughout Wikidata web
interface and its API. In Figure 2.11, we have the property instance of, also known
as is a or is an with a brief description and its names in Vietnamese, French and
Thai languages.

Figure 2.11 The definition of property instance of (Property:P31).

19

2.6 Wikipedia Categories


The category taxonomy of English Wikipedia is considered the best finegrained taxonomy of all languages and is a research object of many authors. One of
the popular trends of contributing category taxonomy is the editors translate the
category names from Wikipedia English into their known languages. The majority of
translation task is manually done by the editors. Therefore, the category taxonomies
of these languages may not be completed and do not have a clear naming convention
in using terms, labels and lacking of interwiki links. Besides, editors also can use
Content Translation tool which was developed recently helps to improve multilingual
contribution.14
Many researches used WordNet to evaluate the correlation between category
labels of English Wikipedias category taxonomy and WordNets words. Simone
showed that the positive outcomes to reinforce this category system with a high
correctness and could evaluate the quality level of manual classification of Wikipedia
English (Ponzetto & Navigli, 2009). However, this research did not support the nonEnglish languages. Different languages may have different grammatical structures. To
perform above research, we need a WordNet version and some NLP (Natural
Language Processing) algorithms for each language. Obviously, the implementation
cost will be expensive if we apply to all languages. Instead, as already mentioned
above, a recent trend found that editors prefer to create new category trees by
comparing and translating from some high collaborative languages. Category labels
(titles) of English Wikipedia can be translated to other languages automatically when
they follow some naming conventions. 15
There are many semantic relations that were found by extracting data from the
English category taxonomy throughout lexico-syntactic matching. The outcomes can
derive isa and notisa relations in conceptual network (Ponzetto & Strube, 2007). The
common process is to get all the category names and remove categories which are
related to Wikipedia management (for example, Category:Wikipedia categorization).
Next, we refine link identification, use some methods (such as syntax-methods,
connectivity-based methods, lexico-syntactic based methods, inference-based
methods), and then compare with ResearchCyc. Similarly, according to other
14

https://www.mediawiki.org/wiki/Content_translation

15

https://en.wikipedia.org/wiki/Wikipedia:Category_names

20
researches, some NLP patterns can be extracted from category taxonomy that are
member_of, directed_by, located_in, attribute_of and R (Nastase & Strube, 2008).
The outputs are estimated against ResearchCyc and a subset of human judges as well
as showing many copious patterns which were induced from category taxonomy.
These patterns can be applied not only English but also other languages as well
without much changes of algorithms and methods. For example, the pattern X Y refers
to a category such as Information Technology in English which X = Information and Y
= Technology. In Vietnamese, the pattern can be denoted as Y X with Y = Technology
= Cng ngh and X = Information = Thng tin. So the Vietnamese category name is
Cng ngh Thng tin. In Indonesian language (Bahasa), this pattern is also like
Vietnamese Y X (Teknologi informasi, X = Informasi, Y = Teknologi). The creation of
new categories at small and medium scale languages can depend on these patterns to
proceed automatically this task.
In Table 2.5, some basic NLP patterns points out the feasibility of translating
category names automatically from English to Vietnamese by a simple tool.
Table 2.5
Some NLP Patterns which can Describe Category Names.
English
[X] in [Y]
Cities in France
X = Cities
Y = France
XY
Information Technology
X = Information
Y = Technology
X by Y
Birds by country
X = Birds
Y = country

Vietnamese
[X] [Y]
Thnh ph Php
X = Thnh ph,
Nhng thnh ph (plural) *
Y = Php
YX
Cng ngh thng tin
X = thng tin
Y = Cng ngh
X theo Y
Chim theo quc gia
X = Chim
Nhng con chim (plural) *
Y = quc gia

* In Vietnamese Wikipedia, the editors prefer not to use plural for category names
in general [23].

21
In 2006, Chernov and co-authors mentioned about the semantic relations
between Wikipedia categories. If the large number of pages from Category A contains
links to Category B, we can conclude that Category A has a semantic relation with
Category B. The research experiment compared the relationship between Category
Country and other categories. In 2010, NER (Named-entity recognition) task was
used to measure the relevant score between a certain category and Software Category
(Xu, Takeda, Hamasaki & Wu, 2010). The estimated method is to divide this
relationship into three things: S-Category, Parent Edge and Ancestor Path and then
use an algorithm to calculate the final score. The outcome can be correct 80%.
However, the research scope is limited to the English software categories. In short, the
correlation between two Wikipedia categories cannot be compared easily and still
have many promising uncovers.
2.7 Wikipedia Infoboxes
Infobox, a type of template, contains the semi-structured content which can be
retrieved as semantic relations. DBPedia Infobox Datasets include two types which
are Raw Infobox Dataset and Mapping-based Dataset (Li & Sima, 2015). Raw
Infobox Dataset cannot deal with using the same attribute (property) name. The latter
uses new extraction method to overcome this drawback, but support insufficiently all
infoboxes and their properties. At the moment, DBPedia still have inadequate support
for non-English languages. Thus, for non-English articles without interwiki links,
DBPedia is impossible to produce the infobox datasets.
The heterogeneity is the cause to prevent the fusion of infobox datasets at
Wikipedia languages. Infobox templates regularly contain three main components:
parameters (or properties), parameter labels and their data values. When the editors
import the infoboxes from English Wikipedia to other languages, they can translate
everything or just keep the English parameters, translate their labels and data values.
For some editors who dislike English parameters, they may create those parameters in
their known languages. In many cases, there could have two or more than two
infoboxes of non-English Wikipedia interlink to an English infobox. Consequently,
the heterogeneity of infobox data is continuously getting bigger. Alessio pointed out
his solution which automatically mapped Wikipedia infobox attributes to DBPedia
properties in 14 different languages (Aprosio, Giuliano & Lavelli, 2013). Many
researchers stored their outputs in the external database that everyone cannot reuse

22
and follow-up easily. Generally, infobox alignment for all languages is still a
challenge.
The second phase (Phase 2) of Wikidata plan is to collect the infobox structure
from different languages and store at Wikidata. This phase reduces the complexity and
plentiful infobox data in all languages. The first version of Wikidata infoboxes was
deployed but now, it is still not available in practice. To support the alignment process
at this phase, in Figure 2.12, we suggest using bilingual parameter couples (property
couples) in English and a non-English. English Wikipedia is the biggest Wikipedia
with million articles and infoboxes. Because of this, the bilingual parameter couples
need to have at least one parameter in English to utilize its copious data. Then, we
will update the infobox structure for using two parameters by semi-automated
mechanism. In this step, we also translate parameter labels to non-English language if
applicable. We call this step as the unification of Wikipedia infobox structure.
Next, we can convert the data values of infobox parameters which have
interwiki links and lastly, compare and update to the articles which use these
infoboxes in some languages. The update process may contain two choices and the
editors must choose one of them:

If the data cannot convert to non-English, editors still update data to the

non-English article and leave the translation for the community.

Choose data which can convert to non-English and update to non-English

articles.
has value
Parameters

Data (convert data to


vi use WikiData)

A (English - en)
B (Vietnamese - vi)

has value
Label (convert data to
vi use Wikidata)

Figure 2.12. Converting infobox by using bilingual parameter couples.


For example, Template:Infobox company in Vietnamese Wikipedia allows
editors use two kinds of parameter (English or Vietnamese parameters) they prefer.

23

Step 1: Prepare the infoboxes with bilingual parameter couples and translate
parameter labels.
English Wikipedia
{{Infobox company
| label1

= [[Country]] | data1

= {{{country|}}}

| label30

= [[Website]] | data30

= {{{homepage|}}}}

}}

Vietnamese Wikipedia
{{Infobox company
| label1

= [[Quc gia]]

| data1

= {{{country|{{{|quc gia}}}}}}

| label30

= [[Website]]

| data30

= {{{homepage|{{{trang ch|}}}}}}

}}

We translate [[Country]] label to [[Quc gia]] label from interwiki links. The
[[Website]] label in Vietnamese is the same in English. We can add more code to
country and homepage parameters in order to use both bilingual parameters. This code
fragment {{{country|{{{|quc gia}}}}}} means we can use parameter country or
quc gia which refers a nation.
Step 2: Using the infobox
An editor can use his code such as:
{{Infobox company
| country = USA
| homepage = http://www.usa.us
}}

or
{{Infobox company
| quc gia = USA
| trang ch = http://www.usa.us
}}

24
or a mixed parameters
{{Infobox company
| country = USA
| trang ch = http://www.usa.us }}

If the data values of these parameters are in [[ ]] mean they are internal
links which lead to the articles and may have interwiki links. We can use a tool
[Appendix A] to translate these data values. For example, [[England]] in brackets
[[ ]] can be converted to Vietnamese language as [[Anh]] or Thai language as
[[]].
In some Wikipedias, community editors prefer their own language parameters
to English parameters. We can use Wikidata properties to align with existing
parameters. For example, country property can map to Property:P856 on
Wikidata. For data values of these properties, we can use parsers which are similar to
DBpedia parsers

16

to extract and make comparisons. This idea could be found in

some infoboxes of several Wikipedia such as Russian Wikipedia, Czech Wikipedia


and Vietnamese Wikipedia.

1718

In 2010, Kim, Weild and Choi offered a metadata

synchronization platform which converted infobox data from English to Korean. This
research emphasized how to synchronize the infoboxes between Korean Wikipedia
and English Wikipedia. The result was successful in updating Korean infoboxes.
However, we have to build new translation mechanisms when applying to other
languages.
2.8 Multilingual Approaches
The technical core of DBPedia (Lehmann et al., 2012) is used to form a new
extraction framework (Morsey, Lehmann, Auer, Stadler & Hellmann, 2012). From
that, we can inherit all the valuable mechanisms of DBPedia and add more custom
methods to enhance the productivity of extracting semantic relations from Wikipedia.
The difficulty is we should have a very deep understanding about DBPedia
architecture as well as robust servers to run the extraction rapidly and smoothly. We
prefer to find a simpler solution for extracting semantic relations.
16

http://wiki.dbpedia.org/Internationalization/Guide#h152-4

17

Using {{#property:P18}} on Template:Thng tin khu dn c at Vietnamese Wikipedia

18

https://www.wikidata.org/wiki/Property_talk:P856

25
In 2011, Nguyen and co-authors deployed WikiMatch as a new approach for
aligning infoboxes in different languages with its case study aligns infoboxes in
Vietnamese, Portuguese and English. WikiMatch can be good for high cross language
heterogeneity WikiMatch can be good for high cross-language heterogeneity and few
data instances. We have to investigate the use of a xed point based matching strategy
to improve the effectiveness.
Using interwiki link system, Francis and Jacques could develop a bilingual
dictionary which is a simple, computationally inexpensive means to retrieve word lists
(Tyers & Pienaar, 2008). Future work must improve the precision more with human
evaluators. In 2010, de Melo and Weikum introduced MENTA, the multilingual
lexical knowledge base which was from the integration of multilingual information.
Heuristic linking functions are responsible for connecting Wikipedia articles,
categories, infoboxes, and WordNet synsets from multiple languages. This research
extracted semantic relations directly from Wikipedia in different languages. The same
authors of MENTA developed techniques to detect imprecise or wrong interlinks (de
Melo & Weikum, 2010). Eytan and co-authors introduced Ziggurat for enriching
Wikipedia infoboxes by applying self-supervised learning. This automated system can
align and create Wikipedia infoboxes; enrich the missing information; and detect
differences between parallel articles. Their experiments indicated the method
effectiveness, even in the absence of dictionaries. These reseaches deployed the
interwiki link detection on the obsoleted mechanism. Wikipedia removed this
mechanism and replaced by Wikidata (Vrandei & Krtzsch, 2014). This thesis uses
Wikidata as a central server to align Infobox properties and translate terms among
languages. Therefore, our approach is different from previous ones.
2.9 Summary
This chapter pointed out the significant tendencies in studying about enriching
Wikipedia content between languages. There are also many areas of Wikipedia in
which we can enhance its performance by applying a framework for multilingual
wikis. Besides, with the rapid development of the new Wikipedia project, Wikidata
opens many opportunities for new researches and applications in the future
(Vrandei, 2013).

26

Chapter 3
Proposed Model for Multilingual Wikis
3.1 Introduction
This chapter proposes a general model for Multilingual Wikis. This model
may be applied to different Wikipedia languages. We set English Wikipedia as an
origin language in extracting and matching semantic relations. The semantic
exploitation focuses on data which include the most valuable information such as
infoboxes and navigation templates (optional). There are also other types of data that
can be retrieved such as disambiguations, images, geography coordinates, etc.
3.2 General Model

no
Finish

Align Infobox &

Start

Wikidata

yes
Store aligned structure

Make comparisons &

Process

Enrichment

Comparison Process

Alignment Process

This framework includes 3 main processes in Figure 3.1:

yes
B

assess semantic
relations
Had interlinks
Synthesize semantic

Connect missing
interwiki links

no
Finish

relations

Enrich Wikipedia & Wikidata

Finish

A: Can retrieve infobox properties and make alignment for infobox properties and
Wikidata properties (items)?
B: Are missing interwiki links found?

Figure 3.1 General model for multilingual wikis. Adapted from A Model for Enriching
3.2.1 Alignment
process.
ThisCopyright
process 2015
will bymake
property alignment
Multilingual
by T. H. Ta, 2014,
p. 338.
Springer.
between infobox and Wikidata. If gathering inadequate results, DBPedia may be

27
replaced for Wikidata. DBPedia may change the alignment more differences
compared with Wikidata. Therefore, to comply with Phase 2 of Wikidata, we should
use Wikidata. Then, the aligned structure is stored hidden inside infoboxes to avoid
affecting their usage. Editors can modify this structure appropriately so it will support
for next researches publicly. Moreover, we can make alignment between navigation
templates and Wikidata to extend the aligned structure. The significant advantage of
this process is to create aligned structures that can support for retrieving semantic
relations easily with an uncomplicated mechanism among Wikipedias and also
support for the Phase 2 development of Wikidata. The outcome will directly affect to
Comparison process because it cannot operate if no aligned structures are established.
3.2.2 Comparison process. Comparisons of semantic relations and assess
their correlation will be executed to detect missing interwiki links. Depend on some
assessments or matching algorithms, we can conclude the interwiki links between
articles of different languages and then update sitelinks at Wikidata. For articles had
interwiki links, we don't have to do anything above. Next, we synthesize all semantic
relations and other optional relations (categories, images, geographic coordinates,
etc.) to prepare for next process.
3.2.3. Enrichment process. This last process will enrich article content and
Wikidata statements after implementing the comparisons of gathered semantic
relations from Comparison process. We also can crosscheck data (semantic relations
which are mainly from infobox properties) among Wikipedia's for enriching so that
the anti-vandalism may be detected and prevented. We expect to enrich more data for
articles which have new interwiki links in Comparison process.
3.3 Align Infobox Parameters with Wikidata Properties
A semi-automated tool will be created to support searching and aligning the
semantic equivalence between Wikidata properties (or items which has no relevant
property) and infobox properties. First of all, we choose infoboxes of non-English
Wikipedias which have interwiki links with English Wikipedia. The reason why we
do this because these infoboxes will tend to have more similar properties, even in
different languages. Then, we get all the properties from these infoboxes. Next, with
each property we search it on Wikidata to find the corresponding property or item. If
we can not find anything on Wikidata, we will pass this property and mark it as a
specific label unknown.

28
We check the alignment between Template:Infobox school in English and
Wikidata which is shown in Table 3.1.
Table 3.1
The Alignment between Template:Infobox School and Wikidata.
Properties of
Template:Infobox
school
Image

Property:P18 - Image: a relevant illustration

Name

Unknown

Location

Unknown

Country

Property:P1 - Country: sovereign state of this item

Coordinates

Property:P625 - Coordinate location: geocoordinates

Corresponding property at Wikidata

29
We can put more information in alignment process such as redirects and
related templates which help in detecting missing interwiki links more effectiveness in
Table 3.2.
Table 3.2
The Alignment between Infoboxes (Bn Mu: Trng Hc in Vietnamese and
Template: Infobox School in English) and Wikidata.
Template
name
Properties

Redirects

Related
templates

Vietnamese

English

Wikidata

Trng hc

Infobox school

Q5618975

Hnh
Tn
Nc
Coor
Hiu trng trng

Bn mu:i hc
Bn mu:Infobox
University
Bn mu:Infobox
university

Image
Name
Location
Country
Coordinates
Principal

Template:School
Template:Infobox
HighSchool
Template:Infobox
OtherEducation
Template:Infobox
Private School

NA

Property:P18
Property:P17
Property:P625
Q1056391

NA

* A tool will help to search similar properties on Wikidata and human decisions are made to
assign which best property on Wikidata for every property of infobox.

We mainly use human judgments in supervising the execution and making


final decisions for the alignment. In this paper, we prefer to allow editors freely
contribute to the aligned structure of infoboxes as the way that semantic relations are
developed on Wikidata. Thus, we can utilize the community power to align more
infoboxes that we are unable to implement by ourselves. Besides, the meaning of
infobox properties is uncomplicated so linguistic experts are not really necessary to
appraise this alignment. A problem of data exactness and data management may arise
when editors contribute content but we would like to leave it for next research which
offers some better solutions for improving the accuracy of alignment of properties

30
with dictionary, WordNet, translation, NLP algorithms and assessments of linguistic
experts.
If there is no alignment between Wikidata properties and infobox properties,
we can use DBPedia as an optional source to make the alignment. Notwithstanding,
this is not our recommendation because DBPedia will change aligned structure that is
not matched with Wikidata metadata and phase 2 of Wikidata plan. 19
Aligned results can be stored as XML format in infoboxes between include
tax, for example <noinclude>alignedresults</noinclude>. This will
not affect the infoboxes, which are embedded in Wikipedia articles. Like mentioned
above in Alignment process, these XML fragments can be reused for next research
and help the infobox alignment of Wikidata at Phase 2.
Here is the aligned structure of Template:Infobox school in XML format:
Template:Infobox school
...<noinclude><!
<infoboxlang="en"name="Template:Infoboxschool"synonyms=""
redirects="Template:School,Template:InfoboxHighSchool,..."
wikidata="Q5618975"relationship="">
<properties>
<propertyname="image"synonyms="portrait,
illustration,picture"wikidata="Property:P18"description="a
relevantillustration"datatype="Commonsmediafile">
</property>
...
</properties>
</infobox>
></noinclude>

19

https://www.wikidata.org/w/index.php?title=Wikidata:Introduction&oldid=42871496

31
3.4 Detect, Connect Missing Interwiki Links and Synthesize Semantic Relations
We define two types of semantic relations:

Semantic relations based on article structure: semantic relations

are retrieved from redirects, categories, external links, internal links, images, videos,
audios of articles. These semantic relations can be represented in RDF triples which
are not always found in DBPedia because of its insufficient support and low update
frequency. When there are no semantic relations on DBPedia, we will create these by
ourselves. The simple solution is to use a bot to get semantic relations from article
content throughout APIs of Wikipedia.

Semantic relations from infobox and navigation templates: We

retrieve semantic relations from infoboxes or any templates that have structured
metadata. RDF triples will be set up from these semantic relations. When infoboxes
regularly summarize the information of articles, these semantic relations are helpful
for detecting interwiki links.
The sample articles will be classified into two groups: non Latin-based
alphabet Wikipedias and Latin-based alphabet Wikipedias. We prefer to focus on the
latter. As stated in the introduction section, English Wikipedia has a high collaborative
quality. It may be a valuable source for identifying interwiki links with other
Wikipedias. Likewise, any Wikipedias have high collaborative quality such as
German Wikipedia, French Wikipedia and Spanish Wikipedia will be considered as
sources to find interlinks. In this thesis, we want to compare articles of all Wikipedias
with English articles to search for interwiki links.
Supposed that to detect interwiki link for an unlinked article in Vietnamese,
firstly we should have a look at the article. We must understand the article content and
search the relevant articles in English by some defined keywords. If we find a needed
article, we will connect it to Vietnamese article. This task requires the understanding
of English, Vietnamese and knowledge about that article. However, we try to make
this task simpler that machine can comprehend when we exclude human, translation
and NLP approaches to find the similarities among articles in various languages.
Instead, we use article name and its redirects. There is a huge tendency to use the
same or nearly same article name in the Latin-based alphabet Wikipedias. This case
may only correct for article about cities, people, biological species, proper nouns,
acronyms, etc. Additionally, there are a lot of articles being translated from English to

32
non-English languages. This reduces the users efforts in building and developing
articles from the beginning. Therefore, it is easier to identify a certain article name of
Latin-based alphabet Wikipedia, which has or does not have in the English. In
contrast, it is totally difficult for recognizing an article of non Latin-based alphabet
Wikipedia, which has its version in English or not because of the different alphabets.
3.4.1 Compared list. The most difficult thing is to search an article A in
language A has interwiki link with which article in language B. To do that, we have to
create a comparison list (candidate articles) of language B to which the article A will
compare. Supposed that language B has 4 million articles, it is not feasible to a
execute linear algorithm to match A with 4 million articles of B.
From the difficulty above, we must reduce the size of comparison list.
Normally, when we search for an object, we always use its name as the first criterion
in searching. In this case, article name and its redirects can be used to define the
comparison list.
In Table 3.3, we have an article about Barack Obama, current American
president. This articles name in Vietnamese and French is Barack Obama which is
same in English. However, in Thai and Chinese, the article name is quite different, it
appears in native language and may be confused for some editors who are illiterate
these non-Latin languages.
Table 3.3
Article Titles about Barack Obama in some Languages.
Latin Wikipedias
French
Vietnamese
Barack Obama
Barack Obama

Non-Latin Wikipedias
Thai
Chinese

33
Table 3.4
Article Titles about Dog in Vietnames and English.
Vietnamese
Page name: Ch
Redirects:

Con

English
Page: Dog
ch,

Canis

lupus Redirects: Canis familiaris, Dogs, Canis

familiaris, Cn, Ch ch, Ch nh, cu

lupus familiaris, Canis Canis

In Table 3.4, with the article name Ch in Vietnamese, we can never find
any article which has the same name in English because of the language differences.
However, if we search by redirects we realize that Ch article in Vietnamese may
have a relationship with Dog article in English because the two contain redirect
Canis lupus familiaris which is a dogs scientific name. Creating a comparison list
from searching by name and its redirects can be used for Latin-based alphabet
Wikipedias which have many resemblances of usage article name and redirects to get
more benefits. This method typically provides one article in the comparison list. It can
reduce the compared times, but may affect the outcome when there are no matching
results are found or the comparison list is empty. Thus, in our future researches, we
will apply many methods which can detect and compare the similarities of using
images, videos, categories, internal links and semantic structure of certain articles.
3.4.2 Semantic relations based on article structure. Besides article name,
an article must be built a semantic structure which machines can understand when
they automatically execute matching processes. The simplest structure is to organize
an article by its relationships of categories, images, terms, templates and others. For
example, in Vietnamese Wikipedia, Alcina article does not have interwiki link. By
reading its source code (Wiki markup), we form its structure.
Here is Alcina articles source code at Vietnamese Wikipedia:
[[Tp tin: George Frideric Handel by Balthasar Denner.jpg|
250px|nh|George Frideric Handel]]
'''Alcina''' l v [[opera]] 3 mn ca [[nh son nhc]]
[[ngi

Anh

gc

c]]

[[George

Frideric

Handel]].

[[Ngi]] vit li v kch bn cho tc phm l [[Riccardo


Broschi]]. ng da vo ct truyn ca bn [[anh hng
ca]] [[Orlando Furioso]] ca [[Ludovico Ariosto]]. Tc

34
phm c trnh ln u tin ti [[London]], [[Anh]] vo
nm [[1735]]<ref>T in tc gi, tc phm m nhc ph
thng, V T Ln, xut bn nm 2007</ref>.
==Ch thch==
{{tham kho}}
[[Th loi:Opera]]
In Table 3.5, we use sitelinks of Wikidata to translate terms from Vietnamese
into English.
Table 3.5
The Semantic Structure of Alcina Article at Vietnamese Wikipedia in English.
Alcina (Vietnamese Wikipedia)
Term: link-to

[[opera]], [[composer]], [[George Frideric Handel]], [[human]],


[[anh hng ca]] (epic), [[Riccardo Broschi]], [[Orlando Furiso]],

Category:

[[Ludovico Ariosto]], [[London]], [[England]], [[1735]]


[[Category:Opera]]

has-category
Template:

{{Reflist}}

has-template
Image: has-image
George Frideric Handel by Balthasar Denner.jpg
* Note: terms in bold do not have their own articles or interlinks

35
Then, from Table 3.5, we also form a graph in Figure 3.2. This graph will remove all
terms which could not be translated into English.
1735

London

human

opera

England

composer
Opera
Alcina
George Frideric Handel
Reflist

Ludovico Ariosto
Riccardo Broschi

Georg... .jpg

Figure 3.2 The graph of the semantic structure of Alcina article at Vietnamese
Wikipedia in English.
links-to

has-category

has-image

has-template

36

In Figure 3.2, these semantic relations can be seen as weak relations because
Wikipedia article content depends on user contributions. So, different articles in
different language may form different semantic structures. That is a crucial reason
why we cannot use these structures for detecting interlinks. Our first idea is to
compare these structures and make conclusions that interwiki links may exist among
articles. However, we can use this structure to support the assessment of detecting
missing interwiki links of articles in the next section.
3.4.3 Semantic relations from infobox and navigation templates. In this
section, we will primarily retrieve semantic relations from infoboxes and navigation
templates which were embed in articles. Other templates may be used if they serve
some good semantic relations. All articles will be scanned in order to choose the ones
that contained infoboxes with their aligned structure in Alignment process. For
articles had interwiki links with English Wikipedia, we just retrieve the semantic
relations from the infobox properties. For others, we detect the missing interwiki
links, connect these links and then also synthesize the semantic relations. For
example, Ch article in Vietnamese does not have interwiki link with English
Wikipedia. After searching by its name, we can find the candidate Dog article in
English (Table 3.4). Then, a bot will read the content of two articles and collect
semantic relations from Template:Taxobox which is aligned similar at Section 3.1.
Table 3.6
Semantic Relations of Ch (vi) & Dog (en) Articles.
Vietnamese

English

Language: vi

Language: en

Page_name: Ch

Page_name: Dog

Redirects:

Con

ch,

Canis

lupus Redirects: Canis familiaris, Dogs, Canis

familiaris, Cn, Ch ch, Ch nh, Cu, lupus familiaris, Canis Canis, Domestic
Ch

Dog, A man's best friend, Doggy, Dog


(Domestic), Dog groups, Dogs as our
pets, Dog

Name:

Name: Domestic dog

Type: species

Type: species

Regnum: ng vt

Regnum: Animalia

37
Ordo: B n tht

Ordo: Carnivora

Familia: H Ch

Familia: Canidae

Genus: Chi Ch

Genus: Canis

Species: Si xm

Species: Gray Wolf

Binomial:

Binomial:

Binomial_authority:

Binomial_authority:

Synonyms:

Synonyms:

We can establish an assessment to compare semantic relations of two articles.


In the Table 3.6, we will compare two articles, Ch in Vietnamese and Dog in English
which are biological species. We set up our own assessment by comparing some
semantic relations which are Regnum, Ordo, Familia, Genus and Species. The result
details are shown in Table 3.7.

Table 3.7
The Comparison Result between Dog Articles in Vietnamese and English.
PAGEvi:Chen:Dog
Type:species
RESULTScore:5/5Percentage:100%
DETAIL
Species(OK) vi:Sixm<>en:GrayWolf
Genus(OK) vi:ChiCh<>en:Canis
Ordo(OK) vi:B?ntht<>en:Carnivora
Familia(OK) vi:HCh<>en:Canidae
Regnum(OK) vi:ngvt<>en:Animalia

38

If all these semantic relations are matched, we can hypothesize that these two
articles may have an interwiki link. After that, a bot will automatically connect them
together by adding sitelinks on Wikidata or sets an alert template which notices
editors and let them make final decisions. This section does not mention about using a
fixed assessment for all articles which are identified by short abstract (may refer to a
type). Different articles can have different types and therefore they will have different
assessments based on gathered semantic relations and how we apply the proper
assessments. For example,

there are two articles in two languages contain

Template:Infoboxperson, we need an assessment with enough semantic


relations to prove these two articles are talking about the same person. In some cases,
two people have the same name, same birthday, same nationality, same sex, but we
cannot conclude that they are the same person. Lastly, we will aggregate all semantic
relations are found for the next step.
3.5 Enrich Semantic Relations for Articles and Wikidata
With the semantic relations from Section 3.2, we will enrich article infoboxes
and Wikidata statements by comparing semantic relations of different articles of
Wikipedias. We mainly enrich Wikipedia content from English to other languages.
Wikidata is used to translate terms between languages (Vrandei & Krtzsch, 2014).
We also can enrich other data, such as categories, external link section, gallery
section, images, etc. For categories, we can create new ones from basic NLP patterns
depend on existing English categories and classify the articles into them based on
English classification if needed. Our purpose is to create a category taxonomy of
small-scale Wikipedias and category classification system more fine-grained.
3.5.1 Property enrichment. First, we compare property data of Infoboxes
between language editions (Adar, Skinner & Weld, 2009). For example, we have
language A and language B. Language A contains Infobox AI. Language B contains
Infobox BI. AI and BI are interlinked. Properties belong to AI is AI_P1, AI_P2,
AI_Pn. Properties belong to BI is BI_P1, BI_P2, BI_Pn. Each pair properties, AI_P1
and BI_P1 match an Wikidata Property P1.
Table 3.8

39
The Alignment Table.
Language A
Infobox AI
AI_P1
AI_P2
AI_P3
AI_P4
* Not available

Language B
Infobox BI
BI_P1
BI_P2
BI_P3
AI_P4

Wikidata
Q123
P51 (string)
NA*
P7 (time)
P34 (quantity)

* Q123: A qualifier20 with index number 123


In Table 3.8, we exclude pairs of property are aligned as NA status on
Wikidata. For other pairs, we compare their data values which are classified into
different data types (Xu, Cheng & Qu, 2014; Erxleben, Gnther, Krtzsch, Mendez &
Vrandei, 2014). For string or item data type, if property AI_P 1 misses its data value,
we can translate data value of AI_P1 from language B to language A by using Wikidata
server through out Converter tool. [Appendix A] If the translation is successful, we
update new data values of AI_P1. Otherwise, we will drop this update. For other data
types, we need to create parsers to check input terms and convert them correctly to
required languages.
Table 3.9
Some Data Values in English and Vietnamese.
Vietnamese properties
dn s

English properties
population

Wikidata
P1082 (quantity)

7.067.000
quc gia

7,067,000
Country

Population of Hanoi
P17 (item)

Thi Lan
m s GND

Thailand
GND identifier

Articles
P227 (string)

4029924-7
ngy sinh

4029924-7
date of birth

GND of Qatar
P569 (time)

29 thng 11, 2000

29 November, 2000

29 thng 11, 2000

November 29, 2000

29/11/2000

11/29/2000

Table 3.9 shows that data values can be represented differently in English and
Vietnamese. For quantity data type, English uses comma to separate sequences of
20

https://www.wikidata.org/wiki/Wikidata:Glossary#Qualifier

40
three digits but Vietnamese uses full stop. Items data value (of an entity or a class)
can be seen as existing articles at Wikipedia projects, such as Thailand article in
English has interwiki link with Thi Lan article in Vietnamese. The plurality of data
values are also very complex, such as datetime case. We prioritize enriching Infobox
properties whose data values are string, item and quantity. For other data types, the
enrichment much depends on the translation or conversion. We try to perform this
step as much as possible.
3.5.2 Wikidata statements enrichment. Temporarily, we dont focus on
enriching Wikidata statements. However, we recommend that this process can be
made when we enrich Wikipedia. For example, in Table 3.6, supposed that we have
Binomial_authority property has a value is Carl Linnus in English, we can update
the Binomial_authority property value in Vietnamese if it does not exist. Then, if
Wikidata item Q144 lacks statement taxon author, we can insert it with value Carl
Linnus.
3.5.3 Category enrichment. Categories can be chosen as a enrichment source
for Wikipedia content. We use Wikidata to translate categories from English to other
languages. Then, we add these categories to articles of language editions. The
enrichment must be comply to some policies of Wikipedia such as Categorization 21,
Naming Conventions 22. We also create Category tool to do this step (Appendix B).
For example, Asparagus persicus is a flowering plant. We have two versions
of this species in English and Vietnamese. In English version, this species has 9
categories. In Vietnamese version, this species has 2 categories. All these categories
are shown in Table 3.10.
Table 3.10
The Categories of Asparagus Persicus in English and Vietnamese.
Asparagus persicus (Wikidata: Q4807699)
English category list
Vietnamese category list
Category:Asparagus
Th loi:Chi Mng ty
Category:Flora of Turkey
Th loi:Thc vt c m t nm 1875
Category:Flora of Iran
Category:Flora of Afghanistan
Category:Flora of Uzbekistan
Category:Flora of Tajikistan
21

https://en.wikipedia.org/wiki/Wikipedia:Category_names

22

https://en.wikipedia.org/wiki/Wikipedia:Categorization

41
Category:Flora of Kazakhstan
Category:Flora of China
Category:Flora of Russia
From Table 3.10, we will translate categories from English to Vietnamese. For
each English category (EC), we check it at Wikidata and get a corresponding
Vietnamese category (VC). Next, if VC does not exist in Vietnamese category list and
dont have any parent-child relationship with categories of this list, we add VC to the
Enrichment list.

Table 3.11
The Translation List and Enrichment List of Asparagus Persicus.
Asparagus persicus (Wikidata: Q4807699)
English category list
Translation list
Vietnamese category list
Category:Asparagus
Category:Flora of Turkey
Category:Flora of Iran
Category:Flora of Afghanistan
Category:Flora of Uzbekistan
Category:Flora of Tajikistan
Category:Flora of Kazakhstan
Category:Flora of China
Category:Flora of Russia

Th loi:Chi Mng ty
Th loi:Thc vt Th Nh K
Th loi:Thc vt Iran
Th loi:Thc vt Afghanistan
NA
NA
Th loi:Thc vt Kazakhstan
Th loi:Thc vt Trung Quc
Th loi:Thc vt Nga

Th loi:Chi Mng ty
Th loi:Thc vt c m t
nm 1875

Enrichment list
Th loi:Thc vt Th Nh K
Th loi:Thc vt Iran
Th loi:Thc vt Afghanistan
Th loi:Thc vt Kazakhstan
Th loi:Thc vt Trung Quc
Th loi:Thc vt Nga

NA: Not available

In Table 3.11, we compare each item of Translation list with Vietnamese


category list and bring out the Enrichment list. Then, we enrich article Asparagus
persicus in Vietnamese with the categories of Enrichment list.
Besides, we may create new categories from basic NLP patterns depending on
English category tree (Nastase & Strube, 2008). Next, similar to above, we will
update these new categories to article content. This will improve category taxonomies
for non-English language editions more fine-grained.
3.5.4 Other data enrichment. Many other datasets can be enriched, such as
external link section, gallery section, bottom templates, etc. For Latin-based alphabet
Wikipedias, we can enrich the external link section if we have the community

42
consensus. External links offer readers more reference sources in the case they can not
find enough information on articles. For bottom templates, we can use Wikidata to
translate template names from English to other languages and add them to article
content. Similarly, the gallery section provides the content visual so we can add this
part to article content.

43

Chapter 4
Experiments and Obtained Results
4.1 Preparation Steps
In Chapter 3, the proposed model can work with the articles that already had
interwiki links. However, we prefer to use articles which lack of interwiki links
because we can demonstrate the process of detecting and connecting interwiki links.
As mentioned in Chapter 1, we focus on the alignment at Vietnamese Wikipedia and
English Wikipedia. We choose the infoboxes which are used in the most articles. In
Vietnamese Wikipedia, this list is shown in Figure 4.1.

Figure 4.1 Most transcluded pages. 23

23

https://vi.wikipedia.org/wiki/Special:MostTranscludedPages

44
4.2 Biological Domain
Vietnamese Wikipedia crossed over 1 million articles with many thousand
biological articles were mainly created by bots. These stub articles miss interlinks
because bots generate them automatically from external databases. Furthermore,
many local editors did not pay attention to enrich these unattractive articles.
Therefore, we need to find a solution to solve the problem. One of the feasible
solutions is we can enrich these articles from other Wikipedias, for example English
Wikipedia. To do so, firstly, we need to connect these articles to English articles. In
Figure 4.1, Taxobox (in bold) is embedded in 791888 pages in Vietnamese Wikipedia.
Therefore, applying this infobox to our model may be a valuable point. We decided to
choose biological articles as our input, which contain Template:Taxobox and have no
interwiki links24 to English Wikipedia. In Figure 4.2, our tool allows user can press
No Interlinks Button to get articles without interwiki links, then press Detect Button to
choose articles which embed Taxobox as well as to search by article name in English
Wikipedia for making compared list as Section 3.4.1.

Figure 4.2 Detect Interlinks 1.0 tool.


The biological classification for infoboxes of Wikipedia articles is mainly
complied to ICZN25 and ICN26 standards. Firstly, we align Bn mu:Bng phn loi in
Vietnamese and Template:Taxobox in English with Wikidata properties or items.
24

https://vi.wikipedia.org/w/index.php?title=c bit:Khng lin

wiki_wiki&limit=500&offset=0
25

http://iczn.org/iczn/index.jsp

26

http://www.iapt-taxon.org/nomen/main.php

45
Template:Taxobox has relationships with Template:Automatic taxobox

and

Template:Speciesbox so we also align these two. In Table 4.1, we manually do the


alignment between the Wikipedia infoboxes and Wikidata. Because of the meaning of
Wikipedia properties is not so complex, so we can easily make the alignment with the
desired results.
Table 4.1
Alignment of Bn Mu:Bng Phn Loi (vi) and Template:Taxobox with Wikidata.
Template
name
Properties

Redirects
Related
templates

Vietnamese
Bng phn loi

English
Taxobox

Wikidata
Q52496

status_system
image, hnh
range_map
binomial
species, loi
genus, chi
familia, h
ordo, b
class, lp
regnum
domain

Bn mu:Phn loi
khoa hc, Bn
mu:Taxobox
NA

status_system
image
range_map
binomial
species
genus
familia
ordo
class
regnum
domain

Wikipedia:TX,
Wikipedia:TAXO
BOX,
Template:Automa
tic taxobox
Template:Species
box

Property:P141
Property:P18
Property:P181
Property:P225
Q7432
Q34740
Q35409
Q10861678
Q37517
Q36732
Q146481

46
Here is XML type of Template:Taxobox after making aligntment.
<infobox lang="en" name="Template:Taxobox" synonyms=""
redirects="Wikipedia:TX, Wikipedia:TAXOBOX, Template:Infobox virus,
Template:Infobox Taxobox" wikidata="Q52496" parent="" children=""
relationship="Template:Automatic taxobox, Template:Speciesbox">
<parameters>
<parameter name="status_system" alternativename="" synonyms="IUCN
conservation status" wikidata="Property:P141" description="conservation status
assigned by the International Union for Conservation of Nature"
datatype="Item"></parameter>
<parameter name="image" alternativename="" synonyms="portrait,
illustration, picture" wikidata="Property:P18" description="a relevant illustration; more
specific properties should be used when more description is required"
datatype="Commons media file"></parameter>
<parameter name="range_map" alternativename="" synonyms="range map
image" wikidata="Property:P181" description="range map of a taxon"
datatype="Commons media file"></parameter>
<parameter name="binomial" alternativename="" synonyms="taxon name"
wikidata="Property:P225" description="the scientific name of a taxon (in biology)"
datatype="String"></parameter>
<parameter name="binomial_authority" alternativename="" synonyms="taxon
author" wikidata="Property:P405" description="the author(s) that (optionally) may be
cited with the scientific name" datatype="Item"></parameter>
<parameter name="domain" alternativename="" synonyms=""
wikidata="Q146481" description="taxonomic rank" datatype="String"></parameter>
<parameter name="regnum" alternativename="" synonyms=""
wikidata="Q36732" description="taxonomic rank" datatype="String"></parameter>
<parameter name="phylum" alternativename="" synonyms=""
wikidata="Q38348" description="taxonomic rank" datatype="String"></parameter>
<parameter name="class" alternativename="" synonyms=""
wikidata="Q37517" description="taxonomic rank" datatype="String"></parameter>
<parameter name="ordo" alternativename="" synonyms=""
wikidata="Q10861678" description="taxonomic rank"
datatype="String"></parameter>
<parameter name="familia" alternativename="" synonyms="family"
wikidata="Q35409" description="taxonomic rank" datatype="String"></parameter>
<parameter name="genus" alternativename="" synonyms=""
wikidata="Q34740" description="taxonomic rank" datatype="String"></parameter>
<parameter name="species" alternativename="" synonyms=""
wikidata="Q7432" description="taxonomic rank" datatype="String"></parameter>
</parameters>
</infobox>

In Table 4.1, Template:Taxobox has relasionship with Template:Automatic taxobox


and Template: Speciesbox so we need to align these Infoboxes too.
Template:Automatic taxobox
<infobox lang="en" name="Template:Automatic taxobox" synonyms="" redirects=""
wikidata="Q6705326" parent="" children="" relationship="Template:Taxobox">
<parameters>
<!-- Properties -->
<parameter name="status_system" alternativename="" synonyms="IUCN
conservation status" wikidata="Property:P141" description="conservation status
assigned by the International Union for Conservation of Nature"
datatype="Item"></parameter>

47
<parameter name="image" alternativename="" synonyms="portrait,
illustration, picture" wikidata="Property:P18" description="a relevant illustration; more
specific properties should be used when more description is required"
datatype="Commons media file"></parameter>
<parameter name="range_map" alternativename="" synonyms="range map
image" wikidata="Property:P181" description="range map of a taxon"
datatype="Commons media file"></parameter>
<parameter name="binomial" alternativename="" synonyms="taxon name"
wikidata="Property:P225" description="the scientific name of a taxon (in biology)"
datatype="String"></parameter>
<parameter name="binomial_authority" alternativename="" synonyms="taxon
author" wikidata="Property:P405" description="the author(s) that (optionally) may be
cited with the scientific name" datatype="Item"></parameter>
<parameter name="taxon" alternativename="" synonyms="latin name,
scientific name" wikidata="Property:P225" description="the scientific name of a taxon
(in biology)" datatype="String"></parameter>
<parameter name="authority" alternativename="" synonyms="taxon author"
wikidata="Property:P405" description="the author(s) that (optionally) may be cited
with the scientific name" datatype="Item"></parameter>
<!-- Q items -->
<parameter name="genus" alternativename="" synonyms=""
wikidata="Q34740" description="taxonomic rank" datatype="String"></parameter>
<parameter name="species" alternativename="" synonyms=""
wikidata="Q7432" description="taxonomic rank" datatype="String"></parameter>
</parameters>
</infobox>

Template:Species box
<infobox lang="en" name="Template:Speciesbox" synonyms="" redirects=""
wikidata="Q14449650" parent="" children="" relationship="Template:Taxobox">
<parameters>
<!-- Properties -->
<parameter name="taxon" alternativename="" synonyms="latin name,
scientific name" wikidata="Property:P225" description="the scientific name of a taxon
(in biology)" datatype="String"></parameter>
<parameter name="authority" alternativename="" synonyms="taxon author"
wikidata="Property:P405" description="the author(s) that (optionally) may be cited
with the scientific name" datatype="Item"></parameter>
<!-- Q items -->
<parameter name="genus" alternativename="" synonyms=""
wikidata="Q34740" description="taxonomic rank" datatype="String"></parameter>
<parameter name="species" alternativename="" synonyms=""
wikidata="Q7432" description="taxonomic rank" datatype="String"></parameter>
</parameters>
</infobox>

48
4.3 Results of Aligning Biological Species
In Table 4.2, we executed the comparisons: 4 times with 100 random couples,
4 times with 200 random couples and 1 time with 1000 random couples. We received
the result of higher-and-equal-80%-matching which is not much different from the
manual method. The matching percent can be higher a bit because we removed the
articles which are related to Monospecificity. We realized that a large number of
couples need to be merged which could be from the mistakes of bots and editors. This
helped to reduce the repetitive of articles. The new interlinks we found in this case
study around 30%-40%, which showed that there are still many articles that lack of
interlinks in biology articles. To connect the interwiki links of articles, we will set a
suggested template into these articles and may let the judgments for the editor
community.
Table 4.2
Results of Comparing Article Couples in Vietnamese and English.
No.

No.
Random
Couples
100
100
100
100

Manual
Matching

>=80%
Matching

=100%
Matching

Merge
needed

New
interlinks

80
84
77
78
79.75%

77
83
76
76
78%

64
67
67
58
64%

37
40
31
32
35%

40
43
45
44
43%

5
6
7
8
Mean

200
200
200
200

165
164
155
160
80.5%

163
156
149
158
78.25%

120
118
119
130
60.88%

98
89
81
85
44.13%

65
67
68
73
34.13%

1000

819
(81.9%)

788
(78.8%)

575
(57.5%)

463
(46.3%)

325
(32.5%)

1
2
3
4
Mean

However, bot can automatically connect interwiki links for the articles which
have higher-and-equal-80%-matching. The next step is to retrieve as much as possible
semantic relations which can help to enrich the article content. In this case study, the
machine can easily detect the missing interwiki links among articles because of the
similarities of using infobox format and article names as well as redirects of Latinbased alphabet Wikipedias.

49
In Table 4.3, we chose randomly 50 couples, which have new interwiki links,
after connecting interwiki links, we enriched Vietnamese articles by categories,
external link and bottom templates. We dont enrich Taxobox properties because we
realize that most articles have a full set of Taxobox properties when bot created these
articles following a clear, universal format.
Table 4.3
Result of Enriching Vietnamese Articles by Categories, External Links and Bottom
Templates
Article size

Bytes added/Article size

No.

Bytes added

(after enriching)

(before enriching)

1
2
3
4
5
6
7
8
9
10

45
46
47
48
49
50
Mean

0
253
0
0
132
170
240
140
281
394

581
364
453
489
441
551
182.68

1421
2261
1371
1635
1656
1760
1497
1339
1010
1733

1755
1154
1615
1679
1626
1678
1478.43

0.00%
12.60%
0.00%
0.00%
8.66%
10.69%
19.09%
11.68%
38.55%
29.42%

49.49%
46.08%
38.98%
41.09%
37.22%
48.89%
14.10%

In some cases, we cannot enrich any content because both Vietnamese and
English articles are stub with their mean size is 1478.43 bytes. Thus, there are not
much data which can be exploited. On average, we did enrich +182.68 bytes or
+14.10% per article. With 791888 pages containing Taxobox, we believe that this
alignment can contribute a significant content to Vietnamese Wikipedia, at least in the
biological domain.

50

Chapter 5
Conclusions and Recommendations
5.1 Conclusions
Our proposed model is a new approach which based on the property alignment
between Wikipedia infoboxes and Wikidata to enrich the articles for all Wikipedias,
especially Latin-based alphabet Wikipedias and Wikidata statements.
The aligned structure of infoboxes are valuable sources for stakeholders when
retrieving the semantic relations or reusing in their works openly and independently.
These structures can be updated by everyone so the retrieved semantic relations which
are used for the Enrichment process may be the latest ones. Therefore, our model can
reduce the low update frequency of semantic relations which was a problem of
DBPedia. Furthermore, these structures support the infobox unification of Wikidata
when bot accounts are able to map directly the infobox properties to Wikidata without
needing any translation tasks, semantic algorithms or human effort.
The comparison list is created from the matching of article titles between
languages. We easily detect the required articles whose titles are proper names, place
names, scientific names, etc. in Latin-based alphabet languages with the similarities of
naming convention. Thus, our model can have more benefits when working in some
specific domains such as biological species (scientific names), places (cities, towns),
person, chemical compounds, symbols, years, numbers, and asteroids.
Our model is a proper solution to enrich the articles which are in stub status or
lack of editors attention in small-and-medium-scale Wikipedias. By this way, we can
earn enrichment profit as much as possible. In this thesis, we successfully enriched
article content for biological species in some datasets. We proposed the possibility of
enrichment for Wikidata statements, however, we did not include this in our
implementation.
At last, we believe that this thesis will open up to many studies about the
correlation between Wikidata, Wikipedia and DBPedia.

51
5.2 Recommendations and Future Works
According to Phase 2 of Wikidata plan, we believe that these aligned
structures may help Wikidata developers in unifying the infoboxes of all languages. In
this model, we can utilize the community power in property alignment which
DBPedia inhibited. Nevertheless, our model is in the development stage, which may
not support the content Enrichment process completely.
In Alignment process, we should use some translation tools and parsers to
improve the property alignment. Furthermore, we need more algorithms to evaluate
the correlation of properties with more exactness and inherit other previous researches
to widen the alignment property database. These works we will continue to research
in the future. Creating a comparison list by searching the article name is still not the
best solution to detect missing interlinks. Thus, we will compare semantic structures
and other data of articles in this task. In the case study, we realize that our model can
work well with the biological articles of Latin-based alphabet Wikipedias. However,
to apply to other domains effectively, many efforts needed to be made to improve our
model. That is the reason why we will build more assessments for different article
domains in Comparison process.
The Enrichment Process depends on the gathered semantic relations. So, to
improve the content enrichment, we have to use more datasets such as Geocoordinates, person data, disambiguation, images, etc to earn more enrichment
benefits. We will continue to deploy the enrichment for Wikidata statements in some
domains.

1 inch

52

References
Adar, E., Skinner, M., & Weld, D. S. (2009). Information arbitrage across
multi-lingual Wikipedia. In Proceedings of the Second ACM International
Conference on Web Search and Data Mining (pp. 94-103). Retrieved from
http://dl.acm.org/citation.cfm?
id=1498813&dl=ACM&coll=DL&CFID=671618123&CFTOKEN=83150324
Anderson, J. J. (2011). Wikipedia: The company and its founders. (pp. 10-11, 42).
North Mankato, Minnesota: ABDO.
Aprosio, A. P., Giuliano, C., & Lavelli, A. (2013). Towards an automatic creation of
localized versions of DBpedia. The Semantic WebISWC 2013 (pp. 494-509).
Berlin: Springer.
Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak, R., & Ives, Z. (2007).
Dbpedia: A nucleus for a web of open data. Berlin: Springer.
Auer, S., & Lehmann, J. (2007). What have innsbruck and leipzig in common?
extracting semantics from wiki content. The Semantic Web: Research and
Applications (pp. 503-517). Berlin: Springer.
Baldwin, R. & Cave, M. & Lodge, M. (2010). The Oxford handbook of regulation
Oxford handbooks in business and management - Oxford handbooks. Oxford
Handbooks Online. Retrieved from http://www.oxfordhandbooks.com/view/
10.1093/oxfordhb/9780199560219.001.0001/oxfordhb-9780199560219
Bieberstein, N. (2008). Executing SOA: A Practical Guide for the Service-Oriented
Architect. Upper Saddle River, N.J.: IBM Press.
Bhmann, L., & Lehmann, J. (2011). LOD2 Deliverable D3. 3.1: Release of
Knowledge

Base

Enrichment

Algorit

hms.

Retrieved

from

jens-

lehmann.org/files/2011/lod2_deliverable_3.3.1.pdf
Cabrio, E., Cojan, J., Gandon, F., & Hallili, A. (2013). Querying multilingual dbpedia
with qakis. The Semantic Web: ESWC 2013 Satellite Events (pp. 194-198).
Berlin: Springer.
Chernov, S., Iofciu, T., Nejdl, W. & Zhou, X. (2006). Extracting Semantics
Relationships between Wikipedia Categories. In Proceedings of the First
Workshop on Semantic Wikis -- From Wiki To Semantics. Budva: Springer.

53
Dandala, B., Mihalcea, R., & Bunescu, R. (2012, June). Towards building a
multilingual semantic network: Identifying interlingual links in wikipedia. In
Proceedings of the First Joint Conference on Lexical and Computational
Semantics-Volume 1: Proceedings of the main conference and the shared task,
and Volume 2: Proceedings of the Sixth International Workshop on Semantic
Evaluation (pp. 30-37). Association for Computational Linguistics.
de Melo, G., & Weikum, G. (2010, July). Untangling the cross-lingual link structure
of Wikipedia. In Proceedings of the 48th Annual Meeting of the Association
for Computational Linguistics (pp. 844-853). Association for Computational
Linguistics.
de Melo, G., & Weikum, G. (2010, October). MENTA: inducing multilingual
taxonomies from wikipedia. In Proceedings of the 19th ACM International
Conference on Information and Knowledge Management (pp. 1099-1108).
Retrieved from http://dl.acm.org/citation.cfm?id=1871577
Erxleben, F., Gnther, M., Krtzsch, M., Mendez, J., & Vrandei, D. (2014).
Introducing Wikidata to the Linked Data Web. The Semantic WebISWC 2014
(pp. 50-65). Trentino: Springer International.
Gurevych, I., Kim, J., & Calzolari, N. (2013). The peoples web meets NLP:
Collaboratively constructed language resources. Berlin: Springer Science &
Business Media.
Hahn, R., Bizer, C., Sahnwaldt, C., Herta, C., Robinson, S., Brgle, M., ... & Scheel,
U. (2010). Faceted wikipedia search. In Business Information Systems (pp. 111). Berlin: Springer.
Hellmann, S., Bryl, V., Bhmann, L., Dojchinovski, M., Kontokostas, D., Lehmann,
J., ... & Zamazal, O. (2014). Knowledge Base Creation, Enrichment and
Repair. Linked Open Data--Creating Knowledge Out of Interlinked Data (pp.
45-69). Springer International.
Kim, E. K., Weidl, M., & Choi, K. S. (2010, April). Metadata Synchronization
between Bilingual Resources: Case Study in Wikipedia. In MSW (pp. 35-38).
Kittur, A., Suh, B., Pendleton, B. A., & Chi, E. H. (2007, April). He says, she says:
Conflict and coordination in Wikipedia. In Proceedings of the SIGCHI
Conference on Human Factors in Computing Systems (pp. 453-462).
Retrieved from http://dl.acm.org/citation.cfm?id=1240698
Lehmann, J., Isele, R., Jakob, M., Jentzsch, A., Kontokostas, D., Mendes, P. N., ... &

54
Bizer, C. (2015). DBpediaA large-scale, multilingual knowledge base
extracted from Wikipedia. Semantic Web, 6 (2), 167-195.
Li, H., & Sima, Q. (2015). Parallel mining of OWL 2 EL ontology from large linked
datasets. Knowledge-Based Systems. Retrieved from https://dx.doi.org/10.1016
/j.knosys.2015.03.023
Mendes, P. N., Jakob, M., & Bizer, C. (2012, May). DBpedia: A multilingual crossdomain knowledge base. LREC. Istanbul: ELRA.
Morsey, M., Lehmann, J., Auer, S., Stadler, C., & Hellmann, S. (2012). Dbpedia and
the live extraction of structured data from wikipedia. Program, 46(2), 157181.
Morsey, M., & Lehmann, J. (2011). LOD2 Deliverable 3.2. 2 DBpedia-Live
Extraction. Retrieved from http://jens-lehmann.org/files/2011/lod2
_deliverable_3.2.2.pdf
Nastase, V., & Strube, M. (2008, July). Decoding wikipedia categories for
knowledge acquisition. AAAI. Chicago: AAAI Press.
Nguyen, T. H. (2013). Integrating structured data on the web. (Doctoral
dissertation). The University of Utah, Utah.
Nguyen, T., Moreira, V., Nguyen, H., Nguyen, H., & Freire, J. (2011). Multilingual
schema matching for wikipedia infoboxes. In Proceedings of the VLDB
Endowment, 5(2), 133-144.
O'Sullivan, D. (2012). Wikipedia: A new community of practice?. Farnham,
England; Burlington, VT: Ashgate.
Ponzetto, S. P., & Navigli, R. (2009, July). Large-scale taxonomy mapping for
restructuring and integrating wikipedia. IJCAI, 9, 2083-2088.
Ponzetto, S. P., & Strube, M. (2007, July). Deriving a large scale taxonomy from
Wikipedia. AAAI, 7, 1440-1445. Vancouver: AAAI Press.
Prudhommeaux, E., & Seaborne, A. (2008). SPARQL query language for RDF.
W3C. Retrieved from http://www.w3.org/TR/rdf-sparql-query/
Rinser, D., Lange, D., & Naumann, F. (2013). Cross-lingual entity matching and
infobox alignment in Wikipedia. Information Systems, 38(6), 887-907.

Sorg, P., & Cimiano, P. (2008, June). Enriching the crosslingual link structure of

55
wikipedia-a classification-based approach. In Proceedings of the AAAI 2008
Workshop on Wikipedia and Artifical Intelligence (pp. 49-54). Chicago:
Springer Science & Business Media.
Suchanek, F. M., Kasneci, G., & Weikum, G. (2007, May). Yago: A core of semantic
knowledge. In Proceedings of the 16th international conference on World
Wide Web (pp. 697-706). Retrieved from http://dl.acm.org/citation.cfm?
id=1242667
Syed, Z. S., & Finin, T. (2010). Approaches for Automatically Enriching Wikipedia.
AAAI

Workshops.

Retrieved

from

https://www.aaai.org/ocs/index.php/

WS/AAAIW10/paper/view/2036/2493
Ta, T. H., & Anutariya, C. (2014). A model for enriching multilingual Wikipedias
using infobox and Wikidata property alignment. In Semantic Technology (pp.
335-350). ChiangMai: Springer International.
Tacchini, E., Schultz, A., & Bizer, C. (2009). Experiments with wikipedia crosslanguage data fusion. In 5th Workshop on Scripting and Development for the
5th Workshop on Scripting and Development for the Semantic Web
(SFSW2009). Tokyo: Springer.
Toma, I., Hangl, S., Caminero, F. J., & Date, C. D. (2010). Diversity-aware
extensions

to

collaborative

systems.

Retrieved

from

http://render-

project.eu/wp-content/uploads/2013/04/D4.1.2_2.0.pdf.
Tyers, F. M., & Pienaar, J. A. (2008). Extracting bilingual word pairs from
Wikipedia. In LREC 2008, SALTMIL Workshop. Marrakech, Marroco.
Vrandei, D. (2013). The rise of Wikidata. IEEE Intelligent Systems, 28(4), 90-95
Vrandei, D., & Krtzsch, M. (2014). Wikidata: A free collaborative
knowledgebase. Communications of the ACM, 57(10), 78-85.
Xu, D., Cheng, G., & Qu, Y. (2014). Preferences in Wikipedia abstracts: Empirical
findings and implications for automatic entity summarization. Information
Processing & Management, 50(2), 284-296.
Xu, L., Takeda, H., Hamasaki, M., & Wu, H. (2010). Typing software articles with
Wikipedia category structure. Nil Tech-nical Reports.

56

Appendix A
Converter 1.1.6
This tool is used to translate terms in brackets [[]] (internal links) from any
Wikipedia to any Wikipedia throughout Wikidata by sitelinks. We can translate data
values of infobox parameters to the specific language which we want to contribute the
information.
To use this tool, please follow these steps:

Choose the prefixes of origin and destination languages, vi =


Vietnamese, en = English, th = Thai, zh = Chinese, jp = Japanese, fr =
French, es = Spain, etc.

Paste text which needed to convert and press Convert button to get the
result.

Figure A.1 Converter 1.1.6

57

Appendix B
Category 1.0.8
This tool is used for improving the category taxonomy more fine-grained by
copying from the English Wikipedia classifications. This tool checks all categories
which have interwiki links to English edition and collect the category classifications
of English as RDF triples. Then, we can use AWB (AutoWikiBrowser) to import the
triples to other languages.

Figure B.1 Category 1.0.8

We used this tool and contributed thousand edits to Vietnamese Wikipedia. We


created a chart with randomly 500 edits. On average, this chart showed that we did
enrich 39.53 bytes/article to 400 articles and 100 categories.

Figure B.2 Statistics of enriching 500 edits.

58

Appendix C
AutoWikiBrowser
AutoWikiBrowser (AWB) is a semi-automated MediaWiki editor which runs
on Windows operating system. AWB helps to edit tasks faster and more convenient.

Figure C.1 AutoWikiBrowser screen shot

59

Biography
Name:

Ta Hoang Thang

Date of Birth:

02 November, 1985

Place of Birth:

Da Lat City, Lam Dong Province, Vietnam

Institutions Attended:
2003 2008

Bachelor of Information Technology


Dalat University
Lam Dong, Vietnam

2012 2014

Master of Information Technology


Shinawatra University
Bangkok, Thailand

Home Address:

43 Vo Truong Toan Street


Da Lat city, Lam Dong province, Vietnam

Email:

tahoangthang@gmail.com
thangth@dlu.edu.vn