Beruflich Dokumente
Kultur Dokumente
Bachelor Thesis
Bachelor Thesis
(i) the thesis comprises only my original work toward the Bachelor Degree
(ii) due acknowlegement has been made in the text to all other material used
Ahmed Soliman
XX July, 20XX
Acknowledgments
Text
V
VI
Abstract
Abstact
VII
VIII
Contents
Acknowledgments V
1 Introduction 1
1.1 Section Name . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Another Section . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
2 Background 3
4 Conclusion 7
5 Future Work 9
Appendix 10
A Lists 11
List of Abbreviations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
References 13
IX
Chapter 1
Introduction
1
2 CHAPTER 1. INTRODUCTION
Chapter 2
Background
Background
3
4 CHAPTER 2. BACKGROUND
Chapter 3
In this chapter we introduce and describe several approaches for location detection over
social media.
First, we tokenize all tweets in our training dataset to filter them, we filter tweets by
removing URLs, mentions and hashtags, then we remove any word that is identified as
stop word. Stop words are defined by a list of words provided by nltk stopwords corpus.
Once the stop words are removed, lemmatization in which we reduce the forms of a word
to a common base form is performed using stanford coreNLP. Once the tokens have been
5
6 CHAPTER 3. LOCATION DETECTION APPROACHES
extracted, we use simple heuristic algorithm which is called CALGARI[1]. This algorithm
is based on intuition that a model will perform better if it is trained on terms that are
more likely to be used by some users from particular regions than users from the general
population. In this algorithm we define a score for each term, this score show us how
likely this term happens in our dataset. We will explain how this score is calculated
below:
Let s(T ) be a function which takes a term and calculate the score for that term T ,
F(T ) be the frequency of a term T in our dataset, (T , c) be a function that count how
many times the term T is used with class c, is the total number of different terms in
out dataset and C be the set of classes (locations) in our dataset, we need to evaluate
this equation for each term:
max(P (T | c = C))
s(T ) = where c C
P(T )
F(T )
The term P(T ) = , so we need to know how to evaluate the numerator.
C
X (T , ci )
P (T | c = C) =
P
i (ti , ci )
j
Now after calculating a score for each term, the algorithm sorts the terms based on
this score in non decreasing order and choose the best 10,000 terms as features for our
model.
Chapter 4
Conclusion
Conclusion
7
8 CHAPTER 4. CONCLUSION
Chapter 5
Future Work
Text
9
Appendix
10
Appendix A
Lists
11
List of Figures
12
Bibliography
[1] W.G. Campbell. Form and style in thesis writing. Houghton Mifflin, 1954.
[2] S. Wenkang. An analysis of the current state of English majors BA thesis writing
[J]. Foreign Language World, 3, 2004.
13