Sie sind auf Seite 1von 22

Media Engineering and Technology Faculty

German University in Cairo

Location Detection Over Social


Media

Bachelor Thesis

Author: Ahmed Soliman


Supervisors: Sarah Elkasrawy

Submission Date: XX July, 20XX


Media Engineering and Technology Faculty
German University in Cairo

Location Detection Over Social


Media

Bachelor Thesis

Author: Ahmed Soliman


Supervisors: Sarah Elkasrawy

Submission Date: XX July, 20XX


This is to certify that:

(i) the thesis comprises only my original work toward the Bachelor Degree

(ii) due acknowlegement has been made in the text to all other material used

Ahmed Soliman
XX July, 20XX
Acknowledgments

Text

V
VI
Abstract

Abstact

VII
VIII
Contents

Acknowledgments V

1 Introduction 1
1.1 Section Name . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Another Section . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

2 Background 3

3 Location Detection Approaches 5


3.1 Profile location identification . . . . . . . . . . . . . . . . . . . . . . . . . 5
3.2 Location detection by language . . . . . . . . . . . . . . . . . . . . . . . 5
3.3 Machine learning approaches . . . . . . . . . . . . . . . . . . . . . . . . . 5
3.3.1 Content-based Statistical Classifier . . . . . . . . . . . . . . . . . 5

4 Conclusion 7

5 Future Work 9

Appendix 10

A Lists 11
List of Abbreviations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

References 13

IX
Chapter 1

Introduction

1.1 Section Name


Some sample text with an Acronym Without Citation (AC), some citation [1], and some
more Acronym With Citation [2] (AC2).

1.2 Another Section


Reference to Section 3.1, and reuse of AC nad AC2 with also full use of Acronym With
Citation [2] (AC2).

1
2 CHAPTER 1. INTRODUCTION
Chapter 2

Background

Background

3
4 CHAPTER 2. BACKGROUND
Chapter 3

Location Detection Approaches

In this chapter we introduce and describe several approaches for location detection over
social media.

3.1 Profile location identification

3.2 Location detection by language

3.3 Machine learning approaches

3.3.1 Content-based Statistical Classifier


In this section we describe our statistical location classifier that is trained from different
terms extracted from all the users geotagged tweets.
We created this classifier for city level location for which we have ground truth. Each
user in our training dataset corresponds to a training example where the features are
extracted from the user tweet contents and the corresponding output is the geolocation
provided with that tweet. The number of classes in this trained model equal to the total
number of locations in our training dataset (total number of cities).

3.3.1.1 Feature Extraction

First, we tokenize all tweets in our training dataset to filter them, we filter tweets by
removing URLs, mentions and hashtags, then we remove any word that is identified as
stop word. Stop words are defined by a list of words provided by nltk stopwords corpus.
Once the stop words are removed, lemmatization in which we reduce the forms of a word
to a common base form is performed using stanford coreNLP. Once the tokens have been

5
6 CHAPTER 3. LOCATION DETECTION APPROACHES

extracted, we use simple heuristic algorithm which is called CALGARI[1]. This algorithm
is based on intuition that a model will perform better if it is trained on terms that are
more likely to be used by some users from particular regions than users from the general
population. In this algorithm we define a score for each term, this score show us how
likely this term happens in our dataset. We will explain how this score is calculated
below:
Let s(T ) be a function which takes a term and calculate the score for that term T ,
F(T ) be the frequency of a term T in our dataset, (T , c) be a function that count how
many times the term T is used with class c, is the total number of different terms in
out dataset and C be the set of classes (locations) in our dataset, we need to evaluate
this equation for each term:

max(P (T | c = C))
s(T ) = where c C
P(T )

F(T )
The term P(T ) = , so we need to know how to evaluate the numerator.

C
X (T , ci )
P (T | c = C) =
P
i (ti , ci )
j

Now after calculating a score for each term, the algorithm sorts the terms based on
this score in non decreasing order and choose the best 10,000 terms as features for our
model.
Chapter 4

Conclusion

Conclusion

7
8 CHAPTER 4. CONCLUSION
Chapter 5

Future Work

Text

9
Appendix

10
Appendix A

Lists

AC Acronym Without Citation

AC2 Acronym With Citation [2]

11
List of Figures

12
Bibliography

[1] W.G. Campbell. Form and style in thesis writing. Houghton Mifflin, 1954.

[2] S. Wenkang. An analysis of the current state of English majors BA thesis writing
[J]. Foreign Language World, 3, 2004.

13

Das könnte Ihnen auch gefallen