Beruflich Dokumente
Kultur Dokumente
Diplomarbeit in Informatik
vorgelegt von
Alexander Markowetz
betreut von:
Prof. Dr. Bernhard Seeger, Philipps-Universität Marburg, Germany
und
Torsten Suel, Assistant Professor, Polytechnic University Brooklyn,
New York, USA
September 2004
Marburg an der Lahn
Ich versichere, dass ich die Arbeit ohne fremde Hilfe und ohne Benutzung
anderer als der angegebenen Quellen angefertigt habe und dass die Arbeit in
gleicher oder ähnlicher Form noch keiner anderen Prüfungsbehörde vorgele-
gen hat und von dieser als Teil einer Prüfungsleistung angenommen wurde.
Alle Ausführungen, die wörtlich oder sinngemäß übernommen wurden, sind
als solche gekennzeichnet.
Alexander Markowetz
i
0.2 Abstract
This paper describes geographic search engines, that accept the input of ge-
ographic coordinates, in addition to the ordinary set of key terms. They
allow to narrow the search to documents, containing information about the
specified geographic region. The development of an exploratory prototype
of a geographic search engine is described step by step, describing the the-
oretical background behind each decision as well as their alternatives, as
they arise. Before the introduction of geographic search engines, the paper
provides a brief overview of traditional search engines.
In this work, every document receives (a possibly empty) geographic foot-
print: a collection of all regions the document provides information about.
The main part of the paper focuses on the creation of these footprints. Dif-
ferent sources of information about the geographic foci of web documents are
discussed, emphasizing the extraction of geographic references from docu-
ments and URLs, as well as exploiting whois entries. The process of extract-
ing terms that might indicate geographic entities as well as the matching
between the terms and the entities, including the resolving of ambiguous
cases, is discussed in detail. Once these initial footprints have been created
for a body of documents, they can be enhanced by propagating their infor-
mation along links and within documents from the same site or directory.
While there is a multitude of options for straight forward query processing
techniques, the prototype makes use of a straight forward approach.
The paper furthermore provides a brief overview over the possibilities of web
mining on the extracted geographic information.
Contents
1 Foreword 1
2 Introduction 3
iii
CONTENTS iv
5 Addendum 96
5.1 Thanks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
Chapter 1
Foreword
1
CHAPTER 1. FOREWORD 2
this prototype was going to not only use whois entries but also extract ge-
ographic information from the web documents.
At the time of writing, in late August 2004, the prototype is close to being
deployed. It had been full of (often not so pleasant) surprises and much
harder than first imagined, but certainly worth the effort. I hope it will lay
the necessary foundation for a new field of research:
Geographic Web Information Retrieval.
One brief note on the bibliography, whose name origins from biblion,
meaning book or scripture in ancient Greek. In the 21st century, most infor-
mation is not stored in books exclusively, not even just in web pages. Much
information is contained in code, prototypes or even deployed production
systems. For this reason the bibliography has been extended to academic
prototypes, industrial solutions and to references to the companies that pro-
vide them.
Chapter 2
Introduction
The dynamic growth of the Internet has not only affected the number of
documents or users online or the number of authors that publish content
online. It has also had an tremendous impact on the usage of the data
from the internet and the problems we can solve with it. Some queries and
applications would not have made sense some five years ago, since the result
sets would have been too small or even empty. Even the most general queries
might have produced results that could still be easily skimmed through by
a human. In today’s information landscape, even extremely specific queries
will retrieve a solid amount of results. The focus of this paper will be how
to make queries more specific, in geographic terms.
Although very brief, the history of the Internet can be divided into three
sections:
early adaptors In the early years of the commercial Internet, few compa-
nies found their way online. Their names, like Amazon are legends.
The Internet was economically an experiment, with returns on invest-
ments not expected for years.
large corporate players In the next couple years, national mail order
stores moved online and large corporations build web information sys-
tems. At this point , the Internet was already economically very in-
teresting, but the resources for setting up a web site were still consid-
erable. Few small and medium size enterprises went online. An those
that did mostly came from a mail order background.
These three periods found were reflected in the usage of web search engines:
3
CHAPTER 2. INTRODUCTION 4
Are there? During the period of early adaptors, the interesting question
was, if there was any information on say Mountainbikes. Search en-
gines were very crude and basically operated on a Boolean text model.
I need an anything Today, search engines can be used basically for any-
thing. They are used to find out what the local pizza store has on
the menu and what the cinema is playing tonight. Searches can easily
produce hundreds of thousands of answers. For one thing, they require
more complex ranking functions, mainly focusing on link structures.
For another, they allow imposing more specific constraints on a search,
since the answer set would still be of sufficient size. Such constraints
might be ”personal”, ”temporal” or ”geographic” for example. The
Internet has reached a level of commercialization, that web spam has
become an industry of its own. Search engines therefore have to care-
fully avoid being spammed or trapped.
No Mining During the first period, the data on the web is too sparse for
any web mining to make sense. Any form of mining only makes sense
on large sets of data that cannot be handled manually by individuals
anymore. This was not the case during this first period.
Global Mining on Large Phenomena The second period allowed for first
web mining, as long as the phenomena to be investigated was large
enough, that is broad and not too specific.
Highly Specific Mining Considering the amount of data on the web and
the fact that it covers any remote topic with a solid set of documents
makes it possible to perform very specific web mining tasks. Queries
can contain almost arbitrary constraints, especially on time and space
and still produce data in a size that supports statistic statements.
Both applications, web search and web mining, have reached the stage
where very specific queries finally make sense. Time and space, the two
most fundamental human conditions are of key importance for taking these
CHAPTER 2. INTRODUCTION 5
applications to the next level. Their incorporation will also cause a funda-
mental redesign of even the most basic techniques to become necessary. The
impact can only be compared to the fundamental step data bases took when
they moved from a single to multiple dimensions. Almost all areas of web
search, even more than web mining, will have to re-evaluated under these
new premises.
In this paper we discus geographic properties of internet resources and
the possibilities of imposing geographic restrictions on web search. We de-
scribe how we built the prototype, how we extracted geographic markers
from web pages and used them to build an exploratory prototype of an ge-
ographical search engine. In addition, we provide a brief outlook on the
geographic extension of web mining.
Chapter 3
Common search engines are already widely used for geographic search. Users
often include names of geographic entities in the keywords, to constrain the
search to some region. Thus, they will compose queries, such as
• – yoga brooklyn
– bed breakfast marburg
– scuba diving ”long island”
• The user has to search extensively through all geographic terms. Thus,
she will re-run slightly modified queries, such as:
– yoga brooklyn
– yoga ”park slope”
– yoga ”new york”
Some of the answers she will have to see over and over, simply because
they appear in all three searches.
• The user must know good geographic terms. Therefore she needs to
be familiar with the area.
• Some good pages might not contain any common geographic terms.
Thus they will not show up in any of the above queries.
6
CHAPTER 3. GEOGRAPHIC SEARCH ENGINES 7
(pre-)compute the positions for all relevant pages. It looks for the same
geographic hints, the user would have included in her modified queries. Ad-
ditionally, it searches for numeric codes and uses site and link analyses and
external databases for additional geographic information.
In this paper, we discuss the development of an exploratory prototype
of a geographic search engine step by step. At each stage, we will discuss
alternatives techniques from the literature.
d1 du bist du
d0 t0 t1 t0
d1 t2 t3 t2
d2 t2 t3 t4 t0
These are then sorted in the inverted index, for every term telling us, at
what position in what document the term appeared:
t0 d0,0,2 d2,3
CHAPTER 3. GEOGRAPHIC SEARCH ENGINES 8
t1 d0,1
t2 d1,0,2 d2,0
t3 d1,1 d2,1
t4 d2,2
The first entry for example tells us, that term t0 appears in document d0
on positions 0 and 2 and in document d2 on position 3. One can imagine
index structures or hash tables indexing the buckets for each term.
This is all it needs for a simple search engine. A search for ich would simply
return all documents from the entry of term t0. A search for ich AND bin
would translate into an intersection of the inverted index entries for t0 = {
d0,d2} and t1 = { d0} , resolving to { d0} . Using the stored positions,
where the terms occur during a document, we can even form queries for
sequences of terms such as ich bin.
The final step in executing a query in a search engine is the ranking
of the answers. In our example this would be rather useless, since the set
of returned documents will never contain more than three documents. In
real life however, most search engines produce thousands, if not millions, of
answers for a simple query. Given the fact that any human user will only
be able to read the top fraction, the order in that the results are returned is
crucial. Since we want to return those documents first, that the user would
find the most interesting and important, this order is called the importance
ranking.
There is no single good function for imposing an importance ranking on sets
of documents. Usually they are a mixture from different functions, based
on various measures, such as:
TF/IDF This measure counts how many times one of the key terms occurs
in a document, its Term Frequency (TF). Not all terms appear equally
often and are equally indicative, car for example is more frequent than
bentley. A term’s Inverse Document Frequency (IDF) indicates how
frequent a term is over the entire body of documents.
Relative Position When a query contains more than just one search term,
their relative position in a page is important. The closer the terms
occur in a document, the higher it should rank.
This is all there is to a basic search engine. A lot of the steps in this
description are oversimplified and will usually not be carried out sequentially.
The basic idea however should have become clear. Looking at an actually
deployed search engine, we find three parts:
Our geographic search engine will consist of the same three parts, none of
which will look the same. We will describe these three components for a
geographic search engine, as we discuss the construction of our prototype.
The interface part will be looked at twice, once for simple interfaces, and
later for interactive interfaces with an elaborate query control.
The data part will consist of the traditional crawl, indexed an ranked, but
will additionally deal with geographic data. For our discussion, we will
assume the reader to be familiar with the traditional techniques and focus
on the later.
Query processing will largely depend on the desired speed and behavior of
the search engine. We will describe our bottom line algorithms and briefly
dip into further possibilities for query processing.
Before we go to describe how we constructed our geographic prototype , we
will discuss what makes a good search engine and what process model we
used.
First however, we will describe some simple interfaces. They come hand in
hand with some simple use cases and will outline the different applications
of geographic search.
In all cases, service provider and search engine would need a close cooper-
ation, to automatically forward position information. Protocols should be
pretty straight forward, but might depend on the appliances. The authors of
[DCG04] et al. propose an extra field in the HTTP protocol for geographic
information. They implemented an early prototype that uses a GPS-sensor
to track the users position and will insert this data in the HTTP requests
to the server.
How well they match human reasoning. Helping the user post good
queries initially, to a search engine that conceptualizes the world in
similar terms as he does.
Instead of deepening this extremely broad definitions, we will go straight
to trying to capture or measure them, providing the definitions in their
”measurement”.
CHAPTER 3. GEOGRAPHIC SEARCH ENGINES 12
do. They usually give a group of users a set of keywords and have them
manually evaluate web pages regarding the keywords. Next, they evaluate
the search engine’s results against the ”optimal” outcome, as produced by
the user group.
This approach sounds straightforward and promises numerical results
that allow the comparison of different search engine techniques. However,
it has many drawbacks when it comes to putting it to practice:
• User groups are usually small and far from representative. A group
of 25 volunteering computer science students will hardly produce any
valuable output. Large and representative user studies require sub-
stantial financial resources.
• Any reasonable document base is too large for extensive human evalu-
ation. Even moderate size web crawls, like the one used for this study,
contain several tens of millions of documents. Commercial crawls can
easily amount to several billions. Any such collection of data is too
large to be evaluated manually. Reducing it to a meaningful subset
however would necessarily introduce bias.
• There are too many unknowns in a search. No user study could inde-
pendently examine the impact of each variable. The area of applica-
tion, type of query (navigational, undirected, etc.) and query length
are just a few such factors.
The fact that we will introduce several new variables over the course
of this paper makes user studies particularly challenging.
and one that does not. The document space D consists of d documents. The
set of results R contains r documents. The remaining d − r documents are
called non-results. The search engine will determine a set of answers A, of
which the user will read a sub-set K: the first k of these documents. The
search engines quality is then described in how K, R and D relate to each
other.
The standard measures such as Precision and Recall can be found in text
books such as [Cha02]. These measures however date back to a time, when
document spaces used to be rather small. Querying was about finding any
relevant documents. That is A was very small and would usually equal K.
After the tremendous growth of the World Wide Web however, A turns out
to be extremely large for most queries, usually much larger than K. Search
engines differ not only in how well A resembles R, but more importantly in
the order they impose on A. The later determines which answers make it in
K. This order however is not measured by precision or recall.
To conduct a meaningful user study, one has to invest tremendous re-
sources. A large user group with a heterogeneous background is needed.
The method of validating results against key words has to be replaced by
search tasks, where the user has to use the search engine to perform little
search tasks, like ”find the web site of company xyz”. There must be several
tasks from several basic categories, like ”navigational search”. This method
of course adds an extra level of complexity. For geographic search engines,
these queries would have to be performed for different locations, adding an-
other dimension. This short discussion should illustrate, how a meaningful
user study for analytic quality of a final product is already very expensive.
Using user studies for constructional quality, during the design process of a
search engine is even more expensive. One does not only have to evaluate
one single system, but many different little systems, where one or a few pa-
rameters differ among each set up, to find out how to set these parameters
correctly. Since there is a whole range of different parameters, this method
becomes so expensive that it is left only to the largest commercial players.
3.3.3 Hypotheses
The other way of supporting any claim about the quality of a search engine
is to state basic hypotheses, and reason on them. Hypotheses are
• short
• simple
• easy to agree upon
assumptions on different subjects, such as:
• The quality and quantity of underlying data, such as the document
space
CHAPTER 3. GEOGRAPHIC SEARCH ENGINES 15
SHALLOW PENETRATION Any given user will only manage to see the
first 10-200 answers. Some commercial search engines, such as Google
[Goo], will only return the first say 1000 answers.
The combination of these two axioms is often called the iceberg phenomenon,
the notion being that the user only gets to see the tip of an iceberg of results.
CHAPTER 3. GEOGRAPHIC SEARCH ENGINES 16
UN-DETERMINED The user does not know what she is looking for, for
most of the time. She hammers a two term query into her keyboard
to even get a first idea of what she could possibly find. This behavior
contributes to the abundance problem and can be captured by the
sentence:
We will keep adding other hypotheses throughout this paper. Also, there
are some that have been widely used already, but have to be adapted to a
geographic environment.
Personalized Search
One can try to overcome ABUNDANCE by imposing additional constraints
on the result set, thus making it smaller. Many such proposed constraints
are personal, that is, specific to the person of the user. Proposed techniques
refer to the user’s past preferences or these of her friends or people with
similar personal profiles. The user’s geographic location, or a location that
she is interested in, is one such personal property. Any of these techniques
will help keep the iceberg small.
Interactive Search
The impact of SHALLOW PENETRATION cannot be reduced so easily.
After all we are humans and cannot browse through thousands of answers.
However, one can try to circumvent this problem by interactive browsing
CHAPTER 3. GEOGRAPHIC SEARCH ENGINES 17
through results.
We can help the un-determined user re-adjust her search interactively as she
browses the results. She will be allowed to change search parameters after
each batch she receives, without having to restart the search or seeing the
same answer twice. Thus, she will be allowed to change directions, while
drilling the iceberg, instead of consuming it in an iterator fashion from the
top.
This will also enable her to circumvent web spam. Determined web masters
have set up so called Google-traps that will produce higher rankings for their
website. As mentioned above, some search engines like Google only return
the top 1000 answers. Successful Google-trapping that allows to conquer
all possibly returned answers can thereby effectively block their competitors
from any attention by the user. Interactive browsing may help to circumvent
such maliciously boosted answers.
Our geographic search engine is personal by nature. In addition, we will
show how to make it interactive and provide the necessary user interfaces.
• One SUN workstation parses out URLs from retrieved pages, removes
duplicates and adds them to the queue for crawling.
• One SUN workstation performs the DNS lookup and manages the
crawl.
• Seven Linux machines were used for storing the crawled pages.
3.4.2 whois
The whois service is an integral part of Internet infrastructure. It is a dis-
tributed database that provides information for all registered domains. For
each entry, it provides information about the registrant, as well as con-
tent, server and network related contact addresses. For all generic top level
domains such as .com .org .net .edu that is, it is maintained by the (com-
mercial) operators. For all country-code domains, such as .de .fr or .at, it
is maintained by the national domain registering authorities, such as the
German Denic [DEN].
The whois service can be accessed via web front ends, such as UWHOIS
[UWH], or via the UNIX command whois . Every whois entry should consist
of four pieces of information:
Registrant It should contain the address, phone and email contact of the
(legal) person who has registered the domain. This information is
most commonly used in disputes over domain ownership.
admin-c This section should contain a contact for the person that is re-
sponsible for the content of the web site. It is usually identical with
the registrant.
tech-c This contact is for the technical administrator of the server behind
the domain. Usually, only large companies and those with their own
CHAPTER 3. GEOGRAPHIC SEARCH ENGINES 19
IT-department will host their own web servers. The largest portion of
all domains is hosted in remotely run server farms. Hence, this contact
will usually not be identical with the above.
zone-c This section provides a contact to the administrator for the network
through that the server is connected to the rest of the Internet. Since
almost all domain registrants are connected by a third party provider,
this contact will almost certainly be different from the first two.
Table 3.2 shows the whois entry for die Fahrradschmiede, a small bike store
in the center of Germany. Like most other small businesses, the company
hosts with one of the two top German hosts. For that reason, tech-c and
zone-c point to the same contact.
Ideally, all whois entries should contain the above information in a well-
structured text document. However information if often incomplete and
noisy. Before we shift to German domains, we will briefly look at some
other whois databases:
.com -domains are highly diverse, since they are registered by companies
from all over the world. They usually contain some sort of information
for most fields. However, they tend to be so unstructured that elabo-
rate parsing techniques need to be deployed to extract a single field of
information, such as the registrar’s zip code or country of origin.
.co.uk is very sparse. It includes the name of the registrant and the name
of the registrant’s agent. Addresses for either are very rare. There are
no phone numbers, so no area codes to be extracted.
.at entries are of very good quality, including all four sections. The fields
usually contain relevant information. They are highly structured.
.ch -entries only contain two section, a holder of the domain and a technical
contact. The first likely corresponds to the registrant and admin-c, the
second to tech-c and zone-c. Fields are rather structured.
.nl -entries contain several address fields. The size seems arbitrary. They
can consist of an array of technical and administrative contacts, a reg-
istrant’s and registrar’s contact. The fields are clearly separated, but
don’t always contain a complete address and phone number. Extrac-
tion however should not be too hard.
The German .de domains’ whois entries are not only extremely complete,
they are extremely structured as well. For most entries all four distinct sec-
tions can be found. They are usually so well structured that parsing for
certain fields of data is quite simple. This greatly facilitated the implemen-
tation and made .de domains an ideal test bed for our prototype.
CHAPTER 3. GEOGRAPHIC SEARCH ENGINES 20
domain: fahrradschmiede.de
descr: Die Fahrradschmiede
descr: Ringstr. 10
descr: D-35108 Allendorf (Eder)
descr: Germany
nserver: ns.schlund.de
nserver: ns2.schlund.de
status: connect
changed: 19990311 092200
source: DENIC
[admin-c]
Type: PERSON
Name: Christoph Michel
Address: Die Fahrradschmiede
Address: Ringstr. 10
City: Allendorf (Eder)
Pcode: 35108
Country: DE
Changed: 20000321 172103
Source: DENIC
[tech-c][zone-c]
Type: PERSON
Name: Puretec Hostmaster
Address: 1&1 Puretec GmbH
Address: Erbprinzenstr. 4-12
City: Karlsruhe
Pcode: 76133
Country: DE
Phone: 49 1908 70700+
Fax: 49 1805 001372+
Email: hostmaster@puretec.de
Changed: 20000927 160119
Source: DENIC
The whois service dates back to a time, when privacy in the Internet
was not an issue and spam was an unheard of phenomenon. Since then the
Internet has dramatically changed and it does not seem appropriate any-
more to publish addresses and emails of web masters anymore. Especially
spam authors have taken to the whois database for an easy source of email
addresses. National privacy laws, such as the German Bundesdatenschutzge-
setz, easily collide with the whois service. For this reason, we expect the
whois service to dramatically change or vanish over the course of the next
years. Already now, the German DENIC refused to share a copy of their
otherwise publicly accessible database. We ended up querying their whois
server for the 650,000 domains we had touched with our crawl, out of the
total 7.800.000 .de domains. We conducted the retrieval slow enough to not
pose a denial-of-service attack, and queried for about two weeks.
Area Codes
The first set contained the centroids of the polygons associated with the
regions covered by area-codes. In addition, it contained the name of the
major city associated with the area code.
It is important to point out that the area codes are only loosely connected to
Germany’s administrative geography. That is, some cities might have more
than one area code, while in other regions several cities share one area code.
The dataset contained only the most significant city name for an area code.
The data came in a text file , with 5000 entries in the following format:
area code;city-name; longitude, latitude
Table 3.3 shows a short extract from the actual data set.
the entry would point to a city. In the second, it would point to a village
and include the name of the city the village is associated with. Table 3.4
shows a short extract from the actual data set.
This data set proved to be extremely problematic and required enormous
effort to clean up. It was intended to be used in geographic information
systems (GIS). For this reason the positions were the key attributes to this
database; this assumes that there are no two town on top of each other. The
city names were a mere hint for the operator of the GIS, what city she sees
on her display. They were extremely dirty and incoherent, the reason being
that they probably originated from several sources, agencies of the different
German states, so called ”Landesämter”. For our application however, we
needed correct city names, since we were going to parse for them. Since
the errors in the data were so inconsistent, we ended up cleaning the data
manually.
There were several sometimes closely connected errors in the data. Many
terms were abbreviated, often highly inconsistently. We had to replace all
abbreviations with the full term, and stored the replacement in an extra file.
We assumed that the abbreviations we found in this data were going to be
found again in the web pages and that we should store the mapping now,
so we could reuse it later. In that case, we could translate the abbreviation
to the full term, by looking it up in a table.
We dropped all little terms such as ”in” and ”auf”1 and their abbreviations,
since they would not be of much help. Sometimes, they had been abbreviated
and glued to the preceding term. So the data set would show a Furthi Wald
instead of a correct Furth i. Wald or a complete Furth im Wald. These
cases were particularly nasty to clean up, since not every trailing i is an
abbreviated in or im.
The cleaning of the data was extremely tedious, could not be automated,
and took one person about ten days. In very hard cases, we had to look
up the correct town name on the Internet. However, in our data-centric
approach, good quality data to start from is of crucial importance. In case
the system does not return the expected results later, it would have been
impossible to tell if it was due to faulty data or faulty design.
34639,schwarzenborn*knuell lager*schwarzenborn,9.4253880002587316,50.89763299981351
34639,schwarzenborn*knuell,9.4473499995065371,50.911501999164322
35037,marburg,8.7691719990471793,50.790983001039926
35041,marburg dagobertshausen,8.6951250011819887,50.819398000500954
35041,marburg dilschhausen,8.6579209990609325,50.816383000470793
35041,marburg elnhausen,8.6940449992263193,50.809825001040245
35041,marburg haddamshausen,8.7010059981941001,50.782691998169838
35041,marburg hermershausen,8.6908049979446815,50.786987998059928
35041,marburg marbach,8.7367700007645119,50.819699998081383
35041,marburg michelbach,8.7030469979329474,50.844195001501646
35041,marburg wehrda,8.7560919993076762,50.838240000501266
35041,marburg wehrshausen,8.7275279990758623,50.811936998521176
35043,marburg bauerbach,8.8327769986550972,50.818493000550703
35043,marburg bortshausen,8.7806929996489167,50.751035000158161
35043,marburg cappel,8.7613720007169569,50.779828000869543
35043,marburg cyriaxweimar,8.7196080015473143,50.784577001069927
35043,marburg ginseldorf,8.8220960016123797,50.841632001111442
35043,marburg gisselberg,8.7438499993194121,50.774928001878941
35043,marburg moischt,8.8259369984968465,50.774175001389246
35043,marburg ronhausen,8.7569309986701196,50.758347000748749
35043,marburg schroeck,8.8313359994932465,50.786459998689701
35066,frankenberg*eder doernholzhausen,8.8773020019242814,51.056517001379312
35066,frankenberg*eder friedrichshausen,8.8609800016807316,51.048979999669264
3.4.5 Umlauts
The German language makes use of a handful of special characters, so called
Umlauts. For historic reasons, there are various ways to incorporate umlauts
in HTML pages. In addition many keyboards do not have keys for umlauts,
so users will type sequences of other characters, when they actually mean
to type an umlaut. In order to deploy a working search engine, we decided
to get rid of umlauts altogether. HTML was originally designed for the
128 characters known as lower-ASCII. They are encoded in a single byte,
not making use of the first bit. Umlauts are not part of lower-ASCII. Web
page authors replaced them with sequences of lower-ASCII characters, a
technique used by every German speaking writer when she encounters a
non-German keyboard. For that reason, several techniques were introduced
later that allowed to include umlauts in HTML documents. There were two
underlying strategies:
• Special HTML tags that the browser would resolve to the German
umlaut. For historic reasons, there are two ways of using tags for
umlauts.
– A tag that specifically names the umlaut
– A tag that names the umlauts number in Latin-1 encoding, using
the upper 128 characters that can be represented in a byte.
• Directly using Latin-1 encoding and telling the browser about it. In
this case, umlauts can be directly typed in the HTML, using the upper
128 characters.
Table 3.5 demonstrates the different umlauts and the several ways of encod-
ing them:
For indexing and recognition, we want to identify two equal words as
equal, no matter if they were written in different encodings. The encodings
in HTML and Latin-1 can easily be converted. The mapping to sequences
of lower-ASCII characters however is not bi-directional. Every ä can be
mapped to a ae, but not every occurrence of an ae was meant to represent
an umlaut. For that reason, we decided to translate every occurrence of
an umlaut to sequences of lower-ASCII characters. We translated the raw
HTML pages from the web crawl, as well as our tables with geographic data
for Germany.
This technique can be highly recommended, even for traditional search en-
gines.
CHAPTER 3. GEOGRAPHIC SEARCH ENGINES 25
Programming Languages
Over the course of the project, we dealt with a wide range of data sets,
from 100kb to over a terabytes. Functions on that data ranged from easy
mapping and string-processing to complex computations that required more
detailed modeling.
The intermediate text formats gave us great flexibility in choosing spe-
cific languages for different tasks. Over the course of the project, we used:
C and C++ for fast processing of the largest data sets, when speed mat-
ters.
Java for tasks that required detailed modeling, when speed is not a key
issue. Java proves to be much faster to code and debug than C.
Perl and Python for functions on geographic data sets and extraction of
features from larger text collections. Both languages are well suited
for string handling.
3.6 Geocoding
It is the underlying idea behind geographic search engines that web pages
provide information for certain areas. A website for Marburg’s tourist office
for example would contain information on Marburg. Every web page can
contain information for one, none or even several such areas. Its set of areas
is called a page’s geographic footprint.
This process of discovering and assigning these areas that compose the
footprint is called geocoding [McC01]. It consists of geoparsing [MAHM03],
the search for hints in web pages that will lead us to the areas. This step
is highly dependent on a countries geography and the names of its towns
and can only be generalized to some degree. By geography, we mean how
a country is organized in political and administrative geographic regions.
Differences can be found in what entities, like cities or states there are, how
they relate to each other and to other administrative codes, such as the
national postal or phone system. As the authors of [McC01] pointed out:
In the next step, we try to geomap these hints to actual geographic entities,
such as cities and villages. The sum of these entities will then build the
footprint.
After initial footprints have been computed for all web pages in the crawl,
we can further refine and improve them in steps of post processing.
CHAPTER 3. GEOGRAPHIC SEARCH ENGINES 27
document, avoiding ambiguity. For this section, we will assume that meta
information is provided by the author. Schemes with external collaborative
editors, such as Internet directories, will be described in Section 3.6.2.
There are two basic levels of sophistication, one less formal and another,
very precise. Ever since its creation, HTML used to contain a meta-tag,
where an author can enter a set of keywords that loosely describe its content.
Table 3.6 contains a HTML header that makes use of meta tags. Later, it
became necessary for authors to agree on fixed meanings for certain terms,
often highly dependent on the content of the document, such as medicine,
trade or product descriptions. In this semantic web, application dependent
ontologies form agreed-upon mappings between terms an meanings. As it
turns out, for a geographic Internet, the difference between the two concepts
becomes rather slim and can almost be ignored.
Currently, no search engine considers the text in HMTL meta tags. As a
result, they are almost never implemented. A smaller study [Mih03] found
that 0.005 - 1.9% of all web pages contained any kind of standardized meta
information. Of these, 1 - 3% are found to contain spatial information.
However, a geographic semantic web as described by [Ege02] could be of
great potential. One could eliminate geoparsing from geocoding, since all
required information would be readily available. There has been quite some
work in this direction. All solutions however suffer from two significant
problems:
Chicken and Egg After initializing such a service, web masters would
have to annotate their web pages or enter geographic information in
the directory service.
Web masters however are practical people and will only invest their
tie, if they expect a significant gain in return. The gain would be more
hits from users, usually led to the web site by a search engine. They
would therefore wait for major search engines to make use of such in-
formation.
Search engines in return wait for the web masters to make the first
move. They would only focus on geographic meta data, if enough web
masters had annotated their pages. Otherwise, a geographic search
engine would be too scarce to be of any use.
This problem could be overcome by either:
• A core search engine that does not rely on geographic meta infor-
mation. It would be ready to use from the first day and web mas-
ters would directly benefit from entering geographic meta data.
• A commercial search engine of significant size would have the
leverage to de-facto force web masters to implement meta data.
Web-masters would risk to lose an intolerable portion of their
traffic, if they did not comply.
CHAPTER 3. GEOGRAPHIC SEARCH ENGINES 30
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<META NAME="geo.country" CONTENT="DE" />
<META NAME="geo.region" CONTENT="DE-HE" />
<META NAME="geo.placename" CONTENT="Marburg * Capelle * Michelbacher Mühle" />
<META NAME="geo.position" CONTENT="50.867;8.700" />
<META NAME="ICBM" CONTENT="50.867, 8.700" />
<META NAME="DC.title" CONTENT="eventax UmkreisFinder Marburg * Capelle * Michelbach
<META NAME="DC.coverage.spatial" SCHEME="DCMIPOINT" CONTENT="east=8.700; north=50.8
<META NAME="DC.coverage.spatial" SCHEME="ISO3166" CONTENT="DE" />
<meta name="keywords" content=" 2004 Umkreis Umgebung [...] Messen Sport" />
<meta name="description" content="Die Suchmaschine und Stadtportal für Marburg
<title>Die Suchmaschine für Marburg und Umgebung</title> <link rel="styleshee
<script language="JavaScript" type="text/Javascript">
[some JavaScript here}
</script>
</head>
• (partially) abbreviated
• highly ambiguous
In the case of telephone numbers, formats might vary and even contain
special characters:
• 06421 2821559
Numeric codes can be used in combination with textual codes or other nu-
meric codes to reduce the risk of mistaking a random number for a numeric
geographic code. They are also highly dependent on a country’s administra-
tive characteristics. Cell phones in Germany for example have an are code
that points to a service provider, not an area. In the United States however,
cell phones usually contain a local area code.
Once a larger data body of documents has been geocoded one can try
to infer geographically relevant terms. The probability of the occurrence
of a term in a document might be strongly correlated with the document’s
geographic footprint. Using techniques such as proposed in [PF02], it should
be possible to detect informal names for landmarks or events that are held
in a specific location. Examples might be:
• Mermaid Parade for Coney Island, Brooklyn, NY, where this event is
held annually.
• whois directories
Business Directories have been around in their printed form for a long
time. They might be for all businesses, like the yellow pages, or focus on a
specific business sector. They all have in common that the initiative to get
listed lies with the client. Even if the directory approaches a company about
getting listed in the directory, it usually is the company’s hands to agree on
getting listed. There is usually a fee associated with the listing. The fee has
a two-sided impact on the content of the directory. On one hand, there is
very little spam (misleading and redundant entries). On the other hand, the
service is limited to commercial clients. Non-profit and private contacts will
usually not be listed. This makes business directories an incomplete source
of high quality data. It can very well be integrated in a geographic search
engine.
The geographic search engine [inc03] as described in Section 3.9.1 is build
around a business directory. As for most commercial search engines, the
exact algorithm and data is usually not disclosed to the public. It is however
quite clear that after having issued the query, the user is first confronted with
a list of nearby entries from a business directory. The user must next click on
an address and will be confronted with answers for that entry. The service,
thus appears to be more like a search inside the yellow pages.
Manually compiled directories are almost as old as the Internet. Com-
mercial directories like yahoo! [Yah] or non-profit competitors like the Open
Directory Project [] have been a fierce competition to search engines. The
directories are organized in a hierarchical structure and web pages/sites are
entered by hand. It is easy to organize the directory, so it contains a ge-
ographic component. Indeed, most directories allow the user to click into
sections for regions or cities. The placement of an entry makes a statement
about the geographic footprint of the Internet resource.
In contrast to business directories, there is usually no fee required for a
placement, and it is the directory that takes the initiative by placing the
entry for a web page, not its author. The directory is always far from com-
plete, but does include non-profit and private pages and thus is much larger
than business directories.
Manually compiled directories are a good source for generating geographic
footprints. The geographic search engine www.umkreisfinder.de [Eve] as
CHAPTER 3. GEOGRAPHIC SEARCH ENGINES 36
• Most web sites are about places close to the position of their owner.
• The admin-c section of the whois entry points to the address of the
company that owns the domain.
• Most small businesses don’t run their own servers, but rent remotely
hosted web space. The technical contact section of the whois entry
would point to this remote address.
• Even larger businesses that host their own servers usually do not op-
erate their own network. The zone-section would point to the address
of the networking company that is usually not close by.
Instead it is compiled by the web host from the (necessarily accurate) billing-
data provided by the registrant.
The whois service proves to be a open, free and rather reliable source of
geographic data.
Post Processing
In the previous section, we have seen that we can find geographic references
in a sufficient fraction of all web pages. We would however like to increase
this fraction. Also, we don’t know if all geographic footprints are sufficiently
filled. They might be nonempty but still miss references to important regions
that a page provides information for. In this section, we demonstrate how
quantity and quality of a geocoded document body can be enhanced by
extending basic IR hypotheses to geographic features.
We have already touched two important topics in the previous section.
We learned that in every German web site there is at least one page that
contains information about the legal entity that runs the page, including
contact phone and address. Even tough this information is concentrated
on one single page, it somehow applies to the entire site. Also, we learned
that geographic references that we find in an anchor text of a link might be
propagated to the page that the link points to. But somehow, the difference
between anchor text, the surrounding text and the rest of a document is not
so great, so we feel that probably they all should somehow be propagated. In
this section we will try to formalize and justify our notion by extending three
basic assumptions from traditional web information retrieval to geographic
properties. We will do this first, one after another, before discussing how
we can actually propagate information between web pages.
INTRA-SITE This hypothesis states that two pages p1 and p2 are more
likely to be similar if they belong to the same site/domain p1 .domain =
p2 .domain than two random web pages p3 and p4 that belong to dif-
ferent web pages p1 .domain 6= p2 .domain. The idea is that if one web
page in a domain is about say ”bicycles”, any other is more likely to
be on the same topic than a random page from the web. This hy-
potheses can be extended to smaller units than a site, and form an
INTRA-SUBDOMAIN or an INTRA-DIRECTORY hypothesis.
RADIUS-ONE This hypothesis states that two pages p1 and p2 are more
likely to be similar if they are linked, that is if there is a link l(p1 , p2 ),
than tow random pages that are not linked. Note that this hypothesis
is symetric, it does not matter, if l(p1 , p2 ) or l(p2 , p1 ). If both links
exist, the correlation between the two pages however is expected to be
stronger.
RADIUS-TWO This hypothesis states that two pages p1 and p2 are more
likely to be similar if there is a page px that links to both pages:
CHAPTER 3. GEOGRAPHIC SEARCH ENGINES 38
∃l1 |l1 (px , p1 ) and ∃l2 |l2 (px , p2 ). This hypotheses is also known as COC-
ITATION. There is a more detailed version of this hypothesis that
argue that the correlation between two p1 and p2 is the stronger the
closer l1 and l2 are in px .
We used the term ”similar” in the above definitions, without giving a de-
tailed description. It could mean for example ”on the same topic”, such as
”bicycles”. We don’t really need to narrow ourselves down to a more precise
description, since we are going to replace it at this point by ”geographically
close”. The notion if two pages are more likely to be similar, they are also
more likely to hold information about the same geographic region. We will
rephrase the three hypotheses in this geographic context:
INTRA-SITE Two pages are more likely to be about the same geographic
region if they belong to the same site/domain.
RADIUS-ONE Two pages are more likely to be about the same geo-
graphic region if they are linked.
RADIUS-TWO Two pages are more likely to be about the same geo-
graphic region if there is a page that links to both of them.
• p1 .f p := F (p1 .f p, p2 .f p)
• p2 .f p := F (p2 .f p, p1 .f p)
The most basic function for F is a sum with a dampening factor f ∈ [0, 1].
• p1 .f p := p1 .f p + f ∗ p2 .f p
• p2 .f p := p2 .f p + f ∗ p1 .f p
This technique works reasonably well. for more flexibility, the dampening
factor f can be replaced by a factor f (p1 , p2 ) that depends on (p1 and p2 ). If
propagating because of a RADIUS-TWO situation for example, f (p1 , p2 ) can
be the higher, the closer l(p1 , p2 ) and l(p2 , p1 ) are. If propagating because
of an INTRA-SITE situation, f (p1 , p2 ) could be the higher the closer p1 and
p2 are. If they are just in the same domain, f (p1 , p2 ) could be relatively
low, but if they are also in the same directory, it could be much higher. As
we will see in our implementation in Section 3.9 in some cases, it makes
CHAPTER 3. GEOGRAPHIC SEARCH ENGINES 39
sense to not propagate from page to page, but to first build an aggregate
and propagate the aggregate.
A quite different but very interesting approach to exploit the RADIUS-
ONE hypothesis has been proposed in [DGS00]. The prototype that im-
plements this approach has been described in Section 3.9.5. Omitting the
geo-extraction as discussed in the previous section, they performed an ini-
tial mapping solely by analyzing whois entries. They however propose an
elaborate density based approach for propagating geographic information.
Focusing on the United States, they imposed an administrative hierarchy
on the country, constructing a tree made of levels of ”states, ”counties” and
”cities”. Web pages are treated one by one. For every web page p they
compute p’s geographic scope, by traversing the tree top down. If a region
qualifies to be part of the geographic scope, the traversal will not continue
into the region’s sub regions. For a region to qualify to be included in p’s
geographic scope, it has to meet two conditions:
• There has to be a minimum number of links from this region, according
to the whois entries.
• The origins of these links have to be spread homogeneously over the
region r.
This approach however has one major flaw: the fixed hierarchy imposed on
the data. Any hierarchy is rather random and application dependent. An
hierarchy that fits political and administrative needs, such as the above,
might completely fail for other applications, such as ecologic observations.
Thus, this approach is not as flexible as the simple addition of geographic
footprints as discussed earlier.
There are many ways to infer a hosts actual position from hardware
and network properties. These techniques are mostly used to determine the
position of a clients host, not a servers. In our scenario, this information
is much easier to get, the most basic is simply to ask the user to fill out
a form. for more information on user tracking, see Section 3.2. Hardware
based location techniques for clients are usually far from precise and not
the focus of this discussion. They can however also be used to estimate a
server’s position and might be of use for combination with other techniques.
There are five basic techniques for determining a host’s location.
Whois entries contain addresses for the network and host operator. From
these, one can try to infer a geographic information for a host, given
its URL.
None of these techniques are very precise and even their combination is
not precise enough for a useful geographic search. There are several commer-
cial enterprises such as Digital Envoy [Dig], Quova [Quo] or Verifia [Ver]
that offer IP-to-position services, mainly for determining user’s positions.
CHAPTER 3. GEOGRAPHIC SEARCH ENGINES 41
Trace complete.
They often claim precision to the street level, but do fail to provide infor-
mation to back up this doubtful claim. The authors of an academic study
that focuses on analyzing names of routers and network delay encountered
errors of twenty to several hundred miles [PS01]. The precision can there-
fore be expected to be ore on a country or state level and is reflected in the
current application of these technologies:
• Guessing the native language of a user.
1. Furth
2. Sachsenhausen
3. Weimar
4. Neustadt
5. Schwalbach
6. Essen
• There are sometimes predicates the town has earned over time, such
as Bad or Sankt. Predicates are usually common German terms and
precede the main term.
We will need these classifications of terms and towns for the extraction of
terms and their mapping to towns in the next sections.
cases. In the following, we will make the case that it is best to avoid false
positives and much rather drop a hint and not include it in a page’s footprint.
Assume a page p contains a doubtful hint h to a geographic entity, such
as town t. Let’s also assume that a user posts a query q for a position near t
and that p could qualify as an answer. We will later see that we can attach
certainties to h, but this first basic decision remains, whether to:
If we include the hint in the footprint, we run the risk that p makes it in
the top-k answers for q. Again, there are two possibilities:
if h was meant to point to t, then we are fine and made a correct
decision. However, we did not gain much, since according to
the abundance hypothesis, there would have been plenty of other
good results.
if h was not meant to point to t, we have an bad answer in the top-k.
If this happens for other pages too, we run the risk of cluttering
the top-k beyond and acceptable fraction. This would render our
search engine useless.
If we ignore the hint and p would have qualified for the top-k, again there
are two possible scenarios:
if h was meant to point to t, we lose one result from the top-k. This
is no big deal, since according to the ABUNDANCE hypothesis,
we have plenty of other good results.
if h was not meant to point to t, we did the right thing.
As we can see, we have nothing to gain if the we drop a hint, but a lot
to lose in the first case. In the conclusion, it is usually safer to completely
drop an uncertain or ambiguous hint.
We will see later in this chapter that we have to be particularly careful in
regions with a low web page density.
This is not the case for geographic search engines. The mapping from
hints to positions in combination to a proximity measure makes a re-tracing
of errors often impossible. Lets look at a simple example to illustrate this
problem.
Assume a user lives in the area of Ostelbien 3 , north-east of Leipzig and
queries for a travel-agency. If we were not careful in our geoextraction and
geomapping, we might return completely useless results that point o travel
agencies at the other end of the country. And the user might not ever find
out why, nor might we.
As it turns out, there is a small village by the name of Last in Ostelbien. If
we are not careful, we might assume every travel agency with last minute
offers, a common English term in German tourism sites, to be locally relevant
for Last and therefore for Ostelbien. The user will most likely not have heard
of Last, since the town happens to have only 90 inhabitants. She will have
no idea, why all her answers consist of far away results. The same applies
to the designer of the search engine, who in addition might not ever find out
about the problem, since it occurs in queries for this particular region only.
This interesting case makes a good example for the discovering problems
hidden within the data during construction a search engine.
The reason for these problems in debugging are to be found in the map-
ping from terms to positions. One would have to have all tables with map-
pings from terms to positions and perform reverse lookups for all close by
towns, to find the reasons for an erroneous mapping.
As a result, we will have to be especially careful about reasoning on
decisions we make about creating geographic footprints. In additions, we
often have to consistently check intermediate data sets for feasibility.
strong terms that are almost uniquely used in town names. They form a
seed for a town name. Having found a strong term like Frankfurt on
a web page tells us that we have found a town name. However, we
might not known which Frankfurt.
weak terms that are common in German language, but might help us later
determine the precise city. So, if we find a Main on the same page
we found the Frankfurt, we would be certain that this page is about
Frankfurt, Main.
• The weak term Main maps to the strong terms Frankfurt and Offen-
bach, because of the two cities Frankfurt, Main and Offenbach, Main.
• The strong term Frankfurt is mapped to the two weak terms Main
and Oder, through the existence of Frankfurt am Main and Frankfurt
Oder.
We manually split the original set of terms in 3,000 weak and 55,000
strong. In addition, we created tables weak2strong and strong2weak that
store the above n : m relationship in both directions. This process took
several days.
The idea at this point is, to first parse for all strong terms in a page and
in a second step for all weak terms that are associated with these strong
terms. This method can reduce noise dramatically.
Weak terms can furthermore be categorized.
4
Frankfurt near the river Main
5
Frankfurt near the river Oder
CHAPTER 3. GEOGRAPHIC SEARCH ENGINES 47
• Other weak terms are just common German terms that are likely to
appear anywhere on any document, without any geographic meaning.
These terms only provide detailed information, when close to the main
term. An oder 6 a the bottom of a page does not help us to find out,
what Frankfurt was talked about much earlier on the page.
For this reason, we attached an integer distance to every weak term. When
parsing for weak terms, the distance between a weak term and a matching
strong term must not be more than distance, for the weak term to qual-
ify. This furthermore helped reducing output from parsing, while increasing
overall quality.
In some rare cases town’s main terms were so common that they had
been sorted to weak. Under these circumstances, such a town name would
not have been detected.
We therefore further refined the process, by adding two additional tables:
validators and killers. They allowed to move the main terms that had
ended up in weak to strong, without producing to many false hits.
validators map certain strong terms to other terms. After a strong term
has been detected, its environment is checked for the presence of one
of its validators. If no validator can be found, the term is discarded.
Most terms do not show up in this mapping, and will qualify for further
processing without the required presence of a validator.
killers are the exact opposite of validators. After a strong term has
been detected, its environment is checked for the presence of one of its
killers. If a killer is present, the term will be discarded.
To further increase flexibility, an integer distance was added to every entry.
It determines, how far a validator or killer can be from a term and still effect
it.
Another problem was the common usage of town names as last names.
Many German last names origin in town names. They were usually meant
to describe the fact that a person’s family originated from that town. Most
town-last names like Marburger and Kirchhainer are in genitive an will not
6
oder is also the German word for or
CHAPTER 3. GEOGRAPHIC SEARCH ENGINES 48
be detected by our parser that searches for Marburg and Kirchhain exclu-
sively. In other cases however, there is no difference between a last name
and a town name.
To avoid false detection of town names, when a page was actually talk-
ing about a person’s last name, we introduced another list of terms, the
general-killers. Any strong term from within a distance environment
of a general-killer is immediately discarded, without further processing. We
manually compiled this data set from 3,000 first names and common titles,
like Herr, Frau or Dr.
Numeric Codes
Numeric codes such as area-codes and zip-codes can be extracted in a three
steps:
2. Look them up in a table of all numeric codes and see if they appear.
If they don’t, discard them. This step eliminates most false positives.
Their five digit format would allow for 100,000 zip codes. However
there are only about 8,000 zip-codes actually in use, all other five digit
numbers can safely be eliminated.
Abbreviations
Many terms that occur in compound city names are often abbreviated.
There are often several abbreviations for the same term. The term Sankt
(Saint) for example can be found abbreviated to St. or Skt. To treat abbre-
viations correctly, we kept a table abbreviations that maps every abbrevi-
ations to its full-length term. We then replaced all abbreviations in a page,
before parsing it.
CHAPTER 3. GEOGRAPHIC SEARCH ENGINES 49
Denglisch
Germans freely mix English and German terms in their communications.
By its critics, this phenomenon has been called Denglisch, a mixture be-
tween Deutsch and Englisch, German and English. Some mayor German
cities, such as München and Köln, have different translations in English,
namely Munich and Cologne. For this reason, both languages have to be
treated equally and in parallel, when analyzing German web pages. To map
the English terms to the correct city, we simply replace them with their
German counterpart, whenever they cross our way. This process basically
came for free, since all we had to do was to include the English terms in the
abbreviations table.
The problem is, where did the author intend to start, a term and where to
end a term. For a human, this problem does not seem to be to hard, but
can we formalize it, so we can efficiently compute it?
Lets rephrase the problem. Given a substring that we think refers to
a town name. How can we find out if that is what the author meant to
express?
In the first case, the problem was easy the term were separated by a
non-literal character. We call this a strong delimiter. Strong delimiters are
all non-literal characters, including numbers, and the end of the string.
CHAPTER 3. GEOGRAPHIC SEARCH ENGINES 50
The term Frankfurt in the first example has two strong delimiter, a ”-” on
the left and a end-of-string on the right.
In the second example, it has only one strong delimiter to the right.
The term kulmbach in the third example has two strong end-of-string de-
limiters, while ulm has none.
In the fourth example, trier has no strong delimiters.
Detecting a term correctly is easy, if it has two strong delimiters. But
what about the second example, where there is only one strong delimiter?
We think we can tell that frankfurt is a correctly recognized term, because
the string to its left contains another (non-geographic) term fitness. We call
this a weak delimiters. Weak delimiters consist of all other geographic terms,
names of people and products, and common German or English words.
The term ulm in the third example has one weak delimiter to its right7 .
The term trier also had a weak delimiters to the right8 .
Parsing for geographic terms in URLS works as follows. We split the
URL in six sections:
• The TLD, that we ignore, since we only treat .de domains
Stemming
Stemming or conflation, the reduction of words to their roots can be a great
help or a great burden, like any other IR application. On one hand, one
7
Bach is German for stream
8
Kasse is German for register
CHAPTER 3. GEOGRAPHIC SEARCH ENGINES 51
Every town t is associated with an ordered set terms of strong and weak
terms. We will use dot-notation to refer to a towns terms, so t.terms is
the set of terms associated with t. If the town is a village (t ∈ V illages)
it is associated with a city, t.city. If the town is a city (t ∈ Cities) it is
associated with a set of villages, t.villages. Terms are categorized by being
strong (∈ strong) or weak (∈ weak), according to section 3.6.3. We can split
d.T according to this characterization and get d.S and d.W . Documents have
an importance ranking d.ir, similar to the rank in ordinary search engines.
At first, this problem seems not so hard, matching a muenchen to the city
of Muenchen. The problems arise, because the relationship between terms
and towns is n : m, not just n : 1. Strong terms can appear in the main
(early) or descriptive terms (on later positions) of several cities.
There are several measures for how well the name of a town can be
matched from a set of terms. The most basic simply counts the number of
terms that can be matched. If a town t’s name consists of three terms t1, t2
and t3 so t.terms = {t1, t2, t3}, and S = t1, t2 the town would be matched
with two terms or 66%. This is just one of many measures and we will see
several others later. For now, this method will provide a feeling, how to
measure a match.
A first naive algorithm would write out all towns t, where t.terms con-
tain at least one strong term s from S. Since we will be assigning certitudes
to each detected town later, we could simply compute it from the measure
9
marburger stadium is German for Marburg’s stadium
CHAPTER 3. GEOGRAPHIC SEARCH ENGINES 52
for how well a town was matched. Poor matches would not receive that
much of a certitude, thus not show up in the results much.
t6 t6.terms = {waake, goettingen} another small city with the same char-
acteristics.
Clearly the author of d did not mean to write about all these Göttingens.
• Most likely, she wanted to write about t1, the large city of Göttingen
in the state of Niedersachsen (Lower Saxony).
• The chance that she meant to write about either of the villages t7 - t9
is just really small, since these are all tiny villages.
CHAPTER 3. GEOGRAPHIC SEARCH ENGINES 53
BB-First
BB-First tries to unite the completeness of a match with the exclusive inten-
tion behind using a towns name. It does so, by allowing every strong term
to cause at most one resulting town, hence it is exclusive. Additionally,
the town that it matches to is determined by the quality of the match. In
addition, we try to follow our intuition and rather match a larger town than
a smaller town.
BB-First requires the towns to be sorted in categories, according to their
size. It is actually an interesting question, how to differentiate between large
and small towns. There are three basic measurements:
• One can also classify towns by the number of web sites that reside
in it. It can compare towns from a heterogeneous background. For
small towns however, a single user that has registered several hundred
domains can introduce a significant misperception about the size and
importance of a town.
This measurement is easy to compute form whois records.
can easily be adapted to any level hierarchy, inferred from various sources.
The idea behind BB-First is, that from a given set of strong terms, we
see, what towns of the highest (biggest) category we can match. We then
write out the best matched town, remove and remove all its terms from
the set of strong terms. We run this algorithm over as long as it produces
results. When it stops, we use the remaining strong terms and start the
algorithm over, trying to match towns from the next lowest level.
This is, what BB-First looks like for our application with the levels of
cities and villages:
The algorithm is based on two hypotheses:
BIG-IS-FULL There are more web pages about ”bigger” places. There are
more people, more companies and more companies on the Internet, in
larger cities than in smaller cities.
BIG-IS-DENSE The web pages about ”bigger” places usually rank higher.
Since the Internet is denser in larger places and large online-companies
with high-ranking sites reside in larger cites, sites from a larger city
are in average expected to rank higher than those from small cities.
1 C := S
a clone
2 R := ∅
the result initialisation
3 E = {t ∈ Cities|t.terms ∩ S 6= ∅}
all possible new cities
4 ∀e∈E e.match(W ∪ C)
match them with all terms
5 I := {i ∈ E|∀e∈E i.score ≥ e.score}
those that match best
S
6 S := S − i∈I i.terms
decrease S
7 R := R ∪ I
add to result
8 if(E 6= ∅) go to line 3.
repeat until no new cities
9 else: proceed at next line
10 E = {t ∈ V illages|t.terms ∩ S 6= ∅}
repeat everything for villages
11 ∀e∈E e.match(W ∪ C)
12 I := {i ∈ E|∀e∈E i.score ≥ e.score}
S
13 S := S − i∈I i.terms
14 R := R ∪ I
15 if(E 6= ∅) go to line 10.
BB-first to determine what of all possible towns has been matched best. We
will first describe all different measures that indicate the quality of a match,
before we show how to integrate them in a single measure. None of these
measures work on their own, we will discuss in what cases they fail in and
in what they succeed
match-count One method is to simply iterate over the matchterm’s matched
and see, what entries are not empty. We thereby count all terms that
have somehow been matched. This measure is rather intuitive, but
fails in certain cases.
If we have two towns a = (t1) and b = (t1, t2) and find that a docu-
ment d contains d.T = {t1}, clearly town a would be a better match,
even though a.match − count = b.match − count = 1.
If the towns were g = (t1, t2) and h = (t2, t1), h would be the better
match, even though g.match − count = h.match − count = 1. The
reason is that, most likely, we matched the main term in g and only a
descriptive term in h.
match-fraction This measure works similar to the previous, but divides
the result by the number of terms: match−f raction := matchterms÷
town.terms.length. It succeeds to for the case of a = (t1) and b =
(t1, t2), if d.T = {t1}, but fails in case d.T = {t1, t2}, since we would
think that t2 is a longer, hence better match, but the algorithm returns
a.match − f raction = b.match − f raction = 1. In the case of g =
(t1, t2) and h = (t2, t1), it would also fail to give a preference to g.
match-first This measure is quite different from the above. It simply
iterates over matchedterm.matched and looks for the first non-empty
entry. This measure will give preference, if a main term is matched in
contrast to only a descriptive term being matched. Given d.T = {t1},
it will prefer g = (t1, t2) over h = (t2, t1). It will not be able to prefer
a = (t1) over b = (t1, t2), since a.matchf irst = b.matchf irst.
match-first-strong This measure is identical to the previous, with the
only condition that the first matched term in addition has to be strong.
This measure makes sure, that g = (s1, t2) gets preferred over h =
(w1, s2, t1), with S1 being strong and w1 being weak. The reason
behind this measure are predicates. As we said, these are usually
weak, in contrast to the main terms, that are often strong.
zip-match So far we have only looked at matching town names, but we
were also parsing for zip-codes. If we find a zip code for a town, clearly
this town makes a better match than another town with the same name
but a different zip-code.
term-count None of the above measures have taken into account, how
many times a term has been detected on a page. Assume d.T = {t1 :
CHAPTER 3. GEOGRAPHIC SEARCH ENGINES 59
As we already pointed out, none of these measures works on its own and
even this list of measures is not complete. The task of determining a better
match for solving possible ties in matching of terms and towns is not so
hard. The goal is not to have a perfect match, but a good match, that can
be computed reasonably fast and gets at least the most common cases right.
We created a very simple measure, called a matchtown’s score by map-
ping each of the above measure to an integer interval 0 − 9 according to
their ”goodness” and multiplied the results. This score function proved
satisfactory.
It had to be ”manually” adapted for the case of Frankfurt Oder and Frank-
furt Main, two cities with the alter being several times more important to
Germany’s economy and Internet infrastructure than the later. If in doubt
and finding nothing but a Frankfurt on a page, we wanted to choose the
later. We therefore added a value of 1 to every score, if the city’s terms
contained the term Main.
By similarly checking BB-first against pathological but reasonably com-
mon cases, we found, we had to modify it over.
CHAPTER 3. GEOGRAPHIC SEARCH ENGINES 60
BB-First+
The above simple version of BB-first strictly prefers cities over villages, by
matching against all possible cities first and only thereafter matching against
villages. The idea was that according to BIG-IS-FULL, chances for a web
page are much higher to point to a city than a village. This situation however
changes, if we have already found hints to the city that is associated with
the village. The hypotheses behind this observation is
CLUSTERS-LIKELIER This assumption is that if a page points to a town
t and contains another vague hint to either town a or town b, it most
likely points to a, if a is significantly closer to t than b.
This observation already contributed to the nearby-town-count measure
and will once again be simplified to the city - villages relation.
The following real-world example illustrates the need for the adaptation
of BB-first that we will introduce later. Lets assume a page d, with d.T =
{ frankfurt, sachsenhausen }. There are plenty of possibly matching towns,
but by applying several possible the above score method, we can narrow
them to the following mire interesting:
c1 c1.terms = {f rankf urt, main} a large city.
c2 c1.terms = {f rankf urt, oder} a medium-size city.
c3 c3.terms = {sachsenhausen, weimar} a small city.
v1 v1.terms = {sachsenhausen} a borough of c1.
Notice the relation between c.1 and v.1. Everybody familiar with German
geography would just from looking at { frankfurt, sachsenhausen } infer that
the author meant to write about v1 and possibly about c1 and certainly
none of the other towns. BB-first however would behave quite differently:
It would first match c.1 which because of the partner-town-count receives
the highest score. Since it removes frankfurtfrom S, the term sachsenhausen
would be the only term to be left to match and would naturally match to
the city c3.
Taking a simplified version of CLUSTERS-LIKELIER into account, we
can modify the algorithm so it produces a much better looking result. After
every extracted city c, we check if we can find an associated village v from
c.villages that we can find a strong match for. If we do, we include it in the
result and remove its terms from S. This algorithm is called BB-first+ and
can be seen in table 3.9.
So far, our focus has been on matches between textual terms and towns,
but we have overlooked numeric terms, such as area-codes or zip-codes. Are
codes origin in a different data set and will be treated separately. First, we
will have to see, how to adapt BB-first one more time, to allow for matching
that emphasizes zip-codes.
CHAPTER 3. GEOGRAPHIC SEARCH ENGINES 61
BB-First++
As pointed out in Section 3.6.2 zip codes are a particularly strong hint for a
geographic location. In fact, they are so much stronger than textual terms
that we need to treat them separately, before proceeding with the textual
terms. We will adapt BB-first+ so it will treat the set Z of zip codes found
in a document first, in a BB-first style and then treat the textual terms
in BB-first+ style. Again, we have to respect the CLUSTER-LIKELIER
hypotheses. We have to ensure that an already detected town’s partners
receive preferred treatment. Partners of cities and villages are defined as:
• a town that has been extracted from the page’s domain’s whois entry
In the next section, we will show how to combine this information to infer
new information about one page from related pages. The whois entries will
also be integrated at this later stage. For now, we will focus on the first two
records that are stored in this text format:
url-id {\tt city-name longitude latitude certitude}*
We have not described the certitude yet. It is a measure for how strongly
we believe that a page holds information about a town. It is computed as
a rule based combination of measures similar to those on Section 3.6.6. It
ranges from 1 for the weakest to 255 for the strongest.
In a first step, we want transform this into our coordinate system. We
also want to get rid of the town names, since they are of no more importance
at this point. As pointed out in Section 3.6.1, we want to use a 1024 * 1024
grid to store the footprint. We will create two initial footprint for each
page, one for the page based record and one for the URL based record. So
we transform each of the above entries from our text files to new entries in
another text file in the following format:
url-id {\tt x-id y-id certitude}*
Where x and y both range in the interval 0-1023.
When transforming a town’s coordinates to our coordinate system, it
can happen, that two towns fall map onto the same tile. In this case, we
CHAPTER 3. GEOGRAPHIC SEARCH ENGINES 63
the tile’s certitude is the sum over the certitudes of all towns that map top
this tile. We also make a distinction between cities and villages. In case
of a village, the certitude of the village is only added to the tile (x, y) that
it maps to. In the case of a city, additionally 50% of its certitude is added
to the tiles (x + 1, y), (x − 1, y), (x, y + 1) and (x, y − 1), as well as 25%
added to (x + 1, y + 1), (x − 1, y − 1), (x − 1, y + 1) and (x + 1, y − 1). We
make this distinction, as illustrated in Figure 3.1, because cities are usually
geographically spread out wider, as well as have a bigger region that they
draw clients from.
For the next and the final step in building this exploratory prototype,
post processing and query processing, we will need a smaller and faster
implementation of a geographic footstep. It will need to support:
• A lot of neighboring tile will have equal or at least similar values for
the certitude.
int getSize() A function that returns the size of the memory that the
footprint currently occupies. This function is needed for writing the
footprint out to disk and to optimize memory usage.
1 C=S
2 R=∅
3 E = {t ∈ Cities|t.terms ∩ S 6= ∅}
4 ∀e∈E e.match(W ∪ C)
S
5 I = {i ∈ E|∀e∈E i.score ≥ e.score} ∪ {v ∈ i∈I i.villages|v.terms ∩ C 6= ∅}
Add qualified villages
S
6 S = S − i∈I i.terms
7 R=R∪I
8 if( E 6= ∅ ) go to line 3.
9 else: proceed at next line
10 E = {t ∈ V illages|t.terms ∩ S 6= ∅}
11 ∀e∈E e.match(W ∪ C)
12 I = {i ∈ E|∀e∈E i.score ≥ e.score}
S
13 S = S − i∈I i.terms
14 R=R∪I
15 if( E 6= ∅ ) go to line 10.
1 C=S
2 R=∅
3 E = {t ∈ Cities|t.zip − terms ∩ Z 6= ∅}
extract cities for zips
4 ∀e∈E e.match(W ∪ C ∪ Z)
5 I = {i ∈ E|∀e∈E i.score ≥ e.score}
Add qualified villages
S
6 S = S − i∈I i.terms
S
6 Z = Z − i∈I i.zip − term
7 R=R∪I
8 if( E 6= ∅ ) go to line 3.
9 E = {t ∈ V illages|t.zip − terms ∩ Z 6= ∅}
extract villages for zips
10 ∀e∈E e.match(W ∪ C ∪ Z)
11 I = {i ∈ E|∀e∈E i.score ≥ e.score}
Add qualified villages
S
12 S = S − i∈I i.terms
S
13 Z = Z − i∈I i.zip − term
14 R=R∪I
15 if( E 6= ∅ ) go to line 9.
S
16 I = r inR r.partnertowns|r.terms ∩ C 6= ∅
handle partners for zips
S
17 S = S − i∈I i.terms
18 R=R∪I
19 E = {t ∈ Cities|t.terms ∩ S 6= ∅}
extract cities for terms
20 ∀e∈E e.match(W ∪ C)
S
21 I = {i ∈ E|∀e∈E i.score ≥ e.score} ∪ {v ∈ i∈I i.villages|v.terms ∩ C 6= ∅}
Add qualified villages
S
22 S = S − i∈I i.terms
23 R=R∪I
24 if( E 6= ∅ ) go to line 19.
25 else: proceed at next line
26 E = {t ∈ V illages|t.terms ∩ S 6= ∅}
extract villages from terms
27 ∀e∈E e.match(W ∪ C)
28 I = {i ∈ E|∀e∈E i.score ≥ e.score}
S
29 S = S − i∈I i.terms
30 R=R∪I
31 if( E 6= ∅ ) go to line 26.
loadFP This function reads a line from text file and initializes a new bitmap.
unite all these hypotheses and call the sum of the propagation operations
we perform according to these hypotheses Intra. Also, we have not incor-
porated the geographic information retrieved from whois entries so far.
There main issues in this section are:
Of course, we could run each step several times and execute them in any
weird order and still fulfill the two requirements. One has however to be
careful to not propagate information too freely, otherwise pages will inherit
so much geographic information over distant links that every page’s foot-
print will cover the entire nation.
Another problem with serial execution of propagation is the echo phenomenon.
It can occur across links as well as within sites, but we will look at the case
within sites only. Say a site has 1000 pages and only one of them p has
a non-empty geographic footprint from the URL and page analyses. After
CHAPTER 3. GEOGRAPHIC SEARCH ENGINES 68
the first site propagation, all other pages receive f times p’s footprint. After
another site propagation, these footprints echo and p receives 1000∗ p’s foot-
print. We can repeat these steps several times and the values in p’s footprint
grow into the indefinite. There has however been no real reason for p’s foot-
print to grow in value much, since actually no new geographic information
has been discovered, we are only moving around p’s initial values. Echoes
are unavoidable in serial processing of propagation, but should be kept to a
minimum. We therefore stick to the minimum propagation:
Let’s next see how to perform Allinks. It is supposed to deal with propa-
gation of geographic information along links, according to the RADIUS-ONE
and RADIUS-TWO hypotheses. We could determine cases where we would
have to propagate according to either of these two hypotheses and perform
them independently. We would split Allinks into two steps: Radius − 1
and Radius − 2. The RADIUS-ONE hypothesis, as pointed out in Sec-
tion 3.6.2, is symmetric and Radius − 1 needs to be split in two steps: Link
and Backlink that propagate information along links and in their opposite
direction. If we wished to perform Radius − 2 with a dampening factor that
is to be chosen freely, we would have to implement this step. If however we
are happy if just a sufficient amount of information is propagated according
to the RADIUS-TWO hypothesis, we have another option. We can observe
that if we perform first Backlink and then Link with a dampening factor
of f each time, we achieve the same effect as performing a true Radius − 2
propagation with a dampening factor of f 2 . Lets look at an example: Say
pa links to both pb and pc . The RADIUS-TWO hypothesis tells us that we
would like to exchange geographic information between the two pages. If
we perform Backlink, pb ’s geographic footprint will find its way to pa . The
next step, Link will propagate all of pa ’s footprint to pc , including the traces
of pb ’s footprint. Equally, after the two steps, pb will have inherited some of
pc ’s footprint.
In our implementation, we adapted this procedure and decided for the fol-
lowing propagation order:
footprints with a dampening factor that depends on how close the pages are.
If they are within the same directory, they inherit more of each other and
if they are just in the same site, they inherit relatively little. The problem
was that from some domains, we had crawled as much as 100,000 pages. If
we performed a O(n2 ) algorithm on all sites, it would literally have taken
several weeks, if not months. this was clearly too slow and we were looking
for an algorithm that runs in O(n).
The solution was an aggregation tree over every domain. First, we
rewrote all URLs from the crawl in a hierarchic format, starting with the
largest entity:
domain subdomain host path file
Next, we sorted the URLs according to these columns, so that all pages
within the same domain were next to each other, all pages within the same
sub domain within that domain were next to each other and so forth. Then
we build a five level tree over all entries of each domain, in linear time. The
tree was built bottom up with these levels:
1. A leave level for individual pages.
2. A level of nodes for all directories.
3. A level of nodes for all host
4. A level of nodes for all subdomain
5. A root node for the entire domain
Each node held a footprint that is the sum over all footprints of the pages
at the leave level rooted at this node. Since the entries are already sorted,
each tree can be built in linear time. First only the bottom level footprints
are initialized, all others are empty. Next, aggregates are computed and
assigned to the node on the parent node from the level above. Once one
level is complete, we move to the next level up and do this over. Bottom
up, we fill all entries, right to the top of the tree.
In the next step, we propagate information back to the leaves, by pushing
it down in a top-to-bottom manner. Every node’s footprint is added to its
children’s footprints, always processing all nodes of one level at a time, of
course with some dampening factor f < 1. The footprint from the root for
example will find its way f 4 times to a leave node, the footprint of a host
level node will contribute with a dampening factor of f 2 . This way, pages
that are closer to each other contribute more to each other than pages that
are situated more distant within the site. After we have done this, the leave
nodes contain their correct and final value and we can simply delete the
upper levels of the tree and move on to the next domain.
So far we yet have to bring in the whois information. We could simply
add it to every page, but since we already build the aggregation tree de-
scribed above, we can simply add it to the root and then push it down with
CHAPTER 3. GEOGRAPHIC SEARCH ENGINES 70
Selection step In this step, the answer set is generated, mainly by inter-
secting entries from the inverted index.
Ranking step In this step, the answer set, as created in the previous step,
is ranked.
The first two steps will exclusively rely on information in main memory,
while only the last step may have to retrieve information from secondary
storage.
Query processing tries to return tuples as fast as possible, while having
to deal with a limited main memory. The general guideline is, to shed tuples
that will not qualify as answers as early as possible.
Search engine query processing will, in contrast to data base query pro-
cessing, can make use of two search hypotheses, making the process a lot
easier. According to the SHALLOW-PENETRATION hypothesis, the user
will only see the top s answers, at most. The query therefore does not have
to be computed completely. As soon as it becomes obvious that an answer
will not make it in the top s, it can immediately be discarded.
According to the BATCH hypothesis, the user will first only see the first
b answers, read them and then proceed to the next batch. The speed of
a query processing is therefore to be measured by how fast the first set of
answers is produced. Since human readers are rather slow, there will be
plenty of time to compute the later batches. We tried to make the greatest
use of these hypotheses and designed the following query processing:
For every web page we kept an geographic footprint in main memory.
Since not all footprints may fit in main memory at the same time in their
original size, we called simplify on all geographic footprints until they did.
These footprints of course are only approximations of the actual geographic
footprints, that we keep in secondary storage. We therefore give them a new
name and call them m-prints.
Every query a user posts consist of two parts:
• A geographic position.
Zones The set of answers can be grouped in zones, according to the distance
to the query position. All entries that fall into the specified zone
qualify for output. The user may jump between zones of different sizes
after every seen batch of answers. This approach basically pushes the
steering in the selection step.
An actual balance The balance between the importance ranking and the
distance ranking can be readjusted after every batch the user has seen.
This approach basically leaves the steering in the ranking step.
The user may not actually have to be aware what technique, or mix between
them, is used.
There are various possible interfaces for both approaches, as shown in
Figure 3.2, Figure 3.3 or Figure 3.4.
In either way, the user gets to dynamically change the balance between
importance and distance. She can continue in a query that returns not so
great results and steer in a better direction, without having to start the
query over or having to see any answer twice. The iceberg is showing its
first crack. In addition, we expect web spam to become a lot harder, since
there are more dimensions the spammer will have to cover.
3.9.1 Google
The Search by location [inc03] by Google [Goo] is the most prominent pro-
totype of a geographic search engine. However, it does seem to limit itself to
page from sites of companies that are listed in a business directory. Thus, is
CHAPTER 3. GEOGRAPHIC SEARCH ENGINES 74
Figure 3.2: One possible interface for interactive geographic search would
simply show the user two alternatives to the ”next” button. One to shift
the balance towards importance, the other towards distance. This approach
would work for either zones or a balanced ranking.
CHAPTER 3. GEOGRAPHIC SEARCH ENGINES 75
Figure 3.3: This interface would allow the user to directly change the balance
between importance and distance. It allows faster changes, but cannot be
realized in standard HMTL. It also can be used either for zones or a balanced
ranking.
CHAPTER 3. GEOGRAPHIC SEARCH ENGINES 76
Figure 3.4: Another interface could allow the user to directly skip between
different zones. These could be measured in kilometers or conceptualize
space similar to humans, like the one in this figure.
CHAPTER 3. GEOGRAPHIC SEARCH ENGINES 77
resembles more a ”search inside the yellow pages” than an actual geographic
search engine, searching an entire web crawl.
After the user has entered a position and the key words, she is presented
with the addresses of several businesses in this area that have information
about this information on their web sites. The user can then click forward
to the actual website. Screenshots of these three steps and the user interface
are provided in Figure 3.5, Figure 3.6 and Figure 3.7.
It must be noted that Google’s ordinary search engine allows to narrow a
search to sites from a particular country, a somewhat very broad, but still
geographic search. It can be observed that the results all come from domains
that are either registered under the countries top level domain, such as .de,
or contain a reference to the country somewhere in its whois entry.
3.9.3 www.search.ch
This commercial search engine focuses on Switzerland. It allows to limit
the search to one of Switzerland’s cantons, that are equivalent to states or
counties. This seems a little broad not very flexible, as it is a fixed grid’s
nature. A user living on the border between two cantons will still have to
issue several queries. Nonetheless, in contrast to the previous prototypes,
this is a real geographic search engine, that treats pages independently of
their web masters association with a business directory. Figure 3.11 shows
the search engines interface with the option to specify a canton such as
Zürich. [Räb]
3.9.4 www.umkreisfinder.de
This German search engine specialized in geographic search. It however
does not extract geographic information, but solely relies on entries from
the Open Directory Service [Opea] where web sites are categorized by geog-
raphy as well as by subject. In comparison to the total Internet, the Open
Directory Service is quite sparse and umkreisfinder therefore very limited.
It seems to crawl the pages of sites that it knows an position for and all
assigns the site’s position to all pages that belong to is, implicitly making
use of the INTRA-SITE hypothesis. Umkreisfinder adds geographic mark
up to its pages, as discussed in Section 3.6.2 and shown in Table 3.6.
CHAPTER 3. GEOGRAPHIC SEARCH ENGINES 78
Figure 3.5: Entering a position and some key words into Google’s interface.
CHAPTER 3. GEOGRAPHIC SEARCH ENGINES 79
Figure 3.6: Google shows relevant addresses from some sort of a business
directory.
CHAPTER 3. GEOGRAPHIC SEARCH ENGINES 80
Figure 3.7: After a click on a company’s address, Google finally shows rele-
vant web pages.
CHAPTER 3. GEOGRAPHIC SEARCH ENGINES 81
Figure 3.8: The initial interface of yahoo’s local search, where the user enters
an address and some key words.
CHAPTER 3. GEOGRAPHIC SEARCH ENGINES 82
Figure 3.10: Yahoo! shows the company’s address, a map of the surrounding
and a link to the company’s web site.
CHAPTER 3. GEOGRAPHIC SEARCH ENGINES 84
3.9.5 GeoSearch
The is academic prototype of a geographic search engine [Gra03] searches
articles from 300 online newspapers such as www.nytimes.com. The user
interface is rather simple, allowing the user to enter the key words and a
zip code. Similar to a geographic footprint, the authors define a so called
geographic scope of a web page. This function is described in more detail in
Section 3.6.2. There are web sites of different geographic scopes: some like
New York Times 10 cover the entire US, others like The Digital Missourian 11
cover only a state, some even cover only a town. The search engines answers
to a query contain an icon next to every entry, that show the entries geo-
graphic scope. Figure 3.14 shows icons for two typical geographic scopes.
For a web page to show up in the answers to a query, it needs to contain
the key term and its web site’s geographic footprint needs to intersect the
position the user entered. Documents from sites with a national geographic
scope always fulfill the later requirement. The importance ranking on the
final set of answers is recomputed. There is however a mayor misconception,
regarding the geographic scopes. Lets say, a user is searching for pizza for
90210 (Beverly Hills, CA). The top answers all come from www.nytimes.com.
Even though, this site’s geographic scope covers the entire nation and is sup-
posed to be an authority on pizza anywhere in the country, this is probably
not what the user was looking for. Geographic search engines are used to
find locally important documents and not globally important documents. If
the user wanted to find the later, she would use a traditional search engine.
3.9.6 geotags.com
This prototype of a search engine [Dav99] relies on web pages to be aug-
mented with geographic mark up tags from [Dav01] as described in Sec-
tion 3.6.2. It requires authors of web documents to implement these tags
and register them with the search engine. the engine will then crawl these
pages and index them. Due to the low commercial impact and the need to
register, the index is very sparse, empty for most regions. The interface,
as shown in Figure 3.15, allows the user to zoom to the desired position,
a somewhat time consuming task. Alternatively, the user can store her
position in a cookie.
3.10 Impact
Geographic search can well prove to be the next generation in search technol-
ogy and shape the landscape of today’s media market dramatically. It will
shift traffic and revenue flows of e-commerce and reshape today’s e-economy.
Geographic search might prove the killer application for broad band
internet over cellular phones. One of the main applications to be launched
over broad band cellular phones that usually integrate a PDA’s functionality
are centered around the user’s position and therefore called location based
services. The user’s position can be determined from the cell she roams in,
or from an external GPS advice. A typical application, as envisioned today,
would allow a user to search for the nearest car repair shop from a database
backed by the manufacturer. A useful service, especially if there is troubles
with the car in distant regions, but no service we could not live without. A
geographic search engine however, that allows to search for anything, not
just car dealers, might just be the long sought killer application of location
based services. It would allow to search for any given key words, including
a nearby ”car repair”, not restricting results to shops that are licensed by
the manufacturer.
Even if just installed on a local PC, geographic search engines’ impact
might be dramatic. It will move traffic from popular web sites to smaller
(local) sites as well as reshape the entire advertisement market.
Under current conditions, the Internet poses an extreme ”winner takes all”
situation, when it comes to traffic distribution. A few of the largest corporate
web sites such as www.amazon.com receive most of the traffic, while many
smaller company’s with a more local focus, such as www.fahrradschmiede.de
are happy for every single user that visits. Underlying this extremely un-
equal distribution, there is a vicious cycle, where popular sites receive more
links, therefore more importance ranking in search engines, therefore more
CHAPTER 3. GEOGRAPHIC SEARCH ENGINES 91
Figure 3.16: The Spirit Project plans on recognizing a hand drawn sketch
such as this, for determining the region the user is interested in.
CHAPTER 3. GEOGRAPHIC SEARCH ENGINES 92
Other small stores without mail order, especially restaurants and services,
will not advertise in this advertising scheme. No local pizza parlor will ever
pay to have its ad shown every time a user enters pizza. Geographic search
engines would enable ad placement agencies to offer localized ad placement,
where an advertiser does not only specify key-words for that she wants to
advertise, but also a region within she wants to advertise. Only if a user
types in the key words as well as a position within the specified region,
CHAPTER 3. GEOGRAPHIC SEARCH ENGINES 93
will she be shown the advertiser’s ad. This feature will reshape the type of
companies that will place ads with the ad placement agency. Now it makes
sense for companies with a local focus to advertise online. The pizza parlor
from above will want to advertise every time a user enters pizza and queries
for a position from within around a 500 meters radius around the pizza shop.
The low cost and efficiency of online advertisement will move large amounts
of advertisement budgets from print media, mainly from photocopied leaflets
as in the case of our pizza parlor, to online advertisement.
Chapter 4
One of the initial ideas behind this project was to explore the possibilities
for geographic web mining. For one thing, it turned out, that building the
geographic search engine was larger than initially thought. for another,
while developing the search engine, we learned a lot about the underlying
problems of geographic web mining, that made the later look even harder.
Nonetheless, we will provide a brief overview over basic web mining and its
potential for a geographic extension.
Web mining has been around for a while, sometimes under an alias like
”market research”. The basic underlying assumption is that the Internet
always somehow reflects society. Studying the web can therefore to some
degree replace direct social research.
Done correctly, it can be cheaper and faster than traditional methods like
door to door or telephone surveys. Web mining allows collection of infor-
mation that would otherwise not be available or expensive to create. Once
a web crawl has been collected and processed, results for mining queries can
be returned within hours, in contrast to the weeks a door-to-door survey
would take. In addition, its costs are somewhat constant (for operating the
infrastructure), but not linear to the number of queries, like door-to-door
surveys. For frequent use, a web mining system might therefore even prove
to be cheaper.
There are however some limits to web mining. Its view of society is always
somewhat distorted and underlies a time shift. Tips on java programming
for example are more common on the web than knitting patterns, even
though in real life there might be more people knitting than programming.
Similarly, it may take an event in real life days or weeks to find its way on
the Internet. Or the Internet news might precede the actual event, such as
when some president publicly ponders about how to get people to the planet
Mars. These factors must be taken into account. It must however be noted,
that traditionally surveys are not without their flaws either and their results
always undergo severe interpretation.
94
CHAPTER 4. GEOGRAPHIC DATA MINING 95
The web has also become a part of the real world, with real web phenomena
to be studied and real money to be made. Here, web mining is a prime
source of information.
Web mining is already widely implemented, but not often advertised. It is
already a serious business, while yet expecting its big take off. Large compa-
nies for example use it to detect pressure groups, before these make it to the
press, or try to detect new trends in youth culture. IBM for example offers
these services to customers, as part of their Web Fountain project [IBM].
The extension of web mining to geographic features comes rather natural.
A piece of German business wisdom states that: ”All business is local”.
Geographic components are therefore a big improvement of web mining. The
geographic component can be used in web mining mainly for two reasons:
IBM’s Web Fountain is already heading in this direction [LAH+ 04] and com-
panies like MetaCarta [Met02] offer geographic web mining to commercial
and governmental clients. An early predecessor of geographic web min-
ing has focused on whois entries, instead of actual web pages. Papers like
[Zoo01], [Kry00] or [SK01] mainly used this technique to reason about the
productivity of different regions during the boom of the new economy.
At first, geographic web mining seems rather straight forward. Web
mining, like any data mining is inherently multidimensional and the addi-
tion of two more dimensions should pose no problem. The adaptation of
other techniques, like query techniques, such as [RGM03] should be straight
forward. The authors of [MAHM03] have shown a first infrastructure, and
many techniques for geographic data mining, such as finding geographic cor-
relations between incidents [PF02] can be directly applied. There is however
one major flaw, hidden in the data.
The geocoding, underlying the entire geographic web mining is not up to
the level to perform meaningful results. During geocoding, especially during
geomatching as described in Section 3.6.6, when resolving ambiguities, we
made many decisions based on Göttingen intends to talk about either of the
towns with that name, not all of them. We had argued that in these cases,
the bigger town should be mapped exclusively and used the ABUNDANCE
hypothesis to justify this decision. This however was a hypothesis from a
search engine background that has no justification on a web mining environ-
ment. The exclusive use however is a phenomenon that web mining should
manage to take into account. Until these basic decisions in geocoding are
not solved sufficiently, the data is simply to messy to perform meaningful
geographic web mining on it.
Chapter 5
Addendum
5.1 Thanks
I would like to thank my two advisors Bernhard Seeger and Torsten Suel and their
teams, for providing know how, guidance and friendship throughout this thesis.
Thanks to Thomas Brinkhoff for co-authorship in earlier work about geographic
information retrieval.
I am in great debt to Yen Yu Chen and Xiaohui Long for their great spirit, and
cooperation in implementing our prototype.
Also thanks a lot to Jan Lapp, for support in implementing the earlier prototype
that relied entirely on whois entries.
Thanks a lot to Dimitris Papadias from Hong Kong University of Science and
Technology for inviting me to do my Ph.D. with him and for facilitating the appli-
cation. This certainly helped me ease my mind about the future and focus on this
thesis.
I would like to thank Deutscher Akademischer Auslandsdienst (DAAD) for some
financial support for the first three month of my stay in Brooklyn and my parents
for generous support, covering whatever was lacking.
Thanks to Utku and Chin Chin in Brooklyn for keeping me smiling, and to
Sven Hahnemann for technical support before moving.
I owe a lot to Julinda Gllavata and Ermir Qeli in Marburg for sending me funnies,
and giving me support whenever it was needed. You are great friends.
Thanks to www.fahrradschmiede.de, Christoph and Mama Michel, for force-feeding
me cake and pasta all these years and to Brian and Tilly for cracking me up every
lunch break.
Thanks to Steffie, Niesch, Lolle and Tommy for being who you are.
Also thanks a lot to Horst Schmalz at www.edersee-tauchen.de who called me every
single week for four month to check and make sure I was all right.
Thanks to the Krusas in Montauk, NY, for hosting me on my weekends and telling
me stories about getting lost inside sharks and being bitten by a 25 pound lobster.
Thanks to these and all other kind people. You were some great friends.
I explicitly do not want to thank the German DENIC for giving us no support
whatsoever. I truly believe that academic relations should not be handled by the
PR department, ever.
Also, no thanks to Lycoss Europe and Web.DE who did not even bother to reply
96
CHAPTER 5. ADDENDUM 97
to any of our letters, asking about possible cooperations. If more people were like
you, stories like ”Google” would never have happened.
Bibliography
98
BIBLIOGRAPHY 99