DA

Philipps-Universität Marburg
Fachbereich Mathematik und Informatik
Geographic Properties of Internet

Resources
(Geographische Eigenschaften Web-basierter Datenquellen)
Diplomarbeit in Informatik
vorgelegt von
Alexander Markowetz
betreut von:
Prof. Dr. Bernhard Seeger, Philipps-Universität Marburg, Germany
und
Torsten Suel, Assistant Professor, Polytechnic University Brooklyn,
New York, USA
September 2004
Marburg an der Lahn
Ich versichere, dass ich die Arbeit ohne fremde Hilfe und ohne Benutzung
anderer als der angegebenen Quellen angefertigt habe und dass die Arbeit in
gleicher oder ähnlicher Form noch keiner anderen Prüfungsbehörde vorgele-
gen hat und von dieser als Teil einer Prüfungsleistung angenommen wurde.
Alle Ausführungen, die wörtlich oder sinngemäß übernommen wurden, sind
als solche gekennzeichnet.
Marburg, den 09.09.2004
Alexander Markowetz
i
0.1 Deutsche Zusammenfassung

Diese Arbeit beischreibt geographische Suchmaschinen, die eine Eingabe
geographischer Koordinaten, zusätzlich zu den normalen Suchwörtern er-
lauben. Sie erlauben es, Suchanfragen auf Dokumente mit Informationen
über die spezifizierte geographische Region zu beschränken. Die Entwick-
lung eines explorativen Prototypen einer geographischen Suchmaschine wird
Schritt für Schritt erklärt, wobei der theoretische Hintergrund von De-
signentscheidungen sowie ihre Alternativen beschrieben werden, wenn sie
gefällt werden. Vor der Einführung geographischer Suchmaschinen wir ein
kurzer Überblick über traditionelle Suchmaschinen gewährt.
In dieser Arbeit erhält jedes Dokument einen (möglicherweise leeren) ge-
ographischen Fußabdruck : die Menge der Regionen für die ein Dokument
Informationen bereithält. Der Hauptteil der Arbeit beschäftigt sich mit der
Erzeugung der geographischen Fußabdrücke. Verschiedene Quellen von In-
formationen über die geographischen Fokussierung eines Web Dokumentes
werden diskutiert, wobei der Schwerpunkt jedoch auf der Extraktion ge-
ographischer Referenzen aus Dokumenten und URLs, sowie der Verwen-
dung von whois-Einträgen liegt. Der Prozess der Extraktion von Termen,
die möglicherweise auf eine geographische Entität hinweisen, ihre Abbildung
auf solche, sowie die Auflösung ambivalenter Fälle, wird in großer Länge
diskutiert. Nachdem für initale geographische Fußabdrücke für einen Kor-
pus an Dokumenten erstellt wurden, können sie durch weitere Verbreitung
ihrer Informationen, entlang von Links und zwischen Dokumenten aus der-
selben Site oder demselben Verzeichnis, noch verbessert werden. Techniken
zur Anfragebearbeitung sind vielfältig, wobei sich die Arbeit lediglich auf
eine naheliegende beschränkt.
Die Arbeit liefert weiterhin einen kurzen Überblick über die Möglichkeiten,
die extrahierten geographischen Informationen in Web Mining einzubeziehen.
ii
0.2 Abstract
This paper describes geographic search engines, that accept the input of ge-
ographic coordinates, in addition to the ordinary set of key terms. They
allow to narrow the search to documents, containing information about the
specified geographic region. The development of an exploratory prototype
of a geographic search engine is described step by step, describing the the-
oretical background behind each decision as well as their alternatives, as
they arise. Before the introduction of geographic search engines, the paper
provides a brief overview of traditional search engines.
In this work, every document receives (a possibly empty) geographic foot-
print: a collection of all regions the document provides information about.
The main part of the paper focuses on the creation of these footprints. Dif-
ferent sources of information about the geographic foci of web documents are
discussed, emphasizing the extraction of geographic references from docu-
ments and URLs, as well as exploiting whois entries. The process of extract-
ing terms that might indicate geographic entities as well as the matching
between the terms and the entities, including the resolving of ambiguous
cases, is discussed in detail. Once these initial footprints have been created
for a body of documents, they can be enhanced by propagating their infor-
mation along links and within documents from the same site or directory.
While there is a multitude of options for straight forward query processing
techniques, the prototype makes use of a straight forward approach.
The paper furthermore provides a brief overview over the possibilities of web
mining on the extracted geographic information.
Contents
0.1 Deutsche Zusammenfassung . . . . . . . . . . . . . . . . . . . i

0.2 Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
1 Foreword 1
2 Introduction 3
3 Geographic Search Engines 6

3.1 Search Engine Basics . . . . . . . . . . . . . . . . . . . . . . . 7
3.2 First Interfaces . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.3 Principles in Search Engine Design . . . . . . . . . . . . . . . 11
3.3.1 Capturing the Quality of a Search Engine . . . . . . . 12
3.3.2 User Studies . . . . . . . . . . . . . . . . . . . . . . . 12
3.3.3 Hypotheses . . . . . . . . . . . . . . . . . . . . . . . . 14
3.3.4 Basic Hypotheses . . . . . . . . . . . . . . . . . . . . . 15
3.3.5 Most Prominent Problems . . . . . . . . . . . . . . . . 16
3.4 The Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.4.1 The Crawl . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.4.2 whois . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.4.3 Geographic Data . . . . . . . . . . . . . . . . . . . . . 21
3.4.4 Additional Dictionaries . . . . . . . . . . . . . . . . . 22
3.4.5 Umlauts . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.5 A Process Model for Designing Search Engines . . . . . . . . 25
3.6 Geocoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.6.1 The Geographic Footprint . . . . . . . . . . . . . . . . 27
3.6.2 Sources of Geographic Information . . . . . . . . . . . 28
3.6.3 Germany’s Administrative Geography . . . . . . . . . 42
3.6.4 The Quality of Geocoding . . . . . . . . . . . . . . . . 43
3.6.5 Extracting Geographic Terms from Internet Resources 45
3.6.6 Matching Extracted Terms to Geographic Entities . . 51
3.6.7 Creating the Footprint . . . . . . . . . . . . . . . . . . 62
3.6.8 Post-Processing: Propagation of Geographic Informa-
tion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
3.7 Query Processing for Geographic Search Engines . . . . . . . 70
iii
CONTENTS iv
3.8 Interactive Geographic Search . . . . . . . . . . . . . . . . . . 72

3.9 Other Prototypes . . . . . . . . . . . . . . . . . . . . . . . . . 73
3.9.1 Google . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
3.9.2 Yahoo! - Overture . . . . . . . . . . . . . . . . . . . . 77
3.9.3 www.search.ch . . . . . . . . . . . . . . . . . . . . . . 77
3.9.4 www.umkreisfinder.de . . . . . . . . . . . . . . . . . . 77
3.9.5 GeoSearch . . . . . . . . . . . . . . . . . . . . . . . . . 87
3.9.6 geotags.com . . . . . . . . . . . . . . . . . . . . . . . . 87
3.9.7 Northern Light . . . . . . . . . . . . . . . . . . . . . . 87
3.9.8 The Spirit Project . . . . . . . . . . . . . . . . . . . . 90
3.10 Impact . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
4 Geographic Data Mining 94
5 Addendum 96
5.1 Thanks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
Chapter 1
Foreword
Spending my spare time writing marketing concepts for www.fahrradschmiede-

.de, I ended up registering more and more domains for future ad-campaigns
and became exposed to the whois database, that holds contact information
for all registered domains. At that point, already over 5,000,000 .de domains
had been registered and the whois database had become a small size geo-
graphical database, with millions of address entries. One day after lunch, I
had the idea of using this information for a simple geographical search engine,
and built a small and very crude prototype as an academic project. Even
though resources had been very limited, the prototype showed promising
results and I managed to convince Bernhard Seeger and Thomas Brinkhoff
to co-author a more or less visionary paper on the subject. This first paper
was accepted at the NG2I workshop in Boston in late 2003 [MBS03], and
presented again at the Third International Workshop on Web Dynamics at
the Thirteenth International World Wide Web Conference in New York in
May 2004.
The more we got involved with the subject, the more interesting it be-
came, but we also noticed that we lacked important infrastructure and know
how to conduct more in depth research on the subject. For one thing, all of
us came from a background of geospatial databases, with little experience
in information retrieval. For another, we lacked access to a decent size web
crawl.
Bernhard Seeger proposed a joint project with Torsten Suel, at Brooklyn
Polytechnic University, who would not only provide the know how in in-
formation retrieval, but also had developed the necessary infrastructure to
perform a medium size web crawl.
After a kind invitation by Torsten Suel and some financial support by the
DAAD, I set out in early April 2004 for a four month stay in Brooklyn,
where I would develop the prototype of a search engine in collaboration
with Torsten Suel and two of his students, Yen-Yu Chen and Xiaohui Long.
They had started a crawl of 31,000,000 web pages from .de domains and
1
CHAPTER 1. FOREWORD 2
this prototype was going to not only use whois entries but also extract ge-
ographic information from the web documents.
At the time of writing, in late August 2004, the prototype is close to being
deployed. It had been full of (often not so pleasant) surprises and much
harder than first imagined, but certainly worth the effort. I hope it will lay
the necessary foundation for a new field of research:
Geographic Web Information Retrieval.
One brief note on the bibliography, whose name origins from biblion,
meaning book or scripture in ancient Greek. In the 21st century, most infor-
mation is not stored in books exclusively, not even just in web pages. Much
information is contained in code, prototypes or even deployed production
systems. For this reason the bibliography has been extended to academic
prototypes, industrial solutions and to references to the companies that pro-
vide them.
Chapter 2
Introduction
The dynamic growth of the Internet has not only affected the number of
documents or users online or the number of authors that publish content
online. It has also had an tremendous impact on the usage of the data
from the internet and the problems we can solve with it. Some queries and
applications would not have made sense some five years ago, since the result
sets would have been too small or even empty. Even the most general queries
might have produced results that could still be easily skimmed through by
a human. In today’s information landscape, even extremely specific queries
will retrieve a solid amount of results. The focus of this paper will be how
to make queries more specific, in geographic terms.
Although very brief, the history of the Internet can be divided into three
sections:
early adaptors In the early years of the commercial Internet, few compa-
nies found their way online. Their names, like Amazon are legends.
The Internet was economically an experiment, with returns on invest-
ments not expected for years.
large corporate players In the next couple years, national mail order
stores moved online and large corporations build web information sys-
tems. At this point , the Internet was already economically very in-
teresting, but the resources for setting up a web site were still consid-
erable. Few small and medium size enterprises went online. An those
that did mostly came from a mail order background.
everyone Today, everyone is online. Down to the smallest pizza parlor,

every smallest company and a lot of private people maintain web sites.
Especially small companies that provide services with a local focus
finally found their way online.
These three periods found were reflected in the usage of web search engines:
3
CHAPTER 2. INTRODUCTION 4
Are there? During the period of early adaptors, the interesting question
was, if there was any information on say Mountainbikes. Search en-
gines were very crude and basically operated on a Boolean text model.
I want to order/inform myself During the period of large corporate play-

ers, one could be certain that there would be web sites on Mountain-
bikes; maybe not on all types and models, but one could use search
engines to solve tasks such as ”Order a cheap bike online” or ”Find
the manufacturer’s specification of model 08/15”. These search tasks
were either centered around a mail order or information on a global
target. Search engines started to have to deal with large answer sets
for some queries and implemented more complex ranking functions,
still mainly text based.
I need an anything Today, search engines can be used basically for any-
thing. They are used to find out what the local pizza store has on
the menu and what the cinema is playing tonight. Searches can easily
produce hundreds of thousands of answers. For one thing, they require
more complex ranking functions, mainly focusing on link structures.
For another, they allow imposing more specific constraints on a search,
since the answer set would still be of sufficient size. Such constraints
might be ”personal”, ”temporal” or ”geographic” for example. The
Internet has reached a level of commercialization, that web spam has
become an industry of its own. Search engines therefore have to care-
fully avoid being spammed or trapped.
For web mining, the impact of the three periods is similar:
No Mining During the first period, the data on the web is too sparse for
any web mining to make sense. Any form of mining only makes sense
on large sets of data that cannot be handled manually by individuals
anymore. This was not the case during this first period.
Global Mining on Large Phenomena The second period allowed for first
web mining, as long as the phenomena to be investigated was large
enough, that is broad and not too specific.
Highly Specific Mining Considering the amount of data on the web and
the fact that it covers any remote topic with a solid set of documents
makes it possible to perform very specific web mining tasks. Queries
can contain almost arbitrary constraints, especially on time and space
and still produce data in a size that supports statistic statements.
Both applications, web search and web mining, have reached the stage
where very specific queries finally make sense. Time and space, the two
most fundamental human conditions are of key importance for taking these
CHAPTER 2. INTRODUCTION 5
applications to the next level. Their incorporation will also cause a funda-
mental redesign of even the most basic techniques to become necessary. The
impact can only be compared to the fundamental step data bases took when
they moved from a single to multiple dimensions. Almost all areas of web
search, even more than web mining, will have to re-evaluated under these
new premises.
In this paper we discus geographic properties of internet resources and
the possibilities of imposing geographic restrictions on web search. We de-
scribe how we built the prototype, how we extracted geographic markers
from web pages and used them to build an exploratory prototype of an ge-
ographical search engine. In addition, we provide a brief outlook on the
geographic extension of web mining.
Chapter 3
Geographic Search Engines
Common search engines are already widely used for geographic search. Users
often include names of geographic entities in the keywords, to constrain the
search to some region. Thus, they will compose queries, such as
• – yoga brooklyn
– bed breakfast marburg
– scuba diving ”long island”
This technique however has major drawbacks:
• The user has to search extensively through all geographic terms. Thus,
she will re-run slightly modified queries, such as:
– yoga brooklyn
– yoga ”park slope”
– yoga ”new york”
Some of the answers she will have to see over and over, simply because
they appear in all three searches.
• The user must know good geographic terms. Therefore she needs to
be familiar with the area.
• There is no way to search for good numeric markers, such as area

codes.
• Some good pages might not contain any common geographic terms.
Thus they will not show up in any of the above queries.
Geographic search engines facilitate the search, by reducing the process

to a single query. The user enters her position and the keywords and will
be presented a list of results about the specified area. She will not be con-
fronted with identical answers again. The search engine will either have to
6
CHAPTER 3. GEOGRAPHIC SEARCH ENGINES 7
(pre-)compute the positions for all relevant pages. It looks for the same
geographic hints, the user would have included in her modified queries. Ad-
ditionally, it searches for numeric codes and uses site and link analyses and
external databases for additional geographic information.
In this paper, we discuss the development of an exploratory prototype
of a geographic search engine step by step. At each stage, we will discuss
alternatives techniques from the literature.
3.1 Search Engine Basics

Before we go on to describe the development of our prototype of a geographic
search engine, we will provide a brief review over the technology underlying
traditional search engines. We will skip most of the details, that can be
looked up in any text book like [Cha02].
The first step in creating a search engine is collecting a body of doc-
uments from the internet that is to be searched. This process is called
emphcrawling and the result a crawl. Searching through the crawl in quest
of a document that satisfies certain constraints is prohibitively expensive,
since even medium sized crawls like ours easily exceed a terabyte.
For this reason, an index is used that indicates for every word/term,
in what document it appears. This index is called an inverted index. It
is created by analyzing the documents from the crawl, extracting terms,
replacing these through a term-id and feeding them into the index. The
entire process is rather straight forward and can best be explained by an
simple example. Let’s say we have three documents, each consisting of a
single sentence:
d0 ich bin ich
d1 du bist du
d2 du bist nicht ich
In a first step, terms are replaced by term-ids. So we find:
d0 t0 t1 t0
d1 t2 t3 t2
d2 t2 t3 t4 t0
These are then sorted in the inverted index, for every term telling us, at
what position in what document the term appeared:
t0 d0,0,2 d2,3
t1 d0,1
t2 d1,0,2 d2,0
t3 d1,1 d2,1
t4 d2,2
The first entry for example tells us, that term t0 appears in document d0
on positions 0 and 2 and in document d2 on position 3. One can imagine
index structures or hash tables indexing the buckets for each term.
This is all it needs for a simple search engine. A search for ich would simply
return all documents from the entry of term t0. A search for ich AND bin
would translate into an intersection of the inverted index entries for t0 = {
d0,d2} and t1 = { d0} , resolving to { d0} . Using the stored positions,
where the terms occur during a document, we can even form queries for
sequences of terms such as ich bin.
The final step in executing a query in a search engine is the ranking
of the answers. In our example this would be rather useless, since the set
of returned documents will never contain more than three documents. In
real life however, most search engines produce thousands, if not millions, of
answers for a simple query. Given the fact that any human user will only
be able to read the top fraction, the order in that the results are returned is
crucial. Since we want to return those documents first, that the user would
find the most interesting and important, this order is called the importance
ranking.
There is no single good function for imposing an importance ranking on sets
of documents. Usually they are a mixture from different functions, based
on various measures, such as:
TF/IDF This measure counts how many times one of the key terms occurs
in a document, its Term Frequency (TF). Not all terms appear equally
often and are equally indicative, car for example is more frequent than
bentley. A term’s Inverse Document Frequency (IDF) indicates how
frequent a term is over the entire body of documents.
Relative Position When a query contains more than just one search term,
their relative position in a page is important. The closer the terms
occur in a document, the higher it should rank.
HTML Context Documents may receive a higher importance ranking, if

the search terms appear in bold, a headline or the title.
Link based Analysis A powerful and important measure for computing

importance ranking is the crawl’s link structure’s analysis. The idea is
that the web basically ranks itself. A good web site receives more links
to it than a bas web site. The actual measures such as the popular
HITS [Kle99] and PAGERANK [BP98] [PBMW98] of course are much

more complex.
This is all there is to a basic search engine. A lot of the steps in this
description are oversimplified and will usually not be carried out sequentially.
The basic idea however should have become clear. Looking at an actually
deployed search engine, we find three parts:
1. An interface, through which the user
• enters search terms and initiates the search;

usually consisting of a text box and a push-button.
• controls the query-process;
usually a simple next-button.
2. The underlying data, usually consisting of the crawl, indexed and

ranked.
3. A query processor, performing queries on the index and ranking the

results.
Our geographic search engine will consist of the same three parts, none of
which will look the same. We will describe these three components for a
geographic search engine, as we discuss the construction of our prototype.
The interface part will be looked at twice, once for simple interfaces, and
later for interactive interfaces with an elaborate query control.
The data part will consist of the traditional crawl, indexed an ranked, but
will additionally deal with geographic data. For our discussion, we will
assume the reader to be familiar with the traditional techniques and focus
on the later.
Query processing will largely depend on the desired speed and behavior of
the search engine. We will describe our bottom line algorithms and briefly
dip into further possibilities for query processing.
Before we go to describe how we constructed our geographic prototype , we
will discuss what makes a good search engine and what process model we
used.
First however, we will describe some simple interfaces. They come hand in
hand with some simple use cases and will outline the different applications
of geographic search.
3.2 First Interfaces

Human interfaces to search engines consist of three parts:
• A means for entering keywords.

• A means for determining the position, the user is interested in.

• A means for browsing through the answers, once they are produced.
Keywords are usually entered through text-fields.
Determinating the users position of interest highly depends on the relation-
ship between the user and the area of interest.
For now we will assume that the user browses through the answers with
a simple next-Button. Later in this chapter, we will see more interactive
forms of browsing.
If the user is interested in a region distant from her own geographic
position, she will use a simple text form to express himself. A single text
field similar to a ordinary search engine or a form such as in mapping services
like Map24 [Map] will do the job. The authors of [FH03] propose another
interface, in which the user draws a sketch of the region she means to search
for. The search engine would then use pattern recognition to determine
which area the user is interested in. This technique however seems quite
cumbersome, does not seem to have been implemented, and will probably
fail due to the user’s weak ability to draw an accurate sketch.
Mapping services, such as the two named above have an interesting option
of offering geographic search. After the user has searched for an address or
a route, they can offer to search for this address or along the route. This
technique is already used for searching within corporate databases. The
user can for example ask to have all restaurants of a certain fast-food chain
displayed that are situated along a route. This technique could easily be
extended to a full search engine.
If the user is interested in her own geographic environment, there is a
multitude of techniques for determining the users position.
If she uses a cell phone or a UMTS device, the service provider can infer
her position and could forward this information to the search engine. Since
more and more complex PDAs and Laptops are connected via cell phone-
services, this technique seems quite promising. The precision in cell phone
tracking is somewhere between a hundred meters in urban scenarios and
several kilometers in a rural setting.
The user could track her position more precisely, employing a GPS or Galileo
sensor. Differential GPS that is available in most populated parts of the
world provides a precision of several meters. Galileo is expected to bring
uncertainty down to a centimeters. GPS has become more and more com-
mon and finds its way into cellular phones, supporting accurate positioning
for emergency calls.
There are techniques for tracking a user’s position by network properties.
Companies such as Digital Envoy [Dig], Quova [Quo] or Verifia [Ver] provide
such services. This technique is usually used to determine a user’s country
of position The company’s claim of precision to the street level however does
not sound very reasonable.
In all cases, service provider and search engine would need a close cooper-
ation, to automatically forward position information. Protocols should be
pretty straight forward, but might depend on the appliances. The authors of
[DCG04] et al. propose an extra field in the HTTP protocol for geographic
information. They implemented an early prototype that uses a GPS-sensor
to track the users position and will insert this data in the HTTP requests
to the server.
3.3 Principles in Search Engine Design

Before embarking on a journey to build a new and ”better” search engine,
one needs to define, what ”better” means, or even just ”good”. The quality
of a search engine needs to be defined and captured.
Search engine design lies equally within computer science as within so-
cial sciences. Search engines are the interface between the human user and
a vast amount of digital data. The later has also been produced by hu-
mans. Even though this should put a strong emphasis on social science,
search engine design has largely left to computer sciences. Exceptions such
as [MM98] are still rare. Given the decent performance of today’s search
engines, computer scientists have not done too bad on their own. They were
left to make educated guesses about users’ intentions and behavioral pat-
terns; often by studying their own behavior. Users’ intentions and behavior
are highly dependent on their cultural background. This do-it-yourself ap-
proach therefore usually works as long as computer scientists and users share
the same cultural background. The search engine designer however needs
to keep in mind that the average user does not hold a degree in computer
science and will not be as educated or computer-savvy. Therefore function-
ality and complexity must be kept simple.
In general, it works well, to counter-check personal assumptions with basic
use cases.
Search engines can be best evaluated by how happy the user feels about
the results. Their design is not primarily a scientific task. First of all it
is a matter of engineering, supposed to facilitate human actions through
technology. Their quality is determined by these two criteria:
How happy the results leave the user. Evaluating how well the search
engine’s answers match the user’s information needs.
How well they match human reasoning. Helping the user post good
queries initially, to a search engine that conceptualizes the world in
similar terms as he does.
Instead of deepening this extremely broad definitions, we will go straight
to trying to capture or measure them, providing the definitions in their
”measurement”.
3.3.1 Capturing the Quality of a Search Engine

There are two methods to ”prove” that a search engine delivers satisfying
results. They focus on the two basic criteria stated above. The first tries
to capture the quality of a search engine in a user study, the second uses
hypotheses and reasoning to make the case that a particular search engine
works well. Even though, for the rest of the discussion, we will mainly
depend on the second, both methods have their limitations and justifications
and will be described briefly.
The quality of any kind of software can be described in two terms:
Analytic Quality is a more traditional quality concept, where the final

product is analyzed and evaluated according to previously described
quality criteria.
Constructive Quality is a concept of quality that came much later. The

basic idea is that the quality that went into the design and the process
of developing the software corresponding to the quality of the final
product. Quality is constructed during the creation of the product, in
contrast to analyzing it after its creation. Constructive quality plays a
major role in software project management, where great care is taken
in assuring the high quality of the process of development.
In this work, we will e developing an exploratory prototype of a search

engine. There are many uncertainties on the way and we won’t be knowing
where we are heading until we are there. We therefore need methods to
assure some quality of our product during the process. If we don’t manage
to do this and entirely rely on analytic quality at the end of the project,
we might find out that we have made some mistake in the beginning that
ruined the entire product.
We will discuss user studies and hypotheses according to their applicability
for either kind of quality and come to a conclusion as describe in Table: 3.1
Next we will discuss the two methods in more detail.
Constructive Quality Analytic Quality
User studies Extremely expensive Expensive
Hypotheses Yes No
Table 3.1: different methods for achieving different kinds of quality
3.3.2 User Studies

There is a broad range of techniques for conducting user studies on search
engines. For this discussion, an over-simplified view on user studies will
do. They usually give a group of users a set of keywords and have them
manually evaluate web pages regarding the keywords. Next, they evaluate
the search engine’s results against the ”optimal” outcome, as produced by
the user group.
This approach sounds straightforward and promises numerical results
that allow the comparison of different search engine techniques. However,
it has many drawbacks when it comes to putting it to practice:
• User groups are usually small and far from representative. A group
of 25 volunteering computer science students will hardly produce any
valuable output. Large and representative user studies require sub-
stantial financial resources.
• Any reasonable document base is too large for extensive human evalu-
ation. Even moderate size web crawls, like the one used for this study,
contain several tens of millions of documents. Commercial crawls can
easily amount to several billions. Any such collection of data is too
large to be evaluated manually. Reducing it to a meaningful subset
however would necessarily introduce bias.
• There are too many unknowns in a search. No user study could inde-
pendently examine the impact of each variable. The area of applica-
tion, type of query (navigational, undirected, etc.) and query length
are just a few such factors.
The fact that we will introduce several new variables over the course
of this paper makes user studies particularly challenging.
• The method of evaluating results in relation to a set of keywords has

inherent flaws. It leaves the matching of human reasoning to the users
in the study.
In a real search, the user starts with an idea, a need for some infor-
mation and then tries to come up with a good set of key terms. The
quality of the results have to be matched against that initial idea, not
against the keywords. The later were just a vehicle of satisfying the
initial need of information.
In a user study of the type described above, the user first has to infer
the initial idea from the set of keywords she is presented with. Next,
she checks the search results against this inferred idea. This means,
her evaluation of the search engines performance depends on her ca-
pability to infer a search idea from a set of keywords.
Additionally, user studies lack truly applicable metrics. Traditionally
precision and recall are used to evaluate user studies. These measures how-
ever are unsuitable for today’s information environment.
For these measures, users are asked to evaluate a set of results in Boolean
terms. They split the document base in two sets, one that fits the keywords
and one that does not. The document space D consists of d documents. The
set of results R contains r documents. The remaining d − r documents are
called non-results. The search engine will determine a set of answers A, of
which the user will read a sub-set K: the first k of these documents. The
search engines quality is then described in how K, R and D relate to each
other.
The standard measures such as Precision and Recall can be found in text
books such as [Cha02]. These measures however date back to a time, when
document spaces used to be rather small. Querying was about finding any
relevant documents. That is A was very small and would usually equal K.
After the tremendous growth of the World Wide Web however, A turns out
to be extremely large for most queries, usually much larger than K. Search
engines differ not only in how well A resembles R, but more importantly in
the order they impose on A. The later determines which answers make it in
K. This order however is not measured by precision or recall.
To conduct a meaningful user study, one has to invest tremendous re-
sources. A large user group with a heterogeneous background is needed.
The method of validating results against key words has to be replaced by
search tasks, where the user has to use the search engine to perform little
search tasks, like ”find the web site of company xyz”. There must be several
tasks from several basic categories, like ”navigational search”. This method
of course adds an extra level of complexity. For geographic search engines,
these queries would have to be performed for different locations, adding an-
other dimension. This short discussion should illustrate, how a meaningful
user study for analytic quality of a final product is already very expensive.
Using user studies for constructional quality, during the design process of a
search engine is even more expensive. One does not only have to evaluate
one single system, but many different little systems, where one or a few pa-
rameters differ among each set up, to find out how to set these parameters
correctly. Since there is a whole range of different parameters, this method
becomes so expensive that it is left only to the largest commercial players.
3.3.3 Hypotheses
The other way of supporting any claim about the quality of a search engine
is to state basic hypotheses, and reason on them. Hypotheses are
• short
• simple
• easy to agree upon
assumptions on different subjects, such as:
• The quality and quantity of underlying data, such as the document
space
• Intention and behavior of the user
• Common understanding of the application domain, in case of more

specialized IR-applications
Starting from the assumptions, combination and further reasoning will

motivate why a search engine will work particularly well. This might require
to author to write a little more convincing, and may not look as scientific at
first, but actually captures a search engines quality quite well. In addition,
it gives an idea, ”why” a particular technique works.
The two most prominent ranking algorithms HITS [Kle99], PAGERANK
[BP98] [PBMW98] for example were initially entirely motivated by hypothe-
ses. Besides the fact that they turned out to work well, everyone seemed to
understands why they work.
Working hypotheses fail to analyze the quality of a final product, but are
a great way of steering toward a good search engine. all it requires is reason-
ing, which in an academic setting comes for free. On the other hand, there
is always the danger of drifting into ”story telling”, and producing useless
products without ever noticing, since they seem to be so ”well founded”.
For our project, we stuck to reasoning with hypotheses, mainly due to lack
of funds for constructive user studies and lack of time for a analytic user
studies after the project has been completed. We tried however not to turn
into the Brothers Grimm. A analytic user study of the systems performance
is planned for the future.
In the following section, we will state some first basic hypotheses that we
will use throughout this study.
3.3.4 Basic Hypotheses

The following is a rather random set of initial hypotheses that we will keep
referring to throughout the discussion. While they are so basic and self
explanatory that they seem trivial, it helps us introduce working with hy-
potheses. In addition, their names will help us referring to them later on.
ABUNDANCE There is an abundance of results for almost any given query.

Most queries easily produce several hundreds of thousands of results.
This phenomenon will only further increase with the growth of the
Internet.
SHALLOW PENETRATION Any given user will only manage to see the
first 10-200 answers. Some commercial search engines, such as Google
[Goo], will only return the first say 1000 answers.
The combination of these two axioms is often called the iceberg phenomenon,
the notion being that the user only gets to see the tip of an iceberg of results.
BATCH-RESULTS Results are presented in batches of 10 to 50. The user

has to click on a next button to see the next batch.
TOLERANT-USER Users will accept some completely misleading answers

in any given batch, as long as the majority consists of helpful results.
The underlying assumption here is that users are practical people,
trying to solve a problem. As long as there is something useful, they
are happy and wont mind a minor amount of noise.
UN-DETERMINED The user does not know what she is looking for, for
most of the time. She hammers a two term query into her keyboard
to even get a first idea of what she could possibly find. This behavior
contributes to the abundance problem and can be captured by the
sentence:
I don’t know what I am looking for, but I’ll recognize it,

when I see it.
We will keep adding other hypotheses throughout this paper. Also, there
are some that have been widely used already, but have to be adapted to a
geographic environment.
3.3.5 Most Prominent Problems

One of the most prominent problems in current web IR is an iceberg phe-
nomenon that has gotten out of hand. Abundance is overwhelming and good
results stay buried deep inside, while penetration is shallow and un-directed.
Besides the increasing amount of web spam, this is one of the major chal-
lenges to web IR at the moment. The iceberg phenomenon can be overcome
to some degree, by reducing the impact of its underlying hypotheses.
Personalized Search
One can try to overcome ABUNDANCE by imposing additional constraints
on the result set, thus making it smaller. Many such proposed constraints
are personal, that is, specific to the person of the user. Proposed techniques
refer to the user’s past preferences or these of her friends or people with
similar personal profiles. The user’s geographic location, or a location that
she is interested in, is one such personal property. Any of these techniques
will help keep the iceberg small.
Interactive Search
The impact of SHALLOW PENETRATION cannot be reduced so easily.
After all we are humans and cannot browse through thousands of answers.
However, one can try to circumvent this problem by interactive browsing
through results.
We can help the un-determined user re-adjust her search interactively as she
browses the results. She will be allowed to change search parameters after
each batch she receives, without having to restart the search or seeing the
same answer twice. Thus, she will be allowed to change directions, while
drilling the iceberg, instead of consuming it in an iterator fashion from the
top.
This will also enable her to circumvent web spam. Determined web masters
have set up so called Google-traps that will produce higher rankings for their
website. As mentioned above, some search engines like Google only return
the top 1000 answers. Successful Google-trapping that allows to conquer
all possibly returned answers can thereby effectively block their competitors
from any attention by the user. Interactive browsing may help to circumvent
such maliciously boosted answers.
Our geographic search engine is personal by nature. In addition, we will
show how to make it interactive and provide the necessary user interfaces.
3.4 The Data

The real challenge in designing a search engine lies in the underlying data.
It is extremely noisy and inherently incomplete and designed for something
different. Web pages are designed to be read by human users, not to be
processed and indexed by search engines. They are written by thousands of
authors that do not adhere to even the most basic conventions.
The geographic data used for this prototype had been designed for geogra-
phers using geographic information systems (GIS). In that sense, it has been
somewhat abused for our purpose, and needed to be heavily adapted.
In the following, we will describe how the underlying web crawl was
created and how matching whois entries were retrieved. We will discuss
what geographic data sets were available for Germany and how they had to
be cleaned.
3.4.1 The Crawl

Beginning in the middle of March 2004, over eight weeks, we crawled about
100,000,000 pages from .de domains. Due to problems with the crawl-
manager and duplicate removal, we ended up with only about 31,000,000
unique pages.
The crawling architecture used has been described in [SS02]; thus we
will only briefly describe our actual setup. We distributed the crawling over
eleven machines as follows:
• Two SUN workstations perform the actual crawling.

• One SUN workstation parses out URLs from retrieved pages, removes
duplicates and adds them to the queue for crawling.
• One SUN workstation performs the DNS lookup and manages the
crawl.
• Seven Linux machines were used for storing the crawled pages.
The speed of the crawler depended on the load on Polytechnic University’s

network connection, limiting itself to otherwise unused bandwidth. For some
still unresolved reasons, it crashed quite frequently, tampering with the du-
plicate removal and leading to this rather large drop from crawled pages to
unique pages.
Crawled pages were stored as raw text files. For performance reasons,
always about 50 pages were combined into a single file and about 100 files
in a directory. The data amounted to 56GB before and 17GB compressed
size after duplicate removal, on every machine.
For every file, we kept an index file. The index file kept the URLs and
size of the stored pages, in the same order in that they appear in the storage
file. This later proves useful in many situations, when the actual content of
the pages is of little importance.
3.4.2 whois
The whois service is an integral part of Internet infrastructure. It is a dis-
tributed database that provides information for all registered domains. For
each entry, it provides information about the registrant, as well as con-
tent, server and network related contact addresses. For all generic top level
domains such as .com .org .net .edu that is, it is maintained by the (com-
mercial) operators. For all country-code domains, such as .de .fr or .at, it
is maintained by the national domain registering authorities, such as the
German Denic [DEN].
The whois service can be accessed via web front ends, such as UWHOIS
[UWH], or via the UNIX command whois . Every whois entry should consist
of four pieces of information:
Registrant It should contain the address, phone and email contact of the
(legal) person who has registered the domain. This information is
most commonly used in disputes over domain ownership.
admin-c This section should contain a contact for the person that is re-
sponsible for the content of the web site. It is usually identical with
the registrant.
tech-c This contact is for the technical administrator of the server behind
the domain. Usually, only large companies and those with their own
IT-department will host their own web servers. The largest portion of
all domains is hosted in remotely run server farms. Hence, this contact
will usually not be identical with the above.
zone-c This section provides a contact to the administrator for the network
through that the server is connected to the rest of the Internet. Since
almost all domain registrants are connected by a third party provider,
this contact will almost certainly be different from the first two.
Table 3.2 shows the whois entry for die Fahrradschmiede, a small bike store
in the center of Germany. Like most other small businesses, the company
hosts with one of the two top German hosts. For that reason, tech-c and
zone-c point to the same contact.
Ideally, all whois entries should contain the above information in a well-
structured text document. However information if often incomplete and
noisy. Before we shift to German domains, we will briefly look at some
other whois databases:
.com -domains are highly diverse, since they are registered by companies
from all over the world. They usually contain some sort of information
for most fields. However, they tend to be so unstructured that elabo-
rate parsing techniques need to be deployed to extract a single field of
information, such as the registrar’s zip code or country of origin.
.co.uk is very sparse. It includes the name of the registrant and the name
of the registrant’s agent. Addresses for either are very rare. There are
no phone numbers, so no area codes to be extracted.
.at entries are of very good quality, including all four sections. The fields
usually contain relevant information. They are highly structured.
.ch -entries only contain two section, a holder of the domain and a technical
contact. The first likely corresponds to the registrant and admin-c, the
second to tech-c and zone-c. Fields are rather structured.
.nl -entries contain several address fields. The size seems arbitrary. They
can consist of an array of technical and administrative contacts, a reg-
istrant’s and registrar’s contact. The fields are clearly separated, but
don’t always contain a complete address and phone number. Extrac-
tion however should not be too hard.
The German .de domains’ whois entries are not only extremely complete,
they are extremely structured as well. For most entries all four distinct sec-
tions can be found. They are usually so well structured that parsing for
certain fields of data is quite simple. This greatly facilitated the implemen-
tation and made .de domains an ideal test bed for our prototype.
domain: fahrradschmiede.de
descr: Die Fahrradschmiede
descr: Ringstr. 10
descr: D-35108 Allendorf (Eder)
descr: Germany
nserver: ns.schlund.de
nserver: ns2.schlund.de
status: connect
changed: 19990311 092200
source: DENIC
[admin-c]
Type: PERSON
Name: Christoph Michel
Address: Die Fahrradschmiede
Address: Ringstr. 10
City: Allendorf (Eder)
Pcode: 35108
Country: DE
Changed: 20000321 172103
Source: DENIC
[tech-c][zone-c]
Type: PERSON
Name: Puretec Hostmaster
Address: 1&1 Puretec GmbH
Address: Erbprinzenstr. 4-12
City: Karlsruhe
Pcode: 76133
Country: DE
Phone: 49 1908 70700+
Fax: 49 1805 001372+
Email: hostmaster@puretec.de
Changed: 20000927 160119
Source: DENIC
Table 3.2: Whois entry for Die Fahrradschmiede

The whois service dates back to a time, when privacy in the Internet
was not an issue and spam was an unheard of phenomenon. Since then the
Internet has dramatically changed and it does not seem appropriate any-
more to publish addresses and emails of web masters anymore. Especially
spam authors have taken to the whois database for an easy source of email
addresses. National privacy laws, such as the German Bundesdatenschutzge-
setz, easily collide with the whois service. For this reason, we expect the
whois service to dramatically change or vanish over the course of the next
years. Already now, the German DENIC refused to share a copy of their
otherwise publicly accessible database. We ended up querying their whois
server for the 650,000 domains we had touched with our crawl, out of the
total 7.800.000 .de domains. We conducted the retrieval slow enough to not
pose a denial-of-service attack, and queried for about two weeks.
3.4.3 Geographic Data

We were able to find two sets of geographic data for Germany. They came
from different sources and had little in common. One table provided posi-
tions for area codes, the other for towns and their zip codes. Both sets were
not intended for our purposes and had to be adapted manually.
Area Codes
The first set contained the centroids of the polygons associated with the
regions covered by area-codes. In addition, it contained the name of the
major city associated with the area code.
It is important to point out that the area codes are only loosely connected to
Germany’s administrative geography. That is, some cities might have more
than one area code, while in other regions several cities share one area code.
The dataset contained only the most significant city name for an area code.
The data came in a text file , with 5000 entries in the following format:
area code;city-name; longitude, latitude
Table 3.3 shows a short extract from the actual data set.
Towns and Zip Codes

The second data set contained the names and positions of roughly 82,000
German towns. The data set was so detailed that it even contained about
4,000 settlements as small as one or two houses, that were specifically
marked. since we expected them to introduce a lot of noise and few ac-
tual results, we eliminated all these entries.
The data came as a text file, with one of the two following formats
zipcode,city-name,longitude,latitude
zipcode,city-name village-name,longitude,latitude In the first case,
the entry would point to a city. In the second, it would point to a village
and include the name of the city the village is associated with. Table 3.4
shows a short extract from the actual data set.
This data set proved to be extremely problematic and required enormous
effort to clean up. It was intended to be used in geographic information
systems (GIS). For this reason the positions were the key attributes to this
database; this assumes that there are no two town on top of each other. The
city names were a mere hint for the operator of the GIS, what city she sees
on her display. They were extremely dirty and incoherent, the reason being
that they probably originated from several sources, agencies of the different
German states, so called ”Landesämter”. For our application however, we
needed correct city names, since we were going to parse for them. Since
the errors in the data were so inconsistent, we ended up cleaning the data
manually.
There were several sometimes closely connected errors in the data. Many
terms were abbreviated, often highly inconsistently. We had to replace all
abbreviations with the full term, and stored the replacement in an extra file.
We assumed that the abbreviations we found in this data were going to be
found again in the web pages and that we should store the mapping now,
so we could reuse it later. In that case, we could translate the abbreviation
to the full term, by looking it up in a table.
We dropped all little terms such as ”in” and ”auf”1 and their abbreviations,
since they would not be of much help. Sometimes, they had been abbreviated
and glued to the preceding term. So the data set would show a Furthi Wald
instead of a correct Furth i. Wald or a complete Furth im Wald. These
cases were particularly nasty to clean up, since not every trailing i is an
abbreviated in or im.
The cleaning of the data was extremely tedious, could not be automated,
and took one person about ten days. In very hard cases, we had to look
up the correct town name on the Internet. However, in our data-centric
approach, good quality data to start from is of crucial importance. In case
the system does not return the expected results later, it would have been
impossible to tell if it was due to faulty data or faulty design.
3.4.4 Additional Dictionaries

We needed additional data, beside the geographic tables. For correctly iden-
tifying substring in URLs, wee needed a dictionary of all common words
that we might expect to appear in a URL. Since the German Internet is not
only in German, but also in English and a wild mixture, commonly called
Denglish, we downloaded the dictionaries for both languages from OpenOf-
fice.org [Opeb] and named them german-terms and english-terms.
1
auf translates to on the
06468;dautphetal mornshausen; 8.528430605383743, 50.82975297946825

06697;willingshausen; 9.206722329712695, 50.84205173103812
06698;schrecksbach; 9.27609159823229, 50.83023158912148
06421;marburg; 8.747445424216252, 50.79514131048266
06629;schenklengsfeld; 9.85437383016424, 50.825068250274285
06422;kirchhain; 8.908569649390017, 50.82233794418693
036603;weida; 12.055939408311584, 50.77847651415118
06620;philippsthal werra; 9.977138324360316, 50.84153326568671
037294;gruenhainichen; 13.15742661736839, 50.75391479548293
02426;noervenich; 6.649794459081309, 50.80427063597293
037204;lichtenstein sachsen; 12.6331455629989, 50.764051897642155
02208;niederkassel; 7.047543595740437, 50.816451808791065
037209;einsiedel chemnitz; 12.974944654934644, 50.757498211125075
Table 3.3: area-codes
34639,schwarzenborn*knuell lager*schwarzenborn,9.4253880002587316,50.89763299981351
34639,schwarzenborn*knuell,9.4473499995065371,50.911501999164322
35037,marburg,8.7691719990471793,50.790983001039926
35041,marburg dagobertshausen,8.6951250011819887,50.819398000500954
35041,marburg dilschhausen,8.6579209990609325,50.816383000470793
35041,marburg elnhausen,8.6940449992263193,50.809825001040245
35041,marburg haddamshausen,8.7010059981941001,50.782691998169838
35041,marburg hermershausen,8.6908049979446815,50.786987998059928
35041,marburg marbach,8.7367700007645119,50.819699998081383
35041,marburg michelbach,8.7030469979329474,50.844195001501646
35041,marburg wehrda,8.7560919993076762,50.838240000501266
35041,marburg wehrshausen,8.7275279990758623,50.811936998521176
35043,marburg bauerbach,8.8327769986550972,50.818493000550703
35043,marburg bortshausen,8.7806929996489167,50.751035000158161
35043,marburg cappel,8.7613720007169569,50.779828000869543
35043,marburg cyriaxweimar,8.7196080015473143,50.784577001069927
35043,marburg ginseldorf,8.8220960016123797,50.841632001111442
35043,marburg gisselberg,8.7438499993194121,50.774928001878941
35043,marburg moischt,8.8259369984968465,50.774175001389246
35043,marburg ronhausen,8.7569309986701196,50.758347000748749
35043,marburg schroeck,8.8313359994932465,50.786459998689701
35066,frankenberg*eder doernholzhausen,8.8773020019242814,51.056517001379312
35066,frankenberg*eder friedrichshausen,8.8609800016807316,51.048979999669264
Table 3.4: zip-towns

In addition we compiled a list names of common first names from various

sources on the Internet. We needed this data to later identify, when a town
name has been used as a person’s last name. This data set also included
common German titles such as ”Frau”, ”Herr” or ”Professor”.
3.4.5 Umlauts
The German language makes use of a handful of special characters, so called
Umlauts. For historic reasons, there are various ways to incorporate umlauts
in HTML pages. In addition many keyboards do not have keys for umlauts,
so users will type sequences of other characters, when they actually mean
to type an umlaut. In order to deploy a working search engine, we decided
to get rid of umlauts altogether. HTML was originally designed for the
128 characters known as lower-ASCII. They are encoded in a single byte,
not making use of the first bit. Umlauts are not part of lower-ASCII. Web
page authors replaced them with sequences of lower-ASCII characters, a
technique used by every German speaking writer when she encounters a
non-German keyboard. For that reason, several techniques were introduced
later that allowed to include umlauts in HTML documents. There were two
underlying strategies:
• Special HTML tags that the browser would resolve to the German
umlaut. For historic reasons, there are two ways of using tags for
umlauts.
– A tag that specifically names the umlaut
– A tag that names the umlauts number in Latin-1 encoding, using
the upper 128 characters that can be represented in a byte.
• Directly using Latin-1 encoding and telling the browser about it. In
this case, umlauts can be directly typed in the HTML, using the upper
128 characters.
Table 3.5 demonstrates the different umlauts and the several ways of encod-
ing them:
For indexing and recognition, we want to identify two equal words as
equal, no matter if they were written in different encodings. The encodings
in HTML and Latin-1 can easily be converted. The mapping to sequences
of lower-ASCII characters however is not bi-directional. Every ä can be
mapped to a ae, but not every occurrence of an ae was meant to represent
an umlaut. For that reason, we decided to translate every occurrence of
an umlaut to sequences of lower-ASCII characters. We translated the raw
HTML pages from the web crawl, as well as our tables with geographic data
for Germany.
This technique can be highly recommended, even for traditional search en-
gines.
3.5 A Process Model for Designing Search Engines

In this section, we will briefly discus the process model used for the devel-
opment of our search engine prototype. The development of an exploratory
prototype, especially of a web application, is always a little bit chaotic.
There are however some best practice methods that are worth noting and
that might find application in other similar projects.
As we have just seen, data in web IR is inherently dirty, noisy and full of
unanticipated surprises. In addition, the amounts of data are rather large,
easily exceeding a terabyte. Unknown characteristics and the large size of
data determine the process model for designing search engines.
Many of the specifications are initially hidden in the data. They are part
of the noise and un-expected characteristics, and are to be discovered along
the way of designing a search engine. As a rule of thumb, every increase
in the data size by a factor of ten, introduces at least one completely new
problem.[Sue04]
In our example, for example, it looked quite easy to parse for city names
such as München or Köln. By increasing the set of towns from Germany’s
metropolitan cities to all German cities and their villages, we came across a
set of completely new problems such as:
• Compound town names
• Several towns having the same name
• Parts of compound names being common German words.
Each of these problems required considerable effort to fix.

As another example, the cleaning of the geographic data sets was not anti-
cipated and required about two weeks.
Since most of the requirements were not known up-front, we decided
for a data-centric, waterfall-like approach. The data from the web crawl
and the geographic data was at the core of the process. Similar to batch
processing, we handed data from one stage to the next. Every new outcome
was tested for feasibility. If it failed, it was re-computed from the output of
the previous step. Since all intermediate results were saved, we were sure
to never have to go back more than one step. Given the fact that some
computations might take several days, this is a very important feature.
Intermediate results were stored in ASCII text files exclusively. This allowed
to:
• Manually check the results for feasibility.
• Easily switch programming languages between the steps.

Programming Languages
Over the course of the project, we dealt with a wide range of data sets,
from 100kb to over a terabytes. Functions on that data ranged from easy
mapping and string-processing to complex computations that required more
detailed modeling.
The intermediate text formats gave us great flexibility in choosing spe-
cific languages for different tasks. Over the course of the project, we used:
C and C++ for fast processing of the largest data sets, when speed mat-
ters.
Java for tasks that required detailed modeling, when speed is not a key
issue. Java proves to be much faster to code and debug than C.
Perl and Python for functions on geographic data sets and extraction of
features from larger text collections. Both languages are well suited
for string handling.
3.6 Geocoding
It is the underlying idea behind geographic search engines that web pages
provide information for certain areas. A website for Marburg’s tourist office
for example would contain information on Marburg. Every web page can
contain information for one, none or even several such areas. Its set of areas
is called a page’s geographic footprint.
This process of discovering and assigning these areas that compose the
footprint is called geocoding [McC01]. It consists of geoparsing [MAHM03],
the search for hints in web pages that will lead us to the areas. This step
is highly dependent on a countries geography and the names of its towns
and can only be generalized to some degree. By geography, we mean how
a country is organized in political and administrative geographic regions.
Differences can be found in what entities, like cities or states there are, how
they relate to each other and to other administrative codes, such as the
national postal or phone system. As the authors of [McC01] pointed out:
Software for performing this parsing is more complicated than

one might first expect, due to the wide variation in abbreviations,
punctuation, line breaks, and other features that are used.
In the next step, we try to geomap these hints to actual geographic entities,
such as cities and villages. The sum of these entities will then build the
footprint.
After initial footprints have been computed for all web pages in the crawl,
we can further refine and improve them in steps of post processing.
3.6.1 The Geographic Footprint

Every web page can hold information for zero, one or several geographic
areas. When discovering these areas the certainty for each of the discoveries
is limited and we are not 100 % sure that the page is actually about this area.
We call the set of areas, augmented with numeric values for the certainties
the page’s geographic footprint.
XXX in these sections find define certainty with importance XXX
On a conceptual level, a pages geographic footprint consists of:
• A list of geographic areas.
• Each with its certainty.
When moving from the concept to the design of geographic information
systems, one has to make a decision regarding the data model.
Vector models model a region as a polygon consisting of vectors. This
model is particularly suitable for sparse data, where polygons do not
overlap much. In the later case, it can be quite expensive to compute
an aggregate over all polygons that overlap a point.
Grid models treat the world as a grid and attach the attributes of each
polygon to every grid-tile it touches. In case several polygons intersect
with the same tile, the tile receives the aggregate over all polygon’s
attributes. This model is not as precise as the first, since the tiles are
necessarily coarser than vectors. On the other hand, it offers instant
access to the aggregate over any point in space, no matter by how
many polygons the point is overlapped.
Straight-forward grid models use the same memory, no matter how
densely populated the space is. One can however easily compress space
grid models, or implement space saving index structures on top.
For our prototype we decided for a grid model because even though space
might be sparse, we expect many polygons to intersect. This is, because we
will find many hints to the same geographic region.
Precision was not required as a key feature, since the underlying process of
data gathering was already not too precise. Also, our tables with geographic
data for Germany only contained the centroids for polygons, and we had to
estimate general surrounding areas.
We circumscribed Germany in a rectangle, reaching from 5 to 16 degrees
longitude and 47 to 56 degree latitude. Since the world is not a disc [Col92],
this is of course no real plane rectangle but fits our application quite well.
The rectangle was divided in a grid of 1024 by 1024 tiles, each holding a
single integer for the certainty.
A page’s geographic footprint is only a design concept. Later imple-
mentations can be as diverse as text files of (id,value) tuples or a two
dimensional array.
3.6.2 Sources of Geographic Information

In this section, we discuss possibilities of finding out for what regions a web
page holds information. The problem would be trivial, if every author of a
web page would attach a geographic footprint to every page, eliminating all
guessing. We will discuss this approach in Section 3.6.2 and see why it fails.
Without the author’s help there are two basic techniques for creating an
initial footprint for a web page. One can try to analyze the document itself
and look for hints to geographic regions or one can look up such information
from external directories. These approaches will be discussed in Section 3.6.2
and 3.6.2.
Once a set of documents has been mapped to geographic locations, one can
use post processing techniques as described in Section se:post-sources to fur-
ther enhance quantity and quality of the geographic footprints. Pages that
have had no footprint might ”inherit” such information, pages that already
have a geographic footprint might receive additional detail.
Finally we will discussed the remotely topic of detecting a server’s geographic
position, for the sake of completeness.
This section treats the retrieval of geographic information about web pages
on a theoretical level an discusses pros and cons. As we build our exploratory
prototype later, we will include those that we find suitable and discuss im-
plementation details as they arise.
The Semantic Web

If every web author would attach a geographic footprint to a document at
the time of writing, building a geographic search engine would be a lot less
work. However, as straight forward as adding geographic meta information
to Internet resources might be, it fails when confronted with Internet reali-
ties. Nonetheless, we will provide a quick overview over various techniques.
Enriching documents with additional explicit information has been around
since the beginning of the Internet, even longer if library science is taken
into account. Meta information is usually highly standardized, in contrast
to the inherently ambiguous natural language that is used in the actual doc-
ument. The document itself might fail to deliver explicit information that
would otherwise be of interest to the user, such as:
• the intended audience of the document
• author and time of creation
• in what larger category the document belongs
• what geographic region the document covers
Meta information, information on information, was supposed to solve the
problem. It provides the missing information explicitly and describes the
document, avoiding ambiguity. For this section, we will assume that meta
information is provided by the author. Schemes with external collaborative
editors, such as Internet directories, will be described in Section 3.6.2.
There are two basic levels of sophistication, one less formal and another,
very precise. Ever since its creation, HTML used to contain a meta-tag,
where an author can enter a set of keywords that loosely describe its content.
Table 3.6 contains a HTML header that makes use of meta tags. Later, it
became necessary for authors to agree on fixed meanings for certain terms,
often highly dependent on the content of the document, such as medicine,
trade or product descriptions. In this semantic web, application dependent
ontologies form agreed-upon mappings between terms an meanings. As it
turns out, for a geographic Internet, the difference between the two concepts
becomes rather slim and can almost be ignored.
Currently, no search engine considers the text in HMTL meta tags. As a
result, they are almost never implemented. A smaller study [Mih03] found
that 0.005 - 1.9% of all web pages contained any kind of standardized meta
information. Of these, 1 - 3% are found to contain spatial information.
However, a geographic semantic web as described by [Ege02] could be of
great potential. One could eliminate geoparsing from geocoding, since all
required information would be readily available. There has been quite some
work in this direction. All solutions however suffer from two significant
problems:
Chicken and Egg After initializing such a service, web masters would
have to annotate their web pages or enter geographic information in
the directory service.
Web masters however are practical people and will only invest their
tie, if they expect a significant gain in return. The gain would be more
hits from users, usually led to the web site by a search engine. They
would therefore wait for major search engines to make use of such in-
formation.
Search engines in return wait for the web masters to make the first
move. They would only focus on geographic meta data, if enough web
masters had annotated their pages. Otherwise, a geographic search
engine would be too scarce to be of any use.
This problem could be overcome by either:
• A core search engine that does not rely on geographic meta infor-
mation. It would be ready to use from the first day and web mas-
ters would directly benefit from entering geographic meta data.
• A commercial search engine of significant size would have the
leverage to de-facto force web masters to implement meta data.
Web-masters would risk to lose an intolerable portion of their
traffic, if they did not comply.
Malicious Web Master The Internet knows no central institution that

could enforce rules or would have authority over web masters. Web-
masters in general cannot be trusted. Already now, web spam has
turned into a serious problem, requiring significantly more work than
some false meta information.
Meta information entered by the author of a web page must be as-
sumed to be misleading. For this reason currently all commercial
search engines ignore the meta-tag in HTML.
Any such service would fail for just this reason.
Nonetheless, we will provide a brief outline of different techniques for ge-

ographic mark up. Added geographic information, describing a document,
can be categorized by two characteristics:
Where The information can be stored:
• within the document, as some sort of mark up.

• in external dictionaries that map documents to the additional
information.
How For geographic positions, it has to be decided, if they are to be de-

scribed:
• in geometric terms that is in points and polygons in some coor-

dinate system.
• in textual terms, using names of landmarks and geographic enti-
ties.
Since we will want to compute distances on the footprints, in the end

there always need to be numeric coordinates. So even in the second
case a mapping from names to positions needs to be applied. On one
hand, textual descriptions have some justification, since they help au-
thors to implement meta information, by typing in Germany, instead
of having to describe a complex polygon that resembles the country.
In addition can be directly read understood by humans. On the other
hand, there needs to be a gazetteer -service installed on the side of the
search engine. Gazetteer services such as [Tru00] resolve town names
to positions. These services are naturally limited to a finite set of
common geographical entities. Special applications, such as ecology
however might need other geographic entities than say politics.
The following is a brief overview over proposed techniques.

The Dublin Core provides an elaborate meta data schema for web pages.
One subset of its features is concerned with geographic meta data. It allows
to specify points ([Cox00b]) or rectangles ([Cox00a]) that a web page is rel-
evant for.
The author of [Dav01] designed schema for geographic meta information

that allows textual as well as geometric terms. The same author developed
a search engine based on this technology [Dav99] that is discussed in Sec-
tion 3.9.6.
Another method ICBM to attach longitude and latitude to a web page, has
been proposed by [Sch02].
Table 3.6 shows the header of an HTML document from a page of www.umkreis-
finder.de [Eve], a prototype of a geographic directory service, as described
in section 3.9.4. The header contains geographic mark up in all the three
formats above. The code is pretty self explanatory, the three formats can
be recognized by:
DC. for Dublin Core metadata. In this case a DCMI-point.
geo. in the format of [Dav01].
ICBM for the ICBM format of [Sch02].
The International Organization for Standardization (ISO) developed sev-

eral standards that are interesting in this context. ISO19111 [21101a] pro-
vides a norm for spatial references in geometric terms. ISO19112 [21101b]
describe how to use so called geographic identifiers (textual terms) and build
a gazetteer service. ISO3166 [Int74] norms two letter codes for country
names. These codes, such as de for Deutschland have been used for national
top-level domains. Containing less than two hundred entries, this is prob-
ably the smallest working gazetteer service. The code in table 3.6 uses the
ISO3166 convention to express that the page is about Germany.
The German chapter of the Open-Source encyclopedia Wikipedia [Wik]
is in the process of implementing geographic mark up for its site.
For more complex geographical models, the Geographical Mark Up Lan-
guage (GML), as proposed by the Open GIS Consortium [Con] provides a
complex XML-based framework that has found wide acceptance. Since most
approaches for a semantic web are already XML-based, adapting GML for a
geographical semantic web, as demonstrated by [CP03] is a straightforward
solution.
The authors of [LRIE01] proposed a .geo top-level domain that encodes
the position in the URL. The approach was based on a hierarchical grid-like
structure, and supposed to work in a DNS-like fashion. The feasibility of
this approach seemed vague and the proposal for this TLD has been rejected
by ICANN [Int].
In addition to describing geographic positions of documents one can
also describe geographic positions of Internet hardware. The authors of
[DVGD96] propose to include a server’s or subnet’s geographic position into
the DNS.
Latin-1 Code Umlaut HTML Hex HTML Specific Replacement

196 Ä &#196 &Auml Ae
214 Ö &#214 &Ouml Oe
220 Ü &#220 &Uuml Ue
223 ß &#223 &szlig ss
228 ä &#228 &auml ae
246 ö &#246 &ouml oe
252 ü &#252 &uuml ue
Table 3.5: The replacement table for Umlauts
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<META NAME="geo.country" CONTENT="DE" />
<META NAME="geo.region" CONTENT="DE-HE" />
<META NAME="geo.placename" CONTENT="Marburg * Capelle * Michelbacher Mühle" />
<META NAME="geo.position" CONTENT="50.867;8.700" />
<META NAME="ICBM" CONTENT="50.867, 8.700" />
<META NAME="DC.title" CONTENT="eventax UmkreisFinder Marburg * Capelle * Michelbach
<META NAME="DC.coverage.spatial" SCHEME="DCMIPOINT" CONTENT="east=8.700; north=50.8
<META NAME="DC.coverage.spatial" SCHEME="ISO3166" CONTENT="DE" />
<meta name="keywords" content=" 2004 Umkreis Umgebung [...] Messen Sport" />
<meta name="description" content="Die Suchmaschine und Stadtportal für Marburg
<title>Die Suchmaschine für Marburg und Umgebung</title> <link rel="styleshee
<script language="JavaScript" type="text/Javascript">
[some JavaScript here}
</script>
</head>
Table 3.6: The header for http://marburg.umkreisfinder.de [Eve]

Extraction of Geographic Features

An author might include the reference to a geographic position or the name
of a landmark in a web page. There are several types of references we will
encounter, and some we might have to infer. In this section, we will discuss,
what types there are, where they can be found and how frequently we can
expect to encounter them. For the different types [McC01] has provided a
good overview.
There are two basic types of geographic references in web documents,
textual references such as names of landmarks and numeric references such
as area or zip codes. Making use of either type for geocoding depends on
available data. One needs tables that provide the geographic position for
every such item that is to be discovered in a page. We call these references
direct geo references. Once a larger body of documents has been geocoded
however, one can also look for inferred geo references.
Textual references mainly consist of names of towns and geographic land-
marks. Their discovery and extraction in a document is not as easy and
straight forward as it might seem at first. Names of geographic entities are
often compounds of several words that might be:
• (partially) abbreviated
• changed in their order
• divided by punctuations, line breaks or fields of a table
• highly ambiguous
They are also highly dependent on the countries administrative geography

and how strongly the alter influenced everyday communications. In Ger-
many for example states are hardly ever mentioned in an address. In the
United States, states are an integral part of every postal address and can be
used to resolve ambiguous town names. Even in private communications,
people will refer to Portland, Oregon, to emphasize they are not talking
about the town of the same name in the state of Maine.
Numeric codes, such as area codes are unique in contrast to names of
landmarks, but might be mistaken with unrelated numeric codes in a docu-
ment. To put it in a nutshell:
Not every five digit number is an area code.
In the case of telephone numbers, formats might vary and even contain
special characters:
• 06421 2821559
• ++49 (0) 6421 282-1559

Numeric codes can be used in combination with textual codes or other nu-
meric codes to reduce the risk of mistaking a random number for a numeric
geographic code. They are also highly dependent on a country’s administra-
tive characteristics. Cell phones in Germany for example have an are code
that points to a service provider, not an area. In the United States however,
cell phones usually contain a local area code.
Once a larger data body of documents has been geocoded one can try
to infer geographically relevant terms. The probability of the occurrence
of a term in a document might be strongly correlated with the document’s
geographic footprint. Using techniques such as proposed in [PF02], it should
be possible to detect informal names for landmarks or events that are held
in a specific location. Examples might be:
• The Statue of Liberty in New York
• Big Apple for New York
• Mermaid Parade for Coney Island, Brooklyn, NY, where this event is
held annually.
There has however been little work in this direction so far.

When searching for geographic references in a web page, one has also to
take into account where and in what context they are found. First of all, the
references don’t have to be in the HTML file itself. URLs also contain useful
information, not only in the domain name, such as www.fitness-frankfurt.de
but also in the names of sub domains, hosts, directories or files, such as in
www.fitness.de/frankfurt/index.html or www.frankfurt.fitness.de. The con-
text of a geographic reference might also be of importance. One might want
to record if it is in bold, italics or a larger font. Anchor text of links can
actually also be treated as part of the page that the link points to. When
detecting a geographic reference in a URL, it might be a substring, like
www.ilovefrankfurtfitness.de. In these cases, the delimiters of the substring
should be recorded. More details on this are to be found in the discussion
of our implementation in Section 3.6.5.
Geographic codes are quite frequent in web pages. The authors of
[McC01] found that about 9.5% of all web pages contain either a recog-
nizable US zip code phone numbers. After parsing names of towns, area
codes and zip codes for our crawl of German web pages, we found that
XXXso manyxxx received a nonempty geographic footprint. German law
requires web site operators to inlcude one page in every site with contact
information about the authors2 . So at least one page in every domain is
guaranteed to contain geographic information. This information is allowed
to be at most two clicks from the start page and can therefore almost be
searched for.
2
Impressum
External Sources of Geographic Information

Besides extracting hints directly from Internet resources, external directories
are another source for a page’s geographic footprint. In this section, we will
cover three different types of source:
• Business directories
• Manually compiled directories
• whois directories
Business Directories have been around in their printed form for a long
time. They might be for all businesses, like the yellow pages, or focus on a
specific business sector. They all have in common that the initiative to get
listed lies with the client. Even if the directory approaches a company about
getting listed in the directory, it usually is the company’s hands to agree on
getting listed. There is usually a fee associated with the listing. The fee has
a two-sided impact on the content of the directory. On one hand, there is
very little spam (misleading and redundant entries). On the other hand, the
service is limited to commercial clients. Non-profit and private contacts will
usually not be listed. This makes business directories an incomplete source
of high quality data. It can very well be integrated in a geographic search
engine.
The geographic search engine [inc03] as described in Section 3.9.1 is build
around a business directory. As for most commercial search engines, the
exact algorithm and data is usually not disclosed to the public. It is however
quite clear that after having issued the query, the user is first confronted with
a list of nearby entries from a business directory. The user must next click on
an address and will be confronted with answers for that entry. The service,
thus appears to be more like a search inside the yellow pages.
Manually compiled directories are almost as old as the Internet. Com-
mercial directories like yahoo! [Yah] or non-profit competitors like the Open
Directory Project [] have been a fierce competition to search engines. The
directories are organized in a hierarchical structure and web pages/sites are
entered by hand. It is easy to organize the directory, so it contains a ge-
ographic component. Indeed, most directories allow the user to click into
sections for regions or cities. The placement of an entry makes a statement
about the geographic footprint of the Internet resource.
In contrast to business directories, there is usually no fee required for a
placement, and it is the directory that takes the initiative by placing the
entry for a web page, not its author. The directory is always far from com-
plete, but does include non-profit and private pages and thus is much larger
than business directories.
Manually compiled directories are a good source for generating geographic
footprints. The geographic search engine www.umkreisfinder.de [Eve] as
discussed in section 3.9.4 entirely relies on geographic information retrieved

from the Open Directory Project [Opea].
The whois service has been described on a technical level in Section 3.4.2.
At this point, we will outline, why it is a great source for inferring geographic
footprints for web pages. The argument makes two basic assumptions:
• Most web sites are about places close to the position of their owner.
• The admin-c section of the whois entry points to the address of the
company that owns the domain.
Most websites, especially those of small companies, contain content about

the region in which the owner operates. The web site of a small bicycle
dealer for example contains information about the bicycle dealer, how to
get there and possibly cool bike paths in the area. All of this information
is about the company’s region of operation. A prototype pf a geographic
search engine [DGS00] as described in section 3.9.5 uses the whois service
exclusively to pin down web sites initially, before post processing this infor-
mation according to links.
From all four sections of whois entries, it is the admin-c section that
most precisely points to the companies real location. It is usually identical
with the registrant-section. In the case of .de domains however, it is much
better structured. The other two, more technical, often point to distant
addresses, since:
• Most small businesses don’t run their own servers, but rent remotely
hosted web space. The technical contact section of the whois entry
would point to this remote address.
• Even larger businesses that host their own servers usually do not op-
erate their own network. The zone-section would point to the address
of the networking company that is usually not close by.
In the implementation of [DGS00], the area code of the phone number of

the technical administrator had been used. The prototype however was
restricted to web pages from .edu domains, and universities in contrast to
small businesses usually host their own web servers. Thus, the difference
between the precision between administrative an technical contact did not
become too apparent.
The admin-c section likelihood to point to the owner’s actual address
seems to be quite high. A study in [Zoo01] that compared whois entries
with a business directory of 20,000 American high-tech companies found
that about 73% of the zip codes in whois entries are identical to those from
the business directory. For smaller German web sites that are hosted by a
remote web host, this figure can be expected to be even higher, since the
registrant is usually not allowed to enter the text for the whois entry directly.
Instead it is compiled by the web host from the (necessarily accurate) billing-
data provided by the registrant.
The whois service proves to be a open, free and rather reliable source of
geographic data.
Post Processing
In the previous section, we have seen that we can find geographic references
in a sufficient fraction of all web pages. We would however like to increase
this fraction. Also, we don’t know if all geographic footprints are sufficiently
filled. They might be nonempty but still miss references to important regions
that a page provides information for. In this section, we demonstrate how
quantity and quality of a geocoded document body can be enhanced by
extending basic IR hypotheses to geographic features.
We have already touched two important topics in the previous section.
We learned that in every German web site there is at least one page that
contains information about the legal entity that runs the page, including
contact phone and address. Even tough this information is concentrated
on one single page, it somehow applies to the entire site. Also, we learned
that geographic references that we find in an anchor text of a link might be
propagated to the page that the link points to. But somehow, the difference
between anchor text, the surrounding text and the rest of a document is not
so great, so we feel that probably they all should somehow be propagated. In
this section we will try to formalize and justify our notion by extending three
basic assumptions from traditional web information retrieval to geographic
properties. We will do this first, one after another, before discussing how
we can actually propagate information between web pages.
INTRA-SITE This hypothesis states that two pages p1 and p2 are more
likely to be similar if they belong to the same site/domain p1 .domain =
p2 .domain than two random web pages p3 and p4 that belong to dif-
ferent web pages p1 .domain 6= p2 .domain. The idea is that if one web
page in a domain is about say ”bicycles”, any other is more likely to
be on the same topic than a random page from the web. This hy-
potheses can be extended to smaller units than a site, and form an
INTRA-SUBDOMAIN or an INTRA-DIRECTORY hypothesis.
RADIUS-ONE This hypothesis states that two pages p1 and p2 are more
likely to be similar if they are linked, that is if there is a link l(p1 , p2 ),
than tow random pages that are not linked. Note that this hypothesis
is symetric, it does not matter, if l(p1 , p2 ) or l(p2 , p1 ). If both links
exist, the correlation between the two pages however is expected to be
stronger.
RADIUS-TWO This hypothesis states that two pages p1 and p2 are more
likely to be similar if there is a page px that links to both pages:
∃l1 |l1 (px , p1 ) and ∃l2 |l2 (px , p2 ). This hypotheses is also known as COC-
ITATION. There is a more detailed version of this hypothesis that
argue that the correlation between two p1 and p2 is the stronger the
closer l1 and l2 are in px .
We used the term ”similar” in the above definitions, without giving a de-
tailed description. It could mean for example ”on the same topic”, such as
”bicycles”. We don’t really need to narrow ourselves down to a more precise
description, since we are going to replace it at this point by ”geographically
close”. The notion if two pages are more likely to be similar, they are also
more likely to hold information about the same geographic region. We will
rephrase the three hypotheses in this geographic context:
INTRA-SITE Two pages are more likely to be about the same geographic
region if they belong to the same site/domain.
RADIUS-ONE Two pages are more likely to be about the same geo-
graphic region if they are linked.
RADIUS-TWO Two pages are more likely to be about the same geo-
graphic region if there is a page that links to both of them.
Having made these observations and stated them as hypotheses, the

question is, how to put them into practice. There is a straight forward
approach that focuses on single pages exclusively. If for any of the hypotheses
above, we find that two pages p1 and p2 have a higher probability for being
about the same geographic region, we propagate the footprints p1 .f p and
p2 .f p in both directions:
• p1 .f p := F (p1 .f p, p2 .f p)
• p2 .f p := F (p2 .f p, p1 .f p)
The most basic function for F is a sum with a dampening factor f ∈ [0, 1].
• p1 .f p := p1 .f p + f ∗ p2 .f p
• p2 .f p := p2 .f p + f ∗ p1 .f p
This technique works reasonably well. for more flexibility, the dampening
factor f can be replaced by a factor f (p1 , p2 ) that depends on (p1 and p2 ). If
propagating because of a RADIUS-TWO situation for example, f (p1 , p2 ) can
be the higher, the closer l(p1 , p2 ) and l(p2 , p1 ) are. If propagating because
of an INTRA-SITE situation, f (p1 , p2 ) could be the higher the closer p1 and
p2 are. If they are just in the same domain, f (p1 , p2 ) could be relatively
low, but if they are also in the same directory, it could be much higher. As
we will see in our implementation in Section 3.9 in some cases, it makes
sense to not propagate from page to page, but to first build an aggregate
and propagate the aggregate.
A quite different but very interesting approach to exploit the RADIUS-
ONE hypothesis has been proposed in [DGS00]. The prototype that im-
plements this approach has been described in Section 3.9.5. Omitting the
geo-extraction as discussed in the previous section, they performed an ini-
tial mapping solely by analyzing whois entries. They however propose an
elaborate density based approach for propagating geographic information.
Focusing on the United States, they imposed an administrative hierarchy
on the country, constructing a tree made of levels of ”states, ”counties” and
”cities”. Web pages are treated one by one. For every web page p they
compute p’s geographic scope, by traversing the tree top down. If a region
qualifies to be part of the geographic scope, the traversal will not continue
into the region’s sub regions. For a region to qualify to be included in p’s
geographic scope, it has to meet two conditions:
• There has to be a minimum number of links from this region, according
to the whois entries.
• The origins of these links have to be spread homogeneously over the
region r.
This approach however has one major flaw: the fixed hierarchy imposed on
the data. Any hierarchy is rather random and application dependent. An
hierarchy that fits political and administrative needs, such as the above,
might completely fail for other applications, such as ecologic observations.
Thus, this approach is not as flexible as the simple addition of geographic
footprints as discussed earlier.
Inferring Geographic Information from Hardware

We are interested in geocoding content, not hardware. The geographic cor-
relation between hosts and their content’s geographic footprint is extremely
low. Most small companies do not host their own web server, but rent re-
motely hosted web space. According to personal communication with their
PR-departments, only two providers, 1und1 [1un] and Strato [STR], host
60 to 70% of the 7,800,000 .de domains.
There is however some, if very weak, connection between the content
and the geographic location of the host. A host with content for Germany
for example is still more likely to be situated in Germany than Romania.
For reasons of completeness, we will discuss determining the positions of
hosts very briefly. As it will turn out, even a combination of several such
techniques is to imprecise to accurately locate a host: Even if there was a
strong correlation between the location some page has information on and
the location where it is hosted, the later would be rather useless in our
context.
There are many ways to infer a hosts actual position from hardware
and network properties. These techniques are mostly used to determine the
position of a clients host, not a servers. In our scenario, this information
is much easier to get, the most basic is simply to ask the user to fill out
a form. for more information on user tracking, see Section 3.2. Hardware
based location techniques for clients are usually far from precise and not
the focus of this discussion. They can however also be used to estimate a
server’s position and might be of use for combination with other techniques.
There are five basic techniques for determining a host’s location.
Names of routers on Internet backbones often include geographic refer-

ences. They may include names of cities, states, countries or letter
codes of nearby airports. Parsing this information is pretty straight
forward. It will however provide just a rough position for some routers
along the way to the target-host, no position for the host. Table 3.7
shows a traceroute, between two German machines, where router
names contain geographic terms.
IP-space is assigned in large chunks to often national distributors. Noth-

ing can however keep a multinational corporation from using some
IP space it obtained in one country for machines in another. This
technique is very unreliable.
Network delay can often be related to geographic distance. In combina-

tion of known locations for intermediate machines, this technique can
to some degree determine a host’s position.
Whois entries contain addresses for the network and host operator. From
these, one can try to infer a geographic information for a host, given
its URL.
”Semantic hardware” , is the extension of semantic mark-up to hard-

ware. In our case, machines and networks would independently de-
scribe their geographic location.
The authors of [DVGD96] have proposed a method for annotating the
DNS with fields regarding the longitude, latitude and altitude for a
host or sub-networks. This however is just a proposal and not im-
plemented. Given the reluctance of system operators to spend time
implementing new mark-up, this technique will most likely never take
off.
None of these techniques are very precise and even their combination is
not precise enough for a useful geographic search. There are several commer-
cial enterprises such as Digital Envoy [Dig], Quova [Quo] or Verifia [Ver]
that offer IP-to-position services, mainly for determining user’s positions.
Tracing route to www.sueddeutsche.de [62.109.133.9]

over a maximum of 30 hops:
1 <10 ms <10 ms <10 ms cerz01121.HRZ.Uni-Marburg.DE [137.248.121.250]

2 <10 ms <10 ms <10 ms cirz0100.HRZ.Uni-Marburg.DE [137.248.2.11]
3 <10 ms <10 ms <10 ms ar-marburg1.g-win.dfn.de [188.1.42.129]
4 <10 ms 11 ms <10 ms cr-frankfurt1-po8-3.g-win.dfn.de [188.1.80.53]
5 <10 ms <10 ms <10 ms so-6-0-0.ar2.FRA2.gblx.net [208.48.23.141]
6 <10 ms <10 ms <10 ms so3-0-0-2488M.ar2.FRA3.gblx.net [67.17.65.82]
7 <10 ms 10 ms <10 ms ge-7-2.Frankfurt1.Level3.net [195.122.136.245]
8 <10 ms <10 ms 10 ms ae-0-52.mp2.Frankfurt1.Level3.net [195.122.136.34]
9 10 ms 10 ms 10 ms so-3-0-0.mp2.Munich1.Level3.net [4.68.128.49]
10 10 ms 10 ms 10 ms ge-5-1.hsa2.Munich1.Level3.net [195.122.176.180]
11 <10 ms 10 ms 10 ms 62.67.36.226
12 <10 ms 10 ms 10 ms ge1-1.ar1.muc2.mpservers.net [62.109.156.10]
13 10 ms 10 ms 20 ms vr10.asf.muc2.mpservers.net [62.109.157.4]
14 10 ms 10 ms 20 ms if200.ws1.muc2.mpservers.net [62.109.157.37]
15 10 ms 10 ms 20 ms sud-lbcont-1.mpservers.net [62.109.133.9]
Trace complete.
Table 3.7: A traceroute from the network of the University of Marburg to

www.sueddeutsche.de, a national newspaper based of Munich. Geographic
markers like marburg, frankfurt and Munich or abbreviations like FRA or
muc are clearly visible.
They often claim precision to the street level, but do fail to provide infor-
mation to back up this doubtful claim. The authors of an academic study
that focuses on analyzing names of routers and network delay encountered
errors of twenty to several hundred miles [PS01]. The precision can there-
fore be expected to be ore on a country or state level and is reflected in the
current application of these technologies:
• Guessing the native language of a user.
• Digital rights management.
• Avoiding the distribution of content to regions where it would be ille-

gal.
The commercial landscape in this sector might see some change, after Dig-
ital Envoy recently announced that it has been granted a patent [PFTL00]
covering the basic techniques from above.
G-trace [PN99] offers a nice tool visualizing the geographic routes of net-
work connections. It analyzes the names of hosts, gathered from a tracer-
oute, and displays them on a map-interface. Its accuracy is good for what it
is meant for, but does not reach the levels necessary for our applications.
3.6.3 Germany’s Administrative Geography

German towns fall into two categories, city (Stadt) and village (Dorf). They
are connected in a 1 : n relationship. A city can have several villages or
boroughs (also just villages), while every village belongs to exactly one city.
Zip codes and area codes are numerical technical codes that are only
loosely connected to these administrative categories. That is, a town might
have one or several zip codes, or even share one with another town. The
same applies to area codes.
German town names can consist of sets of terms. Here are some exam-
ples:
1. Furth im Wald (Furth in the woods)
2. Sachsenhausen bei Weimar (Sachsenhausen near the city of Weimar )
3. Weimar, Thüringen (Weimar in the state of Thüringen)
4. Neustadt an der Donau (Neustadt on the Danube)
5. Bad Schwalbach (spa Schwalbach)

We call these town names compound names.
Compound names usually consist of three parts:
• There is usually a main term that denoted the original name the town
has been given historically. In the above cases, this would be:
1. Furth
2. Sachsenhausen
3. Weimar
4. Neustadt
5. Schwalbach
6. Essen
• There are sometimes predicates the town has earned over time, such
as Bad or Sankt. Predicates are usually common German terms and
precede the main term.
• A descriptive term, to make a distinction to other towns with the same

main term. Descriptive terms usually follow the main term. They very
often contain geographic descriptions, like an der or bei. Descriptive
terms in the above example are:
1. im Wald (Furth in the woods)

2. Weimar (Sachsenhausen near Weimar )
3. Thüringen (Weimar in the state of Thüringen)
4. an der Donau
Geographic references often point to rivers (Danube), states (Thüringen)

or even other cities (Weimar ).
We will need these classifications of terms and towns for the extraction of
terms and their mapping to towns in the next sections.
3.6.4 The Quality of Geocoding

Before we start to geocode pages, we will have to discuss goals and problems
in geocoding and see how we can transform some of the hypotheses from the
beginning to help us through the process.
Avoiding False Positives

In many pages, we will find hints that appear so that we are really unsure if
they actually point to a geographic position. Reasons could be geographic
terms that are ordinary German terms (weak) or the discovery of five digit
numbers that were not meant to be a zip code. In other pages, we will find
hints that are very ambiguous and point to different geographic entities.
Reasons could be descriptive terms including references to other towns, thus
including strong terms, or towns having the same main terms in the absence
of descriptive terms. We will need to establish general guidelines for these
cases. In the following, we will make the case that it is best to avoid false
positives and much rather drop a hint and not include it in a page’s footprint.
Assume a page p contains a doubtful hint h to a geographic entity, such
as town t. Let’s also assume that a user posts a query q for a position near t
and that p could qualify as an answer. We will later see that we can attach
certainties to h, but this first basic decision remains, whether to:
If we include the hint in the footprint, we run the risk that p makes it in
the top-k answers for q. Again, there are two possibilities:
if h was meant to point to t, then we are fine and made a correct
decision. However, we did not gain much, since according to
the abundance hypothesis, there would have been plenty of other
good results.
if h was not meant to point to t, we have an bad answer in the top-k.
If this happens for other pages too, we run the risk of cluttering
the top-k beyond and acceptable fraction. This would render our
search engine useless.
If we ignore the hint and p would have qualified for the top-k, again there
are two possible scenarios:
if h was meant to point to t, we lose one result from the top-k. This
is no big deal, since according to the ABUNDANCE hypothesis,
we have plenty of other good results.
if h was not meant to point to t, we did the right thing.
As we can see, we have nothing to gain if the we drop a hint, but a lot
to lose in the first case. In the conclusion, it is usually safer to completely
drop an uncertain or ambiguous hint.
We will see later in this chapter that we have to be particularly careful in
regions with a low web page density.
Hard to Find Mistakes

Errors in geographic search engines can be extremely hard to find.
Most traditional search engines are pretty self explanatory, when it comes
to finding out, why non-results made it in the top-k answers. Importance
ranking is usually based to some degree on the presence of key-words in the
page. If a user queries for jaguar AND car, and receives a non-result in
the answers that states that we watched jaguars from the windows of our
car, she will understand why this happened. If all answers point to similar
non-results, she might also get a rough idea about how she has to re-phrase
the query. Similarly, the designer of the search engine might receive hints
regarding the re-design of the search engine. In any case, the cause of the
error can usually be traced back.
This is not the case for geographic search engines. The mapping from
hints to positions in combination to a proximity measure makes a re-tracing
of errors often impossible. Lets look at a simple example to illustrate this
problem.
Assume a user lives in the area of Ostelbien 3 , north-east of Leipzig and
queries for a travel-agency. If we were not careful in our geoextraction and
geomapping, we might return completely useless results that point o travel
agencies at the other end of the country. And the user might not ever find
out why, nor might we.
As it turns out, there is a small village by the name of Last in Ostelbien. If
we are not careful, we might assume every travel agency with last minute
offers, a common English term in German tourism sites, to be locally relevant
for Last and therefore for Ostelbien. The user will most likely not have heard
of Last, since the town happens to have only 90 inhabitants. She will have
no idea, why all her answers consist of far away results. The same applies
to the designer of the search engine, who in addition might not ever find out
about the problem, since it occurs in queries for this particular region only.
This interesting case makes a good example for the discovering problems
hidden within the data during construction a search engine.
The reason for these problems in debugging are to be found in the map-
ping from terms to positions. One would have to have all tables with map-
pings from terms to positions and perform reverse lookups for all close by
towns, to find the reasons for an erroneous mapping.
As a result, we will have to be especially careful about reasoning on
decisions we make about creating geographic footprints. In additions, we
often have to consistently check intermediate data sets for feasibility.
3.6.5 Extracting Geographic Terms from Internet Resources

The first stage in geocoding is the extraction of geographic markers from
pages and URLs. We will look at extraction from pates first and discuss the
necessary adjustments for URL parsing separately.
At this stage, all umlauts have been replaced and all text is treated as if it
was in lower case.
At first, it seems straight forward, to extract a ”muenchen” from a page
and draw the conclusion that the page somehow is about Munich. We could
directly infer a geographic position from this single term. For a complete
set of German towns, this proved to be much more difficult.
As we have seen, most German town names are compounds and there are
various issues involved:
• Predicates and descriptive terms are often omitted.

3
http://www.ostelbien.de
• Treating main terms exclusively leaves us with plenty of unresolved

of homonyms. The are usually meant exclusively though. Some page
talking about Frankfurt is usually talking about Frankfurt am Main 4
or Frankfurt Oder 5 .
We propose a two step solution. In the first step, we extract geographic

terms. In a second step, we match them with towns. The topic of this
section is the extraction of geographic terms.
In a first naive approach, one could simply take all 58,000 terms that oc-
cur in any town name and extract them. Many of the terms, especially those
that appear as main terms, are exclusively used in town names. Some of
the terms however, especially those from predicates and descriptive parts of
a town name, are common German words. Parsing for them would produce
a noisy bloated output.
A first better solution is to split the set of terms in two.
strong terms that are almost uniquely used in town names. They form a
seed for a town name. Having found a strong term like Frankfurt on
a web page tells us that we have found a town name. However, we
might not known which Frankfurt.
weak terms that are common in German language, but might help us later
determine the precise city. So, if we find a Main on the same page
we found the Frankfurt, we would be certain that this page is about
Frankfurt, Main.
There is a n : m relationship between strong and weak terms, by what

terms occur in the same town names.
• The weak term Main maps to the strong terms Frankfurt and Offen-
bach, because of the two cities Frankfurt, Main and Offenbach, Main.
• The strong term Frankfurt is mapped to the two weak terms Main
and Oder, through the existence of Frankfurt am Main and Frankfurt
Oder.
We manually split the original set of terms in 3,000 weak and 55,000
strong. In addition, we created tables weak2strong and strong2weak that
store the above n : m relationship in both directions. This process took
several days.
The idea at this point is, to first parse for all strong terms in a page and
in a second step for all weak terms that are associated with these strong
terms. This method can reduce noise dramatically.
Weak terms can furthermore be categorized.
4
Frankfurt near the river Main
5
Frankfurt near the river Oder
• Some weak terms provide good detailed information, independent of

their position on the page. These terms are often:
– of geographic nature, like names of regions, states or rivers.

– or point to uncommon institutions, like vitriolhuette (vitriol-works).
A reference to the river Main helps us determine Frankfurt, Main,

independent of the relative distance of the occurrences of Frankfurt
and Main.
• Other weak terms are just common German terms that are likely to
appear anywhere on any document, without any geographic meaning.
These terms only provide detailed information, when close to the main
term. An oder 6 a the bottom of a page does not help us to find out,
what Frankfurt was talked about much earlier on the page.
For this reason, we attached an integer distance to every weak term. When
parsing for weak terms, the distance between a weak term and a matching
strong term must not be more than distance, for the weak term to qual-
ify. This furthermore helped reducing output from parsing, while increasing
overall quality.
In some rare cases town’s main terms were so common that they had
been sorted to weak. Under these circumstances, such a town name would
not have been detected.
We therefore further refined the process, by adding two additional tables:
validators and killers. They allowed to move the main terms that had
ended up in weak to strong, without producing to many false hits.
validators map certain strong terms to other terms. After a strong term
has been detected, its environment is checked for the presence of one
of its validators. If no validator can be found, the term is discarded.
Most terms do not show up in this mapping, and will qualify for further
processing without the required presence of a validator.
killers are the exact opposite of validators. After a strong term has
been detected, its environment is checked for the presence of one of its
killers. If a killer is present, the term will be discarded.
To further increase flexibility, an integer distance was added to every entry.
It determines, how far a validator or killer can be from a term and still effect
it.
Another problem was the common usage of town names as last names.
Many German last names origin in town names. They were usually meant
to describe the fact that a person’s family originated from that town. Most
town-last names like Marburger and Kirchhainer are in genitive an will not
6
oder is also the German word for or
be detected by our parser that searches for Marburg and Kirchhain exclu-
sively. In other cases however, there is no difference between a last name
and a town name.
To avoid false detection of town names, when a page was actually talk-
ing about a person’s last name, we introduced another list of terms, the
general-killers. Any strong term from within a distance environment
of a general-killer is immediately discarded, without further processing. We
manually compiled this data set from 3,000 first names and common titles,
like Herr, Frau or Dr.
Numeric Codes
Numeric codes such as area-codes and zip-codes can be extracted in a three
steps:
1. Find all occurrences of a sequence of digits that adhere to certain

restrictions, such as max. five digits.
2. Look them up in a table of all numeric codes and see if they appear.
If they don’t, discard them. This step eliminates most false positives.
Their five digit format would allow for 100,000 zip codes. However
there are only about 8,000 zip-codes actually in use, all other five digit
numbers can safely be eliminated.
3. Numeric Codes have to qualify, via textual validators.

In each of our data sets, numeric codes appear in conjunction with a
textual description of the same geographic entity. We require at least
one of the significant (strong) terms form this textual description to
appear in the document. If the document contains no such term, the
numeric code is discarded.
Eliminating false positives in the last second steps is so aggressive, that

detected numeric codes are almost certain to point to a geographic entity,
such as an area-codes underlying region.
Abbreviations
Many terms that occur in compound city names are often abbreviated.
There are often several abbreviations for the same term. The term Sankt
(Saint) for example can be found abbreviated to St. or Skt. To treat abbre-
viations correctly, we kept a table abbreviations that maps every abbrevi-
ations to its full-length term. We then replaced all abbreviations in a page,
before parsing it.
Denglisch
Germans freely mix English and German terms in their communications.
By its critics, this phenomenon has been called Denglisch, a mixture be-
tween Deutsch and Englisch, German and English. Some mayor German
cities, such as München and Köln, have different translations in English,
namely Munich and Cologne. For this reason, both languages have to be
treated equally and in parallel, when analyzing German web pages. To map
the English terms to the correct city, we simply replace them with their
German counterpart, whenever they cross our way. This process basically
came for free, since all we had to do was to include the English terms in the
abbreviations table.
Extracting Information from URLs

Extracting geographic hints from URLs is slightly harder than extracting
information from web pages. The reason is that, in web pages, we only look
at clearly separated words and try to match these against our data set. In
URLs, there is no space-character and otherwise separate terms might be
agglutinated. We therefore have to look at substrings of the URL.
Lets look at some domain names for example:
www.fitness-frankfurt.de This domain name is highly structured, sep-

arating the two terms with a non-literal character.
www.fitnessfrankfurt.de This domain name is not well structured. Hu-

mans however can easily tell where the two terms should be separated.
www.kulmbach.de This domain name consists of the city name Kulmbach

but also contains another city name Ulm. The meaning however is
clear. Humans would actually have a hard time even detecting a Ulm.
www.registrierkasse.de This domain name translates to cash register.

However, it contains a substring trier, that could be misunderstood
as a reference to the German city Trier. Humans however can easily
tell that this is not a intended reference.
The problem is, where did the author intend to start, a term and where to
end a term. For a human, this problem does not seem to be to hard, but
can we formalize it, so we can efficiently compute it?
Lets rephrase the problem. Given a substring that we think refers to
a town name. How can we find out if that is what the author meant to
express?
In the first case, the problem was easy the term were separated by a
non-literal character. We call this a strong delimiter. Strong delimiters are
all non-literal characters, including numbers, and the end of the string.
The term Frankfurt in the first example has two strong delimiter, a ”-” on
the left and a end-of-string on the right.
In the second example, it has only one strong delimiter to the right.
The term kulmbach in the third example has two strong end-of-string de-
limiters, while ulm has none.
In the fourth example, trier has no strong delimiters.
Detecting a term correctly is easy, if it has two strong delimiters. But
what about the second example, where there is only one strong delimiter?
We think we can tell that frankfurt is a correctly recognized term, because
the string to its left contains another (non-geographic) term fitness. We call
this a weak delimiters. Weak delimiters consist of all other geographic terms,
names of people and products, and common German or English words.
The term ulm in the third example has one weak delimiter to its right7 .
The term trier also had a weak delimiters to the right8 .
Parsing for geographic terms in URLS works as follows. We split the
URL in six sections:
• The TLD, that we ignore, since we only treat .de domains
• The domain name
• The sub domain-names
• The host name
• The directory path
• The file name

We sort all URLs according to these categorization. For URLs that only
differ in the filename for example, we only have to analyze the top five
sections once for the first URL, and can copy the results to the next line for
the other URL.
Every of the section of length l we analyze by probing all O(l2 ) substrings
against our table of geographic terms. If its matches, we write out the term,
its position within the section and how it is separated left or right. This
format again is humanly readable and leaves us to decide what terms to use
at a later point.
By evaluating the output manually, for our prototype we decided that a
valid term should qualify by having two delimiters, one of them strong.
Stemming
Stemming or conflation, the reduction of words to their roots can be a great
help or a great burden, like any other IR application. On one hand, one
7
Bach is German for stream
8
Kasse is German for register
would like to recognize marburger stadium 9 as a hint for Marburg. On the

other hand, as already pointed out, especially the genitives are often used as
last names. A whole set of terms such as Hamburger, Frankfurter, Wiener,
Thüringer, Kassler is also used as names for meat products, introducing a
lot of noise.
It is our understanding that conflation could be of more help in detecting
town names, than in other areas of web IR, but we decided against it for
simplicities sake. It might be included in a later prototype.
3.6.6 Matching Extracted Terms to Geographic Entities

In this section, we study how to match a given set of terms to a set of towns.
The problem is far from trivial, but basic assumptions made earlier in this
paper will allow for a nice and easy to understand algorithm. We will show
several improved versions of the algorithm that after its basic paradigm is
named Biggest-Best First (BBFirst).
The problem states as follows:
Given a document d, containing a set of terms d.T , return a set

R of towns that the author most likely meant to write about.
Every town t is associated with an ordered set terms of strong and weak
terms. We will use dot-notation to refer to a towns terms, so t.terms is
the set of terms associated with t. If the town is a village (t ∈ V illages)
it is associated with a city, t.city. If the town is a city (t ∈ Cities) it is
associated with a set of villages, t.villages. Terms are categorized by being
strong (∈ strong) or weak (∈ weak), according to section 3.6.3. We can split
d.T according to this characterization and get d.S and d.W . Documents have
an importance ranking d.ir, similar to the rank in ordinary search engines.
At first, this problem seems not so hard, matching a muenchen to the city
of Muenchen. The problems arise, because the relationship between terms
and towns is n : m, not just n : 1. Strong terms can appear in the main
(early) or descriptive terms (on later positions) of several cities.
There are several measures for how well the name of a town can be
matched from a set of terms. The most basic simply counts the number of
terms that can be matched. If a town t’s name consists of three terms t1, t2
and t3 so t.terms = {t1, t2, t3}, and S = t1, t2 the town would be matched
with two terms or 66%. This is just one of many measures and we will see
several others later. For now, this method will provide a feeling, how to
measure a match.
A first naive algorithm would write out all towns t, where t.terms con-
tain at least one strong term s from S. Since we will be assigning certitudes
to each detected town later, we could simply compute it from the measure
9
marburger stadium is German for Marburg’s stadium
for how well a town was matched. Poor matches would not receive that
much of a certitude, thus not show up in the results much.
This algorithm however is bound to under perform, since it does not

realize the exclusive character of the use of town names. The author of
document d did not mean to write a little bit about one town and a little
bit about another. He meat to write about either one of them.
Lets look at an example. Assume we find S = goettingen in a page. We
could match this term to any of the following towns:
t1 t1.terms = {goettingen, niedersachsen} a large city.
t2 t2.terms = {rosdorf, goettingen} a small city, near Göttingen, hence

the second descriptive term.
t3 t3.terms = {buehren, goettingen} another small city with the same

characteristics.
t4 t4.terms = {gleichen, goettingen} another small city with the same

characteristics.
t5 t5.terms = {f riedland, goettingen} another small city with the same

characteristics.
t6 t6.terms = {waake, goettingen} another small city with the same char-
acteristics.
t7 t7.terms = {goettingen} a small village of Göttingen belonging to the

larger city of Lahntal.

larger city of Waderlsoh.

larger city of Langenau.
Clearly the author of d did not mean to write about all these Göttingens.
• Most likely, she wanted to write about t1, the large city of Göttingen
in the state of Niedersachsen (Lower Saxony).
• If she wanted to write about one of the cities t2 - t6 with goettingen in

the descriptive part, she would have mentioned a term from the main
part.
• The chance that she meant to write about either of the villages t7 - t9
is just really small, since these are all tiny villages.
There is an exclusive characteristic to good matching that the first algorithm

fails to incorporate. Just by looking at the list of towns, we were able to
make a pretty good guess. It would have been our intuition to give it to
one of the bigger entities, to a city, not a village. From the bigger cities,
we would have assumed that the page was about the city that was ”best”
matched. For now, we will leave open what exactly best means. Next, we
will try to create an algorithm that mimics our initial response and try to
formalize, why our intuition did not fail us.
BB-First
BB-First tries to unite the completeness of a match with the exclusive inten-
tion behind using a towns name. It does so, by allowing every strong term
to cause at most one resulting town, hence it is exclusive. Additionally,
the town that it matches to is determined by the quality of the match. In
addition, we try to follow our intuition and rather match a larger town than
a smaller town.
BB-First requires the towns to be sorted in categories, according to their
size. It is actually an interesting question, how to differentiate between large
and small towns. There are three basic measurements:
• Demographic data can be used to classify towns by size. This technique

works well for a homogeneous area, such as Germany. I will certainly
fail for international comparisons. Large cities in poor countries might
not host as many web pages as a single village in a richer country.
In either case, this data is somewhat hard to get a hold of.
• One can also classify towns by the number of web sites that reside
in it. It can compare towns from a heterogeneous background. For
small towns however, a single user that has registered several hundred
domains can introduce a significant misperception about the size and
importance of a town.
This measurement is easy to compute form whois records.
• The easiest way of classifying towns is by categorizing them in cities

(large) and villages small. The assumption is, that if a city c and a
village v both contain the same strong term t, c is usually larger and
more important than v. In some rare cases, special villages, namely
boroughs of large cities, can actually be larger than a small city.
This measurement is readily available from our data set.
The categorization by sizes of towns is not so critical, since it is only needed

to resolve ties between towns. For most parts, there results will not differ
too much. We chose the easiest to implement, differencing between cities
and villages. The following algorithms are based on just these two levels, but
can easily be adapted to any level hierarchy, inferred from various sources.
The idea behind BB-First is, that from a given set of strong terms, we
see, what towns of the highest (biggest) category we can match. We then
write out the best matched town, remove and remove all its terms from
the set of strong terms. We run this algorithm over as long as it produces
results. When it stops, we use the remaining strong terms and start the
algorithm over, trying to match towns from the next lowest level.
This is, what BB-First looks like for our application with the levels of
cities and villages:
The algorithm is based on two hypotheses:
BIG-IS-FULL There are more web pages about ”bigger” places. There are
more people, more companies and more companies on the Internet, in
larger cities than in smaller cities.
BIG-IS-DENSE The web pages about ”bigger” places usually rank higher.
Since the Internet is denser in larger places and large online-companies
with high-ranking sites reside in larger cites, sites from a larger city
are in average expected to rank higher than those from small cities.
”Bigger” and ”larger” in this context refer to urban, metropolitan areas.

It is clear, why the algorithm should choose better matched towns over
poorly matched towns. But let’s look at, why it should choose larger towns
about smaller ones. A similar rule had already been proposed by [LAH+ 04],
although without giving much justification. Lets look at a document d that
contains a single strong geographic term d.S = {t}. Assume there are two
towns t1 and t2 with t ∈ t1.terms ∧ t ∈ t2.terms. Lets assume, t1 is a city
and t2 is a village.
According to the first hypothesis, it is less likely for a page to contain infor-
mation about t2 than t1. So assigning it to t1 is a safer bet. According to
the second hypothesis, d.ir is expected to be relatively large, if it is about
t.1 and relatively small, if it is about t2. If it was about t2 but got assigned
to t1, it would do relatively little damage, since a query in the town t1 will
produce other pages, with higher ir. If it was about t1 but got assigned to
t2, it would do quite some damage, since a query in the town t1 would only
produce results with lower ir, making it quite likely for d to make it in the
top-ten. If this happened for several documents, we might actually end up,
flooding the top-k with pages that are actually about a different place.
Lets look at the Göttingen example, narrow the possibly matching towns
to a city c and a village v, and see what the possible outcomes for a document
d would have been, assuming that the document really meant to talk about
either of the two towns.
• The naive algorithm would assign c and v to d.

1 C := S
a clone
2 R := ∅
the result initialisation
3 E = {t ∈ Cities|t.terms ∩ S 6= ∅}
all possible new cities
4 ∀e∈E e.match(W ∪ C)
match them with all terms
5 I := {i ∈ E|∀e∈E i.score ≥ e.score}
those that match best
S
6 S := S − i∈I i.terms
decrease S
7 R := R ∪ I
add to result
8 if(E 6= ∅) go to line 3.
repeat until no new cities
9 else: proceed at next line
10 E = {t ∈ V illages|t.terms ∩ S 6= ∅}
repeat everything for villages
12 I := {i ∈ E|∀e∈E i.score ≥ e.score}
S
13 S := S − i∈I i.terms
14 R := R ∪ I
15 if(E 6= ∅) go to line 10.
Table 3.8: BB-first

– If the village of Göttingen would have been meant, we induce

a small error to all searched regarding the city. According to
BIG-IS-FULL, this case is not so likely. According to the same
hypothesis, there would have been plenty of other pages that
could make it in the top-k for the city, that according to BIG-
IS-DENSE would even have had a higher chance of getting in
the top-k. A query issued from the village would get one correct
result. A query from the city would still perform fine, with just
a little noise. An rare, but agreeable outcome.
– If the city of Göttingen would have been meant, we would have
introduced a significant error to all searches regarding the village.
According to BIG-IS-FULL, this case is the most likely. Accord-
ing to the same hypothesis, there would have been little of other
pages that could make it in the top-k for the village, that accord-
ing to BIG-IS-DENSE would even have had a reduced chance
of getting in the top-k. A query issued from the village could
be rendered useless by several such strong mismatches. A query
from the city would perform fine, but not gain much, since there
are plenty of other results. So we gain little, but have plenty to
loose.
• BB-First will assign only c to d, not v.
– If the village of Göttingen would have been meant, we induce a

small error to all searched regarding the city. We loose a single
result for the village, insignificant since we assume ABUNDANCE
of results, even for the village. The outcome is rare but even then
satisfying.
– If the city of Göttingen would have been meant, we do just per-
fect. The city receives a good document, and the village receives
no noise. This outcome is perfect and according to BIG-IS-FULL
quite often.
BB-first, does so well, because of an asymmetry in the evaluation of

the outcome. Due to the assumption of ABUNDANCE, missing positives
does not cause much damage. We can assume ABUNDANCE, even for the
village, because there will be other hints, such as area-codes that will assign
an abundance of pages to the village. Due to the NON-ANAL hypotheses,
the little error induced in the case where the page was meant to point to
the smaller town does not bother the user much. Avoiding false positives is
the key factor.
There is one pathologic case that has not been discussed so far: what if
d actually meant to talk about c and v.
First, we could argue, that this case is extremely rare, but even if they
happen, they introduce little error to BB-First. If d points to both, C and

V it can not introduce any false positives. Any missing result is rather
insignificant due to abundance.
Matching Terms to Towns

What does it mean to match a term town?
In our design, we modeled a town as an array of terms, a single zip-term,
and a position. A term can be a of three different types:
strong terms are from the set strong. They consist of a single string.
zip terms are strong terms that are meant to be a zip-codes. The zip-code
just a string. Due to the leading 0 s, we could not have modeled it as
an integer.
weak terms are from the set weak. They consist of a single string and an
integer distance, the maximum distance they are allowed to have to
their qualifying strong term.
In case a town is a village in addition, it additionally contains a reference
city to the associated city. In case it is a city, it has an array of references
villages to associated villages.
When we extract a term from a page, we want to keep a record of its
position. Hence, we wrap a term and an integer for the position into a
matchterm.
When we try to match a town with a set of matchterms, we need to keep
a record of what terms have been matched. Hence, we wrap every town
together with an array of lists of integers called matched into a matchtown.
The array is as long as the underlying town’s terms. A matchtown contains
an extra list of positions to account for extracted zip codes.
Matching matchtowns with matchterms is easy. For every matchterm,
we iterate over the matchtown’s underlying town’s terms and compare them
to the matchterm’s underlying term. If they match, we add the matchterm’s
position to the corresponding list of positions in the matchtown.
In case of a weakterm, we have to check, that it actually qualifies, that is,
that we find a strong term within its distance. To ensure that we match
all terms correctly, we have to match strong and zip terms before tt weak
terms. This description is only a rough outline of what has actually been
implemented, but will suffice to understand the algorithms.
Once we have matched a matchtown with all extracted matchterms, we can
see how well it has been matched, by calling its score function.
Scoring Matched Towns

There are several indicators for how well a town has been matched by ex-
tracted terms. We need to combine these measures into an order, to enable
BB-first to determine what of all possible towns has been matched best. We
will first describe all different measures that indicate the quality of a match,
before we show how to integrate them in a single measure. None of these
measures work on their own, we will discuss in what cases they fail in and
in what they succeed
match-count One method is to simply iterate over the matchterm’s matched
and see, what entries are not empty. We thereby count all terms that
have somehow been matched. This measure is rather intuitive, but
fails in certain cases.
If we have two towns a = (t1) and b = (t1, t2) and find that a docu-
ment d contains d.T = {t1}, clearly town a would be a better match,
even though a.match − count = b.match − count = 1.
If the towns were g = (t1, t2) and h = (t2, t1), h would be the better
match, even though g.match − count = h.match − count = 1. The
reason is that, most likely, we matched the main term in g and only a
descriptive term in h.
match-fraction This measure works similar to the previous, but divides
the result by the number of terms: match−f raction := matchterms÷
town.terms.length. It succeeds to for the case of a = (t1) and b =
(t1, t2), if d.T = {t1}, but fails in case d.T = {t1, t2}, since we would
think that t2 is a longer, hence better match, but the algorithm returns
a.match − f raction = b.match − f raction = 1. In the case of g =
(t1, t2) and h = (t2, t1), it would also fail to give a preference to g.
match-first This measure is quite different from the above. It simply
iterates over matchedterm.matched and looks for the first non-empty
entry. This measure will give preference, if a main term is matched in
contrast to only a descriptive term being matched. Given d.T = {t1},
it will prefer g = (t1, t2) over h = (t2, t1). It will not be able to prefer
a = (t1) over b = (t1, t2), since a.matchf irst = b.matchf irst.
match-first-strong This measure is identical to the previous, with the
only condition that the first matched term in addition has to be strong.
This measure makes sure, that g = (s1, t2) gets preferred over h =
(w1, s2, t1), with S1 being strong and w1 being weak. The reason
behind this measure are predicates. As we said, these are usually
weak, in contrast to the main terms, that are often strong.
zip-match So far we have only looked at matching town names, but we
were also parsing for zip-codes. If we find a zip code for a town, clearly
this town makes a better match than another town with the same name
but a different zip-code.
term-count None of the above measures have taken into account, how
many times a term has been detected on a page. Assume d.T = {t1 :
f 1, t2 : f 2 t3 : f 3}, where f 1, f 2 and f 3 indicate how many times

the corresponding terms have been found on the page. Lets further
assume that f 1 À f 2 = f 3 and look at two towns a = (t1, t2) over
b = (t2, t3).
All the above measures would rate a and b equally, even though d
seems to contain a stronger hint for t1. This measure iterates over
matchterm.matched and builds a sum over the length of all entries.
It would therefore favor t1 over t2, because (f 1 + f 2) À (f 2 + f 3).
nearby-towns In contrast to the previous measures that relied entirely on

matching sets of terms, this measure makes use of geography. Assume
that two towns a and b would rate the same for every of the above
measures. If we would find a match for a town c that is close to a,
we would certainly prefer a over b. The extracted d.T would have to
contain at least one strong s ∈ a.terms.
This measure requires a measure for geographic distance that has to be
weight. For our implementation, we simplified it greatly, and named
it partner-town-count. For a city c, partner-town-count counts all
associated villages p from c.villages with ∃s ∈ d.T |s ∈ strong ∧ s ∈
p.terms. For a village v, partner-town-count counts all villages p
from v.city.villages with ∃s ∈ d.T |s ∈ strong ∧ s ∈ p.terms.
This measure is discrete, easier and faster to compute than nearby--
town-count and based on the observation that a city and its villages
are allows close by. It fails to give score for nearby towns that can be
matched from d.T but are not connected via the city - village relation.
As we already pointed out, none of these measures works on its own and
even this list of measures is not complete. The task of determining a better
match for solving possible ties in matching of terms and towns is not so
hard. The goal is not to have a perfect match, but a good match, that can
be computed reasonably fast and gets at least the most common cases right.
We created a very simple measure, called a matchtown’s score by map-
ping each of the above measure to an integer interval 0 − 9 according to
their ”goodness” and multiplied the results. This score function proved
satisfactory.
It had to be ”manually” adapted for the case of Frankfurt Oder and Frank-
furt Main, two cities with the alter being several times more important to
Germany’s economy and Internet infrastructure than the later. If in doubt
and finding nothing but a Frankfurt on a page, we wanted to choose the
later. We therefore added a value of 1 to every score, if the city’s terms
contained the term Main.
By similarly checking BB-first against pathological but reasonably com-
mon cases, we found, we had to modify it over.
BB-First+
The above simple version of BB-first strictly prefers cities over villages, by
matching against all possible cities first and only thereafter matching against
villages. The idea was that according to BIG-IS-FULL, chances for a web
page are much higher to point to a city than a village. This situation however
changes, if we have already found hints to the city that is associated with
the village. The hypotheses behind this observation is
CLUSTERS-LIKELIER This assumption is that if a page points to a town
t and contains another vague hint to either town a or town b, it most
likely points to a, if a is significantly closer to t than b.
This observation already contributed to the nearby-town-count measure
and will once again be simplified to the city - villages relation.
The following real-world example illustrates the need for the adaptation
of BB-first that we will introduce later. Lets assume a page d, with d.T =
{ frankfurt, sachsenhausen }. There are plenty of possibly matching towns,
but by applying several possible the above score method, we can narrow
them to the following mire interesting:
c1 c1.terms = {f rankf urt, main} a large city.
c2 c1.terms = {f rankf urt, oder} a medium-size city.
c3 c3.terms = {sachsenhausen, weimar} a small city.
v1 v1.terms = {sachsenhausen} a borough of c1.
Notice the relation between c.1 and v.1. Everybody familiar with German
geography would just from looking at { frankfurt, sachsenhausen } infer that
the author meant to write about v1 and possibly about c1 and certainly
none of the other towns. BB-first however would behave quite differently:
It would first match c.1 which because of the partner-town-count receives
the highest score. Since it removes frankfurtfrom S, the term sachsenhausen
would be the only term to be left to match and would naturally match to
the city c3.
Taking a simplified version of CLUSTERS-LIKELIER into account, we
can modify the algorithm so it produces a much better looking result. After
every extracted city c, we check if we can find an associated village v from
c.villages that we can find a strong match for. If we do, we include it in the
result and remove its terms from S. This algorithm is called BB-first+ and
can be seen in table 3.9.
So far, our focus has been on matches between textual terms and towns,
but we have overlooked numeric terms, such as area-codes or zip-codes. Are
codes origin in a different data set and will be treated separately. First, we
will have to see, how to adapt BB-first one more time, to allow for matching
that emphasizes zip-codes.
BB-First++
As pointed out in Section 3.6.2 zip codes are a particularly strong hint for a
geographic location. In fact, they are so much stronger than textual terms
that we need to treat them separately, before proceeding with the textual
terms. We will adapt BB-first+ so it will treat the set Z of zip codes found
in a document first, in a BB-first style and then treat the textual terms
in BB-first+ style. Again, we have to respect the CLUSTER-LIKELIER
hypotheses. We have to ensure that an already detected town’s partners
receive preferred treatment. Partners of cities and villages are defined as:
City c c.partners := c.villages.
Village v v.partners := {v.city} ∪ v.city.villages
We introduce a step between the analysis of zip-terms and textual terms,

where all towns are extracted that are partner of an already detected town
and can somehow be matched from the textual-terms. We call this extended
and final version of the algorithm BB-first++. Its pseudo-code can be found
in figure 3.10.
Matching Area Codes

One form of numeric geographic terms have not been taken into account so
far: area codes. The reason that they have not been included in the above
algorithm is that they origin from a different data set from a different source.
The connection between the two sets is hard to make, due to the incomplete
data. As pointed out in Section 3.4.3 a single area code can cover several
towns, even several cities. The available data set however only showed the
textual representation for one town for every area code. In addition, the
textual representation of two towns can greatly differ between the two data
sets.
As we have seen earlier, geocoding is highly dependent on underlying data
sets. In this case, we ended up treating the two data sets entirely separately.
The extraction of area codes cannot be reduced to extracting numbers.
Not every three, four or five digit number found in a web page is an area
code. There are sophisticated techniques from natural language processing
for detecting phone numbers in documents, but this is a different research
direction. For our baseline algorithm, the objective was to build an easy
and reliable algorithm, that will cover most cases. The prime objective was
to avoid false positives.
Our algorithm area-match therefore requires that every area code found
in a page qualifies by the presence of a strong term of the associated town on
the same page. Let’s say, somewhere in a document d1 we find the number
04845. In our table, we would find the following entry:
04845;ostenfeld husum; 9.23897, 54.45391
The document d1 ’s footprint will only receive a reference to the position

9.23897, 54.45391, if d also contains at least one of the two strong terms
ostenfeld or husum.
As we pointed out, some document d2 might contain the same area code,
but only references to another town covered by 04845. Since d2 contains none
of the qualifying terms ostenfeld or husum and the detected area-code will
be discarded. Clearly this algorithm is pretty generous in discarding results
and probably too strict, but almost guarantees to eliminate false positives.
At this stage, the validation requirement is actually already met, since it had
been included as a constraint in the extraction process described in section
3.6.5.
3.6.7 Creating the Footprint

After having extracted hints to geographic entities from web pages, we need
to combine them to a geographic footprint. So far, for every web page we
have compiled three different records with geographic information:
• a list of towns that were detected on pages
• a list of towns that were detected in URLs
• a town that has been extracted from the page’s domain’s whois entry
In the next section, we will show how to combine this information to infer
new information about one page from related pages. The whois entries will
also be integrated at this later stage. For now, we will focus on the first two
records that are stored in this text format:
url-id {\tt city-name longitude latitude certitude}*
We have not described the certitude yet. It is a measure for how strongly
we believe that a page holds information about a town. It is computed as
a rule based combination of measures similar to those on Section 3.6.6. It
ranges from 1 for the weakest to 255 for the strongest.
In a first step, we want transform this into our coordinate system. We
also want to get rid of the town names, since they are of no more importance
at this point. As pointed out in Section 3.6.1, we want to use a 1024 * 1024
grid to store the footprint. We will create two initial footprint for each
page, one for the page based record and one for the URL based record. So
we transform each of the above entries from our text files to new entries in
another text file in the following format:
url-id {\tt x-id y-id certitude}*
Where x and y both range in the interval 0-1023.
When transforming a town’s coordinates to our coordinate system, it
can happen, that two towns fall map onto the same tile. In this case, we
the tile’s certitude is the sum over the certitudes of all towns that map top
this tile. We also make a distinction between cities and villages. In case
of a village, the certitude of the village is only added to the tile (x, y) that
it maps to. In the case of a city, additionally 50% of its certitude is added
to the tiles (x + 1, y), (x − 1, y), (x, y + 1) and (x, y − 1), as well as 25%
added to (x + 1, y + 1), (x − 1, y − 1), (x − 1, y + 1) and (x + 1, y − 1). We
make this distinction, as illustrated in Figure 3.1, because cities are usually
geographically spread out wider, as well as have a bigger region that they
draw clients from.
For the next and the final step in building this exploratory prototype,
post processing and query processing, we will need a smaller and faster
implementation of a geographic footstep. It will need to support:
• the summation of two overlaid footprints for post processing
• fast intersection queries &
• dynamically assignable space in memory for query processing
We chose a quad-tree structure as the base skeleton for our data-structure,

since we assumed two characteristics that make quad trees behave well:
• Sparse data, most of the space is empty.
• A lot of neighboring tile will have equal or at least similar values for
the certitude.
The quad-tree was implemented as a one dimensional binary tree, where

every left or right at a node corresponds to a horizontal or vertical split. For
a given tile with (x, y) coordinates, the corresponding node in the quad tree
can be computed by alternating taking a bit from the binary representation
of x and y. This process is similar to the computation of a space filing
Z-curve and very efficient.
The footprint was implemented in C as a bit array in main memory with
the following functions:
void getNodeID(int x, int y) A function to convert (x, y) coordinates

into a internal ID, using the bit-mixing algorithm as described above.
int getSize() A function that returns the size of the memory that the
footprint currently occupies. This function is needed for writing the
footprint out to disk and to optimize memory usage.
fp combine(fp A, double facA, fp B, double facB) This function adds

two footprints tile wise and returns the result as a new footprint. For
every tile (x, y) in the resulting footprint R, the certitude R(x, y).c is
the result of the sum of f acA ∗ A(x, y).c + f acB ∗ B(x, y).c.
1 C=S
2 R=∅
S
5 I = {i ∈ E|∀e∈E i.score ≥ e.score} ∪ {v ∈ i∈I i.villages|v.terms ∩ C 6= ∅}
Add qualified villages
S
6 S = S − i∈I i.terms
7 R=R∪I
8 if( E 6= ∅ ) go to line 3.
12 I = {i ∈ E|∀e∈E i.score ≥ e.score}
S
14 R=R∪I
15 if( E 6= ∅ ) go to line 10.
Table 3.9: BB-first+
Figure 3.1: The impact of a village entry vs. a city entry.

1 C=S
2 R=∅
3 E = {t ∈ Cities|t.zip − terms ∩ Z 6= ∅}
extract cities for zips
4 ∀e∈E e.match(W ∪ C ∪ Z)
S
S
6 Z = Z − i∈I i.zip − term
7 R=R∪I
8 if( E 6= ∅ ) go to line 3.
9 E = {t ∈ V illages|t.zip − terms ∩ Z 6= ∅}
extract villages for zips
10 ∀e∈E e.match(W ∪ C ∪ Z)
S
S
13 Z = Z − i∈I i.zip − term
14 R=R∪I
15 if( E 6= ∅ ) go to line 9.
S
16 I = r inR r.partnertowns|r.terms ∩ C 6= ∅
handle partners for zips
S
18 R=R∪I
extract cities for terms
S
21 I = {i ∈ E|∀e∈E i.score ≥ e.score} ∪ {v ∈ i∈I i.villages|v.terms ∩ C 6= ∅}
S
23 R=R∪I
24 if( E 6= ∅ ) go to line 19.
extract villages from terms
S
30 R=R∪I
31 if( E 6= ∅ ) go to line 26.
Table 3.10: BB-first++

fp simplify(fp A) This function takes a footprint A, simplifies it and

returns the result as a new footprint. The simplification is done by
first bottom-up computing the variance over all lower nodes, for every
interior node. This computes rather fast, since the variance of a node
can be compute from the variance of the child nodes. In the next step,
it simplifies the footprint top-down, by visiting every node starting
from the root. If the variance of a node is below a given ², all sub-
trees are pruned, the node turns into a leave node and receives the
average over all certitudes of all former lower leaves. This function
takes either one of two arguments:
an ² This threshold triggers a simplification of node n, whenever

n.variance <= ². After a new footprint has been created from
single tiles from a text file, this function can be called for ² = 0
to unite neighboring nodes that have the same certitude.
a size k Here, the parameter k determines the final memory size the
footprint is allowed to have. The function will try out different
²s until the memory bound is met.
In addition to the core functionality above, other functions were necessary

for running and debugging:
loadFP This function reads a line from text file and initializes a new bitmap.
freeFP To de-allocate memory.
plotFP To plot the footprint into gnu-plot for easier debugging.
Given these functions, we can initialize a page’s footprint as the sum of

its URL based footprint and its page content based footprint. In the next
section we are going to use the combine function further, to propagate foot-
prints along the Internet’s structure.
3.6.8 Post-Processing: Propagation of Geographic Informa-

tion
This section describes the practical application of post processing, as al-
ready described on a theoretical level in Section 3.6.2. The current status
is that some web pages have non empty footprints, inferred from page and
URL content. In this section we want to apply the INTRA-SITE, RADIUS-
ONE and RADIUS-TWO hypotheses to propagate the information of one
page’s geographic footprint to another’s and thereby increase quality and
quantity of our geocoding. We call the sum of all propagation operations we
want to perform for link analysis Allinks. As we had pointed out earlier, the
INTRA-SITE hypothesis can be further refined to a INTRA-SUBDOMAIN,
INTRA-HOST AND INTRA-DIRECTORY hypotheses. For this section we
unite all these hypotheses and call the sum of the propagation operations
we perform according to these hypotheses Intra. Also, we have not incor-
porated the geographic information retrieved from whois entries so far.
There main issues in this section are:
• We need to decide if to do the propagation in a serial or a parallel

manner.
Serial Processing computes the first propagation step on the initial

footprints and every later propagation step on the results of the
previous. In this case we would need to discuss the order of the
propagation steps.
Parallel Processing computes every propagation step on the initial
footprints and later unites all results. In this case we would need
to discuss the method for uniting the results
• We need to describe in detail, how each of the propagation steps are

to be performed.
In our approach we decided for a serial computation of propagation steps.

Our intuition told us that :
• geographic information that is propagated across an link from one

page to a page in another site via Allinks, should be spread within
this distant site via Intra.
• geographic information that is spread from one page in a site to another

page in the same site via Intra should find its way to other sites that
the second page links to via Allinks.
This requirement demanded a serial processing of the propagation steps.

The first of these two requirements told us that there would need to be a
Intra after an Allinks, the second that there would need to be an Allinks
after an Intra. One order of a minimum serial processing would look like
this:
Intra Alllinks Intra
Of course, we could run each step several times and execute them in any
weird order and still fulfill the two requirements. One has however to be
careful to not propagate information too freely, otherwise pages will inherit
so much geographic information over distant links that every page’s foot-
print will cover the entire nation.
Another problem with serial execution of propagation is the echo phenomenon.
It can occur across links as well as within sites, but we will look at the case
within sites only. Say a site has 1000 pages and only one of them p has
a non-empty geographic footprint from the URL and page analyses. After
the first site propagation, all other pages receive f times p’s footprint. After
another site propagation, these footprints echo and p receives 1000∗ p’s foot-
print. We can repeat these steps several times and the values in p’s footprint
grow into the indefinite. There has however been no real reason for p’s foot-
print to grow in value much, since actually no new geographic information
has been discovered, we are only moving around p’s initial values. Echoes
are unavoidable in serial processing of propagation, but should be kept to a
minimum. We therefore stick to the minimum propagation:
Intra AllLinks Intra
Let’s next see how to perform Allinks. It is supposed to deal with propa-
gation of geographic information along links, according to the RADIUS-ONE
and RADIUS-TWO hypotheses. We could determine cases where we would
have to propagate according to either of these two hypotheses and perform
them independently. We would split Allinks into two steps: Radius − 1
and Radius − 2. The RADIUS-ONE hypothesis, as pointed out in Sec-
tion 3.6.2, is symmetric and Radius − 1 needs to be split in two steps: Link
and Backlink that propagate information along links and in their opposite
direction. If we wished to perform Radius − 2 with a dampening factor that
is to be chosen freely, we would have to implement this step. If however we
are happy if just a sufficient amount of information is propagated according
to the RADIUS-TWO hypothesis, we have another option. We can observe
that if we perform first Backlink and then Link with a dampening factor
of f each time, we achieve the same effect as performing a true Radius − 2
propagation with a dampening factor of f 2 . Lets look at an example: Say
pa links to both pb and pc . The RADIUS-TWO hypothesis tells us that we
would like to exchange geographic information between the two pages. If
we perform Backlink, pb ’s geographic footprint will find its way to pa . The
next step, Link will propagate all of pa ’s footprint to pc , including the traces
of pb ’s footprint. Equally, after the two steps, pb will have inherited some of
pc ’s footprint.
In our implementation, we adapted this procedure and decided for the fol-
lowing propagation order:
Intra Backlink Link Intra
Performing the link analysis was straight forward. We were able to

keep the geographic footprints for all web pages in main memory. All we
had to do was process the web graph and for every link add one page’s
footprint to another’s, when performing Link, the link’s origin’s footprint
to the destination’s, and the other way around for Backlink.
Performing Intra was not so easy. The straight forward approach would
have been to look at all n2 pairs of pages from a site with n pages and
process each pair. When processing a pair, they would inherit each other’s
footprints with a dampening factor that depends on how close the pages are.
If they are within the same directory, they inherit more of each other and
if they are just in the same site, they inherit relatively little. The problem
was that from some domains, we had crawled as much as 100,000 pages. If
we performed a O(n2 ) algorithm on all sites, it would literally have taken
several weeks, if not months. this was clearly too slow and we were looking
for an algorithm that runs in O(n).
The solution was an aggregation tree over every domain. First, we
rewrote all URLs from the crawl in a hierarchic format, starting with the
largest entity:
domain subdomain host path file
Next, we sorted the URLs according to these columns, so that all pages
within the same domain were next to each other, all pages within the same
sub domain within that domain were next to each other and so forth. Then
we build a five level tree over all entries of each domain, in linear time. The
tree was built bottom up with these levels:
1. A leave level for individual pages.
2. A level of nodes for all directories.
3. A level of nodes for all host
4. A level of nodes for all subdomain
5. A root node for the entire domain
Each node held a footprint that is the sum over all footprints of the pages
at the leave level rooted at this node. Since the entries are already sorted,
each tree can be built in linear time. First only the bottom level footprints
are initialized, all others are empty. Next, aggregates are computed and
assigned to the node on the parent node from the level above. Once one
level is complete, we move to the next level up and do this over. Bottom
up, we fill all entries, right to the top of the tree.
In the next step, we propagate information back to the leaves, by pushing
it down in a top-to-bottom manner. Every node’s footprint is added to its
children’s footprints, always processing all nodes of one level at a time, of
course with some dampening factor f < 1. The footprint from the root for
example will find its way f 4 times to a leave node, the footprint of a host
level node will contribute with a dampening factor of f 2 . This way, pages
that are closer to each other contribute more to each other than pages that
are situated more distant within the site. After we have done this, the leave
nodes contain their correct and final value and we can simply delete the
upper levels of the tree and move on to the next domain.
So far we yet have to bring in the whois information. We could simply
add it to every page, but since we already build the aggregation tree de-
scribed above, we can simply add it to the root and then push it down with
the other aggregated footprints. This cuts computational costs to a single

operation, the addition to the root. The only thing we have to be careful
about is that the whois entry’s impact gets dampened by f 4 on the way to
the leaves. We can however counterbalance this, by increasing its values,
before adding it to the root node. To add it to the leave nodes unchanged
for example, its values would need to be increased by a factor of 1/f 4 before
being added to the root node.
Of course we only need to integrate whois entries once, even though we
perform Intra twice. We therefore call this new operation IntraW and the
final order of the propagation steps is:
IntraW Backlink Link Intra
3.7 Query Processing for Geographic Search En-

gines
One can think of many techniques to execute a geospatial search query, for
each of them can be tuned in many ways. The goal of this project was a
proof of concept, using basic techniques, showing the feasibility of geospatial
search engines. While our query processing is quite better than a baseline
algorithm, it does in no way claim to be the most elaborate scheme one
could come up with.
The ranking function of geographic search engines is much more complex
than for traditional search engines. The later order their answers accord-
ing to their importance regarding the key words, the so called importance
ranking. Geographic search engines have to incorporate a second order, the
distance between a documents positions and the position the user queries
for, its distance ranking. Both orders or measures have to be combined in
to a single ranking function. While a simple linear combination is straight
forward, more complex functions are imaginable. Future research in this
area is inevitable.
In Section 3.1, we described query processing as a two step process:
Selection step In this step, the answer set is generated, mainly by inter-
secting entries from the inverted index.
Ranking step In this step, the answer set, as created in the previous step,
is ranked.
This is of course extremely over simplified. A detailed description of query

processing for traditional search engines can be found in [LS03]. The actual
query processing for our geographic search engine as presented in this section
will consist of three steps:
1. Generating the answer set, by intersecting entries from the inverted

index, and furthermore eliminating answers according to geographic
constraints.
2. Ranking the answers according to some approximate ranking function.
3. Ranking the top s answers according to the final ranking function.
The first two steps will exclusively rely on information in main memory,
while only the last step may have to retrieve information from secondary
storage.
Query processing tries to return tuples as fast as possible, while having
to deal with a limited main memory. The general guideline is, to shed tuples
that will not qualify as answers as early as possible.
Search engine query processing will, in contrast to data base query pro-
cessing, can make use of two search hypotheses, making the process a lot
easier. According to the SHALLOW-PENETRATION hypothesis, the user
will only see the top s answers, at most. The query therefore does not have
to be computed completely. As soon as it becomes obvious that an answer
will not make it in the top s, it can immediately be discarded.
According to the BATCH hypothesis, the user will first only see the first
b answers, read them and then proceed to the next batch. The speed of
a query processing is therefore to be measured by how fast the first set of
answers is produced. Since human readers are rather slow, there will be
plenty of time to compute the later batches. We tried to make the greatest
use of these hypotheses and designed the following query processing:
For every web page we kept an geographic footprint in main memory.
Since not all footprints may fit in main memory at the same time in their
original size, we called simplify on all geographic footprints until they did.
These footprints of course are only approximations of the actual geographic
footprints, that we keep in secondary storage. We therefore give them a new
name and call them m-prints.
Every query a user posts consist of two parts:
• An set of key words.
• A geographic position.
In an initial step, we translate the geographic position in an geographic

footprint. A position will naturally map to a single tile in the footprint.
However we may shade surrounding tiles as well, to allow web pages with
these tiles in their geographic footprint, to also contribute to the answer set.
We call this footprint the query’s geographic footprint.
In the first processing step, we iterate through the entries of the inverted
index as indicated by our key words. As we compute the intersection, before
writing an answer to the output, we check its m-print. If the m-print does
not intersect with the query’s geographic footprint, it is discarded. Only

answers from within the neighborhood of the query will make it to the
next step. This restriction helps us to dramatically decrease the memory
consumption.
In the second processing step, we impose a preliminary ranking on the
answer set. This ranking method makes use of both importance ranking and
distance ranking quite similar to the final ranking. It however is not identical
to the final ranking, since only the m-prints of web pages are evaluated for
the distance function, not the actual geographic footprints. Since it is only
an approximate ranking, the underlying importance ranking does not have
to be final either but can be approximated, for efficiency reasons.
At this point, without a single read from secondary storage, we have a set
of answers that are ranked according to an approximate ranking function.
Since the SHALLOW-PENETRATION hypothesis tells us that the user will
never see more than s answers. We can therefore only keep s + k answers, k
depending on the trust we have in our ranking approximation, and discard
the rest.
The final s + k answers are ranked according to the actually desired
ranking function. Note s and k are much smaller than the initial set of
entries retrieved from the inverted index. For this small set, we can afford
to read the actual geographic footprints from secondary storage and use
them to compute the final order in that answers are presented to the user.
This technique can further be improved by only computing the b+k answers
in the previous step, instead of the s − k answers. After the first batch has
been returned, the remaining answers can be ranked.
3.8 Interactive Geographic Search

For a query processing as described in the last section, the balance between
distance ranking and importance ranking is fixed, over the entire course
of the query. In Section+3.3.5 however we demanded an interactive query
processing, where the user has more influence on the sequence of answer
batches than just a ”next” button. What, if the user was allowed to change
this balance after every batch of answers she had seen. If after the first
batch, the results turned out to be ”close”, but ”meaningless”, she could
opt to give more weight to importance ranking and less to distance ranking
for the next batch. Vice versa, if the results turned out ”important” but
”too distant”, she could try her luck by giving more weight to distance
ranking and less to importance ranking. Even though this is future work
and not part of this project, we would like to show some basic examples and
interfaces to illustrate the great improvement this process would bring to
geographic query processing. This will be the technique that will allow us
to actually drill into the iceberg, instead of iterating through breath first by
pushing a next button.

There are tow basic approaches to steer the balance between importance
and distance.
Zones The set of answers can be grouped in zones, according to the distance
to the query position. All entries that fall into the specified zone
qualify for output. The user may jump between zones of different sizes
after every seen batch of answers. This approach basically pushes the
steering in the selection step.
An actual balance The balance between the importance ranking and the
distance ranking can be readjusted after every batch the user has seen.
This approach basically leaves the steering in the ranking step.
The user may not actually have to be aware what technique, or mix between
them, is used.
There are various possible interfaces for both approaches, as shown in
Figure 3.2, Figure 3.3 or Figure 3.4.
In either way, the user gets to dynamically change the balance between
importance and distance. She can continue in a query that returns not so
great results and steer in a better direction, without having to start the
query over or having to see any answer twice. The iceberg is showing its
first crack. In addition, we expect web spam to become a lot harder, since
there are more dimensions the spammer will have to cover.
3.9 Other Prototypes

After having described our prototype of a geographic search engine and
how it was build, it is time to outline previous work in this area. In this
section we describe prototypes, commercial and academic, of geographic
search engines. Some prototypes might resemble a geographic search engine
only to some degree, but will be included into this discussion for a broader
perspective. This section can only provide a rough overview and does not
claim completeness. There might be a whole range of unknown geographic
search engines out on the Internet.
As for commercial search engines, they are usually not documented. As a
matter of fact, it would take industrial espionage to find out how they work,
are run or what algorithms they use. Thus, their description is limited to
what can be inferred and guessed through user interaction.
3.9.1 Google
The Search by location [inc03] by Google [Goo] is the most prominent pro-
totype of a geographic search engine. However, it does seem to limit itself to
page from sites of companies that are listed in a business directory. Thus, is
Figure 3.2: One possible interface for interactive geographic search would
simply show the user two alternatives to the ”next” button. One to shift
the balance towards importance, the other towards distance. This approach
would work for either zones or a balanced ranking.
Figure 3.3: This interface would allow the user to directly change the balance
between importance and distance. It allows faster changes, but cannot be
realized in standard HMTL. It also can be used either for zones or a balanced
ranking.
Figure 3.4: Another interface could allow the user to directly skip between
different zones. These could be measured in kilometers or conceptualize
space similar to humans, like the one in this figure.
resembles more a ”search inside the yellow pages” than an actual geographic
search engine, searching an entire web crawl.
After the user has entered a position and the key words, she is presented
with the addresses of several businesses in this area that have information
about this information on their web sites. The user can then click forward
to the actual website. Screenshots of these three steps and the user interface
are provided in Figure 3.5, Figure 3.6 and Figure 3.7.
It must be noted that Google’s ordinary search engine allows to narrow a
search to sites from a particular country, a somewhat very broad, but still
geographic search. It can be observed that the results all come from domains
that are either registered under the countries top level domain, such as .de,
or contain a reference to the country somewhere in its whois entry.
3.9.2 Yahoo! - Overture

After an earlier prototype by Overture [Ove], Yahoo! [Yah], who bought
Overture, has developed a second geographic search engine Its functionality
looks very much like that of Google’s Search by location and can be observed
in Figure 3.8, Figure 3.9 and Figure 3.10.
3.9.3 www.search.ch
This commercial search engine focuses on Switzerland. It allows to limit
the search to one of Switzerland’s cantons, that are equivalent to states or
counties. This seems a little broad not very flexible, as it is a fixed grid’s
nature. A user living on the border between two cantons will still have to
issue several queries. Nonetheless, in contrast to the previous prototypes,
this is a real geographic search engine, that treats pages independently of
their web masters association with a business directory. Figure 3.11 shows
the search engines interface with the option to specify a canton such as
Zürich. [Räb]
3.9.4 www.umkreisfinder.de
This German search engine specialized in geographic search. It however
does not extract geographic information, but solely relies on entries from
the Open Directory Service [Opea] where web sites are categorized by geog-
raphy as well as by subject. In comparison to the total Internet, the Open
Directory Service is quite sparse and umkreisfinder therefore very limited.
It seems to crawl the pages of sites that it knows an position for and all
assigns the site’s position to all pages that belong to is, implicitly making
use of the INTRA-SITE hypothesis. Umkreisfinder adds geographic mark
up to its pages, as discussed in Section 3.6.2 and shown in Table 3.6.
Figure 3.5: Entering a position and some key words into Google’s interface.
Figure 3.6: Google shows relevant addresses from some sort of a business
directory.
Figure 3.7: After a click on a company’s address, Google finally shows rele-
vant web pages.
Figure 3.8: The initial interface of yahoo’s local search, where the user enters
an address and some key words.
Figure 3.9: Yahoo! shows entries from some business directory

Figure 3.10: Yahoo! shows the company’s address, a map of the surrounding
and a link to the company’s web site.
Figure 3.11: The interface of www.search.ch, allows to choose a canton of

interest.
Figure 3.12: Entering key words and a position in www.umkreisfinder.de

Figure 3.13: www.umkreisfinder.de displays the answers as a list, plots them

on a map an adds quick links to the nearest city.
3.9.5 GeoSearch
The is academic prototype of a geographic search engine [Gra03] searches
articles from 300 online newspapers such as www.nytimes.com. The user
interface is rather simple, allowing the user to enter the key words and a
zip code. Similar to a geographic footprint, the authors define a so called
geographic scope of a web page. This function is described in more detail in
Section 3.6.2. There are web sites of different geographic scopes: some like
New York Times 10 cover the entire US, others like The Digital Missourian 11
cover only a state, some even cover only a town. The search engines answers
to a query contain an icon next to every entry, that show the entries geo-
graphic scope. Figure 3.14 shows icons for two typical geographic scopes.
For a web page to show up in the answers to a query, it needs to contain
the key term and its web site’s geographic footprint needs to intersect the
position the user entered. Documents from sites with a national geographic
scope always fulfill the later requirement. The importance ranking on the
final set of answers is recomputed. There is however a mayor misconception,
regarding the geographic scopes. Lets say, a user is searching for pizza for
90210 (Beverly Hills, CA). The top answers all come from www.nytimes.com.
Even though, this site’s geographic scope covers the entire nation and is sup-
posed to be an authority on pizza anywhere in the country, this is probably
not what the user was looking for. Geographic search engines are used to
find locally important documents and not globally important documents. If
the user wanted to find the later, she would use a traditional search engine.
3.9.6 geotags.com
This prototype of a search engine [Dav99] relies on web pages to be aug-
mented with geographic mark up tags from [Dav01] as described in Sec-
tion 3.6.2. It requires authors of web documents to implement these tags
and register them with the search engine. the engine will then crawl these
pages and index them. Due to the low commercial impact and the need to
register, the index is very sparse, empty for most regions. The interface,
as shown in Figure 3.15, allows the user to zoom to the desired position,
a somewhat time consuming task. Alternatively, the user can store her
position in a cookie.
3.9.7 Northern Light

This geographic search engine [inc] was probably the first of its kind, before
its creators [Lig] took it off the Internet. It allowed the user to specify a
10
www.nytimes.com
11
www.digmo.org
Figure 3.14: The geographic scopes for www.nytimes.com and

www.digmo.org.
Figure 3.15: The graphic user interface to www.geotags.com

radius of x kilometers around an address. Results were narrowed to this

region. It cannot be determined anymore, how geocoding was performed.
3.9.8 The Spirit Project

The Spirit Project has been announcing an academic prototype of a geo-
graphic search engine for quite a while [FH03], but so far no prototype or
indicators of its development have surfaces. One of the few detail known
about the project is an interface that requires the user to draw the position
of interest in a sketch. Figure 3.16 shows an example sketch from the original
publication. Independent of the problems in identifying a sketch correctly
and associating it with the intended position, it can be doubted that the av-
erage user knows enough about her environment to draw an accurate sketch
of it. [FH03]
3.10 Impact
Geographic search can well prove to be the next generation in search technol-
ogy and shape the landscape of today’s media market dramatically. It will
shift traffic and revenue flows of e-commerce and reshape today’s e-economy.
Geographic search might prove the killer application for broad band
internet over cellular phones. One of the main applications to be launched
over broad band cellular phones that usually integrate a PDA’s functionality
are centered around the user’s position and therefore called location based
services. The user’s position can be determined from the cell she roams in,
or from an external GPS advice. A typical application, as envisioned today,
would allow a user to search for the nearest car repair shop from a database
backed by the manufacturer. A useful service, especially if there is troubles
with the car in distant regions, but no service we could not live without. A
geographic search engine however, that allows to search for anything, not
just car dealers, might just be the long sought killer application of location
based services. It would allow to search for any given key words, including
a nearby ”car repair”, not restricting results to shops that are licensed by
the manufacturer.
Even if just installed on a local PC, geographic search engines’ impact
might be dramatic. It will move traffic from popular web sites to smaller
(local) sites as well as reshape the entire advertisement market.
Under current conditions, the Internet poses an extreme ”winner takes all”
situation, when it comes to traffic distribution. A few of the largest corporate
web sites such as www.amazon.com receive most of the traffic, while many
smaller company’s with a more local focus, such as www.fahrradschmiede.de
are happy for every single user that visits. Underlying this extremely un-
equal distribution, there is a vicious cycle, where popular sites receive more
links, therefore more importance ranking in search engines, therefore more
Figure 3.16: The Spirit Project plans on recognizing a hand drawn sketch
such as this, for determining the region the user is interested in.
traffic and therefore become even more popular. Due to SHALLOW-PENETRATION,

few user’s will ever click deep enough into a search engine’s results to find a
smaller provider for the same services as the popular web sites that clog the
top results. Geographic search engines have two search criteria, importance
ranking (popularity) as well as distance to the user. The second criteria is
independent of the vicious cycle as described above ad helps to overcome it.
Company’s offering good services but only having a medium Internet popu-
larity will find more traffic on their web sites, from users searching for these
services from a nearby location. This traffic will inadvertently be drawn
from the major players.
Only online advertisement is the other major part of e-commerce that is
to be reshaped by the impact of geographic search. The changes are going
to be so profound that they will not be limited to online advertisement but
extend to all sorts of advertisement in general. It will open new possibilities
for online advertisement as well as shift large amounts of advertisement
budgets, mainly from print to online advertisement.
The effect an ad has on the user very much depends on how well the ad
serves the user’s needs and interests. One will usually analyze what content
a user is interested in, to determine what ad to serve her. When advertising
on the site of a search engine for example, the key words the user enters
are analyzed and used to determine what ad to serve together with the
results to the search. If a user expresses her interest to be focused in a
geographic region, as she will do with a geographic search engine, this will
enable the search engine provider to serve much more specialized ads, for
example for offers limited to the region or from companies residing in that
region. Online advertisement could be made more efficient if this geographic
properties came into play.
Most of today’s online advertisement is conducted via and ad placement
agency. In a popular model, an advertising companies specifies a set of key
words that they wants to advertise under. If a user enters any of these key
words or if they appear in the content she reads, she will be served the
advertiser’s ad. The type of company that invests in online advertisement
of this type is very limited:
• Large national companies
• Small nitch players with mail order stores
Other small stores without mail order, especially restaurants and services,
will not advertise in this advertising scheme. No local pizza parlor will ever
pay to have its ad shown every time a user enters pizza. Geographic search
engines would enable ad placement agencies to offer localized ad placement,
where an advertiser does not only specify key-words for that she wants to
advertise, but also a region within she wants to advertise. Only if a user
types in the key words as well as a position within the specified region,
will she be shown the advertiser’s ad. This feature will reshape the type of
companies that will place ads with the ad placement agency. Now it makes
sense for companies with a local focus to advertise online. The pizza parlor
from above will want to advertise every time a user enters pizza and queries
for a position from within around a 500 meters radius around the pizza shop.
The low cost and efficiency of online advertisement will move large amounts
of advertisement budgets from print media, mainly from photocopied leaflets
as in the case of our pizza parlor, to online advertisement.
Chapter 4
Geographic Data Mining
One of the initial ideas behind this project was to explore the possibilities
for geographic web mining. For one thing, it turned out, that building the
geographic search engine was larger than initially thought. for another,
while developing the search engine, we learned a lot about the underlying
problems of geographic web mining, that made the later look even harder.
Nonetheless, we will provide a brief overview over basic web mining and its
potential for a geographic extension.
Web mining has been around for a while, sometimes under an alias like
”market research”. The basic underlying assumption is that the Internet
always somehow reflects society. Studying the web can therefore to some
degree replace direct social research.
Done correctly, it can be cheaper and faster than traditional methods like
door to door or telephone surveys. Web mining allows collection of infor-
mation that would otherwise not be available or expensive to create. Once
a web crawl has been collected and processed, results for mining queries can
be returned within hours, in contrast to the weeks a door-to-door survey
would take. In addition, its costs are somewhat constant (for operating the
infrastructure), but not linear to the number of queries, like door-to-door
surveys. For frequent use, a web mining system might therefore even prove
to be cheaper.
There are however some limits to web mining. Its view of society is always
somewhat distorted and underlies a time shift. Tips on java programming
for example are more common on the web than knitting patterns, even
though in real life there might be more people knitting than programming.
Similarly, it may take an event in real life days or weeks to find its way on
the Internet. Or the Internet news might precede the actual event, such as
when some president publicly ponders about how to get people to the planet
Mars. These factors must be taken into account. It must however be noted,
that traditionally surveys are not without their flaws either and their results
always undergo severe interpretation.
94
CHAPTER 4. GEOGRAPHIC DATA MINING 95
The web has also become a part of the real world, with real web phenomena
to be studied and real money to be made. Here, web mining is a prime
source of information.
Web mining is already widely implemented, but not often advertised. It is
already a serious business, while yet expecting its big take off. Large compa-
nies for example use it to detect pressure groups, before these make it to the
press, or try to detect new trends in youth culture. IBM for example offers
these services to customers, as part of their Web Fountain project [IBM].
The extension of web mining to geographic features comes rather natural.
A piece of German business wisdom states that: ”All business is local”.
Geographic components are therefore a big improvement of web mining. The
geographic component can be used in web mining mainly for two reasons:
• To focus a mining task on a certain geographic region.
• To find geographic correlations between objects.
IBM’s Web Fountain is already heading in this direction [LAH+ 04] and com-
panies like MetaCarta [Met02] offer geographic web mining to commercial
and governmental clients. An early predecessor of geographic web min-
ing has focused on whois entries, instead of actual web pages. Papers like
[Zoo01], [Kry00] or [SK01] mainly used this technique to reason about the
productivity of different regions during the boom of the new economy.
At first, geographic web mining seems rather straight forward. Web
mining, like any data mining is inherently multidimensional and the addi-
tion of two more dimensions should pose no problem. The adaptation of
other techniques, like query techniques, such as [RGM03] should be straight
forward. The authors of [MAHM03] have shown a first infrastructure, and
many techniques for geographic data mining, such as finding geographic cor-
relations between incidents [PF02] can be directly applied. There is however
one major flaw, hidden in the data.
The geocoding, underlying the entire geographic web mining is not up to
the level to perform meaningful results. During geocoding, especially during
geomatching as described in Section 3.6.6, when resolving ambiguities, we
made many decisions based on Göttingen intends to talk about either of the
towns with that name, not all of them. We had argued that in these cases,
the bigger town should be mapped exclusively and used the ABUNDANCE
hypothesis to justify this decision. This however was a hypothesis from a
search engine background that has no justification on a web mining environ-
ment. The exclusive use however is a phenomenon that web mining should
manage to take into account. Until these basic decisions in geocoding are
not solved sufficiently, the data is simply to messy to perform meaningful
geographic web mining on it.
Chapter 5
Addendum
5.1 Thanks
I would like to thank my two advisors Bernhard Seeger and Torsten Suel and their
teams, for providing know how, guidance and friendship throughout this thesis.
Thanks to Thomas Brinkhoff for co-authorship in earlier work about geographic
information retrieval.
I am in great debt to Yen Yu Chen and Xiaohui Long for their great spirit, and
cooperation in implementing our prototype.
Also thanks a lot to Jan Lapp, for support in implementing the earlier prototype
that relied entirely on whois entries.
Thanks a lot to Dimitris Papadias from Hong Kong University of Science and
Technology for inviting me to do my Ph.D. with him and for facilitating the appli-
cation. This certainly helped me ease my mind about the future and focus on this
thesis.
I would like to thank Deutscher Akademischer Auslandsdienst (DAAD) for some
financial support for the first three month of my stay in Brooklyn and my parents
for generous support, covering whatever was lacking.
Thanks to Utku and Chin Chin in Brooklyn for keeping me smiling, and to
Sven Hahnemann for technical support before moving.
I owe a lot to Julinda Gllavata and Ermir Qeli in Marburg for sending me funnies,
and giving me support whenever it was needed. You are great friends.
Thanks to www.fahrradschmiede.de, Christoph and Mama Michel, for force-feeding
me cake and pasta all these years and to Brian and Tilly for cracking me up every
lunch break.
Thanks to Steffie, Niesch, Lolle and Tommy for being who you are.
Also thanks a lot to Horst Schmalz at www.edersee-tauchen.de who called me every
single week for four month to check and make sure I was all right.
Thanks to the Krusas in Montauk, NY, for hosting me on my weekends and telling
me stories about getting lost inside sharks and being bitten by a 25 pound lobster.
Thanks to these and all other kind people. You were some great friends.
I explicitly do not want to thank the German DENIC for giving us no support
whatsoever. I truly believe that academic relations should not be handled by the
PR department, ever.
Also, no thanks to Lycoss Europe and Web.DE who did not even bother to reply
96
CHAPTER 5. ADDENDUM 97
to any of our letters, asking about possible cooperations. If more people were like
you, stories like ”Google” would never have happened.
Bibliography
[1un] 1und1 Internet AG.

Elgendorfer Straße 57, 56410 Montabaur, Germany. Company’s web
site at http://www.1und1.com/.
[21101a] Technical Committee 211 (ISO TC 211).
Draft International Standard ISO DIS 19111, 2001.
Also available at http://www.ncits.org/ref-docs/ISO DIS 19112.
pdf.
[21101b] Technical Committee 211 (ISO TC 211).
Geographic information - spatial referencing by geographic identifiers.
Draft International Standard ISO DIS 19112, 2001.
Also available at http://www.ncits.org/ref-docs/ISO DIS 19112.
pdf.
[BP98] Sergey Brin and Lawrence Page.
The anatomy of a large-scale hypertextual Web search engine.
Computer Networks and ISDN Systems, 30(1–7):107–117, 1998.
Also available at citeseer.lcs.mit.edu/brin98anatomy.html.
[Cha02] Soumen Chakrabarti.
Mining the Web – Discovering Knowledge from Hypertext Data.
The Morgan Kaufmann Series in Data Management Systems. Morgan
Kaufmann, Oct 2002.
Book’s web site at http://www.cs.berkeley.edu/∼soumen/
mining-the-web/.
[Col92] Cristoforo Colombo, 1492.
[Con] Open GIS Consortium.
Institution’s web site at http://www.opengis.org/.
[Cox00a] Simon Cox.
DCMI Box Encoding Scheme.
Recommendation of the DCMI, Dublin Core Metadata Initiative, Jul
2000.
Also available at http://dublincore.org/documents/2000/07/28/
dcmi-box/.
[Cox00b] Simon Cox.
DCMI Point Encoding Scheme.
Recommendation of the DCMI, Dublin Core Metadata Initiative, Jul
2000.
Also available at http://dublincore.org/documents/2000/07/28/
dcmi-point/.
98
BIBLIOGRAPHY 99
[CP03] J. E. Corcoles and P.Gonzales.

Querying spatial resources. an approach to the semantic web.
In Informal Proceeddings of the Workshop ”Web Services, e-Business,
and the Semantic Web (WES): Foundations, Models, Architecture,
Engineering and Applications”. To Appear in Lecture Notes in
Computer Science (LNCS) by Springer-Verlag, Jun 2003.
Also available at http://www.info-ab.uclm.es/personal/
corcoles/pdf/WES2003Corcoles.pdf.
[Dav99] Andrew Daviel.
geotags.com, Apr 1999.
Project’s web site at http://geotags.com.
[Dav01] Andrew Daviel.
Geographic registration of HTML documents, Apr 2001.
Available at http://geotags.com/geo/
draft-daviel-html-geo-tag-05.html.
[DCG04] Stefano Sanna Davide Carboni, Andrea Piras and Sylvain Giroux.
The web around the corner: Augmenting the browser with GPS.
Poster at the alternate track of the 13th international conference on
World Wide Web, May 2004.
Also available at http://www2004.org/proceedings/docs/authors.
htm\#C.
[DEN] DENIC Domain Verwaltungs- und Betriebsgesellschaft eG.
Institutions’s web site at http://www.denic.de/.
[DGS00] Junyan Ding, Luis Gravano, and Narayanan Shivakumar.
Computing geographical scopes of web resources.
In Proc. of the 26th International Conference on Very Large Data
Bases (VLDB), pages 545–556. Very Large Data Base Endowment
inc., Morgan Kaufmann, Sep 2000.
Also available at http://citeseer.nj.nec.com/ding00computing.
html.
[Dig] Digital Envoy inc.
Company’s web site at http://www.digitalenvoy.net.
[DVGD96] C. Davis, P. Vixie, T. Goodwin, and I. Dickinson.
A means for expressing location information in the domain name sys-
tem.
RFC 1876, Internet Engineering Task Force, Jan 1996.
Available at http://www.ietf.org/rfc/rfc1876.txt.
[Ege02] Max Egenhofer.
Toward the semantic geospatial web.
In Proc. of the 10th ACM Int. Symposium on Advances in Geographic
Information Systems, pages 1–4. Association for Computing Ma-
chinery (ACM), ACM Press, New York, Nov 2002.
Also available at http://citeseer.nj.nec.com/538331.html.
[Eve] Eventax GmbH.
Umkreisfinder.
Service online at http://www.umkreisfinder.de/.
[FH03] M. Sester F. Heinzle, M. Kopczynski.
BIBLIOGRAPHY 100
Spatial data interpretation for the intelligent access to spatial informa-

tion in the internet.
In Proc. of 21st Int. Cartographic Conference, August 2003.
Also available at http://www.geo-spirit.org/publications/
Spatial Data Interpretion Hannover.pdf.
[Goo] Google.
Company’s web site at http://www.google.com/.
[Gra03] Luis Gravano.
Geosearch: A geographically-aware search engine.
Prototype of an geographic search engine, 2003.
Available online at http://www.cs.columbia.edu/\$\sim\
$gravano/GeoSearch/.
[IBM] IBM Almaden Research Center.
Webfountain.
Project’s web site at http://www.almaden.ibm.com/webfountain.
[inc] Divine inc.
Northern light geosearch.
Online Geographic Search Engine for North America.
Company web site at http://www.northernlight.com/geosearch.
html.
[inc03] Google inc.
Search by Location.
Demo of an geographic search engine, 2003.
Available online at http://labs.google.com/location.
[Int] Internet Corporation for Assigned Names and Numbers (ICANN).
Institution’s web site at http://www.icann.org/.
[Int74] International Organization for Standardization (ISO).
Iso 3166-1: The code list, Dec 1974.
Also available at http://www.iso.ch/iso/en/prods-services/
iso3166ma/02iso-3166-code-lists/iso 3166-1 decoding
table.html.
[Kle99] Jon M. Kleinberg.
Authoritative sources in a hyperlinked environment.
Journal of the ACM, 46(5):604–632, 1999.
Also available at http://citeseer.nj.nec.com/
kleinberg99authoritative.html.
[Kry00] Mark David Krymalowski.
Die regionale Verteilung von Domainnamen in Deutschland.
Diplom thesis, Department of Economic and Social Geography at the
University of Cologne, 2000.
Also available at http://www.wiso.uni-koeln.de/wigeo/index.
e.html and http://www.denic.de/media/pdf/dokumente/da
krymalowski.pdf.
[LAH+ 04] Ronny Lempel, Einat Amitay, Nadav Har’EL, Ron Sivan, and Aya
Soffer.
Web-a-where.
BIBLIOGRAPHY 101
Presentation at Second IBM’s Search and Collaboration Seminar, Feb

2004.
Available at http://www.research.ibm.com/haifa/Workshops/
searchandcollaboration2004.
[Lig] Northern Light.
Company’s web site at http://www.norhternlight.com/.
[LRIE01] Yvan Leclerc, Martin Reddy, Lee Iverson, and Michael Eriksen.
The geoweb — a new paradigm for finding data on the web.
In 20th International Cartographic Conference (ICC2001), Aug 2001.
Project’s web site at http://www.dgeo.org/.
[LS03] Xiaohui Long and Torsten Suel.
Optimized query execution in large search engines with global page
ordering.
Also available at http://cis.poly.edu/suel/papers/order.pdf.
[MAHM03] Yasuhiko Morimoto, Masaki Aono, Michael Houle, and Kevin McCur-
ley.
Extracting spatial knowledge from the web.
In Proc. of the 2003 Symposium on Applications and the Internet
(SAINT 2003), pages 326–333. IEEE Computer Society (IEEE-
CS) and the Information Processing Society of Japan (IPSJ), IEEE
Computer Society, Jan 2003.
Also available at http://www.almaden.ibm.com/cs/people/
mccurley/.
[Map] Mapsolute GmbH.
Service online at http://www.map24.de/.
[MBS03] Alexander Markowetz, Thomas Brinkhoff, and Bernhard Seeger.
Expliting the internet as a geospatial database.
In Workshop on the Next Generation Geospatial Information - 2003,
Oct 2003.
Also available at http://dipa.spatial.maine.edu/NG2I03/CD
Contents. Also presented at the Third International Workshop on
Web Dynamics at the Thirteenth International World Wide Web
Conference, New York, May 2004.
[McC01] Kevin S. McCurley.
Geospatial mapping and navigation of the web.
In Proc. of the tenth international conference on World Wide Web
(WWW10), pages 221–229. W3C, ACM Press, May 2001.
Also available at http://citeseer.nj.nec.com/
mccurley01geospatial.html.
[Met02] Metacarta, 2002.
Company’s web site at http://www.metacarta.com.
[Mih03] Jakob Mihaly.
Ortsbezug für Web-Inhalte.
Diplom thesis, Universität Stuttgart, 2003.
BIBLIOGRAPHY 102
Also available at ftp://ftp.informatik.uni-stuttgart.de/pub/

library/medoc.ustuttgart fi/DIP-2049/DIP-2049.pdf.
[MM98] Paul P. Maglio and Teenie Matlock.
Metaphors we surf the web by.
In Workshop on Personalized and Social Navigation in Information
Space, Mar 1998.
Also available at citeseer.nj.nec.com/maglio98metaphors.html.
[Opea] Open Directory Project.
Project’s web site at http://www.dmoz.org/.
[Opeb] OpenOffice.org.
interantional spelling dictionaries.
Projects web site at http://www.openoffice.org/ and dictionaries
online at http://lingucomponent.openoffice.org/spell dic.
html.
[Ove] Overture Services, inc.
Online geographic search engine demo for the us.
Online at http://localdemo.overture.com, until September 2003.
[PBMW98] Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd.
The pagerank citation ranking: Bringing order to the web.
Technical report, Stanford Digital Library Technologies Project, 1998.
Also available at http://citeseer.ist.psu.edu/page98pagerank.
html.
[PF02] Jia-Yu Pan and Christos Faloutsos.
”geoplot”: spatial data mining on video libraries.
In Proc. of the 11th International Conference on Information and
Knowledge Management (CIKM), pages 405–412. Association for
Computing Machinery (ACM), Special Interest Group on Manage-
ment Information Systems (SIGMIS), Special Interest Group on
Information Retrieval (SIGIR), ACM Press, Nov 2002.
Also available at http://www.informedia.cs.cmu.edu/documents/
cikm2002Geoplot.pdf.
[PFTL00] Sanjay Parekh, Robert Friedman, Neal Tibrewala, and Benjamin
Lutch.
Systems and methods for determining, collecting, and using geographic
locations of internet users.
US patent, number 6.757.740, March 2000.
Also available at http://patft.uspto.gov/netacgi/nph-Parser?
Sect1=PTO1&Sect2=HITOFF&d=PALL&p=1&u=/netahtml/srchnum.
htm&r=1&f=G&l=50&s1=6757740.WKU.&OS=PN/6757740&RS=PN/
6757740.
[PN99] Ram Periakaruppan and Evi Nemeth.
Gtrace: A graphical traceroute tool.
In 13th Systems Administration Conference - LISA ’99, volume 25.
USENIX Association, Nov 1999.
Also available at citeseer.nj.nec.com/periakaruppan99gtrace.
html, the software at www.caida.org/tools/visualization/
gtrace/.
BIBLIOGRAPHY 103
[PS01] Venkata N. Padmanabhan and Lakshminarayanan Subramanian.

An investigation of geographic mapping techniques for internet hosts.
In Proc. of the 2001 conference on Applications, technologies, archi-
tectures, and protocols for computer communications, pages 173–
185. Association for Computing Machinery (ACM) Special Interest
Group on Data Communication (SIGCOMM), ACM Press, Aug
2001.
Also available at https://citeseer.nj.nec.com/
padmanabhan01investigation.html.
[Quo] Quova inc.
geopoint.
Company’s web site at http://www.quova.com.
[Räb] Räber Information Management GmbH.
search.ch.
Service online at http://www.search.ch/.
[RGM03] Sriram Raghavan and Hector Garcia-Molina.
Complex queries over web repositories.
Also available at http://www-db.stanford.edu/∼rsram/pubs/
vldb03/vldb03.pdf.
[Sch02] Joshua Schachter.
Geourl icbm address server, 2002.
Project online at http://www.geourl.org.
[SK01] Rolf Sternberg and Mark Krymalowski.
Internet Domains and the Innovativeness of Cities/Regions - Evidence
from Germany and Munich.
European Planning Studies, 10(2):251 – 273, 2001.
Also available at http://cwis.livjm.ac.uk/cities/JSworkshop/
Sternberg.pdf.
[SS02] Vladislav Shkapenyuk and Torsten Suel.
Design and implementation of a high-performance distributed web
crawler.
In Proc. of the 18th International Conference on Data Engineering
(ICDE), pages 357–368. IEEE Computer Society, Feb 2002.
Also available at citeseer.nj.nec.com/article/
shkapenyuk01design.html.
[STR] STRATO Medien AG.
Company’s web site at http://www.strato.com/.
[Sue04] Torsten Suel.
Personal conversation, Apr - Aug 2004.
[Tru00] The J. Paul Getty Trust.
Getty thesaurus of geographic names on line.
Online Gazetteer Service, 2000.
Service online at http://www.getty.edu/research/conducting
research/vocabularies/tgn/.
BIBLIOGRAPHY 104
[UWH] UWHOIS, inc.

Service online at http://www.uwhois.com/.
[Ver] Verifia inc.
netgeo.
Company’s web site at http://verifia.com.
[Wik] Wikipedia.
Discussion online at http://meta.wikimedia.org/wiki/GALILEO
Masters 2004. Project’s web site at http://de.wikipedia.org/.
[Yah] Yahoo! inc.
Company’s web site at http://www.yahoo.com/.
[Zoo01] Matthew A. Zook.
Old Hierarchies or New Networks of Centrality? The Global Geography
of the Internet Content Market.
American Behavioral Scientist, 44(10):1679–1696, June 2001.

DA

Hochgeladen von

Dokumentinformationen

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

DA

Hochgeladen von

Copyright:

Verfügbare Formate

Philipps-Universität Marburg

Fachbereich Mathematik und Informatik

Geographic Properties of Internet

Marburg, den 09.09.2004

0.1 Deutsche Zusammenfassung

0.1 Deutsche Zusammenfassung . . . . . . . . . . . . . . . . . . . i

3 Geographic Search Engines 6

3.8 Interactive Geographic Search . . . . . . . . . . . . . . . . . . 72

4 Geographic Data Mining 94

Spending my spare time writing marketing concepts for www.fahrradschmiede-

everyone Today, everyone is online. Down to the smallest pizza parlor,

I want to order/inform myself During the period of large corporate play-

For web mining, the impact of the three periods is similar:

Geographic Search Engines

This technique however has major drawbacks:

• There is no way to search for good numeric markers, such as area

Geographic search engines facilitate the search, by reducing the process

3.1 Search Engine Basics

d0 ich bin ich

d2 du bist nicht ich

In a first step, terms are replaced by term-ids. So we find:

HTML Context Documents may receive a higher importance ranking, if

Link based Analysis A powerful and important measure for computing

HITS [Kle99] and PAGERANK [BP98] [PBMW98] of course are much

1. An interface, through which the user

• enters search terms and initiates the search;

2. The underlying data, usually consisting of the crawl, indexed and

3. A query processor, performing queries on the index and ranking the

3.2 First Interfaces

• A means for entering keywords.

• A means for determining the position, the user is interested in.

3.3 Principles in Search Engine Design

3.3.1 Capturing the Quality of a Search Engine

Analytic Quality is a more traditional quality concept, where the final

Constructive Quality is a concept of quality that came much later. The

In this work, we will e developing an exploratory prototype of a search

Table 3.1: different methods for achieving different kinds of quality

3.3.2 User Studies

• The method of evaluating results in relation to a set of keywords has

• Intention and behavior of the user

• Common understanding of the application domain, in case of more

Starting from the assumptions, combination and further reasoning will

3.3.4 Basic Hypotheses

ABUNDANCE There is an abundance of results for almost any given query.

BATCH-RESULTS Results are presented in batches of 10 to 50. The user

TOLERANT-USER Users will accept some completely misleading answers

I don’t know what I am looking for, but I’ll recognize it,

3.3.5 Most Prominent Problems

3.4 The Data

3.4.1 The Crawl

• Two SUN workstations perform the actual crawling.

The speed of the crawler depended on the load on Polytechnic University’s

Table 3.2: Whois entry for Die Fahrradschmiede

3.4.3 Geographic Data

Towns and Zip Codes

3.4.4 Additional Dictionaries

06468;dautphetal mornshausen; 8.528430605383743, 50.82975297946825

Table 3.3: area-codes

Table 3.4: zip-towns

In addition we compiled a list names of common first names from various

3.5 A Process Model for Designing Search Engines

• Compound town names

• Several towns having the same name

• Parts of compound names being common German words.