Sie sind auf Seite 1von 2

Hypertext-Matching Analysis: Revision Notes

Hypertext-matching analysis is the process of examining a hypertext document on the internet to


extract key words and phrases.

This is the tactic that many search engines use when scanning and indexing pages for searching.
Google defines hypertext-matching analysis as “analysing the full content of a page and factors in
fonts, subdivisions, and the precise location of each word”.

The key aspects of hypertext-matching analysis are:

 Pattern matching in hypertext to analyse content.


 Analysing page layout to determine key factors.
 Getting content from different types of web pages.

Pattern Matching
The aim of pattern matching techniques is to quickly and efficiently extract information from
documents in hypertext.

Hypertext is usually considered as a graph where each node is a string from a document and an edge
is between two nodes indicate that the text at the source node can be followed by the text of the
target node. It is possible to split each node into a chain such that each resulting node has just one
character.

This problem first considered in this way by Amir et al. They proved that, if edit operations can
occur in the text then the problem is NP-Complete. However, if edit operations can only occur in the
pattern then the problem can be solved in 𝑂(𝑚(𝑛(𝑙𝑜𝑔(𝑚)) + 𝑒)) time and 𝑂(𝑚𝑛) space.

Navarro improved both complexities to 𝑂(𝑚(𝑛 + 𝑒)) time and 𝑂(𝑛) space.

Page Layout
To effectively extract the key words and phrases in a document the page layout has to be analysed
to determine the relevance of a page in relation to a query. The metrics that Google, and other
search engines, uses to ‘extract importance’ for words and phrases in a document are a closely
guarded secret. However, some of the most likely metrics can be inferred from tried and tested
methods, Google’s patents and their guidelines.

In terms of keyword prominence there is a hierarchy of places that Google looks and the weighting
that it gives to these places. Starting with the URL and working down from the title, to headings, to
the body text itself. In the case of <title> and heading (<h1>, <h2>, <h3> etc..) tags, Google
looks gives prominence to words the closer they are to the start of the tag and usually only considers
the first ~100 characters – to prevent keyword spamming.

Within body text keyword density is important with 5-20% being a good measure - any more than
this and you risk ‘keyword spamming’, something that could be analysed negatively. Emphasis (bold,
italics, etc.) is also given prominence, as is the location of a keyword towards the beginning or end of
a document.

Google would like to enforce strict standards-compliant websites and lists this as one of the
recommended optimisation techniques. In practise, however, whilst well-formatted pages tend to
expose their key-content better there is no actual evidence of Google using this as a metric when
ranking pages. For this reason Google’s indexing techniques have to take into considering font-size,
placement and page divisions to assert what is a heading, what is body text and what else is
important on a page.

For multiple word queries, the ordering of a word with respect to the other key words in a document
is stored and taken into consideration. A document with the same word ordering as the query is
given prominence in results. Similarly keyword ‘stemming’ is used when matching with a query –
attempting to match with similar words, e.g. Stem -> stems, stemmed, stemmer, stemming…

As well as positive keyword analysis techniques, Google also employs some negative techniques,
penalising web pages that use their keywords in certain contexts, this is employed to both attempt
to extract the most important key words and at the same time stop web sites from abusing Google’s
policies.

Indexing Pages
When determining what pages to index, the main indexing challenges are:

 Which hypertext do we use for matching analysis?


 And how do we access the hypertext on dynamic pages?

This second question is becoming more relevant as web 2.0 continues to evolve, with content
becoming more and more hidden behind web forms, Javascript and so forth. This becomes difficult
to analyse as robots processing the information on the internet have no conventional way of
accessing this content. One of the methods to solve this is the idea of ‘surfacing’ in which a number
of interesting strings are pre-computed and entered into web forms to try to uncover hidden pages
in the so called ‘deep-web’.

Websites often tend to be comprised of many small pages, for reasons such as readability and
aesthetics. Google also analyses the content of neighbouring pages and the keywords on these
pages to directly influence the key word analysis of the current page. Google says: “We also analyse
the content of neighbouring web-pages to ensure the results returned are the most relevant to a
user’s query”. By applying page indexing techniques to the Google search these websites aren’t
penalised and key words can be analysed across a website. This works closely with the PageRank
algorithm.

Chris Boyes (cb5353) | December 2008

Das könnte Ihnen auch gefallen