Sie sind auf Seite 1von 12

The Porter stemming algorithm http://snowball.tartarus.org/algorithms/porter/stemmer.

html

Snowball main page


The stemmer in Snowball
The ANSI C stemmer
— and its header
Sample English vocabulary
Its stemmed equivalent
Vocabulary + stemmed equivalent
Tar-gzipped file of all of the above

The ‘official’ home page of the Porter stemming algorithm

Here is a case study on how to code up a stemming algorithm in Snowball. First,


the definition of the Porter stemmer, as it appeared in Program, Vol 14 no. 3 pp
130-137, July 1980.

A consonant in a word is a letter other than A, E, I, O or U, and other than Y preceded


by a consonant. (The fact that the term ‘consonant’ is defined to some extent in terms of
itself does not make it ambiguous.) So in TOY the consonants are T and Y, and in
SYZYGY they are S, Z and G. If a letter is not a consonant it is a vowel.

A consonant will be denoted by c, a vowel by v. A list ccc... of length greater than 0 will
be denoted by C, and a list vvv... of length greater than 0 will be denoted by V. Any
word, or part of a word, therefore has one of the four forms:

CVCV ... C
CVCV ... V
VCVC ... C
VCVC ... V

These may all be represented by the single form

1 of 12 1/13/2016 5:43 PM
The Porter stemming algorithm http://snowball.tartarus.org/algorithms/porter/stemmer.html

[C]VCVC ... [V]

where the square brackets denote arbitrary presence of their contents. Using (VC)m to
denote VC repeated m times, this may again be written as

[C](VC)m[V].

m will be called the measure of any word or word part when represented in this form.
The case m = 0 covers the null word. Here are some examples:

m=0 TR, EE, TREE, Y, BY.


m=1 TROUBLE, OATS, TREES, IVY.
m=2 TROUBLES, PRIVATE, OATEN, ORRERY.

The rules for removing a suffix will be given in the form

(condition) S1 -> S2

This means that if a word ends with the suffix S1, and the stem before S1 satisfies the
given condition, S1 is replaced by S2. The condition is usually given in terms of m, e.g.

(m > 1) EMENT ->

Here S1 is ‘EMENT’ and S2 is null. This would map REPLACEMENT to REPLAC,


since REPLAC is a word part for which m = 2.

The ‘condition’ part may also contain the following:

*S - the stem ends with S (and similarly for the other letters).
*v* - the stem contains a vowel.
*d - the stem ends with a double consonant (e.g. -TT, -SS).
*o - the stem ends cvc, where the second c is not W, X or Y (e.g. -WIL, -HOP).

And the condition part may also contain expressions with and, or and not, so that

(m>1 and (*S or *T))

tests for a stem with m>1 ending in S or T, while

(*d and not (*L or *S or *Z))

tests for a stem ending with a double consonant other than L, S or Z. Elaborate
conditions like this are required only rarely.

In a set of rules written beneath each other, only one is obeyed, and this will be the one
with the longest matching S1 for the given word. For example, with

SSES -> SS
IES -> I

2 of 12 1/13/2016 5:43 PM
The Porter stemming algorithm http://snowball.tartarus.org/algorithms/porter/stemmer.html

SS -> SS
S ->

(here the conditions are all null) CARESSES maps to CARESS since SSES is the
longest match for S1. Equally CARESS maps to CARESS (S1=‘SS’) and CARES to
CARE (S1=‘S’).

In the rules below, examples of their application, successful or otherwise, are given on
the right in lower case. The algorithm now follows:

Step 1a

SSES -> SS caresses -> caress


IES -> I ponies -> poni
ties -> ti
SS -> SS caress -> caress
S -> cats -> cat

Step 1b

(m>0) EED -> EE feed -> feed


agreed -> agree
(*v*) ED -> plastered -> plaster
bled -> bled
(*v*) ING -> motoring -> motor
sing -> sing

If the second or third of the rules in Step 1b is successful, the following is done:

AT -> ATE conflat(ed) -> conflate


BL -> BLE troubl(ed) -> trouble
IZ -> IZE siz(ed) -> size
(*d and not (*L or *S or *Z)) -> single letter hopp(ing) -> hop
tann(ed) -> tan
fall(ing) -> fall
hiss(ing) -> hiss
fizz(ed) -> fizz
(m=1 and *o) -> E fail(ing) -> fail
fil(ing) -> file

The rule to map to a single letter causes the removal of one of the double letter pair. The
-E is put back on -AT, -BL and -IZ, so that the suffixes -ATE, -BLE and -IZE can be
recognised later. This E may be removed in step 4.

Step 1c

3 of 12 1/13/2016 5:43 PM
The Porter stemming algorithm http://snowball.tartarus.org/algorithms/porter/stemmer.html

(*v*) Y -> I happy -> happi


sky -> sky

Step 1 deals with plurals and past participles. The subsequent steps are much more
straightforward.

Step 2

(m>0) ATIONAL -> ATE relational -> relate


(m>0) TIONAL -> TION conditional -> condition
rational -> rational
(m>0) ENCI -> ENCE valenci -> valence
(m>0) ANCI -> ANCE hesitanci -> hesitance
(m>0) IZER -> IZE digitizer -> digitize
(m>0) ABLI -> ABLE conformabli -> conformable
(m>0) ALLI -> AL radicalli -> radical
(m>0) ENTLI -> ENT differentli -> different
(m>0) ELI -> E vileli -> vile
(m>0) OUSLI -> OUS analogousli -> analogous
(m>0) IZATION -> IZE vietnamization -> vietnamize
(m>0) ATION -> ATE predication -> predicate
(m>0) ATOR -> ATE operator -> operate
(m>0) ALISM -> AL feudalism -> feudal
(m>0) IVENESS -> IVE decisiveness -> decisive
(m>0) FULNESS -> FUL hopefulness -> hopeful
(m>0) OUSNESS -> OUS callousness -> callous
(m>0) ALITI -> AL formaliti -> formal
(m>0) IVITI -> IVE sensitiviti -> sensitive
(m>0) BILITI -> BLE sensibiliti -> sensible

The test for the string S1 can be made fast by doing a program switch on the penultimate
letter of the word being tested. This gives a fairly even breakdown of the possible values
of the string S1. It will be seen in fact that the S1-strings in step 2 are presented here in
the alphabetical order of their penultimate letter. Similar techniques may be applied in
the other steps.

Step 3

(m>0) ICATE -> IC triplicate -> triplic


(m>0) ATIVE -> formative -> form
(m>0) ALIZE -> AL formalize -> formal
(m>0) ICITI -> IC electriciti -> electric
(m>0) ICAL -> IC electrical -> electric
(m>0) FUL -> hopeful -> hope
(m>0) NESS -> goodness -> good

4 of 12 1/13/2016 5:43 PM
The Porter stemming algorithm http://snowball.tartarus.org/algorithms/porter/stemmer.html

Step 4

(m>1) AL -> revival -> reviv


(m>1) ANCE -> allowance -> allow
(m>1) ENCE -> inference -> infer
(m>1) ER -> airliner -> airlin
(m>1) IC -> gyroscopic -> gyroscop
(m>1) ABLE -> adjustable -> adjust
(m>1) IBLE -> defensible -> defens
(m>1) ANT -> irritant -> irrit
(m>1) EMENT -> replacement -> replac
(m>1) MENT -> adjustment -> adjust
(m>1) ENT -> dependent -> depend
(m>1 and (*S or *T)) ION -> adoption -> adopt
(m>1) OU -> homologou -> homolog
(m>1) ISM -> communism -> commun
(m>1) ATE -> activate -> activ
(m>1) ITI -> angulariti -> angular
(m>1) OUS -> homologous -> homolog
(m>1) IVE -> effective -> effect
(m>1) IZE -> bowdlerize -> bowdler

The suffixes are now removed. All that remains is a little tidying up.

Step 5a

(m>1) E -> probate -> probat


rate -> rate
(m=1 and not *o) E -> cease -> ceas

Step 5b

(m > 1 and *d and *L) -> single letter controll -> control
roll -> roll

Now, turning it into Snowball.

The Porter stemmer makes a use of a measure, m, of the length of a word or word part. If
C is a sequence of one or more consonants, and V a sequence of one or more vowels,
any word part has the form

[C](VC)m[V],

which is to be read as an optional C, followed by m repetitions of VC, followed by an

5 of 12 1/13/2016 5:43 PM
The Porter stemming algorithm http://snowball.tartarus.org/algorithms/porter/stemmer.html

optional V. This defines m. So for crepuscular the measure would be 4.


c r e p u s c u l a r
| | | | |
[C] V C V C V C V C
1 2 3 4

Most of the rules for suffix removal involve leaving behind a stem whose measure
exceeds some value, for example,

(m > 0) eed -> ee

means ‘replace eed with ee if the stem before eed has measure m > 0’. Implementations
of the Porter stemmer usually have a routine that computes m each time there is a
possible candidate for removal.

In fact the only tests on m in the Porter stemmer are m > 0, m > 1, and, at two interesting
points, m = 1. This suggests that there are two critical positions in a word: the point at
which, going from left to right, m > 0 becomes true, and then the point at which m > 1
becomes true. It turns out that m > 0 becomes true at the point after the first consonant
following a vowel, and m > 1 becomes true at the point after the first consonant
following a vowel following a consonant following a vowel. Calling these positions p1
and p2, we can determine them quite simply in Snowball:
define v 'aeiouy'

....

do(
gopast v gopast non-v setmark p1
gopast v gopast non-v setmark p2
)

The region to the right of p1 will be denoted by R1, the region to the right of p2 by R2:
c r e p u s c u l a r
| |
p1 p2
<--- R1 --->
<-- R2 -->

We can test for being in these regions with calls to R1 and R2, defined by,
define R1 as $p1 <= cursor
define R2 as $p2 <= cursor

and using these tests instead of computing m is acceptable, so long as the stemming
process never alters the p1 and p2 positions, which is indeed true in the Porter stemmer.

A particularly interesting feature of the stemmers presented here is the common use they
make of the positions p1 and p2. The details of marking p1 and p2 vary between the
languages because the definitions of vowel and consonant vary. For example, French i
preceded and followed by vowel should be treated as a consonant (inquiétude);
Portuguese (ã and õ should be treated as a vowel-consonant pair (São João). A third
important position is pV, which tries to mark the position of the shortest acceptable verb

6 of 12 1/13/2016 5:43 PM
The Porter stemming algorithm http://snowball.tartarus.org/algorithms/porter/stemmer.html

stem. Its definition varies somewhat between languages. The Porter stemmer does not
use a pV explicitly, but the idea appears when the verb endings ing and ed are removed
only when preceded by a vowel. In English therefore pV would be defined as the
position after the first vowel.

The Porter stemmer is divided into five steps, step 1 is divided further into steps 1a, 1b
and 1c, and step 5 into steps 5a and 5b. Step 1 removes the i-suffixes, and steps 2 to 4
the d-suffixes (*). Composite d-suffixes are reduced to single d-suffixes one at a time.
So for example if a word ends icational, step 2 reduces it to icate and step 3 to ic. Three
steps are sufficient for this process in English. Step 5 does some tidying up.

One can see how easily the stemming rules translate into Snowball by comparing the
definition of Step 1a from the 1980 paper,
Step 1a:
SSES -> SS caresses -> caress
IES -> I ponies -> poni
ties -> ti
SS -> SS caress -> caress
S -> cats -> cat

with its Snowball equivalent,


define Step_1a as (
[substring] among (
'sses' (<-'ss')
'ies' (<-'i')
'ss' ()
's' (delete)
)
)

The word to be stemmed is being scanned right to left from the end. The longest of
'sses', 'ies', 'ss' or 's' is searched for and defined as the slice. (If none are
found, Step_1a signals f.) If 'sses' is found, it is replaced by 'ss', and so on. Of
course, replacing 'ss' by 'ss' is a dummy action, so we can write
'ss' ()

instead of
'ss' (<-'ss')

Remember that delete just means <- ''.

The really tricky part of the whole algorithm is step 1b, which may be worth looking at
in detail. Here it is, without the example words on the far right,
Step 1b:
(m > 0) EED -> EE
(*v*) ED ->
(*v*) ING ->

If the second or third of the rules in Step 1b is successful, the


following is done:

7 of 12 1/13/2016 5:43 PM
The Porter stemming algorithm http://snowball.tartarus.org/algorithms/porter/stemmer.html

AT -> ATE
BL -> BLE
IZ -> IZE
(*d and not (*L or *S or *Z)) -> single letter
(m = 1 and *o) -> E

The first part of the rule means that eed maps to ee if eed is in R1 (which is equivalent to
m > 0), or ed and ing are removed if they are preceded by a vowel. In Snowball this is
simply,
define Step_1b as (
[substring] among (
'eed' (R1 <-'ee')
'ed'
'ing' (test gopast v delete)
)
)

But this must be modified by the second part of the rule. *d indicates a test for double
letter consonant — bb, dd etc. *L, *S, *Z are tests for l, s, z. *o is a short vowel test — it
is matched by consonant-vowel-consonant, where the consonant on the right is not w, x
or y. If the short vowel test is satisfied, m = 1 is equivalent to the cursor being at p1. So
the second part of the rule means, map at, bl, iz to ate, ble, ize; map certain double letters
to single letters; and add e after a short vowel in words of one syllable.

We first need two extra groupings,


define v 'aeiouy'
define v_WXY v + 'wxY' // v with 'w', 'x' and 'y'-consonant
define v_LSZ v + 'lsz' // v with 'l', 's', 'z'

and a test for a short vowel,


define shortv as ( non-v_WXY v non-v )

(The v_WXY test comes first because we are scanning backwards, from right to left.)

The double to single letter map can be done as follows: first define the slice as the next
non-v_LSZ and copy it to a string, ch, as a single character,

strings ( ch )

....

[non-v_LSZ] ->ch

A further test, ch, tests that the next letter of the string is the same as the one in ch, and
if this gives signal t, delete deletes the slice,
[non-v_LSZ] ->ch ch delete

Step_1b can then be written like this,


define Step_1b as (
[substring] among (
'eed' (R1 <-'ee')
'ed'
'ing' (

8 of 12 1/13/2016 5:43 PM
The Porter stemming algorithm http://snowball.tartarus.org/algorithms/porter/stemmer.html

test gopast v delete


(test among('at' 'bl' 'iz') <+ 'e')
or
([non-v_LSZ]->ch ch delete)
or
(atmark p1 test shortv <+ 'e')
)
)
)

But we can improve the appearance, and speed, of this by turning the second part of the
rule into another among command, noting that the only letters that need undoubling are
b, d, f, g, m, n, p, r and t,
define Step_1b as (
[substring] among (
'eed' (R1 <-'ee')
'ed'
'ing' (
test gopast v delete
test substring among(
'at' 'bl' 'iz'
(<+ 'e')
'bb' 'dd' 'ff' 'gg' 'mm' 'nn' 'pp' 'rr' 'tt'
// ignoring double c, h, j, k, q, v, w, and x
([next] delete)
'' (atmark p1 test shortv <+ 'e')
)
)
)
)

Note the null string in the second among, which acts as a default case.

The Porter stemmer in Snowball is given below. This is an exact implementation of the
algorithm described in the 1980 paper, unlike the other implementations distributed by
the author, which have, and have always had, three small points of difference (clearly
indicated) from the original algorithm. Since all other implementations of the algorithm
seen by the author are in some degree inexact, this may well be the first ever correct
implementation.

integers ( p1 p2 )
booleans ( Y_found )

routines (
shortv
R1 R2
Step_1a Step_1b Step_1c Step_2 Step_3 Step_4 Step_5a Step_5b

9 of 12 1/13/2016 5:43 PM
The Porter stemming algorithm http://snowball.tartarus.org/algorithms/porter/stemmer.html

externals ( stem )

groupings ( v v_WXY )

define v 'aeiouy'
define v_WXY v + 'wxY'

backwardmode (

define shortv as ( non-v_WXY v non-v )

define R1 as $p1 <= cursor


define R2 as $p2 <= cursor

define Step_1a as (
[substring] among (
'sses' (<-'ss')
'ies' (<-'i')
'ss' ()
's' (delete)
)
)

define Step_1b as (
[substring] among (
'eed' (R1 <-'ee')
'ed'
'ing' (
test gopast v delete
test substring among(
'at' 'bl' 'iz'
(<+ 'e')
'bb' 'dd' 'ff' 'gg' 'mm' 'nn' 'pp' 'rr' 'tt'
// ignoring double c, h, j, k, q, v, w, and x
([next] delete)
'' (atmark p1 test shortv <+ 'e')
)
)
)
)

define Step_1c as (
['y' or 'Y']
gopast v
<-'i'
)

define Step_2 as (
[substring] R1 among (
'tional' (<-'tion')
'enci' (<-'ence')
'anci' (<-'ance')
'abli' (<-'able')
'entli' (<-'ent')
'eli' (<-'e')
'izer' 'ization'
(<-'ize')
'ational' 'ation' 'ator'
(<-'ate')
'alli' (<-'al')

10 of 12 1/13/2016 5:43 PM
The Porter stemming algorithm http://snowball.tartarus.org/algorithms/porter/stemmer.html

'alism' 'aliti'
(<-'al')
'fulness' (<-'ful')
'ousli' 'ousness'
(<-'ous')
'iveness' 'iviti'
(<-'ive')
'biliti' (<-'ble')
)
)

define Step_3 as (
[substring] R1 among (
'alize' (<-'al')
'icate' 'iciti' 'ical'
(<-'ic')
'ative' 'ful' 'ness'
(delete)
)
)

define Step_4 as (
[substring] R2 among (
'al' 'ance' 'ence' 'er' 'ic' 'able' 'ible' 'ant' 'ement'
'ment' 'ent' 'ou' 'ism' 'ate' 'iti' 'ous' 'ive' 'ize'
(delete)
'ion' ('s' or 't' delete)
)
)

define Step_5a as (
['e']
R2 or (R1 not shortv)
delete
)

define Step_5b as (
['l']
R2 'l'
delete
)
)

define stem as (

unset Y_found
do ( ['y'] <-'Y' set Y_found)
do repeat(goto (v ['y']) <-'Y' set Y_found)

$p1 = limit
$p2 = limit
do(
gopast v gopast non-v setmark p1
gopast v gopast non-v setmark p2
)

backwards (
do Step_1a
do Step_1b
do Step_1c
do Step_2
do Step_3

11 of 12 1/13/2016 5:43 PM
The Porter stemming algorithm http://snowball.tartarus.org/algorithms/porter/stemmer.html

do Step_4
do Step_5a
do Step_5b
)

do(Y_found repeat(goto (['Y']) <-'y'))

12 of 12 1/13/2016 5:43 PM

Das könnte Ihnen auch gefallen