Sie sind auf Seite 1von 8

Symbol Table Review

Tries Symbol table: key-value pair abstraction.


n Insert a value with specified key.
n Search for value given key.
n Delete value with given key.
n Balanced trees use log N key comparisons.
R-way tries n Hashing uses O(1) probes, but probe proportional to key length.

Ternary search tries


Are key comparisons necessary? No.
Is time proportional to key length required? No.
Best possible. Examine lg N bits.

This lecture: specialized symbol table for string keys.


Reference: Chapter 12, Algorithms in Java, 3rd Edition, Robert Sedgewick.
n Faster than hashing.
n More flexible than BST.

Princeton University • COS 226 • Algorithms and Data Structures • Spring 2004 • Kevin Wayne • http://www.Princeton.EDU/~cos226 2

Tries Applications

Tries. Applications.
n Store characters in internal nodes, not keys. n Spell checkers.
n Store records in external nodes. n Data compression. stay tuned
n Use the characters of the key to guide the search. n Princeton U-CALL.
n NB: from retrieval, but pronounced "try." n Computational biology.
n You can get at anything if its organized properly in 40 or 100 bits! n Routing tables for IP addresses.
n Storing and querying XML documents.
n Associative arrays, associative indexing.
Example: sells sea shells by the sea shore

Modern application: inverted index of Web.


n Insert each word of every web page into trie, storing URL list in leaves.
by n Find query keywords in trie, and take intersection of URL lists.
sea the n Use Pagerank algorithm to rank resulting web pages.

sells
shore
shells 4 5
Existence Symbol Table: Operations Keys

Existence symbol table: set of keys. Key = sequence of "digits."


n DNA: sequence of a,c, g, t.
say, strings over ASCII alphabet
n Protein: sequence of 20 amino acids A, C, ..., Y.
Operations. n IPv6 address: sequence of 128 bits.
n st.add(key) inserts a key. n English words: sequence of lowercase letters.
n st.contains(key) checks if the key is in the symbol table. n International words: sequence of UNICODE characters.
n Credit card number: sequence of 16 decimal digits.
n Library call numbers: sequence of letters, numbers, periods.
ExistenceTable st = new ExistenceTable();
while (!StdIn.isEmpty()) {
String key = StdIn.readString();
if (!st.contains(key)) {
This lecture: key = string.
st.add(key);
System.out.println(key); n We assume over ASCII alphabet.
} n We also assume that character '\0' never appears.
}

Removes duplicates from input stream

6 7

Existence Symbol Table: Implementations Cost Summary R-Way Existence Trie: Example

Assumption: no string is a prefix of another string.

Typical Case Dedup


Ex: sells sea shells by the sea shore
Implementation Search hit Insert Space Moby Actors
Input * L L L 0.26 15.1
Red-Black L + log N log N C 1.40 97.4
Hashing L L C 0.76 40.6

Actor: 82MB, 11.4M words, 900K distinct. N = number of strings


Moby: 1.2MB, 210K words, 32K distinct. L = size of string R = 26
C = number of characters in input
R = radix

* only reads in data

Challenge: As fast as hashing, as flexible as BST.

8 9
R-Way Existence Trie: Java Implementation R-Way Existence Trie: Implementation

R-way existence trie: a node. Code is short and sweet.


private static class Node {
Node: reference to R nodes. Node[] next = new Node[R];
public class RwayExistenceTable {
}
private static final int R = 128; ASCII
private static final char END = '\0'; sentinel
private Node root;

private static class Node {


root Node[] next = new Node[R];
}

public boolean contains(String s) {


return contains(root, s + END, 0);
a f h R=8 } ensure no string is a prefix of another

private boolean contains(Node x, String s, int i) {


char d = s.charAt(i);
if (x == null) return false;
if (d == END) return (x.next[END] != null);
return contains(x.next[d], s, i+1);
}

10 11

R-Way Existence Trie: Implementation Existence Symbol Table: Implementations Cost Summary

public void add(String s) { Typical Case Dedup


root = add(root, s + END, 0);
Implementation Search hit Insert Space Moby Actors
}
ensure no string is a prefix of another Input L L L 0.26 15.1
Red-Black L + log N log N C 1.40 97.4
private Node add(Node x, String s, int i) {
char d = s.charAt(i); Hashing L L C 0.76 40.6
if (x == null) x = new Node(); R-Way Trie L L RN+C 1.12 Memory
if (d == END && x.next[END] == null)
x.next[END] = new Node(); R = 128 R = 256
if (d == END) return x;
x.next[d] = insert(x.next[d], s, i+1);
return x; R-way trie: Faster than hashing for small R, but slow and wastes
} memory if R is large.
}

Goal: Use less space.

12 13
Existence TST Existence TST: Implementation

Ternary search trie. Bentley-Sedgewick Existence TST: a node. root


n Each node has 3 children: Node: four fields:
n Left (smaller), middle (equal), right (larger). n Character d.
h
n Reference to left TST. smaller
Ex: sells sea shells by the sea shore n Reference to middle TST. equal
Observation: Few wasted links! n Reference to right TST. larger
a i

private class Node {


char d;
Node l, m, r; \0 i \0
}
ha i

\0

hi
15 16

Existence TST: Java Implementation Existence Symbol Table: Implementations Cost Summary

private boolean contains(Node x, String s, int i) {


char d = s.charAt(i);
if (x == null) return false;
Typical Case Dedup
if (d == END && x.d == END) return true;
if (d < x.d) return contains(x.l, s, i); Implementation Search hit Insert Space Moby Actors
else if (d == x.d) return contains(x.m, s, i+1);
else return contains(x.r, s, i); Input L L L 0.26 15.1
} Red-Black L + log N log N C 1.40 97.4
Hashing L L C 0.76 40.6
private Node add(Node x, String s, int i) { R-Way Trie L L RN+C 1.12 Memory
char d = s.charAt(i);
TST L + log N L + log N C 0.72 38.7
if (x == null) {
x = new Node();
x.d = d; no arithmetic
}
if (d == END && x.d == END) return x;
if (d < x.d) x.l = add(x.l, s, i);
else if (d == x.d) x.m = add(x.m, s, i+1);
else x.r = add(x.r, s, i);
return x;
}
17 18
Existence TST With R2 Branching At Root Existence Symbol Table: Implementations Cost Summary

Hybrid of R-way and TST.


n Do R-way or R2-way branching at root.
n Each of R2 root nodes points to a TST. Typical Case Dedup

Implementation Search hit Insert Space Moby Actors


Input L L L 0.26 15.1
array of R2 roots
Red-Black L + log N log N C 1.40 97.4
Hashing L L C 0.76 40.6
aa ab ac zy zz
R-Way Trie L L RN+C 1.12 Memory
TST L + log N L + log N C 0.72 38.7
TST with R2 L + log N L + log N C 0.51 32.7
TST TST TST TST TST

Q. What about one letter words?

19 20

Existence TST Summary Existence TST: Other Operations

Advantages. Delete. Delete key from the symbol table.


n Very fast search hits. Sort. Examine the keys in ascending order. conventional BST ops
n Search misses even faster. examine only a few digits of the key! Find ith. Find the ith largest key.
n Linear space. Range search. Find all elements between k1 and k2.
n Adapts gracefully to irregularities in keys.
Partial match search.
n Supports even more general symbol table ops.
n Use . to match any character.
additional ops
n co....er .c...c.

Bottom line: more flexible than BST and can be faster than hashing.
Near neighbor search.
especially if lots of search misses
n Find all strings in ST that differ in £ P characters from query.
n Application: spell checking for OCR.

Longest prefix match.


n Find string in ST with longest prefix match to query.
n Application: search IP database for longest prefix matching
destination IP, and route packets accordingly.
21 22
TST: Partial Matches TST Symbol Table

Partial match in a TST. TST implementation of symbol table ADT.


n Search as usual if query character is not a period. n Store key-value pairs in leaves of trie.
n Go down all three branches if query character is a period. n Search hit ends at leaf with key-value pair;
search miss ends at null or leaf with different key.
n Internal node stores char; external node stores key-value pair.
private void match(Node x, String s, int i, String prefix) { – use separate internal and external nodes?
char d = s.charAt(i);
for printing out matches – collapse (and split) 1-way branches at bottom?
if (x == null) return;
if (d == END && x.d == END) System.out.println(prefix);
s
if (d == END) return;
if (d == '.' || d < x.d) match(x.l, s, i, prefix);
if (d == '.' || d == x.d) match(x.m, s, i+1, prefix + x.d); by h the
if (d == '.' || d > x.d) match(x.r, s, i, prefix);
}
e
or use explicit char shells
public void match(String s) { array for efficiency
match(root, s + END, 0, "");
} l

sea sells
23 24

TST Symbol Table Existence Symbol Table: Implementations Cost Summary

TST implementation of symbol table ADT.


n Store key-value pairs in leaves of trie. Typical Case
n Search hit ends at leaf with key-value pair; Implementation Search hit Insert Space
search miss ends at null or leaf with different key.
Input L L L
Internal node stores char; external node stores key-value pair.
Red-Black L + log N log N C
n

– use separate internal and external nodes?


Hashing L L C
– collapse (and split) 1-way branches at bottom?
R-Way Trie L L RN+C
s TST L + log N L + log N C
TST with R2 L + log N L + log N C
by h the
R-way collapse 1-way logR N logR N RN + C
TST collapse 1-way log N log N C
e e

Search, insert time is independent of key length!


l shells shore n Consequence: can use with very long keys.

sea sells
25 26
PATRICIA Tries Suffix Tree

Patricia tries. Practical Algorithm to Retrieve Information Coded in Alphanumeric. Suffix tree: PATRICIA trie of suffixes of a string.
n Collapse one-way branches in binary trie.
n Thread trie to eliminate multiple node types.

Applications. Applications.
n Database search. n Longest common substring.
n P2P network search. n Longest repeated substring.
n IP routing tables: find longest prefix match. n Longest palindromic substring.
n Compressed quad-tree for N-body simulation. n Longest common prefix of two substrings.
n Efficiently storing and querying XML documents. n Computational biology databases (BLAST, FASTA).
n Search for music by melody.
27 28

Associative Arrays Associative Indexing

Associative array. Associative index.


n In Java, C, C++, arrays indexed by integers. n Given list of N strings, associate index 0 to N-1 with each string.
n In Perl, csh, PHP, Python: president["Princeton"] = "Tilghman" n Recall union-find where we assumed objects were labeled 0 to N-1.

Why useful?
# collect data n Using algorithm with strings is more useful.
foreach student ($argv)
n Running algorithm with indices (instead of ST lookup) is faster.
foreach input (input100.txt input1000.txt input10000.txt)
foreach program (worstfit bestfit)
t[$student][$input][$program] = `time java $program < $input`
end while (true) { while (true) {
end int p = StdIn.readInt(); String s = StdIn.readString();
end int q = StdIn.readInt(); String t = StdIn.readString();
... int p = st.index(s);
# compute statistics uf.unite(p, q); int q = st.index(t);
... ...
. . .
} uf.unite(p, q);
...
Idealized excerpt from COS 226 timing script }

29 30
Associative Indexing: Application Symbol Table Summary

Connectivity problem. Hash tables: separate chaining, linear probing.


n N objects: 0 to N-1
n Find: is there a connection between A and B? Binary search trees: randomized, splay, red-black.
n Union: add a connection between A and B.
Tries: R-way, TST.
Fun version.
n N objects: "Kevin Bacon", "Kate Hudson", . . .
n Find: is there a chain of movies connecting Kevin to Kate? Determine the needed ST ops for your application, and choose
the best data structure.
n Union: Kevin and Kate appeared in "How To Lose a Guy in 10 Days"
together, add connection

Real version.
n N objects: "www.cs.princeton.edu", "www.harvard.edu"
n Any graph processing application.

31 32

Das könnte Ihnen auch gefallen