Text Processing Algorithms ♢ COMP2521 ♢ (19T1)
Text Processing Algorithms ♢ COMP2521 (19T1) ♢ Week 09 ♢ [0/63]
A string is a sequence of characters.
An alphabet Σ is the set of possible characters in strings.
Examples of strings:
- C program
- HTML document
- DNA sequence
- Digitized image
Examples of alphabets:
- ASCII
- Unicode
- {0,1}
- {A,C,G,T}
Text Processing Algorithms ♢ COMP2521 (19T1) ♢ Week 09 ♢ [1/63]
Notation:
- length(P) … #characters in P
- λ … empty string (length(λ) = 0)
- Σm … set of all strings of length m over alphabet Σ
- Σ* … set of all strings over alphabet Σ
νω denotes the
concatenation of strings ν and ω
Note: length(νω) = length(ν)+length(ω) λω = ω = ωλ
Text Processing Algorithms ♢ COMP2521 (19T1) ♢ Week 09 ♢ [2/63]
Notation:
- substring of P … any string Q such that P = νQω, for some ν,ω∈Σ*
- prefix of P … any string Q such that P = Qω, for some ω∈Σ*
- suffix of P … any string Q such that P = ωQ, for some ω∈Σ*
Text Processing Algorithms ♢ COMP2521 (19T1) ♢ Week 09 ♢ [3/63]
The string a/a
of length 3 over the ASCII alphabet has
- how many prefixes?
- how many suffixes?
- how many substrings?
Text Processing Algorithms ♢ COMP2521 (19T1) ♢ Week 09 ♢ [4/63]
- 4 prefixes:
""
"a"
"a/"
"a/a"
- 4 suffixes:
"a/a"
"/a"
"a"
""
- 6 substrings:
""
"a"
"/"
"a/"
"/a"
"a/a"
Note:
""
means the same as λ (= empty string)
Text Processing Algorithms ♢ COMP2521 (19T1) ♢ Week 09 ♢ [5/63]
ASCII (American Standard Code for Information Interchange)
- Specifies mapping of 128 characters to integers 0..127
- The characters encoded include:
- upper and lower case English letters: A-Z and a-z
- digits: 0-9
- common punctuation symbols
- special non-printing characters: e.g. newline and space
Text Processing Algorithms ♢ COMP2521 (19T1) ♢ Week 09 ♢ [6/63]
Reminder:
In C a string is an array of char
s containing ASCII codes
- these arrays have an extra element containing a 0
- the extra 0 can also be written
'\0'
(null character or null-terminator)
- convenient because don't have to track the length of the string
Because strings are so common, C provides convenient syntax:
char str[] = "hello";
Note: str[]
will have 6 elements
Text Processing Algorithms ♢ COMP2521 (19T1) ♢ Week 09 ♢ [7/63]
C provides a number of string manipulation functions via #include <string.h>
, e.g.
strlen()
strncpy()
strncat()
strstr()
Example:
char *strncat(char *dest, char *src, int n)
- appends string
src
to the end of dest
overwriting the '\0'
at the end of dest
and adds terminating '\0'
- returns start of string
dest
- will never add more than
n
characters
(If src
is less than n
characters long, the remainder of dest
is filled with '\0'
characters. Otherwise, dest
is not null-terminated.)
Text Processing Algorithms ♢ COMP2521 (19T1) ♢ Week 09 ♢ [8/63]
Text Processing Algorithms ♢ COMP2521 (19T1) ♢ Week 09 ♢ [9/63]
Example (pattern checked backwards):
- Text …
abacaab
- Pattern …
abacab
Text Processing Algorithms ♢ COMP2521 (19T1) ♢ Week 09 ♢ [10/63]
Given two strings T (text) and P (pattern),
the pattern matching problem consists of finding a substring of T equal to P
Applications:
- Text editors
- Search engines
- Biological research
Text Processing Algorithms ♢ COMP2521 (19T1) ♢ Week 09 ♢ [11/63]
Brute-force pattern matching algorithm
- checks for each possible shift of P relative to T
- untile a match is found, or
- all placements of the pattern have been tried
BruteForceMatch(T,P):
| Input text T of length n, pattern P of length m
| Output starting index of a substring of T equal to P
| -1 if no such substring exists
|
| for all i=0..n-m do
| | j=0
| | while j<m ∧ T[i+j]=P[j] do
| | j=j+1
| | if j=m then
| | return i
| | end if
| | end while
| end for
| return -1
Text Processing Algorithms ♢ COMP2521 (19T1) ♢ Week 09 ♢ [12/63]
❖ Analysis of Brute-force Pattern Matching | |
Brute-force pattern matching runs in O(n·m)
Examples of worst case (forward checking):
- T =
aaa…ah
- P =
aaah
- may occur in DNA sequences
- unlikely in English text
Text Processing Algorithms ♢ COMP2521 (19T1) ♢ Week 09 ♢ [13/63]
The Boyer-Moore pattern matching algorithm is based on two heuristics:
- Looking-glass heuristic: Compare P with subsequence of T moving backwards
- Character-jump heuristic: When a mismatch occurs at T[i]=
c
- if P contains
c
⇒ shift P so as to align the last occurrence of c
in P with T[i]
- otherwise ⇒ shift P so as to align P[0] with T[i+1] (a.k.a. "big jump")
Text Processing Algorithms ♢ COMP2521 (19T1) ♢ Week 09 ♢ [14/63]
❖ ... Boyer-Moore Algorithm | |
Example:
Text Processing Algorithms ♢ COMP2521 (19T1) ♢ Week 09 ♢ [15/63]
❖ ... Boyer-Moore Algorithm | |
Boyer-Moore algorithm preprocesses pattern P and alphabet Σ to build
- last-occurrence function L
- L maps Σ to integers such that L(c) is defined as
- the largest index i such that P[i]=c, or
- -1 if no such index exists
Example: Σ = {a,b,c,d
}, P = acab
- L can be represented by an array indexed by the numeric codes of the characters
- L can be computed in O(m+s) time (m … length of pattern, s … size of Σ)
Text Processing Algorithms ♢ COMP2521 (19T1) ♢ Week 09 ♢ [16/63]
❖ ... Boyer-Moore Algorithm | |
BoyerMooreMatch(T,P,Σ):
| Input text T of length n, pattern P of length m, alphabet Σ
| Output starting index of a substring of T equal to P
| -1 if no such substring exists
|
| L=lastOccurenceFunction(P,Σ)
| i=m-1, j=m-1
| repeat
| | if T[i]=P[j] then
| | if j=0 then
| | return i
| | else
| | i=i-1, j=j-1
| | end if
| | else
| | i=i+m-min(j,1+L[T[i]])
| | j=m-1
| | end if
| until i≥n
| return -1
- Biggest jump (m characters ahead) occurs when L[T[i]] = -1
Text Processing Algorithms ♢ COMP2521 (19T1) ♢ Week 09 ♢ [17/63]
❖ ... Boyer-Moore Algorithm | |
Case 1: j ≤ 1+L[c]
Case 2: 1+L[c] < j
Text Processing Algorithms ♢ COMP2521 (19T1) ♢ Week 09 ♢ [18/63]
❖ Exercise : Boyer-Moore algorithm | |
For the alphabet Σ = {a,b,c,d
}
- compute last-occurrence function L for pattern P =
abacab
- trace Boyer-More on P and text T =
abacaabadcabacabaabb
- how many comparisons are needed?
Text Processing Algorithms ♢ COMP2521 (19T1) ♢ Week 09 ♢ [19/63]
Text Processing Algorithms ♢ COMP2521 (19T1) ♢ Week 09 ♢ [20/63]
❖ ... Boyer-Moore Algorithm | |
Analysis of Boyer-Moore algorithm:
- Runs in O(nm+s) time
- m … length of pattern n … length of text s … size of alphabet
- Example of worst case:
- Worst case may occur in images and DNA sequences but unlikely in English texts
⇒ Boyer-Moore significantly faster than brute-force on English text
Text Processing Algorithms ♢ COMP2521 (19T1) ♢ Week 09 ♢ [21/63]
❖ Knuth-Morris-Pratt Algorithm | |
The Knuth-Morris-Pratt algorithm …
- compares the pattern to the text left-to-right
- but shifts the pattern more intelligently than the brute-force algorithm
Reminder:
- Q is a prefix of P … P = Qω, for some ω∈Σ*
- Q is a suffix of P … P = ωQ, for some ω∈Σ*
Text Processing Algorithms ♢ COMP2521 (19T1) ♢ Week 09 ♢ [22/63]
❖ ... Knuth-Morris-Pratt Algorithm | |
When a mismatch occurs …
- what is the most we can shift the pattern to avoid redundant comparisons?
- Answer: the largest prefix of P[0..j] that is a suffix of P[1..j]
Text Processing Algorithms ♢ COMP2521 (19T1) ♢ Week 09 ♢ [23/63]
❖ ... Knuth-Morris-Pratt Algorithm | |
KMP preprocesses the pattern to find matches of its prefixes with itself
- Failure function F(j) defined as
- the size of the largest prefix of P[0..j] that is also a suffix of P[1..j]
- if mismatch occurs at Pj ⇒ advance j to F(j-1)
Example:
P =
abaaba
j | 0 | 1 | 2 | 3 | 4 | 5 |
Pj | a | b | a | a | b | a |
F(j) | 0 | 0 | 1 | 1 | 2 | 3 |
Text Processing Algorithms ♢ COMP2521 (19T1) ♢ Week 09 ♢ [24/63]
❖ ... Knuth-Morris-Pratt Algorithm | |
KMPMatch(T,P):
| Input text T of length n, pattern P of length m
| Output starting index of a substring of T equal to P
| -1 if no such substring exists
|
| F=failureFunction(P)
| i=0, j=0
| while i<n do
| | if T[i]=P[j] then
| | if j=m-1 then
| | return i-j
| | else
| | i=i+1, j=j+1
| | end if
| | else
| | if j>0 then
| | j=F[j-1]
| | else
| | i=i+1
| | end if
| | end if
| end while
| return -1
Text Processing Algorithms ♢ COMP2521 (19T1) ♢ Week 09 ♢ [25/63]
❖ Exercise : KMP-Algorithm | |
- compute failure function F for pattern P =
abacab
- trace Knuth-Morris-Pratt on P and text T =
abacaabadcabacabaabb
- how many comparisons are needed?
Text Processing Algorithms ♢ COMP2521 (19T1) ♢ Week 09 ♢ [26/63]
j | 0 | 1 | 2 | 3 | 4 | 5 |
Pj | a | b | a | c | a | b |
F(j) | 0 | 0 | 1 | 0 | 1 | 2 |
19 comparisons in total
Text Processing Algorithms ♢ COMP2521 (19T1) ♢ Week 09 ♢ [27/63]
❖ ... Knuth-Morris-Pratt Algorithm | |
Construction of the failure function is similar to the KMP algorithm itself:
failureFunction(P):
| Input pattern P of length m
| Output failure function for P
|
| F[0]=0
| i=1, j=0
| while i<m do
| | if P[i]=P[j] then
| | F[i]=j+1
| | i=i+1, j=j+1
| | else if j>0 then
| | j=F[j-1]
| | else
| | F[i]=0
| | i=i+1
| | end if
| end while
| return F
Text Processing Algorithms ♢ COMP2521 (19T1) ♢ Week 09 ♢ [28/63]
❖ ... Knuth-Morris-Pratt Algorithm | |
Analysis of failure function computation:
- At each iteration of the while-loop, either
- i increases by one, or
- the "shift amount" i-j increases by at least one (observe that F(j-1)<j)
- Hence, there are no more than 2·m iterations of the while-loop
⇒ failure function can be computed in
O(m) time
Text Processing Algorithms ♢ COMP2521 (19T1) ♢ Week 09 ♢ [29/63]
❖ ... Knuth-Morris-Pratt Algorithm | |
Analysis of Knuth-Morris-Pratt algorithm:
- Failure function can be computed in O(m) time
- At each iteration of the while-loop, either
- i increases by one, or
- the "shift amount" i-j increases by at least one (observe that F(j-1)<j)
- Hence, there are no more than 2·n iterations of the while-loop
⇒ KMP's algorithm runs in optimal time O(m+n)
Text Processing Algorithms ♢ COMP2521 (19T1) ♢ Week 09 ♢ [30/63]
Boyer-Moore algorithm
- decides how far to jump ahead based on the mismatched character in the text
- works best on large alphabets and natural language texts (e.g. English)
Knuth-Morris-Pratt algorithm
- uses information embodied in the pattern to determine where the next match could begin
- works best on small alphabets (e.g.
A,C,G,T
)
Text Processing Algorithms ♢ COMP2521 (19T1) ♢ Week 09 ♢ [31/63]
Text Processing Algorithms ♢ COMP2521 (19T1) ♢ Week 09 ♢ [32/63]
Preprocessing the pattern speeds up pattern matching queries
- After preprocessing P, KMP algorithm performs pattern matching in time proportional to the text length
If the text is large, immutable and searched for often (e.g., works by Shakespeare)
- we can preprocess the text instead of the pattern
Text Processing Algorithms ♢ COMP2521 (19T1) ♢ Week 09 ♢ [33/63]
❖ ... Preprocessing Strings | |
A trie …
- is a compact data structure for representing a set of strings
- e.g. all the words in a text, a dictionary etc.
- supports pattern matching queries in time proportional to the pattern size
Note: Trie comes from retrieval, but is pronounced like "try" to distinguish it from "tree"
Text Processing Algorithms ♢ COMP2521 (19T1) ♢ Week 09 ♢ [34/63]
Tries are trees organised using parts of keys (rather than whole keys)
Text Processing Algorithms ♢ COMP2521 (19T1) ♢ Week 09 ♢ [35/63]
Each node in a trie …
- contains one part of a key (typically one character)
- may have up to 26 children
- may be tagged as a "finishing" node
- but even "finishing" nodes may have children
Depth d of trie = length of longest key value
Cost of searching O(d) (independent of n)
Text Processing Algorithms ♢ COMP2521 (19T1) ♢ Week 09 ♢ [36/63]
Possible trie representation:
#define ALPHABET_SIZE 26
typedef struct Node *Trie;
typedef struct Node {
bool finish;
Item data;
Trie child[ALPHABET_SIZE];
} Node;
typedef char *Key;
Text Processing Algorithms ♢ COMP2521 (19T1) ♢ Week 09 ♢ [37/63]
Note: Can also use BST-like nodes for more space-efficient implementation of tries
Text Processing Algorithms ♢ COMP2521 (19T1) ♢ Week 09 ♢ [38/63]
Basic operations on tries:
- search for a key
- insert a key
Text Processing Algorithms ♢ COMP2521 (19T1) ♢ Week 09 ♢ [39/63]
Text Processing Algorithms ♢ COMP2521 (19T1) ♢ Week 09 ♢ [40/63]
Traversing a path, using char-by-char from Key:
find(trie,key):
| Input trie, key
| Output pointer to element in trie if key found
| NULL otherwise
|
| node=trie
| for each char in key do
| | if node.child[char] exists then
| | node=node.child[char]
| | else
| | return NULL
| | end if
| end for
| if node.finish then
| return node
| else
| return NULL
| end if
Text Processing Algorithms ♢ COMP2521 (19T1) ♢ Week 09 ♢ [41/63]
Insertion into Trie:
insert(trie,item,key):
| Input trie, item with key of length m
| Output trie with item inserted
|
| if trie is empty then
| t=new trie node
| end if
| if m=0 then
| t.finish=true, t.data=item
| else
| t.child[key[0]]=insert(trie,item,key[1..m-1])
| end if
| return t
Text Processing Algorithms ♢ COMP2521 (19T1) ♢ Week 09 ♢ [42/63]
Analysis of standard tries:
- O(n) space
- insertion and search in time O(d·m)
- n … total size of text (e.g. sum of lengths of all strings in a given dictionary)
- m … size of the string parameter of the operation (the "key")
- d … size of the underlying alphabet (e.g. 26)
Text Processing Algorithms ♢ COMP2521 (19T1) ♢ Week 09 ♢ [43/63]
❖ Word Matching With Tries | |
Text Processing Algorithms ♢ COMP2521 (19T1) ♢ Week 09 ♢ [44/63]
❖ Word Matching with Tries | |
Preprocessing the text:
- Insert all searchable words of a text into a trie
- Each leaf stores the occurrence(s) of the associated word in the text
Text Processing Algorithms ♢ COMP2521 (19T1) ♢ Week 09 ♢ [45/63]
❖ ... Word Matching with Tries | |
Example text and corresponding trie of searchable words:
Text Processing Algorithms ♢ COMP2521 (19T1) ♢ Week 09 ♢ [46/63]
Compressed tries …
- have internal nodes of degree ≥ 2
- are obtained from standard tries by compressing "redundant" chains of nodes
Example:
Text Processing Algorithms ♢ COMP2521 (19T1) ♢ Week 09 ♢ [47/63]
Possible compact representation of a compressed trie to encode an array S of strings:
- nodes store ranges of indices instead of substrings
- use triple (i,j,k) to represente substring S[i][j..k]
- requires O(s) space (s = #strings in array S)
Example:
Text Processing Algorithms ♢ COMP2521 (19T1) ♢ Week 09 ♢ [48/63]
❖ Pattern Matching With Suffix Tries | |
The suffix trie of a text T is the compressed trie of all the suffixes of T
Example:
Text Processing Algorithms ♢ COMP2521 (19T1) ♢ Week 09 ♢ [49/63]
❖ ... Pattern Matching With Suffix Tries | |
Compact representation:
Text Processing Algorithms ♢ COMP2521 (19T1) ♢ Week 09 ♢ [50/63]
❖ ... Pattern Matching With Suffix Tries | |
Input:
- compact suffix trie for text T
- pattern P
Goal:
- find starting index of a substring of T equal to P
Text Processing Algorithms ♢ COMP2521 (19T1) ♢ Week 09 ♢ [51/63]
❖ ... Pattern Matching With Suffix Tries | |
Analysis of pattern matching using suffix tries:
Suffix trie for a text of size n …
- can be constructed in O(n) time
- uses O(n) space
- supports pattern matching queries in O(s·m) time
- m … length of the pattern
- s … size of the alphabet
Text Processing Algorithms ♢ COMP2521 (19T1) ♢ Week 09 ♢ [52/63]
Text Processing Algorithms ♢ COMP2521 (19T1) ♢ Week 09 ♢ [53/63]
Problem: Efficiently encode a given string X by a smaller string Y
Applications:
- Save memory and/or bandwidth
Huffman's algorithm
- computes frequency f(c) for each character c
- encodes high-frequency characters with short code
- no code word is a prefix of another code word
- uses optimal encoding tree to determine the code words
Text Processing Algorithms ♢ COMP2521 (19T1) ♢ Week 09 ♢ [54/63]
Code … mapping of each character to a binary code word
Prefix code … binary code such that no code word is prefix of another code word
Encoding tree …
- represents a prefix code
- each leaf stores a character
- code word given by the path from the root to the leaf (0 for left child, 1 for right child)
Text Processing Algorithms ♢ COMP2521 (19T1) ♢ Week 09 ♢ [55/63]
Text Processing Algorithms ♢ COMP2521 (19T1) ♢ Week 09 ♢ [56/63]
Text compression problem
Given a text T, find a prefix code that yields the shortest encoding of T
- short codewords for frequent characters
- long code words for rare characters
Text Processing Algorithms ♢ COMP2521 (19T1) ♢ Week 09 ♢ [57/63]
Example: T = abracadabra
T1 requires 29 bits to encode text T,
T2 requires 24 bits
Text Processing Algorithms ♢ COMP2521 (19T1) ♢ Week 09 ♢ [58/63]
Huffman's algorithm
- computes frequency f(c) for each character
- successively combines pairs of lowest-frequency characters to build encoding tree "bottom-up"
Example:
abracadabra
Text Processing Algorithms ♢ COMP2521 (19T1) ♢ Week 09 ♢ [59/63]
Huffman's algorithm using priority queue:
HuffmanCode(T):
| Input string T of size n
| Output optimal encoding tree for T
|
| compute frequency array
| Q=new priority queue
| for all characters c do
| T=new single-node tree storing c
| join(Q,T) with frequency(c) as key
| end for
| while |Q|≥2 do
| f1=Q.minKey(), T1=leave(Q)
| f2=Q.minKey(), T2=leave(Q)
| T=new tree node with subtrees T1 and T2
| join(Q,T) with f1+f2 as key
| end while
| return leave(Q)
Text Processing Algorithms ♢ COMP2521 (19T1) ♢ Week 09 ♢ [60/63]
Larger example: a fast runner need never be afraid of the dark
Text Processing Algorithms ♢ COMP2521 (19T1) ♢ Week 09 ♢ [61/63]
Analysis of Huffman's algorithm:
- O(n+d·log d) time
- n … length of the input text T
- s … number of distinct characters in T
Text Processing Algorithms ♢ COMP2521 (19T1) ♢ Week 09 ♢ [62/63]
- Alphabets and words
- Pattern matching
- Boyer-Moore, Knuth-Morris-Pratt
- Tries
- Text compression
- Suggested reading:
- Tries … Sedgewick, Ch.15.2
Text Processing Algorithms ♢ COMP2521 (19T1) ♢ Week 09 ♢ [63/63]
Produced: 15 Mar 2019