Text Processing Algorithms <small>♢ COMP2521 ♢ (19T1)</small>

Text Processing Algorithms ♢ COMP2521 (19T1) ♢ Week 09 ♢ [0/63]

∧ >>

❖ Strings

A string is a sequence of characters.

An alphabet Σ is the set of possible characters in strings.

Examples of strings:

C program

HTML document

DNA sequence

Digitized image

Examples of alphabets:

ASCII
Unicode
{0,1}
{A,C,G,T}

Text Processing Algorithms ♢ COMP2521 (19T1) ♢ Week 09 ♢ [1/63]

<< ∧ >>

❖ ... Strings

Notation:

length(P) … #characters in P
λ … empty string (length(λ) = 0)
Σ^m … set of all strings of length m over alphabet Σ
Σ^* … set of all strings over alphabet Σ

νω denotes the concatenation of strings ν and ω

Note: length(νω) = length(ν)+length(ω) λω = ω = ωλ

Text Processing Algorithms ♢ COMP2521 (19T1) ♢ Week 09 ♢ [2/63]

<< ∧ >>

❖ ... Strings

Notation:

substring of P … any string Q such that P = νQω, for some ν,ω∈Σ^*
prefix of P … any string Q such that P = Qω, for some ω∈Σ^*
suffix of P … any string Q such that P = ωQ, for some ω∈Σ^*

Text Processing Algorithms ♢ COMP2521 (19T1) ♢ Week 09 ♢ [3/63]

<< ∧ >>

❖ Exercise : Strings

The string a/a of length 3 over the ASCII alphabet has

how many prefixes?
how many suffixes?
how many substrings?

Text Processing Algorithms ♢ COMP2521 (19T1) ♢ Week 09 ♢ [4/63]

<< ∧ >>

4 prefixes: "" "a" "a/" "a/a"
4 suffixes: "a/a" "/a" "a" ""
6 substrings: "" "a" "/" "a/" "/a" "a/a"

Note:
"" means the same as λ (= empty string)

Text Processing Algorithms ♢ COMP2521 (19T1) ♢ Week 09 ♢ [5/63]

<< ∧ >>

❖ ... Strings

ASCII (American Standard Code for Information Interchange)

Specifies mapping of 128 characters to integers 0..127
The characters encoded include:
- upper and lower case English letters: A-Z and a-z
- digits: 0-9
- common punctuation symbols
- special non-printing characters: e.g. newline and space

Text Processing Algorithms ♢ COMP2521 (19T1) ♢ Week 09 ♢ [6/63]

<< ∧ >>

❖ ... Strings

Reminder:

In C a string is an array of chars containing ASCII codes

these arrays have an extra element containing a 0
the extra 0 can also be written '\0' (null character or null-terminator)
convenient because don't have to track the length of the string

Because strings are so common, C provides convenient syntax:

char str[] = "hello";  // same as char str[] = {'h', 'e', 'l', 'l', 'o', '\0'};

Note: str[] will have 6 elements

Text Processing Algorithms ♢ COMP2521 (19T1) ♢ Week 09 ♢ [7/63]

<< ∧ >>

❖ ... Strings

C provides a number of string manipulation functions via #include <string.h>, e.g.

strlen()   // length of string
strncpy()  // copy one string to another
strncat()  // concatenate two strings
strstr()   // find substring inside string

Example:

char *strncat(char *dest, char *src, int n)

appends string src to the end of dest overwriting the '\0' at the end of dest and adds terminating '\0'
returns start of string dest
will never add more than n characters
(If src is less than n characters long, the remainder of dest is filled with '\0' characters. Otherwise, dest is not null-terminated.)

Text Processing Algorithms ♢ COMP2521 (19T1) ♢ Week 09 ♢ [8/63]

<< ∧ >>

❖ Pattern Matching

Text Processing Algorithms ♢ COMP2521 (19T1) ♢ Week 09 ♢ [9/63]

<< ∧ >>

❖ Pattern Matching

Example (pattern checked backwards):

Text … abacaab
Pattern … abacab

Text Processing Algorithms ♢ COMP2521 (19T1) ♢ Week 09 ♢ [10/63]

<< ∧ >>

❖ ... Pattern Matching

Given two strings T (text) and P (pattern),
the pattern matching problem consists of finding a substring of T equal to P

Applications:

Text editors

Search engines

Biological research

Text Processing Algorithms ♢ COMP2521 (19T1) ♢ Week 09 ♢ [11/63]

<< ∧ >>

❖ ... Pattern Matching

Brute-force pattern matching algorithm

checks for each possible shift of P relative to T
- untile a match is found, or
- all placements of the pattern have been tried

BruteForceMatch(T,P):
|  Input  text T of length n, pattern P of length m
|  Output starting index of a substring of T equal to P
|         -1 if no such substring exists
|
|  for all i=0..n-m do
|  |  j=0                           // check from left to right
|  |  while j<m ∧ T[i+j]=P[j] do    // test i^th shift of pattern
|  |     j=j+1
|  |     if j=m then
|  |        return i                // entire pattern checked
|  |     end if
|  |  end while
|  end for
|  return -1                        // no match found

Text Processing Algorithms ♢ COMP2521 (19T1) ♢ Week 09 ♢ [12/63]

<< ∧ >>

❖ Analysis of Brute-force Pattern Matching

Brute-force pattern matching runs in O(n·m)

Examples of worst case (forward checking):

T = aaa…ah
P = aaah
may occur in DNA sequences
unlikely in English text

Text Processing Algorithms ♢ COMP2521 (19T1) ♢ Week 09 ♢ [13/63]

<< ∧ >>

❖ Boyer-Moore Algorithm

The Boyer-Moore pattern matching algorithm is based on two heuristics:

Looking-glass heuristic: Compare P with subsequence of T moving backwards
Character-jump heuristic: When a mismatch occurs at T[i]=c
- if P contains c ⇒ shift P so as to align the last occurrence of c in P with T[i]
- otherwise ⇒ shift P so as to align P[0] with T[i+1] (a.k.a. "big jump")

Text Processing Algorithms ♢ COMP2521 (19T1) ♢ Week 09 ♢ [14/63]

<< ∧ >>

❖ ... Boyer-Moore Algorithm

Example:

Text Processing Algorithms ♢ COMP2521 (19T1) ♢ Week 09 ♢ [15/63]

<< ∧ >>

❖ ... Boyer-Moore Algorithm

Boyer-Moore algorithm preprocesses pattern P and alphabet Σ to build

last-occurrence function L
- L maps Σ to integers such that L(c) is defined as
  - the largest index i such that P[i]=c, or
  - -1 if no such index exists

Example: Σ = {a,b,c,d}, P = acab

c	`a`	`b`	`c`	`d`
L(c)	2	3	1	-1

L can be represented by an array indexed by the numeric codes of the characters
L can be computed in O(m+s) time (m … length of pattern, s … size of Σ)

Text Processing Algorithms ♢ COMP2521 (19T1) ♢ Week 09 ♢ [16/63]

<< ∧ >>

❖ ... Boyer-Moore Algorithm

BoyerMooreMatch(T,P,Σ):
|  Input  text T of length n, pattern P of length m, alphabet Σ
|  Output starting index of a substring of T equal to P
|         -1 if no such substring exists
|
|  L=lastOccurenceFunction(P,Σ)
|  i=m-1, j=m-1                 // start at end of pattern
|  repeat
|  |  if T[i]=P[j] then
|  |     if j=0 then
|  |        return i            // match found at i
|  |     else
|  |        i=i-1, j=j-1
|  |     end if
|  |  else                      // character-jump
|  |     i=i+m-min(j,1+L[T[i]])
|  |     j=m-1
|  |  end if
|  until i≥n
|  return -1                    // no match

Biggest jump (m characters ahead) occurs when L[T[i]] = -1

Text Processing Algorithms ♢ COMP2521 (19T1) ♢ Week 09 ♢ [17/63]

<< ∧ >>

❖ ... Boyer-Moore Algorithm

Case 1: j ≤ 1+L[c]

Case 2: 1+L[c] < j

Text Processing Algorithms ♢ COMP2521 (19T1) ♢ Week 09 ♢ [18/63]

<< ∧ >>

❖ Exercise : Boyer-Moore algorithm

For the alphabet Σ = {a,b,c,d}

compute last-occurrence function L for pattern P = abacab
trace Boyer-More on P and text T = abacaabadcabacabaabb
- how many comparisons are needed?

Text Processing Algorithms ♢ COMP2521 (19T1) ♢ Week 09 ♢ [19/63]

<< ∧ >>

c	`a`	`b`	`c`	`d`
L(c)	4	5	3	-1

13 comparisons in total

Text Processing Algorithms ♢ COMP2521 (19T1) ♢ Week 09 ♢ [20/63]

<< ∧ >>

❖ ... Boyer-Moore Algorithm

Analysis of Boyer-Moore algorithm:

Runs in O(nm+s) time
- m … length of pattern n … length of text s … size of alphabet
Example of worst case:
- T = aaa … a
- P = baaa
Worst case may occur in images and DNA sequences but unlikely in English texts
⇒ Boyer-Moore significantly faster than brute-force on English text

Text Processing Algorithms ♢ COMP2521 (19T1) ♢ Week 09 ♢ [21/63]

<< ∧ >>

❖ Knuth-Morris-Pratt Algorithm

The Knuth-Morris-Pratt algorithm …

compares the pattern to the text left-to-right
but shifts the pattern more intelligently than the brute-force algorithm

Reminder:

Q is a prefix of P … P = Qω, for some ω∈Σ^*
Q is a suffix of P … P = ωQ, for some ω∈Σ^*

Text Processing Algorithms ♢ COMP2521 (19T1) ♢ Week 09 ♢ [22/63]

<< ∧ >>

❖ ... Knuth-Morris-Pratt Algorithm

When a mismatch occurs …

what is the most we can shift the pattern to avoid redundant comparisons?
Answer: the largest prefix of P[0..j] that is a suffix of P[1..j]

Text Processing Algorithms ♢ COMP2521 (19T1) ♢ Week 09 ♢ [23/63]

<< ∧ >>

❖ ... Knuth-Morris-Pratt Algorithm

KMP preprocesses the pattern to find matches of its prefixes with itself

Failure function F(j) defined as
- the size of the largest prefix of P[0..j] that is also a suffix of P[1..j]
if mismatch occurs at P_j ⇒ advance j to F(j-1)

Example: P = abaaba

j	0	1	2	3	4	5
P_j	a	b	a	a	b	a
F(j)	0	0	1	1	2	3

Text Processing Algorithms ♢ COMP2521 (19T1) ♢ Week 09 ♢ [24/63]

<< ∧ >>

❖ ... Knuth-Morris-Pratt Algorithm

KMPMatch(T,P):
|  Input  text T of length n, pattern P of length m
|  Output starting index of a substring of T equal to P
|         -1 if no such substring exists
|
|  F=failureFunction(P)
|  i=0, j=0                    // start from left
|  while i<n do
|  |  if T[i]=P[j] then
|  |     if j=m-1 then
|  |        return i-j         // match found at i-j
|  |     else
|  |        i=i+1, j=j+1
|  |     end if
|  |  else                     // mismatch at P[j]
|  |     if j>0 then
|  |        j=F[j-1]           // resume comparing P at F[j-1]
|  |     else
|  |        i=i+1
|  |     end if
|  |  end if
|  end while
|  return -1                   // no match

Text Processing Algorithms ♢ COMP2521 (19T1) ♢ Week 09 ♢ [25/63]

<< ∧ >>

❖ Exercise : KMP-Algorithm

compute failure function F for pattern P = abacab
trace Knuth-Morris-Pratt on P and text T = abacaabadcabacabaabb
- how many comparisons are needed?

Text Processing Algorithms ♢ COMP2521 (19T1) ♢ Week 09 ♢ [26/63]

<< ∧ >>

j	0	1	2	3	4	5
P_j	a	b	a	c	a	b
F(j)	0	0	1	0	1	2

19 comparisons in total

Text Processing Algorithms ♢ COMP2521 (19T1) ♢ Week 09 ♢ [27/63]

<< ∧ >>

❖ ... Knuth-Morris-Pratt Algorithm

Construction of the failure function is similar to the KMP algorithm itself:

failureFunction(P):
|  Input  pattern P of length m
|  Output failure function for P
|
|  F[0]=0
|  i=1, j=0
|  while i<m do
|  |  if P[i]=P[j] then   // we have matched j+1 characters
|  |     F[i]=j+1
|  |     i=i+1, j=j+1
|  |  else if j>0 then    // use failure function to shift P
|  |     j=F[j-1]
|  |  else
|  |     F[i]=0           // no match
|  |     i=i+1
|  |  end if
|  end while
|  return F

Text Processing Algorithms ♢ COMP2521 (19T1) ♢ Week 09 ♢ [28/63]

<< ∧ >>

❖ ... Knuth-Morris-Pratt Algorithm

Analysis of failure function computation:

At each iteration of the while-loop, either
- i increases by one, or
- the "shift amount" i-j increases by at least one (observe that F(j-1)<j)
Hence, there are no more than 2·m iterations of the while-loop

⇒ failure function can be computed in O(m) time

Text Processing Algorithms ♢ COMP2521 (19T1) ♢ Week 09 ♢ [29/63]

<< ∧ >>

❖ ... Knuth-Morris-Pratt Algorithm

Analysis of Knuth-Morris-Pratt algorithm:

Failure function can be computed in O(m) time
At each iteration of the while-loop, either
- i increases by one, or
- the "shift amount" i-j increases by at least one (observe that F(j-1)<j)
Hence, there are no more than 2·n iterations of the while-loop

⇒ KMP's algorithm runs in optimal time O(m+n)

Text Processing Algorithms ♢ COMP2521 (19T1) ♢ Week 09 ♢ [30/63]

<< ∧ >>

❖ Boyer-Moore vs KMP

Boyer-Moore algorithm

decides how far to jump ahead based on the mismatched character in the text
works best on large alphabets and natural language texts (e.g. English)

Knuth-Morris-Pratt algorithm

uses information embodied in the pattern to determine where the next match could begin
works best on small alphabets (e.g. A,C,G,T)

Text Processing Algorithms ♢ COMP2521 (19T1) ♢ Week 09 ♢ [31/63]

<< ∧ >>

❖ Tries

Text Processing Algorithms ♢ COMP2521 (19T1) ♢ Week 09 ♢ [32/63]

<< ∧ >>

❖ Preprocessing Strings

Preprocessing the pattern speeds up pattern matching queries

After preprocessing P, KMP algorithm performs pattern matching in time proportional to the text length

If the text is large, immutable and searched for often (e.g., works by Shakespeare)

we can preprocess the text instead of the pattern

Text Processing Algorithms ♢ COMP2521 (19T1) ♢ Week 09 ♢ [33/63]

<< ∧ >>

❖ ... Preprocessing Strings

A trie …

is a compact data structure for representing a set of strings
- e.g. all the words in a text, a dictionary etc.
supports pattern matching queries in time proportional to the pattern size

Note: Trie comes from retrieval, but is pronounced like "try" to distinguish it from "tree"

Text Processing Algorithms ♢ COMP2521 (19T1) ♢ Week 09 ♢ [34/63]

<< ∧ >>

❖ Tries

Tries are trees organised using parts of keys (rather than whole keys)

Text Processing Algorithms ♢ COMP2521 (19T1) ♢ Week 09 ♢ [35/63]

<< ∧ >>

❖ ... Tries

Each node in a trie …

contains one part of a key (typically one character)
may have up to 26 children
may be tagged as a "finishing" node
but even "finishing" nodes may have children

Depth d of trie = length of longest key value

Cost of searching O(d) (independent of n)

Text Processing Algorithms ♢ COMP2521 (19T1) ♢ Week 09 ♢ [36/63]

<< ∧ >>

❖ ... Tries

Possible trie representation:

#define ALPHABET_SIZE 26

typedef struct Node *Trie;

typedef struct Node {
   bool finish;      // last char in key?
   Item data;        // no Item if !finish
   Trie child[ALPHABET_SIZE];
} Node;

typedef char *Key;

Text Processing Algorithms ♢ COMP2521 (19T1) ♢ Week 09 ♢ [37/63]

<< ∧ >>

❖ ... Tries

Note: Can also use BST-like nodes for more space-efficient implementation of tries

Text Processing Algorithms ♢ COMP2521 (19T1) ♢ Week 09 ♢ [38/63]

<< ∧ >>

❖ Trie Operations

Basic operations on tries:

search for a key
insert a key

Text Processing Algorithms ♢ COMP2521 (19T1) ♢ Week 09 ♢ [39/63]

<< ∧ >>

❖ Trie Operations

Text Processing Algorithms ♢ COMP2521 (19T1) ♢ Week 09 ♢ [40/63]

<< ∧ >>

❖ ... Trie Operations

Traversing a path, using char-by-char from Key:

find(trie,key):
|  Input  trie, key
|  Output pointer to element in trie if key found
|         NULL otherwise
|
|  node=trie
|  for each char in key do
|  |  if node.child[char] exists then
|  |     node=node.child[char]  // move down one level
|  |  else
|  |     return NULL
|  |  end if
|  end for
|  if node.finish then          // "finishing" node reached?
|     return node
|  else
|     return NULL
|  end if

Text Processing Algorithms ♢ COMP2521 (19T1) ♢ Week 09 ♢ [41/63]

<< ∧ >>

❖ ... Trie Operations

Insertion into Trie:

insert(trie,item,key):
|  Input  trie, item with key of length m
|  Output trie with item inserted
|
|  if trie is empty then
|     t=new trie node
|  end if
|  if m=0 then
|     t.finish=true, t.data=item
|  else
|     t.child[key[0]]=insert(trie,item,key[1..m-1])
|  end if
|  return t

Text Processing Algorithms ♢ COMP2521 (19T1) ♢ Week 09 ♢ [42/63]

<< ∧ >>

❖ ... Trie Operations

Analysis of standard tries:

O(n) space
insertion and search in time O(d·m)
- n … total size of text (e.g. sum of lengths of all strings in a given dictionary)
- m … size of the string parameter of the operation (the "key")
- d … size of the underlying alphabet (e.g. 26)

Text Processing Algorithms ♢ COMP2521 (19T1) ♢ Week 09 ♢ [43/63]

<< ∧ >>

❖ Word Matching With Tries

Text Processing Algorithms ♢ COMP2521 (19T1) ♢ Week 09 ♢ [44/63]

<< ∧ >>

❖ Word Matching with Tries

Preprocessing the text:

Insert all searchable words of a text into a trie
Each leaf stores the occurrence(s) of the associated word in the text

Text Processing Algorithms ♢ COMP2521 (19T1) ♢ Week 09 ♢ [45/63]

<< ∧ >>

❖ ... Word Matching with Tries

Example text and corresponding trie of searchable words:

Text Processing Algorithms ♢ COMP2521 (19T1) ♢ Week 09 ♢ [46/63]

<< ∧ >>

❖ Compressed Tries

Compressed tries …

have internal nodes of degree ≥ 2
are obtained from standard tries by compressing "redundant" chains of nodes

Example:

Text Processing Algorithms ♢ COMP2521 (19T1) ♢ Week 09 ♢ [47/63]

<< ∧ >>

❖ ... Compressed Tries

Possible compact representation of a compressed trie to encode an array S of strings:

nodes store ranges of indices instead of substrings
- use triple (i,j,k) to represente substring S[i][j..k]
requires O(s) space (s = #strings in array S)

Example:

Text Processing Algorithms ♢ COMP2521 (19T1) ♢ Week 09 ♢ [48/63]

<< ∧ >>

❖ Pattern Matching With Suffix Tries

The suffix trie of a text T is the compressed trie of all the suffixes of T

Example:

Text Processing Algorithms ♢ COMP2521 (19T1) ♢ Week 09 ♢ [49/63]

<< ∧ >>

❖ ... Pattern Matching With Suffix Tries

Compact representation:

Text Processing Algorithms ♢ COMP2521 (19T1) ♢ Week 09 ♢ [50/63]

<< ∧ >>

❖ ... Pattern Matching With Suffix Tries

Input:

compact suffix trie for text T
pattern P

Goal:

find starting index of a substring of T equal to P

Text Processing Algorithms ♢ COMP2521 (19T1) ♢ Week 09 ♢ [51/63]

<< ∧ >>

❖ ... Pattern Matching With Suffix Tries

Analysis of pattern matching using suffix tries:

Suffix trie for a text of size n …

can be constructed in O(n) time
uses O(n) space
supports pattern matching queries in O(s·m) time
- m … length of the pattern
- s … size of the alphabet

Text Processing Algorithms ♢ COMP2521 (19T1) ♢ Week 09 ♢ [52/63]

<< ∧ >>

❖ Text Compression

Text Processing Algorithms ♢ COMP2521 (19T1) ♢ Week 09 ♢ [53/63]

<< ∧ >>

❖ Text Compression

Problem: Efficiently encode a given string X by a smaller string Y

Applications:

Save memory and/or bandwidth

Huffman's algorithm

computes frequency f(c) for each character c
encodes high-frequency characters with short code
no code word is a prefix of another code word
uses optimal encoding tree to determine the code words

Text Processing Algorithms ♢ COMP2521 (19T1) ♢ Week 09 ♢ [54/63]

<< ∧ >>

❖ ... Text Compression

Code … mapping of each character to a binary code word

Prefix code … binary code such that no code word is prefix of another code word

Encoding tree …

represents a prefix code
each leaf stores a character
code word given by the path from the root to the leaf (0 for left child, 1 for right child)

Text Processing Algorithms ♢ COMP2521 (19T1) ♢ Week 09 ♢ [55/63]

<< ∧ >>

❖ ... Text Compression

Example:

Text Processing Algorithms ♢ COMP2521 (19T1) ♢ Week 09 ♢ [56/63]

<< ∧ >>

❖ ... Text Compression

Text compression problem

Given a text T, find a prefix code that yields the shortest encoding of T

short codewords for frequent characters
long code words for rare characters

Text Processing Algorithms ♢ COMP2521 (19T1) ♢ Week 09 ♢ [57/63]

<< ∧ >>

❖ ... Text Compression

Example: T = abracadabra

T₁ requires 29 bits to encode text T,
T₂ requires 24 bits

Text Processing Algorithms ♢ COMP2521 (19T1) ♢ Week 09 ♢ [58/63]

<< ∧ >>

❖ ... Text Compression

Huffman's algorithm

computes frequency f(c) for each character
successively combines pairs of lowest-frequency characters to build encoding tree "bottom-up"

Example: abracadabra

Text Processing Algorithms ♢ COMP2521 (19T1) ♢ Week 09 ♢ [59/63]

<< ∧ >>

❖ Huffman Code

Huffman's algorithm using priority queue:

HuffmanCode(T):
|  Input  string T of size n
|  Output optimal encoding tree for T
|
|  compute frequency array
|  Q=new priority queue
|  for all characters c do
|     T=new single-node tree storing c
|     join(Q,T) with frequency(c) as key
|  end for
|  while |Q|≥2 do
|     f₁=Q.minKey(), T₁=leave(Q)
|     f₂=Q.minKey(), T₂=leave(Q)
|     T=new tree node with subtrees T₁ and T₂
|     join(Q,T) with f₁+f₂ as key 
|  end while
|  return leave(Q)

Text Processing Algorithms ♢ COMP2521 (19T1) ♢ Week 09 ♢ [60/63]

<< ∧ >>

❖ ... Huffman Code

Larger example: a fast runner need never be afraid of the dark

Text Processing Algorithms ♢ COMP2521 (19T1) ♢ Week 09 ♢ [61/63]

<< ∧ >>

❖ ... Huffman Code

Analysis of Huffman's algorithm:

O(n+d·log d) time
- n … length of the input text T
- s … number of distinct characters in T

Text Processing Algorithms ♢ COMP2521 (19T1) ♢ Week 09 ♢ [62/63]

<< ∧

❖ Summary

Alphabets and words
Pattern matching
- Boyer-Moore, Knuth-Morris-Pratt
Tries
Text compression
- Huffman code

Suggested reading:
- Tries … Sedgewick, Ch.15.2

Text Processing Algorithms ♢ COMP2521 (19T1) ♢ Week 09 ♢ [63/63]

Produced: 15 Mar 2019