Text Compression

COMP2521 20T2 ♢ Text Compression [0/10]
❖ Text Compression

Problem: Efficiently encode a given string T  by a smaller string E

Applications:

Huffman's algorithm
COMP2521 20T2 ♢ Text Compression [1/10]
❖ ... Text Compression

Code

Prefix code

Encoding tree
COMP2521 20T2 ♢ Text Compression [2/10]
❖ ... Text Compression

Example:

[Diagram:Pics/encoding.png]


T  as ascii chars would require 56 bits; E  needs only 16 bits

COMP2521 20T2 ♢ Text Compression [3/10]
❖ ... Text Compression

Example: T = abracadabra

[Diagram:Pics/two-encodings.png]

Encoding with T1  is  010.11.011.010.00.010.10.010.11.011.010  i.e. 29 bits

Encoding with T2  is  00.10.11.00.010.00.011.00.10.11.00  i.e. 24 bits

The dots are there only to distinguish characters; they are not part of the encoding.

COMP2521 20T2 ♢ Text Compression [4/10]
❖ ... Text Compression


Text compression problem

Given a text T


Some obvious strategies … But how to ensure optimal  encoding?
COMP2521 20T2 ♢ Text Compression [5/10]
❖ ... Text Compression

Huffman's algorithm

Example:   T  =  abracadabra

[Diagram:Pics/building.png]

COMP2521 20T2 ♢ Text Compression [6/10]
❖ Huffman Code

Huffman's algorithm using a priority queue:

HuffmanCode(T):
|  Input  string T of size n
|  Output optimal encoding tree for T
|
|  compute frequency array
|  Q = new priority queue (ordered low key to high key)
|  for all characters c do
|     T = new single-node tree storing c
|     join(Q,T) with frequency(c) as key
|  end for
|  while |Q|≥2 do
|     f1 = Q.minKey(),  T1 = leave(Q)
|     f2 = Q.minKey(),  T2 = leave(Q)
|     T = new tree node with subtrees T1 and T2
|     join(Q,T) with f1+f2 as key 
|  end while
|  return leave(Q)

COMP2521 20T2 ♢ Text Compression [7/10]
❖ ... Huffman Code

Larger example: T  =  a fast runner need never be afraid of the dark

[Diagram:Pics/larger-huffman-tree.png]

COMP2521 20T2 ♢ Text Compression [8/10]
❖ Decompression

Decompression involves repeated traversal of paths in tree …

decompress(B, T):
|  Input  bit-string B, encoding tree T
|  Output original string 
|
|  start from root of Tree
|  for each b in Bits do
|     if b = 1 then
|        go right in Tree
|     else
|        go left in Tree
|     end if
|     if reached leaf then
|         print char in leaf
|         return to root of Tree
|     end if
|  end for

COMP2521 20T2 ♢ Text Compression [9/10]
❖ Analysis of Huffman Encoding


Analysis of Huffman's algorithm:

Gives complexity: O(m + v log v) time


Many variations exist to improve compression (e.g. gzip, bzip2, xz)

COMP2521 20T2 ♢ Text Compression [10/10]


Produced: 9 Aug 2020