![]() | ||
In computer science and information theory, Tunstall coding is a form of entropy coding used for lossless data compression.
Contents
History
Tunstall coding was the subject of Brian Parker Tunstall's PhD thesis in 1967, while at Georgia Institute of Technology. The subject of that thesis was "Synthesis of noiseless compression codes"
Its design is a precursor to Lempel-Ziv.
Properties
Unlike variable-length codes, which include Huffman and Lempel–Ziv coding, Tunstall coding is a code which maps source symbols to a fixed number of bits.
Both Tunstall codes and Lempel-Ziv codes represent variable-length words by fixed-length codes.
Unlike typical set encoding, Tunstall coding parses a stochastic source with codewords of variable length.
It can be shown that, for a large enough dictionary, the number of bits per source letter can be infinitely close to
Algorithm
The algorithm requires as input an input alphabet
Example
Let's imagine that we wish to encode the string "hello, world". Let's further assume (somewhat unrealistically) that the input alphabet
We initialize the tree, starting with a tree of
We then take the leaf of highest probability (here,
We obtain 17 words, which can each be encoded into a fixed-sized output of
Note that we could iterate further, increasing the number of words by
Limitations
Tunstall coding requires the algorithm to know, prior to the parsing operation, what the distribution of probabilities for each letter of the alphabet is. This issue is shared with Huffman coding.
Its requiring a fixed-length block output makes it lesser than Lempel-Ziv, which has a similar dictionary-based design, but with a variable-sized block output.