Home > Algorithms > Nice Compression: Huffman Coding

Nice Compression: Huffman Coding

In this post we are going to see a brief resume of one of the most famous coding algorithms, Huffman coding. It’s purpose is to find a way to encode information in less space than the standard one, that is, to compress information. It was developed by David Huffman and published in 1952.

We are going to see examples using plain text encoded in ASCII. Later on the post we will see how this algorithm can be applied to other sorts of data.

Basic idea

If we have some text, usually we are storing 8 bits of information for each of its characters, which allows us to represent 256 different characters. Even in the subset of those 256 characters actually being used, the frequency at which they appear is (usually) very different. That is the fact that Huffman coding tries to exploit. If we could encode frequent characters with fewer bits than characters less frequent, the total length of the coded text will usually be smaller. (Yes, if you want to encode a text where every character has the same frequency, Huffman coding cannot help you. Sorry about that).

So, we need a non-ambiguous, variable-length way to encode characters, where characters get codes with length inversely proportional to their frequency. By non-ambiguous I mean that we should have a way to know where one character’s encoding ends, and so the next one begins. For example, if ‘A’ was coded as 1 and ‘B’ as 11, and we received 11, we could not know whether it was encoding “AA” or “B”‘.

David Huffman approached this problem using a binary tree, where at each node 0 represents a child and 1 the other one, and the characters we want to encode are represented by the leaves. The most frequent characters would be, in this tree, nearer from the root, and the most rarely found characters would be in deeper levels.

Tree construction

So, how can we construct this kind of tree in such a way that it minimizes the resulting encoding length? First of all, we need to count the frequency of every character. We are going to see the set of characters as elements, each of them with an associated weight, being the weight the number of times that it appears in the text. Now, we just select the two elements with the lowest weight, make that two elements siblings of a new parent node, remove those two elements from the set and insert the new sub-tree constructed to the set, being its weight the sum of its two children. We repeat the process until we have only one element, that is, the Huffman tree, which minimizes the encoding length.

Huffman tree
Huffman tree

Using it

We want to use this tree in two ways: for encoding text and for decoding it:

  • Encoding: We have to find the path, for the character we want to encode at each moment, from the tree’s root to the leave where it is. For every node from the root to the character, insert into the coded stream the bit linked to the decision you must take to reach your target character.
  • Decoding: At first we start from the tree root. For each bit we read, we move to its related node, until we reach a leaf. In that moment we get a decoded character, and start again from the tree root.

A thing to take into account is that the decoder should have access to the Huffman tree, so we must send/store it with the encoded data. There is also the option of agreeing beforehand in some Huffman tree so we don’t have to waste space with it, but this can lead to a worse compression ratio.

We should also bear in mind that depending on the text, specially if it is short or has not much difference in frequency among characters, the fact of storing the tree can be counterproductive, but it is usually not.

Evolution and variations

A number of variations with higher performance have been built over this algorithm, but I consider that the original is much simpler and nicer.

Some of these variations consider not only characters but groups of them, or analyze if there are different frequencies of the characters in different parts of a file and then using different trees for the different parts.

Finally, just say that I have just used text examples on this post, but of course the algorithm works well on any kind of data with different values appearing at different frequencies. The fact of some codecs as MP3 or JPEG, among others, using Huffman coding as part of their encoding/decoding algorithms corroborates that and shows David Huffman’s work usefulness.

Algorithms , ,

  1. May 27th, 2009 at 10:20 | #1

    I blogged on Huffman coding too: http://paddy3118.blogspot.com/2009/03/huffman-encoding-in-python.html

    For example implementations in several languages try: http://www.rosettacode.org/wiki/Huffman_codes

    - Paddy.

  1. June 2nd, 2009 at 22:10 | #1