Archive

Archive for the ‘Algorithms’ Category

Markov chains: PageRank and many others

June 13th, 2009

What do Google’s PageRank, procedural music composition, text generation, speech and image recognition, among many, many others have in common?

Yes, you guessed it (it was easy as the title says it ;) : Markov chains are a basic element in their respective algorithms. (Surely, there are other approaches for some of these problems, but Markov chains are widely used).

Markov chains

So, what are Markov chains? They are a process, a sequence of states, where at a given time the next state is chosen probabilistically based only in the current state. That is, they don’t have any memory longer than in which state they are at the moment.

Let’s see a simple example of a Markov chain using a weather prediction system. To simplify it, let us assume that a day has only 2 possible weather states, sunny and rainy, and they last all day. We know that if today is sunny, tomorrow will be sunny with probability 0.9 and rainy with probability 0.1. If today is rainy, tomorrow can be sunny or rainy, both with probability 0.5.

That can be represented with a directed graph with weights on the arcs:

Graph modeling the weather system

Graph modeling the weather system

Any Markovian system can be also represented by a transition matrix. In this matrix, every entry can be read as the probability of going from the state represented by the row to the state represented by the column. The matrix related to our system is:

Weather system associated matrix

Weather system associated matrix

This matrix has some nice properties. For example:

  • P to the power of n represents the probabilities of going from a given state to another in n steps.
  • If you multiply a vector representing the probabilities of being at a given state in a given moment by the matrix P, you get the probabilities of being at every step one step later. So, if we know that before we start we are on a sunny day, the first day will be sunny with probability 0.9 and rainy with probability 0.1, and if we multiply that by the matrix again, we get that the second day will be sunny with probability 0.86 and rainy with probability 0.14
  • If the system has the following two properties
    • You can go from every state to every other state.
    • The number of steps needed to go from any state to the same state does not have to be necessarily multiple of a number different from 1.

    then, as you advance steps the resulting vector converges towards what is known as steady-state or stationary distribution, which is independent from the start input. That is, there exists a number n of steps after which the probability of being in each step does not depend on the start distribution, and does not vary when advancing more steps. In our example, the probabilities of being in a sunny day in the long term are 0.833, and the probabilities of being in a rainy day are 0.167.

If you want to see more in-depth the mathematic operations to find those values, you can find them here.

Pagerank

So, how does this apply to PageRank algorithm? As you would probably know, PageRank is the algorithm used by Google Search to try to measure the importance of a webpage. As a curiosity, as the algorithm was developed by Larry Page while doing a Ph.D in Stanford University, the process was patented by the university and Google had to pay them an amount of shares sold later for $336 million in order to use the algorithm.

What PageRank (really, really simplified) does is to deal with the web as a directed graph, where each page represents a node and each link represents an edge. From a given page, all its links have the same weight. The solution to the page importance problem is the steady-state of the system. In it, probabilities of being at a given page in the long term are a function of how many pages link there and the probability of being at those pages (their importance). That is, a link to another page is like a vote, and the value of a vote is at the same time the sum of the values of the votes the linking page has.

Other applications

Although PageRank is currently a really important algorithm, it would be unfair to say that all what Markov chains can offer is that. Some other applications are:

  • Text generators: Some text is feed as input, and a graph reflecting the partial order between words is built. Then, a random walk is done while generating a new text. A variation is not to consider the state as the current word only but the last n words. Texts generated this way do not usually make sense, but you have to read a few words to realize that. Another approach is not to use words as states but characters, which produces invented words most of the time. Some examples of Markov text generators are:
    • An online generator
    • Chat bots. Kooky is chat bot learning from many IRC channels and generating replies based on what it have received. It produces great responses, worth a read! Mark V Shaney is another example of Markov based IRC bot.
    • Spam generation: Some spammers use Markov text generators to generate unique emails trying to not be caught by spam filters.
  • Spam filtering: Yes, Markov chains are on both sides of the spam battle. It is not only used by the spammers but also by the filters [PDF].
  • Ray-tracing: Markov chains are used in Metropolis Light Transport algorithm.

Of course, this list is not exhaustive. If you are interested in knowing more applications, Wikipedia holds a larger list.

Algorithms ,

Nice Compression: Huffman Coding

May 27th, 2009

In this post we are going to see a brief resume of one of the most famous coding algorithms, Huffman coding. It’s purpose is to find a way to encode information in less space than the standard one, that is, to compress information. It was developed by David Huffman and published in 1952.

We are going to see examples using plain text encoded in ASCII. Later on the post we will see how this algorithm can be applied to other sorts of data.

Basic idea

If we have some text, usually we are storing 8 bits of information for each of its characters, which allows us to represent 256 different characters. Even in the subset of those 256 characters actually being used, the frequency at which they appear is (usually) very different. That is the fact that Huffman coding tries to exploit. If we could encode frequent characters with fewer bits than characters less frequent, the total length of the coded text will usually be smaller. (Yes, if you want to encode a text where every character has the same frequency, Huffman coding cannot help you. Sorry about that).

So, we need a non-ambiguous, variable-length way to encode characters, where characters get codes with length inversely proportional to their frequency. By non-ambiguous I mean that we should have a way to know where one character’s encoding ends, and so the next one begins. For example, if ‘A’ was coded as 1 and ‘B’ as 11, and we received 11, we could not know whether it was encoding “AA” or “B”‘.

David Huffman approached this problem using a binary tree, where at each node 0 represents a child and 1 the other one, and the characters we want to encode are represented by the leaves. The most frequent characters would be, in this tree, nearer from the root, and the most rarely found characters would be in deeper levels.

Tree construction

So, how can we construct this kind of tree in such a way that it minimizes the resulting encoding length? First of all, we need to count the frequency of every character. We are going to see the set of characters as elements, each of them with an associated weight, being the weight the number of times that it appears in the text. Now, we just select the two elements with the lowest weight, make that two elements siblings of a new parent node, remove those two elements from the set and insert the new sub-tree constructed to the set, being its weight the sum of its two children. We repeat the process until we have only one element, that is, the Huffman tree, which minimizes the encoding length.

Huffman tree
Huffman tree

Using it

We want to use this tree in two ways: for encoding text and for decoding it:

  • Encoding: We have to find the path, for the character we want to encode at each moment, from the tree’s root to the leave where it is. For every node from the root to the character, insert into the coded stream the bit linked to the decision you must take to reach your target character.
  • Decoding: At first we start from the tree root. For each bit we read, we move to its related node, until we reach a leaf. In that moment we get a decoded character, and start again from the tree root.

A thing to take into account is that the decoder should have access to the Huffman tree, so we must send/store it with the encoded data. There is also the option of agreeing beforehand in some Huffman tree so we don’t have to waste space with it, but this can lead to a worse compression ratio.

We should also bear in mind that depending on the text, specially if it is short or has not much difference in frequency among characters, the fact of storing the tree can be counterproductive, but it is usually not.

Evolution and variations

A number of variations with higher performance have been built over this algorithm, but I consider that the original is much simpler and nicer.

Some of these variations consider not only characters but groups of them, or analyze if there are different frequencies of the characters in different parts of a file and then using different trees for the different parts.

Finally, just say that I have just used text examples on this post, but of course the algorithm works well on any kind of data with different values appearing at different frequencies. The fact of some codecs as MP3 or JPEG, among others, using Huffman coding as part of their encoding/decoding algorithms corroborates that and shows David Huffman’s work usefulness.

Algorithms , ,

Sorting faster: Radix sort

April 24th, 2009

It may seem a bit useless talking about some sorting algorithm at this point in computer history, and specially about one that dates back as far as 1887. Some people consider that it is enough to know what the sorting routine in your language/library of choice is, but I consider that:

  • It’s definitely good to have a breadth knowledge about different topics in Computer Science, and sorting is a basic field in algorithmics.
  • Radix sort is not useful sometimes, but it can give you a performance boost in some other circumstances.
  • It may seem that all programmers already know how this works. I am sure that not only most of them don’t know how it works, but that there are many Computer Scientists that don’t even know of its existence. I myself did not realize of its existence until I have spent almost all of my years at university.

So, what is radix sort about? It’s not like the basic and traditional sorting algorithms in the sense that no comparisons are made during its execution. Thanks to that, it can break the theoretical O(n log n) lower limit of a sorting algorithm. It’s time complexity is O(k*N), which makes it lineal. (k, in our implementation, will be the number of bytes of the elements to sort).

Let’s go to the point. Imagine first that we want to sort some bytes, for example:

5, 44, 200, 110 and 90

We have a first buffer of 256 integers that will act as a counter of how many bytes with that value are there to sort. We will call this the distribution buffer, because it stores how the bytes are distributed. In our case, all would be set to 0 except elements 5, 44, 90, 110 and 200 which would be set to 1. So, we traverse all the input elements incrementing the counters:

// source is an array with N elements to be sorted
unsigned int distribution[256];
for(unsigned int i=0; i < 256; i++)
	distribution[i] = 0;

for(unsigned int i=0; i < N; i++)
	distribution[source[i]]++;

Once we know how many elements of each value are there, we use another buffer of 256 entries too, which we will call indices buffer, to accumulate the values from the distribution buffer. The entries at the indices buffer indicate at which position, once sorted, an element with a certain value must go. Here, elements until 5 would be 0, elements from, 6 to 44 would be 1, from 45 to 90 would be 2, and so on.

unsigned int indices[256];
for(unsigned int i=0; i<256; i++)
	indices[i] = 0;

for(unsigned int i=1; i<256; i++)
	indices[i] = indices[i-1]+distribution[i-1];

Finally, we iterate over the input elements again, and as we know at which position do they correspond when sorted, we can put them at their place in a new buffer called destination buffer, and which has the size of the input buffer. Once did this, we increment the index for that value as the next element with the same value (if there is another) must go not to the same entry but to the next one. If we want the sorted values to go in the source list, we copy them.

unsigned char destination[N];
for(unsigned int i=0; i < N; i++)
	destination[ indices[source[i]]++ ] = source[i];

for(unsigned int i=0; i < N; i++)
	source[i] = destination[i];

Well, you may say now, I do not want to sort only bytes, I want to sort, for example, 32 bits integers, and if I have to create a couple of buffers of 232 elements each this algorithm is not going to be very useful. That's right, but you don't have to do that. You can notice that due to the way the elements are iterated again looking for its corresponding sorted index, and then that index being incremented, this sort is stable (In short, a stable sort is one which, for elements with the same sorting value, the output order is the same as the input).

So, as the the algorithm is stable we can sort larger types without the need of larger distribution and indices buffers. All what we are going to do is sort first taking into account only the least significant byte, then sorting by the next more significant byte, until we have sorted the elements by the most significant byte, at which point the elements are completely sorted.

In fact, and by what we have seem, it may seem that only unsigned integers can be sorted by radix. But negative and even floating point numbers can be sorted with radix sort, as exposed by Pierre Terdiman.

If you are interested in the topic, more tricks and optimization techniques can be found in this article by Michael Herf.

Algorithms, Optimization , , ,

Randomized algorithms – An introduction

April 6th, 2009

Randomness is a very useful resource at our disposal we should bear in mind when designing algorithms. It may be used in some different ways:

  • To try to avoid special inputs that do our algorithm specially slow.
  • To design faster algorithms than the deterministic ones, at the cost of:
    • Not always getting the correct output
    • Getting an approximated output

Randomized algorithms is a very wide range, where books and several papers have been written, but now we’ll only have a brief introduction.

Randomization on quicksort

Let’s first recall how quicksort works. It’s a divide and conquer algorithm whose base case is a list of size 0 or 1, which are already sorted. On the general case do:

1. Select a pivot among the list elements
2. Split the elements in two lists, one with the elements smaller than the pivot and one with the greater ones.
3. Return the concatenation of quicksort(elements smaller than the pivot) + pivot + quicksort(elements greater than the pivot)

The algorithm is very simple, and there is also a modification of it that does not need extra space for allocating the sorted sublists, but we’ll not cover it here.

What is the cost of the algorithm? If you choose the median as the pivot, you will split the list in two equal-sized sublists and you will reach the desired O(n*log(n)). But if you choose the smaller or greater element in the list, you’ll only move from a list of size n to a list of size n-1. If you continue doing this bad choice until you finish, you’ll get a performance of Ο(n²), which is clearly bad.

So, what makes a difference about getting good performance or bad performance is a good pivot choice. Some implementations rely on the expected case, and choose always the first element of the list as the pivot, and that, on random inputs, leads to an average of O(n*log(n)). But who said that the input was random? What happens if our input is already sorted or almost sorted, or reverse sorted? We’ll get a Ο(n²) performance. Many times we do not control our input, and many times we want to sort almost sorted lists.

What can we do to avoid Ο(n²) cases? Of course, we could find the median of every list and use it as the pivot. Finding the median, though being linear, is expensive. What can we then do? We can choose a random element as a pivot, and although sometimes it will be the smaller or the greater element, it’s very unlikely to keep doing such choices, and 50% of the time, our choice will produce a sublist with at most 75% of the elements of the original list. So, our expected performance, independently of the input, would be O(n*log(n)).

Algorithms , , ,