Archive

Archive for June, 2009

Markov chains: PageRank and many others

June 13th, 2009

What do Google’s PageRank, procedural music composition, text generation, speech and image recognition, among many, many others have in common?

Yes, you guessed it (it was easy as the title says it ;) : Markov chains are a basic element in their respective algorithms. (Surely, there are other approaches for some of these problems, but Markov chains are widely used).

Markov chains

So, what are Markov chains? They are a process, a sequence of states, where at a given time the next state is chosen probabilistically based only in the current state. That is, they don’t have any memory longer than in which state they are at the moment.

Let’s see a simple example of a Markov chain using a weather prediction system. To simplify it, let us assume that a day has only 2 possible weather states, sunny and rainy, and they last all day. We know that if today is sunny, tomorrow will be sunny with probability 0.9 and rainy with probability 0.1. If today is rainy, tomorrow can be sunny or rainy, both with probability 0.5.

That can be represented with a directed graph with weights on the arcs:

Graph modeling the weather system

Graph modeling the weather system

Any Markovian system can be also represented by a transition matrix. In this matrix, every entry can be read as the probability of going from the state represented by the row to the state represented by the column. The matrix related to our system is:

Weather system associated matrix

Weather system associated matrix

This matrix has some nice properties. For example:

  • P to the power of n represents the probabilities of going from a given state to another in n steps.
  • If you multiply a vector representing the probabilities of being at a given state in a given moment by the matrix P, you get the probabilities of being at every step one step later. So, if we know that before we start we are on a sunny day, the first day will be sunny with probability 0.9 and rainy with probability 0.1, and if we multiply that by the matrix again, we get that the second day will be sunny with probability 0.86 and rainy with probability 0.14
  • If the system has the following two properties
    • You can go from every state to every other state.
    • The number of steps needed to go from any state to the same state does not have to be necessarily multiple of a number different from 1.

    then, as you advance steps the resulting vector converges towards what is known as steady-state or stationary distribution, which is independent from the start input. That is, there exists a number n of steps after which the probability of being in each step does not depend on the start distribution, and does not vary when advancing more steps. In our example, the probabilities of being in a sunny day in the long term are 0.833, and the probabilities of being in a rainy day are 0.167.

If you want to see more in-depth the mathematic operations to find those values, you can find them here.

Pagerank

So, how does this apply to PageRank algorithm? As you would probably know, PageRank is the algorithm used by Google Search to try to measure the importance of a webpage. As a curiosity, as the algorithm was developed by Larry Page while doing a Ph.D in Stanford University, the process was patented by the university and Google had to pay them an amount of shares sold later for $336 million in order to use the algorithm.

What PageRank (really, really simplified) does is to deal with the web as a directed graph, where each page represents a node and each link represents an edge. From a given page, all its links have the same weight. The solution to the page importance problem is the steady-state of the system. In it, probabilities of being at a given page in the long term are a function of how many pages link there and the probability of being at those pages (their importance). That is, a link to another page is like a vote, and the value of a vote is at the same time the sum of the values of the votes the linking page has.

Other applications

Although PageRank is currently a really important algorithm, it would be unfair to say that all what Markov chains can offer is that. Some other applications are:

  • Text generators: Some text is feed as input, and a graph reflecting the partial order between words is built. Then, a random walk is done while generating a new text. A variation is not to consider the state as the current word only but the last n words. Texts generated this way do not usually make sense, but you have to read a few words to realize that. Another approach is not to use words as states but characters, which produces invented words most of the time. Some examples of Markov text generators are:
    • An online generator
    • Chat bots. Kooky is chat bot learning from many IRC channels and generating replies based on what it have received. It produces great responses, worth a read! Mark V Shaney is another example of Markov based IRC bot.
    • Spam generation: Some spammers use Markov text generators to generate unique emails trying to not be caught by spam filters.
  • Spam filtering: Yes, Markov chains are on both sides of the spam battle. It is not only used by the spammers but also by the filters [PDF].
  • Ray-tracing: Markov chains are used in Metropolis Light Transport algorithm.

Of course, this list is not exhaustive. If you are interested in knowing more applications, Wikipedia holds a larger list.

Algorithms ,

Throwing some light on NP-completeness and the P=NP problem

June 3rd, 2009

I want to say first that I am an usual reader of Jeff Atwood’s Coding Horror blog. His posts deal with a wide range of interesting subjects, not delving deep into technical topics. Some time ago, though, he tried to explain NP-Complete problems and the P=NP problem (which are different things, by the way), without good results, because he didn’t understand completely the topic. Now, he tried again, but he hardly had better luck.

It seems that this is a confusing topic, and although every computer scientist, programmer, software engineer or related professional has heard about it, many, if not most of them, can’t provide a definition any better than an intuitive one, which is usually not very accurate.

I hope this post is useful to clarify the doubts some of you can have on this topic. And maybe it helps Jeff to write a third part of the “series”, hopefully better than the previous two, and matching his usual writing quality.

First of all, say that unlike what many people think, P and NP are not the only 2 complexity classes. There are many of them. Hundreds of them are being listed and explained on Stanford University’s Complexity Zoo. It’s an extremely deep topic you could spend your whole life learning about. Of course, you don’t need to know everything of it, but it’s good to know the most common classes.

P and NP

So, what are P and NP?

  • P is the class containing all the problems solvable, by a deterministic Turing machine, in a polynomial amount of time depending on the input size. This seems easy to understand to most of us.
  • NP, on the other hand, does not seem so straightforward to understand to everybody. The definition of NP is the same as given for P, but changing “deterministic Turing machine” for “non-deterministic Turing machine”. Adding these 3 letters makes the machine more powerful, so NP includes many problems which are not in P (if P != NP). You can think of the non-determinism as if, every time the machine has to explore 2 different branches, it can explore both of them at the same time. You can also view it as if, when having to explore 2 different branches, it always chooses a branch that leads to accepting (if such a branch exists). Another interesting feature of NP problems is that a solution can be checked in polynomial time by a deterministic Turing machine. So, checking a solution of a NP problem is a P problem.

The problem with non-deterministic machines is just that we don’t have them. So, to solve a problem in NP and not in P (all problems in P are also in NP, as we can just use the non-deterministic Turing machine in a deterministic way) we need to execute the branches one after the other instead of executing them at the same time, and as there can be an exponential number of them, we need an exponential amount of time, depending on the input size, to solve the problem in a current processor.

NP-complete

Another term widely used is NP-complete. When is a problem in the NP-complete class? It has to have two properties:

  1. It has to be a NP problem.
  2. If it can be solved in polynomial time, then all of NP problems can.

The second is in itself a class: NP-hard. A problem can be NP-hard and not NP-complete, because it can be of more difficult complexity class than NP.

To demonstrate that a problem is NP-complete, we can reduce it to a known NP-complete a known NP-complete problem to it in polynomial time. To reduce a problem to another means transform its input to the input of another problem, with the condition of both problems with their respective inputs generating the same output. This would be of limited usefulness if we did not have an initial NP-complete problem, but we have it thanks to Cook’s theorem demonstrating that the Boolean satisfiability problem, also known as SAT, is NP-complete.

Given that, if an algorithm to solve a NP-complete problem in polynomial time in a deterministic machine is found, then all of NP problems would be in P, and the statement P = NP would be true.

P = NP?

As you will probably know, this is a really important open question, and really hard to solve as can be seen from the fact that a million dollars has been offered since year 2000 to the first who proves either its equivalence or its non equivalence.

But, whether P = NP or not, and until the unlikely fact of that being demonstrated, my advice is to, while bearing in mind that the question is open, act as if it was demonstrated that they were different. Deal with NP-complete problems as difficult ones. If you have a NP-complete problem you have to expect exponentially large times for algorithms solving it. This means that if your input is going to be large, the best you can do is look for approximation, heuristic or other non-exact methods. They will not give you the best solution, but, at least, they will last less than the age of the universe, which is something valuable.

Complexity , , ,