Archive

Archive for the ‘Optimization’ Category

Turning your loops faster, II

May 13th, 2009

This post is a continuation to “Turning your loops faster, I”. If you haven’t read it, you may want to before reading this one. Here we continue reviewing some loop optimizing techniques.

Loop unrolling

We saw on the previous post that there is an overhead we pay on each loop iteration. A simple idea, so, is to do in a single new, bigger iteration, the work that was done in many former iterations. This way, there are less iterations to do and less overhead per element.

Due to the fact that the most used current processors can execute instructions out-of-order, a bigger benefit can be gained from loop unrolling as long as there are no data dependencies among different iterations, as we can also avoid the pipeline from flushing in some points, and so not losing cycles. As many other optimitzations, this can be already performed by your compiler, based on some heuristics, but maybe their results are not the optimal ones. You may want to try this specially on loops with many iterations and few code instructions.

// Original loop
sum = 0;
for(i=0; i < N; i++)
    sum += a[i];

// 2-Unrolled loop
sum = 0;
for(i=0; i < N; i+=2)
{
    sum += a[i];
    sum += a[i+1];
}

You have to be sure that the number of iterations is multiple of your unrolling degree; otherwise you have to do the remaining operations, non unrolled, at the end of the unrolled loop.

This was just a simple example, in this case the compiler would probably be already doing unrolling, and maybe in a greater unrolling degree.

Using accumulators

Even if we have data dependencies, sometimes we can slightly modify the algorithm to use different variables in a loop and so avoiding the processor from having to wait for the previous result when it really doesn't needs it.

// 2-Unrolled loop with accumulator
sum = sum2 = 0;
for(i=0; i < N; i+=2)
{
    sum += a[i];
    sum2 += a[i+1];
}
sum += sum2;

This way we do not have data dependencies inside the loop .Compilers usually do not apply this optimization.

Loop collapsing

Sometimes we have nested loops that can be joined into a single loop, usually with pointer arithmetic. This optimization is usually not done by compilers, and it can also enable other optimizations.

// We have a bidimensional array, and we want to set all of its elements to 0
int a[N][M];

// Usual way of setting all elements to 0
for (i = 0; i < N; i++)
    for (j = 0; j < M; j++)
        a[i][j] = 0;

// Collapsing loops
int *p = &a[0][0];
for (i = 0; i < N*M; i++)
    *p++ = 0;

This way we have fewer loop overhead per element.

Loop fusion

This technique can be applied when we have two loops that iterate over the same set (or, at least, they have a common subset). Knowing that, we can merge both loops and have half of the previous loop overhead. Care must be taken to avoid data dependency problems. This is usually not done by the compiler either.

// We have 2 arrays, a and b, of size N. We want to increment all the elements of both arrays.
for(i=0; i < N; i++)
    a[i]++;
for(i=0; i < N; i++)
    b[i]++;

// Fusing the loops
for(i=0; i < N; i++)
{
    a[i]++;
    b[i]++;
}

Reversing order in nested loops

If we have nested loops, and use them as index to access memory, for example, to access a bidimensional array, we must take advantage of the memory hierarchy. Sometimes, just reversing the nested loops order, a big speedup can be achieved. It's important to know your language, in particular how your language stores multidimensional arrays. C stores them by rows, but there are some languages that store them in the inverse way.

I have done some tests on my machine. Here are the two codes:

// Not cache-friendly
for(int i=0; i < N; i++)
	for(int j=0; j < N; j++)
		a[j][i] = 1;

// Cache friendly
for(int i=0; i < N; i++)
	for(int j=0; j < N; j++)
		a[i][j] = 1;

First I tested with an array of 10000 x 10000 int elements. The speed-up obtained by the cache friendly version was 5.46x.

Then I tested with an array of 2¹³ int elements each dimension. In this case, the obtained speed-up was 15.14x. This is caused because the cache friendly code continues exploiting spatial locality while the cache unaware code not only does not exploit it but it's using only a subset of cache lines, because every iteration in the inner loop is accessing memory addresses with the same lower bits, which are used to know where does that memory go in the cache.

Condition evaluation order

When you have more than one condition ORed or ANDed, the compiler will short-circuit it. That is, if the conditions are ORed and the first one is true, it will not check the others, as the result will be true anyway. If the conditions are ANDed and the first one is false, it won't check the others either, as the result will always be false. So, if you have one condition much slower than the others, put it the last one, as you will avoid evaluating it some times. If you know that a condition is going to cause more short-circuits than the other ones, put it the first one so you will avoid evaluating the other conditions more frequently.

Other techniques

These are the most common basic techniques. You can usually achieve greater speed-ups from here with parallellization, using techniques as vectorization or threading, which we will talk about in future posts.

Architecture, Optimization , , ,

Turning your loops faster, I

May 6th, 2009

It’s very usual in our programs that most of its time is spent in loops. So, after having profiled and measured our application, if it is the case, we may want optimize some of our loops. Maybe they are already optimized, by us or by the compiler, but it’s important to know some techniques that may turn our loops faster.

Branch overhead

This is an important concept in order to understand many of the loop optimizations.

Current processors are pipelined. Not a little bit. A lot. An instruction is at many stages before ending its execution. Most of the instructions can be well adapted to a pipelined design without major efforts, but that’s not the case with branching instructions. Because of their nature, they (can) change the next instruction to be executed, from the implicit (the following) one to an explicit one, often depending on a value produced by an instruction which has not yet finished its execution when the branching instruction is fetched. So we have a problem, we have to wait, flush some (many) stages and lose some cycles, something that we don’t like, because it does our programs slower.

In fact, the reality is not so hard. Processor companies have, of course, seen this bottleneck in their products, and have had lots of engineers trying to solve them. The result is that different branching predictors have been designed, some of them are really effective, and what was a huge performance penalty is now a smaller one. Although it is now a smaller problem than it used to be, we still have to be aware of it, specially in loops, because loops imply, inherently, branching.

Inlining

Inlining means inserting the body of a function where it is called instead of just calling it. This can be done both by the compiler or by the programmer, and it’s not just a loop optimization, it can be applied anywhere else, but it’s in loops where we’ll get the bigger benefits of inlining our smaller functions.

Beware that:

  • Compilers, depending of the set options, already try to guess which functions would be good for inlining. You can want to use some tools to look which functions are actually inlined. Try to know what your compiler’s policy is in this aspect.
  • Function inlining can actually cause you performance loses in some cases, because it can imply a larger number of instructions and a worse explotation of the instruction cache.
  • If the compiler does not inlines some function, you try to manually inline it, and you in fact have a performance gain, think before leaving it there if the speed-up is worth having some code duplicated, which can lead to maintenability problems.

Another good thing about inlining is that it can let, in some situations, your compiler to do other kind of optimizations which it wasn’t able to do without inlining, because of a better knowledge of the code context.

Code Hoisting or Loop-invariant code motion

They key idea here is to move some code out of the loop, in order to avoid calculating it at every iteration, when its result will be the same during all the loop execution. Imagine the following code:

for (i = 0; i < n; i++)
{
  if(x)
    a[i] = y;
  else
    a[i] = y*y;
}

Here, a couple of optimizations can be made:

  • "x" does not change inside the loop, so we don't have to check that condition, and execute branching instructions, at every single iteration.
  • "y*y" does not change inside the loop either. So we can have it precomputed outside the loop and assign it.
if(x)
{
    for (i = 0; i < n; i++)
        a[i] = y;
}
else
{
    y2 = y*y;
    for (i = 0; i < n; i++)
        a[i] = y2;
}

The program's semantic is the same but the performance is greater in the second code. Again, be warned that your compiler might be already doing this optimization for you.

More techniques

There are many other techniques, and we will see some of them in the next post, so stay tuned!

EDIT: Second part already avalaible

Architecture, Optimization , , ,

Sorting faster: Radix sort

April 24th, 2009

It may seem a bit useless talking about some sorting algorithm at this point in computer history, and specially about one that dates back as far as 1887. Some people consider that it is enough to know what the sorting routine in your language/library of choice is, but I consider that:

  • It’s definitely good to have a breadth knowledge about different topics in Computer Science, and sorting is a basic field in algorithmics.
  • Radix sort is not useful sometimes, but it can give you a performance boost in some other circumstances.
  • It may seem that all programmers already know how this works. I am sure that not only most of them don’t know how it works, but that there are many Computer Scientists that don’t even know of its existence. I myself did not realize of its existence until I have spent almost all of my years at university.

So, what is radix sort about? It’s not like the basic and traditional sorting algorithms in the sense that no comparisons are made during its execution. Thanks to that, it can break the theoretical O(n log n) lower limit of a sorting algorithm. It’s time complexity is O(k*N), which makes it lineal. (k, in our implementation, will be the number of bytes of the elements to sort).

Let’s go to the point. Imagine first that we want to sort some bytes, for example:

5, 44, 200, 110 and 90

We have a first buffer of 256 integers that will act as a counter of how many bytes with that value are there to sort. We will call this the distribution buffer, because it stores how the bytes are distributed. In our case, all would be set to 0 except elements 5, 44, 90, 110 and 200 which would be set to 1. So, we traverse all the input elements incrementing the counters:

// source is an array with N elements to be sorted
unsigned int distribution[256];
for(unsigned int i=0; i < 256; i++)
	distribution[i] = 0;

for(unsigned int i=0; i < N; i++)
	distribution[source[i]]++;

Once we know how many elements of each value are there, we use another buffer of 256 entries too, which we will call indices buffer, to accumulate the values from the distribution buffer. The entries at the indices buffer indicate at which position, once sorted, an element with a certain value must go. Here, elements until 5 would be 0, elements from, 6 to 44 would be 1, from 45 to 90 would be 2, and so on.

unsigned int indices[256];
for(unsigned int i=0; i<256; i++)
	indices[i] = 0;

for(unsigned int i=1; i<256; i++)
	indices[i] = indices[i-1]+distribution[i-1];

Finally, we iterate over the input elements again, and as we know at which position do they correspond when sorted, we can put them at their place in a new buffer called destination buffer, and which has the size of the input buffer. Once did this, we increment the index for that value as the next element with the same value (if there is another) must go not to the same entry but to the next one. If we want the sorted values to go in the source list, we copy them.

unsigned char destination[N];
for(unsigned int i=0; i < N; i++)
	destination[ indices[source[i]]++ ] = source[i];

for(unsigned int i=0; i < N; i++)
	source[i] = destination[i];

Well, you may say now, I do not want to sort only bytes, I want to sort, for example, 32 bits integers, and if I have to create a couple of buffers of 232 elements each this algorithm is not going to be very useful. That's right, but you don't have to do that. You can notice that due to the way the elements are iterated again looking for its corresponding sorted index, and then that index being incremented, this sort is stable (In short, a stable sort is one which, for elements with the same sorting value, the output order is the same as the input).

So, as the the algorithm is stable we can sort larger types without the need of larger distribution and indices buffers. All what we are going to do is sort first taking into account only the least significant byte, then sorting by the next more significant byte, until we have sorted the elements by the most significant byte, at which point the elements are completely sorted.

In fact, and by what we have seem, it may seem that only unsigned integers can be sorted by radix. But negative and even floating point numbers can be sorted with radix sort, as exposed by Pierre Terdiman.

If you are interested in the topic, more tricks and optimization techniques can be found in this article by Michael Herf.

Algorithms, Optimization , , ,

Turning it faster – Know

April 10th, 2009

This post extends “Turning it faster – Observe”.

Know your compiler

Every compiler has different optimization options, and even on that options that could seem the same, they are not (at least, not always). Things like inlining or not may change a lot from one compiler to other. So, be sure that you know what your compiler does and doesn’t with every option. Check the documentation, or if needed, generate simple examples and disassemble them.

Know how to link dynamically and statically, as there can be a speed difference between them if the functions in that library are called many times, as there is an overhead for every call to a dynamically linked library function.

Know your architecture

Every architecture is different too. Sometimes you will be targeting not perfectly uniform architectures, as when developing for PC’s, but there use to be a set of common features among the targets. How many cache levels does it have? How size are they? What size do the cache lines have? Which latency and throughput do the most common instruction have? All the knowledge you can get about your architecture will be good.

Know your language

In most languages, things can sometimes be done in a few different ways, with possible different speed. Try to know in-depth your language, and for example in C++, how calls to polymorphic functions are made or how a recently created object is returned. Try also to help your compiler with language-dependant features, as adding the “const” keyword to all parameters and methods where possible in C++.

Know your data structures

There are data structures that are used very much, such as containers, and there are many of them. Different kinds of linked list, vectors, hash tables… all of them have advantages and disadvantages depending on how are they going to be used. Be sure to know them because the performance difference can be asymptotic in some cases, and changing your data structure can give you a very big speedup (or maybe slowdown) sometimes.

Optimization , , , ,

Turning it faster – Observe

April 7th, 2009

So, as it uses to happen, you have finished your program’s functionality and you think that it should be faster. But, where to start?

Where is my program spending its time?

The answer to this question is very difficult to know looking only at the source code, so it’s better to apply the “Stop thinking and look” rule. Use your usual profiler and tools to know which parts in the program is the most time spent in. If you aren’t used to profiling, look for and experiment with them. There are many tools of different kind, from Intel’s VTune and AMD’s CodeAnalyst to gprof, oprofile or pin, among many others.

Amdahl is your friend

Or maybe he’s your enemy, but you must know him anyway. Amdahl’s law says that the benefit from an optimization is limited to the amount of time that optimization is being used. This can be seen in the following image:

Amdahl's law

Amdahl's law

As we can see, a small improvement over a large amount of time gives greater benefits than big benefits over a small amount of time. So, once you know where in your program time is being spent, don’t waste your time optimizing that function where 0.5% of the time is being spent.

What kind of optimization do you need?

So, now that you know what function or code you have to optimize in order to obtain a good speedup, before starting with low level optimizations, think what complexity does that code have, measure it. Is there an algorithm with less complexity, a more efficient one? Are there better data structures for this problem?

To exaggerate a little bit, if you are sorting with bubble sort a large list, you’ll never achieve the efficiency of another sorting algorithm no matters how much do you optimize it. Before starting doing low level techniques, be sure that your code could not be more algorithmically optimized.

Methodology

It’s important to follow a methodology when optimizing a program. Of course everyone can modify it to better fit his needs, but it would be something as an iteration over the following steps:

  1. Detect a bottleneck or somewhere in the program or a place where the program is spending too much time.
  2. Introduce just one optimization.
  3. Check the program correctness (we don’t want a fast program that does not what it’s supposed to do).
  4. Analyze the performance of the new code.
  5. Decide whether the optimization is worth the complexity it introduces to the code.
  6. Measure the obtained speedup, and write it down.

Optimization , , , ,