Turning your loops faster, II
This post is a continuation to “Turning your loops faster, I”. If you haven’t read it, you may want to before reading this one. Here we continue reviewing some loop optimizing techniques.
Loop unrolling
We saw on the previous post that there is an overhead we pay on each loop iteration. A simple idea, so, is to do in a single new, bigger iteration, the work that was done in many former iterations. This way, there are less iterations to do and less overhead per element.
Due to the fact that the most used current processors can execute instructions out-of-order, a bigger benefit can be gained from loop unrolling as long as there are no data dependencies among different iterations, as we can also avoid the pipeline from flushing in some points, and so not losing cycles. As many other optimitzations, this can be already performed by your compiler, based on some heuristics, but maybe their results are not the optimal ones. You may want to try this specially on loops with many iterations and few code instructions.
// Original loop
sum = 0;
for(i=0; i < N; i++)
sum += a[i];
// 2-Unrolled loop
sum = 0;
for(i=0; i < N; i+=2)
{
sum += a[i];
sum += a[i+1];
}
You have to be sure that the number of iterations is multiple of your unrolling degree; otherwise you have to do the remaining operations, non unrolled, at the end of the unrolled loop.
This was just a simple example, in this case the compiler would probably be already doing unrolling, and maybe in a greater unrolling degree.
Using accumulators
Even if we have data dependencies, sometimes we can slightly modify the algorithm to use different variables in a loop and so avoiding the processor from having to wait for the previous result when it really doesn't needs it.
// 2-Unrolled loop with accumulator
sum = sum2 = 0;
for(i=0; i < N; i+=2)
{
sum += a[i];
sum2 += a[i+1];
}
sum += sum2;
This way we do not have data dependencies inside the loop .Compilers usually do not apply this optimization.
Loop collapsing
Sometimes we have nested loops that can be joined into a single loop, usually with pointer arithmetic. This optimization is usually not done by compilers, and it can also enable other optimizations.
// We have a bidimensional array, and we want to set all of its elements to 0
int a[N][M];
// Usual way of setting all elements to 0
for (i = 0; i < N; i++)
for (j = 0; j < M; j++)
a[i][j] = 0;
// Collapsing loops
int *p = &a[0][0];
for (i = 0; i < N*M; i++)
*p++ = 0;
This way we have fewer loop overhead per element.
Loop fusion
This technique can be applied when we have two loops that iterate over the same set (or, at least, they have a common subset). Knowing that, we can merge both loops and have half of the previous loop overhead. Care must be taken to avoid data dependency problems. This is usually not done by the compiler either.
// We have 2 arrays, a and b, of size N. We want to increment all the elements of both arrays.
for(i=0; i < N; i++)
a[i]++;
for(i=0; i < N; i++)
b[i]++;
// Fusing the loops
for(i=0; i < N; i++)
{
a[i]++;
b[i]++;
}
Reversing order in nested loops
If we have nested loops, and use them as index to access memory, for example, to access a bidimensional array, we must take advantage of the memory hierarchy. Sometimes, just reversing the nested loops order, a big speedup can be achieved. It's important to know your language, in particular how your language stores multidimensional arrays. C stores them by rows, but there are some languages that store them in the inverse way.
I have done some tests on my machine. Here are the two codes:
// Not cache-friendly for(int i=0; i < N; i++) for(int j=0; j < N; j++) a[j][i] = 1; // Cache friendly for(int i=0; i < N; i++) for(int j=0; j < N; j++) a[i][j] = 1;
First I tested with an array of 10000 x 10000 int elements. The speed-up obtained by the cache friendly version was 5.46x.
Then I tested with an array of 2¹³ int elements each dimension. In this case, the obtained speed-up was 15.14x. This is caused because the cache friendly code continues exploiting spatial locality while the cache unaware code not only does not exploit it but it's using only a subset of cache lines, because every iteration in the inner loop is accessing memory addresses with the same lower bits, which are used to know where does that memory go in the cache.
Condition evaluation order
When you have more than one condition ORed or ANDed, the compiler will short-circuit it. That is, if the conditions are ORed and the first one is true, it will not check the others, as the result will be true anyway. If the conditions are ANDed and the first one is false, it won't check the others either, as the result will always be false. So, if you have one condition much slower than the others, put it the last one, as you will avoid evaluating it some times. If you know that a condition is going to cause more short-circuits than the other ones, put it the first one so you will avoid evaluating the other conditions more frequently.
Other techniques
These are the most common basic techniques. You can usually achieve greater speed-ups from here with parallellization, using techniques as vectorization or threading, which we will talk about in future posts.
I usually had less perfomance when doing loop fusion that without it, this can happen because of the memory hierarchy.
@Vinicius That can certainly happen, specially if the loop body is big or it calls other functions which fill the instruction cache. Also, according to Amdahl’s law, if the time spent in the “loop control” is small compared to the loop body, you will not get a big benefit from it.
But, as is always difficult to predict the speed-up that will be obtained from an optimization, it’s very important to take good performance mesures before and after the optimization.
Thanks for your comment, Vinicius
If you have no idea how many elements you have in your array, you can still unroll it with Duff’s Device: http://en.wikipedia.org/wiki/Duff’s_device.
int count =(N+7)/8;
int i=0;
switch(N%8){
case 0: do{ sum += a[i++];
case 7: sum += a[i++];
case 6: sum += a[i++];
case 5: sum += a[i++];
case 4: sum += a[i++];
case 3: sum += a[i++];
case 2: sum += a[i++];
case 1: sum += a[i++];
}while(–count>0);
}
this accomplishes the spirit of “you have to do the remaing operations, non unrolled, at the end of the unrolled loop” without much code duplication.
@Jake Thanks for your contribution, I didn’t know about that device, it’s interesting.
While I certainly agree that speed is very important, I think readability and simplicity are also absolutely crucial for future understanding and enhancement of code. For example, your accumulator example:
sum = sum2 = 0;
for(i=0; i < N; i+=2)
{
sum += a[i];
sum2 += a[i+1];
}
sum += sum2;
This works for an even number of elements, but would require more code to handle an odd number of elements. It’s a trivial addition, but it adds even more ugly code to a simple task which is much more readable and less “screw-up-able” the traditional way. In most cases, a simple loop is ridiculously fast, regardless of the number of iterations, and we’re talking about a few milli-seconds difference between iterating n times or n/2 times.
Unless you’re dealing with enormous data sets, or if you truly need every millisecond you can get, my opinion is that code readability/maintainability wins out over a tiny performance gain.
@Joe Enos I totally agree with you, as I tell in previous post http://www.unknownprogrammer.com/2009/turning-it-faster-observe/ . You first need to know if it’s a problem the time being spent in that loop, and once with a possible optimization, “Decide whether the optimization is worth the complexity it introduces to the code.”.
About that only working for an even number of elements, I already comment that “You have to be sure that the number of iterations is multiple of your unrolling degree”.
In almost all of the code readibility is more important than a tiny performance gain, but sometimes that performance gain is not so tiny, or you need as performance as possible in some small fraction of code.
Thanks for your comment, Joe
Theoretically speaking
int t = N*M;
for (i = 0; i < t; i++)
would be faster than
for (i = 0; i < N*M; i++)
But again, i guess the compiler will do it for u anyway.
@Amarghosh: If N and M are constants, and in this example they are, your compiler will calculate the product at compile time. Even if the product was of variables, if they are not modified inside the loop, the compiler should be able of detect that and avoid doing the multiplication at each iteration.
But one must actually be aware that, if instead of a product, the loop condition involves checking against a function call, it may be called at every iteration.
Try, for example, to write some code to convert a string to uppercase. Do it with a loop over every char of the string, and compare, at the loop condition, the string index with the strlen function. Try calling it with large strings and measure its time. You’ll have a quadratic function. If, instead of that, you store the string length before the loop, the algorithm will be linear, and so much faster.
Thanks for your comment
I guess we will have to add code to check for ArrayIndexOutOfBoundsException too, when we use Accumulators or unroll the loops.
@satish: From the post: “You have to be sure that the number of iterations is multiple of your unrolling degree; otherwise you have to do the remaining operations, non unrolled, at the end of the unrolled loop.”
If you do that, you will not go out of bounds, as long as your original code was correct.