Turning your loops faster, II
This post is a continuation to “Turning your loops faster, I”. If you haven’t read it, you may want to before reading this one. Here we continue reviewing some loop optimizing techniques.
Loop unrolling
We saw on the previous post that there is an overhead we pay on each loop iteration. A simple idea, so, is to do in a single new, bigger iteration, the work that was done in many former iterations. This way, there are less iterations to do and less overhead per element.
Due to the fact that the most used current processors can execute instructions out-of-order, a bigger benefit can be gained from loop unrolling as long as there are no data dependencies among different iterations, as we can also avoid the pipeline from flushing in some points, and so not losing cycles. As many other optimitzations, this can be already performed by your compiler, based on some heuristics, but maybe their results are not the optimal ones. You may want to try this specially on loops with many iterations and few code instructions.
// Original loop
sum = 0;
for(i=0; i < N; i++)
sum += a[i];
// 2-Unrolled loop
sum = 0;
for(i=0; i < N; i+=2)
{
sum += a[i];
sum += a[i+1];
}
You have to be sure that the number of iterations is multiple of your unrolling degree; otherwise you have to do the remaining operations, non unrolled, at the end of the unrolled loop.
This was just a simple example, in this case the compiler would probably be already doing unrolling, and maybe in a greater unrolling degree.
Using accumulators
Even if we have data dependencies, sometimes we can slightly modify the algorithm to use different variables in a loop and so avoiding the processor from having to wait for the previous result when it really doesn't needs it.
// 2-Unrolled loop with accumulator
sum = sum2 = 0;
for(i=0; i < N; i+=2)
{
sum += a[i];
sum2 += a[i+1];
}
sum += sum2;
This way we do not have data dependencies inside the loop .Compilers usually do not apply this optimization.
Loop collapsing
Sometimes we have nested loops that can be joined into a single loop, usually with pointer arithmetic. This optimization is usually not done by compilers, and it can also enable other optimizations.
// We have a bidimensional array, and we want to set all of its elements to 0
int a[N][M];
// Usual way of setting all elements to 0
for (i = 0; i < N; i++)
for (j = 0; j < M; j++)
a[i][j] = 0;
// Collapsing loops
int *p = &a[0][0];
for (i = 0; i < N*M; i++)
*p++ = 0;
This way we have fewer loop overhead per element.
Loop fusion
This technique can be applied when we have two loops that iterate over the same set (or, at least, they have a common subset). Knowing that, we can merge both loops and have half of the previous loop overhead. Care must be taken to avoid data dependency problems. This is usually not done by the compiler either.
// We have 2 arrays, a and b, of size N. We want to increment all the elements of both arrays.
for(i=0; i < N; i++)
a[i]++;
for(i=0; i < N; i++)
b[i]++;
// Fusing the loops
for(i=0; i < N; i++)
{
a[i]++;
b[i]++;
}
Reversing order in nested loops
If we have nested loops, and use them as index to access memory, for example, to access a bidimensional array, we must take advantage of the memory hierarchy. Sometimes, just reversing the nested loops order, a big speedup can be achieved. It's important to know your language, in particular how your language stores multidimensional arrays. C stores them by rows, but there are some languages that store them in the inverse way.
I have done some tests on my machine. Here are the two codes:
// Not cache-friendly for(int i=0; i < N; i++) for(int j=0; j < N; j++) a[j][i] = 1; // Cache friendly for(int i=0; i < N; i++) for(int j=0; j < N; j++) a[i][j] = 1;
First I tested with an array of 10000 x 10000 int elements. The speed-up obtained by the cache friendly version was 5.46x.
Then I tested with an array of 2¹³ int elements each dimension. In this case, the obtained speed-up was 15.14x. This is caused because the cache friendly code continues exploiting spatial locality while the cache unaware code not only does not exploit it but it's using only a subset of cache lines, because every iteration in the inner loop is accessing memory addresses with the same lower bits, which are used to know where does that memory go in the cache.
Condition evaluation order
When you have more than one condition ORed or ANDed, the compiler will short-circuit it. That is, if the conditions are ORed and the first one is true, it will not check the others, as the result will be true anyway. If the conditions are ANDed and the first one is false, it won't check the others either, as the result will always be false. So, if you have one condition much slower than the others, put it the last one, as you will avoid evaluating it some times. If you know that a condition is going to cause more short-circuits than the other ones, put it the first one so you will avoid evaluating the other conditions more frequently.
Other techniques
These are the most common basic techniques. You can usually achieve greater speed-ups from here with parallellization, using techniques as vectorization or threading, which we will talk about in future posts.
