Archive

Archive for the ‘Architecture’ Category

Turning your loops faster, II

May 13th, 2009

This post is a continuation to “Turning your loops faster, I”. If you haven’t read it, you may want to before reading this one. Here we continue reviewing some loop optimizing techniques.

Loop unrolling

We saw on the previous post that there is an overhead we pay on each loop iteration. A simple idea, so, is to do in a single new, bigger iteration, the work that was done in many former iterations. This way, there are less iterations to do and less overhead per element.

Due to the fact that the most used current processors can execute instructions out-of-order, a bigger benefit can be gained from loop unrolling as long as there are no data dependencies among different iterations, as we can also avoid the pipeline from flushing in some points, and so not losing cycles. As many other optimitzations, this can be already performed by your compiler, based on some heuristics, but maybe their results are not the optimal ones. You may want to try this specially on loops with many iterations and few code instructions.

// Original loop
sum = 0;
for(i=0; i < N; i++)
    sum += a[i];

// 2-Unrolled loop
sum = 0;
for(i=0; i < N; i+=2)
{
    sum += a[i];
    sum += a[i+1];
}

You have to be sure that the number of iterations is multiple of your unrolling degree; otherwise you have to do the remaining operations, non unrolled, at the end of the unrolled loop.

This was just a simple example, in this case the compiler would probably be already doing unrolling, and maybe in a greater unrolling degree.

Using accumulators

Even if we have data dependencies, sometimes we can slightly modify the algorithm to use different variables in a loop and so avoiding the processor from having to wait for the previous result when it really doesn't needs it.

// 2-Unrolled loop with accumulator
sum = sum2 = 0;
for(i=0; i < N; i+=2)
{
    sum += a[i];
    sum2 += a[i+1];
}
sum += sum2;

This way we do not have data dependencies inside the loop .Compilers usually do not apply this optimization.

Loop collapsing

Sometimes we have nested loops that can be joined into a single loop, usually with pointer arithmetic. This optimization is usually not done by compilers, and it can also enable other optimizations.

// We have a bidimensional array, and we want to set all of its elements to 0
int a[N][M];

// Usual way of setting all elements to 0
for (i = 0; i < N; i++)
    for (j = 0; j < M; j++)
        a[i][j] = 0;

// Collapsing loops
int *p = &a[0][0];
for (i = 0; i < N*M; i++)
    *p++ = 0;

This way we have fewer loop overhead per element.

Loop fusion

This technique can be applied when we have two loops that iterate over the same set (or, at least, they have a common subset). Knowing that, we can merge both loops and have half of the previous loop overhead. Care must be taken to avoid data dependency problems. This is usually not done by the compiler either.

// We have 2 arrays, a and b, of size N. We want to increment all the elements of both arrays.
for(i=0; i < N; i++)
    a[i]++;
for(i=0; i < N; i++)
    b[i]++;

// Fusing the loops
for(i=0; i < N; i++)
{
    a[i]++;
    b[i]++;
}

Reversing order in nested loops

If we have nested loops, and use them as index to access memory, for example, to access a bidimensional array, we must take advantage of the memory hierarchy. Sometimes, just reversing the nested loops order, a big speedup can be achieved. It's important to know your language, in particular how your language stores multidimensional arrays. C stores them by rows, but there are some languages that store them in the inverse way.

I have done some tests on my machine. Here are the two codes:

// Not cache-friendly
for(int i=0; i < N; i++)
	for(int j=0; j < N; j++)
		a[j][i] = 1;

// Cache friendly
for(int i=0; i < N; i++)
	for(int j=0; j < N; j++)
		a[i][j] = 1;

First I tested with an array of 10000 x 10000 int elements. The speed-up obtained by the cache friendly version was 5.46x.

Then I tested with an array of 2¹³ int elements each dimension. In this case, the obtained speed-up was 15.14x. This is caused because the cache friendly code continues exploiting spatial locality while the cache unaware code not only does not exploit it but it's using only a subset of cache lines, because every iteration in the inner loop is accessing memory addresses with the same lower bits, which are used to know where does that memory go in the cache.

Condition evaluation order

When you have more than one condition ORed or ANDed, the compiler will short-circuit it. That is, if the conditions are ORed and the first one is true, it will not check the others, as the result will be true anyway. If the conditions are ANDed and the first one is false, it won't check the others either, as the result will always be false. So, if you have one condition much slower than the others, put it the last one, as you will avoid evaluating it some times. If you know that a condition is going to cause more short-circuits than the other ones, put it the first one so you will avoid evaluating the other conditions more frequently.

Other techniques

These are the most common basic techniques. You can usually achieve greater speed-ups from here with parallellization, using techniques as vectorization or threading, which we will talk about in future posts.

Architecture, Optimization , , ,

Turning your loops faster, I

May 6th, 2009

It’s very usual in our programs that most of its time is spent in loops. So, after having profiled and measured our application, if it is the case, we may want optimize some of our loops. Maybe they are already optimized, by us or by the compiler, but it’s important to know some techniques that may turn our loops faster.

Branch overhead

This is an important concept in order to understand many of the loop optimizations.

Current processors are pipelined. Not a little bit. A lot. An instruction is at many stages before ending its execution. Most of the instructions can be well adapted to a pipelined design without major efforts, but that’s not the case with branching instructions. Because of their nature, they (can) change the next instruction to be executed, from the implicit (the following) one to an explicit one, often depending on a value produced by an instruction which has not yet finished its execution when the branching instruction is fetched. So we have a problem, we have to wait, flush some (many) stages and lose some cycles, something that we don’t like, because it does our programs slower.

In fact, the reality is not so hard. Processor companies have, of course, seen this bottleneck in their products, and have had lots of engineers trying to solve them. The result is that different branching predictors have been designed, some of them are really effective, and what was a huge performance penalty is now a smaller one. Although it is now a smaller problem than it used to be, we still have to be aware of it, specially in loops, because loops imply, inherently, branching.

Inlining

Inlining means inserting the body of a function where it is called instead of just calling it. This can be done both by the compiler or by the programmer, and it’s not just a loop optimization, it can be applied anywhere else, but it’s in loops where we’ll get the bigger benefits of inlining our smaller functions.

Beware that:

  • Compilers, depending of the set options, already try to guess which functions would be good for inlining. You can want to use some tools to look which functions are actually inlined. Try to know what your compiler’s policy is in this aspect.
  • Function inlining can actually cause you performance loses in some cases, because it can imply a larger number of instructions and a worse explotation of the instruction cache.
  • If the compiler does not inlines some function, you try to manually inline it, and you in fact have a performance gain, think before leaving it there if the speed-up is worth having some code duplicated, which can lead to maintenability problems.

Another good thing about inlining is that it can let, in some situations, your compiler to do other kind of optimizations which it wasn’t able to do without inlining, because of a better knowledge of the code context.

Code Hoisting or Loop-invariant code motion

They key idea here is to move some code out of the loop, in order to avoid calculating it at every iteration, when its result will be the same during all the loop execution. Imagine the following code:

for (i = 0; i < n; i++)
{
  if(x)
    a[i] = y;
  else
    a[i] = y*y;
}

Here, a couple of optimizations can be made:

  • "x" does not change inside the loop, so we don't have to check that condition, and execute branching instructions, at every single iteration.
  • "y*y" does not change inside the loop either. So we can have it precomputed outside the loop and assign it.
if(x)
{
    for (i = 0; i < n; i++)
        a[i] = y;
}
else
{
    y2 = y*y;
    for (i = 0; i < n; i++)
        a[i] = y2;
}

The program's semantic is the same but the performance is greater in the second code. Again, be warned that your compiler might be already doing this optimization for you.

More techniques

There are many other techniques, and we will see some of them in the next post, so stay tuned!

EDIT: Second part already avalaible

Architecture, Optimization , , ,

A matter of representation

April 21st, 2009

There’s basically only one way unsigned integers are represented with bits: in binary, namely, a sum where each bit represents a power of 2, from 2^0=1 being the least significant bit (the rightmost bit) to 2^(n-1), the most significant or leftmost bit. The represented number is the addition of those powers whose bit is set to 1.

But things are not as easy as that when trying to represent also negative values.

Sign and magnitude

This is probably the easiest way to understand, for a human, of solving the problem. The approach works basically setting a bit, usually the most significant one, to represent the sign. A 0 on that bit means that it’s a positive number and a 1 means that it’s a negative number. All the other bits represent the magnitude or the absolute value.

Unfortunately, this system has the disadvantage of having 2 ways of representing 0 (100…000 and 000…000) and that hardware built to operate with it requires special logic to handle signs.

Excess-N

This one is another easy approach to the problem. It consists in setting an offset to every number represented by it of the minimum number you want to be representable, usually -(2^(n-1)). Thus, all bits set to 0 is the minimum representable value and 0 would be represented by a 1 in the most significant bit and 0 in all the other bits (100…000).

This is not used currently in signed integer representation but it’s used in the exponent representation of floating point numbers (more on that topic in another post).

One’s complement

Another representation is that known as “one’s complement”. To negate a number, you just have to apply the bitwise operation “not” to it. Positive numbers are those starting by 0 and negative numbers those starting by 1.

With this system we still have 2 ways of representing 0. It helps on addition because you don’t have to worry about the sign of the operands, and subtraction is very simple too as all you have to do is negate the second operand and do an addition. So, addition and subtraction are easier in one’s complement than in sign and magnitude, but multiplication is not easy enough, as you must take care of the signs.

Two’s complement

Two’s complement came to solve the problems with the other representations. There is just one way of representing 0 in it (all bits set to zero) and there aren’t the operating problems that there were with one’s complement.

To negate a number in two’s complement all you have to do is negate it bitwise and add 1. A great thing about two’s complement is that numbers represented in this way can be added, subtracted and multiplied without taking care of the operands’ signs.

It’s important to note that when expanding a number in two’s complement, new bits at the left must be the same as the previous most significant bit in order to preserve the value of the number. This is known as sign extension.

With the advantages we have just seen is no surprise that this is the representation used in most systems currently, so this is the system to know, if you are a programmer you should know how it works perfectly. Not only the basics exposed here but more deeply. Knowing how it works is essential to some things as bit hacks, which will be covered in some post to come.

To learn more: Two’s complement in the Wikipedia

Architecture , , , ,

Wheel of reincarnation

April 15th, 2009

The term “wheel of reincarnation” was first coined in a paper by T.H. Myer and I.E. Sutherland named “On the Design of Display Processors”, and refers to the cycle, completed many times in computer history, of migrating some kind of work to special designed hardware in order to gain speed and liberating the CPU for other tasks, then this hardware becoming of more and more general purpose and powerful, and finally integrating it again into the CPU to simplify the architecture.

Let’s have now a brief look at 3d graphic cards history.

The first 3D graphic cards started in the early and mid-90’s and were only rasterizers, leaving all other tasks in the CPU. Later, in the late-90’s, the Transform and Lighting stage went to the graphic card too, allowing for greater detail scenes and to do more work on the CPU.

In the early-2000’s, they did a great step towards generalization, with the appearance of shaders (small programs executed for every vertex, every fragment or , more recently, for every primitive to be rendered). Shaders had to be very simple at first, with a very limited number of instructions, with lack of branching and other limitations, but over the years those limitations have been relaxing or disappearing.

The graphic cards, at that point already named “GPUs”, were specialized in the processing of floating point numbers, and so programmers solving numeric problems of this kind started to use them, and GPGPU (General Purpose Computing on Graphical Processing Units) was an expanding field.

As more kind of programmers were attracted to use GPUs to solve their problems, so were GPU companies to offer them more attractive technologies, and special languages to allow generic computations on the GPU started to appear, such as NVidia’s CUDA or OpenCL.

Now, with Intel’s GPU project, Larrabee, which is based in x86 processors, the difference between the CPU and hardware processing graphics will be reduced again, and the wheel of reincarnation in the graphic cards field is one step nearer to be completed.

Will it be finally closed? I think so. When? We will probably have to wait many years before getting enough powerful CPU as to host the usual computations of high-demanding applications such as last generation games and at the same time process their graphics rendering, which will be for sure way more advanced and complex than what it is currently.

Architecture, General, Graphics , , , , , , ,