Deficiencies of GCC's optimizer

This page lists places where GCC's code generation is suboptimal. Although the examples are small, the problems are usually quite deep.

Note: unless otherwise specified, all examples have been compiled with the current CVS tree as of the date of the example, on x86, with -O2 -fomit-frame-pointer -fschedule-insns. (The x86 back end disables -fschedule-insns, which is something that should be revisited, because it always gives better code when I turn it back on.)

Contents:

  1. Inverting conditionals
  2. Failure of common subexpression elimination
  3. Store merging
  4. Global CSE and hard registers
  5. Volatile inhibits too many optimizations
  6. Unnecessary changes of rounding mode
  7. Register shuffling and long long
  8. Moving floating point through integer registers
  9. Failure to hoist loads out of loops
  10. Suboptimal code for complex conditionals
  11. Strange side effects of scheduling

Inverting conditionals

(14 Jan 2000) Frequently GCC produces better code if you write a conditional one way than if you write it the opposite way. Here is a simple example.

static const unsigned char
trigraph_map[] = {
  '|', 0, 0, 0, 0, 0, '^',
  '[', ']', 0, 0, 0, '~',
  0, '\\', 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
  0, '{', '#', '}'
};

unsigned char
map1 (c)
     unsigned char c;
{
  if (c >= '!' && c <= '>')
    return trigraph_map[c - '!'];
  return 0;
}

unsigned char
map2 (c)
     unsigned char c;
{
  if (c < '!' || c > '>')
    return 0;
  return trigraph_map[c - '!'];
}

Assembly output for map1 and map2 is, surprisingly, different:

map1:
	movb	4(%esp), %cl
	xorl	%eax, %eax
	movb	%cl, %dl
	subb	$33, %dl
	cmpb	$29, %dl
	ja	.L4
	movzbl	%cl, %eax
	movzbl	trigraph_map-33(%eax), %eax
.L4:
	ret

map2:
	movb	4(%esp), %cl
	xorl	%eax, %eax
	movb	%cl, %dl
	subb	$33, %dl
	cmpb	$29, %dl
	ja	.L7
	movzbl	%cl, %eax
	movzbl	trigraph_map-33(%eax), %eax
	ret
	.p2align 4,,7
.L7:
	ret

Admittedly, the difference is small - a redundant 'ret' instruction and a padding directive, and six bytes wasted in the object file. The problem is worse for larger blocks of conditional code, though.


Failure of common subexpression elimination

(14 Jan 2000) The same code also illustrates a failing in CSE. Once again, the source is

unsigned char
map1 (c)
     unsigned char c;
{
  if (c >= '!' && c <= '>')
    return trigraph_map[c - '!'];
  return 0;
}

and the assembly is

map1:
	movb	4(%esp), %cl
	xorl	%eax, %eax
	movb	%cl, %dl
	subb	$33, %dl
	cmpb	$29, %dl
	ja	.L4
	movzbl	%cl, %eax
	movzbl	trigraph_map-33(%eax), %eax
.L4:
	ret

If we were writing this code by hand, we would do it thus:

map1:
	movb	4(%esp), %cl
	xorl	%eax, %eax
	subb	$33, %cl
	cmpb	$29, %cl
	ja	.L4
	movzbl	%cl, %eax
	movzbl	trigraph_map(%eax), %eax
.L4:
	ret

This does not save a runtime subtract - trigraph_map-33 happens at load time. It does, however, save a register, which would be important if this function were to be inlined. It also puts the 'ret' instruction at the alignment the processor likes for jump targets, which is important because we happen to know that the jump will almost always be taken.

Some marginally more detailed analysis: Local CSE can't help because the two subtracts are in different basic blocks. Global CSE does not merge the subtracts because they appear to occur in different modes. We have RTL like so:

(insn 13 7 14 (parallel[
	    (set (reg:QI 27)
		(plus:QI (reg/v:QI 25)
		    (const_int -33 [0xffffffdf])))
	    (clobber (reg:CC 17 flags))
	] ) 183 {*addqi_1} (nil)
    (nil))

;; ...

(insn 17 44 19 (parallel[
	    (set (reg:SI 29)
		(zero_extend:SI (reg/v:QI 25)))
	    (clobber (reg:CC 17 flags))
	] ) 106 {*zero_extendqisi2_movzbw_and} (nil)
    (nil))

(insn 19 17 21 (parallel[
	    (set (reg:SI 30)
		(plus:SI (reg:SI 29)
		    (const_int -33 [0xffffffdf])))
	    (clobber (reg:CC 17 flags))
	] ) 174 {*addsi_1} (nil)
    (nil))

I suspect that this is conservatism on the part of the optimizer. It might be that doing the zero_extend and then the subtract would have a different result than the other way around. However, we know that this cannot be the case, because control will never reach insn 17 unless (reg:QI 25) is greater than 33.


Store merging

(14 Jan 2000) GCC frequently generates multiple narrow writes to adjacent memory locations. Memory writes are expensive; it would be better if they were combined. For example:

struct rtx_def
{
  unsigned short code;
  int mode : 8;
  unsigned int jump : 1;
  unsigned int call : 1;
  unsigned int unchanging : 1;
  unsigned int volatil : 1;
  unsigned int in_struct : 1;
  unsigned int used : 1;
  unsigned integrated : 1;
  unsigned frame_related : 1;
};

void
i1(struct rtx_def *d)
{
  memset((char *)d, 0, sizeof(struct rtx_def));
  d->code = 12;
  d->mode = 23;
}

void
i2(struct rtx_def *d)
{
  d->code = 12;
  d->mode = 23;

  d->jump = d->call = d->unchanging = d->volatil
    = d->in_struct = d->used = d->integrated = d->frame_related = 0;
}

compiles to (I have converted the constants to hexadecimal to make the situation clearer):

i1:
	movl	4(%esp), %eax
	movl	$0x0, (%eax)
	movb	$0x17, 2(%eax)
	movw	$0x0c, (%eax)
	ret

i2:
	movl	4(%esp), %eax
	movb	$0x0, 3(%eax)
	movw	$0x0c, (%eax)
	movb	$0x17, 2(%eax)
	ret

Both versions ought to compile to

i3:
	movl	4(%esp), %eax
	movl	$0x17000c, (%eax)
	ret

Other architectures have to do this optimization, so GCC is capable of it. GCC simply needs to be taught that it's a win on this architecture too. It might be nice if it would do the same thing for a more general function where the values assigned to 'code' and 'mode' were not constant, but the advantage is less obvious here.


Global CSE and hard registers

(16 Jan 2000) Global CSE is not capable of operating on hard registers. This causes it to miss obvious optimizations. For example, consider this C++ fragment:

struct A
{
  A (int);
};

struct B : virtual public A
{
  B ();
};

B::B ()
  : A (3)
{
}

This compiles as follows (exception handling labels edited out for clarity):

__1Bi:
	subl	$24, %esp
	pushl	%ebx
	movl	36(%esp), %edx
	movl	32(%esp), %ebx
	testl	%edx, %edx
	je	.L3
	leal	4(%ebx), %eax
	movl	%eax, (%ebx)
.L3:
	testl	%edx, %edx
	je	.L4
	subl	$8, %esp
	leal	4(%ebx), %eax
	pushl	$3
	pushl	%eax
	call	__1Ai
	addl	$16, %esp
.L4:
	movl	%ebx, %eax
	popl	%ebx
	addl	$24, %esp
	ret

Notice how the test of %edx and the load of %eax both occur twice. We would like code more like this to be generated:

__1Bi:
	subl	$24, %esp
	pushl	%ebx
	movl	36(%esp), %edx
	movl	32(%esp), %ebx
	testl	%edx, %edx
	je	.L4
	leal	4(%ebx), %eax
	movl	%eax, (%ebx)
	subl	$8, %esp
	pushl	$3
	pushl	%eax
	call	__1Ai
	addl	$16, %esp
.L4:
	movl	%ebx, %eax
	popl	%ebx
	addl	$24, %esp
	ret

This is also a decent example of stack space wastage. The i386 architecture wants 16-byte stack alignment right before every call instruction, and we try to align doubles on the stack as well. However, none of the variables in this function need more than 4 byte alignment, and there's no reason to keep the stack pointer aligned in the middle of the function. All the same constraints are satisfied by this version:

__1Bi:
	pushl	%ebx
	movl	12(%esp), %edx
	movl	8(%esp), %ebx
	testl	%edx, %edx
	je	.L4
	leal	4(%ebx), %eax
	movl	%eax, (%ebx)
	pushl	$3
	pushl	%eax
	call	__1Ai
	addl	$8, %esp
.L4:
	movl	%ebx, %eax
	popl	%ebx
	ret

Only part of the problem is with alignment. The other part is that stack slots are frequently allocated for variables that wound up in registers.


Volatile inhibits too many optimizations

(17 Jan 2000) gcc refuses to perform in-memory operations on volatile variables, on architectures that have those operations. Compare:

extern int a;
extern volatile int b;

void inca(void) { a++; }

void incb(void) { b++; }

compiles to:

inca:
	incl	a
	ret

incb:
	movl	b, %eax
	incl	%eax
	movl	%eax, b
	ret

Note that this is a policy decision. Changing the behavior is trivial - permit general_operand to accept volatile variables. To date the GCC team has chosen not to do so.

The C standard is maddeningly ambiguous about the semantics of volatile variables. It happens that on x86 the two functions above have identical semantics. On other platforms that have in-memory operations, that may not be the case, and the C standard may take issue with the difference - we aren't sure.


Unnecessary changes of rounding mode

(17 Jan 2000) gcc does not remember the state of the floating point control register, so it changes it more than necessary. Consider the following:

void
d2i2(const double a, const double b, int * const i, int * const j)
{
	*i = a;
	*j = b;
}

This performs two conversions from 'double' to 'int'. The example compiles as follows:

d2i2:
	subl	$24, %esp
	pushl	%ebx
	movl	48(%esp), %edx
	movl	52(%esp), %ecx
	fldl	32(%esp)
	fldl	40(%esp)
	fxch	%st(1)
	fnstcw	12(%esp)
	movl	12(%esp), %ebx
	movb	$12, 13(%esp)
	fldcw	12(%esp)
	movl	%ebx, 12(%esp)
	fistpl	8(%esp)
	fldcw	12(%esp)
	movl	8(%esp), %eax
	movl	%eax, (%edx)
	fnstcw	12(%esp)
	movl	12(%esp), %edx
	movb	$12, 13(%esp)
	fldcw	12(%esp)
	movl	%edx, 12(%esp)
	fistpl	8(%esp)
	fldcw	12(%esp)
	movl	8(%esp), %eax
	movl	%eax, (%ecx)
	popl	%ebx
	addl	$24, %esp
	ret

For those who are unfamiliar with the, um, unique design of the x86 floating point unit, it has an eight-slot stack and each entry holds a value in an extended format. Values can be moved between top-of-stack and memory, but cannot be moved between top-of-stack and the integer registers. The control word, which is a separate value, cannot be moved to or from the integer registers either.

On x86, converting a 'double' to 'int', when 'double' is in 64-bit IEEE format, requires setting the control word to a nonstandard value. In the code above, you can clearly see that the control word is saved, changed, and restored around each individual conversion. It would be perfectly possible to do it only once, thus:

d2i2:
	subl	$24, %esp
	pushl	%ebx
	movl	48(%esp), %edx
	movl	52(%esp), %ecx
	fldl	32(%esp)
	fldl	40(%esp)
	fxch	%st(1)
	fnstcw	12(%esp)
	movl	12(%esp), %ebx
	movb	$12, 13(%esp)
	fldcw	12(%esp)
	movl	%ebx, 12(%esp)
	fistpl	8(%esp)
	movl	8(%esp), %eax
	movl	%eax, (%edx)
	fistpl	8(%esp)
	fldcw	12(%esp)
	movl	8(%esp), %eax
	movl	%eax, (%ecx)
	popl	%ebx
	addl	$24, %esp
	ret

Other obvious improvements in this code include storing directly from the floating-point stack to the target addresses, and reordering the loads to avoid the 'fxch' instruction. You can't reorder the stores in C because 'i' and 'j' might point at the same location.

d2i2:
	subl	$24, %esp
	pushl	%ebx
	movl	48(%esp), %edx
	movl	52(%esp), %ecx
	fldl	40(%esp)
	fldl	32(%esp)
	fnstcw	12(%esp)
	movl	12(%esp), %ebx
	movb	$12, 13(%esp)
	fldcw	12(%esp)
	movl	%ebx, 12(%esp)
	fistpl	(%edx)
	fistpl	(%ecx)
	fldcw	12(%esp)
	popl	%ebx
	addl	$24, %esp
	ret

As usual, we can also reduce the amount of wasted stack space:

d2i2:
	pushl	%ebx
	movl	24(%esp), %edx
	movl	28(%esp), %ecx
	fldl	16(%esp)
	fldl	8(%esp)
	fnstcw	24(%esp)
	movl	24(%esp), %ebx
	movb	$12, 25(%esp)
	fldcw	24(%esp)
	fistpl	(%edx)
	fistpl	(%ecx)
	movl	%ebx, 24(%esp)
	fldcw	24(%esp)
	popl	%ebx
	ret

This version recycles the stack slot of one of the parameters as temporary storage for the control word.

The four versions of this routine occupy respectively 97, 72, 54, and 48 bytes of text. Version 2 will be dramatically faster than version 1; 3 will be somewhat faster than 2, and 4 will be about the same as 3, but will waste less memory.


Register shuffling and long long

(22 Jan 2000) GCC has a number of problems doing 64-bit arithmetic on architectures with 32-bit words. This is only one of the most obvious issues.

extern void big(long long u);
void doit(unsigned int a, unsigned int b, char *id)
{
  big(*id);
  big(a);
  big(b);
}

compiles to:

doit:
	subl	$20, %esp
	pushl	%esi
	pushl	%ebx
	movl	40(%esp), %ecx
	subl	$8, %esp
	movl	40(%esp), %ebx
	movl	44(%esp), %esi
	movsbl	(%ecx), %eax
	cltd
*	pushl	%edx
*	pushl	%eax
	call	big
	subl	$8, %esp
	xorl	%edx, %edx
	movl	%ebx, %eax
*	pushl	%edx
*	pushl	%eax
	call	big
	addl	$24, %esp
	xorl	%edx, %edx
	movl	%esi, %eax
*	pushl	%edx
*	pushl	%eax
	call	big
	addl	$16, %esp
	popl	%ebx
	popl	%esi
	addl	$20, %esp
	ret

Notice how the argument to big is invariably shuffled such that its high word is in %edx and its low word is in %eax, and then pushed. This is because gcc is incapable of manipulating the two halves separately. It should be able to generate code like this:

doit:
	subl	$20, %esp
	pushl	%esi
	pushl	%ebx
	movl	40(%esp), %ecx
	subl	$8, %esp
	movl	40(%esp), %ebx
	movl	44(%esp), %esi
	movsbl	(%ecx), %eax
	cltd
	pushl	%edx
	pushl	%eax
	call	big
	subl	$8, %esp
	xorl	%edx, %edx
	pushl	%edx
	pushl	%ebx
	call	big
	addl	$24, %esp
	xorl	%edx, %edx
	pushl	%edx
	pushl	%esi
	call	big
	addl	$16, %esp
	popl	%ebx
	popl	%esi
	addl	$20, %esp
	ret

Also, the choice to fetch all arguments from the stack at the very beginning is questionable. It might be better to use one callee-save register to hold zero and retrieve args from the stack when needed. This, with the usual tweaks to stack adjusts, makes the code much shorter.

doit:
	pushl	%ebx
	xorl	%ebx, %ebx
	movl	8(%esp), %ecx
	movsbl	(%ecx), %eax
	cltd
	pushl	%edx
	pushl	%eax
	call	big
	addl	$8, %esp
	movl	12(%esp), %eax
	pushl	%ebx
	pushl	%eax
	call	big
	addl	$8, %esp
	movl	16(%esp), %eax
	pushl	%ebx
	pushl	%eax
	call	big
	addl	$8, %esp
	popl	%ebx
	ret

Moving floating point through integer registers

(22 Jan 2000) GCC 2.96 on x86 knows how to move float quantities using integer instructions. This is normally a win because floating point moves take more cycles. However, it increases the pressure on the minuscule integer register file and therefore can end up making things worse.

void
fcpy(float *a, float *b, float *aa, float *bb, int n)
{
	int i;
	for(i = 0; i < n; i++) {
		aa[i]=a[i];
		bb[i]=b[i];
	}
}

I've compiled this three times and present the results side by side. Only the inner loop is shown.

  2.95 @ -O2		2.96 @ -O2		    2.96 @ -O2 -fomit-fp
  .L6:			.L6:			    .L6:
			movl  8(%ebp), %ebx
  flds	(%edi,%eax,4)	movl  (%ebx,%edx,4), %eax   movl  (%ebp,%edx,4), %eax
  fstps (%ebx,%eax,4)	movl  %eax, (%esi,%edx,4)   movl  %eax, (%esi,%edx,4)
			movl  20(%ebp), %ebx
  flds	(%esi,%eax,4)	movl  (%edi,%edx,4), %eax   movl  (%edi,%edx,4), %eax
  fstps (%ecx,%eax,4)	movl  %eax, (%ebx,%edx,4)   movl  %eax, (%ebx,%edx,4)
  incl	%eax		incl  %edx		    incl  %edx
  cmpl	%edx,%eax	cmpl  %ecx, %edx	    cmpl  %ecx, %edx
  jl	.L6		jl    .L6		    jl	  .L6

The loop requires seven registers: four base pointers, an index, a limit, and a scratch. All but the scratch must be integer. The x86 has only six integer registers under normal conditions. gcc 2.95 uses a float register for the scratch, so the loop just fits. 2.96 tries to use an integer register, and has to spill two pointers onto the stack to make everything fit. Adding -fomit-frame-pointer makes a seventh integer register available, and the loop fits again.

We see here numerous optimizer idiocies. First, it ought to recognize that a load - even from L1 cache - is more expensive than a floating point move, and go back to the FP registers. Second, instead of spilling the pointers, it should spill the limit register. The limit is only used once and the 'cmpl' instruction can take a memory operand. Third, the loop optimizer has failed to do anything at all. It should rewrite the code thus:

void
fcpy(float *a, float *b, float *aa, float *bb, int n)
{
	int i;
	for(i = n-1; i >= 0; i--) {
		*aa++ = *a++;
		*bb++ = *b++;
	}
}

which compiles to this inner loop:

.L6:
	movl	(%esi), %eax
	addl	$4, %esi
	movl	%eax, (%ecx)
	addl	$4, %ecx
	movl	(%ebx), %eax
	addl	$4, %ebx
	movl	%eax, (%edx)
	addl	$4, %edx
	decl	%edi
	jns	.L6

Yes, more adds are necessary, but this loop is going to be bound by I/O bandwidth anyway, and the rewrite gets rid of the limit register. Thus the loop fits in the integer registers again.

Interestingly, GCC does manage to make a transformation like that for the equivalent program in Fortran:

	subroutine fcpy (a, b, aa, bb, n)
	implicit none
	integer n, i
	real a(n), b(n), aa(n), bb(n)

	do i = 1, n
		aa(i) = a(i)
		bb(i) = b(i)
	end do
	end

which compiles to this inner loop:

.L6:
	movl	(%ecx), %eax
	movl	(%esi), %edx
	addl	$4, %ecx
	movl	%eax, (%ebx)
	addl	$4, %esi
	addl	$4, %ebx
	movl	%edx, (%edi)
	addl	$4, %edi
	decl	%ebp
	jns	.L6

That's still not as good as it could get, though. In Fortran (but not in C) the compiler is allowed to assume the arrays don't overlap, so it could treat it as if it had been written thus:

void
fcpy(float *a, float *b, float *aa, float *bb, int n)
{
	int i;
	for(i = n-1; i >= 0; i--) {
		aa[i] = a[i];
		bb[i] = b[i];
	}
}

which compiles to:

.L6:
	movl	(%edi,%edx,4), %eax
	movl	%eax, (%ebx,%edx,4)
	movl	(%esi,%edx,4), %eax
	movl	%eax, (%ecx,%edx,4)
	decl	%edx
	jns	.L6

That transformation is also allowed in C if all four pointers are qualified with restrict.

Then there's the question of loop unrolling, loop splitting, etc. but high-level transformations like those are outside the scope of this document.


Failure to hoist loads out of loops

(13 Feb 2000) If presented with even slightly complicated looping code, GCC may fail to extract all the invariants from the loops, even when there are plenty of registers.

Consider the following code, which is a trimmed down version of a real function that does something sensible.

unsigned char *
read_and_prescan (ip, len, speccase)
     unsigned char *ip;
     unsigned int len;
     unsigned char *speccase;
{
  unsigned char *buf = malloc (len);
  unsigned char *input_buffer = malloc (4096);
  unsigned char *ibase, *op;
  int deferred_newlines;

  op = buf;
  ibase = input_buffer + 2;
  deferred_newlines = 0;

  for (;;)
    {
      unsigned int span = 0;

      if (deferred_newlines)
	{
	  while (speccase[ip[span]] == 0
		 && ip[span] != '\n'
		 && ip[span] != '\t'
		 && ip[span] != ' ')
	    span++;
	  memcpy (op, ip, span);
	  op += span;
	  ip += span;
	  if (speccase[ip[0]] == 0)
	    while (deferred_newlines)
	      deferred_newlines--, *op++ = '\r';
	  span = 0;
	}

      while (speccase[ip[span]] == 0) span++;
      memcpy (op, ip, span);
      op += span;
      ip += span;
      if (*ip == '\0')
	break;
    }
  return buf;
}

We're going to look exclusively at the code generated for the innermost three loops. This one is the most important:

while (speccase[ip[span]] == 0) span++;

which is compiled to

.L12:
        xorl    %esi, %esi
.L6:
        movzbl  (%esi,%ebx), %eax
        movl    16(%ebp), %edx
        cmpb    $0, (%eax,%edx)
        jne     .L19
        .p2align 4
.L20:
        incl    %esi
*       movl    16(%ebp), %edx
        movzbl  (%esi,%ebx), %eax
        cmpb    $0, (%eax,%edx)
        je      .L20
.L19:

To start, look at the line marked with a star. There is no way to reach label .L20 except by going through the block starting at .L6. Register %edx is not modified inside the loop, and neither is the stack slot it's being loaded from. So why wasn't that load deleted?

Then there's the matter of the entire loop test being duplicated. When the body of the loop is large, that's a good move, but here it doubles the size of the code. The loop optimizer should have the brains to start the counter at -1, and emit instead

.L12:
        movl    $-1, %esi
        movl    16(%ebp), %edx
        .p2align 4
.L20:
        incl    %esi
        movzbl  (%esi,%ebx), %eax
        cmpb    $0, (%eax,%edx)
        je      .L20

The next loop is

while (deferred_newlines)
      deferred_newlines--, *op++ = '\r';

This compiles to

        movl    -20(%ebp), %ecx
        testl   %ecx, %ecx
        je      .L12
        .p2align 4
.L15:
        movb    $13, (%edi)
        incl    %edi
*       decl    -20(%ebp)
        jne     .L15

This is the same problem, but with a value that's modified. It loaded -20(%ebp) into a register, but then forgot about it and started doing read-mod-write on a memory location, which is horrible.

Finally, the topmost loop:

  while (speccase[ip[span]] == 0
	 && ip[span] != '\n'
	 && ip[span] != '\t'
	 && ip[span] != ' ')
    span++;

compiles to

        movl    $0, %esi
        movzbl  (%ebx), %eax
        movl    16(%ebp), %edx
        cmpb    $0, (%eax,%edx)
        jne     .L8
        movb    (%ebx), %al
        jmp     .L22
        .p2align 4,,7
.L9:
        incl    %esi
*       movl    16(%ebp), %edx
        movzbl  (%esi,%ebx), %eax
        cmpb    $0, (%eax,%edx)
        jne     .L8
*       movb    (%esi,%ebx), %al
.L22:
        cmpb    $10, %al
        je      .L8
        cmpb    $9, %al
        je      .L8
        cmpb    $32, %al
        jne     .L9
.L8:

Exact same problem: a pointer is fetched on every trip through the loop, despite the fact that that register is never used for anything else. Also, note that the value we need in %al for the comparison sequence starting at .L22 is already there, but we fetch it again anyway. And we've got an odd split loop test, with half duplicated and half not.

If you look at the source code carefully, you might notice another oddity: deferred_newlines is set to zero before the outer loop, and never modified again except inside an if block that will only be executed if it's nonzero. Therefore, that if block is dead code, and should have been deleted.


Suboptimal code for complex conditionals

(26 Feb 2000) gcc is a lot less clever about compound conditionals than it could be.

Consider:

int and(int a, int b) { return (a && b); }
int or (int a, int b) { return (a || b); }

With the usual optimization options, gcc produces this code for these functions:

and:
	movl	4(%esp), %eax
	movl	$0, %edx
	testl	%eax, %eax
	movl	$0, %eax
	je	.L3
	movl	8(%esp), %edx
	testl	%edx, %edx
	setne	%al
	movl	%eax, %edx
.L3:
	movl	%edx, %eax
	ret

or:
	movl	4(%esp), %eax
	testl	%eax, %eax
	movl	$0, %eax
	jne	.L6
	movl	8(%esp), %edx
	testl	%edx, %edx
	je	.L5
.L6:
	movl	$1, %eax
.L5:
	ret

That's not too bad, although we do have some pointless register shuffling in the "and" function. (But note the register clearing with movl. See below for discussion.)

However, it would be really nice if gcc would recognize that conditional branches are more expensive than evaluating both sides of the test. In fact, gcc does know how to do that - but it doesn't, because it thinks branches are dirt cheap. We can correct that with the -mbranch-cost switch. Here's what you get when you compile the above with -mbranch-cost=2 in addition to the normal switches:

and:
	movl	4(%esp), %edx
	movl	$0, %eax
	movl	8(%esp), %ecx
	testl	%edx, %edx
	movl	$0, %edx
	setne	%dl
	testl	%ecx, %ecx
	setne	%al
	andl	%eax, %edx
	movl	%edx, %eax
	ret

or:
	movl	4(%esp), %edx
	movl	$0, %eax
	movl	8(%esp), %ecx
	testl	%edx, %edx
	movl	$0, %edx
	setne	%dl
	testl	%ecx, %ecx
	setne	%al
	orl	%eax, %edx
	movl	%edx, %eax
	ret

Yay - no branches! But this code is decidedly suboptimal. There's far too much register shuffling, for one thing - note the final and/mov or or/mov combinations. For another, we fail to take advantage of the ability to test two registers at once. And finally, registers are being cleared with 'mov' and then written with 'set', which can cause partial register stalls.

Optimal code for this example would look something like:

and:
	movl	4(%esp), %edx
	movl	8(%esp), %ecx
	xorl	%eax, %eax
	testl	%ecx, %edx
	setne	%al
	ret

or:
	movl	4(%esp), %edx
	xorl	%eax, %eax
	movl	8(%esp), %ecx
	notl	%edx
	notl	%ecx
	testl	%ecx, %edx
	sete	%al
	ret

The 'or' example uses the rule that (a || b) == !(!a && !b).

The important scheduling considerations, for PPro/PII anyway, are to separate loads from uses by at least one instruction (assuming top of stack will be in L1 cache - pretty safe) and to use xor/set so as not to provoke partial register stalls.


Strange side effects of scheduling

Remember the register-clearing with 'mov' in the previous example? That's the fault of the scheduler. The scheduler? Yes indeed. Recall what the code looked like compiled with my normal flags:

and:
	movl	4(%esp), %eax
	movl	$0, %edx
	testl	%eax, %eax
	movl	$0, %eax
	je	.L3
	movl	8(%esp), %edx
	testl	%edx, %edx
	setne	%al
	movl	%eax, %edx
.L3:
	movl	%edx, %eax
	ret

or:
	movl	4(%esp), %eax
	testl	%eax, %eax
	movl	$0, %eax
	jne	.L6
	movl	8(%esp), %edx
	testl	%edx, %edx
	je	.L5
.L6:
	movl	$1, %eax
.L5:
	ret

With -O2 -fomit-frame-pointer only, it comes out like this:

and:
	movl	4(%esp), %edx
	xorl	%eax, %eax
	testl	%edx, %edx
	je	.L3
	movl	8(%esp), %edx
	testl	%edx, %edx
	setne	%al
.L3:
	ret

or:
	movl	4(%esp), %edx
	xorl	%eax, %eax
	testl	%edx, %edx
	jne	.L6
	movl	8(%esp), %edx
	testl	%edx, %edx
	je	.L5
.L6:
	movl	$1, %eax
.L5:
	ret

which is obviously better. In fact, it eliminates the pointless register shuffling too.

With -O2 -fomit-frame-pointer -mbranch-cost=2, you get

and:
	movl	4(%esp), %ecx
	xorl	%edx, %edx
	testl	%ecx, %ecx
	movl	8(%esp), %ecx
	setne	%dl
	xorl	%eax, %eax
	testl	%ecx, %ecx
	setne	%al
	andl	%eax, %edx
	movl	%edx, %eax
	ret

or:
	movl	4(%esp), %ecx
	xorl	%edx, %edx
	testl	%ecx, %ecx
	movl	8(%esp), %ecx
	setne	%dl
	xorl	%eax, %eax
	testl	%ecx, %ecx
	setne	%al
	orl	%eax, %edx
	movl	%edx, %eax
	ret

which is better than the scheduled version, but still has the or/mov and and/mov sequences.