This page lists places where GCC's code generation is suboptimal. Although the examples are small, the problems are usually quite deep.
Note: unless otherwise specified, all examples have been compiled
with the current CVS tree as of the date of the example, on x86, with
-O2 -fomit-frame-pointer -fschedule-insns
. (The x86 back
end disables -fschedule-insns
, which is something that
should be revisited, because it always gives better code when I turn
it back on.)
Contents:
long long
(14 Jan 2000) Frequently GCC produces better code if you write a conditional one way than if you write it the opposite way. Here is a simple example.
static const unsigned char trigraph_map[] = { '|', 0, 0, 0, 0, 0, '^', '[', ']', 0, 0, 0, '~', 0, '\\', 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, '{', '#', '}' }; unsigned char map1 (c) unsigned char c; { if (c >= '!' && c <= '>') return trigraph_map[c - '!']; return 0; } unsigned char map2 (c) unsigned char c; { if (c < '!' || c > '>') return 0; return trigraph_map[c - '!']; }
Assembly output for map1
and map2
is,
surprisingly, different:
map1: movb 4(%esp), %cl xorl %eax, %eax movb %cl, %dl subb $33, %dl cmpb $29, %dl ja .L4 movzbl %cl, %eax movzbl trigraph_map-33(%eax), %eax .L4: ret map2: movb 4(%esp), %cl xorl %eax, %eax movb %cl, %dl subb $33, %dl cmpb $29, %dl ja .L7 movzbl %cl, %eax movzbl trigraph_map-33(%eax), %eax ret .p2align 4,,7 .L7: ret
Admittedly, the difference is small - a redundant 'ret'
instruction and a padding directive, and six bytes wasted in the
object file. The problem is worse for larger blocks of conditional
code, though.
(14 Jan 2000) The same code also illustrates a failing in CSE. Once again, the source is
unsigned char map1 (c) unsigned char c; { if (c >= '!' && c <= '>') return trigraph_map[c - '!']; return 0; }
and the assembly is
map1: movb 4(%esp), %cl xorl %eax, %eax movb %cl, %dl subb $33, %dl cmpb $29, %dl ja .L4 movzbl %cl, %eax movzbl trigraph_map-33(%eax), %eax .L4: ret
If we were writing this code by hand, we would do it thus:
map1: movb 4(%esp), %cl xorl %eax, %eax subb $33, %cl cmpb $29, %cl ja .L4 movzbl %cl, %eax movzbl trigraph_map(%eax), %eax .L4: ret
This does not save a runtime subtract - trigraph_map-33
happens at load time. It does, however, save a register, which would
be important if this function were to be inlined. It also puts the
'ret'
instruction at the alignment the processor likes for
jump targets, which is important because we happen to know that the
jump will almost always be taken.
Some marginally more detailed analysis: Local CSE can't help because the two subtracts are in different basic blocks. Global CSE does not merge the subtracts because they appear to occur in different modes. We have RTL like so:
(insn 13 7 14 (parallel[ (set (reg:QI 27) (plus:QI (reg/v:QI 25) (const_int -33 [0xffffffdf]))) (clobber (reg:CC 17 flags)) ] ) 183 {*addqi_1} (nil) (nil)) ;; ... (insn 17 44 19 (parallel[ (set (reg:SI 29) (zero_extend:SI (reg/v:QI 25))) (clobber (reg:CC 17 flags)) ] ) 106 {*zero_extendqisi2_movzbw_and} (nil) (nil)) (insn 19 17 21 (parallel[ (set (reg:SI 30) (plus:SI (reg:SI 29) (const_int -33 [0xffffffdf]))) (clobber (reg:CC 17 flags)) ] ) 174 {*addsi_1} (nil) (nil))
I suspect that this is conservatism on the part of the optimizer. It might be that doing the zero_extend and then the subtract would have a different result than the other way around. However, we know that this cannot be the case, because control will never reach insn 17 unless (reg:QI 25) is greater than 33.
(14 Jan 2000) GCC frequently generates multiple narrow writes to adjacent memory locations. Memory writes are expensive; it would be better if they were combined. For example:
struct rtx_def { unsigned short code; int mode : 8; unsigned int jump : 1; unsigned int call : 1; unsigned int unchanging : 1; unsigned int volatil : 1; unsigned int in_struct : 1; unsigned int used : 1; unsigned integrated : 1; unsigned frame_related : 1; }; void i1(struct rtx_def *d) { memset((char *)d, 0, sizeof(struct rtx_def)); d->code = 12; d->mode = 23; } void i2(struct rtx_def *d) { d->code = 12; d->mode = 23; d->jump = d->call = d->unchanging = d->volatil = d->in_struct = d->used = d->integrated = d->frame_related = 0; }
compiles to (I have converted the constants to hexadecimal to make the situation clearer):
i1: movl 4(%esp), %eax movl $0x0, (%eax) movb $0x17, 2(%eax) movw $0x0c, (%eax) ret i2: movl 4(%esp), %eax movb $0x0, 3(%eax) movw $0x0c, (%eax) movb $0x17, 2(%eax) ret
Both versions ought to compile to
i3: movl 4(%esp), %eax movl $0x17000c, (%eax) ret
Other architectures have to do this optimization, so GCC is
capable of it. GCC simply needs to be taught that it's a win on this
architecture too. It might be nice if it would do the same thing for
a more general function where the values assigned to
'code'
and 'mode'
were not constant, but the
advantage is less obvious here.
(16 Jan 2000) Global CSE is not capable of operating on hard registers. This causes it to miss obvious optimizations. For example, consider this C++ fragment:
struct A { A (int); }; struct B : virtual public A { B (); }; B::B () : A (3) { }
This compiles as follows (exception handling labels edited out for clarity):
__1Bi: subl $24, %esp pushl %ebx movl 36(%esp), %edx movl 32(%esp), %ebx testl %edx, %edx je .L3 leal 4(%ebx), %eax movl %eax, (%ebx) .L3: testl %edx, %edx je .L4 subl $8, %esp leal 4(%ebx), %eax pushl $3 pushl %eax call __1Ai addl $16, %esp .L4: movl %ebx, %eax popl %ebx addl $24, %esp ret
Notice how the test of %edx
and the load of
%eax
both occur twice. We would like code more like this
to be generated:
__1Bi: subl $24, %esp pushl %ebx movl 36(%esp), %edx movl 32(%esp), %ebx testl %edx, %edx je .L4 leal 4(%ebx), %eax movl %eax, (%ebx) subl $8, %esp pushl $3 pushl %eax call __1Ai addl $16, %esp .L4: movl %ebx, %eax popl %ebx addl $24, %esp ret
This is also a decent example of stack space wastage. The i386 architecture wants 16-byte stack alignment right before every call instruction, and we try to align doubles on the stack as well. However, none of the variables in this function need more than 4 byte alignment, and there's no reason to keep the stack pointer aligned in the middle of the function. All the same constraints are satisfied by this version:
__1Bi: pushl %ebx movl 12(%esp), %edx movl 8(%esp), %ebx testl %edx, %edx je .L4 leal 4(%ebx), %eax movl %eax, (%ebx) pushl $3 pushl %eax call __1Ai addl $8, %esp .L4: movl %ebx, %eax popl %ebx ret
Only part of the problem is with alignment. The other part is that stack slots are frequently allocated for variables that wound up in registers.
(17 Jan 2000) gcc refuses to perform in-memory operations on volatile variables, on architectures that have those operations. Compare:
extern int a; extern volatile int b; void inca(void) { a++; } void incb(void) { b++; }
compiles to:
inca: incl a ret incb: movl b, %eax incl %eax movl %eax, b ret
Note that this is a policy decision. Changing the behavior is
trivial - permit general_operand
to accept volatile
variables. To date the GCC team has chosen not to do so.
The C standard is maddeningly ambiguous about the semantics of volatile variables. It happens that on x86 the two functions above have identical semantics. On other platforms that have in-memory operations, that may not be the case, and the C standard may take issue with the difference - we aren't sure.
(17 Jan 2000) gcc does not remember the state of the floating point control register, so it changes it more than necessary. Consider the following:
void d2i2(const double a, const double b, int * const i, int * const j) { *i = a; *j = b; }
This performs two conversions from 'double'
to
'int'
. The example compiles as follows:
d2i2: subl $24, %esp pushl %ebx movl 48(%esp), %edx movl 52(%esp), %ecx fldl 32(%esp) fldl 40(%esp) fxch %st(1) fnstcw 12(%esp) movl 12(%esp), %ebx movb $12, 13(%esp) fldcw 12(%esp) movl %ebx, 12(%esp) fistpl 8(%esp) fldcw 12(%esp) movl 8(%esp), %eax movl %eax, (%edx) fnstcw 12(%esp) movl 12(%esp), %edx movb $12, 13(%esp) fldcw 12(%esp) movl %edx, 12(%esp) fistpl 8(%esp) fldcw 12(%esp) movl 8(%esp), %eax movl %eax, (%ecx) popl %ebx addl $24, %esp ret
For those who are unfamiliar with the, um, unique design of the x86 floating point unit, it has an eight-slot stack and each entry holds a value in an extended format. Values can be moved between top-of-stack and memory, but cannot be moved between top-of-stack and the integer registers. The control word, which is a separate value, cannot be moved to or from the integer registers either.
On x86, converting a 'double'
to 'int'
,
when 'double'
is in 64-bit IEEE format, requires setting
the control word to a nonstandard value. In the code above, you can
clearly see that the control word is saved, changed, and restored
around each individual conversion. It would be perfectly possible to
do it only once, thus:
d2i2: subl $24, %esp pushl %ebx movl 48(%esp), %edx movl 52(%esp), %ecx fldl 32(%esp) fldl 40(%esp) fxch %st(1) fnstcw 12(%esp) movl 12(%esp), %ebx movb $12, 13(%esp) fldcw 12(%esp) movl %ebx, 12(%esp) fistpl 8(%esp) movl 8(%esp), %eax movl %eax, (%edx) fistpl 8(%esp) fldcw 12(%esp) movl 8(%esp), %eax movl %eax, (%ecx) popl %ebx addl $24, %esp ret
Other obvious improvements in this code include storing directly
from the floating-point stack to the target addresses, and reordering
the loads to avoid the 'fxch'
instruction. You can't
reorder the stores in C because 'i'
and 'j'
might point at the same location.
d2i2: subl $24, %esp pushl %ebx movl 48(%esp), %edx movl 52(%esp), %ecx fldl 40(%esp) fldl 32(%esp) fnstcw 12(%esp) movl 12(%esp), %ebx movb $12, 13(%esp) fldcw 12(%esp) movl %ebx, 12(%esp) fistpl (%edx) fistpl (%ecx) fldcw 12(%esp) popl %ebx addl $24, %esp ret
As usual, we can also reduce the amount of wasted stack space:
d2i2: pushl %ebx movl 24(%esp), %edx movl 28(%esp), %ecx fldl 16(%esp) fldl 8(%esp) fnstcw 24(%esp) movl 24(%esp), %ebx movb $12, 25(%esp) fldcw 24(%esp) fistpl (%edx) fistpl (%ecx) movl %ebx, 24(%esp) fldcw 24(%esp) popl %ebx ret
This version recycles the stack slot of one of the parameters as temporary storage for the control word.
The four versions of this routine occupy respectively 97, 72, 54, and 48 bytes of text. Version 2 will be dramatically faster than version 1; 3 will be somewhat faster than 2, and 4 will be about the same as 3, but will waste less memory.
long long
(22 Jan 2000) GCC has a number of problems doing 64-bit arithmetic on architectures with 32-bit words. This is only one of the most obvious issues.
extern void big(long long u); void doit(unsigned int a, unsigned int b, char *id) { big(*id); big(a); big(b); }
compiles to:
doit: subl $20, %esp pushl %esi pushl %ebx movl 40(%esp), %ecx subl $8, %esp movl 40(%esp), %ebx movl 44(%esp), %esi movsbl (%ecx), %eax cltd * pushl %edx * pushl %eax call big subl $8, %esp xorl %edx, %edx movl %ebx, %eax * pushl %edx * pushl %eax call big addl $24, %esp xorl %edx, %edx movl %esi, %eax * pushl %edx * pushl %eax call big addl $16, %esp popl %ebx popl %esi addl $20, %esp ret
Notice how the argument to big
is invariably shuffled
such that its high word is in %edx
and its low word is in
%eax
, and then pushed. This is because gcc is incapable
of manipulating the two halves separately. It should be able to
generate code like this:
doit: subl $20, %esp pushl %esi pushl %ebx movl 40(%esp), %ecx subl $8, %esp movl 40(%esp), %ebx movl 44(%esp), %esi movsbl (%ecx), %eax cltd pushl %edx pushl %eax call big subl $8, %esp xorl %edx, %edx pushl %edx pushl %ebx call big addl $24, %esp xorl %edx, %edx pushl %edx pushl %esi call big addl $16, %esp popl %ebx popl %esi addl $20, %esp ret
Also, the choice to fetch all arguments from the stack at the very beginning is questionable. It might be better to use one callee-save register to hold zero and retrieve args from the stack when needed. This, with the usual tweaks to stack adjusts, makes the code much shorter.
doit: pushl %ebx xorl %ebx, %ebx movl 8(%esp), %ecx movsbl (%ecx), %eax cltd pushl %edx pushl %eax call big addl $8, %esp movl 12(%esp), %eax pushl %ebx pushl %eax call big addl $8, %esp movl 16(%esp), %eax pushl %ebx pushl %eax call big addl $8, %esp popl %ebx ret
(22 Jan 2000) GCC 2.96 on x86 knows how to move float
quantities using integer instructions. This is normally a win because
floating point moves take more cycles. However, it increases the
pressure on the minuscule integer register file and therefore can end
up making things worse.
void fcpy(float *a, float *b, float *aa, float *bb, int n) { int i; for(i = 0; i < n; i++) { aa[i]=a[i]; bb[i]=b[i]; } }
I've compiled this three times and present the results side by side. Only the inner loop is shown.
2.95 @ -O2 2.96 @ -O2 2.96 @ -O2 -fomit-fp .L6: .L6: .L6: movl 8(%ebp), %ebx flds (%edi,%eax,4) movl (%ebx,%edx,4), %eax movl (%ebp,%edx,4), %eax fstps (%ebx,%eax,4) movl %eax, (%esi,%edx,4) movl %eax, (%esi,%edx,4) movl 20(%ebp), %ebx flds (%esi,%eax,4) movl (%edi,%edx,4), %eax movl (%edi,%edx,4), %eax fstps (%ecx,%eax,4) movl %eax, (%ebx,%edx,4) movl %eax, (%ebx,%edx,4) incl %eax incl %edx incl %edx cmpl %edx,%eax cmpl %ecx, %edx cmpl %ecx, %edx jl .L6 jl .L6 jl .L6
The loop requires seven registers: four base pointers, an index, a
limit, and a scratch. All but the scratch must be integer. The x86
has only six integer registers under normal conditions. gcc 2.95 uses
a float register for the scratch, so the loop just fits. 2.96 tries
to use an integer register, and has to spill two pointers onto the
stack to make everything fit. Adding -fomit-frame-pointer
makes a seventh integer register available, and the loop fits again.
We see here numerous optimizer idiocies. First, it ought to
recognize that a load - even from L1 cache - is more expensive than a
floating point move, and go back to the FP registers. Second, instead
of spilling the pointers, it should spill the limit register. The
limit is only used once and the 'cmpl'
instruction can
take a memory operand. Third, the loop optimizer has failed to do
anything at all. It should rewrite the code thus:
void fcpy(float *a, float *b, float *aa, float *bb, int n) { int i; for(i = n-1; i >= 0; i--) { *aa++ = *a++; *bb++ = *b++; } }
which compiles to this inner loop:
.L6: movl (%esi), %eax addl $4, %esi movl %eax, (%ecx) addl $4, %ecx movl (%ebx), %eax addl $4, %ebx movl %eax, (%edx) addl $4, %edx decl %edi jns .L6
Yes, more adds are necessary, but this loop is going to be bound by I/O bandwidth anyway, and the rewrite gets rid of the limit register. Thus the loop fits in the integer registers again.
Interestingly, GCC does manage to make a transformation like that for the equivalent program in Fortran:
subroutine fcpy (a, b, aa, bb, n) implicit none integer n, i real a(n), b(n), aa(n), bb(n) do i = 1, n aa(i) = a(i) bb(i) = b(i) end do end
which compiles to this inner loop:
.L6: movl (%ecx), %eax movl (%esi), %edx addl $4, %ecx movl %eax, (%ebx) addl $4, %esi addl $4, %ebx movl %edx, (%edi) addl $4, %edi decl %ebp jns .L6
That's still not as good as it could get, though. In Fortran (but not in C) the compiler is allowed to assume the arrays don't overlap, so it could treat it as if it had been written thus:
void fcpy(float *a, float *b, float *aa, float *bb, int n) { int i; for(i = n-1; i >= 0; i--) { aa[i] = a[i]; bb[i] = b[i]; } }
which compiles to:
.L6: movl (%edi,%edx,4), %eax movl %eax, (%ebx,%edx,4) movl (%esi,%edx,4), %eax movl %eax, (%ecx,%edx,4) decl %edx jns .L6
That transformation is also allowed in C if all four pointers
are qualified with restrict
.
Then there's the question of loop unrolling, loop splitting, etc. but high-level transformations like those are outside the scope of this document.
(13 Feb 2000) If presented with even slightly complicated looping code, GCC may fail to extract all the invariants from the loops, even when there are plenty of registers.
Consider the following code, which is a trimmed down version of a real function that does something sensible.
unsigned char * read_and_prescan (ip, len, speccase) unsigned char *ip; unsigned int len; unsigned char *speccase; { unsigned char *buf = malloc (len); unsigned char *input_buffer = malloc (4096); unsigned char *ibase, *op; int deferred_newlines; op = buf; ibase = input_buffer + 2; deferred_newlines = 0; for (;;) { unsigned int span = 0; if (deferred_newlines) { while (speccase[ip[span]] == 0 && ip[span] != '\n' && ip[span] != '\t' && ip[span] != ' ') span++; memcpy (op, ip, span); op += span; ip += span; if (speccase[ip[0]] == 0) while (deferred_newlines) deferred_newlines--, *op++ = '\r'; span = 0; } while (speccase[ip[span]] == 0) span++; memcpy (op, ip, span); op += span; ip += span; if (*ip == '\0') break; } return buf; }
We're going to look exclusively at the code generated for the innermost three loops. This one is the most important:
while (speccase[ip[span]] == 0) span++;
which is compiled to
.L12: xorl %esi, %esi .L6: movzbl (%esi,%ebx), %eax movl 16(%ebp), %edx cmpb $0, (%eax,%edx) jne .L19 .p2align 4 .L20: incl %esi * movl 16(%ebp), %edx movzbl (%esi,%ebx), %eax cmpb $0, (%eax,%edx) je .L20 .L19:
To start, look at the line marked with a star. There is no way to
reach label .L20
except by going through the block
starting at .L6
. Register %edx
is not
modified inside the loop, and neither is the stack slot it's being
loaded from. So why wasn't that load deleted?
Then there's the matter of the entire loop test being duplicated. When the body of the loop is large, that's a good move, but here it doubles the size of the code. The loop optimizer should have the brains to start the counter at -1, and emit instead
.L12: movl $-1, %esi movl 16(%ebp), %edx .p2align 4 .L20: incl %esi movzbl (%esi,%ebx), %eax cmpb $0, (%eax,%edx) je .L20
The next loop is
while (deferred_newlines) deferred_newlines--, *op++ = '\r';
This compiles to
movl -20(%ebp), %ecx testl %ecx, %ecx je .L12 .p2align 4 .L15: movb $13, (%edi) incl %edi * decl -20(%ebp) jne .L15
This is the same problem, but with a value that's modified. It loaded
-20(%ebp)
into a register, but then forgot about it and
started doing read-mod-write on a memory location, which is horrible.
Finally, the topmost loop:
while (speccase[ip[span]] == 0 && ip[span] != '\n' && ip[span] != '\t' && ip[span] != ' ') span++;
compiles to
movl $0, %esi movzbl (%ebx), %eax movl 16(%ebp), %edx cmpb $0, (%eax,%edx) jne .L8 movb (%ebx), %al jmp .L22 .p2align 4,,7 .L9: incl %esi * movl 16(%ebp), %edx movzbl (%esi,%ebx), %eax cmpb $0, (%eax,%edx) jne .L8 * movb (%esi,%ebx), %al .L22: cmpb $10, %al je .L8 cmpb $9, %al je .L8 cmpb $32, %al jne .L9 .L8:
Exact same problem: a pointer is fetched on every trip through the
loop, despite the fact that that register is never used for anything
else. Also, note that the value we need in %al
for the
comparison sequence starting at .L22
is already there,
but we fetch it again anyway. And we've got an odd split loop test,
with half duplicated and half not.
If you look at the source code carefully, you might notice another
oddity: deferred_newlines
is set to zero before the outer
loop, and never modified again except inside an if block that will
only be executed if it's nonzero. Therefore, that if block is dead
code, and should have been deleted.
(26 Feb 2000) gcc is a lot less clever about compound conditionals than it could be.
Consider:
int and(int a, int b) { return (a && b); } int or (int a, int b) { return (a || b); }
With the usual optimization options, gcc produces this code for these functions:
and: movl 4(%esp), %eax movl $0, %edx testl %eax, %eax movl $0, %eax je .L3 movl 8(%esp), %edx testl %edx, %edx setne %al movl %eax, %edx .L3: movl %edx, %eax ret or: movl 4(%esp), %eax testl %eax, %eax movl $0, %eax jne .L6 movl 8(%esp), %edx testl %edx, %edx je .L5 .L6: movl $1, %eax .L5: ret
That's not too bad, although we do have some pointless register
shuffling in the "and" function. (But note the register clearing with
movl
. See below for discussion.)
However, it would be really nice if gcc would recognize that conditional branches are more expensive than evaluating both sides of the test. In fact, gcc does know how to do that - but it doesn't, because it thinks branches are dirt cheap. We can correct that with the -mbranch-cost switch. Here's what you get when you compile the above with -mbranch-cost=2 in addition to the normal switches:
and: movl 4(%esp), %edx movl $0, %eax movl 8(%esp), %ecx testl %edx, %edx movl $0, %edx setne %dl testl %ecx, %ecx setne %al andl %eax, %edx movl %edx, %eax ret or: movl 4(%esp), %edx movl $0, %eax movl 8(%esp), %ecx testl %edx, %edx movl $0, %edx setne %dl testl %ecx, %ecx setne %al orl %eax, %edx movl %edx, %eax ret
Yay - no branches! But this code is decidedly suboptimal. There's far too much register shuffling, for one thing - note the final and/mov or or/mov combinations. For another, we fail to take advantage of the ability to test two registers at once. And finally, registers are being cleared with 'mov' and then written with 'set', which can cause partial register stalls.
Optimal code for this example would look something like:
and: movl 4(%esp), %edx movl 8(%esp), %ecx xorl %eax, %eax testl %ecx, %edx setne %al ret or: movl 4(%esp), %edx xorl %eax, %eax movl 8(%esp), %ecx notl %edx notl %ecx testl %ecx, %edx sete %al ret
The 'or' example uses the rule that (a || b) == !(!a && !b)
.
The important scheduling considerations, for PPro/PII anyway, are to separate loads from uses by at least one instruction (assuming top of stack will be in L1 cache - pretty safe) and to use xor/set so as not to provoke partial register stalls.
Remember the register-clearing with 'mov' in the previous example? That's the fault of the scheduler. The scheduler? Yes indeed. Recall what the code looked like compiled with my normal flags:
and: movl 4(%esp), %eax movl $0, %edx testl %eax, %eax movl $0, %eax je .L3 movl 8(%esp), %edx testl %edx, %edx setne %al movl %eax, %edx .L3: movl %edx, %eax ret or: movl 4(%esp), %eax testl %eax, %eax movl $0, %eax jne .L6 movl 8(%esp), %edx testl %edx, %edx je .L5 .L6: movl $1, %eax .L5: ret
With -O2 -fomit-frame-pointer only, it comes out like this:
and: movl 4(%esp), %edx xorl %eax, %eax testl %edx, %edx je .L3 movl 8(%esp), %edx testl %edx, %edx setne %al .L3: ret or: movl 4(%esp), %edx xorl %eax, %eax testl %edx, %edx jne .L6 movl 8(%esp), %edx testl %edx, %edx je .L5 .L6: movl $1, %eax .L5: ret
which is obviously better. In fact, it eliminates the pointless register shuffling too.
With -O2 -fomit-frame-pointer -mbranch-cost=2, you get
and: movl 4(%esp), %ecx xorl %edx, %edx testl %ecx, %ecx movl 8(%esp), %ecx setne %dl xorl %eax, %eax testl %ecx, %ecx setne %al andl %eax, %edx movl %edx, %eax ret or: movl 4(%esp), %ecx xorl %edx, %edx testl %ecx, %ecx movl 8(%esp), %ecx setne %dl xorl %eax, %eax testl %ecx, %ecx setne %al orl %eax, %edx movl %edx, %eax ret
which is better than the scheduled version, but still has the or/mov and and/mov sequences.