Code Gems Part 6
This text comes from IMPHOBIA Issue XII - July 1996
Let's jump right in:
* Filling a register with the Carry Flag *
Sometimes we
need to fill a register with the carry flag: put 0 to the desired reg if Carry=0
or put ffffffff if Carry=1. Probably the fastest way to do it (for example, with
EDX): SBB EDX,EDX
In practice it can be used to prepare EDX before a division. (Before
dividing a signed number by an unsigned number.) By the way, the Carry won't
change after this operation. I rambled about it and made the conclusion that SBB
EDX,EDX is faster and in many cases more appropriate than CDQ. What's more,
there's an undocumented 1-byte instruction called SETALC which fills AL with the
Carry; it's code is 0d6h. (Hi Rod!)
* BSWAP
with 16-bit register *
Yeah, BSWAP with a word register operand
is a valid instruction. At least in the genuine Intel processors. Some
specifications recommend against using it for better compatibility with
Intel-clones. Anyway, if it works, it brings down the upper word of the
doubleword register without affecting its upper 16 bit.For example, if
EAX=ab120000 then BSWAP AX
results in EAX=ab1212ab. Flags won't change of course.
* Operand sizes *
At instructions where the
operand is small (like 'ADD EBX,12'), the machine code is shorter since the
operand is stored in 8 bits and will be sign- extended to 32-bit when the
instruction actually runs. This means that all operand values in the [-128..127]
range (hexadecimally [ffffff80..0000007f]) save 3 bytes per instruction. There's
only one trick concerning it. When using 'ADD EBX,80h', the operand takes 4
bytes, since 00000080h falls out of the range. But the SUB EBX,0FFFFFF80h
will do the same as 'ADD EBX,80h' but three bytes shorter. The same goes
for 'SUB EBX,80h' / 'ADD EBX,0ffffff80h' too. Note: EAX's operand is always
32-bit, it can't be shrunk. This comes from the 8086 era, when the 16-bit
registers ruled. In 16-bit code most instructions which affect AX are shorter
therfore there were no need for making such sign-extended operands for AX; and
since the 32-bit code comes directly from the 16-bit code, there are no short
sign-extending opcodes for EAX. Except if you use the good old LEA ;-) So 'LEA
EAX,[EAX+12]' costs three bytes while 'ADD EAX,12' eats five. Of course, only in
32-bit code! In real mode 'LEA EAX,EAX+12' have TWO prefix bytes (slowing the
1-clock LEA back to THREE clocks). I've got one more addressing-mode trick.
Instead of 'MOV EAX,[EBX*2]' the 'MOV EAX,[EBX+EBX] is better. The first one's
code is seven bytes, the second's is three ;-) Generally, every scaled
addressing like [EBX*2] have a 32-bit displacement so the [EBX*2] is in fact
[EBX*2+00000000h]. But if another register is present, the opcode can be
shorter, for example, 'MOV EAX,[ESI+EBX*4+10h]' is a 4-byte instruction.
Nevertheless, on the 486 every instruction which uses two registers in
addressing or one register scaled, takes one extra cycle! So 'MOV EAX,[EBX+EBX]
takes two cycles and so the 'MOV EAX,[EBX*2]', and the 'LEA
EAX,[EBP+EDX*8+1234h]' too.
* Rounding *
Let's say we want to divide a number by a
power of two (2, 4, 8, 16, etc.) then rounding it upwards. In this case SAR EAX,4
ADC EAX,0
will perfectly work. The credit for this one goes to Good Old Dave :-)
* Penalties on the 486 *
In
most cases, the 486 is free from flow-dependence penalties which mean that an
instruction which uses the result of the previous instruction will not cause
slowdown: ADD EAX,EBX
ADD ECX,EAX
takes two cycles. On a Pentium, however, it takes two cycles too, but the ADD EAX,EBX
ADD ECX,EBX
takes one cycle because the second instr. doesn't use the result of the
first so they can be 'pair'-ed. These situations are quite well described in the
opt32.doc application note released by Intel, I just want to point to one
interesting thing. (By the way, there's a new versioun out with Pentium Pro
optimization tricks. Check ftp
.intel.com!) Generally, the 486 has two types of flow-dependence penalties:
1) immediately using a register after its 8-byte subregister was
modified (so this applies to (e)ax, (e)bx, (e)cx, (e)dx after al, bh, etc. has
been changed);
2) using a register in addressing immediately after it
was modified. (This is valid for all registers, but beware, the LEA is an
addressing instruction, so try avoid using it if its operand was modified by the
previous instruction). For example, how many cycles does the following code
sequence eat (in protected mode assuming 100% cache hit): ADD ECX,EBP
ADC BL,DL
MOV AL,[EBX]
On a 486 the add is one, the adc is another one, but the mov takes three
even if the operand is already in the cache! Why? There is a double penalty: one
clock for using a register in addressing immediately after it was modified.
(Address Generation Interlock AGI),; and another cycle for using a register
immediately after its subregister was modified (Flow Break). So this innocent
MOV instruction costs THREE cycles... Hey, I'm a smart coder, I'm gonna put an
instruction between the ADC and the MOV, and the problem is solved! Really? The ADD ECX,EBP
ADC BL,DL
SUB ESI,EBP
MOV AL,[EBX]
sequence takes 5 clocks: the add, adc, sub take three but the mov takes
two because ONE cycle inserted BETWEEN the ADC and the MOV can save only ONE
penalty, not TWO. So for a perfect one clock per one instruction ratio at least
TWO instructions have to be inserted. Or, one two-cycle instr. like shr or even
a prefixed like add ax,bx in 32-bit code.
* Aligning inner loops *
Aligning the first instruction of
an 'inner' loop to 16-byte boundary may increase performance: the 486's and the
Pentium Pro's instruction prefetcher (or what) just loves aligned addresses. But
the straight-forward solution: JMP _INNERLOOP
ALIGN 16
_INNERLOOP:
sounds quite dirty from many points of view: it may waste both space and
speed. Certaily a more elegant method needs, which won't put a JMP when the
_INNERLOOP label is already on paragraph boundary; and when there are only 1-2
bytes remain before the next aligned position, it will put some NOPs instead of
a JMP: CODEALIGN MACRO
LOCAL _NOW,_THEN
_NOW:
ALIGN 16
_THEN:
IF (_THEN-_NOW) ; Already aligned?
IF (_THEN-_NOW) LE 3 ; <=3 remained?
DB (_THEN-_NOW) DUP (90h) ; put NOPs
ELSE
ORG _NOW ; Set position back
JMP _THEN ; Jump to the boundary
ALIGN 16 ; Apply a
ENDIF
ENDIF
ENDM
Simply put 'codealign' before the top of the inner loop: (...Inner loop preparation code...)
codealign
_INNERLOOP:
(...Inner loop...)
This one is not fully optimized from the speed's point of view: instead of
two NOPs (when two bytes remain) one 'MOV ESP,ESP' would be faster and instead
of three NOPs an 'ADD ESP,0'.
* Aligning a
memory pointer *
After allocating a few doublewords it's not sure
that the array's starting address is on dword boundary, we have to adjust it: ADD EBX,3
AND EBX,0fffffffch
With this technique it's possible to align to any power of two. This idea
came from Technomancer's PM kernel.
Another (space-saving) trick is when
flags needed to be stored with pointers; for example, in a 3D code one pointer
belongs to every face (triangle). In runtime we need one bit to indicate if the
face is visible, or not. It would be too expensive to allocate another byte for
every triangle. But where can that one bit be stored? Right in the pointer. This
requires that the faces' stuctures always start on even addresses; in this case,
their address' lowest bit is always zero, so the pointers' lowest bit is always
zero. This lowest bit can be used as a flag: set by the appropriate routine
(e.g. 'int face_visible(face *)') and checked by other routines. Only take care
to clear it before accessing the face's structure.
* Kicking out conditional jumps, Part II *
When optimizing
for the Pentium (or for Pentium Pro with respect for compatibility) it might
save some cycles to change the conditional jumps to non-branching code. For
example, the next code sequence implements this IF constuction: IF (eax = ebx) THEN ecx=x;
ELSE ecx=y;
In asm: SUB EAX,EBX
CMP EAX,1
SBB ECX,ECX
AND ECX,x-y
ADD ECX,y
It also reveals how to copy the Zero flag to the Carry in one cycle ('CMP
EAX,1' ;-) On the Pentium Pro, however, the next version should be prefered: MOV ECX,x
CMP EAX,EBX
CMOVNE ECX,y
Yeah, the Pro has been blessed with the conditional move instruction, like
cmovs, cmovne, cmovg, and all the rest. As always, the optimization techniques
of the Pro are completely different from the other members of the Intel family.
For example, using a 32-bit register after its subregister was modified may
cause a 6-7 clock penalty! Can you believe it? Linear texturemapping inner loops
using the classic 8-bit U/V coordinates suck... Some other notes: loop aligning
to paragraph boundary now boosts the performance again since the Pro always
loads 16 bytes of code in one step. The banned instructions: CDQ and MOVZX are
also faster than the Pentium's mov/shift-crafted substitutors.
* Do you have any 'Code Gems' ? *
but you
don't feel enough inspiration to write a whole article? Do you want to share
your knowledge with Imphobia readers?
Right then, please send your
'gem's to me! I will publicate them in the next issue. Ervin / AbaddoN
Ervin Toth
48/a, Kiss Janos street
1126 Budapest
HUNGARY
+36-1-201-9563
ervin@sch.bme.hu