The following is a list of optimizations that may come in handy. Each one is listed alphabetically (more or less) in the first column.
The second column lists the CPU or CPU's that this optimization is applicable to; alternatively it may be noted as applicable to 16-bit code or 32-bit code.
The third column contains one or more replacement sequences of code that is either faster or smaller (sometimes both) than the first column. For some obscure optimizations, the action of the first column instruction is explained.
The forth column contains a description and/or examples.
replacement
instruction CPU's or action description/notes
---------------------------------------------------------------------------
aad (imm8) all AL = AL+(AH*imm8) If imm8 is blank uses 10.
AH = 0 AAD is almost always slower,
but only 2 bytes long.
aam (imm8) all AH = AL/imm8 Same as AAD.
AL = AL MOD imm8
add 16-bit lea reg, [reg+reg+disp]
Use LEA to add
base + index + displacement
Also preserves flags;
for example:
add bx, 4
can be replaced by:
lea bx, [bx+4]
when the flags must not
be changed.
add 32-bit lea reg, [reg+reg*scale+disp]
Use LEA to add
base + scaled index + disp
Also preserves flags.
(See previous example).
The 32-bit form of LEA
is much more powerful
than the 16-bit version
because of the scaling
and the fact that almost
all of the 8 General purpose
registers can be used
as base and index registers.
and reg, reg Pent test reg, reg Use TEST instead of AND
on the Pentium because
fewer register conflict
will result in better pairing
bswap Pent ror eax, 16 Pairs in U pipe, BSWAP
doesn't pair.
disadvantage: modifies flags
(Not a direct replacement)
call dest1 286+ push offset dest2 When CALL is followed by
jmp dest2 jmp dest1 a JMP, change the return
address to the JMP destination.
call dest1 all jmp dest1 When a CALL is followed by a
ret RET, the CALL can be replaced
by a JMP.
cbw 386+ mov ah, 0 When you know AL < 128
use MOV AH, 0 for speed.
But use CBW for smaller
code size.
cdq 486+ xor edx, edx When you know EAX is positive
Faster, better pairing.
disadvantage: modifies flags
Pent mov edx, eax When EAX value could be
sar edx, 31 positive or negative
because of better pairing
cmp mem, reg 286 cmp reg, mem reg, mem is 1 cycle faster
cmp reg, mem 386 cmp mem, reg mem, reg is 1 cycle faster
dec reg16 lea reg16, [reg16 - 1] Use to preserve flags
for BX, BP, DI, SI
dec reg32 lea reg32, [reg32 - 1] Use to preserve flags
for EAX, EBX, ECX, EDX
EDI, ESI, EBP
div <op> 8088 shr accum, 1 When <op> resolves to 2, use
shift for division.
(use CL for 4, 8, etc.)
div <op> 186+ shr accum, n When <op> resolves to a power
of 2 use shifts for division.
enter imm16, 0 286+ push bp ENTER is always slower
mov bp, sp and 4 bytes in length
sub sp, imm16 if imm16 = 0 then push/mov
is smaller
386+ push ebp
32-bit mov ebp, esp
sub esp, imm16
inc reg16 lea reg16, [reg16 + 1] Use to preserve flags
for BX, BP, DI, SI
inc reg32 lea reg32, [reg32 + 1] Use to preserve flags
for EAX, EBX, ECX, EDX
EDI, ESI, EBP
jcxz <dest>: 486+ test cx, cx JCXZ is faster and
je <dest>: smaller on 8088-286.
On the 386 it is the
about the same speed
486+ test ecx, ecx Never use JCXZ on 486
je <dest>: or Pentium except for
compactness
lea reg, mem 8088-286 mov reg, OFFSET mem MOV reg, imm is faster
on 8088 - 286. 386+
they are the same.
Note: There are many uses for LEA, see: add, inc, dec, mov, mul
leave 486+ mov sp, bp LEAVE is only 1 byte
pop bp long and is faster
on the 186-386. The
mov esp, ebp MOV/POP is much faster
pop ebp on 486 and Pentium
lodsb 486+ mov al, [si] LODS is only 1 byte long
inc si and is faster on 8088-386,
much slower on the 486.
On the Pentium the MOV/INC
or MOV/ADD instructions
pair, taking only 1 cycle.
lodsw 486+ mov ax, [si] see lodsb
add si, 2
lodsd 486+ mov eax, [esi] see lodsb
add esi, 4
loop <dest>: 386+ dec cx LOOP is faster and
jnz <dest>: smaller on 8088-286.
on 386+ DEC/JNZ is
loopd <dest>: dec ecx much faster. On the Pentium
jnz <dest>: the DEC/JNZ instructions
pair taking only 1 cycle.
loopXX <dest>: 486+ je $+5 The 3 replacement instructions
( XX = e,ne,z or nz) dec cx are much faster on the 486+.
jnz <dest>: LOOPxx is smaller and
faster on 8088-286
loopdXX <dest>: 486+ je $+5 The speed is about the
dec ecx same on the 386.
jnz <dest>:
mov reg2, reg1 286+ lea reg2, [reg1+n] LEA is faster, smaller and
followed by: preserves flags. This is a
inc/dec/add/sub reg2 way to do a MOV and ADD/SUB
of a constant, n.
mov acc, reg all xchg acc, reg Use XCHG for smaller code
when one of the registers
final value can be ignored.
Note that acc = AL, AX or EAX.
mov mem, 1 Pent lea bx, mem Displacement/immediate does
mov [bx], 1 not pair. LEA/MOV can be used
if other code can be placed
inbetween to prevent AGI's.
mov ax, 1 MOV/MOV may be easier to pair.
mov mem, ax
mov [bx+2], 1 Pent mov ax, 1 Better pairing because
mov [bx+2], ax displacement/immediate
instructions do not pair.
lea bx, [bx+2]
mov [bx], 1
movsb 486+ mov al, [si] MOVS is faster and
inc si smaller to move a single
mov [di], al byte, word or dword
inc di on the 8088-386.
On the 486+ the MOV/INC
method is faster.
NOTE: REP MOVS is always
faster to move a large block.
movsw 486+ mov ax, [si] see MOVSB
add si, 2
mov [di], ax
add di, 2
movsd 486+ mov eax, [esi] see MOVSB
add esi, 4
mov [edi], eax
add edi, 4
movzx r16, rm8 486+ xor bx, bx MOVZX is faster and
mov bl, al smaller on the 386.
On the 486+ XOR/MOV
movzx r32, rm8 486+ xor ebx, ebx is faster. Possible
mov bl, al pairing on the Pentium.
(source can be reg or mem)
movzx r32, rm16 486+ xor ebx, ebx disadvantage: modifies flags
mov bx, ax
mul n 8088+ shl ax, cl Use shifts or ADDs instead of
multiply when n is a power of 2
mul n Pent add ax, ax ADD is better than single
shift because it pairs better.
mul 32-bit lea Use LEA to multiply by
2, 3, 4, 5, 7, 8, 9
lea eax, [eax+eax*4] (ex: multiply EAX * 5)
LEA is better than SHL on the
Pentium because it pairs in
both pipes, SHL pairs only in
the U pipe.
or reg, reg Pent test reg, reg Better pairing because
OR writes to register.
(This is for src = dest.)
pop mem 486+ pop reg Faster on 486+
mov mem, reg Better pairing on Pentium
push mem 486+ mov reg, mem Faster on 486
push reg Better pairing on Pentium
pushf 486+ rcr reg, 1 To save only the carry flag
use a rotate (RCR or RCL)
or into a register. RCR and RCL
are pairiable (U pipe only)
rcl reg, 1 and take 1 cycle. PUSHF is
slow and not pairable.
popf 486+ rcl reg, 1 To restore only the carry flag.
See PUSHF.
or
rcr reg, 1
rep scasb Pent loop1: REP SCAS is faster and
mov al, [di] smaller on 8088-486.
inc di Expanded code is faster
cmp al, reg2 on Pentium due to pairing.
je exit
dec cx
jnz loop1
exit:
shl reg, 1 Pent add reg, reg ADD pairs better. SHL
only pairs in the U pipe.
stosb 486+ mov [di], al STOS is faster and smaller
inc di on the 8088-286, and the same
speed on the 386. On the 486+
stosw 486+ mov [di], ax the MOV/INC is slightly
add di, 2 faster.
stosd 486+ mov [edi], eax REP STOS is faster on 8088-386.
add edi, 4 MOV/INC or MOV/ADD is faster
on the 486+
Note: use LEA SI, [SI+n]
to advance LEA without
changing the flags.
xchg all Use xchg acc, reg to do a
1 byte MOV when one register
can be ignored.
xchg reg1, reg2 Pent push reg1 pushes and pops are 1 cycle
push reg2 faster on Pentium due to
pop reg1 pairing.
pop reg2
disadvantage: uses stack
Pent mov reg3, reg1 Faster and better pairing
mov reg1, reg2 if reg3 is available.
mov reg2, reg3
xlatb 486+ mov bh, 0 XLAT is faster and smaller
mov bl, al on 8088-386. MOV's are faster
mov al, [bx] on 486+. Best to rearrange
instructions to prevent AGI's
xlatb 486+ xor ebx, ebx and get pairing on Pentium.
mov bl, al Force high part of BX/EBX
mov al, [ebx] to zero outside of loop.
disadvantage: modifies flags