Optimizations for assembly coders (Intel series 80386-Pentium)
--------------------------------------------------------------

By The Cremator / Metal a.k.a. Arnout v.d. Kamp

For who is this document?
-------------------------
This document is especially ment for demo coders, game programmers, and
compilerprogrammers (are there any?)
There are a whole bunch of good programmers around the globe, but most of them
are just good at mathemetics rather than real coding (I mean assembly)

What are optimizations? When is something optimized?
----------------------------------------------------
A program or a programloop can be optimized when the same can be done in fewer
clock cycles.
The faster the code executes the better it is optimized. Optimizations for
innerloops are more important than optimizations for outerloops or startup
code.

Optimizations can be done by replacing instructions with faster ones like:
        Instead of CMP AX,0           TEST AX,AX
        Instead of IMUL EAX,320       MOV EAX,[Mul320+EAX*4]

Or replacing a slow (difficult) instruction by two faster ones:
        Instead of MOVZX ebx,[byte ptr esi]     XOR ebx,ebx
                                                MOV bl,[esi]

Other optimizations are removing loops, which is done in compiled sprites,
however this one is very code intensive and with the come of the new
processors (Pentium, P6) it is not very time winning due to the branch
prediction / instruction piping.

The last optimization and by most coders not known / understood is the
instruction pairing; with this one you can remove adress generation interlocks
and all other sorts of register dependency. (explained later on)

What do you need to know
------------------------
Well.... a lot....
It is very handsome when you know ALL the instructions of the 486 / Pentium.
Including clockcycle count and implicit register usage (stosb for example uses
al, es and (e)di)
A good knowledge and understanding why it takes more clock cycles to execute
this instruction than that etc.
At last you must know the existence of the operand size prefix (66h).

First (before we begin) remember this:

* Never use instructions that are not nessecary.
   This sounds simple and easy minded, but there are a lot of coders who make
   this:
        SUB ax,cx           (sets flags)
        CMP ax,0            (sets flags, but nothing changes!)
        Jxx label

* The shorter the opcode, the faster the instruction
   This is not always true.... but keep it as a guideline...

* The less instructions the fewer clockcycles it takes
   This is not always true.... but keep it as a guideline...

* In an USE16 segment you must use as few as possible 32 bits registers
  In an USE32 segment you must use as few as possible 16 bits registers
   This counts both for operands and addressing.
    8 bits registers can be used everywhere you want (no restrictions)

Already known optimizations
---------------------------
Some 'open door' optimizations:
 - Replacing MUL by a series of SHL and add's (or sub's)
 - Replacing a MUL by a table :
                        0 * 320
                        1 * 320
                        2 * 320
                        etc

- Replacing some MULs with LEA;  LEA eax,[eax+eax*4]
   (only useful when using with at least a base)

 - Instead of SHL reg,1   ADD reg,reg
 - Instead of CMP reg,0   TEST reg,reg
 - Instead of MOV reg,0   XOR reg,reg

Some less known optimizations
-----------------------------

- For some instructions (like OR, ADD etc) there are seperate opcodes when
using the accumulator (al,ax or eax) with an immediate value. Keep this in
mind.

 add ecx,4000  81 C1 4000
 add eax,4000  05 4000

- The INC and DEC word/dword are seperate opcoded;

 INC eAX  40
 INC AL   FE C0

- Replacing memory-read/modify/write instructions with register read, register
modify and register write instructions can be faster when using instruction
pairing (no register dependency):

    Instead of inc [dword ptr label+ebx]     mov eax,[label+ebx]
                                             inc eax
                                             mov [label+ebx],eax

- On 386 and 486 processors it takes one clock cycle more to generate the
address when using a (scaled) index register. The Pentium is neutral to the
choice of index verses base.

- Since data access has priority over prefetching try to avoid consecutive
memory instructions, so that the prefix queue stays filled.

- Try to align your labels at MOD 16. At least try it for your innerloop
labels.

- Align your data; words at word boundaries, dwords at dword boundaries and
qwords at qword boundaries.

- Nearly all prefixed opcodes (including prefix 66h and 67h) take one clock
cycle more to decode! So know your segment use (USE 16 and USE 32)

- Try to avoid the use of complex instructions like LOOP, ENTER and LEAVE.
Replace these instructions by simple ones; for example LOOP:
  dec reg
  jnz label
This one has got two advantages; the first is the register which can be free
selected. The second is the jump which can be near or short.
An other note here is replacing the jnz with a jge; this will take care of
executing the loop code n+1 times instead of n times. This saves you an INC.

- The zero extend and sign extend moves (MOVZX, MOVSX) are very slow and must
be as much avoided as possible. Use the cbw, cwde, cwd and cdq for these.
For movzx you can use:
       XOR eax,eax
       mov al,[byte ptr esi]

- Pushing memory is very slow. The best you can use for this one is first
loading the value in a register, and then pushing the register.

- When you need some stack space, do not sub esp,4 but push a register. For
eight bytes two pushes can be used. For more you still must sub esp.

REGISTER DEPENDANCY (AND AGI'S)
-------------------------------
How do you write optimized code?
Well this is very simple.... First imagine this scenary:

(Not optimized code:)
John asks Janny: "What does Robert Jan collect?"
Some minutes later Robert Jan tells to Janny: "I collect FLIPPO'S"
Then Janny goes to John and says: "I know what Robert Jan collects; FLIPPO'S"

(optimized code:)
Robert Jan tells Janny: "I collect FLIPPO'S"
John asks Janny: "What does Robert Jan collect?"
Janny answers: "Robert Jan collects FLIPPO'S"

(Robert Jan : Moest dat nou weer ? Hehe ;).)

This example is very comparable to optimizing code:

Not Optimized:

    MOV CX,10
    MOV SI,[INDEX]
    ADD SI,SI           ; Clock Penalty; SI is used in previous instruction
    MOV BX,[SI]         ; AGI; SI is used in previous instruction
    XOR AX,AX

Optimized:

    MOV SI,[INDEX]
    MOV CX,10
    ADD SI,SI
    XOR AX,AX
    MOV BX,[SI]

In the first example, SI is being modified in the add si,si instruction
involving that mov bx,[si] has to wait till si is updated.

Remember that the flags are a register too. So:

        TEST ax,ax
        jnz label

will deliver you an extra clock cycle too.

Note: The Pentium processor very often is neutral in register dependancy.
It only causes non pairability (read on and find out)
When an Address Generation Interlock can be avoided with the cost of a
register dependancy, then do it! AGI's are boneheads.
The best (of course) is no AGI's and no register dependancies.

Maybe the P6 can do something about register dependancy by adapting the code
or something... I don't know... Let's wait and see.
(I mean instead of XOR ax,ax MOV ax,0 which is faster due to register
dependancy)

Some instructions use registers implicit; PUSH and POP use (e)SP LODSD uses
(e)SI;

        ADD sp,4
        PUSH ax       <- One clock cycle penalty

Since the designers of the Pentium are not STUPID (although they cannot
divide) a serie of pushes will not generate clock penalties:
        PUSH ax
        PUSH bx
        PUSH cx      No penalties

INSTRUCTION PAIRING
-------------------
The Pentium processor of Intel has got 2 seperate execution units so it can
(theoratically) execute two instructions a clock cycle.

Not all instructions can be paired; no pairing can be done when the following
conditions occur:

1. The next two instructions cannot be paired. (At the end of the doc you'll
find a pairing table) In general most arithmetic instructions can be paired.
2. The next two instructions have some register contention. In other words
they update/use the same registers (implicit or explicit)
3. Both the instructions are not in the instruction cache. An exception to
this is when the first instruction is a one byte instruction.

Unpairable instructions
1. Shift or rotate instructions with the shift count in CL
2. Long arithmetic instructions for example, MUL, DIV
3. Extended instructions for example, RET, ENTER, PUSHA, MOVS, REP STOS
4. Some floating point instructions for example FSCALE, FLDCW, FST
5. Inter-segment instructions for example, PUSH sreg, CALL far

Pairable instructions issued to U or V pipes (UV)
1. Most 8/32 bit ALU operations for example, ADD, INC, XOR
2. All 8/32 bit compare instructions for example, CMP, TEST
3. All 8/32 bit stack operations using registers: PUSH reg, POP reg

Pairable instructions issued to U pipe (PU)
These instructions must be executed in the U pipe and can be paired with a
suitable instruction in the V pipe.
1. Carry instructions for example, ADC, SBB
2. Prefixed instructions (see later on)
3. Shift with immediate
4. Some floating point instructions for example, FADD, FMUL, FLD

Pairable instructions issued to V pipe (PV)
These instructions can be executed in the U pipe or in the V pipe but they
will only be paired when executed in the V pipe.
1. Simple control transfer instructions for example; CALL near, JMP near, Jcc.
This includes both the Jcc short and Jcc near (0F prefixed)
2. The floating point instruction FXCH

The pairability of an instruction is also affected by its operands. Unpairable
due to register usage are: (flow-dependance)
1. The first instruction updates a register the second instruction reads from:
        MOV eax,8
        MOV [ebp],eax
2. Both instructions write to the same register (output-dependance)
        MOV eax,8
        MOV eax,[ebp]
This limitation does not apply for a pair of instructions which write to the
flags register. (ALU instructions)

Note that two instructions in which the first reads a register the second
writes to are pairable:
        MOV eax,ebx     pairs with:     MOV ebx,[ebp]

Instructions are always updated in their 32 bits variants;
        MOV al,8
        MOV ah,bl       do not pair due to rule 2 (both use eax)

Special Pairs
There are some instructions that can be paired although the general rules
prohibit it. These special pairs overcome register dependancy. Most of these
exceptions involve implicit reads/writes to the esp register or implicit
writes to the condition codes:

Stack Pointer:
1. PUSH reg/imm         pairs with PUSH reg/imm
2. PUSH reg/imm         pairs with CALL
3. POP reg              pairs with POP reg

Condition codes:
1. CMP                  pairs with Jcc
2. ADD                  pairs with JNE (JNZ)

Last 'LEA' note:
The LEA instruction can be executed in both the U and the V pipe. The SHL
instructions with immediate count can only be executed in the U pipe. So
replace all your shl ,1 with adds shl ,2-3 with LEA. (Only when you don't get
AGI's ofcourse)


And Here comes the instruction pairing table:
---------------------------------------------

Pairable in either pipe (UV):
ADD
AND
CMP
DEC
INC
LEA
MOV
NOP
OR
POP reg
PUSH reg
PUSH imm
SUB
TEST reg1 reg2
TEST mem  reg
TEST imm  acc
XOR

Pairable if issued to U-pipe:
ADC
SBB
RCL, RCR, ROL, ROR, SAR, SHL, SHR
by 1
by imm

Pairable if issued to V-pipe:
CALL near direct
Jcc
JMP near short/direct

Not pairable:
AAA
AAD
AAM
AAS
ARPL
BOUND
BSF
BSR
BSWAP
BT
BTC
BTR
BTS
CALL near
indirect
CALL far
CBW/CWDE
CLC
CLD
CLI
CLTS
CMC
CMPS
CMPXCHG
CWD/CDQ
DAA
DAS
DIV
ENTER
HLT
IDIV
IMUL
INT n
INTO
INVD
INVLPG
IRET(d)
JeCXZ
JMP near
indirect
JMP far
LAHF
LAR
LDS
LEAVE
LES
LFS
LGDT
LGS
LIDT
LLDT
LMSW
LODS
LOOPcc
LSL
LSS
LTR
MOV control
MOV debug
MOV segment
MOVS
MOVSX
MOVZX
MUL
NEG
NOT
POP segment
POPAd
POPFd
PUSH mem
PUSH segment
PUSHAd
PUSHFd
RCL by CL
RCR by CL
RDMSR
REP string
REPE string
REPNE string
RET(F)
ROL by CL
ROR by CL
RSM
SAHF
SAL = SHL by CL
SAR by CL
SCAS
SETcc
SGDT
SHL by CL
SHLD
SHR by CL
SHRD
SIDT
SLDT
SMSW
STC
STD
STI
STOS
STR
TEST
imm  reg
imm  mem
VERR
VERW
WAIT
WBINVD
WRMSR
XADD
XCHG
XLAT


Further reading
---------------

This document is based on the:
  AP-500 Application Note
  Optimizations for Intel's 32-bit processors
  (C) Intel 1993
  Order number 241799-001
