			          ķ
			  }{ .\\ I) 
			          Ľ


			     ---*[ Coding Tips ]*---
			       ---*[ Part 0 ]*---

				       by

			    Black Label / hemOrOids Pc


	NOT a lame ASM course ...
	BEGINNERS !  just skip this textfile ...
	ADVANCED CODERS !  Read this and ACT !!!


Here are some easy tips you may use to achieve higher speed in some routines,
just replace some instructions with newer ones...


1 - The right use of the "LEA" instruction
    --------------------------------------

	mov eax, ebx
	inc eax
should be :
	lea eax, [ebx+1]

and more generally :

	mov eax, ebx
	add eax, VALUE
should be :
	lea eax, [ebx+VALUE]

"LEA" - Calculing video offsets
-------------------------------

When working in a 320x200 mode, use this to get the physical adress for the
Y coordinate :

	[ebx = Y Coo.]

	lea bx, [ebx+ebx*4]	; Y*5
	shl bx, 6		; Y*64

Here you got BX = Y * 320.


2 - The right use of the "SHR"/"SHL" instruction
    --------------------------------------

Both "SHR" and "SHL" opcodes got a special form when used for one single
shifting...
Ex.:	SHR AX, 3 => C1h, E8h, 03h
	SHR AX, 1 => D1h, E8h

But unfortunately, the "SHR AX, 1" instruction is much slower than "SHR AX, 3"
And both TASM and MASM will assemble "SHR AX, 1" as D1h,E8h, the slower form !

	So, assemble yourself the best opcode in the source code :

	shr ax, 1
should be :
	db 0C1h, 0E8h, 1

It does the same, but faster !!!
(also true for SAR, ROL, ROR, etc...)

BTW : Here's something I found very often in some sources :

	shl bx, 1
should be :
	add bx, bx	; so simple .... hehe .... but who knows ... ???

3 - The right use of the "ADC" instruction
    --------------------------------------

The "ADC" instruction is sometimes used in some fixed point routine
(the incremental way).

Ex.: when incrementing 2 registers with fixed point method (linedrawing etc...)

(SI = SI + BP,BH)
(DI = DI + DX,CH)

	add bl, bh
	adc si, bp
	add cl, ch
	adc di, dx

A better way to use the "ADC" when you got many simultaneous increments for
16bit registers (16bit fixed point, 16bit value) :

All the higher parts of the 32bit registers are meaningless.

	add eax, ebx
	adc ecx, edx
	adc esi, edi
	adc ebp, ...  ; etc ...

The first "ADD" computes both : 
	- the integer part of the AX increment
	- the fixed point part og the CX increment
The first "ADC" computes both : 
	- the integer part of the CX increment
	- the fixed point part og the SI increment
The second "ADC" computes both : 
	- the integer part of the SI increment
	- the fixed point part og the BP increment

		and so on ...

So, you got 1 "ADD" for the first addition and then 1 "ADC" for each increment
instead of 2.


4 - The right use of the "ROR" instruction
    --------------------------------------

Let's say you got AX as loop counter, and you want to poke all values "word by
word" instead of "byte by byte" (and you're right! :))

	ror ax, 1	; AX / 2 and Carry in AH
	( better if assembled 0C1h, 0C8h, 01h, see the "SHR" section ! )
	MyLoop:
		"WORK_WITH_WORD"	; It Works with 2 pixels at a time

		dec al
	jnz MyLoop
	or ah, ah	; "Carry" set ???  One more pixel left ???
	jz NotLast

		"WORK_WITH_BYTE"
	NotLast:

It looks like the "Carry" method used for vectorfilling :
( Never use the "JC" or "JNC" 'til it's necessary. Condit. Jumps are slow ! )

	[ CX = Nb of pixel ]
	[ AL = AH = Color ]
	[ Di = Video offset ]

	shr cx, 1	; CX / 2 for "word" poking
	( assemble it : 0C1h, 0E9h, 1 !!! )
	jz NoRep
	rep stosw	; poke ! (and CX set to 0)
	NoRep:
	adc cl, cl	; CX = 1 if carry was set, else 0.
	rep stosb	; do the last pixel or not.

5 - About "Word" poking
    -------------------

Never poke your datas as word, until you're sure the adress is multiple of 2.
To do this :

	test di, 1
	jz IsOk
		DO_BYTE
	IsOk:
		DO_WORD

If you don't do this, you'll have lot of penalties !!!

5 - "LOOP" vs "DEC/JNZ"
    -------------------

As anyone should know, the "LOOP" instruction should *NEVER* be used in *ANY*
case. Decrementing CX (or CL) and then testing the Zero Flag is ALWAYS faster 
than the classical LOOP.

So, replace all your "LOOP" with the "DEC CX/JNZ toto" (until you're sure 
CX <> 0)

6 - The truth about intructions interlacing
    ---------------------------------------

Some people say that interlacing inctructions refering to the same registers,
memory reference or simply CPU flags, is a good way to enhance performance on
486 and Pentiums...

Something like :

	add ax, bx
	mov ds:[Toto], ax
	inc bx
should be :
	add ax, bx
	inc bx
	mov ds:[Toto], ax

But I think that it can't be always true for some simple reasons :

	- If the CPU hasn't the AX register correctly updated for the next
	instruction, why would the CPU try to run it ????

	- How the CPU can be sure that all registers and flags are ready ????

	- Why does the CPU never hangs when a register haven't been updated ???

 Anyway, we can consider this *true* for Video poking :
When accessing to the video, you got some cycles left between both video 
accesses so you can add some register-instructions between it :
(different nb of cycles on different videos of curz')

	stosb		; First video access
	add si, cx
	mov al, ds:[si]	; RAM access
	mov bl, dl
	adc bl, bh
	stosb		; Second video access

should be :

	stosb		; First video access
	mov bl, dl
	adc bl, bh
	add si, cx
	mov al, ds:[si]	; RAM access
	stosb		; Second video access

 'coz the RAM access is placed after all the register-instruction.

	Euh .... What should I say about this ... Euh ...

			Euh ...



Ah , yes ! , one more thing you gotta know :

			" JUST TEST AND YOU'LL SEE !!!  :) "


	Black Label / HEMOROIDS Pc '95 - Main Coder

				"Hi" to all french dudes I've meet at TP4 !!!
