Rawheds Tutorial#3:


		333333 222222
		3   33 22  22
		    33    22
		   333   22    BPP GRAPHICS CODING
		    33  22
		3   33 22 
		333333 222222


		       [Introduction]
		       [32BPP Basics]
		       [Alpha Channel?]
		       [More 32BPP RGB]
		       [MMX Helps out]
		       [Conversion]
		       [Converting to 24BPP]
		       [Converting to 16BPP]
		       [Converting to 8BPP(mode13h)]
		       [Converting to TextMode]
		       [Closing Words]

---==[Introduction]==---------------------------------------------------------

This tutorial is based on how my current vesa/gfx engine works. I'd 
previously been doing just 16bpp graphics, and I had to code all of my 
routines for 16bpp.  When I first started with 16bpp it was quite a novelty, 
but after a while(tut2) I was getting irritated with it and wanted a more 
flexable model.  I saw demos which could run in TONS of modes like 8bpp, 
15bpp, 16bpp, 24bpp, 32bpp and even textmode!  A lot of demos could have 
their mode changed from the commandline, and I realised that this pure 16bpp 
model of mine was not so cool and very unflexable.

What this tutorial covers is a different way of coding gfx engines so that 
they can handle multiple color depths.  Basically what happens is that you 
create all your memory buffers as if they were holding 32bpp graphics, and 
all of your internal graphics code works at the 32bpp level, and then finally
when you want to flip the frame to the screen you just convert to the 
appropriate bpp level.  So you could have conversion functions to convert
between 32bpp-->16bpp  and 32bpp-->8bpp, and then you would flip that into
video memory.

I also had a lot of trouble finding out the video mode for 32BPP modes. All 
the vesa docs I read only had up to 24BPP.  Eventually I found(from UNIVBE) 
that 320x200x32bpp is 146h mode.

I can't rememeber where I heard of this idea from, but I do know that its not
original.  Infact A LOT of demo groups use it.  But since I couldn't find any
tuts on it, and I thought it works very well I wrote
this tut.  So lets go then.

---==[32BPP Basics]==---------------------------------------------------------

Although 32bpp alows way more colours than the other modes(16bpp etc) it is 
actually the easiest to code for!  15&16bpp modes are cool, but they only 
offer 32768 & 65536 colours, and they are difficult to work with because they
have the RGB values packed into them(see tut1).

The 32BPP format is easy, and of course each pixel takes up 32bits(4 bytes)
of memory.  You have to be careful because a 320x200 surface can take up a 
lot more memory than lesser modes. 

	320x200x32bpp - 256k / layer
	320x200x24bpp - 192k / layer
	320x200x16bpp - 128k / layer
	320x200x15bpp - 128k / layer     ;may as well use 16bpp huh? :)
	320x200x08bpp - 64k  / layer

So only 4 32bpp layers and you are using a MEG of memory!
Here is how the 4 bytes are structured:

	[1 byte] [1 byte] [1 byte] [1 byte]
	AAAAAAAA RRRRRRRR GGGGGGGG BBBBBBBB

8bits for Alpha channel, 8 for Red, 8 for green and 8 for blue.  As you can 
see, you have the same range of RGB colours as you do in 24bpp.  So why use 
32bpp if its just gonna take up more memory? Simple.  First of all its 
faster.  Why would it be faster to read/write 4 bytes as opposed to 3?  
Basically the computer handles R/W faster when it has to read an even number
of bytes.  Also, you don't get 24bit registers.  For example:

	;24bpp clear screen
	mov edi,[dest]
	mov eax,[color]   ;24bit color, with upper 8bits=0
	mov ecx,64000	  ;number of pixels for 320x200
	@slowloop:
	stosw		  ;write 2 bytes
	stosb		  ;write 1 byte
	dec ecx
	jnz @slowloop

	;32bpp clear screen
	mov edi,[dest]
	mov eax,[color]   ;32bit color, with upper 8bits=0
	mov ecx,64000	  ;number of pixels for 320x200
	rep stosd	  ;loop writing 32bits/time

Because there is no easy was of writing 3 bytes at a time, its much
easier to write 4 bytes.  Hence 32bpp modes :)

---==[Alpha Channel?]==-------------------------------------------------------

Well I must confess, as the time that I'm typing this I've never used the 
alpha channel, or really thought about what it could be used for...So I'm 
sort of gonna be making this up as I go along.  But I'm sure you can think of
groovy things to use it for.  Having an extra 8 bits on your layers/surfaces
is very handy indeed.

1]You could use it to define MANY characteristics of the surface pixels.
	Eg.
		A A A A A A A A
		7 6 5 4 3 2 1 0 bits
		| | | | | | | | 
   		| | | | | | | |____Active
		| | | | | | | 
		| | | | |_|_| 
		| | | |   |________Draw style
		| | | |  
		|_|_|_|
                   |_______________Percentage Transparent
 
	!Active(0-1) - Whether the pixel is drawn/not.
		       Useful for images with holes in them.
		       Sort of like a built-in mask.

	!Draw style(0-7) - How to draw the pixel.
			   eg, 0=normal(opaque)
			       1=additive
			       2=subtractive
			       3=multiplication
			       4=difference
			       5=transparent
			       6=?
			       7=?

	!Percentage Transpart(0-15) - How transparent the pixel is.
				      so 15=fully opaque, and 0=invisible/

  This is just an example of one way you could to things.  Although I think a
  simplifies version of the above would be better for the realtime demos of 
  today.

2]You could keep things simple and just use the 8 alpha bits for doing your 
  own internal transparency etc.  This is probably what most people use it
  for.  Very handy, but not something I've done myself.

---==[More 32BPP RGB]==-------------------------------------------------------

Ok, so now you know the format etc.  Now to show you some nice things.  Want
to add 2 RGB pixels together?  Sure, easy - not like 16bpp.

	;adding 2 32bit colors together(assuming the alpha byte is ignored)
	mov eax,[col1]
	mov ebx,[col2]
	and eax,11111111_11111110_11111110_11111110b
	and ebx,11111111_11111110_11111110_11111110b
	shr eax,1
	shr ebx,1
	add eax,ebx
	mov [edi],eax

A very nice trick that I found was with MMX instructions.  They have something
which I found perfect for 32BPP functions.  I'm not about to write an MMX
tutorial :) so go and read another doc for that, but I want to introduct one
MMX feature in particular.  Saturated registers.

Lets take a simple additive surface loop.  Here you have 2 320x200x32bpp
surfaces, both with pictures on them and you want to add them together.  
Eg:

	//pseudo code
	long col1,col2,colf;
	col1=memget(blah);    		//32bit
	col2=memget(blah2);   		//32bit
	colf.r=col1.r+col2.r;
	colf.g=col1.g+col2.g;
	colf.b=col1.b+col2.b;
					//but now instead of dividing by 2 as
					//we do for transparency, we clip
					// to 255;
	if (colf.r>255) colf.r=255;
	if (colf.g>255) colf.g=255;
	if (colf.b>255) colf.b=255;
	memput(blah3)=colf;

Doing that for every pixel would be VERY slow yes?  Even doing that in normal
ASM would be slowish.  But MMX can make it easier.  I use NASM, you should too :)

---==[MMX Helps out]==--------------------------------------------------------

Saturated registers are registers which don't overflow.  Normally if you a
dded 250+20 in a byte value(say AL), at the end AL would = 4.  So what MMX's 
saturated registers does is clips it.  So when you do an MMX add, 250+20 
= 255.  Funky eh? MMX works with 8 mmx registers(MM0-MM7), each are 64bit 
egisters. So you can store 2 32BPP pixels in each register!  This is VERY 
cool because it means that using 1 instruction you can additively add 2
pairs of pixels.  

Two MMX instructions which I have found handy are: PADDUSB & PSUBUSB
	PADDUSB - Saturated ADD, unsigned, saturated at the byte level.
	PSUBUSB - Saturated SUB, unsigned, saturated at the byte level.

Here are 2 MMX registers(64bits each) filled with 2 pixels each:

     [-------------------------------64 BITS-------------------------------]
     [-------------32 BITS-------------] [-------------32 BITS-------------]
     [----16 BITS----] [----16 BITS----] [----16 BITS----] [----16 BITS----]
     [8 BITS] [8 BITS] [8 BITS] [8 BITS] [8 BITS] [8 BITS] [8 BITS] [8 BITS]
MM0: AAAAAAAA RRRRRRRR GGGGGGGG BBBBBBBB AAAAAAAA RRRRRRRR GGGGGGGG BBBBBBBB   
MM1: AAAAAAAA RRRRRRRR GGGGGGGG BBBBBBBB AAAAAAAA RRRRRRRR GGGGGGGG BBBBBBBB   

The MMX instruction: PADDUSB MM0,MM1 basically adds each 8bit segment, and
clips the addition to 255.  Same with PSUBUSB MM0,MM1 except that is clips it
to 0. Here is how we could use this in a complete function.  This function
does the same as the above pseudo code, but MUCH quicker.

	;ASM 32bpp MMX adding
	mov edi,[dest]
	mov esi,[src]
        mov ecx,32000
        @MMX_layeraddloop:
        	movq MM0,[edi]		;Move QUAD(64bits)
	        movq MM1,[esi]		;Move QUAD(64bits)
	        paddusb MM0,MM1		;Saturated Add
	        movq [esi],MM0          ;Move QUAD(64bits)
	        add esi,8
	        add edi,8
	        dec ecx
        jnz @MMX_layeraddloop
        EMMS				;Must always do this after about of 
					;MMX instructions

You won't believe how fast this is until you try it.

---==[Conversion]==-----------------------------------------------------------

Ok, so you've written a groovy internal 32bpp gfx library.  Complete with
texture-mapped four dimensional splines and beautiful particles algorithms.
Now what?  Well you have to copy you buffer into videomemory so that it can
be seen :)  The nice thing is that the viewer doesn't have to have a videocard
that can handle 32BPP.  You can convert the image in the buffer to the 
appropriate format and then flip.  Eg:

	if (vmode==_32bit) FLIPtoSCREEN32_(final.addr);
			   else
 	if (vmode==_text) {
	                   convtxt_(final.addr,buffery.addr);
	                   FLIPtoSCREENtxt_(buffery.addr);
	                  } else
	if (vmode==_8bit) {
	                   conv8_(final.addr,buffery.addr);
	                   FLIPtoSCREEN8_(buffery.addr);
	                  } else
	if (vmode==_16bit) {
	                    conv16_(final.addr,buffery.addr);
	                    FLIPtoSCREEN16_(buffery.addr);
	                   }

A nice feature that I've added to my demo (which I'm busy writing) is that
you can change videomodes while running the demo by pressing F1-F4.  I 
thought this was quite a groovy idea :)

Before I actually sat to code my 32BPP engine, I thought it would be very slow
to convert all the time.  I mean one fullscreen color conversion MUST be slow.
But its not that bad :) Why not? Ok, lets take the videomodes from the above
code:

	1]  32BPP - no conversion needed.
		    Just a 256k flip.
	2]  16BPP - conversion needed.
		    But just then a 128k flip.
	3]  8BPP  - conversion needed.
		    But just then a 64k flip.
	4]  text  - conversion needed.
		    But just then a 4k flip.

As you can see, even though you have to convert, the ammount of data you have
to push to the video card becomes less, so it sort of compensates :)  And
besides, the conversion routine ISN'T that costly. I actually love figuring
out new ways(and faster ways) to convert between different pixel formats.  
Its FUN :)  Below are the algorithms that I use.  If you use them please 
credit me and send me a little email ;).  I don't claim that they are the best
or anything, and if you can see kewl ways to improve them pleaser give me a 
shout.

---==[Converting to 24BPP]==--------------------------------------------------

Well, this should be very easy :) Just chop off the ALPHA channel.  So I'll 
leave this one up to you :)

---==[Converting to 16BPP]==--------------------------------------------------

Have fun trying to come up with your own methods :)  I think PTC has some
nice conversion routines, although I have yet to check them out.

	;32BPP->16BPP conversion(320x200)
	proc conv16_ src,dest:dword
	    pushad
	    push edi
	    push esi
	    mov edi,[dest]
	    mov esi,[src]
	    mov ecx,64000
	    @conv16_loop:
	        mov eax,[esi]
	        and eax,00000000111110001111110011111000b
	        shr ah,2
	        shr ax,3
	        ror eax,8
	        add al,ah
	        rol eax,8
	        stosw
	        add esi,4
	    dec ecx
	    jnz @conv16_loop
	    pop esi
	    pop edi
	    popad
	    ret
	endp    conv16_

---==[Converting to 8BPP(mode13h)]==------------------------------------------

Have fun trying to come up with your own methods :)  I think PTC has some
nice conversion routines, although I have yet to check them out. This 
function doesn't take into account the palette.  Infact, all it does is assume
you've set your palette to go from 0(black) to 255(white), and then finds the
approximate brightness of the RGB values and uses them.  I know its lame :),
but I've seen other demos doing the same thing.  Oh well, I'm sure I'll write
a color palette version very soon, as I've only adopted this 32BPP internal
mode about 2 weeks ago.

	;32BPP->8BPP conversion(320x200)
	proc conv8_ src,dest:dword
	    pushad
	    push edi
	    push esi
	    mov edi,[dest]
	    mov esi,[src]
	    mov ecx,64000
	    @conv8_loop:
	        mov ebx,[esi]
	        mov eax,ebx
	        rol ebx,16
	        and ebx,255
        	and eax,255
	        add ax,bx
	        ror ebx,16
	        shr ebx,8
	        and ebx,255
	        add ax,bx
	        shr eax,2
	        stosb
	        add esi,4
	    dec ecx
	    jnz @conv8_loop
	    pop esi
	    pop edi
	    popad
	    ret
	endp    conv8_

---==[Converting to TextMode]==-----------------------------------------------

Hmmm, this was hard :)  hehe, its amazing that with these graphics modes, it 
seems to get easier with the more colors you can have.  I mean 32BPP is dead
easy to code, 16BPP is harder, and textmode is quite a mission :)  This is a 
VERY simple hack, and if you can make a better one, please let me know all 
about it.  This one just writes character #176, #177, #178, #219 to the
screen depending on the brightness of the RGB value.  And it also selects the
color(0-15) base on the "brightness" of the RGB value.  So it assumes that
your palette goes from dark-->bright.  Unfortunately I haven't made it do
funky things like realtime change the palette or search for the best color.
I'll probably do this soon.  This is basically just a test:

	;32BPP->Textmode conversion(80x50)
	proc convtxt_ src,dest:dword
	    pushad
	    push edi
	    push esi
	    mov edi,[dest]
	    mov esi,[src]
	    mov edx,50
	    @convtxt_loopy:
	        mov ecx,80
	        @convtxt_loopx:
	        mov ebx,[esi]
	        mov eax,ebx
	        rol ebx,16
	        and ebx,255
	        and eax,255
	        add ax,bx
	        ror ebx,16
	        shr ebx,8
	        and ebx,255
	        add ax,bx
	        shr eax,2
		mov ah,al

	        mov bl,0
	        cmp al,0
	        jle @asc0
	        cmp al,48
	        jge @asc0
	        mov bl,176
	        jmp @ascout
	        @asc0:

	        cmp al,48
	        jle @asc1
	        cmp al,96
	        jge @asc1
	        mov bl,177
	        jmp @ascout
	        @asc1:

	        cmp al,96
	       	jle @asc2
	        cmp al,144
	        jge @asc2
	        mov bl,178
	        jmp @ascout
	        @asc2:
	
	        cmp al,144
	        jle @asc3
	        mov bl,219
	        jmp @ascout
	        @asc3:

	        @ascout:

	        shr ah,4
	        mov al,bl
	        	stosw
	        add esi,16
	        dec ecx
        	jnz @convtxt_loopx
	    add esi,3840
	    dec edx
	    jnz @convtxt_loopy
	    pop esi
	    pop edi
	    popad
	    ret
	endp    convtxt_

---==[Closing Words]==--------------------------------------------------------

Phew :)  I really hope this helps some people out there, in some way or 
another.  Please send me any thoughts/ideas/improvements on this topic, I'd 
really like to hear/see them.  The scene is wonderful, long live the scene. 
When I die I want to go to a scene heaven ;)

-Rawhed/Sensory Overload
-Mailto:andrew@overload.co.za
-Htpp://www.overload.co.za
-Andrew Griffiths
-South Africa
-05-07-1999