                                          __
                   _ ___         _ ___   |  |__
                  | '   |       | '_  |  |    _|
            npk:  |__|__|ibble  |   __|ac|__|__|er v0.1
                                |__|

Original concept by Tony Haines (a.s.haines@bham.ac.uk)
Implementation by baah/Arm'Z TECh (abrobecker@yahoo.com)


Usage
~~~~~
"npk" is a freeware packer devoted to small sized programs written in ARM
assembly. The programs to compress must be Absolute files, and the packed
program will be untyped. To install npk copy it in your library directory
(or you can copy it in CSD...), and type under the CLI:

            *npk MyProg MyProgPk

You will then be faced with some informations concerning the compression
process, and once finished your directory shall contain "MyProgPk" file,
which hopefully will be smaller than previous one! ;)


Credits & Ratings
~~~~~~~~~~~~~~~~~
SHOCKING NEWS: Pervect/Topix is curently (2000jul09) working on a better
version of CodePressor that is said to outbeat npk for files >512 bytes.
But ArmOric is trying to reduce the size of npk's depacker, and might
well succeed to decrease it down to 80 bytes or so!

The original and cool concept was devised by Tony for his "flu" demo, a 8Kb
entry for the CodeCraft #2 coding contest. Then Eli-Jean Leyssens (Pervect)
made a 2Kb implementation which competed in CodeCraft #2 tool category,
called CodePressor. Eli-Jean's implementation is quite complex, so i decided
to make a very basic version, and spent some days reducing the size of the
depacking routine. The result is a 96 bytes depacker (NINETY SIX BYTES! =),
probably one of the smallest depacker ever, and here are some compression
ratios just to show how it behaves:

           | size |     npk      | cprs | sqsh |  zip
-----------+------+--------------+------+------+-----
CUNT       | 2536 | 1858 (73.2%) | 1696 |      | 1623
WaterFrac  | 1880 | 1347 (71.6%) | 1420 |      | 1392  PT/DFI
plasma     | 1856 | 1370 (73.8%) | 1208 |      |  745  Mr Hill/IceBird
MarsFace   | 1364 |  927 (67.9%) |      |  937 |       eXoTiCorn/IceBird
Rays       | 1168 |  924 (79.1%) |  936 | 1022 |  892  Tom/Kulture
ItsDemoTim | 1134 |  930 (82.0%) |  928 | 1016 |  920  ArmOric/Arm'Z TECh
HappyRGB   | 1116 |  977 (87.5%) |      | 1022 |       Pervect/Topix
CountDown  | 1112 |  946 (85.0%) |      | 1014 |       Pervect/Topix
paranoid   | 1095 |  878 (80.1%) |  904 |  982 |  813
NZCVdemo   | 1094 |  888 (81.1%) |  900 | 1005 |  848
3dStars    | 1024 |  865 (84.4%) |  888 |      |  843
Invaders   | 1024 |  865 (84.4%) |  872 |      |  794  eXoTiCorn/IceBird
Pacman     | 1024 |  871 (85.0%) |  880 |      |  832  eXoTiCorn/IceBird
Doom1k     |  892 |  755 (84.6%) |  796 |      |  720  Mr Hill/IceBird
sqrt2      |  780 |  598 (76.6%) |  628 |      |  532
ShaDots    |  764 |  639 (83.6%) |  676 |      |  554
wimplife   |  732 |  570 (77.8%) |  592 |      |  467
zOArc      |  724 |  613 (84.6%) |  648 |      |  503
SIC        |  504 |  429 (85.1%) |  488 |      |  356
ArmOricTV  |  468 |  421 (89.9%) |  468 |      |  338  ArmOric/Arm'Z TECh

As you can see it gives approximately a 80% compression ratio for programs
around 1Kb. CodePressor v0.03 gives better compression on bigger file, and
it seems both packer give similar results around 1200 bytes. So test both
both packer and take the best result for programs with such size.

Both packers beat squasher v0.05 by Eli-Jean, and unlike squashed programs
the decompression routine will work on RiscOS 2 without any external module
(squasher needs Squash by Acorn Computers to work). Squasher is dead!


The Algorithm
~~~~~~~~~~~~~
We start compressing from top of executable down to &8000. For each long
we look afterward in a range of 16 longs which one has the biggest number
of identical nibbles. If the amount of identical nibbles is big enough to
allow compression, we save a nibble containing the distance to the best
matching long (so the distance is 1 to 15 longs), a byte containing flags
to tell what nibbles are identical (nibbles' mask), and then all nibbles
that were not matched. If the amount of identical nibbles doesn't allow a
gain, we save a nibble containing 0 followed by the 8 nibbles of the long.

So if the best matching long has N nibbles identical to the long we are
currently trying to compress, we'll save respectively 4+8+4*(8-N) or
4+8*4 bits to store the long. If you try this for all possible values of
N in [0;8] you'll see that we gain nibbles as soon as we have 4 or more
matching nibbles, and we lose nibbles when we have only 0,1 or 2 matching
nibbles. By chance ARM instructions often have matching nibbles.

In this version, the first instructions if not matched are simply skipped
by the 'adr r2,#???' of the depacker. Also, the nibbles' masks and nibbles
are separated. All this leads to the following memory organisation in a
packed file:

  &8000                                   ;
                                          ;... Nothing yet ...
                                          ;_
  .UnpackedLongs   dcd long0              ; \
                   ...                    ;  > Unpacked longs
                   dcd longX              ;_/
  .Depacker        adr   r2,UnpackedLongs ; \
                   ...                    ;  > Depacker (23 instructions =)
                   mov   pc,r14           ;_/
  .ANibbles        dcd   Nibbles+(1<<31)  ;_> Adress to nibbles
  .NibblesMasks    dcb   mask0            ; \
                   ...                    ;  > Masks for matched nibbles
                   dcb   maskY            ;_/
  .Nibbles         nib   nibble0          ; \
                   ...                    ;  > Distances & unmatched nibbles
                   nib   nibbleZ          ;_/

The maximum compression this algorithm allows is when N=8 all the time,
so each long is packed down to 3 nibbles, ie ~37.5%. This doesn't include
depacker and firsts unpacked longs of course.


Unfinshed
~~~~~~~~~
Well, before i decided to make a nibble packer only, i tried different
number of bits used to code the distance. Here are first versions of the
code snippets that would be used in a generic version. The depacker would
then be 124 bytes long, which sound fair.

REM!!!Skip unpacked longs.
.DepackHere
  mov       m0,#1<<32-5
  adr       m2,#data
  adr       @dest,#DepackHere
.BitDepackerOneLong
  mov       m3,#N<<32-3
  bl        ExtractBits             ;Extract N bits and m3=0
  andS      m4,m4,#2^N-1            ;m4=dist (without garbage) & set flags
  ldrNE     long,[@dest,m4,lsl #2]  ;If m4<>0 then load long and
  blNE      ExtractBits             ;  get nibbles' mask (8 bits, ie m3=0)
  and       mask,m4,#&ff
  mov       counter,#8
.BitDepackerOneNibble
  movS      mask,mask,lsr #1
  andCS     m4,long,#&f
  movCC     m3,#4
  blCC      ExtractBits
  mov       new,new,lsr #4
  add       new,new,m4,lsl #28
  mov       long,long,lsr #4
 subS counter,counter,#1:bNE BitDepackerOneNibble
  str       new,[@dest,#-4]!
 cmp @dest,#&8000:bGE BitDepackerOneLong
   mov      r0,#0
   dcd &ef02006e ;swi "XOS_SynchroniseCodeAreas"
   b &8000

REMIN  m0=BitCounter (initialised at 1<<32-5)
REM    m1=long
REM    m2=@data
REM    m3=nb of bits to extract<<32-3 (0 means 8 bit to extract)
REMOUT m3=0
REM    m4=bits extracted, possible garbage in upper bits
.ExtractBits
;  mov       m4,#0
.ExtractOneBit
  subS      m0,m0,#1<<32-5          ;One bit will be extracted
  ldrEQ     m1,[m2],#4              ;Load long if no bits left
  addS      m1,m1,m1                ;carry=upper bit and lsl
  adc       m4,m4,m4                ;lsl and lower bit=carry
 subS m3,m3,#1<<32-3:bNE ExtractOneBit
  mov pc,r14
