libclammspack/doc/szdd_kwaj_format.html
This document describes the SZDD and KWAJ file formats which are implemented in the MS-DOS commands COMPRESS.EXE and EXPAND.EXE.
Both formats compress a single file to another single file, replacing the last character in the filename with an underscore or dollar character, e.g. README.TXT becomes README.TX_ or README.TX$.
An SZDD file begins with this fixed header:
SZDD header format| Offset | Length | Description | | --- | --- | --- | | 0x00 | 8 | "SZDD" signature: 0x53,0x5A,0x44,0x44,0x88,0xF0,0x27,0x33 | | 0x08 | 1 | Compression mode: only "A" (0x41) is valid here | | 0x09 | 1 | The character missing from the end of the filename (0=unknown) | | 0x0A | 4 | The integer length of the file when unpacked |
The header is immediately followed by the compressed data. The following pseudocode explains how to unpack this data; it's a form of the LZSS algorithm.
SZDD decompression pseudocode|
charwindow[4096];intpos=4096-16;memset(window,0x20,4096);/\* window initially full of spaces \*/for(;;){intcontrol=GETBYTE();if(control==EOF)break;/\* exit if no more to read \*/for(intcbit=0x01;cbit&0xFF;cbit\<\<=1){if(control&cbit){/\* literal \*/PUTBYTE(window[pos++]=GETBYTE());}else{/\* match \*/intmatchpos=GETBYTE();intmatchlen=GETBYTE();matchpos|=(matchlen&0xF0)\<\<4;matchlen=(matchlen&0x0F)+3;while(matchlen--){PUTBYTE(window[pos++]=window[matchpos++]);pos&=4095;matchpos&=4095;}}}}
|
There is also a variant SZDD format seen in the installation package for QBasic 4.5, so I call it the QBasic variant. It has a different header and the pos variable in the pseudocode above is set to 4096-18 instead of 4096-16.
QBasic SZDD variant header format| Offset | Length | Description | | --- | --- | --- | | 0x00 | 8 | "SZ" signature: 0x53,0x5A,0x20,0x88,0xF0,0x27,0x33,0xD1 | | 0x08 | 4 | The integer length of the file when unpacked |
A KWAJ file begins with this fixed header:
KWAJ header format| Offset | Length | Description | | --- | --- | --- | | 0x00 | 8 | "KWAJ" signature: 0x4B,0x57,0x41,0x4A,0x88,0xF0,0x27,0xD1 | | 0x08 | 2 | compression method (0-4) | | 0x0A | 2 | file offset of compressed data | | 0x0C | 2 | header flags to mark header extensions |
The "compression method" field indicates the type of data compression used:
Header extensions immediately follow the header.
If you don't care about the header extensions, use the file offset to skip to the compressed data.
The header extensions appear in this order:
When header flags bit 0 is set4 bytes: decompressed length of fileWhen header flags bit 1 is set2 bytes: unknown purposeWhen header flags bit 2 is set2 bytes: length of data, followed by that many bytes of (unknown purpose) dataWhen header flags bit 3 is set1-9 bytes: null-terminated string with max length 8: file nameWhen header flags bit 4 is set1-4 bytes: null-terminated string with max length 3: file extensionWhen header flags bit 5 is set2 bytes: length of data, followed by that many bytes of (arbitrary text) data
Compression method 3 is unique to the KWAJ format. It's an LZ+Huffman algorithm created by Jeff Johnson.
Bits are always read from MSB to LSB, one byte at a time.
There are three parts:
KWAJ uses 5 huffman trees. They always have the same number of symbols in them. They are, in order:
Canonical huffman codes are used, which means you simply need to know how many symbols in each huffman tree (given above), and how long each huffman symbol is
How the symbol lengths are encoded depends on the encoding type, as given by the 6 nybbles at the start of the compressed data.
Symbol lengths are read in ascending order, and the number of symbols to read is implied by which tree you're defining.
Huffman code length list, encoding type 0All symbol have the same length, implied by the number of symbols in the tree:
You don't need to read anything.Huffman code length list, encoding type 1A run-length encoding is used:
Huffman code length list, encoding type 2Another run-length encoding is used:
Huffman code length list, encoding type 3There is no compression. Read 4 bits per symbol (0-15).
At this point, the compressed data begins.
We have a 4096 byte ring buffer, initially filled with byte 0x20 (ASCII space). Unlike the SZDD format, the starting position in the buffer is irrelevant, as match positions are stored relative to the current position in the window, not as absolute positions in the window.
Pseudo-code:
ring buffer position = 4096-17
selected table = MATCHLEN
LOOP:
code = read huffman code using selected table (MATCHLEN or MATCHLEN2)
if EOF reached, exit loop
if code > 0, this is a match:
match length = code + 2
x = read huffman code using OFFSET table
y = read 6 bits
match offset = current ring buffer position - (x<<6 | y)
copy match as output and into the ring buffer
selected table = MATCHLEN
if code == 0, this is a run of literals:
x = read huffman code using LITLEN table
if x != 31, selected table = MATCHLEN2
read {x+1} literals using LITERAL huffman table, copy as output and into the ring buffer
KWAJ type 4 compression is called MS-ZIP, because it is almost identical to the MS-ZIP compression found in Microsoft Cabinet files. Each 32768 bytes of data is compressed independently using Phil Katz's DEFLATE algorithm. However, the history window is shared between blocks, so they must be unpacked in order. The format of each block is as follows:
KWAJ MS-ZIP block format| Offset | Length | Description | | --- | --- | --- | | 0 | 2 | Compressed length of this block (n). Stored in Intel byte order. Doesn't include these two bytes. | | 2 | 2 | "CK" in ASCII (0x43, 0x4B) | | 4 | n-2 | Data compressed in DEFLATE format |
The final block will unpack to 1-32768 bytes. It will be followed by two zero bytes.