third_party/puffin/README.md
[TOC]
Puffin is a deterministic deflate recompressor. It is mainly used for patching deflate compressed images (zip, gzip, etc.) because patches created from deflate files/streams are usually large (deflate has a bit-aligned format, hence, changing one byte in the raw data can cause the entire deflate stream to change drastically.)
Puffin has two tools (operations): puffdiff and puffpatch (shown
here.) The purpose of puffdiff operation is to create a patch
between a source and a target file with one or both of them having some deflate
streams. This patch is used in the puffpatch operation to generate the target
file from the source file deterministically. The patch itself is created by
bsdiff library (but can be replaced by any other diffing mechanism). But,
before it uses bsdiff to create the patch, we have to transform both the
source and target files into a specific format so bsdiff can produce smaller
patches. We define puff operation to perform such a transformation. The
resulting stream is called a puff stream. The reverse transformation (from
puff stream to deflate stream) is called a huff operation. huff is used in
the client to transform the puff stream back into its original deflate stream
deterministically.
For information about deflate format see RFC 1951.
puff is an operation that decompresses only the Huffman part of the deflate
stream and keeps the structure of the LZ77 coding unchanged. This is roughly
equivalent of decompressing ‘half way’.
huff is the exact opposite of puff and it deterministically converts the
puff stream back to its original deflate stream. This deterministic conversion
is based on two facts:
puff stream. This means the deflate stream can be
reconstructed deterministically using the Huffman tables.The inclusion of Huffman tables in the puff stream has minuscule space burden
(on average maximum 320 bytes for each block. There is roughly one block per
32KB of uncompressed data).
bsdiff of two puffed streams has much smaller patch in comparison to their
deflate streams, but it is larger than uncompressed streams.
The major benefits
puff and huff are deterministic operations.huff is in order of 10X faster than full recompression. It is even faster
than Huffman algorithm because it already has the Huffman tables and does
not need to reconstruct it.The drawbacks
A deflate compressed file (gzip, zip, etc.) contains multiple deflate streams at
different locations. When this file is puffed, the resulting puff stream
contains both puffs and the raw data that existed between the deflates in the
compressed file. When performing huff operation, the location of puffs in the
puff stream and deflates in the deflate stream should be passed upon. This is
necessary as huff operation has to know exactly where the locations of both
puffs and deflates are. (See the following image)
Similarly puffpatch requires deflates and puffs locations in both the source
and target streams. These location information is saved using Protobufs in the
patch generated by bsdiff. One requirement for these two operations are that
puffpatch has to be efficient in the client. Client devices have limited
memory and CPU bandwidth and it is necessary that each of these operations are
performed with the most efficiency available. In order to achieve this
efficiency a new operation can be added to bspatch, that reads and writes into
a puff streams using special interfaces for puffing and huffing on the fly.
bsdiff program.Depending on the scheme for storing the Huffman tables, the payload size can
change. We discovered that the following scheme does not produce the smallest
payload, but it is the most deterministic one. In a deflate stream, Huffman
tables for literals/length and distances are themselves Huffman coded. In this
format, we also puff the Huffman tables into the puff stream instead of
completely decompressing them.
There are three tables stored in this structure very similar to the one defined in RFC 1951. A Huffman table can be defined as an array of unsigned integer code length values. Three Puffed Huffman tables appear like the following scheme. The first table (codes) is the Huffman table used to decode the next two Huffman tables. The second Huffman table is used to decode literals and lengths, and the third Huffman table is used to decode distances.
Literals lists are constructed by a “length” value followed by “length” bytes of
literals. The Puffer should try to merge adjacent literals lists as much as
possible into one literals list in the puff stream. This Is a length value
followed by length bytes of literals (Even if there is only one literal.)
This Is a Length value followed by a Distance value.
Currently Puffin is being used in both Android and Chrome OS and is built differently in each of them. There is also a Makefile build, but it is not comprehensive.
Puffin builds an executable puffin which can be used to perform diffing and
patching algorithms using Puffin format. To get the list of options available
run:
puffin --help
It can also be used as a library (currently used by update_engine) that provides different APIs.
To compute the diff between our current state and the original changes:
git diff 4180a65119ef2c333c4d33c9e39869da89a8faea -- .