A Deep Learning Approach to Data Compression
We introduce BitSwap, a scalable and effective lossless data compression
technique based on deep learning. It extends previous work on practical
compression with latent variable models, based on bitsback coding and
asymmetric numeral systems. In our experiments BitSwap is able to beat
benchmark compressors on a highly diverse collection of images. We’re releasing
code for the method and optimized models such that people can explore and
advance this line of modern compression ideas. We also release a demo and
a pretrained model for BitSwap image compression and decompression on your
own image. See the end of the post for a talk that covers how bitsback coding
and BitSwap works.
Lossless compression for highdimensional data
The goal is to design an effective lossless compression scheme that is scalable
to highdimensional data, like images. This is a matter of concurrently solving
two problems:

choosing a statistical model that closely captures the underlying
distribution of the input data and 
developing a scalable compression algorithm that exploits this model’s
theoretical compression potential.
The compression ratio of the resulting compression scheme heavily relies on the
first problem: the model capacity. Recent advances in deep learning allow us to
optimize probabilistic models of complex highdimensional data efficiently.
These developments have opened up many opportunities regarding lossless
compression. A powerful technique is to pair autoregressive models with
entropy coders, like arithmetic coding or asymmetric numeral systems
(ANS), resulting in excellent compression ratios. However, the autoregressive
structure typically makes decompression several orders of magnitude slower than
compression.
Fortunately, ANS is known to be amenable to parallelization. To exploit this
property, we have to narrow our focus to models that encompass fully factorized
distributions. This constraint forces us to be extra innovative and choose our
model and coding scheme accordingly.
Recent work
The recent BitsBack with Asymmetric Numeral Systems (BBANS) method tries to
mitigate this issue by combining latent variable models with ANS. Latent
variable models define unobserved random variables whose values help govern the
distribution of the data. For example, if the observed data consists of images,
the composition of the images may be dependent on the locations of edges and
textures, which are latent variables. (In practice, we typically define an
uninformative prior over the latent variables, like a standard gaussian.) This
type of model may use fully factorized distributions and can be efficiently
optimized using the VAE framework.
The critical component that enables BBANS to compress with latent variable
models is a principle called bitsback coding that turned out to be a natural
fit with ANS. Bitsback coding ensures compression that closely matches the
negative ELBO on average, in addition to an overhead that only occurs at
initialization. This overhead becomes insignificant when compressing long
sequences of datapoints at once.
Our contribution
While latent variable models can be designed to be complex density estimators,
restricting the model to fully factorized distributions, however, can
significantly limit model flexibility. Therefore, we propose employing
hierarchical latent variable models, which typically have greater modelling
capacity than models with a single latent layer. We extend the latent variable
model recursively, by substituting its fully factorized prior distribution by a
second latent variable model, substituting its prior by a third latent variable
model, and so on.
For example, if the observed data are images, the composition of the images may
be dependent on the locations of edges and textures, which may be dependent on
the locations of objects, which may be dependent on the scene composition, etc.
Note that if we let every layer be solely dependent on the layer on top of
that, this model design can be interpreted as multiple nested latent variable
models: the observed data distribution being governed by latent layer 1, latent
layer 1 distribution being governed by latent layer 2, up until the top latent
layer, which has an unconditional prior distribution.
Using that insight, we developed a novel coding technique called recursive
bitsback coding. As the name suggests, we apply bitsback coding on every
layer recursively, processing the nested latent variable models from bottom to
top. We coined the joint composition of recursive bitsback coding and the
specified hierarchical latent variable model BitSwap. The merits of BitSwap
include:

Applying bitsback coding in a recursive manner resulting in an overhead
that is bounded and in practice does not grow with the depth of the model
hierarchy. This stands in contrast with naively applying BBANS on a
hierarchical latent variable model, which would ignore the latent variable
topology and would treat all latent layers as one single vector, resulting in
an overhead that linearly grows with the depth of the hierarchy. The bounded
overhead makes BitSwap particularly interesting if we want to employ a
powerful model with a deep latent hierarchy, while we do not wish to compress
long sequences of datapoints at once. 
BitSwap is still able to compress close to the negative ELBO on average, in
addition to a smaller overhead. 
Nesting the latent variable models through the prior distribution of every
layer enables more complex distributions for every latent layer, except for the
top one. The nested structure prompts a tighter ELBO, which in turn results in
lower compression ratios. 
We maintain fully factorized distributions throughout the model, which makes
the entire coding process is parallelizable. Using a GPU implementation of
ANS, together with modelparallelism, should result in highspeed
compression and decompression. The major bottleneck in our implementation is
the ANS operation, but we are optimistic this can be resolved due to its
inherent parallelizability. We leave speed optimization of BitSwap to future
work.
Results
Compression ratio of 100 images from ImageNet (unscaled & cropped)  

Uncompressed  100.00 % 
GNU Gzip  74.50 % 
bzip2  63.38 % 
LZMA  63.63 % 
PNG  58.88 % 
WebP  45.75 % 
BBANS  45.25 % 
BitSwap  43.88 % 
We trained the hierarchical latent variable model on random 32 by 32 pixel
patches of the training set of ImageNet. For testing, we took 100 images
independently from the test set. The 100 images were cropped to multiples of 32
pixels on each side, such that we could fit a grid 32 by 32 pixel blocks. The
grid is treated as a dataset that has to be processed in sequence in order to
be compressed with BitSwap and BBANS. We then apply BitSwap and BBANS to a
single sequence at the time, corresponding to compressing one image at the
time. We used the same cropped images for the baseline schemes, without first
breaking the images up in 32 x 32 pixel blocks. The results are shown above. We
believe results can be further improved by using bigger pixel patches and more
sophisticated model optimization. All other results can be found in the paper.
Demo
Compress your own image using BitSwap. Clone the GitHub repository on
https://github.com/fhkingma/bitswap and
run the script demo_compress.py
and demo_decompress.py
. The script
demo_compress.py
will compress using BitSwap and compare it against GNU
Gzip, bzip2, LZMA, PNG and WebP compression. The script demo_decompress.py
will decompress a BitSwap compressed file.
Note: if the input file is already compressed (JPEG, PNG etc.), the script
first has to decompress that file export it to RGB pixel data. Thereafter, the
RGB pixel values are the input and that data gets compressed by BitSwap and
the other schemes, resulting in size reduction compared to the RGB pixel data.
One might notice that the raw RGB pixel data contains (much) more information
than the input file. The discrepancy between the file sizes is especially
prominent when converting JPEG files to RGB data. This is largely because JPEG,
a lossy compressor, consists of a quantization step, in which the original
image loses a (large) portion of the information. Quantization creates
predictable patterns, which are in turn compressed using lossless compression
techniques for the purpose of storage. When decompressing the JPEG file and
converting to RGB, however, we store every single pixel value explicitly, thus
disregarding the patterns. This could result in a large increase of
information.
Video
See the video below for a talk about this paper.
You can find the slides here.
This work was done while the author was at UC Berkeley. We refer the reader to
the following paper for details:
 BitSwap: Recursive BitsBack Coding for Lossless Compression with Hierarchical Latent Variables
Friso H. Kingma, Pieter Abbeel, Jonathan Ho
ICML 2019