fossil

A compressor that shows its work.

I built fossil around one idea: a packed file should be able to tell you how it was made. While it compresses, it records what it did, and fossil explainreads it back. The demos below run the real thing in the page, and everything after explains how it works.

Playground

Drop a file and fossil packs it right here. The breakdown below is whatfossil explain shows: which model handled each block, and how much it saved. Nothing is uploaded.

or try one:

Entropy map

Drop in a file to see its entropy. This is the same heatmap fossil map draws and the stats fossil inspect reports. It runs in your browser, so the file never leaves your machine.

What it is

Most compressors just hand you a smaller file. fossil does that too, but it also keeps a record of how, so you can go back and see what it decided.

It works in blocks. The file is cut into 4 KB chunks, and each one is handled on its own. For every chunk fossil runs a handful of small models, checks what each one produces, and keeps the smallest. Different parts of a file usually want different methods, so choosing per block beats picking one method for the whole file.

Because that choice is saved, fossil explain can walk back through a packed file and tell you, block by block, what it used and why.

Commands

fossil pack <in> [out]compress a file or directory into a .fossil (omit the input to pack the clipboard)
fossil liftfossilize the clipboard, then copy the .fossil back to it
fossil unpack <file.fossil> [out]restore the original (checks the CRC first)
fossil inspect <file>per-block breakdown: entropy, model, savings
fossil map <file>entropy heatmap, or block models for a .fossil
fossil explain <file.fossil>the reconstruction recipe (--block N for one block)

pack --lossy[=bits] drops the low bits of each byte for a smaller file; --best-effort packs already-compressed inputs losslessly instead of refusing, and --images-only limits lossy to raw images.pack --verify round-trips the result before writing it, andunpack --trust skips the CRC check. pack --fast skips the slow models for much faster packing, trading a little ratio.

How it works

Each model looks for a different kind of structure. Run-length squashes long runs of one byte. Huffman and range coding give common bytes shorter codes. LZ replaces repeated chunks with pointers back to earlier copies, and LZR packs those pointers a bit tighter with some context, like LZMA. BWT reorders the data so similar contexts line up, which helps the stages after it. The generator spots simple patterns like counters and gradients and stores the rule instead of the bytes.

Every block tries all of these and keeps whichever comes out smallest:

ModelBest for
RAWthe fallback, stored as-is
RLEadjacent repeated bytes
ENTROPYskewed byte frequencies (canonical Huffman)
LZrepeated substrings
LZHLZ, then Huffman
LZRLZ tokens range-coded with a literal context (LZMA-style)
BWTMBurrows-Wheeler + move-to-front + range coding
RANGEadaptive range coding, no stored table
PPMorder-1 context (each byte from the last)
GENformulas like constant fills and ramps
DELTAsmooth, slowly-changing data
CSVtabular data, transposed so columns group together
WORDtext with repeating words, dictionary-coded
SIGNAL8/16/24-bit audio and sensor data: FLAC-style windowed LPC, mid/side, partitioned Rice

A few things about the format. Tiny or random files are stored as-is, so they never grow. Lengths use varints. Every file gets a CRC32, so corruption turns up on unpack. The blocks are still 4 KB, but the LZ models can look back up to 64 KB into what they've already seen, so a repeat far from its original only costs a pointer instead of a second copy. Raw images (PPM and BMP) get filtered row by row first (PNG-style), so the models see small differences instead of raw pixels. Packing a folder runs one LZ pass over everything, so duplicate files cost almost nothing.

How it measures up

On the sample files, against gzip -9 and zstd -19 (percent smaller, bigger is better):

filefossilgzip -9zstd -19
mixed.bin78.2%74.1%77.0%
bigmix.bin78.3%74.7%77.3%
cat.ppm73.1%51.1%57.7%
wave.pcm81.0%1.7%2.3%
cat.jpg~0%~0%~0%

These are just the sample files, so other data lands differently. The structured files do well because fossil picks a model per block. The image gets filtered row by row first (PNG-style), so the models see small differences instead of raw pixels. wave.pcmis audio, and fossil fits a predictor to each block the way FLAC does. gzip and zstd don't predict audio, so they barely touch it. The jpg is already compressed, so there's nothing left to take.

Install

fossil builds from source with Rust.

# install Rust (skip if you already have it)
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

# install fossil
cargo install --git https://github.com/punctuations/fossil

fossil help
# install Rust (skip if you already have it)
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

# install fossil
cargo install --git https://github.com/punctuations/fossil

fossil help
# install Rust from https://rustup.rs (run rustup-init.exe), then in PowerShell:
cargo install --git https://github.com/punctuations/fossil

fossil help