How does file compression work?

I suppose the question is not about how I call gzip, bzip2, xz or similar tools to compress files, but about the compression algorithms used in such tools.

The main idea is to encode the data differently, more compactly, or to omit irrelevant things.There are many different approaches to this, which differ, e.g. in terms of

  • Speed of compression
  • Speed of decompression
  • Compression ratio
  • Memory
  • Streamability
  • Licensing costs

But with an example, this will probably be more vivid.Let’s say we want the text

aabbbbaacccXXXXaabbbbaacccX

Compress.Instead of repeating the letters, we could specify how often a letter occurs. The result:

2a4b2a3c4X2a4b2a3c1X

As you can see, the text has already become shorter.This method is called run length coding – very simple, very fast, hardly needs memory, compression rate is generally so well.

Mr. Huffman had the idea to make code length dependent on frequency: frequent characters get shorter codes than less common ones.Instead of encoding each character with 8 bits, you could use for very common 2 bits and for rare ones e.g. 14 bits.

In this example, 8 x a, 8 x b, 6 x c and 5 x X occur, i.e. a and b get shorter codes than c or X.

Another idea is to encode recurring symbol sequences with special symbols, e.g.:

1:aa-n2:bb-n3:ccc-n4:12213 = aa bb bb aa ccc-n5:X

With these new symbols, the text (in a kind of pseudonotation) looks like this:

( aa=1 bb=2 2 1 ccc=3 )=4 X=5 5 5 5 4 5

There are some algorithms, such as Lempel-Ziv-Welch, that work according to this scheme.They differ in the way symbol sequences are found and encoded.

The methods described so far generate exactly the data that was originally compressed during decompression.They are therefore called losslessprocedures.If there is data that is irrelevant, so that can be omitted completely, then one speaks of lossy procedures.

Suppose the X are only placeholders that occur as separators in the data to be compressed, then this stream could be transmitted as:

aabbbbaacccXaabbbbaacccX

Unnecessary X slackened.

Lossless methods can now be applied to this stream to achieve an even higher compression rate.

Examples of lossy processes are MP3 or JPG, because the human ear does not hear some sounds in the presence of other sounds, and the eye still recognizes the color even if the color value is not 100% correct.

I hope that these ideas have become a little clearer.Otherwise: Google is your friend, and Wikipedia knows almost everything 🙂

Leave a Reply