DEFLATE is a lossless data compression algorithm that combines the LZ77 algorithm and Huffman coding. It was initially defined by Phil Katz for the second version of his PKZIP software and later standardized by RFC 1951. DEFLATE is used in zip, gzip, and png files.
Introduction
The DEFLATE algorithm consists of an improved LZ77 compression algorithm and Huffman coding algorithm. The improved LZ77 algorithm and Huffman coding algorithm retain the basic principles of the original algorithm, improve the compression speed of the DEFLATE algorithm, increase the algorithm's flexibility, and make it adaptable.
In the DEFLATE algorithm, the original data is first compressed using the improved LZ77 algorithm, and then the LZ77 compressed data is further compressed using the improved Huffman coding. The final compressed data is the LZ77 compressed data after Huffman coding.
Improved LZ77
The LZ77 compression algorithm processes all data into a triplet string, where each triplet requires at least 3 bytes to represent. If a matching string is not found in the dynamic dictionary, 1 byte is output as 3 bytes, resulting in byte waste. Therefore, the LZ77 algorithm used in the DEFLATE algorithm has been improved as follows:
- Strings with a matching length less than 3 are not compressed. Strings with a length less than 3 cannot achieve compression using the triplet representation.
- The matching length of the string is limited to (-128~127), represented by 8 bits, and the sliding window size is 32KB, represented by 15 bits. The compressed data is constructed in the form of identifier + data, as shown in the table. This storage method reduces the storage space of the compressed data.
- Since only matching strings with a length greater than 3 are searched, a hash chain search algorithm is used to improve the algorithm speed. The hash algorithm compresses 3 bytes into 2 bytes. Although the hash chain structure may produce incorrect results, it greatly improves the search speed.
Improved LZ77 Algorithm Compressed Data Storage Format
Original Data Identifier (1-bit) Data (8-bit) Offset (15-bit)
Matching length less than 3 0 Original data None
Matching length greater than 3 1 Matching length Offset distance
Improved Huffman
The compressed data after Huffman coding consists of a code table and encoded data. In the DEFLATE algorithm, the Huffman coding algorithm used can be either static Huffman coding or dynamic Huffman coding, depending on the generation method of the code table. Static Huffman coding uses a predefined code table that remains fixed and does not need to be stored in the compressed data. Dynamic Huffman coding generates a dynamic code table based on the data being processed and needs to be stored in the compressed data. The DEFLATE algorithm mainly improves the dynamic Huffman coding algorithm as follows:
- Fixed Huffman code table generation rule: From the construction of Huffman coding, it is known that the same data can obtain different Huffman code tables due to different binary tree generation methods or orders. By fixing the binary tree generation method and order, the same data can obtain the same Huffman code table.
- Use code length to record the code table, further compressing the storage space occupied by the code table. This way, 1 byte can represent any code word with a code length of 1 to 256.
- The Huffman code table to be stored is further compressed using a second encoding. In the DEFLATE algorithm, three Huffman code tables are finally used, achieving the storage of the Huffman code table.
The improved Huffman coding algorithm effectively reduces the storage space of the compressed data while considering the computational speed.
DEFLATE Processing Flow
DEFLATE treats all data to be compressed as a binary data stream, with bytes as the basic processing elements. The Huffman coding algorithm uses dynamic or static coding, which is automatically selected by the DEFLATE algorithm based on the compression ratio. The specific processing flow of the DEFLATE algorithm using dynamic Huffman coding is shown in Figure 1. First, the original data is compressed using the LZ77 compression algorithm, resulting in the "identifier + data" and "offset" parts of the table. Then, the Huffman coding algorithm is used to compress the two parts of data separately, resulting in "code table 1" and "code table 2", as well as the compressed data encoded using the two code tables, "compressed data 1" and "compressed data 2". Finally, Huffman coding is used again to compress "code table 1" and "code table 2", resulting in "code table 3" and "compressed data 3".
After the above processing, for the DEFLATE algorithm using static Huffman coding, the data that needs to be stored is "compressed data 1" and "compressed data 2". For the DEFLATE algorithm using dynamic Huffman coding, the data that needs to be stored is "compressed data 1", "compressed data 2", "compressed data 3", and "code table 3". Decompression is the inverse operation of the compression process.