banner
Sereinme

Sereinme

无聊中的夏日闲话
github
zhihu
email
jike
steam_profiles

📑LZ77

1696781354749

Introduction#

LZ77 is a classic compression algorithm that uses a dictionary for data compression, proposed by two Israeli engineers, Jacob Ziv and Abraham Lempel, in their 1977 paper "A Universal Algorithm for Sequential Data Compression."

Traditional statistical compression coding, such as Huffman coding, requires prior information—the character frequency of the source—before compression can take place. However, in most cases, this prior information is difficult to obtain in advance. Therefore, designing a more general data compression coding becomes particularly important, and the LZ77 data compression algorithm emerged. Its core idea is to utilize the repeated structural information of data for compression. That is, if information in a string has appeared before, it only needs to indicate the position where it previously appeared, allowing these words to be represented with relative indices.

Principle#

This section discusses only the compression and decompression processes of the LZ77 algorithm. For the mathematical proof of the unique translatable, lossless compression property of the LZ77 algorithm, please refer to the original paper.

Sliding Window#

First, define the length of the string SS as NN, and the substring Sij,1i,jNS_{ij},1\leq i,j \leq N. For the prefix substring S1,jS_{1,j}, let LijL_i^j be the maximum matching length of the substring starting with the first character SiS_i and the substring starting with the first character Sj+1S_{j+1}, that is:

Lij=max{lSi,i+l1=Sj+1,j+l},lNjL_i^j=\max\{l|S_{i,i+l-1}=S_{j+1,j+l}\},\quad l\leq N-j

We say that the string Sj+1,j+lS_{j+1,j+l} matches the string Si,i+l1S_{i,i+l-1}, with a matching length of ll. As shown in the figure, there are two cases.

string_match

Define pjp^j as the starting position ii of the longest match in all cases, that is

pi=argmaxi{Lij},1ijp^i = \mathop{\arg\max}\limits_i\{L_i^j\}, \quad 1\leq i \leq j

For the string S=00101011,j=3S=00101011,j=3, we have:

  • L1j=1Sj+1,j+1=S1,1,Sj+1,j+2S1,2L_1^j=1\rightarrow S_{j+1,j+1}=S_{1,1},S_{j+1,j+2}\neq S_{1,2}
  • L2j=4Sj+1,j+1=S2,2,Sj+1,j+2=S2,3,Sj+1,j+3=S2,4,Sj+1,j+4=S2,5,Sj+1,j+5S2,6L_2^j=4\rightarrow S_{j+1,j+1}=S_{2,2},S_{j+1,j+2}=S_{2,3},S_{j+1,j+3}=S_{2,4},S_{j+1,j+4}=S_{2,5},S_{j+1,j+5}\neq S_{2,6}
  • L3j=0Sj+1,j+1S3,3L_3^j=0\rightarrow S_{j+1,j+1}\neq S_{3,3}

Thus, pj=2p^j=2 and the longest matching length lj=4l^j=4. It can be seen that the substring Sj+1,j+pS_{j+1,j+p} can be generated from S1,jS_{1,j}, hence it is called the Reproducible Extension of S1,jS_{1,j}. The core idea of the LZ77 algorithm comes from this—using strings that have appeared in history as a dictionary to encode future characters for the purpose of data compression. In specific implementations, a sliding window dictionary stores historical characters, and a Lookahead Buffer stores characters to be compressed, with a Cursor acting as a separator between the two, as shown in the figure.

lz77

Moreover, the lengths of the dictionary and Lookahead Buffer are fixed.

Compression#

Using (p,l,c)(p,l,c) to represent the longest match result of the string in the Lookahead Buffer, where

  • pp indicates the starting position of the longest match in the dictionary (relative to the Cursor position)
  • ll is the length of the longest matching string
  • cc refers to the next character after the longest match in the Lookahead Buffer

The compression process involves repeatedly outputting (p,l,c)(p,l,c) and moving the Cursor to l+1l+1, as follows:

Repeat:
	Output (p,l,c),
	Cursor --> l+1
Until to end of string

An example of compression is shown in the figure.

lz77-compression

Decompression#

To ensure correct decoding, the sliding window during decompression must be the same as during compression. When encountering (p,l,c)(p,l,c) during decompression, there are generally three cases:

  • p==0p==0 and l==0l==0, which is the initial case, where cc is decoded directly
  • plp\geq l, decoded as dict[p:p+l+1]
  • p<lp<l, which indicates a loop encoding, requiring left-to-right concatenation, with pseudocode as follows:
for (i = p, k = 0; k < length; i++, k++)
	out[cursor+k] = dict[i%cursor]

For example, if dict=abcd, the encoding is (2,9,e), then decompression results in output=abcdcdcdcdcdce.

Implementation#

A simple LZ77 compression algorithm is implemented in MATLAB. Since my primary goal is to compress arrays, I use arrays instead of strings for compression design and testing. The specific implementation is as follows:

classdef LZ77
% A simple implementation of LZ77 compression algorithm

    properties (SetAccess = private)
        window_size % dictionary
        buffer_size % lookahead buffer
    end

    methods
        function self = LZ77(d, h)
            if nargin == 2
                self.window_size = d;
                self.buffer_size = h;
            elseif nargin == 1
                self.window_size = d;
                self.buffer_size = 4;
            elseif nargin == 0
                self.window_size = 6;
                self.buffer_size = 4;
            end
        end

        function [p, l, c] = longest_match(self, data, cursor)
        % Find the longest match between dictionary and
        % lookahead-buffer.
        % 
        % arguments: (input)
        %   data    - data used to find longest match
        %   cursor  - position of data that separate dictionary and 
        %             lookahead-buffer, from 0 to data length, value of
        %             cursor should be equal to its left item's index
        % arguments: (output)
        %   p - start point of longest match based on cursor
        %   l - length of longest match
        %   c - next character of longest match

            end_buffer = min(cursor + self.buffer_size, length(data)-1);

            p = -1;
            l = -1;
            c = '';

            for i = cursor+1:end_buffer
                start_index = max(1, cursor - self.window_size + 1);
                sub_string = data(cursor+1:i);
                subs_len = length(sub_string);

                for j = start_index:cursor
                    repetition = floor(subs_len / (cursor - j + 1));
                    last = rem(subs_len, (cursor - j + 1));
                    matched_string = [repmat(data(j:cursor), ...
                        [1 repetition]), data(j:j+last-1)];

                    % if match, update p, l, c
                    if isequal(matched_string, sub_string) && subs_len > l
                        p = cursor - j + 1;
                        l = subs_len;
                        c = data(i+1);
                    end
                end
            end

            % if not match, return next character of cursor
            if p == -1 && l == -1
                p = 0;
                l = 0;
                c = data(cursor + 1);
            end
        end

        function cm = compress(self, message)
        % Compress message
        % 
        % arguments: (input)
        %   message - data ready to compress
        % arguments: (output)
        %   cm      - compressed message
            
            i = 0;
            while i < length(message)
                [p, l, c] = self.longest_match(message, i);
                if exist('cm', 'var')
                    cm = [cm;[p,l,c]];
                else
                    cm = [p,l,c];
                end
                i = i + l + 1;
            end
        end

        function dm = decompress(self, cm)
        % Decompress the compressed message
        % 
        % arguments: (input)
        %   cm - compressed message
        % arguments: (output)
        %   dm - decompressed message

            cursor = 0;
            [n, ~] = size(cm);
            dm = [];

            for i = 1:n
                p = cm(i,1);
                l = cm(i,2);
                c = cm(i,3);
                if p == 0 && l == 0
                    dm = [dm, c];
                elseif p >= l
                    dm = [dm, dm(cursor - p + 1: cursor - p + l), c];
                elseif p < l
                    repetition = floor(l / p);
                    last = rem(l, p);
                    dm = [dm, repmat(dm(cursor-p+1:cursor), ...
                        [1 repetition]), dm(cursor-p+1:cursor-p+last), c];
                end
                cursor = cursor + l + 1;
            end
        end
    end
end

The overall encoding and decoding logic is clear, and the loop matching during the decoding process is relatively simple. In finding the maximum matching substring, the prefix substring uses the same generation method as the loop matching; in fact, other methods are similar, and the generation method here is not essential.

It is worth noting that when searching for the maximum matching substring, the Lookahead Buffer cannot access the last character of the data, meaning it needs to be length(data)-1. This is because if all the data after the Cursor matches, the index of the next character after matching would exceed the boundary. By ensuring that the Lookahead Buffer only goes up to the second-to-last character, it guarantees that even after a complete match, there will still be one character left to attach to cc in (p,l,c)(p,l,c), effectively avoiding array out-of-bounds issues, while the compression efficiency remains unaffected.

end_buffer = min(cursor + self.buffer_size, length(data)-1);

Performance#

When the lengths of the Dictionary and Lookahead Buffer are small, the compression speed of LZ77 is very fast, but the compression ratio is low. In cases where repeated data occurs infrequently, the compressed data may even be larger than the original data. When the lengths of the Dictionary and Lookahead Buffer are small, the compression effect is good, but the compression speed is slow; as the lengths increase, the compression effect improves, but the compression speed decreases.

For a Dictionary and Lookahead Buffer length of 80, testing data collected from a small antenna's blue box resulted in a compression ratio of about 1.29:1, with a compression rate of 22%; increasing the window length to 200 with the same data yielded a compression ratio of about 1.49:1, with a compression rate of 33%.

For decompression, since it is independent of the lengths of the Dictionary and Lookahead Buffer, the decompression speed is consistently fast.

Directly using zip to compress 2.29G of raw collected data resulted in a compression ratio of about 3.35:1. Since it is also based on a variant of the LZ77 algorithm, performance is improved compared to direct compression, but the overall large data volume can also lead to an increase in compression rate.

References#

Loading...
Ownership of this post data is guaranteed by blockchain and smart contracts to the creator alone.