📑LZ77 - Sereinme

1696781354749

引言#

LZ77 是一種採用字典進行數據壓縮的經典壓縮算法，由兩位以色列工程師 Jacob Ziv 和 Abraham Lempel 在 1977 年發表的論文 “A Universal Algorithm for Sequential Data Compression” 中提出。

傳統基於統計的壓縮編碼，如 Huffman 編碼，需要得到先驗信息 —— 信源的字符頻率，然後進行壓縮。但是在大多數情況下，這種先驗信息是很難預先獲得的。因此，設計一種更為通用的數據壓縮編碼顯得尤為重要，LZ77 數據壓縮算法應運而生，其核心思想是，利用數據的重複結構信息來進行數據壓縮。即如果字符串中的信息在之前出現過，則只需要指出其之前出現過的位置，便可以用相對索引來表示這些詞。

原理#

這裡僅討論 LZ77 算法的壓縮和解壓縮過程，關於 LZ77 算法的唯一可譯、無損壓縮的性質，其數學證明參看原論文。

滑動窗口#

首先，定義字符串 $S$ 的長度為 $N$ ，字符串 $S$ 的子串 $S_{ij},1\leq i,j \leq N$ 。對於前綴子串 $S_{1,j}$ ，記 $L_i^j$ 為首字符 $S_i$ 的子串與首字符 $S_{j+1}$ 的子串的最大匹配長度，即：

L_i^j=\max\{l|S_{i,i+l-1}=S_{j+1,j+l}\},\quad l\leq N-j

我們稱字符串 $S_{j+1,j+l}$ 匹配了字符串 $S_{i,i+l-1}$ ，且匹配長度為 $l$ 。如圖所示，共有兩種情況

string_match

定義 $p^j$ 為所有情況下的最長匹配的起始位置 $i$ 值，即

p^i = \mathop{\arg\max}\limits_i\{L_i^j\}, \quad 1\leq i \leq j

如對於字符串 $S=00101011,j=3$ 則有

$L_1^j=1\rightarrow S_{j+1,j+1}=S_{1,1},S_{j+1,j+2}\neq S_{1,2}$
$L_2^j=4\rightarrow S_{j+1,j+1}=S_{2,2},S_{j+1,j+2}=S_{2,3},S_{j+1,j+3}=S_{2,4},S_{j+1,j+4}=S_{2,5},S_{j+1,j+5}\neq S_{2,6}$
$L_3^j=0\rightarrow S_{j+1,j+1}\neq S_{3,3}$

因此， $p^j=2$ 且最長匹配的長度 $l^j=4$ 。由此可知，子串 $S_{j+1,j+p}$ 是可以由 $S_{1,j}$ 生成的，因而稱之為 $S_{1,j}$ 的再生擴展（Reproducible Extension）。LZ77 算法的核心思想便來源於此 —— 用歷史中出現過的字符串做字典，編碼未來出現的字符，以達到數據壓縮的目的。在具體實現中，用滑動窗口（Sliding Window）字典存儲歷史字符，Lookahead-Buffer 存儲待壓縮的字符，Cursor 作為兩者之間的分隔，如圖所示

lz77

並且字典與 Lookahead Buffer 的長度是固定的。

壓縮#

用 $(p,l,c)$ 表示 Lookahead Buffer 中字符串的最長匹配結果，其中

$p$ 表示最長匹配時，字典中字符開始時的位置（相對於 Cursor 位置）
$l$ 為最長匹配字符串的長度
$c$ 指 Lookahead Buffer 最長匹配結束時的下一字符

壓縮的過程，就是重複輸出 $(p,l,c)$ ，並將 Cursor 移動至 $l+1$ ，算法如下

Repeat:
	Output (p,l,c),
	Cursor --> l+1
Until to end of string

壓縮示例如圖所示

lz77-compression

解壓縮#

為了能保證正確解碼，解壓縮時的滑動窗口與壓縮時要相同。在解壓縮時，遇到 $(p,l,c)$ 大致分為三類情況：

$p==0$ 且 $l==0$ ，即初始情況，這時直接解碼 $c$
$p\geq l$ ，解碼為字典dict[p:p+l+1]
$p<l$ ，即出現循環編碼，則需要從左至右循環拼接，伪代码如下

for (i = p, k = 0; k < length; i++, k++)
	out[cursor+k] = dict[i%cursor]

比如，dict=abcd，編碼為 (2,9,e) ，則解壓縮為 output=abcdcdcdcdcdce。

實現#

用 matlab 實現一個簡單的 LZ77 壓縮算法，由於我的目的主要是壓縮數組，因此用數組代替字符串進行壓縮設計和測試，具體實現如下

classdef LZ77
% A simple implementation of LZ77 compression algorithm

    properties (SetAccess = private)
        window_size % dictionary
        buffer_size % lookahead buffer
    end

    methods
        function self = LZ77(d, h)
            if nargin == 2
                self.window_size = d;
                self.buffer_size = h;
            elseif nargin == 1
                self.window_size = d;
                self.buffer_size = 4;
            elseif nargin == 0
                self.window_size = 6;
                self.buffer_size = 4;
            end
        end

        function [p, l, c] = longest_match(self, data, cursor)
        % Find the longest match between dictionary and
        % lookahead-buffer.
        % 
        % arguments: (input)
        %   data    - data used to find longest match
        %   cursor  - position of data that seperate dictionary and 
        %             lookahead-buffer, from 0 to data length, value of
        %             cursor should be equal to its left item's index
        % arguments: (output)
        %   p - start point of longest match based on cursor
        %   l - length of longest match
        %   c - next character of longest match

            end_buffer = min(cursor + self.buffer_size, length(data)-1);

            p = -1;
            l = -1;
            c = '';

            for i = cursor+1:end_buffer
                start_index = max(1, cursor - self.window_size + 1);
                sub_string = data(cursor+1:i);
                subs_len = length(sub_string);

                for j = start_index:cursor
                    repetition = floor(subs_len / (cursor - j + 1));
                    last = rem(subs_len, (cursor - j + 1));
                    matched_string = [repmat(data(j:cursor), ...
                        [1 repetition]), data(j:j+last-1)];

                    % if match, update p, l, c
                    if isequal(matched_string, sub_string) && subs_len > l
                        p = cursor - j + 1;
                        l = subs_len;
                        c = data(i+1);
                    end
                end
            end

            % if not match, return next character of cursor
            if p == -1 && l == -1
                p = 0;
                l = 0;
                c = data(cursor + 1);
            end
        end

        function cm = compress(self, message)
        % Compress message
        % 
        % arguments: (input)
        %   message - data ready to compress
        % arguments: (output)
        %   cm      - compressed message
            
            i = 0;
            while i < length(message)
                [p, l, c] = self.longest_match(message, i);
                if exist('cm', 'var')
                    cm = [cm;[p,l,c]];
                else
                    cm = [p,l,c];
                end
                i = i + l + 1;
            end
        end

        function dm = decompress(self, cm)
        % Decompress the compressed message
        % 
        % arguments: (input)
        %   cm - compressed message
        % arguments: (output)
        %   dm - decompressed message

            cursor = 0;
            [n, ~] = size(cm);
            dm = [];

            for i = 1:n
                p = cm(i,1);
                l = cm(i,2);
                c = cm(i,3);
                if p == 0 && l == 0
                    dm = [dm, c];
                elseif p >= l
                    dm = [dm, dm(cursor - p + 1: cursor - p + l), c];
                elseif p < l
                    repetition = floor(l / p);
                    last = rem(l, p);
                    dm = [dm, repmat(dm(cursor-p+1:cursor), ...
                        [1 repetition]), dm(cursor-p+1:cursor-p+last), c];
                end
                cursor = cursor + l + 1;
            end
        end
    end
end

整體的編碼解碼邏輯清楚，解碼過程中的循環匹配也較為簡單。在尋找最大匹配子串中，前置子串採用了和循環匹配一樣的生成方式，其實其他方式也是一樣的，這裡的生成方式不本質。

值得注意的是，在尋找最大匹配子串中時，Lookahead-Buffer 是不能取到數據的最後一位的，即需要 length(data)-1 。這是因為如果 Cursor 後的數據全部匹配的話，這樣匹配後的下一字符索引會超出邊界，而保證 Lookahead-Buffer 最多只到倒數第二個字符的話，能夠保證就算在全部匹配之後還能夠留有一字符附在 $(p,l,c)$ 中的 $c$ 處，從而有效避免了數組越界的情況，同時壓縮效率也沒有下降。

end_buffer = min(cursor + self.buffer_size, length(data)-1);

性能#

在 Dictionary 和 Lookahead-Buffer 長度較小的時候，LZ77 的壓縮速度很快，但是壓縮率很低，對於重複數據出現少的情況，甚至可能出現壓縮後的數據比壓縮之前還要大的情況。對於 Dictionary 和 Lookahead-Buffer 長度較小的情況，壓縮效果較好，但是壓縮速度很慢，且長度越大，壓縮效果越好，壓縮速度越慢。

對於 Dictionary 和 Lookahead-Buffer 長度均為 80 時，測試小天線的藍盒子採集數據，壓縮比約為 1.29:1 ，壓縮率為 22% ；擴大窗長度到 200 時，同樣數據，壓縮比約為 1.49:1，壓縮率為 33%。

對於解壓縮，由於和 Dictionary 和 Lookahead-Buffer 長度無關，因此解壓縮速度均很快。

直接使用 zip 對 2.29G 的原始採集數據進行壓縮，壓縮比在 3.35:1 左右，由於也是基於 LZ77 算法的變種，性能相較直接的壓縮有提升，但是整體數據量大也會導致壓縮率的提升。

參考#

Ziv, Jacob, and Abraham Lempel. "A universal algorithm for sequential data compression." IEEE Transactions on information theory 23.3 (1977): 337-343.
LZ77 算法原理及實現