Matches overlapping lookahead on LZ77/LZSS with suffix trees


As I understand it, given a suffix tree, we are interested (roughly) in computing, for each suffix S, which longer suffix has the longest common prefix with S.

Add a reference from each tree node to the descendant leaf with the longest suffix (linear time with DFS). From each leaf, walk rootward, examining the new references, stopping if a longer suffix is found. The running time of the latter step is linear, because each tree edge is examined exactly once.

Life with a bounded window is unfortunately more difficult. Instead of propagating one reference, we propagate several. To compute the set of suffixes referenced from a node, we merge them in sorted order by length. Then whenever we have suffixes of lengths x > y > z, if x - z < 8192 (eg), then we can drop y, because all three are equally good for all suffixes with which the current node is the leafmost common ancestor, and if y is in window, then either x or z is. Since the window is a large fraction of the total file, each node will have at most a handful of references.

If you want to look back the shortest distance possible, then there's an O(n log^2 n)-time algorithm (probably improvable to O(n log n) with various hard-to-implement magic). In the course of the algorithm, we construct, for each node, a binary search tree of the descendant suffixes by length, then do next-longest lookups. To construct the node's tree from its childrens', start with the largest child tree and insert the elements from the others. By a heavy path argument, each length is inserted O(log n) times.

链接地址: http://www.djcxy.com/p/27488.html

上一篇: 通过jQuery增加

下一篇: 在LZ77 / LZSS上与后缀树匹配重叠前瞻