Find repeating substring of Length N
I have to make a Java program which finds all repeating sub-strings of length n in a given String. The input is string is extremely long and a brute-force approach takes too much time.
 I alread tried:  
 Presently I am finding each sub-string separately and checking for repetitions of that sub-string using the KMP alogrithm.  This too is taking too much time.  
What is a more efficient approach for this problem?
1) You should look at using a suffix tree data structure.
Suffix Tree
 This data structure can be built in O(N * log N) time  
 (I think even in O(N) time using Ukkonen's algorithm)  
 where N is the size/length of the input string.  
 It then allows for solving many (otherwise) difficult  
 tasks in O(M) time where M is the size/length of the pattern.  
 So even though I didn't try your particular problem, I am pretty sure that  
 if you use a suffix tree and a smart formulation of your problem, then the  
 problem can be solved by using a suffix tree (in reasonable O time).  
2) A very good book on these (and related) subjects is this one:
Algorithms on Strings, Trees and Sequences
 It's not really easy to read though unless you're well-trained in algorithms.  
 But OK, reading such things is the only way to get well-trained ;)  
3) I suggest you have a quick look at this algorithm too.
Aho-Corasick Algorithm
 Even though, I am not sure but... this one might be somewhat  
 off-topic with respect to your particular problem.  
I am going to take @peter.petrov's suggestion and enhance it by explaining how can one actually use a suffix tree to solve the problem:
 1. Create a suffix tree from the string, let it be `T`.
 2. Find all nodes of depth `n` in the tree, let that set of nodes be `S`. This can be done using DFS, for example.
 3. For each node `n` in `S`, do the following:
     3.1. Do a DFS, and count the number of terminals `n` leads to. Let this number be `count`
     3.2. If `count>1`, yield the substring that is related to `n` (the path from root to `n`), and `count`
 Note that this algorithm treats any substring of length n and add it to the set S , and from there it search for how many times this was actually a substring by counting the number of terminals this substring leads to.  
 This means that the complexity of the problem is O(Creation + Traversal) - meaning, you first create the tree and then you traverse it (easy to see you don't traverse in steps 2-3 each node in the tree more than once).  Since the traversal is obviously "faster" than creation of the tree - it leaves you with O(Creation) , which as was pointed by @perer.petrov is O(|S|) or O(|S|log|S|) depending on your algorithm of choice.  
上一篇: 断开图中的所有顶点
下一篇: 查找长度N的重复子字符串
