Possessive generic quantifier {m,n}+ not implemented in Ruby 1.9.3?

Possessive quantifiers are greedy and refuse backtrack. A regex /.{1,3}+b/ should mean: Match any character except line breaks, 1 to 3 times, as many as possible and don't backtrack. Tthen match the character b .

In this example:

'ab'.sub /.{1,3}+b/, 'c'    #=> "c"

no substitution should take place, contrary to fact.

The result in these two examples differs:

'aab'.sub /.{0,1}+b/, 'c'   #=> "c"
'aab'.sub /.?+b/, 'c'       #=> "ac"

Compare this with Scala, where they give the same answer:

scala> ".{0,1}+b".r.replaceAllIn("aab", "c")
res1: String = ac
scala> ".?+b".r.replaceAllIn("aab", "c")
res2: String = ac

Is this a Ruby bug, or is it possible to motivate this behavior? Perhaps, Oniguruma for some reason implemented possessive with all quantifiers ? , * , + except the generic quantifier {m,n} ? If that's the case, why?


What really happens

It seems that + followed by range quantifier doesn't offer the possessive property to the range quantifier. Rather, it is treated as whatever in front repeated once or more. Using .{1,3}+b as an example, it will be equivalent to (?:.{1,3})+b .

Work-around

You can work-around this with the more general construct non-backtracking group (or atomic grouping) (?>pattern) . Let us use the general case pattern{n,m}+ as an example to construct the equivalent regex with non-backtracking group (equivalent to Java's behavior when matching with pattern{n,m}+ ):

(?>(?>pattern){n,m})

Why 2 levels of non-backtracking groups? 2 are necessary because:

  • When a match is found for pattern (one instance of repetition), backtracking within pattern is disallowed. (Note that as long as an instance has not been found, backtracking within pattern is allowed). This is emulated with the inner non-backtrakcing group.
  • When no more instances of pattern can be found, backtracking to remove any of the instances is disallowed. This is emulated with the outer non-backtracking group.
  • I am not sure if there is any caveat here. Please ping me with comment if you found any case not emulated with this method.

    Testing

    Test 1

    At first, I tested this regex:

    (.{1,3}+)b
    

    Initially, I tested without the capturing group, but the result was so surprising that I need the capturing group to confirm what is happening.

    On this input:

    2343333ab
    

    The result is that the whole string matches , and the capturing group captured 2343333a (without the b at the end). This shows that the upper limit has somehow been broken.

    DEMO at rubular

    Test 2

    This second test reveals how range quantifiers {n} 's behavior cannot be modified to be possessive, and it is likely that this also applies for other range quantifiers {n,} and {n,m} . Instead, the following + will only exhibit repetition of 1 or more time behavior.

    (My initial conclusion is that + overwrites the upper limit, but it turns out to be wrong).

    Testing regex:

    (.{3}+)b
    

    Input string:

    23d4344333ab
    234344333ab
    23434433ab
    

    The matches captured in capturing group 1 are all multiples of 3. From top to bottom, the regex skips 2, 1, 0 characters respectively for the input strings.

    Input string with annotation ( [] denotes the match for the whole regex, () denotes the text captured by capturing group 1):

    23[(d4344333a)b]
    2[(34344333a)b]
    [(23434433a)b]
    

    DEMO at rubular

    Testing code for work around

    This is the testing code in Java to show that both outer and inner non-backtracking groups are necessary. ideone

    class TestPossessive {
      public static void main(String args[]) {
        String inputText = "123456789012";
        System.out.println("Input string: " + inputText);
        System.out.println("Expected: " + inputText.replaceFirst("(?:d{3,4}(?![89])){2,}+", ">$0<"));
        System.out.println("Outer possessive group: " + inputText.replaceFirst("(?>(?:d{3,4}(?![89])){2,})", ">$0<"));
        System.out.println("Inner possessive group: " + inputText.replaceFirst("(?>d{3,4}(?![89])){2,}", ">$0<"));
        System.out.println("Both: " + inputText.replaceFirst("(?>(?>d{3,4}(?![89])){2,})", ">$0<"));
    
        System.out.println();
    
        inputText = "aab";
        System.out.println("Input string: " + inputText);
        System.out.println("Expected: " + inputText.replaceFirst(".{1,3}+b", ">$0<"));
        System.out.println("Outer possessive group: " + inputText.replaceFirst("(?>.{1,3})b", ">$0<"));
        System.out.println("Inner possessive group: " + inputText.replaceFirst("(?>.){1,3}b", ">$0<"));
        System.out.println("Both: " + inputText.replaceFirst("(?>(?>.){1,3})b", ">$0<"));
      }
    }
    

    It seems like this is intended in Oniguruma. Documentation says {n,m}+, {n,}+, {n}+ are possessive op. in ONIG_SYNTAX_JAVA only {n,m}+, {n,}+, {n}+ are possessive op. in ONIG_SYNTAX_JAVA only . I guess this is because of backward compatibility reasons, or?

    链接地址: http://www.djcxy.com/p/76984.html

    上一篇: 惰性量词在这个特定的正则表达式中有什么不同?

    下一篇: 在Ruby 1.9.3中没有实现的所有通用量词{m,n} +?