Greedy Quantifiers

I was reading K.Sierra and found the following sentance:

The greedy quantifier does in fact read the entire source data, and then it works backward (from the right) until it finds the rightmost match. At that point, it includes everything from earlier in the source data up to and including the data that is part of the rightmost match.

Now, Suppose we have a source as follows:

"proj3.txt,proj1sched.pdf,proj1,proj2,proj1.java"

and pattern: proj1([^,])*

why doesn't it match the whole text? Being greedy it should have match the rightmost "proj1.java" and the returned match should have been the entire source before the right most match? Instead it returns:

proj1sched.pdf
proj1
proj1.java

why doesn't it match the whole text?

Because you stated it must start with proj1

Being greedy it should have match the rightmost "proj1.java"

correct.

and the returned match should have been the entire source before the right most match?

no idea why you would think that, or why that would be useful. You can just do .*proj1.* if that is what you want.


why doesn't it match the whole text?

Oh, it tried. But it found the sequence p , r , o , j , 1 in one spot, then went on to find zero or more characters not being a comma, therefore matching . , p , d , f . And it stopped there, since the next character, being a comma, did not match [^,] .

Note that the next matching attempt will start at the next character, ie r , and so on until it finds a p ; when it finds one, it will try r , etc etc.

The regex being satisfied in its entirety, the engine decided that it was a success and didn't try any further, even though there are matches beyond that point.

The text matched is therefore proj1.pdf . And not the entire input. Regular expressions are lazy, they only match what they need to match and never go any further.

BUT. And this is where it gets interesting. Some engines don't work this way.

Consider the regex cat(|flap) and the input text catflap . POSIX has had a go at regex engines, and dictated that a regex engine should match the leftmost, longest match.

So, if a regex engine obeys POSIX, it should match catflap . But most regex engines in existence, here, will only match cat : the empty alternation matches first, the regex is satisfied, end of story!

Now, to the core of the question: quantifiers are of three types, greedy, lazy and possessive:

  • greedy: the default, * ;
  • lazy, aka the overused: *? ;
  • possessive: *+ .
  • A greedy quantifier will try and match as much text as it can, and give back only when it has to; a lazy quantifier will try and match as little text as it can; a possessive quantifier will try and match as much text as it can, and it will not give text back.

    Illustration: here is the input text:

    The answer to everything is 42, says the mouse
    

    Here are three regexes to match this text, with a capturing group:

  • .*(d+) (greedy);
  • .*?(d+) (lazy);
  • .*+(d+) (possessive).
  • Question: what will the group capture in each of these expressions? Answer:

  • the first: 2 ;
  • the second: 42 ;
  • the third: will not even match the text! .*+ will have swallowed everything but will not give back, d+ is therefore left with nothing which can match, regex failure.

  • We have proj1([^,])* in which -

    ([^,])* means it will concatenate sub-string of any combination of characters (occurred zero or more times), which does not consists of char ',' with string "proj1" like: "sched.pdf" or " " or ".java" all three does not include a ',' character. Hence the result.

    链接地址: http://www.djcxy.com/p/76932.html

    上一篇: java正则表达式给予贪婪量词的勉强量词

    下一篇: 贪婪的量词