HOw to parse a div using Regular Expressions in java?

Possible Duplicate:
RegEx match open tags except XHTML self-contained tags

I am having trouble in parsing a tag using java.

Goal:

My goal is to parse complete div tag with all of its contents, even if it contains sub tags,

like from an HTML

<h2>some random text</h2>
<div id="outerDiv">
  some text
      <div>
          some more text
      </div>
  last text
</div>
<div> some random div <b>bold</b></div>

i want to parse with all its inner contents upto its closing tags, that is:

<div id="outerDiv">
      some text
          <div>
              some more text
          </div>
      last text
    </div>

But what I currently gets, is either in this form or any other random format (dpending upon the changes I try with my expression :) ),

Please help me out to improve my Regex to parse a div with a specific id along with its contents perfectly.

Here is my expression (alot of brackets just to be on safer side :) ):

((<div.*(class="afs")(.)*?>)((.)*?)(((<div(.)*?>)((.)*?)((</div>){1}))*?)((</div>){1}))

Here is my java code:

package rexp;

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class Rexp {

    public static void main(String[] args) {

        CharSequence inputStr = "asdasd<div class="af">sasa<div><div><div class="afs">as</div>qwessa</div></div></div>asd";


        Pattern pattern = Pattern.compile("((<div.*(class="afs")(.)*?>)((.)*?)(((<div(.)*?>)((.)*?)((</div>){1}))*?)((</div>){1}))");
        Matcher matcher = null;
        matcher = pattern.matcher(inputStr);

        if (matcher.find()) {
            System.out.println("Matched "+matcher.group(1));
        } else {
            System.out.println("Not Matched");
        }
    }
}

I think a regex is the wrong tool for this. I would consider using a lexer/parser library, or just using a 3rd party HTML parsing library. A quick google shows several out there.


Regular expressions are not suitable for HTML parsing, since HTML is not a regular language. You would be better off using a proper HTML parser library, such as jsoup or JTidy.

See also this question for more Java HTML parser references.

链接地址: http://www.djcxy.com/p/76872.html

上一篇: 为什么正则表达式如此引起争议?

下一篇: HOw使用Java中的正则表达式来解析一个div?