JAVA Regex to remove html tag and content

Possible Duplicate:
How to remove HTML tag in Java
RegEx match open tags except XHTML self-contained tags

I want to remove specific HTML tag with its content.

For example, if the html is:

<span style='font-family:Verdana;mso-bidi-font-family:
"Times New Roman";display:none;mso-hide:all'>contents</span>

If the tag contains "mso-*", it must remove the whole tag (opening, closing and content).


As Dave Newton pointed out in his comment, a html parser is the way to go here. If you really want to do it the hard way, here's a regex that works:

    String html = "FOO<span style='font-family:Verdana;mso-bidi-font-family:"
        + ""Times New Roman";display:none;mso-hide:all'>contents</span>BAR";
    // regex matches every opening tag that contains 'mso-' in an attribute name
    // or value, the contents and the corresponding closing tag
    String regex = "<(S+)[^>]+?mso-[^>]*>.*?</1>";
    String replacement = "";
    System.out.println(html.replaceAll(regex, replacement)); // prints FOOBAR
链接地址: http://www.djcxy.com/p/76868.html

上一篇: 正则表达式去掉标签,保留CDATA

下一篇: JAVA正则表达式去除html标签和内容