Can't access nodes in xhtml document with multiple namespaces through xpath
Okay, so I'm trying to parse a xhtml site with curl and xpath.
The site has multiple namespaces:
<html xmlns="http://www.w3.org/1999/xhtml" xmlns:addthis="http://www.addthis.com/help/api-spec"     xmlns:og="http://ogp.me/ns#" xmlns:fb="http://www.facebook.com/2008/fbml">
I am trying to get all the urls from the site's pagination like this:
$url = [site_im_parsing];
$dom = new DOMDocument();
@$dom->loadHTML($url);  
$xpath = new Domxpath($dom);
$xpath->registerNamespace("x", "http://www.w3.org/1999/xhtml");
$pages = $xpath->query('//x:div[2]/x:table/x:tbody/x:tr/x:td[1]/x:a');
for ($i = 0; $i < $pages->length; $i++) {
    echo $pages->item($i)->getAttribute('href');
}
This doesn't work. (The xpath to the pagination without the x-namespace should be right). Should I register all the namespaces and use them all somehow in the xpath query?
Best regards, AB
// question update //
This is the part of the page I'm trying to parse: (I want the href's)
<div class="pager">
    <table style="width:100%" border="0" cellspacing="0" cellpadding="0">
        <tbody>
            <tr>
                <td>
                    <span class="current">1</span>  | 
                    <a href="http://www.somewebsite.com/catalog?on_offer=1&commodity_offset=1">2</a> | 
                    <a href="http://www.somewebsite.com/catalog?on_offer=1&commodity_offset=2">3</a> | 
                    <a href="http://www.somewebsite.com/catalog?on_offer=1&commodity_offset=3">4</a> | 
                    <a href="http://www.somewebsite.com/catalog?on_offer=1&commodity_offset=4">5</a> | 
                    <a href="http://www.somewebsite.com/catalog?on_offer=1&commodity_offset=5">6</a> | 
                    <a href="http://www.somewebsite.com/catalog?on_offer=1&commodity_offset=6">7</a>        
                </td>
                <td style="text-align:right">
                    <a href="http://www.somewebsite.com/catalog?on_offer=1&commodity_offset=1">Next</a>
                </td>
            </tr>
        </tbody>
    </table>
</div>
The doctype is:
html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"
if that makes any difference...
 With loadHTML I would expect any namespaces to be ignored so try not to use registerNamespace at all and then to use $xpath->query('//div[2]/table/tbody/tr/td[1]/a');  .  As an alternative parse as XML, then using namespaces makes sense.  
Okay I figured it out...
The browser will insert an implicit tag < tbody > when it is not present in the document. The xpath will only process the tags present in the raw HTML string, so I just left out the < tbody > tag.
Old xpath query:
$xpath->query('//div[2]/table/tbody/tr/td[1]/a');
New:
$xpath->query('//div[2]/table/tr/td[1]/a');
上一篇: URL解析器错误:实体'nbsp'未定义
