Finding direct child of an element

2018-06-22 09:32:05

I'm writing a solution to test this phenomenon in Python. I have most of the logic done, but there are many edge cases that arise when following links in Wikipedia articles.

The problem I'm running into arises for a page like this where the first <p> has multiple levels of child elements and the first <a> tag after the first set of parentheses needs to be extracted. In this case, (to extract this link), you have to skip over the parentheses, and then get to the very next anchor tag/href. In most articles, my algorithm can skip over the parentheses, but with the way that it looks for links in front of parentheses (or if they don't exist), it is finding the anchor tag in the wrong place. Specifically, here: <span style="font-size: small;"><span id="coordinates"><a href="/wiki/Geographic_coordinate_system" title="Geographic coordinate system">Coordinates</a>

The algorithm works by iterating through the elements in the first paragraph tag (in the main body of the article), stringifying each element iteratively, and first checking to see if it contains either an '(' or an '

Is there any straight forward way to avoid embedded anchor tags and only take the first link that is a direct child of the first <p> ?

Below is the function with this code for reference:

**def getValidLink(self, currResponse):
        currRoot = BeautifulSoup(currResponse.text,"lxml")
        temp = currRoot.body.findAll('p')[0]
        parenOpened = False
        parenCompleted = False
        openCount = 0
        foundParen = False
        while temp.next:
            temp = temp.next
            curr = str(temp)
            if '(' in curr and str(type(temp)) == "<class 'bs4.element.NavigableString'>":
                foundParen = True
                break
            if '<a' in curr and str(type(temp)) == "<class 'bs4.element.Tag'>":
                link = temp
                break

        temp = currRoot.body.findAll('p')[0]
        if foundParen:
            while temp.next and not parenCompleted:
                temp = temp.next
                curr = str(temp)
                if '(' in curr:
                    openCount += 1
                    if parenOpened is False:
                        parenOpened = True
                if ')' in curr and parenOpened and openCount > 1:
                    openCount -= 1
                elif ')' in curr and parenOpened and openCount == 1:
                    parenCompleted = True
            try:
                return temp.findNext('a').attrs['href']
            except KeyError:
                print "nReached article with no main body!n"
                return None
        try:
            return str(link.attrs['href'])
        except KeyError:
            print "nReached article with no main bodyn"
            return None**

I think you are seriously overcomplicating the problem.

There are multiple ways to use the direct parent-child relationship between the elements in BeautifulSoup . One way is the > CSS selector:

In [1]: import requests  

In [2]: from bs4 import BeautifulSoup   

In [3]: url = "https://en.wikipedia.org/wiki/Sierra_Leone"    

In [4]: response = requests.get(url)    

In [5]: soup = BeautifulSoup(response.content, "html.parser")

In [6]: [a.get_text() for a in soup.select("#mw-content-text > p > a")]
Out[6]: 
['West Africa',
 'Guinea',
 'Liberia',
 ...
 'Allen Iverson',
 'Magic Johnson',
 'Victor Oladipo',
 'Frances Tiafoe']

Here we've found a elements that are located directly under the p elements directly under the element with id="mw-content-text" - from what I understand this is where the main Wikipedia article is located in.

If you need a single element, use select_one() instead of select() .

Also, if you want to solve it via find*() , pass the recursive=False argument.

链接地址: http://www.djcxy.com/p/62854.html

上一篇: 在iPhone锁定屏幕上显示CFUserNotificationDisplayAlert

下一篇: 找到元素的直接子元素