Web Scraping a wikipedia page

2018-06-22 09:31:03

In some wikipedia pages, after the title of the article (appearing in bold), there is some text inside of parentheses used to explain the pronunciation and phonetics of the words in the title. For example, on this, after the bold title diglossia in the <p> , there is an open parenthesis. In order to find the corresponding close parenthesis, you would have to iterate through the text nodes one by one to find it, which is simple. What I'm trying to do is find the very next href link and store it.

The issue here is that (AFAIK), there isn't a way to uniquely identify the text node with the close parenthesis and then get the following href. Is there any straight forward (not convoluted) way to get the first link outside of the initial parentheses?

EDIT

In the case of the link provided here, the href to be stored should be: https://en.wikipedia.org/wiki/Dialects since that is the first link outside of the parenthesis

Is this what you want?

import requests
rs = requests.get('https://en.wikipedia.org/wiki/Diglossia', verify=False)
parsed_html = BeautifulSoup(rs.text)
print parsed_html.body.findAll('p')[0].findAll('a')[0]

This gives:

<a href="/wiki/Linguistics" title="Linguistics">linguistics</a>

if you want to extract href then you can use this:

parsed_html.body.findAll('p')[0].findAll('a')[0].attrs[0][1]

UPDATE It seems you want href after parentheses not the before one. I have written script for it. Try this:

import requests
from BeautifulSoup import BeautifulSoup
rs = requests.get('https://en.wikipedia.org/wiki/Diglossia', verify=False)
parsed_html = BeautifulSoup(rs.text)

temp = parsed_html.body.findAll('p')[0]

start_count = 0
started = False
found = False

while temp.next and found is False:
    temp = temp.next
    if '(' in temp:
        start_count += 1
        if started is False:
            started = True
    if ')' in temp and started and start_count > 1:
        start_count -= 1
    elif ')' in temp and started and start_count == 1:
        found = True

print temp.findNext('a').attrs[0][1]

链接地址: http://www.djcxy.com/p/62852.html

上一篇: 找到元素的直接子元素

下一篇: 网页刮一个维基百科页面