Getting a particular image from Wikipedia with BeautifulSoup

I'm trying to get a particular image from some Wikipedia pages by using BeautifulSoup 4 with lxml as the parser. For example I'm trying to get the album cover on the right from this wikipedia page: http://en.wikipedia.org/wiki/Animal_House_(UDO_album)

The function that does the scraping is this:

def get_cover_from_wikipedia(url):
    r = requests.get(url)
    if r.status_code == 200:
        soup = BeautifulSoup(r.content, 'lxml')
        elements = soup.find_all('a', class_='image')
        for element in elements:
            print '%snn' % element.prettify()

    return False

the output of the print is as follows:

<a class="image" href="/wiki/File:Question_book-new.svg">
 <img alt="" data-file-height="204" data-file-width="262" height="39" src="//upload.wikimedia.org/wikipedia/en/thumb/9/99/Question_book-new.svg/50px-Question_book-new.svg.png" srcset="//upload.wikimedia.org/wikipedia/en/thumb/9/99/Question_book-new.svg/75px-Question_book-new.svg.png 1.5x, //upload.wikimedia.org/wikipedia/en/thumb/9/99/Question_book-new.svg/100px-Question_book-new.svg.png 2x" width="50"/>
</a>

<a class="image" href="/wiki/File:UDO_animal_house.jpg">
 <img alt="" data-file-height="302" data-file-width="300" height="221" src="//upload.wikimedia.org/wikipedia/en/thumb/4/4e/UDO_animal_house.jpg/220px-UDO_animal_house.jpg" srcset="//upload.wikimedia.org/wikipedia/en/4/4e/UDO_animal_house.jpg 1.5x, //upload.wikimedia.org/wikipedia/en/4/4e/UDO_animal_house.jpg 2x" width="220"/>
</a>

the image I want to pull out is the image in the second block that starts with <a class... , not the book image which is the image in the first block

what I want to accomplish here is:

  • I only want to get the links specified with src , not everything that comes with the class.

  • I want to be able to distinguish between the book image and image I want to pull out. The book image is there because if you check the Wikipedia page, it says the article need citations and there is a book image there. Apparently it matches my search for tag a and class image but it might or might not be there depending on the article in question.

  • What's the best way to get only the image I'm interested in, which is the image in the right side of the article?


    Your search is not specific enough. The book image is nested in a metadata table:

    <table class="metadata plainlinks ambox ambox-content ambox-Refimprove" role="presentation">
    

    while the album cover is nested inside another:

    <table class="infobox vevent haudio" style="width:22em">
    

    Use that to your advantage.

    Using the CSS selector support makes this trivial:

    covers = soup.select('table.infobox a.image img[src]')
    for cover in covers:
        print cover['src']
    

    The CSS selector asks for <img> tags with a src attribute, provided they are nested in a <a class="image"> element, inside a <table class="infobox"> element. There is but one such image:

    >>> from bs4 import BeautifulSoup
    >>> import requests
    >>> r = requests.get('http://en.wikipedia.org/wiki/Animal_House_(U.D.O._album)')
    >>> soup = BeautifulSoup(r.content)
    >>> covers = soup.select('table.infobox a.image img[src]')
    >>> for cover in covers:
    ...     print cover['src']
    ... 
    //upload.wikimedia.org/wikipedia/en/thumb/4/4e/UDO_animal_house.jpg/220px-UDO_animal_house.jpg
    

    Well you've already got 99% of what you want, so that's the main thing. My first thought is to tighten your filter a little bit. If this is a one off case, and you don't need this program to apply in many places, the 'text' argument in BeautifulSoup.find_all() may help you:

    if r.status_code == 200:
        soup = BeautifulSoup(r.content, 'lxml')
        elements = soup.find_all('a', text='.jpg' class_='image')
        for element in 
            print '%snn' % element.prettify()
    
    return False
    

    As your target image is the only .jpg file on the page, this should help. You've probably already looked, but this should help if you get stuck: http://www.crummy.com/software/BeautifulSoup/bs4/doc/#find-all

    链接地址: http://www.djcxy.com/p/62848.html

    上一篇: 获取伦敦内部所有维基百科文章

    下一篇: 使用BeautifulSoup从维基百科获取特定图像