Web Scraping a wikipedia page

In some wikipedia pages, after the title of the article (appearing in bold), there is some text inside of parentheses used to explain the pronunciation and phonetics of the words in the title. For example, on this, after the bold title diglossia in the <p> , there is an open parenthesis. In order to find the corresponding close parenthesis, you would have to iterate through the text nodes

网页刮一个维基百科页面

在某些维基百科页面中,在文章标题(以粗体显示)之后,括号内有一些文字用于解释标题中单词的发音和语音。 例如,在这里,在<p>的粗体标题diglossia之后,有一个左括号。 为了找到相应的右括号,您必须逐个遍历文本节点才能找到它,这很简单。 我想要做的是找到下一个href链接并存储它。 这里的问题是(AFAIK),没有办法用左括号唯一标识文本节点,然后获得以下href。 是否有任何直接的(而不是复杂的)方式来获得

Getting all Wikipedia articles with coordinates inside London

Generally I want to get the links (and titles) of all Wikipedia articles with coordinates inside London. I tried using Google, but unfortunately did't come with proper search terms. Any hints? This is really just a collection of ideas that was too big for a comment. Your best bet is probably DBpedia. It's a semantic mirror of Wikipedia, with much more sophisticated query possibilit

获取伦敦内部所有维基百科文章

一般来说,我想要获得伦敦境内所有维基百科文章的链接(和标题)。 我尝试过使用Google,但不幸的是没有提供适当的搜索字词。 任何提示? 这实际上只是一个太大而不能发表评论的想法集合。 你最好的选择可能是DBpedia。 它是维基百科的语义镜像,具有比维基百科的API更复杂的查询可能性。 正如你在本文中看到的那样,它可以处理相当复杂的空间查询,但是你需要进入SPARQL。 这是一篇来自该论文的图: 也就是说,维基

Getting a particular image from Wikipedia with BeautifulSoup

I'm trying to get a particular image from some Wikipedia pages by using BeautifulSoup 4 with lxml as the parser. For example I'm trying to get the album cover on the right from this wikipedia page: http://en.wikipedia.org/wiki/Animal_House_(UDO_album) The function that does the scraping is this: def get_cover_from_wikipedia(url): r = requests.get(url) if r.status_code == 200:

使用BeautifulSoup从维基百科获取特定图像

我试图通过使用带有lxml的BeautifulSoup 4作为解析器来从某些Wikipedia页面获取特定图像。 例如,我试图从这个维基百科页面获取专辑封面:http://en.wikipedia.org/wiki/Animal_House_(UDO_album) 这样做的功能是这样的: def get_cover_from_wikipedia(url): r = requests.get(url) if r.status_code == 200: soup = BeautifulSoup(r.content, 'lxml') elements = soup.find_all('a', class_='ima

Wikipedia philosophy game diagram in python and R

So I am relatively new to python, and in order to learn, I have started writing a program that goes online to wikipedia, finds the first link in the overview section of a random article, follows that link and keeps going until it either enters a loop or finds the philosophy page (as detailed here) and then repeats this process for a new random article a specified number of times. I then want to

在Python和R维基百科哲学游戏图

所以我对python比较陌生,为了学习,我已经开始编写一个程序,可以在线上找到维基百科,在随机文章的概述部分找到第一个链接,跟随该链接并继续前进,直到它进入循环或找到哲学页面(如此处所述),然后重复此过程以获得指定次数的新随机文章。 然后我想以某种形式的有用数据结构收集结果,以便我可以使用Rpy库将数据传递给R,以便我可以绘制某种网络图(R非常擅长绘制类似的东西)图中的每个节点代表所访问的页面,以及从起始

Extract the first paragraph from a Wikipedia article (Python)

How can I extract the first paragraph from a Wikipedia article, using Python? For example, for Albert Einstein , that would be: Albert Einstein (pronounced /ˈælbərt ˈaɪnstaɪn/; German: [ˈalbɐt ˈaɪnʃtaɪn] ( listen); 14 March 1879 – 18 April 1955) was a theoretical physicist, philosopher and author who is widely regarded as one of the most influential and iconic scientists and intellectuals of

从维基百科文章中提取第一段(Python)

我如何使用Python从维基百科文章中提取第一段? 例如,对于阿尔伯特爱因斯坦来说 ,那将是: 阿尔伯特爱因斯坦(发音为/ælbərtaɪnstaɪn/;德语:[albɐtaɪnʃtaɪn](听); 1879年3月14日 - 1955年4月18日)是理论物理学家,哲学家和作家,被广泛认为是最有影响力和标志性的科学家和知识分子之一所有时间。 德国和瑞士的诺贝尔奖得主爱因斯坦经常被认为是现代物理学的父亲。[2] 他获得了1921年诺贝尔物理学奖,“因为他为理论

Fetch a Wikipedia article with Python

I try to fetch a Wikipedia article with Python's urllib: f = urllib.urlopen("http://en.wikipedia.org/w/index.php?title=Albert_Einstein&printable=yes") s = f.read() f.close() However instead of the html page I get the following response: Error - Wikimedia Foundation: Request: GET http://en.wikipedia.org/w/index.php?title=Albert_Einstein&printable=yes, from 192.35.17.11 vi

用Python获取维基百科文章

我尝试使用Python的urllib获取维基百科文章: f = urllib.urlopen("http://en.wikipedia.org/w/index.php?title=Albert_Einstein&printable=yes") s = f.read() f.close() 但是,而不是HTML页面,我得到以下回应:错误 - 维基媒体基金会: Request: GET http://en.wikipedia.org/w/index.php?title=Albert_Einstein&printable=yes, from 192.35.17.11 via knsq1.knams.wikimedia.org (squid/2.6.STABLE21)

Reverse tree building (with an odd number of children)

I just found out about the AWS Glacier service and wanted to write a small Python application to upload files via the REST API. I took a look at the required headers and stumbled upon x-amz-sha256-tree-hash . I need to calculate the SHA-256 hash of the entire file as well as the hash of the parent of all hashes of each 1 MB chunk. This leads to the following tree: (Image taken from here) I

反向树木建设(奇数个孩子)

我刚刚发现了有关AWS Glacier服务,并希望编写一个小型Python应用程序来通过REST API上传文件。 我查看了所需的标题,并偶然发现了x-amz-sha256-tree-hash 。 我需要计算整个文件的SHA-256散列以及每个1 MB块的所有散列的父散列。 这导致以下树: (图片来自这里) 我已经创建了一个读取1 MB大块的函数和一个可以即时计算哈希值的类,然后我完全拼命: 在我的应用程序中,我创建了一个名为chunk的类,它接受数据并在__i

Python Selenium getting started error

I have updated to Selenium 3. I have set gecko in PATH but continue to get the error. Firefox starts up but then no action for a few moments and Firefox closes (i assume a time out).Any incite would be greatly appreciated! Traceback (most recent call last): File "C:UsersPaulDocumentspython seleniumpython_org_search.py", line 4, in driver = webdriver.Firefox() File "C:Python27lib

Python Selenium入门错误

我已更新到Selenium 3.我已经在PATH中设置了壁虎,但仍然出现错误。 Firefox启动,但没有动作一会儿,Firefox关闭(我假设超时)。任何煽动都将不胜感激! Traceback(最近一次调用最后一次):文件“C: Users Paul Documents python selenium python_org_search.py​​”,第4行,位于driver = webdriver.Firefox()文件“C: Python27 lib selenium webdriver firefox webdriver.py“,第78行,在init self.binary中,

Selenium script working from console, not working in CRON

I have Selenium script running from SH file. It is working perfectly fine when I run sh file from console, but the same file ran from Cron job fails. SH file: #!/bin/sh export DISPLAY=:10 cd /home/user python3 selenium.py > /home/user/selenium.log 2>&1 Error which I am getting is well known: Traceback (most recent call last): File "/usr/local/lib/python3.5/dist-packages/sel

Selenium脚本从控制台工作,不在CRON中工作

我有从SH文件运行的Selenium脚本。 当我从控制台运行sh文件时,它工作得很好,但从Cron作业运行的同一文件失败。 SH文件: #!/bin/sh export DISPLAY=:10 cd /home/user python3 selenium.py > /home/user/selenium.log 2>&1 我遇到的错误是众所周知的: Traceback(最近一次调用最后一次):启动stdout = self.log_file中的文件“/usr/local/lib/python3.5/dist-packages/selenium/webdriver/common/service.p

Firefox blank webbrowser with selenium

When I call a firefox webbrowser with python firefox webdriver, the firefox is opening with a blank page (nothing in the navigation barre), wait and close. The python consol give me this error : Traceback (most recent call last): File "firefox_selenium2.py", line 4, in driver = webdriver.Firefox() File "/usr/local/lib/python3.5/dist-packages/selenium/webdriver/firefox/webdriver

带有硒的Firefox空白网页浏览器

当我用python firefox webdriver调用firefox webbrowser时,Firefox会打开一个空白页面(导航栏中没有任何内容),等待并关闭。 python的控制台给我这个错误: Traceback(最近的最后一次调用):文件“firefox_selenium2.py”,第4行,在driver = webdriver.Firefox()文件中“/usr/local/lib/python3.5/dist-packages/selenium/webdriver/firefox/webdriver .py“,第80行,在init self.binary中,timeout)文件”/usr/local/l