Apache Nutch: Get outlink URL's text context

Anyone knows an efficient way to extract the text context that wraps an outlink URL. For example, given this sample text containing an outlink:

Nutch can run on a single machine, but gains a lot of its strength from running in a Hadoop cluster. You can download Nutch here. For more information about Apache Nutch, please see the Nutch wiki.

In this example, I would like to get the sentence containing the link, and a sentence before and after that sentence. Any way to do this efficiently? Any methods I can invoke to get something like the position of the link within a fetched content? Or even a part of the nutch code I can modify to do this? Thanks!


What you want to do is Web Scraping. Python and Hadoop offers tools for that. To achieve it, you can use selectors.

Here you find some examples how to do that using Python Scrapy:

  • Selectors
  • Scrapy Tutorial
  • On Hadoop the best way to go is to implement a crawling using selectors:

  • Web crawl with Hadoop
  • enter link description here
  • HiveQL
  • The cascading can be used to address the URL you specify:

  • Hadoop and Cascading
  • After having the data, you can also use R to optimize analysis:

  • R and Hadoop
  • Enabling R on Hadoop
  • If you haven't done anything with Hadoop yet, here is a good starting point. You may also want to have a look in HUE Beeswax as an interactive tool that is very useful for data analysis.

    链接地址: http://www.djcxy.com/p/78040.html

    上一篇: popBackStack导致一次又一次地调用fragment的fragment

    下一篇: Apache Nutch:获取outlink URL的文本上下文