How to extract keywords from a block of text in Haskell

So I know this is a kind of a large topic, but I need to accept a chunk of text, and extract the most interesting keywords from it. The text comes from TV captions, so the subject can range from news to sports to pop culture references. It is possible to provide the type of show the text came from.

I have an idea to match the text against a dictionary of terms I know to be interesting somehow.

Which libraries for Haskell can help me with this?

Assuming I do have a dictionary of interesting terms, and a database to store them in, is there a particular approach you'd recommend to matching keywords within the text?

Is there an obvious approach I'm not thinking of?


I'd stem the words in the chunks and then search for all terms in the dict just two random libs:

stem http://hackage.haskell.org/packages/archive/stemmer/0.2/doc/html/NLP-Stemmer-C.html

search http://hackage.haskell.org/packages/archive/sphinx/0.2.1/doc/html/Text-Search-Sphinx.html


To expand on bpgergo answer (but I don't have any haskell-specific info), it's pretty straightforward to enter documents into a relational database and index them with SOLR/lucene or sphinx, either of which should have a stemmer in their default/suggested configuration. And then you can search on which docs have pairs, triples, etc of your list of "interesting terms"

You might look at Named entity recognition, statistically unusual Phrase Detection, auto-tag generation, topics like that. Lingpipe is a good place to start, also these books:

http://alias-i.com/lingpipe/demos/tutorial/read-me.html

http://www.manning.com/marmanis/excerpt_contents.html

http://www.manning.com/alag/excerpt_contents.html

链接地址: http://www.djcxy.com/p/9910.html

上一篇: header()如何工作?

下一篇: 如何从Haskell中的文本块中提取关键字