How to extract numbers (along with comparison adjectives or ranges)

I am working on two NLP-projects in Python, and both have similar task to extract values and comparison operators from sentences like:

"... greater than $10 ... ",
"... weight not more than 200lbs ...",
"... height in 5-7 feets ...",
"... faster than 30 seconds ... "

I saw two different ways to solve this problem, one using very complex regular expressions, and one using NER (and some regexes, too).

How can I parse values out of such sentences? I assume this is a common task in NLP.


The desired output would be something like:

Input:

"greater than $10"

Output:

{'value': 10, 'unit': 'dollar', 'relation': 'gt', 'position': 3}

I would probably approach this as a chunking task and use nltk 's part of speech tagger combined with its regular expression chunker. This will allow you to define a regular expression based on the part of speech of the words in your sentences instead of on the words themselves. For a given sentence, you can do the following:

import nltk

# example sentence
sent = 'send me a table with a price greater than $100'

The first thing I would do is to modify your sentences slightly so that you don't confuse the part of speech tagger too much. Here are some examples of changes that you can make (with very simple regular expressions) but you can experiment and see if there are others:

$10 -> 10 dollars
200lbs -> 200 lbs
5-7 -> 5 - 7 OR 5 to 7

so we get:

sent = 'send me a table with a price greater than 100 dollars'

now you can get the parts of speech from your sentence:

sent_pos = nltk.pos_tag(sent.split())
print(sent_pos)

[('send', 'VB'), ('me', 'PRP'), ('a', 'DT'), ('table', 'NN'), ('with', 'IN'), ('a', 'DT'), ('price', 'NN'), ('greater', 'JJR'), ('than', 'IN'), ('100', 'CD'), ('dollars', 'NNS')]

We can now create a chunker which will chunk your POS tagged text according to a (relatively) simple regular expression:

grammar = 'NumericalPhrase: {<NN|NNS>?<RB>?<JJR><IN><CD><NN|NNS>?}'
parser = nltk.RegexpParser(grammar)

This defines a parser with a grammar that chunks numerical phrases (what we'll call your phrase type). It defines your numerical phrase as: an optional noun, followed by an optional adverb, followed by a comparative adjective, a preposition, a number, and an optional noun. This is just a suggestion for how you may want to define your phrases, but I think that this will be much simpler than using a regular expression on the words themselves.

To get your phrases you can do:

print(parser.parse(sent_pos))
(S
  send/VB
  me/PRP
  a/DT
  table/NN
  with/IN
  a/DT
  (NumericalPhrase price/NN greater/JJR than/IN 100/CD dollars/NNS))  

Or to get only your phrases you can do:

print([tree.leaves() for tree in parser.parse(sent_pos).subtrees() if tree.label() == 'NumericalPhrase'])

[[('price', 'NN'),
  ('greater', 'JJR'),
  ('than', 'IN'),
  ('100', 'CD'),
  ('dollars', 'NNS')]]
链接地址: http://www.djcxy.com/p/91726.html

上一篇: 斯坦福大学帕尔斯和NLTK

下一篇: 如何提取数字(连同比较形容词或范围)