使用NLTK提取关系

2018-06-23 05:40:42

这是我的问题的后续行动。我正在使用nltk解析出人员，组织及其关系。通过这个例子，我能够创建大量的人员和组织; 但是，在nltk.sem.extract_rel命令中出现错误：

AttributeError: 'Tree' object has no attribute 'text'

以下是完整的代码：

import nltk
import re
#billgatesbio from http://www.reuters.com/finance/stocks/officerProfile?symbol=MSFT.O&officerId=28066
with open('billgatesbio.txt', 'r') as f:
    sample = f.read()

sentences = nltk.sent_tokenize(sample)
tokenized_sentences = [nltk.word_tokenize(sentence) for sentence in sentences]
tagged_sentences = [nltk.pos_tag(sentence) for sentence in tokenized_sentences]
chunked_sentences = nltk.batch_ne_chunk(tagged_sentences)

# tried plain ne_chunk instead of batch_ne_chunk as given in the book
#chunked_sentences = [nltk.ne_chunk(sentence) for sentence in tagged_sentences]

# pattern to find <person> served as <title> in <org>
IN = re.compile(r'.+s+ass+')
for doc in chunked_sentences:
    for rel in nltk.sem.extract_rels('ORG', 'PERSON', doc,corpus='ieer', pattern=IN):
        print nltk.sem.show_raw_rtuple(rel)

这个例子与本书中给出的例子非常相似，但是这个例子使用了准备好的'解析文档'，这个文档看起来并不存在，我不知道在哪里找到它的对象类型。我也搜遍了git库。任何帮助表示赞赏。

我的最终目标是为一些公司提取人员，组织，职位（日期）; 然后创建个人和组织的网络地图。

它看起来像是一个“Parsed Doc”，一个对象需要有一个headline成员和一个text成员，这两个都是令牌列表，其中一些令牌被标记为树。例如，这个（哈希）例子的作品：

import nltk
import re

IN = re.compile (r'.*binb(?!b.+ing)')

class doc():
  pass

doc.headline=['foo']
doc.text=[nltk.Tree('ORGANIZATION', ['WHYY']), 'in', nltk.Tree('LOCATION',['Philadelphia']), '.', 'Ms.', nltk.Tree('PERSON', ['Gross']), ',']

for rel in  nltk.sem.extract_rels('ORG','LOC',doc,corpus='ieer',pattern=IN):
   print nltk.sem.relextract.show_raw_rtuple(rel)

运行时提供输出：

[ORG: 'WHYY'] 'in' [LOC: 'Philadelphia']

很明显，你不会像这样编写代码，但它提供了extract_rels期望的数据格式的工作示例，您只需确定如何执行预处理步骤即可将数据转换为该格式。

这里是nltk.sem.extract_rels函数的源代码：

def extract_rels(subjclass, objclass, doc, corpus='ace', pattern=None, window=10):
"""
Filter the output of ``semi_rel2reldict`` according to specified NE classes and a filler pattern.

The parameters ``subjclass`` and ``objclass`` can be used to restrict the
Named Entities to particular types (any of 'LOCATION', 'ORGANIZATION',
'PERSON', 'DURATION', 'DATE', 'CARDINAL', 'PERCENT', 'MONEY', 'MEASURE').

:param subjclass: the class of the subject Named Entity.
:type subjclass: str
:param objclass: the class of the object Named Entity.
:type objclass: str
:param doc: input document
:type doc: ieer document or a list of chunk trees
:param corpus: name of the corpus to take as input; possible values are
    'ieer' and 'conll2002'
:type corpus: str
:param pattern: a regular expression for filtering the fillers of
    retrieved triples.
:type pattern: SRE_Pattern
:param window: filters out fillers which exceed this threshold
:type window: int
:return: see ``mk_reldicts``
:rtype: list(defaultdict)
"""
....

因此，如果您将语料库参数作为ieer传递，则nltk.sem.extract_rels函数需要doc参数为IEERDocument对象。你应该通过语料库作为王牌或只是不通过它（默认是王牌）。在这种情况下，它期望一个块树列表（这就是你想要的）。我修改了代码如下：

import nltk
import re
from nltk.sem import extract_rels,rtuple

#billgatesbio from http://www.reuters.com/finance/stocks/officerProfile?symbol=MSFT.O&officerId=28066
with open('billgatesbio.txt', 'r') as f:
    sample = f.read().decode('utf-8')

sentences = nltk.sent_tokenize(sample)
tokenized_sentences = [nltk.word_tokenize(sentence) for sentence in sentences]
tagged_sentences = [nltk.pos_tag(sentence) for sentence in tokenized_sentences]

# here i changed reg ex and below i exchanged subj and obj classes' places
OF = re.compile(r'.*bofb.*')

for i, sent in enumerate(tagged_sentences):
    sent = nltk.ne_chunk(sent) # ne_chunk method expects one tagged sentence
    rels = extract_rels('PER', 'ORG', sent, corpus='ace', pattern=OF, window=7) # extract_rels method expects one chunked sentence
    for rel in rels:
        print('{0:<5}{1}'.format(i, rtuple(rel)))

结果如下：

[PER: u'Chairman/NNP'] u'and/CC Chief/NNP Executive/NNP Officer/NNP of/IN the/DT' [ORG: u'Company/NNP']

这是nltk版本的问题。你的代码应该在nltk 2.x中工作，但对于nltk 3你应该这样编写代码

IN = re.compile(r'.*binb(?!b.+ing)')
for doc in nltk.corpus.ieer.parsed_docs('NYT_19980315'):
    for rel in nltk.sem.relextract.extract_rels('ORG', 'LOC', doc,corpus='ieer', pattern = IN):
         print (nltk.sem.relextract.rtuple(rel))

用于关系提取的NLTK示例不起作用

链接地址: http://www.djcxy.com/p/65155.html

上一篇: extract relationships using NLTK

下一篇: Creating a new corpus with NLTK