如何输出NLTK块到文件？

2018-07-02 21:53:56

我有这个Python脚本，我正在使用nltk库来解析，标记化，标记和块一些让我们说从网上随机文本。

我需要格式化并在文件中写入chunked1 ， chunked2 ， chunked3的输出。这些类型为class 'nltk.tree.Tree'

更具体地说，我需要只写与正则表达式chunkGram1 ， chunkGram2 ， chunkGram3匹配的行。

我怎样才能做到这一点？

#! /usr/bin/python2.7

import nltk
import re
import codecs

xstring = ["An electronic library (also referred to as digital library or digital repository) is a focused collection of digital objects that can include text, visual material, audio material, video material, stored as electronic media formats (as opposed to print, micro form, or other media), along with means for organizing, storing, and retrieving the files and media contained in the library collection. Digital libraries can vary immensely in size and scope, and can be maintained by individuals, organizations, or affiliated with established physical library buildings or institutions, or with academic institutions.[1] The electronic content may be stored locally, or accessed remotely via computer networks. An electronic library is a type of information retrieval system."]


def processLanguage():
    for item in xstring:
        tokenized = nltk.word_tokenize(item)
        tagged = nltk.pos_tag(tokenized)
        #print tokenized
        #print tagged

        chunkGram1 = r"""Chunk: {<JJw?>*<NN>}"""
        chunkGram2 = r"""Chunk: {<JJw?>*<NNS>}"""
        chunkGram3 = r"""Chunk: {<NNPw?>*<NNS>}"""

        chunkParser1 = nltk.RegexpParser(chunkGram1)
        chunked1 = chunkParser1.parse(tagged)

        chunkParser2 = nltk.RegexpParser(chunkGram2)
        chunked2 = chunkParser2.parse(tagged)

        chunkParser3 = nltk.RegexpParser(chunkGram3)
        chunked3 = chunkParser2.parse(tagged)

        #print chunked1
        #print chunked2
        #print chunked3

        # with codecs.open('pathtofileoutput.txt', 'w', encoding='utf8') as outfile:

            # for i,line in enumerate(chunked1):
                # if "JJ" in line:
                    # outfile.write(line)
                # elif "NNP" in line:
                    # outfile.write(line)



processLanguage()

暂时当我试图运行它时，我得到错误：

`Traceback (most recent call last):
  File "sentdex.py", line 47, in <module>
    processLanguage()
  File "sentdex.py", line 40, in processLanguage
    outfile.write(line)
  File "C:Python27libcodecs.py", line 688, in write
    return self.writer.write(data)
  File "C:Python27libcodecs.py", line 351, in write
    data, consumed = self.encode(object, self.errors)
TypeError: coercing to Unicode: need string or buffer, tuple found`

编辑： @Alvas答案后，我设法做我想要的。但是现在，我想知道如何从文本语料库中去除所有非ASCII字符。例：

#store cleaned file into variable
with open('pathtofile.txt', 'r') as infile:
    xstring = infile.readlines()
infile.close

    def remove_non_ascii(line):
        return ''.join([i if ord(i) < 128 else ' ' for i in line])

    for i, line in enumerate(xstring):
        line = remove_non_ascii(line)

#tokenize and tag text
def processLanguage():
    for item in xstring:
        tokenized = nltk.word_tokenize(item)
        tagged = nltk.pos_tag(tokenized)
        print tokenized
        print tagged
processLanguage()

以上内容来自S / O中的另一个答案。但它似乎并不奏效。什么可能是错的？我得到的错误是：

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position
not in range(128)

你的代码有几个问题，但主要的罪魁祸首是你的for循环不会修改xstring的内容：

我将在这里解决你的代码中的所有问题：

你不能写单这样的路径，如t将被解释为制表符，而f作为一个换行符。你必须加倍。我知道这是一个例子，但这样的混乱经常出现：

with open('pathtofile.txt', 'r') as infile:
    xstring = infile.readlines()

以下infile.close行是错误的。它不会调用close方法，它实际上没有做任何事情。此外，如果您在任何地方的任何答案中都看到此行，则您的文件已经被with子句关闭了，请直接用评论表明file.close是错误的，应该是file.close() 。

以下内容应该可以工作，但是你需要知道它会用' '替换每个非ascii字符，它会打破诸如朴素和咖啡馆之类的词

def remove_non_ascii(line):
    return ''.join([i if ord(i) < 128 else ' ' for i in line])

但是，这就是为什么你的代码失败并出现unicode异常的原因：根本没有修改xstring的元素，也就是说，你正在计算删除了ascii字符的行，是的，但这是一个新值，存储在列表中：

for i, line in enumerate(xstring):
   line = remove_non_ascii(line)

相反，它应该是：

for i, line in enumerate(xstring):
    xstring[i] = remove_non_ascii(line)

或我的首选非常pythonic：

xstring = [ remove_non_ascii(line) for line in xstring ]

尽管这些Unicode错误的发生主要是因为您使用Python 2.7来处理纯Unicode文本，而最近的Python 3版本还有一些问题，所以我建议您，如果您刚开始时需要升级任务到Python 3.4+很快。

首先，请参阅以下视频：https：//www.youtube.com/watch？v = 0Ef9GudbxXY

在这里输入图像描述

现在为了正确的答案：

import re
import io

from nltk import pos_tag, word_tokenize, sent_tokenize, RegexpParser


xstring = u"An electronic library (also referred to as digital library or digital repository) is a focused collection of digital objects that can include text, visual material, audio material, video material, stored as electronic media formats (as opposed to print, micro form, or other media), along with means for organizing, storing, and retrieving the files and media contained in the library collection. Digital libraries can vary immensely in size and scope, and can be maintained by individuals, organizations, or affiliated with established physical library buildings or institutions, or with academic institutions.[1] The electronic content may be stored locally, or accessed remotely via computer networks. An electronic library is a type of information retrieval system."


chunkGram1 = r"""Chunk: {<JJw?>*<NN>}"""
chunkParser1 = RegexpParser(chunkGram1)

chunked = [chunkParser1.parse(pos_tag(word_tokenize(sent))) 
            for sent in sent_tokenize(xstring)]

with io.open('outfile', 'w', encoding='utf8') as fout:
    for chunk in chunked:
        fout.write(str(chunk)+'nn')

[OUT]：

alvas@ubi:~$ python test2.py
Traceback (most recent call last):
  File "test2.py", line 18, in <module>
    fout.write(str(chunk)+'nn')
TypeError: must be unicode, not str
alvas@ubi:~$ python3 test2.py
alvas@ubi:~$ head outfile
(S
  An/DT
  (Chunk electronic/JJ library/NN)
  (/:
  also/RB
  referred/VBD
  to/TO
  as/IN
  (Chunk digital/JJ library/NN)
  or/CC

如果你必须坚持python2.7：

with io.open('outfile', 'w', encoding='utf8') as fout:
    for chunk in chunked:
        fout.write(unicode(chunk)+'nn')

[OUT]：

alvas@ubi:~$ python test2.py
alvas@ubi:~$ head outfile
(S
  An/DT
  (Chunk electronic/JJ library/NN)
  (/:
  also/RB
  referred/VBD
  to/TO
  as/IN
  (Chunk digital/JJ library/NN)
  or/CC
alvas@ubi:~$ python3 test2.py
Traceback (most recent call last):
  File "test2.py", line 18, in <module>
    fout.write(unicode(chunk)+'nn')
NameError: name 'unicode' is not defined

并强烈建议，如果你必须坚持py2.7：

from six import text_type
with io.open('outfile', 'w', encoding='utf8') as fout:
    for chunk in chunked:
        fout.write(text_type(chunk)+'nn')

[OUT]：

alvas@ubi:~$ python test2.py
alvas@ubi:~$ head outfile 
(S
  An/DT
  (Chunk electronic/JJ library/NN)
  (/:
  also/RB
  referred/VBD
  to/TO
  as/IN
  (Chunk digital/JJ library/NN)
  or/CC
alvas@ubi:~$ python3 test2.py
alvas@ubi:~$ head outfile 
(S
  An/DT
  (Chunk electronic/JJ library/NN)
  (/:
  also/RB
  referred/VBD
  to/TO
  as/IN
  (Chunk digital/JJ library/NN)
  or/CC

链接地址: http://www.djcxy.com/p/91721.html

上一篇: How to output NLTK chunks to file?

下一篇: How to use NLTK to generate sentences from an induced grammar?