How can I tag and chunk French text using NLTK and Python?

I have 30,000+ French-language articles in a JSON file. I would like to perform some text analysis on both individual articles and on the set as a whole. Before I go further, I'm starting with simple goals:

  • Identify important entities (people, places, concepts)
  • Find significant changes in the importance (~=frequency) of those entities over time (using the article sequence number as a proxy for time)
  • The steps I've taken so far:

  • Imported the data into a python list:

    import json
    json_articles=open('articlefile.json')
    articlelist = json.load(json_articles)
    
  • Selected a single article to test, and concatenated the body text into a single string:

    txt =  ' '.join(data[10000]['body'])
    
  • Loaded a French sentence tokenizer and split the string into a list of sentences:

    nltk.data.load('tokenizers/punkt/french.pickle')
    tokens = [french_tokenizer.tokenize(s) for s in sentences]
    
  • Attempted to split the sentences into words using the WhiteSpaceTokenizer:

    from nltk.tokenize import WhitespaceTokenizer
    wst = WhitespaceTokenizer()
    tokens = [wst.tokenize(s) for s in sentences]
    
  • This is where I'm stuck, for the following reasons:

  • NLTK doesn't have a built-in tokenizer which can split French into words. White space doesn't work well, particular due to the fact it won't correctly separate on apostrophes.
  • Even if I were to use regular expressions to split into individual words, there's no French PoS (parts of speech) tagger that I can use to tag those words, and no way to chunk them into logical units of meaning
  • For English, I could tag and chunk the text like so:

        tagged = [nltk.pos_tag(token) for token in tokens]
        chunks = nltk.batch_ne_chunk(tagged)
    

    My main options (in order of current preference) seem to be:

  • Use nltk-trainer to train my own tagger and chunker.
  • Use the python wrapper for TreeTagger for just this part, as TreeTagger can already tag French, and someone has written a wrapper which calls the TreeTagger binary and parses the results.
  • Use a different tool altogether.
  • If I were to do (1), I imagine I would need to create my own tagged corpus. Is this correct, or would it be possible (and premitted) to use the French Treebank?

    If the French Treebank corpus format (example here) is not suitable for use with nltk-trainer, is it feasible to convert it into such a format?

    What approaches have French-speaking users of NLTK taken to PoS tag and chunk text?


    As of version 3.1.0 (January 2012), the Stanford PoS tagger supports French.

    It should be possible to use this French tagger in NLTK, using Nitin Madnani's Interface to the Stanford POS-tagger

    I haven't tried this yet, but it sounds easier than the other approaches I've considered, and I should be able to control the entire pipeline from within a Python script. I'll comment on this post when I have an outcome to share.


    There is also TreeTagger (supporting french corpus) with a Python wrapper. This is the solution I am currently using and it works quite good.


    Here are some suggestions:

  • WhitespaceTokenizer is doing what it's meant to. If you want to split on apostrophes, try WordPunctTokenizer , check out the other available tokenizers, or roll your own with Regexp tokenizer or directly with the re module.

  • Make sure you've resolved text encoding issues (unicode or latin1), otherwise the tokenization will still go wrong.

  • The nltk only comes with the English tagger, as you discovered. It sounds like using TreeTagger would be the least work, since it's (almost) ready to use.

  • Training your own is also a practical option. But you definitely shouldn't create your own training corpus! Use an existing tagged corpus of French. You'll get best results if the genre of the training text matches your domain (articles). Also, you can use nltk-trainer but you could also use the NLTK features directly.

  • You can use the French Treebank corpus for training, but I don't know if there's a reader that knows its exact format. If not, you must start with XMLCorpusReader and subclass it to provide a tagged_sents() method.

  • If you're not already on the nltk-users mailing list, I think you'll want to get on it.

  • 链接地址: http://www.djcxy.com/p/91718.html

    上一篇: 如何使用NLTK从诱导语法生成句子?

    下一篇: 我如何使用NLTK和Python标记和块法文文本?