Latent Semantic Analysis concepts

I've read about using Singular Value Decomposition (SVD) to do Latent Semantic Analysis (LSA) in corpus of texts. I've understood how to do that, also I understand mathematical concepts of SVD.

But I don't understand why does it works applying to corpuses of texts (I believe - there must be linguistical explanation). Could anybody explain me this with linguistic point of view?

Thanks


There is no linguistic interpretation, there is no syntax involved, no handling of equivalence classes, synonyms, homonyms, stemming etc. Neither are any semantics involved, it is just words-occuring-together. Consider a "document" as a shopping cart: it contains a combination of words (purchases). And words tend to occur together with "related" words.

For instance: The word "drug" can occur together with either of {love, doctor, medicine, sports, crime}; each will point you in a different direction. But combined with many other words in the document, your query will probably find documents from a similar field.


Words occurring together (ie nearby or in the same document in a corpus) contribute to context. Latent Semantic Analysis basically groups similar documents in a corpus based on how similar they are to each other in terms of context.

I think the example and the word-document plot on this page will help in understanding.


Suppose we have the following set of five documents

  • d1 : Romeo and Juliet.
  • d2 : Juliet: O happy dagger!
  • d3 : Romeo died by dagger.
  • d4 : “Live free or die”, that's the New-Hampshire's motto.
  • d5 : Did you know, New-Hampshire is in New-England.
  • and a search query: dies, dagger .

    Clearly, d3 should be ranked top of the list since it contains both dies, dagger. Then, d2 and d4 should follow, each containing a word of the query. However, what about d1 and d5? Should they be returned as possibly interesting results to this query? As humans we know that d1 is quite related to the query. On the other hand, d5 is not so much related to the query. Thus, we would like d1 but not d5, or differently said, we want d1 to be ranked higher than d5.

    The question is: Can the machine deduce this? The answer is yes, LSI does exactly that. In this example, LSI will be able to see that term dagger is related to d1 because it occurs together with the d1's terms Romeo and Juliet, in d2 and d3, respectively. Also, term dies is related to d1 and d5 because it occurs together with the d1's term Romeo and d5's term New-Hampshire in d3 and d4, respectively. LSI will also weigh properly the discovered connections; d1 more is related to the query

    than d5 since d1 is “doubly” connected to dagger through Romeo and Juliet, and also connected to die through Romeo, whereas d5 has only a single connection to the query through New-Hampshire.

    Reference: Latent Semantic Analysis (Alex Thomo)

    链接地址: http://www.djcxy.com/p/49098.html

    上一篇: 在java中的svd问题

    下一篇: 潜在的语义分析概念