How do we decide the number of dimensions for Latent semantic analysis ?

2018-06-17 08:54:35

I have been working on latent semantic analysis lately. I have implemented it in java by making use of the Jama package.

Here is the code:

    Matrix vtranspose ; 
    a = new Matrix(termdoc);  
    termdoc = a.getArray(); 
    a = a.transpose() ; 
    SingularValueDecomposition sv =new SingularValueDecomposition(a) ; 
    u = sv.getU();
    v = sv.getV(); 
    s = sv.getS();
    vtranspose = v.transpose() ; // we obtain this as a result of svd 

    uarray = u.getArray();
    sarray = s.getArray(); 
    varray = vtranspose.getArray(); 
    if(semantics.maketerms.nodoc>50)
    {

        sarray_mod = new double[50][50]; 
        uarray_mod = new double[uarray.length][50];
        varray_mod = new double[50][varray.length]; 
        move(sarray,50,50,sarray_mod); 
        move(uarray,uarray.length,50,uarray_mod); 
        move(varray,50,varray.length,varray_mod); 
        e = new Matrix(uarray_mod); 
        f = new Matrix(sarray_mod);
        g = new Matrix(varray_mod);
        Matrix temp  =e.times(f); 
        result = temp.times(g);  

    }
    else 
    {
        Matrix temp = u.times(s); 
        result = temp.times(vtranspose); 
    }
    result = result.transpose(); 
    results = result.getArray() ; 

    return results ;

But how do we determine the number of dimensions? Is there a method to determine the number of dimensions to which the system should be reduced to obtain best results? What other parameters do we consider for effective performance of LSA?

Regarding the choice of the number of dimensions:

1) http://en.wikipedia.org/wiki/Latent_semantic_indexing:

Another challenge to LSI has been the alleged difficulty in determining the optimal number of dimensions to use for performing the SVD. As a general rule, fewer dimensions allow for broader comparisons of the concepts contained in a collection of text, while a higher number of dimensions enable more specific (or more relevant) comparisons of concepts. The actual number of dimensions that can be used is limited by the number of documents in the collection. Research has demonstrated that around 300 dimensions will usually provide the best results with moderate-sized document collections (hundreds of thousands of documents) and perhaps 400 dimensions for larger document collections (millions of documents). However, recent studies indicate that 50-1000 dimensions are suitable depending on the size and nature of the document collection.

Checking the amount of variance in the data after computing the SVD can be used to determine the optimal number of dimensions to retain. The variance contained in the data can be viewed by plotting the singular values (S) in a scree plot. Some LSI practitioners select the dimensionality associated with the knee of the curve as the cut-off point for the number of dimensions to retain. Others argue that some quantity of the variance must be retained, and the amount of variance in the data should dictate the proper dimensionality to retain. Seventy percent is often mentioned as the amount of variance in the data that should be used to select the optimal dimensionality for recomputing the SVD.

2) http://www.puffinwarellc.com/index.php/news-and-articles/articles/33-latent-semantic-analysis-tutorial.html?showall=1:

The trick in using SVD is in figuring out how many dimensions or "concepts" to use when approximating the matrix. Too few dimensions and important patterns are left out, too many and noise caused by random word choices will creep back in. The SVD algorithm is a little involved, but fortunately Python has a library function that makes it simple to use. By adding the one line method below to our LSA class, we can factor our matrix into 3 other matrices. The U matrix gives us the coordinates of each word on our “concept” space, the Vt matrix gives us the coordinates of each document in our “concept” space, and the S matrix of singular values gives us a clue as to how many dimensions or “concepts” we need to include.

def calc(self): self.U, self.S, self.Vt = svd(self.A)

In order to choose the right number of dimensions to use, we can make a histogram of the square of the singular values. This graphs the importance each singular value contributes to approximating our matrix. Here is the histogram in our example.

在这里输入图像描述

For large collections of documents, the number of dimensions used is in the 100 to 500 range. In our little example, since we want to graph it, we'll use 3 dimensions, throw out the first dimension, and graph the second and third dimensions.

The reason we throw out the first dimension is interesting. For documents, the first dimension correlates with the length of the document. For words, it correlates with the number of times that word has been used in all documents. If we had centered our matrix, by subtracting the average column value from each column, then we would use the first dimension. As an analogy, consider golf scores. We don't want to know the actual score, we want to know the score after subtracting it from par. That tells us whether the player made a birdie, bogie, etc.

3) Landauer, TK, Foltz, PW, Laham, D., (1998), 'Introduction to Latent Semantic Analysis', Discourse Processes, 25, 259-284:

在这里输入图像描述

链接地址: http://www.djcxy.com/p/49102.html

上一篇: 试图压缩图像时出现彩色像素（包括图片）

下一篇: 我们如何确定潜在语义分析的维数？