SQL Most effective way to store every word in a document separately

Here's my situation (or see TLDR at bottom ): I'm trying to make a system that will search for user entered words through several documents and return the documents that contain those words. The user(s) will be searching through thousands of documents, each of which will be 10 - 100+ pages long, and stored on a webserver.

The solution I have right now is to store each unique word in a table with an ID (only maybe 120 000 relevant words in the English language), and then in a separate table store the word id, the document it is in, and the number of times it appears in that document.

Eg: Document foo's text is

abc abc def

and document bar's text is

abc def ghi

Documents table will have

id | name

1 'foo'
2 'bar'

Words table:

id | word

1 'abc'
2 'def'
3 'ghi'

Word Document table:

word id | doc id | occurrences

1        1        2
1        2        1
2        1        1
2        2        1
3        2        1

As you can see when you have thousands of documents and each has thousands of unique words, the Word Document tables blows up very quickly and takes way too long to search through.

TL;DR My question is this:

How can I store searchable data from large documents in an SQL database, while retaining the ability to use my own search algorithm (I am aware SQL has one built in for .docs and pdfs) based on custom factors (like occurrence, as well as others) without having an outright massive table for all the entries linking each word to a document and its properties in that document?

Sorry for the long read and thanks for any help!


Rather than building your own search engine using SQL Server, have you considered using a C# .net implementation of the lucene search api's? Have a look at https://github.com/apache/lucene.net


Good question. I would piggy back on the existing solution of SQL Server (full text indexing). They have integrated a nice indexing engine which optimises considerably better than your own code probably could do (or the developers at Microsoft are lazy or they just got a dime to build it :-)

Please see SQL server text indexing background. You could query views such as sys.fulltext_index_fragments or use stored procedures.

Ofcourse, piggy backing on an existing solution has some draw backs:

  • You need to have a license for the solution.
  • When your needs can no longer be served, you will have to program it all yourself.
  • But if you allow SQL Server to do the indexing, you could more easily and with less time build your own solution.


    Your question strikes me as being naive. In the first place... you are begging the question. You are giving a flawed solution to your own problem... and then explaining why it can't work. Your question would be much better if you simply described what your objective is... and then got out of the way so that people smarter than you could tell you HOW to accomplish that objective.

    Just off hand... the database sounds like a really dumb idea to me. People have been grepping text with command line tools in UNIX-like environments for a long time. Either something already exists that will solve your problem or else a decent perl script will "fake" it for you-- depending on your real world constraints, of course.

    Depending on what your problem actually is, I suspect that this could get into some really interesting computer science questions-- indexing, Bayesian filtering, and who knows what else. I suspect, however, that you're making a very basic task more complicated than it needs to be.

    TL;DR My answer is this:

    ** Why wouldn't you just write a script to go through a directory... and then use regexes to count the occurences of the word in each file that is found there?

    链接地址: http://www.djcxy.com/p/76080.html

    上一篇: 如何让透析器忽略某些未导出的功能?

    下一篇: SQL将文档中的每个单词单独存储在文档中的最有效方法