Saturday, February 25, 2012

Query the Full-Text Words List?

Hi Guys. I’m doing searches in a DB and am using the Full-Text Indexing.

CONTAINS() will only process strings (words!) three letters or more in length. Even though it can do substring searching, it will only do this if it recognises the parameter as a word.

So take this code for example:

Dim mySQLStatement As String = "SELECT TOP 100 Description, Price, Stock FROM Products WHERE "

x = Split(strString, " ")

For i = 0 To x.GetUpperBound(0)

If x(i).Length > 2 Then

mySQLStatement &= " CONTAINS(Description, '*" & x(i) & "*') AND "

ElseIf x(i).Length = 1 Or x(i).Length = 2 Then

mySQLStatement &= " Description LIKE '%" & x(i) & "%' AND "

End If

Next

If mySQLStatement.EndsWith(" AND ") Then mySQLStatement = Left(mySQLStatement, Len(mySQLStatement) - 5)

mySQLStatement &= " ORDER BY PRICE DESC"

dolog(mySQLStatement)

mySQLCommand = New SqlCommand(mySQLStatement, mySQLConnection)

mySQLAdapter = New SqlDataAdapter(mySQLCommand)

myDataSet = New DataSet : mySQLAdapter.Fill(myDataSet)

That is code I’m using in a small proof-of-concept application I’m writing – so I’m aware I can use StringBuilders and should be using SPs and all that jazz.

It will process the queries people type in, so if the person were to type in “sql server”, it would perform all of the following queries: (this is by design by the way)

SELECT TOP 100 Description, Price, Stock FROM Products WHERE Description LIKE '%s%' ORDER BY PRICE DESC

SELECT TOP 100 Description, Price, Stock FROM Products WHERE Description LIKE '%sq%' ORDER BY PRICE DESC

SELECT TOP 100 Description, Price, Stock FROM Products WHERE CONTAINS(Description, '*sql*') ORDER BY PRICE DESC

SELECT TOP 100 Description, Price, Stock FROM Products WHERE CONTAINS(Description, '*sql*') ORDER BY PRICE DESC

SELECT TOP 100 Description, Price, Stock FROM Products WHERE CONTAINS(Description, '*sql*') AND Description LIKE '%s%' ORDER BY PRICE DESC

SELECT TOP 100 Description, Price, Stock FROM Products WHERE CONTAINS(Description, '*sql*') AND Description LIKE '%se%' ORDER BY PRICE DESC

SELECT TOP 100 Description, Price, Stock FROM Products WHERE CONTAINS(Description, '*sql*') AND CONTAINS(Description, '*ser*') ORDER BY PRICE DESC

SELECT TOP 100 Description, Price, Stock FROM Products WHERE CONTAINS(Description, '*sql*') AND CONTAINS(Description, '*serv*') ORDER BY PRICE DESC

SELECT TOP 100 Description, Price, Stock FROM Products WHERE CONTAINS(Description, '*sql*') AND CONTAINS(Description, '*serve*') ORDER BY PRICE DESC

SELECT TOP 100 Description, Price, Stock FROM Products WHERE CONTAINS(Description, '*sql*') AND CONTAINS(Description, '*server*') ORDER BY PRICE DESC

I figured that if CONTAINS() ignores anything under 3 letters, then I’ll use LIKE for things under 3 letters. Problem is though, only the lines in bold work.

This is because CONTAINS() does not bother searching (as far as I can tell) on the parameters: '*ser*','*serv*' and '*serve*'.

So. My question is. How can I find out what strings CONTAINS() does and does not consider searchable words. Because I’d like to check if the query should be executed via LIKE or CONTAINS and build the SQL Statement accordingly. Having said that, using LIKE on its own is running quickly enough, but I would really like to use CONTAINS() instead.

I may of course be making so big logical error here or not understand something in particular, so any help would be appreciated…

Jamie, I need some more information to help answer your question:

What exactly do you mean when you say only the queries in bold "work"? Do you get an error or no results?
What is your sample data like?
What is your intention behind putting '*' before and after a string?

--
Sara Tahir
Program Manager
Microsoft SQL Server

|||

plenderj wrote:

CONTAINS() will only process strings (words!) three letters or more in length. Even though it can do substring searching, it will only do this if it recognises the parameter as a word.
...
I figured that if CONTAINS() ignores anything under 3 letters, then I’ll use LIKE for things under 3 letters. Problem is though, only the lines in bold work.

This is because CONTAINS() does not bother searching (as far as I can tell) on the parameters: '*ser*','*serv*' and '*serve*'.

My question is. How can I find out what strings CONTAINS() does and does not consider searchable words.

Full-Text Search results for your queries depend on multiple things including the wordbreaker, noise word list and whether the intention is to prefix, etc. Without that information, it’s hard to recommend which approach is better.

Here is some information that may help:

Full-Text is token (word) based search and not a substring search. Therefore it will not find arbitrary 3 char string patterns in middle of strings.

Tokenizing is dependent on the wordbreaker and that depends on the language of the column. Some tokenizers could consider the * as punctuation and strip it out - others might leave it intact. In any case Full-Text will be consistent in query and indexing, if the intent was to search the exact string with * and so will be able to match.

Full-Text does do prefix search - meaning if the user provides the leading part of a token, it can find all the tokens that have that prefix. The syntax for such a query is ' "foo*" ' where foo is the prefix and both the prefix and the * are enclosed in double-quotes. So if the intent below of trailing * was to do prefix match, you would need to put it in double-quotes.

There is no support for equivalent suffix match. So if the intent of leading * was that, then Full-Text does not support it.

Full-Text does not have any restrictions on the size of tokes - except 64 characters is max and by default single character token are considered noise.

The only things Full-Text does not consider searchable are the noise words – you can find that list in the noise word file for the specific language.

|||

Hello, everyone! I started to use full-text search and found out this not nice limitation in length of words and noise words. Firstly, I'm not sure that this limit exist, because I can search for two letters long words. If we look in noise file we can find all letters typed in, so one letter long words are noise words and this is not lenght limitation (must try to remove those letters and try search for them).
Again, if I have noise words in query I get error message "Server: Msg 7619, Level 16, State 1, Line 1" which doesn't tell me anything about noise word, so I can't determine it and do proper search using "like" for noise words and "contains" for all other.
Finaly, question. What is a best practice to go around this problem. Import all noise words into application or remove all lines from noise words file.

|||Strange that contains doesn't allow one letter words (just letter) but in noise word files I can see letters. If I remove them, doesn't help.|||Erik,
Firstly, The above reference limit of 3 characters does not exist (otherwise, what would of been the point in adding FTS to SQL 7.0, 2000 and 2005 in the first place?)

Secondly, what exactly was the noise word file you modified? Was it noise.enu (US English) or some other noise.* file under \FTDATA\SQLServer\Config where you have SQL Server 2000 installed? Assuming that you modified the correct file for the "Language for Word Breaker" for your FT-enabled table, did you run a Full Population? If not, then re-run a Full Population as this is required after making changes to the noise word files.

Thanks,
John
SQL Full Text Search Blog
http://spaces.msn.com/members/jtkane/
|||I removed data from all of noise word files, but limit for one letter is stil exist. I also done full population.
It's not realy a problem now, problem is that if I want to install my application I also need different instance of SQL server, right? Also, I can run application on same instance with others but then I need to determine those noise words, but I'm not sure that there is a safe way to do it.|||Hi Erik,
So, you left only one letter? What really needs to be done is to leave a single space character in your language specific noise.* file. If you're using US English, then the noise word file is noise.enu. You can determine this by running the following code in your FT-enabled database:

sp_help_fulltext_columns
-- FULLTEXT_LANGUAGE value of 1033 = US English

There was no need to remove all text from all noise word files. If you leave a single space in the nosie word file (noise.enu) and then run a Full Population it will work for you.

No, you don't need to install a different instance (unless you want to), as you can use the default install of SQL Server 2000. Note, I'm assuming that you're using SQL Server 2000 (let me know if you're using SQL Server 2005), there is one MSSearch service for SQL 2000, but the noise word files are installed into each named instance folder, specificly under:

\MSSQL<$Instance_Name>\FTDATA\SQLServer<$Instance_Name>\Config

for example: for Instance name "SQL80":
\MSSQL$SQL2K\FTData\SQLServer$SQL2K\Config\noise.enu

On safe way to determine where are the instance noise word files, is to store the above path in a table along with the instance name. The path is also recorded in the registry under an instance specific registry key:

HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\ContentIndexCommon\LanguageResources\Override\SQLServer$SQL2K\English (United States)
Locale value= 1033
NoiseFile value= d:\mssql80\MSSQL$SQL2K\FTData\SQLServer$SQL2K\Config\noise.enu

Hope that helps!
John
SQL Full Text Search Blog
http://spaces.msn.com/members/jtkane/

|||Hi Jamie,
Drop me a line re: this an S205 noise lists and how to develop your own/circumvent.
Do remember they are localised.
fb|||I've since resolved/worked-around by using LIKE:

Shared Function GetMatchingDSForProducts(ByVal strString As String, ByRef mySQLStatement As String) As DataSet
Dim mySQLConnection As New SqlConnection(ConfigurationManager.ConnectionStrings("SQLConn").ToString)
mySQLStatement = "SELECT TOP 100 Description, Price, Stock, [SAP Code] FROM Products WHERE "
mySQLConnection.Open()
Dim x() As String, i As Integer
x = Split(strString, " ")
For i = 0 To x.GetUpperBound(0)
If Not x(i) = "instock" Then mySQLStatement &= " Description LIKE '%" & x(i) & "%' AND "
Next
If Not InStr(strString, "instock") = 0 Then mySQLStatement &= " STOCK > 0 "
If mySQLStatement.EndsWith(" AND ") Then mySQLStatement = Left(mySQLStatement, Len(mySQLStatement) - 4)
mySQLStatement &= " ORDER BY PRICE DESC"
Dim mySQLCommand As New SqlCommand(mySQLStatement, mySQLConnection)
Dim mySQLAdapter As New SqlDataAdapter(mySQLCommand)
mySQLAdapter.SelectCommand = mySQLCommand
Dim myDataSet As New DataSet : mySQLAdapter.Fill(myDataSet)
mySQLConnection.Close()
Return myDataSet
End Function

No comments:

Post a Comment