/ WRITING SPEAKING
← back

Today I learned: Hyphen is a delimiter

3 min read ·

It’s going to be a short and brief note about one nonobvious (at least for me) full-text search related thing in MySQL. So, starting with some prerequirements, imagine that you have a database containing some users data. It will at least contain a name and a surname. It’s MySQL database and you have fulltext index on the surname column. It’s of course filled with tons of users. In many cases, these users have a double-barrelled surname. Now let’s say that you have to find all the people with the surname Jones-Smith. You want to use Natural Language Mode (the default one) since in Boolean Mode + and - are special characters that tell if the word has to be present or absent in the text.

sql
SELECT
*
FROM
users
WHERE
MATCH(surename) AGAINST ("Jones-Smith");
sql
SELECT
*
FROM
users
WHERE
MATCH(surename) AGAINST ("Jones-Smith");

And.. you got all your people. Great. But you also got hundreds of rows with people with surnames like Jones, Smith, Williams-Smith, Smith-Brown, Wilson-Jones, Jones-Jones, Smith-Doe, etc. Kind of awkward, isn’t it?

When MySQL creates an index for some column it uses a very simple parser to split the text into words. And hyphen is one of the delimiters. That’s a bit of an oversimplification, but it works somehow like this: check if Jones or Smith appear in the surname column. Notice the or operator. You can think of this indexed surname column as of rows of arrays filled with words extracted from the original surname. So for sure, the result will be positive for each row containing either Jones or Smith. Or both.

What is also important is the fact that by default the minimum word length, that is extracted from text and is matched in the query is 3. It is specified by ft_min_word_len in mysqld. So if you have a surname like Jo-Do it won’t be matched with neither Jo-Do, Jo nor Do.

Workarounds

Let’s say that you really need to extract all the Jones-Smiths from your database and you cannot have anyone else in your result set. It would be a good thing to be able to treat hyphen as a word character. And it can be achieved in the following ways:

Recompiling? I don’t like it.

Kind of better.

Sounds much better. But there comes my favorite h̶a̶c̶k̶ workaround:

Bug or feature?

I cannot say if it’s a bug or a feature. In the example above this MySQL behavior is obviously a bug. And in many other cases, it’s also going to look like a bug. But there’re often hyphens between words in many texts — I personally tend to write this way, without any additional delimiters — so I guess I kind of see the use case in treating hyphen as a delimiter.