How to ignore hyphens in the Elasticsearch

3 min readMay 2, 2023

DALL-E idea a searcher trying to search in library ignoring hyphens

I recently stumbled upon a problem. I wanted to index data containing words divided by hyphens. The problem is that once the data were ingested, simple `match` did not evaluate the words as a word without a hyphen, e.g. a black-cat is not a blackcat. I want Elasticsearch to ignore the hyphens altogether.

Testing data

Let’s create an adhoc index and push data which should be evaluated in the search as the same. Once data is inserted, let’s query them and see the result.

POST test/_doc
{
  "name": "blackcat"
}

POST test/_doc
{
  "name": "black-cat"
}

GET test/_search
{
  "query": {
      "match": {
      "name":"blackcat"
    }
  }
}

Results are, without a surprise:

"hits": [
  {
    ...
    "_source": {
      "name": "blackcat"
    }
  }
]

In case we are successful, the search should output both cats and give them the same score.

Pick a different tokenizer

Standard tokenizer used by Elasticsearch considers each word with all the hyphens. There is usually no problem with that, that’s why there is hardly ever need to change it. In this case, let’s use a tokenizer, which helps us avoid hyphens and make analyzers more inclusive. Let’s use n-grams.

PUT test
{
  "settings": {
    "analysis": {
      "analyzer": {
        "ngrams_analyzer": {
          "tokenizer": "ngrams"
        }
      },
      "tokenizer": {
        "ngrams": {
          "type": "ngram",
          "token_chars": [
            "letter",
            "digit"
          ]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "name": {
        "type": "text",
        "analyzer": "ngrams_analyzer"
      }
    }
  }
}

We intentionally used only letters and digits for n-grams analyzer. Now the search works as it should. Let’s do a test:

GET test/_search
{
  "query":{
    "match":{
      "name":"black-cat"
    }
  }
}

And the data

...
    "hits": [
      {
        ...
        "_score": 2.861892,
        "_source": {
          "name": "black-cat"
        }
      },
      {
        ...
        "_score": 2.790991,
        "_source": {
          "name": "blackcat"
        }
      }
    ]
...

Output displayed both cats but with a different score. Since we are interested in a single word, we might be using this wrong all along. N-grams might be helpful for our data but it’s generally an overkill. As the documentation says, we should use it for languages that don’t use spaces or that have long compound words. That’s not the case for our black cat.

Replacing the characters

What if we just remove the characters instead of analyzing them. Also, we are interested in a single word, how about using keyword analyzer instead of the standard. In case the value is a single word, there is hardly any harm done to our storage. For further explanation, check the Stackoverflow question.

PUT test
{
    "settings": {
        "analysis": {
            "filter": {
                "clean_special": {
                    "type": "pattern_replace",
                    "pattern": "-",
                    "replacement": ""
                }
            },
            "analyzer": {
                "clean_special": {
                    "type": "custom",
                    "tokenizer": "keyword",
                    "filter": [
                        "clean_special"
                    ]
                }
            }
        }
    },
    "mappings": {
        "properties": {
            "name": {
                "type": "text",
                "analyzer": "clean_special"
            }
        }
    }
}

GET test/_search
{
  "query":{
    "match":{
      "name":"black-cat"
    }
  }
}

{
  ...
  "hits": {
    ...
    "hits": [
      {
        ...
        "_score": 0.18232156,
        "_source": {
          "name": "blackcat"
        }
      },
      {
        ...
        "_score": 0.18232156,
        "_source": {
          "name": "black-cat"
        }
      }
    ]
  }
}

Summary

The second solution using a custom filter is better simply because the scores of the results are the same. That’s not true for n-grams. On the other hand, n-grams are much more efficient for longer inputs. As always, everything depends on the input data.

Anyway, that’s it for today, like & subscribe.

How to ignore hyphens in the Elasticsearch

Testing data

Pick a different tokenizer

Replacing the characters

Summary

Written by Martin Beranek