Advanced string indexes

Let us see how to apply all the concepts we saw earlier to build a small full-text search engine with sheraf.

Full-text search takes a query from a user, and tries to match documents containing the words in the query. Generally we expect it to tolerate typos, and understand plural forms, conjugations and variants of the same word (for instance work and working should be considered as very close). We also expect the result to be ordered by pertinence. For instance, a document containing several times a search word should appear in a better position than a document where it only appears once.

General concepts

Those expectations leads to several treatments on the text we should index, and on the search queries. Here are some steps that can appear:

Alphabet reduction

To deal with a lesser number of character, you can map your alphabet to a subset. You can ignore the case in the idexation and queries, and consider everything is lowercase.

Another technique is the accent folding. This is about considering that the accented letters (for instance e, é and è) are equivalent, and ignoring accents in indexation and querying. This can be done for instance with unidecode.unidecode().

Stemming or lemmatization

Stemming is about deleting prefixes and suffixes, and reducing a word to another word that represents its sense. For instance translations and translated can both be reduced to translat. Both words refer to the same concept we call here translat, even if it is not a proper English word. Thus, a query containing translations can return results containing translated. With the proper tooling, this step is done automatically, but depends on the language. Indeed different languages have different prefixes and suffixes. Stemming can be done with librarys like pystemmer or nltk.

lemmatization is a variant of stemming that only reduce words to existing words. For instance translations and translated could be reduced to translate. This is slower than stemming, but is more accurate. You can lemmatize with libraries like nltk.

Typo correction

errare humamun est, this adage is still true with query search. Typo correction is about allowing the users to make little mistakes in the information they index or query, and still match relevant results. This is sometimes called fuzzy searching.

Levenshtein and variants

Some algorithms like the Levenshtein distance or the Damerau-Levenshtein allow to estimate the similarity between two words. The promiximity of the variant letters, their numbers etc. are taken in account. The distance consist in the minimum number of operations needed to transform one word to the other (insertion, deletion and substitution, for Levenshtein, and (insertion, deletion, substitution and transposition for Damerau-Levenshtein). Computing the Levenshtein distance between one query word and a set of reference words is tedious, since you have to compute the distance between your query word and each word in the set. Some datastructures like the BK Trees allow you to organize your reference set so you just need to compute the Levenshtein distance against a subset of words before you find your matches.

The Levenshtein distance is implemented in libraries like fuzzywuzzy or python-levenshtein. difflib also bring string similarity methods.

SymSpell

Other algorithms like Wolf Garbe’s LinSpell or SymSpell take a different approach but seem to be faster than BK/Damerau-Levenshtein. It indexes a word with Thus Symspell especially trades query speed against memory usage and pre-computation speed.

Another complementary approach is to check words against dictionnaries and then correct them. This for instance is what does pyenchant, based on tools like aspell. Note that pyenchant allows you to define your own dictionnary.

Restricting user input

Sometimes the better solution to a problem is actually to not having to deal with it. When it is possible, restricting the user inputs to a valid set of data prevent you to deal with data imprecisions. This is generally implemented by user interface tricks, such as suggestions as you type. However, that may be difficult to instore when the valid data set is very large or motley.

Pertinence matching

Pertinence is about knowing if a word takes an important place in a document. For instance a document where a searched word appears several time is more relevant than a document where it appears only once. A document where a searched word appears in the title is more pertitent than a document where it appears in the footer. You can do this with some algorithms family like BM25. BM25 is implemented for instance in rank_bm25.

Depending on the number of documents, the size of the document, the nature of your data (natural language, small sets of terms you choosed, etc.), you might want to use and tune this or that technique. There is no magical formula to give perfect results. You can also define strategies where you use some of those techniques only when exact matches does not return good enough results. Some libraries like Whoosh implement almost all the previous concepts, and it also manages the storing of indexes.

Use cases

Let us see what we can do with those concepts in sheraf. Wo do not pretend sheraf can replace tools like Whoosh, but to experiment the flexibility sheraf offers.

Name completion

Imagine we have a cowboy database, and we would like to be able to find cowboys by their names. We have a input field, and we would like to suggest valid cowboy names as the user is typing. Then we can enforce the user to choose one of the valid names to the search. The idea is to find cowboy names that are prefixed by the user input. For instance, if we have a cowboy name George Abitbol, then he would appear in the suggestion box if we type Geor or abit.

Note

This is just an example. In a real situation, querying the database each time an user presses a key does not seem to be a good idea. Periodically generating and caching the valid data you want to suggest sounds a better way to achieve this.

  • We do not have to understand a whole natural language like English, because proper nouns won’t appear in a dictionnary. Also each name stands for a unique person, and there is no name synonyms. In that case it seems useless to deal with stemming or lemmatization.

  • We can consider our search queries will be indexed maximum once for each cowboy. Thus, we can avoid using pertinence algorithms.

  • We just want to find approximate matches, so case and accents won’t matter. Thus, we can use alphabet reduction techniques.

  • We can provide useful data to users before they can make a typo, so typo correction algorithms are not needed here.

>>> import unidecode
>>> import itertools
>>> def cowboy_indexation(string):
...     lowercase = string.lower()
...     unaccented = unidecode.unidecode(lowercase)
...     names = unaccented.split(" ")
...     permutations = {
...         " ".join(perm)
...         for perm in itertools.permutations(names, len(names))
...     }
...     return {
...         name[:x]
...         for name in permutations
...         for x in range(len(name))
...         if name[:x]
...     }
...
>>> def cowboy_query(string):
...     lowercase = string.lower()
...     return {unidecode.unidecode(lowercase)}
...
>>> class Cowboy(sheraf.Model):
...     table = "cowboys_prefixes"
...     name = sheraf.StringAttribute().index(
...         values=cowboy_indexation,
...         search=cowboy_query,
...     )

The indexation method sets the names in lowercase, remove the accents, then build all the possible combinations of words in the name (because we want the user to be able to type George Abitbol or Abitbol George), and then build all the possible prefixes for those combinations.

>>> with sheraf.connection(commit=True):
...    george = Cowboy.create(name="George Abitbol")
...
...    assert [george] == Cowboy.search(name="George")
...    assert [george] == Cowboy.search(name="gEoRgE")
...    assert [george] == Cowboy.search(name="Abitbol")
...    assert [george] == Cowboy.search(name="geo")
...    assert [george] == Cowboy.search(name="abi")
...    assert [george] == Cowboy.search(name="George Abi")
...    assert [george] == Cowboy.search(name="Abitbol Geo")
...
...    assert [] == Cowboy.search(name="Peter")
...    assert [] == Cowboy.search(name="eorge")

We can see that any prefix of any words in the name is enough to find back a cowboy.

Todo

Order the results by the number of occurences.