How I built my blog: Search

How it’s possible to have a full text-search on static site? Read on.

What is searching all about?

Search is a technique of finding the best matching documents to given search criteria.

Sounds simple, right? Unfortunately it’s anything but.

Let’s take an uncomplicated example, I want to find every document that contains word cat or cats.

What it usually means that I want to have a result that:

  • has all documents that have words cat or cats in them
  • We don’t want results finding caterpilar or catastrophe, just cat
  • Results should be ordered by density those terms exist in them, with most frequent at top
  • We want results with cat in title to be higher then results with cat in text, for ridiculus example:
    • title: Cat rescues human, text: all about cat rescuing human
    • title: Baking cookies for dummies, text: my cat just jumped on a cookie tray.

That small requirement - wanting to find what you are searching for - means advanced engines need to be used to return relevant results - search indexers and engines.

Those tools work by first sanitizing the documents and preparing the index to search against, thanks to that they return relevant results very quickly.

The following processes / tools are used:

Stemmer

Stemming is the process of reducing inflected or derived words to their base or stem form:

  • templating becomes templat
  • template becomes templat
  • templates becomes templat

When same stemmer is used on documents and on search query it enables search for inflected and derived words.

Tokenizer

No search engine is operating on actual text, first it processes the document and extracts the actual words from it - using process named tokenizer. The simplest example would be whitespace tokenizer which just splits words by space.

Text processors and filters

There are multiple of those, the most commonly used are:

  • blacklisted words filter
  • stop words filter, stop words like a, an, the are not relevant for search results and can be safely removed.
  • special characters remover / sanitizer
  • short words filter - usually words that are short (let’s say 1 or 2 characters) are not relevant and can be removed.
  • synonyms filter, abbreviations filter

What engine to use

Obviously writing this kind of tools is time consuming and unnecessary - excellent libraries are ready and free to use, few good examples are:

  • Apache Solr - Lucene engine based indexer, open-sourced and one of best
  • Whoosh - Python based indexer, much simpler to use then Solr but much worse at complex tasks
  • Lunr.js - Sorl-like engine written in javascript

I am using lunr, but since I am using it in browser environment I am not creating indexes on the fly, it’d take ages.

Instead I have used python version of lunr to create an serialized index during page generation, which I am loading and search against using simple component.

All that was left is to pass the query to the lunr, and display the results.

What this search can do? Let’s see:

Quite nice for a ready to install package, with just some small tooling.

This is the last article in this series.

If you liked this article, share it:

Previous Next