How it’s possible to have a full text-search on static site? Read on.
What is searching all about?
Search is a technique of finding the best matching documents to given search criteria.
Sounds simple, right? Unfortunately it’s anything but.
Let’s take an uncomplicated example, I want to find every document that contains word cat or cats.
What it usually means that I want to have a result that:
- has all documents that have words cat or cats in them
- We don’t want results finding caterpilar or catastrophe, just cat
- Results should be ordered by density those terms exist in them, with most frequent at top
- We want results with cat in title to be higher then results with cat in text, for ridiculus example:
- title: Cat rescues human, text: all about cat rescuing human
- title: Baking cookies for dummies, text: my cat just jumped on a cookie tray.
That small requirement - wanting to find what you are searching for - means advanced engines need to be used to return relevant results - search indexers and engines.
Those tools work by first sanitizing the documents and preparing the index to search against, thanks to that they return relevant results very quickly.
The following processes / tools are used:
Stemming is the process of reducing inflected or derived words to their base or stem form:
- templating becomes templat
- template becomes templat
- templates becomes templat
When same stemmer is used on documents and on search query it enables search for inflected and derived words.
No search engine is operating on actual text, first it processes the document and extracts the actual words from it - using process named tokenizer. The simplest example would be whitespace tokenizer which just splits words by space.
Text processors and filters
There are multiple of those, the most commonly used are:
- blacklisted words filter
- stop words filter, stop words like a, an, the are not relevant for search results and can be safely removed.
- special characters remover / sanitizer
- short words filter - usually words that are short (let’s say 1 or 2 characters) are not relevant and can be removed.
- synonyms filter, abbreviations filter
What engine to use
Obviously writing this kind of tools is time consuming and unnecessary - excellent libraries are ready and free to use, few good examples are:
- Apache Solr - Lucene engine based indexer, open-sourced and one of best
- Whoosh - Python based indexer, much simpler to use then Solr but much worse at complex tasks
I am using lunr, but since I am using it in browser environment I am not creating indexes on the fly, it’d take ages.
All that was left is to pass the query to the lunr, and display the results.
What this search can do? Let’s see:
- Search words by stem
- Search by words matching from any side front or back or even both
- Search just in title or in full text
- Boost terms in title over those in text
- Fuzzy search when unclear of the word link
- Search without a term or with multiple terms present.
Quite nice for a ready to install package, with just some small tooling.
This is the last article in this series.