The ultimate goal of a search engine is to provide fast, reliable, easy to use, scalable search features to a software.
But before diving into complex and technical considerations we should ask ourselves why bother with a search engine.
1 – Why should my team bother with a new complex component ?
Adding a new component needs new dev/integration/ops skills. The learning curve might be important. The configuration in “testing mode” can really be a nightmare to setup. So why introducing such risk and complexity in a project ?
One day or another, anyone familiar with databases did add some “contains” semantics to search features. You end up writing such queries:
select * from table0 t0 left join table1 t1 on t0.fk = t1.id left join table2 t2 on t1.fk = t2.id left join table3 t3 on t2.fk = t3.id where t0.title = '%term%' or t1.description = '%term%' and t3.created > '2011-01-12' and t3.status not in ('archived', 'suspended', 'canceled')
The above query will perform slower and slower as your amount of data grows because the time consuming parts of the query don’t use the optimized path: indices. Moreover, as the requirements evolve, building your query will be more of a nightmare than pleasure.
As a rule of thumb, whenever the time spent waiting for results in a complex search is not acceptable you’re left with 2 choices:
– optimize your request: make sure you use the most optimized path,
– use a search engine: highly optimized for reading and searching because it indexes almost everything (not true for database which emphasizes on relations and structure).
2 – How does it work ?
The principle of a full-text search engine is based on indexing documents. First index documents then search in those documents.
A document is a succession of words stored in sections/paragraphs. An analogy with database could be : a table for a document, a field for a section. Words are tokens.
Indexing is the process of analysing a document and storing the result of that analysis for further retrieval.
Analysing is the process of extracting tokens from a field, counting occurences (which are valuable for pertinence), associating them with unique path in document.
Not all tokens are relevant to search, some are so common that they are ignored. Indexers user analyzers that can ignore such tokens.
Not all fields are analyzed. For instance a unique reference like ISBN should not be analyzed.
All the settings can be configured in a mapping.
You can search in the same type of document or in all type of documents. The later use case, though less intuitive, can be a great time saver when it comes to build cross cutting informations like statistics.
Keep in mind : first write a document definition, setup your engine with that definition, index documents (tokenize, store) then search.
3 – Which tool does the job ?
So far so good, I understand the concepts but don’t know which tool does the job.
Before choosing a tool, to avoid getting lost in a world we’re not familiar with, let’s write down requirements.
– the tool should integrate seamlessly with either java or http (because HTTP is a great interface).
– the tool should be easy to install : debian package would be awesome.
– the tool should be easy to configure : declarative settings would be much appreciated.
– the tool should provide a comprehensive documentation that allow one to get familiar with the concept first then the practice.
– the tool should provide a comprehensive integration/acceptance test suite that will serve as a learning tool.
– the obvious ones : fast at runtime and lowest possible memory footprint.
While you have Woosh in python and Zend Lucene in php, you have Solr, Elasticsearch and Hibernate Search in java.
They all rely on Lucene, are written in java and 2 of them (Elasticsearch and solr) offer HTTP interface to index and search. Lucene is a very advanced and mature project. The amount of work around it is huge. But Lucene mainly focuses on the very technical details about parsing and analysing text. It focuses on providing fast searcher and reliable indexer and low level features like custom analysers, synonyms and all the plumbing/noise that avoid one to focus on the business search requirements. The other projects take advantage of that core and offer higher level features around it like remoting (REST/HTTP), declarative configuration, scaling (clusters, etc), etc.
I did go for Elasticsearch because it offers in-memory nodes which are valuable when testing in embedded mode.
In addition, REST is the preferred way to instrument Elasticsearch. I really like the idea because I believe that HTTP is a hell of an interface. Moreover the REST API is really simpler than the .
I can’t do a comparative work, I can just explain why I was attracted by Elasticsearch.
I think we’re good for the concepts. This article is the first in a series of 4 “Add search features to your application, try Elasticsearch”.
This is the full program:
part 1 : the concepts
part 2 : start small
part 3 : attaching indexation to events
part 4 : search (define a query grammar, parse query, build Elasticsearch query, search, build response)