Add search features to your application, try Elasticsearch part 2 : start small

Hi,

This article is part of a whole which aims to describe how one could integrate Elasticsearch. The previous post discussed the concepts: why using a search engine?.

No people learn the same way. I usually need to understand the theory and I need to start small.
So I usually start by books. If there is no book I look for a blog that explains the philosophy. Finally I look for pieces of codes (official documentation, blogs, tutorials, github, etc). Starting small makes me confident and allows me to increase complexity gradually.

Depending on what you want to know about Elasticsearch you should read different section of the guide :

SETUP section : describes how to install and run Elasticsearch (run as a service).
API : this is the REST API which seems more complete than the others. Describes how to inter-operate with nodes : search, index, check clusters status.
Query DSL : the query API is quite rich. You have explanations about the syntax and semantics of queries and filters.
Mapping : mapping configures elasticsearch when indexing/searching on a particular type of document. Mapping is an important part which deserves special care.
Modules : presents the technical architecture with low level services like discovery or http.
Index modules : low level configuration on indices like sharding and logging.
River : the river concept is the ability to feed your index from another datasource (pull data every X ms).
Java and groovy API : if your software already runs in a jvm you can benefit from that and control elastic search via this API.

To avoid getting lost in the documentation, let’s focus on simple goals. We’ll implement them progressively:
– create node/client in test environment
– create node/client in non test environment
– integrate with Spring
– create/delete/exists an index on a node
– wait until node status is ok before operating on it
– create/update/delete/find data on an index
– create a mapping

1 – Admin operations

Operations on indices are admin operation. You can find them in the API section under Indices.

* Create node/client in test environment

A node is a process (a member) belonging to a cluster (a group). A builder is responsible of joining, detaching, configuring the node. When creating a node, sensible defaults are already configured. I didn’t dive into the discovery process and I won’t. Creating a node will automatically create its encapsulating cluster. Creating a node is as simple as :

// settings
private Settings defaultSettings = ImmutableSettings.settingsBuilder().put("cluster.name", "test-cluster-" + NetworkUtils.getLocalAddress().getHostName()).build();
// create node
final Node node = NodeBuilder.nodeBuilder().local(true).data(true).settings(defaultSettings).build();
// start node
node.start()

The above code will create a node instance. The node doesn’t use the transport layer (tcp). So no rmi, no http, no network services. Everything happens in the jvm.

To operate with a node you must acquire a client from your node. Any single operation depends on it

Client client = node.client();

An invaluable resource on how to setup nodes and clients in test env is the AbstractNodesTests class.

* Create node in non test environment

In non test env just install Elasticsearch like described in the documentation SETUP section. This installation uses the transport layer (tcp).

There isn’t an official debian package but Nicolas Huray and Damien Hardy contributed on the project and wrote one which has will be integrated to the 0.19 branch. This branch moves from a gradle building system to maven. It will use the jdeb-maven-plugin to build the debian package. It will then be available for download on the elasticsearch site.

Once installed you should have an Elasticsearch instance up and running with a discovery service listening on port 54328, an HTTP service listening on 9200 and an inter-node communication port 9300.
The default cluster name is “elasticsearch” but we do no use it to make sure tests run in isolation.
For more on nodes and clusters configuration feel free to read this page in the official documentation.

* Integrate with Spring

You can integrate with spring by creating a FactoryBean wich is responsible for the Node/Client construction. Don’t forget to destroy them as they really are memory consuming (beware of PermGenSpace …).
This post, though a bit complex for my needs, was helpful.
If interested in that specific part you can take a look at LocalNodeClientFactoryBean.

* Create an index on a node

Once your node is up you can create indices. The main property of the index is its name which acts like an id. A node can’t have 2 indices with same name. The index name can’t contains special chars like ‘.’, ‘/’ etc. Keep it simple.

Client client = node.client();

Then create an index with “adverts” id, intended to store adverts :

client.admin().indices().prepareCreate("adverts")
      .execute().actionGet();

Depending on your organization you can choose to create one indice per software or one indice per stored type or wathever settings suits you. You just have to maintain the indices names.

* Remove an index from a node

As soon as you have the name it is straightforward. You should can test existence before removing.

if (client.admin().indices().prepareExists("adverts")
     .execute().actionGet().exists()) {
        client.admin().indices().prepareDelete("adverts")
            .execute().actionGet();
}
* Wait for cluster health
        client.admin().cluster()
                .prepareHealth("adverts").setWaitForYellowStatus()
                .execute().actionGet();

2 – Data operations

* Create / Update
        client.prepareIndex("adverts", "advert", "1286743")//
                .setRefresh(true) //
                .setSource(advertToJsonByteArrayConverter.convert(advert)) //
                .execute().actionGet();

The above code will index data (the source) whose type is “advert” under the “adverts” index area. It will also commit (refresh) the index modifications.
The source can be many types ranging from a fieldName/value map to the byte array. The byte array is the preferred way so I created converters from/to byte[]/Advert.

AdvertToJsonByteArrayConverter (relies on Spring Converter interface)

...
	/**
	 * @see org.springframework.core.convert.converter.Converter#convert(java.lang.Object)
	 */
	@Override
	public byte[] convert(final Advert source) {

		if (source == null) return null;

		this.jsonMapper.getSerializationConfig()
                      .without(SerializationConfig.Feature.FAIL_ON_EMPTY_BEANS);

		String string;
		try {
			string = this.jsonMapper.writeValueAsString(source);
			return string.getBytes("utf-8");
		} catch (final Throwable th) {
			throw new IllegalArgumentException(th);
		}
		// System.out.println("source as string = " + string);
	}

...

Updating means re-indexing so it’s the exact same operation.

* Delete data

When done with an object and we don’t want it to appear in search results we might want to delete it from the index:

        client.prepareDelete("adverts", "advert", "4586321")
                .setRefresh(true).execute().actionGet();

That will delete then refresh immediately after.

* Find data

The search API is very rich so you have to understand search semantics. If you’re familiar with lucene then everything will seem obvious to you. But if you don’t you’ll have to get familiar with the basics.
There are 2 main types of search : exact match and full text.

The exact match operates on fields as a whole data. The field is considered a term (even if it contains spaces). It is not analyzed, so querying “field=condition” will return nothing if field equals “excellent condition”. Exact match suits very well for certain fields (id, reference, date, status, etc) but not for all. Exact match fields can be sorted.

The full text operates on tokens. The analyzer removes stop words, splits the field in tokens, groups them. The most relevant result is the one that contains the higher term occurences (roughly).
You obviously can’t apply a lexical sort on those fields. They are sorted by score.

Below, an exact match example (will match adverts with provided id):

    private SearchResponse findById(final Long id) {
        return client.prepareSearch("adverts").setTypes("advert")
                .setQuery(QueryBuilders.boolQuery()
                .must(QueryBuilders.termQuery("_id", id))).execute().actionGet();
    }

Below, a full text on a single field example (will match adverts whose “description” field contains at least once the term “condition”):

        client.prepareSearch("adverts").setTypes("advert")
                .setQuery(QueryBuilders.boolQuery()
                       .must(QueryBuilders.queryString("condition")
                       .defaultField("description")).execute().actionGet();
* Create a mapping

The searchable nature of a field is an important design decision and can be configured in the mapping. It defines, for an indexed type, the indexed fields and for each field some interesting properties like analyzed nature (analyze|not_analyzed), type (long, string, date) , etc. Elasticsearch provides a default mapping: strings fields are analyzed, other ones are not.
I really recommend you to spend some time on that section. One don’t necessarily have to design the perfect mapping the first time (it requires some experience) but the decisions taken on that part will impact the search results.

Below an example of mapping:

{
    "advert" : {
        "properties" : {
            "id" : {
                "type" : "long",
                "index" : "not_analyzed"
            },
            "name" : {
                "type" : "string"
            },
            "description" : {
                "type" : "string"
            },
            "email" : {
                "type" : "string",
                "index" : "not_analyzed"
            },
            "phoneNumber" : {
                "type" : "string",
                "index" : "not_analyzed"
            },
            "reference" : {
                "type" : "string",
                "index" : "not_analyzed"
            },
            "address" : {
                "dynamic" : "true",
                "properties" : {
                    "streetAddress" : {
                        "type" : "string"
                    },
                    "postalCode" : {
                        "type" : "string"
                    },
                    "city" : {
                        "type" : "string"
                    },
                    "countryCode" : {
                        "type" : "string"
                    }
                }
            }
        }
    }
}

I gathered those CRUD operations in 2 integration tests : ElasticSearchDataOperationsTestIT and ElasticSearchAdminOperationsTestIT.

Now that we’re familiar with Elasticsearch basic operations we can move on. We can consider improving the code.
You agree that handling the indexing task manually is an option but not the most elegant and reliable one.
In the next post we’ll discuss the different solutions to automatically index our data.

Advertisements

2 thoughts on “Add search features to your application, try Elasticsearch part 2 : start small

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s