Spice-up your application: add elasticsearch geo feature

Lately I’ve been busy working on elasticsearch features for my company.
In the process I came accross the shiny “geo search” feature. While not being that sensitive to shiny and new technologies (don’t get me wrong, I don’l like dusty ones) I still wanted to test elasticsearch geo capabilities for further adoption.
The reference documentation on the subject is a post from Shay Banon on the elasticsearch website. Geo search is made possible by indexing coordinates that conform to “geo_point” structure, then sort by “_geo_distance” or filter by “geo_distance“.
The other very useful resource was that post from Gauthier Lemoine‘s blog . The author’s app could find the nearest stations to the Eiffel Tour.
The the example is clear and includes index feed from the ratp’s open data, then a search based on the Eiffel Tour coordinates. It is written in python.

I added a geocoding capability that allows a user to provide any location which is a little more convenient. I also tried to use maven’s exec plugin to setup/feed the index based on the file available on data.ratp.fr which contains the complete list of paris stations: trains, subways and bus.
The maven build drops/creates the index, generates stations data and bulk insert them, starts the webapp then searches against a provided location like “35 avenue daumesnil, 75012 paris”.

Let’s take a look at the relevant parts of the solution

1- Drop/create index

For the drop/create/put sequence I wrote a shell script and invoked it from maven exec.

#! /bin/bash
# drop index
curl -XDELETE 'http://localhost:9200/stations'
# create index with settings
SETTINGS_LOCATION=${project.build.outputDirectory}/elasticsearch/stations/_settings.json
curl -XPOST 'http://localhost:9200/stations' -d@$SETTINGS_LOCATION
# put mapping
MAPPING_LOCATION=${project.build.outputDirectory}/elasticsearch/stations/station.json
curl -XPUT 'http://localhost:9200/stations/station/_mapping' -d@$MAPPING_LOCATION
...
<execution>
    <id>create-index</id>
    <configuration>
        <executable>${project.build.outputDirectory}/create-index.sh</executable>
    </configuration>
    <phase>pre-integration-test</phase>
    <goals>
        <goal>exec</goal>
    </goals>
</execution>
...

2- Generate stations then bulk-index

Next generate a file of bulk inserts that follow this syntax (then invoke it with maven after the previous invocation):

{ "index" : { "_index" : "stations", "_type" : "station", "_id" : "1975" } }
{"id": 1975, "name": "Abbesses", "township": "PARIS-18EME", "type": "metro", "location": {"lat": "2.33871281165883", "lon": "48.8844176451841"}}

The script is in groovy because I find it well suited for scripting tasks

    def run() {
        new File("target/classes/insert-stations").newOutputStream().withWriter("UTF-8") { writer ->
            final File ratpStationsFile = new File(getClass().getResource("ratp-stations.csv").getFile());
            ratpStationsFile.splitEachLine("#") {fields ->
                def id = fields[0]
                def lat = fields[1]
                def lng = fields[2]
                def name = URLEncoder.encode(fields[3])
                def township = URLEncoder.encode(fields[4])
                def type = fields[5]
                def metadata = "{ \"index\" : { \"_index\" : \"stations\", \"_type\" : \"station\", \"_id\" : \"$id\" } }\n"
                writer.write metadata
                def content = "{\"id\": $id, \"name\": \"$name\", \"township\": \"$township\", \"type\": \"$type\", \"location\": {\"lat\": \"$lat\", \"lon\": \"$lng\"}}\n"
                writer.write content
            }
        }
    }
    static main(args) {
        new GenerateStations().run()
    }

Then bulk insert the stations (with another similar maven invocation)

#! /bin/bash
# bulk index
DATA_LOCATION=${project.build.outputDirectory}/insert-stations
curl -s -XPOST 'http://localhost:9200/stations/_bulk' --data-binary @$DATA_LOCATION;

Note that my first try generated a curl order for each station. It worked but performed poorly. For huge data I’d advise you to favor the bulk API.

3- Write a little scenario

Scenario: search stations by location, ordered by distance
When I search the closest stations to "10 rue La Fayette 75009, Paris"
Then I should get the following stations:
| id      | name                                         |  type  |
| 1957 | Chaussée d'Antin (La Fayette) | metro |
| 1638 | Trinité-d'Estienne d'Orves       | metro |
| 1771 | Opéra                                        | metro |
| 1744 | Quatre Septembre                    | metro |
| 1990 | Auber                                        | rer      |
| 1665 | Richelieu-Drouot                      | metro |
| 1767 | Notre-Dame de Lorette            | metro |
| 1795 | Le Peletier                                | metro |
| 1852 | Havre-Caumartin                     | metro |
| 1686 | Saint-Georges                          | metro |

4- Build the query and get your results

First include the geocode api in your maven build:

<dependency>
    <groupId>com.google.code.geocoder-java</groupId>
    <artifactId>geocoder-java</artifactId>
    <version>${geocoder-java.version}</version>
    <exclusions>
        <exclusion>
            <groupId>commons-logging</groupId>
            <artifactId>commons-logging</artifactId>
        </exclusion>
    </exclusions>
</dependency>

Then geocode the location to get its coordinates. I chose to throw exception when no or too many result(s) were returned.

    private GeocoderResult geocodeProvidedAddress(String address) {
        GeocoderRequest geocoderRequest = new GeocoderRequestBuilder().setAddress(address).setLanguage("en").getGeocoderRequest();
        GeocodeResponse geocoderResponse = googleGeocoder.geocode(geocoderRequest);
        List results = geocoderResponse.getResults();
        if (CollectionUtils.isEmpty(results)) {
            String message = "The geocoding service found no match for address [{}]";
            LOGGER.error(message,  address);
            throw new RuntimeException("geocode.no.results");
        }
        int countResults = results.size();
        if (countResults > 1) {
            String message = "The geocoding service found {} matches for addresses [{}]";
            LOGGER.error(message, countResults, address);
            throw new RuntimeException("geocode.too.many.results");
        }
        return results.iterator().next();
    }

Build the query (exclude the bus stations for example):

private QueryBuilder queryBuilder() {
    BoolFilterBuilder filterBuilder = FilterBuilders.
            boolFilter().
            mustNot(FilterBuilders.termFilter("type", StationType.bus.toString()));
    LOGGER.info(filteredQueryBuilder.toString());
    return filteredQueryBuilder;
}

Build the sort clause which includes the distance specification:

    private GeoDistanceSortBuilder sortBuilder(String address) {
        GeocoderResult result = geocodeProvidedAddress(address);
        LatLng location = result.getGeometry().getLocation();
        BigDecimal lat = location.getLat();
        BigDecimal lng = location.getLng();
        GeoDistanceSortBuilder sortBuilder = SortBuilders
                .geoDistanceSort("location")
                .point(lng.doubleValue(), lat.doubleValue())
                .unit(DistanceUnit.KILOMETERS)
                .order(SortOrder.ASC);
        return sortBuilder;
    }

You’re set: if you invoke the service you will get the 10 first stations.

Have fun with elasticsearch geo!

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s