Elasticsearch part 3: The implementation

In my last blog post on Elasticsearch I covered the majority of the theory and commands that I learned at the Elasticsearch for developers course I completed at work now I want to have a play around with Elasticsearch and the RESTful API you use to interact with Elasticsearch called Kibana.

I decided to install both of these systems onto NMIT Windows laptop. What was a seamless installation process last time on my other laptop turned into a more complicated troubleshooting exercise this time.

After starting both Elasticsearch (accessible at http://localhost:9200) and Kibana (accessible at http://5601) I saw that Kibana was throwing an error (see below) because it required a default index pattern.

error message in kibana

What is a default index pattern? It is an index (which remember is like the equivalent of a relational database) or collection of indexes you want to interact with using Kibana. You can specify several using the wildcard symbol in the index pattern input box.

So the first thing I had to do was to create an index. I used the example index on the documentation on the Elasticsearch (https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-create-index.html#indices-create-index) which was called ‘twitter’.

creating an index.PNG

Then after indexing an document (equivalent of inserting a row into a table in a relational database), I set twitter* as the default index pattern, thereby removing the error I was getting.

An important point beginners of Elasticsearch need to know is that when you interacting with your Elasticsearch using the Kibana RESTful API you will be writing in sense syntax rather than curl which is for use in terminal. However the Kibana Dev Tools area which is where you will write sense syntax is fantastic because it automatically converts curl commands into sense syntax. For example I copied and pasted the curl command

curl -XGET 'http://localhost:9200/_nodes'

And Kibana converted it to:

GET /_nodes

 

Insert data

Now for some fun with Elasticsearch…

I have created a index named cats

created cat index.PNG

Then I create a mapping type (equivalent of a relational databases table) automatically when indexing a document (creating a data record). How so? Well I use the PUT API (in Elasticsearch jargon an API is a command).

PUT cats/domestic/1

What this means is use the cats index, create a mapping type named ‘domestic’ and create a document with the ID of 1 in this mapping type.

Note that the ID number is optional in PUT APIs.

entered cat

What is happening when I use the PUT API to create an index, well Kibana sends a index request to a node in the Elasticsearch cluster (collection of nodes (instances of Elasticsearch)). The ID value (manually set or auto generated) is hashed and used to find a matching shard to execute the index.

What is a shard? it is a conceptual object holding a collection documents allowing Elasticsearch to be distributed and extendable.

Once the matching shard has created the index then it is replicated to the replica shard (the backup shard).

Note: As you can see above you do not need to specify data types when creating or modifying Elasticsearch indexes.

Retrieve data

Now to retrieve the document I just indexed I need to use the GET API:

GET cats/domestic/1

Whats happening in the background when you send a GET API. Well the ID in the request is hashed and so when the request arrives at a node in the Elasticsearch cluster, then the hashed ID is used to route the request to a shard with a matching hashed ID value.

How to check if a document exists

To check if a document exists in the index then you can use HEAD API

HEAD cats/domestic/1

Now this should return a HTTP header 200 if the document exists and 404 if it doesn’t exist. Except when I ran it in Kibana I got a fatal error.

 

error.PNG

Now it seems several other people have had issues running the Exists API in Kibana as these forum question posts show, none of which were satisfactorily answered.

https://unix.stackexchange.com/questions/253414/elasticsearch-error-on-head-command

https://github.com/elastic/elasticsearch-php/issues/391

However this github source (https://github.com/elastic/kibana/pull/10611) suggests that the syntax for the Exists API is deprecated and I need to write:

HEAD cats/_mapping/domestic

However this produced the same error. I could not find any other useful suggestions online and so I will move on, and ask the trainer of the course Frederik later.

Delete data

DELETE index/mapping type/id

delete.PNG

The background process when the DELETE API is run is as usual the id in the request is hashed and this is used to route the request to the primary shard that this document lives in, after the document is deleted there then the primary shard updates replica shards.

Point of interest: Write consistency

Now because all documents are written on a primary shard and can this can be (but doesn’t have to be) replicated on several replica shards.

If you have set up replica shards when you created the index, then you need to make sure a certain number of these shards are available when writing to Elasticsearch.

You need to have:

(primary+replicas)/2 + 1 shards available to be written to

 

Update data

I indexed another document

PUT cats/domestic/1
{
  "name": "Kelly",
  "age" : "1",
  "colour": "black and white"
}

Then to update this making the age 2 I wrote:

update.PNG

As I understand it the background process in this command is all fields including ones not being updated are replaced. So fields not being replaced are just replaced with the same value. This is again performed on the primary shard first, and then replicated to the replica shards if applicable.

 

Get multiple documents simultaneously

I created another index named ‘president’, with the mapping type ‘individual’ and id ‘1’ for a document on George Washington.

Then to get the documents with id ‘1’ in cats/domestic and president/individual I perform a Multi Get API

multi get.PNG

 

Perform multiple different APIs simultaneously

To perform multiple different commands using Kibana you can use a Bulk API command. You can think of this like the equivalent of being able to perform a select, delete, update, and insert SQL query into multiple tables in a relational database in a single command.

When I first tried this command I wrote in the HTTP header: PUT _bulk this resulted in an error:

bulk error.PNG

After some troubleshooting I found this is being caused by the /n which need to be removed and then it work, like so:

worked.PNG

 

Text analysis

Elasticsearch is very useful for searching text, because it can store the words from a text such as a book in the inverted index in much the same way a book index holds keywords for readers to find easily.

The way we split text up so it can be stored in the inverted index for searching is using the Analyze API.

I started using this by specifying the HTTP header GET _analyze, I specified I wanted the tokenizer “keyword” this stores the supplied string as one keyword combination rather than splitting it, filter “lowercase” this lowercases my supplied text.

As you can see below ‘New South Wales’ has been transformed into ‘new south wales’

lowercase.PNG

Often for long text (like sentences) it is best to split the words up so they can be searched individually. You can do this by specifying the tokenizer “whitespace”. So using the Shakespearean sentence “You shall find of the king a husband, madam; you,sir, a father:” I used the whitespace tokenizer to split it up:

splits.PNG

If you want to learn more about what the analyser is doing you can implement the “explain”: true attribute.

Now the analyzer commands I have performed to date are using the default _analyzer on supplied text, but what if I wanted all data in a document I index to be analyzed and thereby made searchable?

Well you can configure a analyzer in an index when creating the index.

analyzer in.PNG

To make the job of the tokenizer easier you can implement character filters for example you can filter out HTML. This would be very important to make the system more secure.

char filter.PNG

It is interesting how the different analyzers work; the English one does not just split the words up it actually removes stop words (common words that add no value to a search query). Below I wrote in the sentence from the course exercise: “It is unlikely that I’m especially good at analysis yet” which has words like ‘unlikely’ just stored and indexed as ‘unlik’

english.PNG

Whereas all words are stored and indexed in their original form when using the standard analyzer.

standard analyzer.PNG

 

Mappings

Instead of letting Elasticsearch decide the data type of fields in an index you can specify it manually in the mappings. Like I said previously the mapping type which is the equivalent of a table in a relational database is just the name, so in my previous examples I have cats/domestic/1 this meant I had the mapping type name of ‘domestic’. However there are many attributes in an index that you can customize to make it match the business problem domain more closely.

Mappings are useful because they have some idea of how data is structured even though they don’t have a schema.

I created a index named programminglanguage, with a mapping type of ‘OO’. I set the data type of the “name” field (which is circled below) to a string.

field altering.PNG

You can also update your mapping attributes in an index however you need to keep in mind that you cannot remove an mapping type field.

To retrieve your mapping values for an index simply write in GET <indexname>/_mappings

Like so:

retrieve mappings.PNG

Now you can create objects in Elasticsearch, for example by default the comments in my below tvseries index will be a nested object.

That means ‘comments’ is of data type ‘object’.

nested objects.PNG

If I want to reference a field in the comments nested object I have to write: comments.<fieldname>

How do you set a field to be searchable?

You use the ‘index’ attribute in the mappings. You set it to ‘analyzed’ if you want it searchable and it goes through the analyzer.

You set it to not_analyzed if you want it searchable but don’t want it to go through the analyzer.

You set it to ‘no’ if you don’t want it searchable.

 

Index templates

An index template is a good way to make a index fast, without having to write it out manually. So once you have the mappings customized to your business problem domain you can then apply this to multiple similar indexes using a template. I like to think of this like inheritance hierarchies in Object Oriented programming, you place all the common features in the superclass and all subclasses inherit it, thereby only having to write it once.

To create a mapping you need a PUT HTTP header:

PUT _template/tv_template

This is creating a template in the _template area named tv_template

template 1.PNG

Like with indices you can delete, retrieve and retrieve all templates using similar commands. As I have not covered how to retrieve all I will do so now, it is very simple:

GET /_template

Searching

Elasticsearch can perform two kinds of searches on the searchable values (setting values to searchable is described further above):

  • Structured query (checking for an exact, boolean match on the keywords the user entered. Equivalent to a SELECT SQL query with a WHERE clause. Either there is a match for the WHERE clause or there isn’t)
  • Unstructured query (not just looking for exact matches but ranking the query output, so this is a continuum rather than boolean answer. Also known as a full text search).

Elasticsearch uses a query language called QueryDSL. A quick Google of this and I have found it is a “extensive Java framework for the generation of type-safe queries in a syntax similar to SQL” (Chapman, 2014).

Now search uses the GET HTTP header; and to set up a structured query you use the ‘filter’ attribute, and to set up a unstructured query you use the ‘query’ attribute which gives all results a score.

Using QueryDSL we can write a single query (known as a leaf query clause) or multiple queries in a single statement (known as compound query clauses).

Where I want to query (retrieve) all documents in an index I can use the match_all attribute:

GET programminglanguage/OO
{
  "query": {
    "match_all" : {}
  }
}

This is the equivalent of a SELECT * query in SQL and it is perfectly acceptable to use.

Note: In the above match_all query  it is an unstructured query because it uses the ‘query’ attribute.

If you want to limit the number of ranked results displayed in an unstructured query then you can specify the number of results you want with the ‘size’ attribute.

 

How to use your search query

Now if you use the ‘match’ attribute in your query then the search term goes through the analysis chain to tidy it up and is then used for unstructured query.

Whereas if you use the ‘term’ attribute then whatever the user wrote in is compared exactly to what is in the inverted index and a structured query is performed.

So in the below example I am performing a unstructured query.

match.PNG

 

To make your unstructured query more fine-grained there are 3 types of unstructured queries for you to choose to implement.

Boolean –  This is effectively a structured query, whatever you enter as a search term is used to find an exact match in the inverted index otherwise no hits are found.

I have a document with the name “Kelly” in the cats/domestic index and mapping type, so trying the bool query searching for a name “K'” I got no results because I have no document with the name “K” in the cats/domestic.

book.PNG

Whereas when I perform this bool query using the name of “Kelly” I get 1 hit, this is because there is the exactly 1 document with the name “Kelly”

bool 2.PNG

 

 

Phrase – This treats the entered search term as a phrase

match_phrase_prefix – This query splits the values in the string the user entered as a search term. If the user has not completed a word like in the below example of when I used match_phrase_prefix I just used K and so Elasticsearch looks at a dictionary of sorted words and puts the first 50 into the query one at a time.

match 2.PNG

 

The query_string query is interesting it is built rather like a SQL query in that you can use OR and AND. So in the below example I am searching the cats/domestic with the phrase “(blue OR black) AND (white OR red)” without specifying the field name and I am getting the correct result.

query string.PNG

Suggesters

Suggesters are faster than search queries, although the suggester can be implemented on a search query as well.

What a suggester allows the query to do is to suggest values similar to the users search term. So for example if the user misspelt and wrote in the name “Kellu” then the suggester could suggest “Kelly” which is another similar term.

How Elasticsearch works

Search queries in Elasticsearch go through 3 stages, here is a summary on what I understand them to be:

  1. Pre-query – This is checking the number of times a word exists in a particular document. This is only possible where Elasticsearch has a small data set.
  2. Query – This is checking through the inverted indexes for a matching value, this is achieved by running the search query on all shards holding the index we are asking for until a matching value is found which can point the search query to the document ID that holds this value.  This is a useful resource for finding out more a out the Query phase: https://www.elastic.co/guide/en/elasticsearch/guide/current/distributed-search.html
  3. Fetch – This returns the documents (whose document ID was listed alongside the search term in the inverted index) to the user.

Deep pagination – What this concept means is Elasticsearch will look through every document in the cluster even if you just want to search 10 documents, as you can imagine this is very inefficient in a large data set. It is best avoided.

Elasticsearch is known for its speed and a contributing factor is the Request Cache. As indexes are shared across multiple shards when a search query is run on an index, what happens is it is run individually on all the shards and then the resulting results are combined to form the total result. However each shard keeps a copy of its own results meaning if someone queries the same value again it will exist on a shards request cache and it can be returned much faster than having to search the shard.

 

Aggregations

Aggregations is a framework that helps the user learn more about the search results they have found with a search query (“Aggregations | Elasticsearch Reference [5.3] | Elastic,” n.d.)

There’s three main types:

Bucket aggregations – Group of documents that meet a specific factor. You can think of this like finding common features in a whole lot of different documents and grouping them in ‘buckets’ based on these common features.

Metric aggregations – Calculate statistical data about a collection of buckets.

Pipeline aggregations – Combine the insights generated from other aggregations. This is an aggregation on an aggregation.

Kibana can use visualization tools to create graphs and maps using aggregations.

You implement aggregations using the “aggregations” attribute in the search query.

I am unable to perform many of the aggregation commands due to having a small data set, however a summary of the aggregation commands available is:

Sum aggregation – This adds together values of the same field in multiple documents

Min/max aggregation – Display the highest or lowest value of a field in all documents in an aggregation.

Multiple metrics aggregation – Display both the highest and lowest values for a field in all documents in an aggregation.

Terms aggregations – This returns the top 5 values for a particular field in all documents in an aggregation.

Missing aggregation – Find documents in an aggregation that do not have a specified value.

Filter aggregation – This is what is used to create bucket aggregations.

Significant term aggregation – This is finds strangely common values, by checking document values for common values in the aggregation against the total data source the bucket aggregation was collected from.

It is important not to nest too many aggregations in a single command because they are very resource hungry and you can end up crashing your system, this occurrence is called combinatorial explosion.

 

Data Modelling

If you choose to use Elasticsearch as a data store in addition to or replacing a relational database management system then you will need to perform data modelling to transform your existing data into something useful for Elasticsearch.

There are several paradigm shifts you will have to have to make this process possible. Firstly you need to understand that duplicate data is fine in Elasticsearch as it makes searching faster, this goes against what we are taught for relational database design and so it is not initially intuitive.

Now to take the data stored in relational tables with  relationships between one another into Elasticsearch we can do one of three things:

Denormalise the data into a single document: This flattens the data out so if you had 2 tables in a direct relationship then you can place all columns and data into a single Elasticsearch mapping type. This is making the data structure flat so it is searchable.

Nested objects: Nested objects are one way to store the relationship between two relational tables in some form. For example in a relational database you may have two tables ‘program’ and ‘comments’. These tables have the relationship that an tvseries has one or many comments, and a comment belongs to one tvseries.

To transfer this relationship to Elasticsearch which does not use foreign keys to map relationships we can place ‘comment’ as a nested object in the ‘tvseries’.

nested objects.PNG

This means the foreign key relationship is lost but we have been able to retain some of the relationship between the two logical groups of data by nesting one inside the other.

Now in order to still be able to use the nested object and the root object it is stored in we need to be able to query them separately so we use nested queries.

Parent/child objects: Another way to map the relationship between logical groups of data is parent/child objects. I did this using the pet owning example I have used previously:

The parent object will be the owner, and the child object will be the cat. Here’s the steps I went through to create this parent/child object combination.

  1. Create a index named ” petowning”

setting it up.PNG

2. Create the parent object which is the owner

parent.PNG

 

3. Create the child object which is the cat

child.PNG

 

Now each of these three methods have advantages and disadvantages which need to be considered against your system requirements when you are performing data modelling:

Flatten data:  Uses less data than nested objects and parent/child objects, but the data relationships are totally lost

Nested objects: Faster but less flexible (because the root object that the nest object is held in must be re-indexed whenever the nested object is updated)

Parent/child objects: Less fast but more flexible

 

Relevancy

Elasticsearch by default uses the TF/IDF (Term Frequency/ Inverse Document Frequency) algorithm to determine how relevant a document is to a query.

This algorithm works by comparing term frequency against all other documents, after looking at the specificality of the search term. What this means is a shorter more specific search term has a greater rank.

 

Perculator

Instead of saying which documents match a query, the perculator does the opposite it outputs the queries that match a document.

To be able to output the search queries that match a document we have to store the search queries as JSON documents however this is no problem because the search queries are QueryDSL (as I have previously discussed) and this is very similar to JSON.

Below I am storing a search query to find a cat with the name “Violet”

storing query.PNG

 

So there we have it explanations and tested examples (by me on Elasticsearch 5.3 and Kibana 5.3) shown as screenshots of the many different functions that Elasticsearch and Kibana provides. I hope this has been interesting and useful for you, I personally have found it fascinating to go through almost all of the commands I learned on my course in more depth and understand them better.

 

Bibliography

Chapman, B. (2014, June 11). What Can Querydsl Do for Me Part 1: How to Enhance and Simplify Existing Spring Data JPA Repositories – http://www.credera.com. Retrieved April 16, 2017, from https://www.credera.com/blog/technology-insights/java/can-querydsl-part-1-enhance-simplify-existing-spring-data-jpa-repositories/

Aggregations | Elasticsearch Reference [5.3] | Elastic. (n.d.). Retrieved April 16, 2017, from https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations.html

 

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s