Elasticsearch Part 2

On Friday I attended an excellent Elasticsearch basics for developers course at work and I would like to discuss what I learned and how it has changed my  view of Elasticsearch since I wrote about it a couple of weeks back.

What is Elasticsearch?

Elasticsearch is usually thought of as a search engine but it is more than that, Elasticsearch can also be considered a:

  • Data store, meaning in addition to using it as a search engine for your app/system you could also use it as an alternative to a Relational Database Management System (RDBMS). Elasticsearch stores data in documents, mapping types and indexes, which are the equivalent of a Relational databases rows, tables, and database respectively.

 

  • Reporting tool – Elasticsearch can be used to store system logs. The RESTful API Kibana that you can use to interact with Elasticsearch can generate visual charts from Elasticsearch search query results. For example the below visual charts of system log information. These charts present the data in a far more useful format than a written system log.

kibana visualisation

(Christopher, 2015)

Something that particularly interested me about the course was that  the presenter Frederik said that Elasticsearch is very flexible and extremely useful as long as your prepared to spend time configuring it.

A lot of people implement Elasticsearch (which is actually pretty easy as I found last week) and expect it to be the equivalent of Google for their organizations data, however if you don’t configure it to match your business problem domain then it would not reach its full potential.

What is the internal structure of Elasticsearch?

Elasticsearch is built on top of Lucene, which is a search library. In the documentation it is very hard to determine where one ends and the other begins however I believe having done the course and read through the first answer in this very interesting StackOverFlow page (http://stackoverflow.com/questions/15025876/what-is-an-index-in-elasticsearch) that I have a good understand of this now, so lets test it out.

I look at Elasticsearch and Lucene as a 2 layered cake ( to see this graphically look at the below diagram where we have Elasticsearch as the top layer and Lucene as the bottom layer), the top layer (Elasticsearch) is the one that the user interacts with.  When you first install Elasticsearch a cluster is created ( a cluster is a collection of 1 or more nodes (instances of Elasticsearch)).

Inside this cluster by default you have 1 node (a single instance of Elasticsearch). This node contains indexes. Now an index is like a database instance, drilling down further we have mapping types (which are the equivalent of tables for example you could create a mapping type of student), inside a mapping type there are documents (which are a single data record making them the equivalent of a row in a database), and inside each indexed document there are properties which are the individual data values so for example 22 years old is a property for age).

To put the document into perspective it is just a JSON data structure.

So we have established that Elasticseach stores the data in indexes, with each data record known as a document.

But how does Elasticsearch actually find specific data when someone writes in a HTTP GET request into Kibana? Well that’s where Lucene comes in, Lucene is the bottom of the two layers in my cake simile. Lucene contains its own index, which is a inverted index: instead of storing data it points to which indexed documents in Elasticsearch index that the data value is stored in, in much the same way a book index points to the page number where a particular word exists.

Anther good analogy of what a inverted index is that it is quite similar to containers such as arrays and dictionaries which point to a specific location in memory where a particular value is stored rather than storing the value itself in their data structure.

Having done the course I now believe I understand how the index in Elasticsearch and the index in Lucene relate.

lucene and es.png

(Principe, 2013)

Now as I said by default your Elasticsearch cluster has one node, however Elasticsearch is extendable meaning you can add more nodes to your cluster.

Within each node there are 5 shards. What is a shard? “A shard is a single Lucene instance. It is a low-level “worker” unit which is managed automatically by elasticsearch. An index is a logical namespace which points to primary and replica shards.” (“Glossary of terms | Elasticsearch Reference [5.3] | Elastic,” n.d.). In other words each node contains the Elasticsearch indexes outside the shards and the Lucene Inverted indexes inside the shard.

Each shard has a backup in the form of a replica shard which is stored on a different node, this provides data redundancy and speeds up the search times because it is more likely the HTTP GET request is sent to a shard containing an inverted index with the search term in it.

What is the process that happens when the client RESTful API sends a request to the cluster?

In Elasticsearch the commands are called APIs so for example a delete command is a called delete API.

Now like I previously stated Elasticsearch is structured as a collection of nodes in a cluster (think of it like how there are multiple servers in the concept of the cloud).

The nodes store different information (in the form of Lucene inverted indexes and Elasticsearch indexes) so the request needs to go to a particular node to access particular data. However all the nodes store information about the topology of the cluster, so they know what node contains the data the API command seeks/wants to modify.

When you write a HTTP GET request in Kibana the ID specified in the GET request is hashed and then the hashed id is sent to a node in the cluster, it doesn’t matter which node that request is sent to as it will redirect the request to the appropriate node (if it doesn’t store the matching hashed id).

However to make sure that the same node is not always queried the destination node of each search query is different based on a round robin distribution.

How Elasticsearch data storage violates normalisation and refactoring

Elasticsearch is all about fast search times, to achieve this having duplicated data in multiple indexes is considered acceptable.

This is in complete contrast to the database concept of normalization and the programming concept of refactoring both of which stress the need to remove duplicate data/code.

What are the differences between Elasticsearch and a relational database

Although Elasticsearch can be used as a data store meaning you could implement it as an alternative to a relational database the differences are:

  • Elasticsearch does not use foreign keys to create relationships between indexes
  • Data can be duplicated to speed up the query time
  • Query joins (querying two or more indexes in a single query) are not available in any effective way from Elasticsearch, meaning although rudimentary joins can be implemented they are not very effective

So when should you replace your RDBMS with Elasticsearch? Well it depends on the sorts of queries you have performed/want to perform on your primary data store. If you are performing complex transaction queries (2 or more queries concurrently) then Elasticsearch is not ideal and you would be better off using a RDBMS such as MySQL and just use Elasticsearch as a search engine.

However if you don’t want complex transactional queries then Elasticsearch is a good alternative to a RDBMS.

What are the cons of Elasticsearch?

Elasticsearch is not ideal from a security point of view this is because it does not provide data or transport encryption.

It is near realtime – This means there is a slight latency after indexing a document before you can search for the data it holds.

What are the benefits of Elasticsearch?

The two main benefits of Elasticsearch are:

  • It is fast – Due to the data being duplicated in multiple shards it means it is faster to access data in either the primary or replica shards
  • It is distributed – Meaning it is easy to extend by creating another node in your cluster
  • High availability – By having a primary and replica shard to hold each inverted index twice this means the indexes are more easily available

 

Starting Elasticsearch on Linux

Last week I installed and used Elasticsearch on a Windows machine, now I want to cover how to use Elasticsearch on a Linux machine:

  1. Download both Elasticsearch and Kibana (the versions used in my course were Elasticsearch 5.1.1 and Kibana 5.1.1 however there are more recent versions available of both systems and so there may be version conflicts which are visible once you visit Kibana in your browser. If there are version issues simply install the version of either Kibana or Elasticsearch specified on the Kibana interface).
  2. Start two terminal windows. In one terminal navigate to the Elasticsearch directory and start Elasticsearch by writing in:

./elasticsearch-5.1.1/bin/elasticsearch

3. In the other terminal navigate to the Kibana directory and write in:
./kibana-5.1.1-linux-x86_64/bin/kibana

3. Now in browser visit Elasticsearch by writing in the URL:
http://localhst:9200

4. Also in your web browser visit Kibana by writing in the URL:

http://localhost:5601/app/kibana

Because you now interact with Elasticsearch through Kibana in the web browser everything is the same from this stage no matter what OS you are using.

 

Examples of Elasticsearch API commands

Create Index API – This creates a index which we can then use to index documents (create data records)

PUT student

{
“settings” : {…},

“mappings:: {…}

}

In this create index API command you can specify the number of shards and replicas you want the index to span. e.g.

PUT  “student”

{

“settings” : {

“number_of_shards” : 1,

“number_of_replicas” : 1

}

}

Index API – Here I am specifying I want to use the ‘student’ index, creating a ‘degree’ document type and specifying the ID of the document I am indexing is 1. Then I am indexing the document itself. By performing the index API I am automatically creating a document type of degree.

Note: Specifying the id value in the PUT command is optional.

PUT student/degree/1

{

“name” : “Alexander Buckley”,

“alt_names” : [ “Alex” ],

“origin” : “Nelson”

}

 

If there was no student index before I wanted to index this document I would need to write

PUT  student/degree/1/_create

{

“name” : “Alexander Buckley”,

“alt_names” : [ “Alex” ],

“origin” : “Nelson”

}

GET API – This retrieves the data for a specific document.

e.g.

GET student/degree/1

 

Exists API – This checks if there is a document with a particular id in a index.

e.g.

HEAD student/degree/1

This is checking if there is a document with the id of 1 in the student index and degree mapping type.

 

Delete API – This deletes a document from an index by specifying the id of the document you want deleted in the Delete API command

DELETE student/degree/1

Write consistency – Before any of these API commands is performed more than 1/2 of the shards in the cluster need to be available because it is dangerous to write to a single shard.

Versioning – Elasticsearch uses versioning to keep track of every write operation to a document. The version is assigned to a document when it is indexed and is automatically incremented with every write operation.

Update API – This allows you to update parts of a document. To do this you write the changes to specific properties of the document in the ‘docs’ . So in the below example I am updating the name property of the document with the id of 1. This will be merged with the existing document.

POST student/degree/1/_update

{

“docs” : {

“name” : “Alex Buckley”

}

}

 

Multi Get API – This is where you can request multiple documents from a specific index and mapping type. Whats returned is a doc array.

GET _mget

{

“docs” : [

{

“_index” : “student”,

“_type”   : “degree”,

“_id” : 1

},

{

“_index” : “student”,

“_type” : “degree”,

“_id” : 2,

“_source” : [“origin”]

}

]}

 

Bulk API – To perform multiple different API commands simultaneously. Elasticsearch splits the command up and sends them off to the appropriate node. If two requests are requesting/manipulating the same node then they are sent together.

PUT _bulk

{ “delete” : { “index” : “student”, “_type” : “degree”, “_id” : 2 } }\n

{ ” index” : { “_index” : “student” , “_type” : “degree”, “_id” : “3}}\n

{ “name” : “Jan Smith”, “alt_names” : [“Janet Smith”], “origin” : “Wellington” }\n

In this example I am deleting a document from the fruit index and indexing (adding) another document all with a single Bulk API. The benefit of this is once this API request has been redirected to the node containing the student index multiple API commands can be performed which is obviously more efficient.

Search – Text analysis

Unlike the range for commands for the CRUD actions in the above section for search we would use the GET API. This is sent from the Kibana client to the Elasticsearch cluster, and redirected from the node that received the command to the node containing the inverted index with a matching search value (known as a token).

This Lucene inverted index contains 3 columns , first column is the search token,  second column is the number of documents it exists in and third column is the id of the documents it exists in.

e.g.

token        docfreq.        postings (doc ids)

Janet          2                  3, 5

If we were using a GET API to find all instances of the word ‘Janet’ we would be returned with the documents 3 and 5.

When indexing a document you can use the index attribute to specify what fields you want to be searchable. This attribute can have one of three values:

  • analyzed: Make the field searchable and put it through the analyzer chain
  • not_analyzed: Make the field searchable, and don’t put it through the analyzer chain
  • no: Don’t make the field searchable

But what is the analyzer chain?

OK so the values from the index documents are placed in the Lucene inverted index and that is what is queried when using Elasticsearch as a search engine. If we have a string we want to be searchable then we often have to tidy it up a bit to make it more easily searchable, that’s where the analyzer chain comes in, it performs the following actions:

  1. The first step is the char filter, this removes any HTML syntax. e.g. the string “<h1> This is a heading </h1>” would become “This is a heading”.

2. The second step is the tokeniser, which usually does (but you can specify the steps you want the tokeniser to do):

  • Splitting the words in the string apart
  • Removes stop words like ‘a’
  • Make all letters in each word lower case
  • Replace similar words with their stem word. In other words the two words “run”, and “running”  are similar and so instead of writing them all to the inverted index we replace them with a singular word “run”.  Replacing similar words with stem words is automated by performing a stem algorithm.

3. Token filter

Interestingly all user query terms go through the same analyzer chain before they are compared against the inverted index if the user uses the ‘match’ attribute in their search query (which will be discussed below)

Search

Elasticsearch can perform two types of search:

  • Structured query – This is a boolean query in the sense that either a match for the search query is found or it isn’t. It is used for keyword searches.
  • Unstructured query – This can be used for searching for phrases and it ranks the matches on how relevant they are. It can also be called a fuzzy search, because it does not treat the results in a boolean way saying their either a match or not as the structured query does but it returns results that exist on a continuum of relevancy.

 

Search queries:

match_all: This is the equivalent of SELECT * in SQL queries. The below example will return all documents in the fruit index and berries mapping type.

GET student/degree/_search

{

“queries” : {

“match_all”: {}

}

}

Note: The top 10 results are returned, and so even if you perform the match_all query you will still only get 10 results back by default. But this can be customized.

 

If you want to search terms that have not been analyzed (i.e. haven’t gone through the analyzer chain when the document was indexed)  then you want to use the ‘term’ attribute

However if you want to query a field that has been analyzed (i.e. it has gone through the analyzer chain) then you will use the match attribute.

e.g.

GET student/degree/_search

{

“query” : {

“match” : { “name” : “Jan Smith” }

}

}

This means the term Jan Smith will go through the analyzer chain before it is compared against values in the inverted index.

The multi_match attribute can be used to find a match in multiple fields, i.e. it will put multiple search values through the analyzer chain in order to find a matching value in the inverted index.

GET student/degree/_search

{

“query” : {

“multi_match” : {

“fields” : [“name”, “origin” ],

“query”: “Jan Smith Wellington”

}

}

I will discuss the other search queries in my next Elasticsearch blog to prevent this one getting too long.

 

Mappings

When we index a document we specify a mapping type which is kind of like the table in a relational database, or a class in the OO paradigm because it has particular properties which all documents of that mapping type have values for.

The benefits of mapping types are that they make your indexes match the problem domain more closely. For example by making a mapping type of degree I am making the documents that I index in the degree mapping type a more specific type of student.

To save time when creating indexes with the same mapping type we can place the mapping type in a template and just apply the template to the index.

e.g.

PUT _template/book_template

{

     “template” : “book*”,

     “settings” : {

          “number_of_shard” : 1

       },

       “mappings” : {

                 “_default_” : {

                          “_all” : {

                                    “enabled” : false

                            }

                      }

              }

}

To apply this mapping type template to a index I would have to write:

PUT _template/book_wildcard

{

    “template” : “book*”,

    “mappings” : {

          “question”: {

               “dynamic_templates” : [

                 {

                            “integers” : {

                                     “match”: “*_i”,

                                      “mapping” : { “type”: “integer”}

                              }

                       }

                   ]

}}}

Note: It is recommended that you only assign a single mapping type to a index.

Conclusion

I have learned a lot from the Elasticsearch course and will continue to discuss what I learned in the next Elasticsearch blog.

 

Bibliography:

Christopher. (2015, April 16). Visualizing data with Elasticsearch, Logstash and Kibana. Retrieved March 26, 2017, from http://blog.webkid.io/visualize-datasets-with-elk/

Principe,  f. (2013, August 13). ELASTICSEARCH what is | Portale di Francesco Principe. Retrieved March 26, 2017, from http://fprincipe.altervista.org/portale/?q=en/node/81

Glossary of terms | Elasticsearch Reference [5.3] | Elastic. (n.d.). Retrieved March 30, 2017, from https://www.elastic.co/guide/en/elasticsearch/reference/current/glossary.html

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s