Elasticsearch Part 2

On Friday I attended an excellent Elasticsearch basics for developers course at work and I would like to discuss what I learned and how it has changed my  view of Elasticsearch since I wrote about it a couple of weeks back.

What is Elasticsearch?

Elasticsearch is usually thought of as a search engine but it is more than that, Elasticsearch can also be considered a:

  • Data store, meaning in addition to using it as a search engine for your app/system you could also use it as an alternative to a Relational Database Management System (RDBMS). Elasticsearch stores data in documents, mapping types and indexes, which are the equivalent of a Relational databases rows, tables, and database respectively.

 

  • Reporting tool – Elasticsearch can be used to store system logs. The RESTful API Kibana that you can use to interact with Elasticsearch can generate visual charts from Elasticsearch search query results. For example the below visual charts of system log information. These charts present the data in a far more useful format than a written system log.

kibana visualisation

(Christopher, 2015)

Something that particularly interested me about the course was that  the presenter Frederik said that Elasticsearch is very flexible and extremely useful as long as your prepared to spend time configuring it.

A lot of people implement Elasticsearch (which is actually pretty easy as I found last week) and expect it to be the equivalent of Google for their organizations data, however if you don’t configure it to match your business problem domain then it would not reach its full potential.

What is the internal structure of Elasticsearch?

Elasticsearch is built on top of Lucene, which is a search library. In the documentation it is very hard to determine where one ends and the other begins however I believe having done the course and read through the first answer in this very interesting StackOverFlow page (http://stackoverflow.com/questions/15025876/what-is-an-index-in-elasticsearch) that I have a good understand of this now, so lets test it out.

I look at Elasticsearch and Lucene as a 2 layered cake ( to see this graphically look at the below diagram where we have Elasticsearch as the top layer and Lucene as the bottom layer), the top layer (Elasticsearch) is the one that the user interacts with.  When you first install Elasticsearch a cluster is created ( a cluster is a collection of 1 or more nodes (instances of Elasticsearch)).

Inside this cluster by default you have 1 node (a single instance of Elasticsearch). This node contains indexes. Now an index is like a database instance, drilling down further we have mapping types (which are the equivalent of tables for example you could create a mapping type of student), inside a mapping type there are documents (which are a single data record making them the equivalent of a row in a database), and inside each indexed document there are properties which are the individual data values so for example 22 years old is a property for age).

To put the document into perspective it is just a JSON data structure.

So we have established that Elasticseach stores the data in indexes, with each data record known as a document.

But how does Elasticsearch actually find specific data when someone writes in a HTTP GET request into Kibana? Well that’s where Lucene comes in, Lucene is the bottom of the two layers in my cake simile. Lucene contains its own index, which is a inverted index: instead of storing data it points to which indexed documents in Elasticsearch index that the data value is stored in, in much the same way a book index points to the page number where a particular word exists.

Anther good analogy of what a inverted index is that it is quite similar to containers such as arrays and dictionaries which point to a specific location in memory where a particular value is stored rather than storing the value itself in their data structure.

Having done the course I now believe I understand how the index in Elasticsearch and the index in Lucene relate.

lucene and es.png

(Principe, 2013)

Now as I said by default your Elasticsearch cluster has one node, however Elasticsearch is extendable meaning you can add more nodes to your cluster.

Within each node there are 5 shards. What is a shard? “A shard is a single Lucene instance. It is a low-level “worker” unit which is managed automatically by elasticsearch. An index is a logical namespace which points to primary and replica shards.” (“Glossary of terms | Elasticsearch Reference [5.3] | Elastic,” n.d.). In other words each node contains the Elasticsearch indexes outside the shards and the Lucene Inverted indexes inside the shard.

Each shard has a backup in the form of a replica shard which is stored on a different node, this provides data redundancy and speeds up the search times because it is more likely the HTTP GET request is sent to a shard containing an inverted index with the search term in it.

What is the process that happens when the client RESTful API sends a request to the cluster?

In Elasticsearch the commands are called APIs so for example a delete command is a called delete API.

Now like I previously stated Elasticsearch is structured as a collection of nodes in a cluster (think of it like how there are multiple servers in the concept of the cloud).

The nodes store different information (in the form of Lucene inverted indexes and Elasticsearch indexes) so the request needs to go to a particular node to access particular data. However all the nodes store information about the topology of the cluster, so they know what node contains the data the API command seeks/wants to modify.

When you write a HTTP GET request in Kibana the ID specified in the GET request is hashed and then the hashed id is sent to a node in the cluster, it doesn’t matter which node that request is sent to as it will redirect the request to the appropriate node (if it doesn’t store the matching hashed id).

However to make sure that the same node is not always queried the destination node of each search query is different based on a round robin distribution.

How Elasticsearch data storage violates normalisation and refactoring

Elasticsearch is all about fast search times, to achieve this having duplicated data in multiple indexes is considered acceptable.

This is in complete contrast to the database concept of normalization and the programming concept of refactoring both of which stress the need to remove duplicate data/code.

What are the differences between Elasticsearch and a relational database

Although Elasticsearch can be used as a data store meaning you could implement it as an alternative to a relational database the differences are:

  • Elasticsearch does not use foreign keys to create relationships between indexes
  • Data can be duplicated to speed up the query time
  • Query joins (querying two or more indexes in a single query) are not available in any effective way from Elasticsearch, meaning although rudimentary joins can be implemented they are not very effective

So when should you replace your RDBMS with Elasticsearch? Well it depends on the sorts of queries you have performed/want to perform on your primary data store. If you are performing complex transaction queries (2 or more queries concurrently) then Elasticsearch is not ideal and you would be better off using a RDBMS such as MySQL and just use Elasticsearch as a search engine.

However if you don’t want complex transactional queries then Elasticsearch is a good alternative to a RDBMS.

What are the cons of Elasticsearch?

Elasticsearch is not ideal from a security point of view this is because it does not provide data or transport encryption.

It is near realtime – This means there is a slight latency after indexing a document before you can search for the data it holds.

What are the benefits of Elasticsearch?

The two main benefits of Elasticsearch are:

  • It is fast – Due to the data being duplicated in multiple shards it means it is faster to access data in either the primary or replica shards
  • It is distributed – Meaning it is easy to extend by creating another node in your cluster
  • High availability – By having a primary and replica shard to hold each inverted index twice this means the indexes are more easily available

 

Starting Elasticsearch on Linux

Last week I installed and used Elasticsearch on a Windows machine, now I want to cover how to use Elasticsearch on a Linux machine:

  1. Download both Elasticsearch and Kibana (the versions used in my course were Elasticsearch 5.1.1 and Kibana 5.1.1 however there are more recent versions available of both systems and so there may be version conflicts which are visible once you visit Kibana in your browser. If there are version issues simply install the version of either Kibana or Elasticsearch specified on the Kibana interface).
  2. Start two terminal windows. In one terminal navigate to the Elasticsearch directory and start Elasticsearch by writing in:

./elasticsearch-5.1.1/bin/elasticsearch

3. In the other terminal navigate to the Kibana directory and write in:
./kibana-5.1.1-linux-x86_64/bin/kibana

3. Now in browser visit Elasticsearch by writing in the URL:
http://localhst:9200

4. Also in your web browser visit Kibana by writing in the URL:

http://localhost:5601/app/kibana

Because you now interact with Elasticsearch through Kibana in the web browser everything is the same from this stage no matter what OS you are using.

 

Examples of Elasticsearch API commands

Create Index API – This creates a index which we can then use to index documents (create data records)

PUT student

{
“settings” : {…},

“mappings:: {…}

}

In this create index API command you can specify the number of shards and replicas you want the index to span. e.g.

PUT  “student”

{

“settings” : {

“number_of_shards” : 1,

“number_of_replicas” : 1

}

}

Index API – Here I am specifying I want to use the ‘student’ index, creating a ‘degree’ document type and specifying the ID of the document I am indexing is 1. Then I am indexing the document itself. By performing the index API I am automatically creating a document type of degree.

Note: Specifying the id value in the PUT command is optional.

PUT student/degree/1

{

“name” : “Alexander Buckley”,

“alt_names” : [ “Alex” ],

“origin” : “Nelson”

}

 

If there was no student index before I wanted to index this document I would need to write

PUT  student/degree/1/_create

{

“name” : “Alexander Buckley”,

“alt_names” : [ “Alex” ],

“origin” : “Nelson”

}

GET API – This retrieves the data for a specific document.

e.g.

GET student/degree/1

 

Exists API – This checks if there is a document with a particular id in a index.

e.g.

HEAD student/degree/1

This is checking if there is a document with the id of 1 in the student index and degree mapping type.

 

Delete API – This deletes a document from an index by specifying the id of the document you want deleted in the Delete API command

DELETE student/degree/1

Write consistency – Before any of these API commands is performed more than 1/2 of the shards in the cluster need to be available because it is dangerous to write to a single shard.

Versioning – Elasticsearch uses versioning to keep track of every write operation to a document. The version is assigned to a document when it is indexed and is automatically incremented with every write operation.

Update API – This allows you to update parts of a document. To do this you write the changes to specific properties of the document in the ‘docs’ . So in the below example I am updating the name property of the document with the id of 1. This will be merged with the existing document.

POST student/degree/1/_update

{

“docs” : {

“name” : “Alex Buckley”

}

}

 

Multi Get API – This is where you can request multiple documents from a specific index and mapping type. Whats returned is a doc array.

GET _mget

{

“docs” : [

{

“_index” : “student”,

“_type”   : “degree”,

“_id” : 1

},

{

“_index” : “student”,

“_type” : “degree”,

“_id” : 2,

“_source” : [“origin”]

}

]}

 

Bulk API – To perform multiple different API commands simultaneously. Elasticsearch splits the command up and sends them off to the appropriate node. If two requests are requesting/manipulating the same node then they are sent together.

PUT _bulk

{ “delete” : { “index” : “student”, “_type” : “degree”, “_id” : 2 } }\n

{ ” index” : { “_index” : “student” , “_type” : “degree”, “_id” : “3}}\n

{ “name” : “Jan Smith”, “alt_names” : [“Janet Smith”], “origin” : “Wellington” }\n

In this example I am deleting a document from the fruit index and indexing (adding) another document all with a single Bulk API. The benefit of this is once this API request has been redirected to the node containing the student index multiple API commands can be performed which is obviously more efficient.

Search – Text analysis

Unlike the range for commands for the CRUD actions in the above section for search we would use the GET API. This is sent from the Kibana client to the Elasticsearch cluster, and redirected from the node that received the command to the node containing the inverted index with a matching search value (known as a token).

This Lucene inverted index contains 3 columns , first column is the search token,  second column is the number of documents it exists in and third column is the id of the documents it exists in.

e.g.

token        docfreq.        postings (doc ids)

Janet          2                  3, 5

If we were using a GET API to find all instances of the word ‘Janet’ we would be returned with the documents 3 and 5.

When indexing a document you can use the index attribute to specify what fields you want to be searchable. This attribute can have one of three values:

  • analyzed: Make the field searchable and put it through the analyzer chain
  • not_analyzed: Make the field searchable, and don’t put it through the analyzer chain
  • no: Don’t make the field searchable

But what is the analyzer chain?

OK so the values from the index documents are placed in the Lucene inverted index and that is what is queried when using Elasticsearch as a search engine. If we have a string we want to be searchable then we often have to tidy it up a bit to make it more easily searchable, that’s where the analyzer chain comes in, it performs the following actions:

  1. The first step is the char filter, this removes any HTML syntax. e.g. the string “<h1> This is a heading </h1>” would become “This is a heading”.

2. The second step is the tokeniser, which usually does (but you can specify the steps you want the tokeniser to do):

  • Splitting the words in the string apart
  • Removes stop words like ‘a’
  • Make all letters in each word lower case
  • Replace similar words with their stem word. In other words the two words “run”, and “running”  are similar and so instead of writing them all to the inverted index we replace them with a singular word “run”.  Replacing similar words with stem words is automated by performing a stem algorithm.

3. Token filter

Interestingly all user query terms go through the same analyzer chain before they are compared against the inverted index if the user uses the ‘match’ attribute in their search query (which will be discussed below)

Search

Elasticsearch can perform two types of search:

  • Structured query – This is a boolean query in the sense that either a match for the search query is found or it isn’t. It is used for keyword searches.
  • Unstructured query – This can be used for searching for phrases and it ranks the matches on how relevant they are. It can also be called a fuzzy search, because it does not treat the results in a boolean way saying their either a match or not as the structured query does but it returns results that exist on a continuum of relevancy.

 

Search queries:

match_all: This is the equivalent of SELECT * in SQL queries. The below example will return all documents in the fruit index and berries mapping type.

GET student/degree/_search

{

“queries” : {

“match_all”: {}

}

}

Note: The top 10 results are returned, and so even if you perform the match_all query you will still only get 10 results back by default. But this can be customized.

 

If you want to search terms that have not been analyzed (i.e. haven’t gone through the analyzer chain when the document was indexed)  then you want to use the ‘term’ attribute

However if you want to query a field that has been analyzed (i.e. it has gone through the analyzer chain) then you will use the match attribute.

e.g.

GET student/degree/_search

{

“query” : {

“match” : { “name” : “Jan Smith” }

}

}

This means the term Jan Smith will go through the analyzer chain before it is compared against values in the inverted index.

The multi_match attribute can be used to find a match in multiple fields, i.e. it will put multiple search values through the analyzer chain in order to find a matching value in the inverted index.

GET student/degree/_search

{

“query” : {

“multi_match” : {

“fields” : [“name”, “origin” ],

“query”: “Jan Smith Wellington”

}

}

I will discuss the other search queries in my next Elasticsearch blog to prevent this one getting too long.

 

Mappings

When we index a document we specify a mapping type which is kind of like the table in a relational database, or a class in the OO paradigm because it has particular properties which all documents of that mapping type have values for.

The benefits of mapping types are that they make your indexes match the problem domain more closely. For example by making a mapping type of degree I am making the documents that I index in the degree mapping type a more specific type of student.

To save time when creating indexes with the same mapping type we can place the mapping type in a template and just apply the template to the index.

e.g.

PUT _template/book_template

{

     “template” : “book*”,

     “settings” : {

          “number_of_shard” : 1

       },

       “mappings” : {

                 “_default_” : {

                          “_all” : {

                                    “enabled” : false

                            }

                      }

              }

}

To apply this mapping type template to a index I would have to write:

PUT _template/book_wildcard

{

    “template” : “book*”,

    “mappings” : {

          “question”: {

               “dynamic_templates” : [

                 {

                            “integers” : {

                                     “match”: “*_i”,

                                      “mapping” : { “type”: “integer”}

                              }

                       }

                   ]

}}}

Note: It is recommended that you only assign a single mapping type to a index.

Conclusion

I have learned a lot from the Elasticsearch course and will continue to discuss what I learned in the next Elasticsearch blog.

 

Bibliography:

Christopher. (2015, April 16). Visualizing data with Elasticsearch, Logstash and Kibana. Retrieved March 26, 2017, from http://blog.webkid.io/visualize-datasets-with-elk/

Principe,  f. (2013, August 13). ELASTICSEARCH what is | Portale di Francesco Principe. Retrieved March 26, 2017, from http://fprincipe.altervista.org/portale/?q=en/node/81

Glossary of terms | Elasticsearch Reference [5.3] | Elastic. (n.d.). Retrieved March 30, 2017, from https://www.elastic.co/guide/en/elasticsearch/reference/current/glossary.html

What is Elasticsearch?

I thought I would write a research journal entry about Elasticsearch, because I will be attending a development course on this technology on the 24th of March for my work and so I think it will be advantageous to try to understand what Elasticsearch is, and how it works.

The reason I am going to be doing the development course on Elasticsearch is that the Koha Integrated Library Management System (ILMS) that I work on is currently implementing Elasticsearch.

For a start I want to mention trying to find an understandable information about Elasticsearch has been a  challenge to say the least. I am a big picture person and so when researching a technology like Elasticsearch I like to know how it fits in with the other tiers of a software architecture. This information was lacking in the ElasticSearch and Wikipedia pages, however I was able to find the following links which gave me enough knowledge to understand and explain what Elasticsearch is:

https://www.sitepoint.com/building-recipe-search-site-angular-elasticsearch/

http://www.elasticsearchtutorial.com/basic-elasticsearch-concepts.html

elasticsearch.png

What is Elasticsearch?

Elasticsearch is an open source, search engine server. It is used to provide search functionality for web apps by indexing data records (known as ‘documents’. These documents are written in  JSON. Effectively Elasticsearch works just like a book index where authors place specific words they think readers will want to find and the location where these words appear throughout the book). Developers can implement ElasticSearch in their web apps if search functionality is required.

With the proliferation of big data (large amounts of unstructured data for example data in Word documents) Elasticsearch implements a effective indexing system allowing users to be able to generate information (a refined form of data) from the data it points to.

It is distributed which means the data that it stores is kept on multiple different instances. Now I originally thought that this meant that Elasticsearch must be run on multiple different physical machines however, you can create multiple instances of Elasticsearch on a single machine and this will make a cluster of multiple nodes. Much the same as a cluster of RAID disks.

The distributed nature of Elasticsearch means as data volumes increase all you have to do is create a new node in the cluster  to store additional data indexes.

Elasticsearch is built using Java, so having Java installed on the machines Elasticsearch will be run on is required.

Examples of companies/web applications that use Elasticsearch are GitHub, Facebook, CERN, and Netflix, so with these big players using it, its obviously a popular technology.

Whist a specific examples of Elasticsearch use when implemented is: “Stack Overflow combines full-text search with geolocation queries and uses more-like-this to find related questions and answers.”(Vanderzyden, 2015)

Elasticsearch is built on a storage system called Lucene. ElasticSearch is basically the easy to use RESTful API (this is a application program interface that you can write HTTP commands GET, PUT, POST, and DELETE on to perform CRUD( Create, Retrieve, Update, Delete) actions) front end of Lucene.

To explain Lucene simply it is basically an alternative to using a relational database to store data.

How do Elasticsearch and Lucene actually work?

Elasticsearch is built on Lucene so there is a lot of overlap describing them in the literature I read, here is my understanding of how they work together.

Elasticsearch is a distributed search engine as I previously explained. Every Elasticsearch instance is called a node. Each node stores multiple shards which in turn store inverted indexes.

Now the inverted indexes that Elasticsearch uses are the same as Lucene, with the big difference between these two technologies being Elasticsearch is distributed and it provides an easy to use RESTful API for users to use to perform their search.

What is an inverted index? It is a file that matches a search term (the index value) with the frequency (the search term and frequency form part of the index called ‘mapping’ or ‘dictionary’) and location (the location value form part of the index called ‘posting’) that it exists in the data. As the below image shows the search (index) term ‘blue’ exists in the locations: documents 1,2 and 3.

index.png

(“Elasticsearch Storage Architecture using Inverted Indexes | Java J2EEBrain,” n.d.)

Now Elasticsearch indexes data in inverted indexes (each data record in an index is known as a document), rather than in tables and rows as relational databases do.

The index terms are stored in something called dictionary, this is an alphabetically ordered list in much the same way a human dictionary is ordered, whilst the location pointer values are stored in a data structure called a posting.

When a person searches using Elasticsearch the dictionary is checked to find a index term matching the users query term once a matching value has been found a union query is performed to retrieve the corresponding location value in the postings list in order. Now that the location of the search term is known.

NOTE: This is how I understand Elasticsearch works at this stage, my understanding may change as I do further reading and use of this technology for a future research journal entry.

Demo of me installing and using Elasticsearch

Now that I have learnt a bit about how Elasticsearch works I want to have a go, I found this online tutorial very useful to follow:

Here is the process I went through of installing and using Elasticsearch:

  1. Download the zip file (from https://www.elastic.co/downloads/elasticsearch) and extract it to your documents folder.
  2. Run the elasticsearch.bat file (circled below) to start Elasticsearch

bat file to run.PNG

Note: This will run a batch file which must be run in the background whenever you want to use Elasticsearch

running bat file.PNG

3. In my browser I went to the URL: localhost:9200

This shows me my default cluster. Again the cluster is the collection of nodes (Elasticsearch instances) at the moment I should only have one node in my cluster.

elasticsearch runnign

4. That’s all very nice but how do I actually interact with Elasticsearch? Well I can interact with it using one of two methods: Either write curl commands into the command prompt/terminal or you can use a RESTful API called Kibana. I chose to use the latter which is just a web based interface to write HTTP commands to create, delete, and retrieve information from Elasticsearch.

To install Kibana I went to https://www.elastic.co/downloads/kibana.

Issue: I had an issue installing Kibana because I had originally installed Elasticsearch.5.0.0 but Kibana would not working successfully unless Elasticsearch 5.2.2 was installed. As the screenshot of the error I got from Kibana shows:

error in kibana.PNG

I tried to install and run the batch file of Elasticsearch 5.2.2 however I could not work out how to uninstall Elasticsearch 5.0.0 (because Elasticsearch does not appear in ‘programs and apps’ to uninstall) and so Kibana kept throwing an error because Elasticsearch 5.0.0 was still running.

So I solved this issue by restarting my laptop, deleting the Elasticsearch 5.0.0 file and running the batch file for Elaticsearch 5.2.2 again.

After solving the Elasticsearch version error, I ran the Kibana batch file and in my browser I went to the URL: localhost:5601

Kibana loaded successfully and so now I have a RESTful API to use to work with Elasticsearch

5. Like I mentioned earlier Elasticsearch search functionality works by checking indexes to find matching data.Now the actual use of Elasticsearch made it look more like a typical DBMS than the theory suggested so I think I will need to do a bit more reading up on how Elasticsearch uses indexes.

In the meantime to create an simple index (with no fields in it) named ecommerce  I went to the devtools interface and wrote in the command:

PUT /ecommerce

{
}

Which was successfully created as I got the response “acknowledged”: true, (as the below screenshot shows).

devtool.PNG

6. I then deleted this index by writing in: DELETE /ecommerce

7. I made a index named ecommerce again, this time with mapping type.

What is a mapping type? ElasticSeach says “Each index has one or more mapping types, which are used to divide the documents in an index into logical groups”(“Mapping | Elasticsearch Reference [5.2] | Elastic,” n.d.).

I understand this to mean a mapping type is a organizational structure for placing information from the document (data record e.g. all the data describing a single item for sale) into the index. As this index is going to store documents (these are data records) about products a eCommerce company sells, then the mapping type is named products.

Within the mapping type there are fields which will contain data values. For example there is a name field of type string. When I add a document (data record) to Elasticsearch I will have to give the name field a value so the document’s name value is searchable.

creating a mapping.PNG

8. Now to add a data record to Elasticsearch so that it is searchable I must use the HTTP PUT command and write in a value for every field in the mapping type for the ecommerce index. added document to index no searchable.PNG

9. Now going back to the home page of Kibana I  can create a index also named ecommerce and clicking on the discover link I write in the title of the course “zend framework” and click enter to search my index for the words “zend framework”. As you can see every instance of the words “zend” and “framework” is highlighted.

search index.PNG

This shows that Elasticsearch looks through all the data values in the mapping type of each index to find a match for the search value. Remembering the knowledge I learned from my readings earlier I remembered that each index has two parts a mapping (which I have created) and a posting (which is a location pointer defining in which documents the search term exists in).

Now at this stage I cannot align the theory of index mapping with the mapping type I have just made, because according to the theory the mapping is just a search term and a frequency counter, well in the mapping I made for the ecommerce index there were fields with data types, and into these fields I indexed a document (wrote a data value for each field). So the purpose seems quite different to me, further research is required for a future research journal entry.

What are the benefits of Elasticsearch?

  • Fast lookup time – According to the Elasticsearch website the lookup time is near real-time
  • Textual content in documents such as WordDocs is searchable because Elasticseach provides fulltext search functionality
  • Extendable – Due to the distributed nature of Elasticsearch it is easy to extend the technology as your data volumes increase, by setting up a new node to store additional inverted indexes

Conclusion

Overall it has been very interesting reading up on Elasticsearch and I feel I have a basic understanding of what it is now, so I can attend the development course with a bit of background in the subject.

Bibliography:

Elasticsearch Storage Architecture using Inverted Indexes | Java J2EEBrain. (n.d.). Retrieved March 11, 2017, from http://www.j2eebrain.com/java-J2ee-elasticsearch-storage-architecture-using-inverted-indexes.html

Vanderzyden, J. (2015, September 1). What is Elasticsearch, and How Can I Use It? Retrieved March 11, 2017, from https://qbox.io/blog/what-is-elasticsearch

Download Elasticsearch Free • Get Started Now | Elastic. (2017). Retrieved March 11, 2017, from https://www.elastic.co/downloads/elasticsearch

Mapping | Elasticsearch Reference [5.2] | Elastic. (n.d.). Retrieved March 11, 2017, from https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping.html