I thought I would write a research journal entry about Elasticsearch, because I will be attending a development course on this technology on the 24th of March for my work and so I think it will be advantageous to try to understand what Elasticsearch is, and how it works.
The reason I am going to be doing the development course on Elasticsearch is that the Koha Integrated Library Management System (ILMS) that I work on is currently implementing Elasticsearch.
For a start I want to mention trying to find an understandable information about Elasticsearch has been a challenge to say the least. I am a big picture person and so when researching a technology like Elasticsearch I like to know how it fits in with the other tiers of a software architecture. This information was lacking in the ElasticSearch and Wikipedia pages, however I was able to find the following links which gave me enough knowledge to understand and explain what Elasticsearch is:
What is Elasticsearch?
Elasticsearch is an open source, search engine server. It is used to provide search functionality for web apps by indexing data records (known as ‘documents’. These documents are written in JSON. Effectively Elasticsearch works just like a book index where authors place specific words they think readers will want to find and the location where these words appear throughout the book). Developers can implement ElasticSearch in their web apps if search functionality is required.
With the proliferation of big data (large amounts of unstructured data for example data in Word documents) Elasticsearch implements a effective indexing system allowing users to be able to generate information (a refined form of data) from the data it points to.
It is distributed which means the data that it stores is kept on multiple different instances. Now I originally thought that this meant that Elasticsearch must be run on multiple different physical machines however, you can create multiple instances of Elasticsearch on a single machine and this will make a cluster of multiple nodes. Much the same as a cluster of RAID disks.
The distributed nature of Elasticsearch means as data volumes increase all you have to do is create a new node in the cluster to store additional data indexes.
Elasticsearch is built using Java, so having Java installed on the machines Elasticsearch will be run on is required.
Examples of companies/web applications that use Elasticsearch are GitHub, Facebook, CERN, and Netflix, so with these big players using it, its obviously a popular technology.
Whist a specific examples of Elasticsearch use when implemented is: “Stack Overflow combines full-text search with geolocation queries and uses more-like-this to find related questions and answers.”(Vanderzyden, 2015)
Elasticsearch is built on a storage system called Lucene. ElasticSearch is basically the easy to use RESTful API (this is a application program interface that you can write HTTP commands GET, PUT, POST, and DELETE on to perform CRUD( Create, Retrieve, Update, Delete) actions) front end of Lucene.
To explain Lucene simply it is basically an alternative to using a relational database to store data.
How do Elasticsearch and Lucene actually work?
Elasticsearch is built on Lucene so there is a lot of overlap describing them in the literature I read, here is my understanding of how they work together.
Elasticsearch is a distributed search engine as I previously explained. Every Elasticsearch instance is called a node. Each node stores multiple shards which in turn store inverted indexes.
Now the inverted indexes that Elasticsearch uses are the same as Lucene, with the big difference between these two technologies being Elasticsearch is distributed and it provides an easy to use RESTful API for users to use to perform their search.
What is an inverted index? It is a file that matches a search term (the index value) with the frequency (the search term and frequency form part of the index called ‘mapping’ or ‘dictionary’) and location (the location value form part of the index called ‘posting’) that it exists in the data. As the below image shows the search (index) term ‘blue’ exists in the locations: documents 1,2 and 3.
(“Elasticsearch Storage Architecture using Inverted Indexes | Java J2EEBrain,” n.d.)
Now Elasticsearch indexes data in inverted indexes (each data record in an index is known as a document), rather than in tables and rows as relational databases do.
The index terms are stored in something called dictionary, this is an alphabetically ordered list in much the same way a human dictionary is ordered, whilst the location pointer values are stored in a data structure called a posting.
When a person searches using Elasticsearch the dictionary is checked to find a index term matching the users query term once a matching value has been found a union query is performed to retrieve the corresponding location value in the postings list in order. Now that the location of the search term is known.
NOTE: This is how I understand Elasticsearch works at this stage, my understanding may change as I do further reading and use of this technology for a future research journal entry.
Demo of me installing and using Elasticsearch
Now that I have learnt a bit about how Elasticsearch works I want to have a go, I found this online tutorial very useful to follow:
Here is the process I went through of installing and using Elasticsearch:
- Download the zip file (from https://www.elastic.co/downloads/elasticsearch) and extract it to your documents folder.
- Run the elasticsearch.bat file (circled below) to start Elasticsearch
Note: This will run a batch file which must be run in the background whenever you want to use Elasticsearch
3. In my browser I went to the URL: localhost:9200
This shows me my default cluster. Again the cluster is the collection of nodes (Elasticsearch instances) at the moment I should only have one node in my cluster.
4. That’s all very nice but how do I actually interact with Elasticsearch? Well I can interact with it using one of two methods: Either write curl commands into the command prompt/terminal or you can use a RESTful API called Kibana. I chose to use the latter which is just a web based interface to write HTTP commands to create, delete, and retrieve information from Elasticsearch.
To install Kibana I went to https://www.elastic.co/downloads/kibana.
Issue: I had an issue installing Kibana because I had originally installed Elasticsearch.5.0.0 but Kibana would not working successfully unless Elasticsearch 5.2.2 was installed. As the screenshot of the error I got from Kibana shows:
I tried to install and run the batch file of Elasticsearch 5.2.2 however I could not work out how to uninstall Elasticsearch 5.0.0 (because Elasticsearch does not appear in ‘programs and apps’ to uninstall) and so Kibana kept throwing an error because Elasticsearch 5.0.0 was still running.
So I solved this issue by restarting my laptop, deleting the Elasticsearch 5.0.0 file and running the batch file for Elaticsearch 5.2.2 again.
After solving the Elasticsearch version error, I ran the Kibana batch file and in my browser I went to the URL: localhost:5601
Kibana loaded successfully and so now I have a RESTful API to use to work with Elasticsearch
5. Like I mentioned earlier Elasticsearch search functionality works by checking indexes to find matching data.Now the actual use of Elasticsearch made it look more like a typical DBMS than the theory suggested so I think I will need to do a bit more reading up on how Elasticsearch uses indexes.
In the meantime to create an simple index (with no fields in it) named ecommerce I went to the devtools interface and wrote in the command:
Which was successfully created as I got the response “acknowledged”: true, (as the below screenshot shows).
6. I then deleted this index by writing in: DELETE /ecommerce
7. I made a index named ecommerce again, this time with mapping type.
What is a mapping type? ElasticSeach says “Each index has one or more mapping types, which are used to divide the documents in an index into logical groups”(“Mapping | Elasticsearch Reference [5.2] | Elastic,” n.d.).
I understand this to mean a mapping type is a organizational structure for placing information from the document (data record e.g. all the data describing a single item for sale) into the index. As this index is going to store documents (these are data records) about products a eCommerce company sells, then the mapping type is named products.
Within the mapping type there are fields which will contain data values. For example there is a name field of type string. When I add a document (data record) to Elasticsearch I will have to give the name field a value so the document’s name value is searchable.
8. Now to add a data record to Elasticsearch so that it is searchable I must use the HTTP PUT command and write in a value for every field in the mapping type for the ecommerce index.
9. Now going back to the home page of Kibana I can create a index also named ecommerce and clicking on the discover link I write in the title of the course “zend framework” and click enter to search my index for the words “zend framework”. As you can see every instance of the words “zend” and “framework” is highlighted.
This shows that Elasticsearch looks through all the data values in the mapping type of each index to find a match for the search value. Remembering the knowledge I learned from my readings earlier I remembered that each index has two parts a mapping (which I have created) and a posting (which is a location pointer defining in which documents the search term exists in).
Now at this stage I cannot align the theory of index mapping with the mapping type I have just made, because according to the theory the mapping is just a search term and a frequency counter, well in the mapping I made for the ecommerce index there were fields with data types, and into these fields I indexed a document (wrote a data value for each field). So the purpose seems quite different to me, further research is required for a future research journal entry.
What are the benefits of Elasticsearch?
- Fast lookup time – According to the Elasticsearch website the lookup time is near real-time
- Textual content in documents such as WordDocs is searchable because Elasticseach provides fulltext search functionality
- Extendable – Due to the distributed nature of Elasticsearch it is easy to extend the technology as your data volumes increase, by setting up a new node to store additional inverted indexes
Overall it has been very interesting reading up on Elasticsearch and I feel I have a basic understanding of what it is now, so I can attend the development course with a bit of background in the subject.
Elasticsearch Storage Architecture using Inverted Indexes | Java J2EEBrain. (n.d.). Retrieved March 11, 2017, from http://www.j2eebrain.com/java-J2ee-elasticsearch-storage-architecture-using-inverted-indexes.html
Vanderzyden, J. (2015, September 1). What is Elasticsearch, and How Can I Use It? Retrieved March 11, 2017, from https://qbox.io/blog/what-is-elasticsearch
Download Elasticsearch Free • Get Started Now | Elastic. (2017). Retrieved March 11, 2017, from https://www.elastic.co/downloads/elasticsearch
Mapping | Elasticsearch Reference [5.2] | Elastic. (n.d.). Retrieved March 11, 2017, from https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping.html