Elasticsearch part 3: The implementation

In my last blog post on Elasticsearch I covered the majority of the theory and commands that I learned at the Elasticsearch for developers course I completed at work now I want to have a play around with Elasticsearch and the RESTful API you use to interact with Elasticsearch called Kibana.

I decided to install both of these systems onto NMIT Windows laptop. What was a seamless installation process last time on my other laptop turned into a more complicated troubleshooting exercise this time.

After starting both Elasticsearch (accessible at http://localhost:9200) and Kibana (accessible at http://5601) I saw that Kibana was throwing an error (see below) because it required a default index pattern.

error message in kibana

What is a default index pattern? It is an index (which remember is like the equivalent of a relational database) or collection of indexes you want to interact with using Kibana. You can specify several using the wildcard symbol in the index pattern input box.

So the first thing I had to do was to create an index. I used the example index on the documentation on the Elasticsearch (https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-create-index.html#indices-create-index) which was called ‘twitter’.

creating an index.PNG

Then after indexing an document (equivalent of inserting a row into a table in a relational database), I set twitter* as the default index pattern, thereby removing the error I was getting.

An important point beginners of Elasticsearch need to know is that when you interacting with your Elasticsearch using the Kibana RESTful API you will be writing in sense syntax rather than curl which is for use in terminal. However the Kibana Dev Tools area which is where you will write sense syntax is fantastic because it automatically converts curl commands into sense syntax. For example I copied and pasted the curl command

curl -XGET 'http://localhost:9200/_nodes'

And Kibana converted it to:

GET /_nodes

 

Insert data

Now for some fun with Elasticsearch…

I have created a index named cats

created cat index.PNG

Then I create a mapping type (equivalent of a relational databases table) automatically when indexing a document (creating a data record). How so? Well I use the PUT API (in Elasticsearch jargon an API is a command).

PUT cats/domestic/1

What this means is use the cats index, create a mapping type named ‘domestic’ and create a document with the ID of 1 in this mapping type.

Note that the ID number is optional in PUT APIs.

entered cat

What is happening when I use the PUT API to create an index, well Kibana sends a index request to a node in the Elasticsearch cluster (collection of nodes (instances of Elasticsearch)). The ID value (manually set or auto generated) is hashed and used to find a matching shard to execute the index.

What is a shard? it is a conceptual object holding a collection documents allowing Elasticsearch to be distributed and extendable.

Once the matching shard has created the index then it is replicated to the replica shard (the backup shard).

Note: As you can see above you do not need to specify data types when creating or modifying Elasticsearch indexes.

Retrieve data

Now to retrieve the document I just indexed I need to use the GET API:

GET cats/domestic/1

Whats happening in the background when you send a GET API. Well the ID in the request is hashed and so when the request arrives at a node in the Elasticsearch cluster, then the hashed ID is used to route the request to a shard with a matching hashed ID value.

How to check if a document exists

To check if a document exists in the index then you can use HEAD API

HEAD cats/domestic/1

Now this should return a HTTP header 200 if the document exists and 404 if it doesn’t exist. Except when I ran it in Kibana I got a fatal error.

 

error.PNG

Now it seems several other people have had issues running the Exists API in Kibana as these forum question posts show, none of which were satisfactorily answered.

https://unix.stackexchange.com/questions/253414/elasticsearch-error-on-head-command

https://github.com/elastic/elasticsearch-php/issues/391

However this github source (https://github.com/elastic/kibana/pull/10611) suggests that the syntax for the Exists API is deprecated and I need to write:

HEAD cats/_mapping/domestic

However this produced the same error. I could not find any other useful suggestions online and so I will move on, and ask the trainer of the course Frederik later.

Delete data

DELETE index/mapping type/id

delete.PNG

The background process when the DELETE API is run is as usual the id in the request is hashed and this is used to route the request to the primary shard that this document lives in, after the document is deleted there then the primary shard updates replica shards.

Point of interest: Write consistency

Now because all documents are written on a primary shard and can this can be (but doesn’t have to be) replicated on several replica shards.

If you have set up replica shards when you created the index, then you need to make sure a certain number of these shards are available when writing to Elasticsearch.

You need to have:

(primary+replicas)/2 + 1 shards available to be written to

 

Update data

I indexed another document

PUT cats/domestic/1
{
  "name": "Kelly",
  "age" : "1",
  "colour": "black and white"
}

Then to update this making the age 2 I wrote:

update.PNG

As I understand it the background process in this command is all fields including ones not being updated are replaced. So fields not being replaced are just replaced with the same value. This is again performed on the primary shard first, and then replicated to the replica shards if applicable.

 

Get multiple documents simultaneously

I created another index named ‘president’, with the mapping type ‘individual’ and id ‘1’ for a document on George Washington.

Then to get the documents with id ‘1’ in cats/domestic and president/individual I perform a Multi Get API

multi get.PNG

 

Perform multiple different APIs simultaneously

To perform multiple different commands using Kibana you can use a Bulk API command. You can think of this like the equivalent of being able to perform a select, delete, update, and insert SQL query into multiple tables in a relational database in a single command.

When I first tried this command I wrote in the HTTP header: PUT _bulk this resulted in an error:

bulk error.PNG

After some troubleshooting I found this is being caused by the /n which need to be removed and then it work, like so:

worked.PNG

 

Text analysis

Elasticsearch is very useful for searching text, because it can store the words from a text such as a book in the inverted index in much the same way a book index holds keywords for readers to find easily.

The way we split text up so it can be stored in the inverted index for searching is using the Analyze API.

I started using this by specifying the HTTP header GET _analyze, I specified I wanted the tokenizer “keyword” this stores the supplied string as one keyword combination rather than splitting it, filter “lowercase” this lowercases my supplied text.

As you can see below ‘New South Wales’ has been transformed into ‘new south wales’

lowercase.PNG

Often for long text (like sentences) it is best to split the words up so they can be searched individually. You can do this by specifying the tokenizer “whitespace”. So using the Shakespearean sentence “You shall find of the king a husband, madam; you,sir, a father:” I used the whitespace tokenizer to split it up:

splits.PNG

If you want to learn more about what the analyser is doing you can implement the “explain”: true attribute.

Now the analyzer commands I have performed to date are using the default _analyzer on supplied text, but what if I wanted all data in a document I index to be analyzed and thereby made searchable?

Well you can configure a analyzer in an index when creating the index.

analyzer in.PNG

To make the job of the tokenizer easier you can implement character filters for example you can filter out HTML. This would be very important to make the system more secure.

char filter.PNG

It is interesting how the different analyzers work; the English one does not just split the words up it actually removes stop words (common words that add no value to a search query). Below I wrote in the sentence from the course exercise: “It is unlikely that I’m especially good at analysis yet” which has words like ‘unlikely’ just stored and indexed as ‘unlik’

english.PNG

Whereas all words are stored and indexed in their original form when using the standard analyzer.

standard analyzer.PNG

 

Mappings

Instead of letting Elasticsearch decide the data type of fields in an index you can specify it manually in the mappings. Like I said previously the mapping type which is the equivalent of a table in a relational database is just the name, so in my previous examples I have cats/domestic/1 this meant I had the mapping type name of ‘domestic’. However there are many attributes in an index that you can customize to make it match the business problem domain more closely.

Mappings are useful because they have some idea of how data is structured even though they don’t have a schema.

I created a index named programminglanguage, with a mapping type of ‘OO’. I set the data type of the “name” field (which is circled below) to a string.

field altering.PNG

You can also update your mapping attributes in an index however you need to keep in mind that you cannot remove an mapping type field.

To retrieve your mapping values for an index simply write in GET <indexname>/_mappings

Like so:

retrieve mappings.PNG

Now you can create objects in Elasticsearch, for example by default the comments in my below tvseries index will be a nested object.

That means ‘comments’ is of data type ‘object’.

nested objects.PNG

If I want to reference a field in the comments nested object I have to write: comments.<fieldname>

How do you set a field to be searchable?

You use the ‘index’ attribute in the mappings. You set it to ‘analyzed’ if you want it searchable and it goes through the analyzer.

You set it to not_analyzed if you want it searchable but don’t want it to go through the analyzer.

You set it to ‘no’ if you don’t want it searchable.

 

Index templates

An index template is a good way to make a index fast, without having to write it out manually. So once you have the mappings customized to your business problem domain you can then apply this to multiple similar indexes using a template. I like to think of this like inheritance hierarchies in Object Oriented programming, you place all the common features in the superclass and all subclasses inherit it, thereby only having to write it once.

To create a mapping you need a PUT HTTP header:

PUT _template/tv_template

This is creating a template in the _template area named tv_template

template 1.PNG

Like with indices you can delete, retrieve and retrieve all templates using similar commands. As I have not covered how to retrieve all I will do so now, it is very simple:

GET /_template

Searching

Elasticsearch can perform two kinds of searches on the searchable values (setting values to searchable is described further above):

  • Structured query (checking for an exact, boolean match on the keywords the user entered. Equivalent to a SELECT SQL query with a WHERE clause. Either there is a match for the WHERE clause or there isn’t)
  • Unstructured query (not just looking for exact matches but ranking the query output, so this is a continuum rather than boolean answer. Also known as a full text search).

Elasticsearch uses a query language called QueryDSL. A quick Google of this and I have found it is a “extensive Java framework for the generation of type-safe queries in a syntax similar to SQL” (Chapman, 2014).

Now search uses the GET HTTP header; and to set up a structured query you use the ‘filter’ attribute, and to set up a unstructured query you use the ‘query’ attribute which gives all results a score.

Using QueryDSL we can write a single query (known as a leaf query clause) or multiple queries in a single statement (known as compound query clauses).

Where I want to query (retrieve) all documents in an index I can use the match_all attribute:

GET programminglanguage/OO
{
  "query": {
    "match_all" : {}
  }
}

This is the equivalent of a SELECT * query in SQL and it is perfectly acceptable to use.

Note: In the above match_all query  it is an unstructured query because it uses the ‘query’ attribute.

If you want to limit the number of ranked results displayed in an unstructured query then you can specify the number of results you want with the ‘size’ attribute.

 

How to use your search query

Now if you use the ‘match’ attribute in your query then the search term goes through the analysis chain to tidy it up and is then used for unstructured query.

Whereas if you use the ‘term’ attribute then whatever the user wrote in is compared exactly to what is in the inverted index and a structured query is performed.

So in the below example I am performing a unstructured query.

match.PNG

 

To make your unstructured query more fine-grained there are 3 types of unstructured queries for you to choose to implement.

Boolean –  This is effectively a structured query, whatever you enter as a search term is used to find an exact match in the inverted index otherwise no hits are found.

I have a document with the name “Kelly” in the cats/domestic index and mapping type, so trying the bool query searching for a name “K'” I got no results because I have no document with the name “K” in the cats/domestic.

book.PNG

Whereas when I perform this bool query using the name of “Kelly” I get 1 hit, this is because there is the exactly 1 document with the name “Kelly”

bool 2.PNG

 

 

Phrase – This treats the entered search term as a phrase

match_phrase_prefix – This query splits the values in the string the user entered as a search term. If the user has not completed a word like in the below example of when I used match_phrase_prefix I just used K and so Elasticsearch looks at a dictionary of sorted words and puts the first 50 into the query one at a time.

match 2.PNG

 

The query_string query is interesting it is built rather like a SQL query in that you can use OR and AND. So in the below example I am searching the cats/domestic with the phrase “(blue OR black) AND (white OR red)” without specifying the field name and I am getting the correct result.

query string.PNG

Suggesters

Suggesters are faster than search queries, although the suggester can be implemented on a search query as well.

What a suggester allows the query to do is to suggest values similar to the users search term. So for example if the user misspelt and wrote in the name “Kellu” then the suggester could suggest “Kelly” which is another similar term.

How Elasticsearch works

Search queries in Elasticsearch go through 3 stages, here is a summary on what I understand them to be:

  1. Pre-query – This is checking the number of times a word exists in a particular document. This is only possible where Elasticsearch has a small data set.
  2. Query – This is checking through the inverted indexes for a matching value, this is achieved by running the search query on all shards holding the index we are asking for until a matching value is found which can point the search query to the document ID that holds this value.  This is a useful resource for finding out more a out the Query phase: https://www.elastic.co/guide/en/elasticsearch/guide/current/distributed-search.html
  3. Fetch – This returns the documents (whose document ID was listed alongside the search term in the inverted index) to the user.

Deep pagination – What this concept means is Elasticsearch will look through every document in the cluster even if you just want to search 10 documents, as you can imagine this is very inefficient in a large data set. It is best avoided.

Elasticsearch is known for its speed and a contributing factor is the Request Cache. As indexes are shared across multiple shards when a search query is run on an index, what happens is it is run individually on all the shards and then the resulting results are combined to form the total result. However each shard keeps a copy of its own results meaning if someone queries the same value again it will exist on a shards request cache and it can be returned much faster than having to search the shard.

 

Aggregations

Aggregations is a framework that helps the user learn more about the search results they have found with a search query (“Aggregations | Elasticsearch Reference [5.3] | Elastic,” n.d.)

There’s three main types:

Bucket aggregations – Group of documents that meet a specific factor. You can think of this like finding common features in a whole lot of different documents and grouping them in ‘buckets’ based on these common features.

Metric aggregations – Calculate statistical data about a collection of buckets.

Pipeline aggregations – Combine the insights generated from other aggregations. This is an aggregation on an aggregation.

Kibana can use visualization tools to create graphs and maps using aggregations.

You implement aggregations using the “aggregations” attribute in the search query.

I am unable to perform many of the aggregation commands due to having a small data set, however a summary of the aggregation commands available is:

Sum aggregation – This adds together values of the same field in multiple documents

Min/max aggregation – Display the highest or lowest value of a field in all documents in an aggregation.

Multiple metrics aggregation – Display both the highest and lowest values for a field in all documents in an aggregation.

Terms aggregations – This returns the top 5 values for a particular field in all documents in an aggregation.

Missing aggregation – Find documents in an aggregation that do not have a specified value.

Filter aggregation – This is what is used to create bucket aggregations.

Significant term aggregation – This is finds strangely common values, by checking document values for common values in the aggregation against the total data source the bucket aggregation was collected from.

It is important not to nest too many aggregations in a single command because they are very resource hungry and you can end up crashing your system, this occurrence is called combinatorial explosion.

 

Data Modelling

If you choose to use Elasticsearch as a data store in addition to or replacing a relational database management system then you will need to perform data modelling to transform your existing data into something useful for Elasticsearch.

There are several paradigm shifts you will have to have to make this process possible. Firstly you need to understand that duplicate data is fine in Elasticsearch as it makes searching faster, this goes against what we are taught for relational database design and so it is not initially intuitive.

Now to take the data stored in relational tables with  relationships between one another into Elasticsearch we can do one of three things:

Denormalise the data into a single document: This flattens the data out so if you had 2 tables in a direct relationship then you can place all columns and data into a single Elasticsearch mapping type. This is making the data structure flat so it is searchable.

Nested objects: Nested objects are one way to store the relationship between two relational tables in some form. For example in a relational database you may have two tables ‘program’ and ‘comments’. These tables have the relationship that an tvseries has one or many comments, and a comment belongs to one tvseries.

To transfer this relationship to Elasticsearch which does not use foreign keys to map relationships we can place ‘comment’ as a nested object in the ‘tvseries’.

nested objects.PNG

This means the foreign key relationship is lost but we have been able to retain some of the relationship between the two logical groups of data by nesting one inside the other.

Now in order to still be able to use the nested object and the root object it is stored in we need to be able to query them separately so we use nested queries.

Parent/child objects: Another way to map the relationship between logical groups of data is parent/child objects. I did this using the pet owning example I have used previously:

The parent object will be the owner, and the child object will be the cat. Here’s the steps I went through to create this parent/child object combination.

  1. Create a index named ” petowning”

setting it up.PNG

2. Create the parent object which is the owner

parent.PNG

 

3. Create the child object which is the cat

child.PNG

 

Now each of these three methods have advantages and disadvantages which need to be considered against your system requirements when you are performing data modelling:

Flatten data:  Uses less data than nested objects and parent/child objects, but the data relationships are totally lost

Nested objects: Faster but less flexible (because the root object that the nest object is held in must be re-indexed whenever the nested object is updated)

Parent/child objects: Less fast but more flexible

 

Relevancy

Elasticsearch by default uses the TF/IDF (Term Frequency/ Inverse Document Frequency) algorithm to determine how relevant a document is to a query.

This algorithm works by comparing term frequency against all other documents, after looking at the specificality of the search term. What this means is a shorter more specific search term has a greater rank.

 

Perculator

Instead of saying which documents match a query, the perculator does the opposite it outputs the queries that match a document.

To be able to output the search queries that match a document we have to store the search queries as JSON documents however this is no problem because the search queries are QueryDSL (as I have previously discussed) and this is very similar to JSON.

Below I am storing a search query to find a cat with the name “Violet”

storing query.PNG

 

So there we have it explanations and tested examples (by me on Elasticsearch 5.3 and Kibana 5.3) shown as screenshots of the many different functions that Elasticsearch and Kibana provides. I hope this has been interesting and useful for you, I personally have found it fascinating to go through almost all of the commands I learned on my course in more depth and understand them better.

 

Bibliography

Chapman, B. (2014, June 11). What Can Querydsl Do for Me Part 1: How to Enhance and Simplify Existing Spring Data JPA Repositories – http://www.credera.com. Retrieved April 16, 2017, from https://www.credera.com/blog/technology-insights/java/can-querydsl-part-1-enhance-simplify-existing-spring-data-jpa-repositories/

Aggregations | Elasticsearch Reference [5.3] | Elastic. (n.d.). Retrieved April 16, 2017, from https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations.html

 

Elasticsearch Part 2

On Friday I attended an excellent Elasticsearch basics for developers course at work and I would like to discuss what I learned and how it has changed my  view of Elasticsearch since I wrote about it a couple of weeks back.

What is Elasticsearch?

Elasticsearch is usually thought of as a search engine but it is more than that, Elasticsearch can also be considered a:

  • Data store, meaning in addition to using it as a search engine for your app/system you could also use it as an alternative to a Relational Database Management System (RDBMS). Elasticsearch stores data in documents, mapping types and indexes, which are the equivalent of a Relational databases rows, tables, and database respectively.

 

  • Reporting tool – Elasticsearch can be used to store system logs. The RESTful API Kibana that you can use to interact with Elasticsearch can generate visual charts from Elasticsearch search query results. For example the below visual charts of system log information. These charts present the data in a far more useful format than a written system log.

kibana visualisation

(Christopher, 2015)

Something that particularly interested me about the course was that  the presenter Frederik said that Elasticsearch is very flexible and extremely useful as long as your prepared to spend time configuring it.

A lot of people implement Elasticsearch (which is actually pretty easy as I found last week) and expect it to be the equivalent of Google for their organizations data, however if you don’t configure it to match your business problem domain then it would not reach its full potential.

What is the internal structure of Elasticsearch?

Elasticsearch is built on top of Lucene, which is a search library. In the documentation it is very hard to determine where one ends and the other begins however I believe having done the course and read through the first answer in this very interesting StackOverFlow page (http://stackoverflow.com/questions/15025876/what-is-an-index-in-elasticsearch) that I have a good understand of this now, so lets test it out.

I look at Elasticsearch and Lucene as a 2 layered cake ( to see this graphically look at the below diagram where we have Elasticsearch as the top layer and Lucene as the bottom layer), the top layer (Elasticsearch) is the one that the user interacts with.  When you first install Elasticsearch a cluster is created ( a cluster is a collection of 1 or more nodes (instances of Elasticsearch)).

Inside this cluster by default you have 1 node (a single instance of Elasticsearch). This node contains indexes. Now an index is like a database instance, drilling down further we have mapping types (which are the equivalent of tables for example you could create a mapping type of student), inside a mapping type there are documents (which are a single data record making them the equivalent of a row in a database), and inside each indexed document there are properties which are the individual data values so for example 22 years old is a property for age).

To put the document into perspective it is just a JSON data structure.

So we have established that Elasticseach stores the data in indexes, with each data record known as a document.

But how does Elasticsearch actually find specific data when someone writes in a HTTP GET request into Kibana? Well that’s where Lucene comes in, Lucene is the bottom of the two layers in my cake simile. Lucene contains its own index, which is a inverted index: instead of storing data it points to which indexed documents in Elasticsearch index that the data value is stored in, in much the same way a book index points to the page number where a particular word exists.

Anther good analogy of what a inverted index is that it is quite similar to containers such as arrays and dictionaries which point to a specific location in memory where a particular value is stored rather than storing the value itself in their data structure.

Having done the course I now believe I understand how the index in Elasticsearch and the index in Lucene relate.

lucene and es.png

(Principe, 2013)

Now as I said by default your Elasticsearch cluster has one node, however Elasticsearch is extendable meaning you can add more nodes to your cluster.

Within each node there are 5 shards. What is a shard? “A shard is a single Lucene instance. It is a low-level “worker” unit which is managed automatically by elasticsearch. An index is a logical namespace which points to primary and replica shards.” (“Glossary of terms | Elasticsearch Reference [5.3] | Elastic,” n.d.). In other words each node contains the Elasticsearch indexes outside the shards and the Lucene Inverted indexes inside the shard.

Each shard has a backup in the form of a replica shard which is stored on a different node, this provides data redundancy and speeds up the search times because it is more likely the HTTP GET request is sent to a shard containing an inverted index with the search term in it.

What is the process that happens when the client RESTful API sends a request to the cluster?

In Elasticsearch the commands are called APIs so for example a delete command is a called delete API.

Now like I previously stated Elasticsearch is structured as a collection of nodes in a cluster (think of it like how there are multiple servers in the concept of the cloud).

The nodes store different information (in the form of Lucene inverted indexes and Elasticsearch indexes) so the request needs to go to a particular node to access particular data. However all the nodes store information about the topology of the cluster, so they know what node contains the data the API command seeks/wants to modify.

When you write a HTTP GET request in Kibana the ID specified in the GET request is hashed and then the hashed id is sent to a node in the cluster, it doesn’t matter which node that request is sent to as it will redirect the request to the appropriate node (if it doesn’t store the matching hashed id).

However to make sure that the same node is not always queried the destination node of each search query is different based on a round robin distribution.

How Elasticsearch data storage violates normalisation and refactoring

Elasticsearch is all about fast search times, to achieve this having duplicated data in multiple indexes is considered acceptable.

This is in complete contrast to the database concept of normalization and the programming concept of refactoring both of which stress the need to remove duplicate data/code.

What are the differences between Elasticsearch and a relational database

Although Elasticsearch can be used as a data store meaning you could implement it as an alternative to a relational database the differences are:

  • Elasticsearch does not use foreign keys to create relationships between indexes
  • Data can be duplicated to speed up the query time
  • Query joins (querying two or more indexes in a single query) are not available in any effective way from Elasticsearch, meaning although rudimentary joins can be implemented they are not very effective

So when should you replace your RDBMS with Elasticsearch? Well it depends on the sorts of queries you have performed/want to perform on your primary data store. If you are performing complex transaction queries (2 or more queries concurrently) then Elasticsearch is not ideal and you would be better off using a RDBMS such as MySQL and just use Elasticsearch as a search engine.

However if you don’t want complex transactional queries then Elasticsearch is a good alternative to a RDBMS.

What are the cons of Elasticsearch?

Elasticsearch is not ideal from a security point of view this is because it does not provide data or transport encryption.

It is near realtime – This means there is a slight latency after indexing a document before you can search for the data it holds.

What are the benefits of Elasticsearch?

The two main benefits of Elasticsearch are:

  • It is fast – Due to the data being duplicated in multiple shards it means it is faster to access data in either the primary or replica shards
  • It is distributed – Meaning it is easy to extend by creating another node in your cluster
  • High availability – By having a primary and replica shard to hold each inverted index twice this means the indexes are more easily available

 

Starting Elasticsearch on Linux

Last week I installed and used Elasticsearch on a Windows machine, now I want to cover how to use Elasticsearch on a Linux machine:

  1. Download both Elasticsearch and Kibana (the versions used in my course were Elasticsearch 5.1.1 and Kibana 5.1.1 however there are more recent versions available of both systems and so there may be version conflicts which are visible once you visit Kibana in your browser. If there are version issues simply install the version of either Kibana or Elasticsearch specified on the Kibana interface).
  2. Start two terminal windows. In one terminal navigate to the Elasticsearch directory and start Elasticsearch by writing in:

./elasticsearch-5.1.1/bin/elasticsearch

3. In the other terminal navigate to the Kibana directory and write in:
./kibana-5.1.1-linux-x86_64/bin/kibana

3. Now in browser visit Elasticsearch by writing in the URL:
http://localhst:9200

4. Also in your web browser visit Kibana by writing in the URL:

http://localhost:5601/app/kibana

Because you now interact with Elasticsearch through Kibana in the web browser everything is the same from this stage no matter what OS you are using.

 

Examples of Elasticsearch API commands

Create Index API – This creates a index which we can then use to index documents (create data records)

PUT student

{
“settings” : {…},

“mappings:: {…}

}

In this create index API command you can specify the number of shards and replicas you want the index to span. e.g.

PUT  “student”

{

“settings” : {

“number_of_shards” : 1,

“number_of_replicas” : 1

}

}

Index API – Here I am specifying I want to use the ‘student’ index, creating a ‘degree’ document type and specifying the ID of the document I am indexing is 1. Then I am indexing the document itself. By performing the index API I am automatically creating a document type of degree.

Note: Specifying the id value in the PUT command is optional.

PUT student/degree/1

{

“name” : “Alexander Buckley”,

“alt_names” : [ “Alex” ],

“origin” : “Nelson”

}

 

If there was no student index before I wanted to index this document I would need to write

PUT  student/degree/1/_create

{

“name” : “Alexander Buckley”,

“alt_names” : [ “Alex” ],

“origin” : “Nelson”

}

GET API – This retrieves the data for a specific document.

e.g.

GET student/degree/1

 

Exists API – This checks if there is a document with a particular id in a index.

e.g.

HEAD student/degree/1

This is checking if there is a document with the id of 1 in the student index and degree mapping type.

 

Delete API – This deletes a document from an index by specifying the id of the document you want deleted in the Delete API command

DELETE student/degree/1

Write consistency – Before any of these API commands is performed more than 1/2 of the shards in the cluster need to be available because it is dangerous to write to a single shard.

Versioning – Elasticsearch uses versioning to keep track of every write operation to a document. The version is assigned to a document when it is indexed and is automatically incremented with every write operation.

Update API – This allows you to update parts of a document. To do this you write the changes to specific properties of the document in the ‘docs’ . So in the below example I am updating the name property of the document with the id of 1. This will be merged with the existing document.

POST student/degree/1/_update

{

“docs” : {

“name” : “Alex Buckley”

}

}

 

Multi Get API – This is where you can request multiple documents from a specific index and mapping type. Whats returned is a doc array.

GET _mget

{

“docs” : [

{

“_index” : “student”,

“_type”   : “degree”,

“_id” : 1

},

{

“_index” : “student”,

“_type” : “degree”,

“_id” : 2,

“_source” : [“origin”]

}

]}

 

Bulk API – To perform multiple different API commands simultaneously. Elasticsearch splits the command up and sends them off to the appropriate node. If two requests are requesting/manipulating the same node then they are sent together.

PUT _bulk

{ “delete” : { “index” : “student”, “_type” : “degree”, “_id” : 2 } }\n

{ ” index” : { “_index” : “student” , “_type” : “degree”, “_id” : “3}}\n

{ “name” : “Jan Smith”, “alt_names” : [“Janet Smith”], “origin” : “Wellington” }\n

In this example I am deleting a document from the fruit index and indexing (adding) another document all with a single Bulk API. The benefit of this is once this API request has been redirected to the node containing the student index multiple API commands can be performed which is obviously more efficient.

Search – Text analysis

Unlike the range for commands for the CRUD actions in the above section for search we would use the GET API. This is sent from the Kibana client to the Elasticsearch cluster, and redirected from the node that received the command to the node containing the inverted index with a matching search value (known as a token).

This Lucene inverted index contains 3 columns , first column is the search token,  second column is the number of documents it exists in and third column is the id of the documents it exists in.

e.g.

token        docfreq.        postings (doc ids)

Janet          2                  3, 5

If we were using a GET API to find all instances of the word ‘Janet’ we would be returned with the documents 3 and 5.

When indexing a document you can use the index attribute to specify what fields you want to be searchable. This attribute can have one of three values:

  • analyzed: Make the field searchable and put it through the analyzer chain
  • not_analyzed: Make the field searchable, and don’t put it through the analyzer chain
  • no: Don’t make the field searchable

But what is the analyzer chain?

OK so the values from the index documents are placed in the Lucene inverted index and that is what is queried when using Elasticsearch as a search engine. If we have a string we want to be searchable then we often have to tidy it up a bit to make it more easily searchable, that’s where the analyzer chain comes in, it performs the following actions:

  1. The first step is the char filter, this removes any HTML syntax. e.g. the string “<h1> This is a heading </h1>” would become “This is a heading”.

2. The second step is the tokeniser, which usually does (but you can specify the steps you want the tokeniser to do):

  • Splitting the words in the string apart
  • Removes stop words like ‘a’
  • Make all letters in each word lower case
  • Replace similar words with their stem word. In other words the two words “run”, and “running”  are similar and so instead of writing them all to the inverted index we replace them with a singular word “run”.  Replacing similar words with stem words is automated by performing a stem algorithm.

3. Token filter

Interestingly all user query terms go through the same analyzer chain before they are compared against the inverted index if the user uses the ‘match’ attribute in their search query (which will be discussed below)

Search

Elasticsearch can perform two types of search:

  • Structured query – This is a boolean query in the sense that either a match for the search query is found or it isn’t. It is used for keyword searches.
  • Unstructured query – This can be used for searching for phrases and it ranks the matches on how relevant they are. It can also be called a fuzzy search, because it does not treat the results in a boolean way saying their either a match or not as the structured query does but it returns results that exist on a continuum of relevancy.

 

Search queries:

match_all: This is the equivalent of SELECT * in SQL queries. The below example will return all documents in the fruit index and berries mapping type.

GET student/degree/_search

{

“queries” : {

“match_all”: {}

}

}

Note: The top 10 results are returned, and so even if you perform the match_all query you will still only get 10 results back by default. But this can be customized.

 

If you want to search terms that have not been analyzed (i.e. haven’t gone through the analyzer chain when the document was indexed)  then you want to use the ‘term’ attribute

However if you want to query a field that has been analyzed (i.e. it has gone through the analyzer chain) then you will use the match attribute.

e.g.

GET student/degree/_search

{

“query” : {

“match” : { “name” : “Jan Smith” }

}

}

This means the term Jan Smith will go through the analyzer chain before it is compared against values in the inverted index.

The multi_match attribute can be used to find a match in multiple fields, i.e. it will put multiple search values through the analyzer chain in order to find a matching value in the inverted index.

GET student/degree/_search

{

“query” : {

“multi_match” : {

“fields” : [“name”, “origin” ],

“query”: “Jan Smith Wellington”

}

}

I will discuss the other search queries in my next Elasticsearch blog to prevent this one getting too long.

 

Mappings

When we index a document we specify a mapping type which is kind of like the table in a relational database, or a class in the OO paradigm because it has particular properties which all documents of that mapping type have values for.

The benefits of mapping types are that they make your indexes match the problem domain more closely. For example by making a mapping type of degree I am making the documents that I index in the degree mapping type a more specific type of student.

To save time when creating indexes with the same mapping type we can place the mapping type in a template and just apply the template to the index.

e.g.

PUT _template/book_template

{

     “template” : “book*”,

     “settings” : {

          “number_of_shard” : 1

       },

       “mappings” : {

                 “_default_” : {

                          “_all” : {

                                    “enabled” : false

                            }

                      }

              }

}

To apply this mapping type template to a index I would have to write:

PUT _template/book_wildcard

{

    “template” : “book*”,

    “mappings” : {

          “question”: {

               “dynamic_templates” : [

                 {

                            “integers” : {

                                     “match”: “*_i”,

                                      “mapping” : { “type”: “integer”}

                              }

                       }

                   ]

}}}

Note: It is recommended that you only assign a single mapping type to a index.

Conclusion

I have learned a lot from the Elasticsearch course and will continue to discuss what I learned in the next Elasticsearch blog.

 

Bibliography:

Christopher. (2015, April 16). Visualizing data with Elasticsearch, Logstash and Kibana. Retrieved March 26, 2017, from http://blog.webkid.io/visualize-datasets-with-elk/

Principe,  f. (2013, August 13). ELASTICSEARCH what is | Portale di Francesco Principe. Retrieved March 26, 2017, from http://fprincipe.altervista.org/portale/?q=en/node/81

Glossary of terms | Elasticsearch Reference [5.3] | Elastic. (n.d.). Retrieved March 30, 2017, from https://www.elastic.co/guide/en/elasticsearch/reference/current/glossary.html

What is Elasticsearch?

I thought I would write a research journal entry about Elasticsearch, because I will be attending a development course on this technology on the 24th of March for my work and so I think it will be advantageous to try to understand what Elasticsearch is, and how it works.

The reason I am going to be doing the development course on Elasticsearch is that the Koha Integrated Library Management System (ILMS) that I work on is currently implementing Elasticsearch.

For a start I want to mention trying to find an understandable information about Elasticsearch has been a  challenge to say the least. I am a big picture person and so when researching a technology like Elasticsearch I like to know how it fits in with the other tiers of a software architecture. This information was lacking in the ElasticSearch and Wikipedia pages, however I was able to find the following links which gave me enough knowledge to understand and explain what Elasticsearch is:

https://www.sitepoint.com/building-recipe-search-site-angular-elasticsearch/

http://www.elasticsearchtutorial.com/basic-elasticsearch-concepts.html

elasticsearch.png

What is Elasticsearch?

Elasticsearch is an open source, search engine server. It is used to provide search functionality for web apps by indexing data records (known as ‘documents’. These documents are written in  JSON. Effectively Elasticsearch works just like a book index where authors place specific words they think readers will want to find and the location where these words appear throughout the book). Developers can implement ElasticSearch in their web apps if search functionality is required.

With the proliferation of big data (large amounts of unstructured data for example data in Word documents) Elasticsearch implements a effective indexing system allowing users to be able to generate information (a refined form of data) from the data it points to.

It is distributed which means the data that it stores is kept on multiple different instances. Now I originally thought that this meant that Elasticsearch must be run on multiple different physical machines however, you can create multiple instances of Elasticsearch on a single machine and this will make a cluster of multiple nodes. Much the same as a cluster of RAID disks.

The distributed nature of Elasticsearch means as data volumes increase all you have to do is create a new node in the cluster  to store additional data indexes.

Elasticsearch is built using Java, so having Java installed on the machines Elasticsearch will be run on is required.

Examples of companies/web applications that use Elasticsearch are GitHub, Facebook, CERN, and Netflix, so with these big players using it, its obviously a popular technology.

Whist a specific examples of Elasticsearch use when implemented is: “Stack Overflow combines full-text search with geolocation queries and uses more-like-this to find related questions and answers.”(Vanderzyden, 2015)

Elasticsearch is built on a storage system called Lucene. ElasticSearch is basically the easy to use RESTful API (this is a application program interface that you can write HTTP commands GET, PUT, POST, and DELETE on to perform CRUD( Create, Retrieve, Update, Delete) actions) front end of Lucene.

To explain Lucene simply it is basically an alternative to using a relational database to store data.

How do Elasticsearch and Lucene actually work?

Elasticsearch is built on Lucene so there is a lot of overlap describing them in the literature I read, here is my understanding of how they work together.

Elasticsearch is a distributed search engine as I previously explained. Every Elasticsearch instance is called a node. Each node stores multiple shards which in turn store inverted indexes.

Now the inverted indexes that Elasticsearch uses are the same as Lucene, with the big difference between these two technologies being Elasticsearch is distributed and it provides an easy to use RESTful API for users to use to perform their search.

What is an inverted index? It is a file that matches a search term (the index value) with the frequency (the search term and frequency form part of the index called ‘mapping’ or ‘dictionary’) and location (the location value form part of the index called ‘posting’) that it exists in the data. As the below image shows the search (index) term ‘blue’ exists in the locations: documents 1,2 and 3.

index.png

(“Elasticsearch Storage Architecture using Inverted Indexes | Java J2EEBrain,” n.d.)

Now Elasticsearch indexes data in inverted indexes (each data record in an index is known as a document), rather than in tables and rows as relational databases do.

The index terms are stored in something called dictionary, this is an alphabetically ordered list in much the same way a human dictionary is ordered, whilst the location pointer values are stored in a data structure called a posting.

When a person searches using Elasticsearch the dictionary is checked to find a index term matching the users query term once a matching value has been found a union query is performed to retrieve the corresponding location value in the postings list in order. Now that the location of the search term is known.

NOTE: This is how I understand Elasticsearch works at this stage, my understanding may change as I do further reading and use of this technology for a future research journal entry.

Demo of me installing and using Elasticsearch

Now that I have learnt a bit about how Elasticsearch works I want to have a go, I found this online tutorial very useful to follow:

Here is the process I went through of installing and using Elasticsearch:

  1. Download the zip file (from https://www.elastic.co/downloads/elasticsearch) and extract it to your documents folder.
  2. Run the elasticsearch.bat file (circled below) to start Elasticsearch

bat file to run.PNG

Note: This will run a batch file which must be run in the background whenever you want to use Elasticsearch

running bat file.PNG

3. In my browser I went to the URL: localhost:9200

This shows me my default cluster. Again the cluster is the collection of nodes (Elasticsearch instances) at the moment I should only have one node in my cluster.

elasticsearch runnign

4. That’s all very nice but how do I actually interact with Elasticsearch? Well I can interact with it using one of two methods: Either write curl commands into the command prompt/terminal or you can use a RESTful API called Kibana. I chose to use the latter which is just a web based interface to write HTTP commands to create, delete, and retrieve information from Elasticsearch.

To install Kibana I went to https://www.elastic.co/downloads/kibana.

Issue: I had an issue installing Kibana because I had originally installed Elasticsearch.5.0.0 but Kibana would not working successfully unless Elasticsearch 5.2.2 was installed. As the screenshot of the error I got from Kibana shows:

error in kibana.PNG

I tried to install and run the batch file of Elasticsearch 5.2.2 however I could not work out how to uninstall Elasticsearch 5.0.0 (because Elasticsearch does not appear in ‘programs and apps’ to uninstall) and so Kibana kept throwing an error because Elasticsearch 5.0.0 was still running.

So I solved this issue by restarting my laptop, deleting the Elasticsearch 5.0.0 file and running the batch file for Elaticsearch 5.2.2 again.

After solving the Elasticsearch version error, I ran the Kibana batch file and in my browser I went to the URL: localhost:5601

Kibana loaded successfully and so now I have a RESTful API to use to work with Elasticsearch

5. Like I mentioned earlier Elasticsearch search functionality works by checking indexes to find matching data.Now the actual use of Elasticsearch made it look more like a typical DBMS than the theory suggested so I think I will need to do a bit more reading up on how Elasticsearch uses indexes.

In the meantime to create an simple index (with no fields in it) named ecommerce  I went to the devtools interface and wrote in the command:

PUT /ecommerce

{
}

Which was successfully created as I got the response “acknowledged”: true, (as the below screenshot shows).

devtool.PNG

6. I then deleted this index by writing in: DELETE /ecommerce

7. I made a index named ecommerce again, this time with mapping type.

What is a mapping type? ElasticSeach says “Each index has one or more mapping types, which are used to divide the documents in an index into logical groups”(“Mapping | Elasticsearch Reference [5.2] | Elastic,” n.d.).

I understand this to mean a mapping type is a organizational structure for placing information from the document (data record e.g. all the data describing a single item for sale) into the index. As this index is going to store documents (these are data records) about products a eCommerce company sells, then the mapping type is named products.

Within the mapping type there are fields which will contain data values. For example there is a name field of type string. When I add a document (data record) to Elasticsearch I will have to give the name field a value so the document’s name value is searchable.

creating a mapping.PNG

8. Now to add a data record to Elasticsearch so that it is searchable I must use the HTTP PUT command and write in a value for every field in the mapping type for the ecommerce index. added document to index no searchable.PNG

9. Now going back to the home page of Kibana I  can create a index also named ecommerce and clicking on the discover link I write in the title of the course “zend framework” and click enter to search my index for the words “zend framework”. As you can see every instance of the words “zend” and “framework” is highlighted.

search index.PNG

This shows that Elasticsearch looks through all the data values in the mapping type of each index to find a match for the search value. Remembering the knowledge I learned from my readings earlier I remembered that each index has two parts a mapping (which I have created) and a posting (which is a location pointer defining in which documents the search term exists in).

Now at this stage I cannot align the theory of index mapping with the mapping type I have just made, because according to the theory the mapping is just a search term and a frequency counter, well in the mapping I made for the ecommerce index there were fields with data types, and into these fields I indexed a document (wrote a data value for each field). So the purpose seems quite different to me, further research is required for a future research journal entry.

What are the benefits of Elasticsearch?

  • Fast lookup time – According to the Elasticsearch website the lookup time is near real-time
  • Textual content in documents such as WordDocs is searchable because Elasticseach provides fulltext search functionality
  • Extendable – Due to the distributed nature of Elasticsearch it is easy to extend the technology as your data volumes increase, by setting up a new node to store additional inverted indexes

Conclusion

Overall it has been very interesting reading up on Elasticsearch and I feel I have a basic understanding of what it is now, so I can attend the development course with a bit of background in the subject.

Bibliography:

Elasticsearch Storage Architecture using Inverted Indexes | Java J2EEBrain. (n.d.). Retrieved March 11, 2017, from http://www.j2eebrain.com/java-J2ee-elasticsearch-storage-architecture-using-inverted-indexes.html

Vanderzyden, J. (2015, September 1). What is Elasticsearch, and How Can I Use It? Retrieved March 11, 2017, from https://qbox.io/blog/what-is-elasticsearch

Download Elasticsearch Free • Get Started Now | Elastic. (2017). Retrieved March 11, 2017, from https://www.elastic.co/downloads/elasticsearch

Mapping | Elasticsearch Reference [5.2] | Elastic. (n.d.). Retrieved March 11, 2017, from https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping.html