Elasticsearch part 3: The implementation

In my last blog post on Elasticsearch I covered the majority of the theory and commands that I learned at the Elasticsearch for developers course I completed at work now I want to have a play around with Elasticsearch and the RESTful API you use to interact with Elasticsearch called Kibana.

I decided to install both of these systems onto NMIT Windows laptop. What was a seamless installation process last time on my other laptop turned into a more complicated troubleshooting exercise this time.

After starting both Elasticsearch (accessible at http://localhost:9200) and Kibana (accessible at http://5601) I saw that Kibana was throwing an error (see below) because it required a default index pattern.

error message in kibana

What is a default index pattern? It is an index (which remember is like the equivalent of a relational database) or collection of indexes you want to interact with using Kibana. You can specify several using the wildcard symbol in the index pattern input box.

So the first thing I had to do was to create an index. I used the example index on the documentation on the Elasticsearch (https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-create-index.html#indices-create-index) which was called ‘twitter’.

creating an index.PNG

Then after indexing an document (equivalent of inserting a row into a table in a relational database), I set twitter* as the default index pattern, thereby removing the error I was getting.

An important point beginners of Elasticsearch need to know is that when you interacting with your Elasticsearch using the Kibana RESTful API you will be writing in sense syntax rather than curl which is for use in terminal. However the Kibana Dev Tools area which is where you will write sense syntax is fantastic because it automatically converts curl commands into sense syntax. For example I copied and pasted the curl command

curl -XGET 'http://localhost:9200/_nodes'

And Kibana converted it to:

GET /_nodes

 

Insert data

Now for some fun with Elasticsearch…

I have created a index named cats

created cat index.PNG

Then I create a mapping type (equivalent of a relational databases table) automatically when indexing a document (creating a data record). How so? Well I use the PUT API (in Elasticsearch jargon an API is a command).

PUT cats/domestic/1

What this means is use the cats index, create a mapping type named ‘domestic’ and create a document with the ID of 1 in this mapping type.

Note that the ID number is optional in PUT APIs.

entered cat

What is happening when I use the PUT API to create an index, well Kibana sends a index request to a node in the Elasticsearch cluster (collection of nodes (instances of Elasticsearch)). The ID value (manually set or auto generated) is hashed and used to find a matching shard to execute the index.

What is a shard? it is a conceptual object holding a collection documents allowing Elasticsearch to be distributed and extendable.

Once the matching shard has created the index then it is replicated to the replica shard (the backup shard).

Note: As you can see above you do not need to specify data types when creating or modifying Elasticsearch indexes.

Retrieve data

Now to retrieve the document I just indexed I need to use the GET API:

GET cats/domestic/1

Whats happening in the background when you send a GET API. Well the ID in the request is hashed and so when the request arrives at a node in the Elasticsearch cluster, then the hashed ID is used to route the request to a shard with a matching hashed ID value.

How to check if a document exists

To check if a document exists in the index then you can use HEAD API

HEAD cats/domestic/1

Now this should return a HTTP header 200 if the document exists and 404 if it doesn’t exist. Except when I ran it in Kibana I got a fatal error.

 

error.PNG

Now it seems several other people have had issues running the Exists API in Kibana as these forum question posts show, none of which were satisfactorily answered.

https://unix.stackexchange.com/questions/253414/elasticsearch-error-on-head-command

https://github.com/elastic/elasticsearch-php/issues/391

However this github source (https://github.com/elastic/kibana/pull/10611) suggests that the syntax for the Exists API is deprecated and I need to write:

HEAD cats/_mapping/domestic

However this produced the same error. I could not find any other useful suggestions online and so I will move on, and ask the trainer of the course Frederik later.

Delete data

DELETE index/mapping type/id

delete.PNG

The background process when the DELETE API is run is as usual the id in the request is hashed and this is used to route the request to the primary shard that this document lives in, after the document is deleted there then the primary shard updates replica shards.

Point of interest: Write consistency

Now because all documents are written on a primary shard and can this can be (but doesn’t have to be) replicated on several replica shards.

If you have set up replica shards when you created the index, then you need to make sure a certain number of these shards are available when writing to Elasticsearch.

You need to have:

(primary+replicas)/2 + 1 shards available to be written to

 

Update data

I indexed another document

PUT cats/domestic/1
{
  "name": "Kelly",
  "age" : "1",
  "colour": "black and white"
}

Then to update this making the age 2 I wrote:

update.PNG

As I understand it the background process in this command is all fields including ones not being updated are replaced. So fields not being replaced are just replaced with the same value. This is again performed on the primary shard first, and then replicated to the replica shards if applicable.

 

Get multiple documents simultaneously

I created another index named ‘president’, with the mapping type ‘individual’ and id ‘1’ for a document on George Washington.

Then to get the documents with id ‘1’ in cats/domestic and president/individual I perform a Multi Get API

multi get.PNG

 

Perform multiple different APIs simultaneously

To perform multiple different commands using Kibana you can use a Bulk API command. You can think of this like the equivalent of being able to perform a select, delete, update, and insert SQL query into multiple tables in a relational database in a single command.

When I first tried this command I wrote in the HTTP header: PUT _bulk this resulted in an error:

bulk error.PNG

After some troubleshooting I found this is being caused by the /n which need to be removed and then it work, like so:

worked.PNG

 

Text analysis

Elasticsearch is very useful for searching text, because it can store the words from a text such as a book in the inverted index in much the same way a book index holds keywords for readers to find easily.

The way we split text up so it can be stored in the inverted index for searching is using the Analyze API.

I started using this by specifying the HTTP header GET _analyze, I specified I wanted the tokenizer “keyword” this stores the supplied string as one keyword combination rather than splitting it, filter “lowercase” this lowercases my supplied text.

As you can see below ‘New South Wales’ has been transformed into ‘new south wales’

lowercase.PNG

Often for long text (like sentences) it is best to split the words up so they can be searched individually. You can do this by specifying the tokenizer “whitespace”. So using the Shakespearean sentence “You shall find of the king a husband, madam; you,sir, a father:” I used the whitespace tokenizer to split it up:

splits.PNG

If you want to learn more about what the analyser is doing you can implement the “explain”: true attribute.

Now the analyzer commands I have performed to date are using the default _analyzer on supplied text, but what if I wanted all data in a document I index to be analyzed and thereby made searchable?

Well you can configure a analyzer in an index when creating the index.

analyzer in.PNG

To make the job of the tokenizer easier you can implement character filters for example you can filter out HTML. This would be very important to make the system more secure.

char filter.PNG

It is interesting how the different analyzers work; the English one does not just split the words up it actually removes stop words (common words that add no value to a search query). Below I wrote in the sentence from the course exercise: “It is unlikely that I’m especially good at analysis yet” which has words like ‘unlikely’ just stored and indexed as ‘unlik’

english.PNG

Whereas all words are stored and indexed in their original form when using the standard analyzer.

standard analyzer.PNG

 

Mappings

Instead of letting Elasticsearch decide the data type of fields in an index you can specify it manually in the mappings. Like I said previously the mapping type which is the equivalent of a table in a relational database is just the name, so in my previous examples I have cats/domestic/1 this meant I had the mapping type name of ‘domestic’. However there are many attributes in an index that you can customize to make it match the business problem domain more closely.

Mappings are useful because they have some idea of how data is structured even though they don’t have a schema.

I created a index named programminglanguage, with a mapping type of ‘OO’. I set the data type of the “name” field (which is circled below) to a string.

field altering.PNG

You can also update your mapping attributes in an index however you need to keep in mind that you cannot remove an mapping type field.

To retrieve your mapping values for an index simply write in GET <indexname>/_mappings

Like so:

retrieve mappings.PNG

Now you can create objects in Elasticsearch, for example by default the comments in my below tvseries index will be a nested object.

That means ‘comments’ is of data type ‘object’.

nested objects.PNG

If I want to reference a field in the comments nested object I have to write: comments.<fieldname>

How do you set a field to be searchable?

You use the ‘index’ attribute in the mappings. You set it to ‘analyzed’ if you want it searchable and it goes through the analyzer.

You set it to not_analyzed if you want it searchable but don’t want it to go through the analyzer.

You set it to ‘no’ if you don’t want it searchable.

 

Index templates

An index template is a good way to make a index fast, without having to write it out manually. So once you have the mappings customized to your business problem domain you can then apply this to multiple similar indexes using a template. I like to think of this like inheritance hierarchies in Object Oriented programming, you place all the common features in the superclass and all subclasses inherit it, thereby only having to write it once.

To create a mapping you need a PUT HTTP header:

PUT _template/tv_template

This is creating a template in the _template area named tv_template

template 1.PNG

Like with indices you can delete, retrieve and retrieve all templates using similar commands. As I have not covered how to retrieve all I will do so now, it is very simple:

GET /_template

Searching

Elasticsearch can perform two kinds of searches on the searchable values (setting values to searchable is described further above):

  • Structured query (checking for an exact, boolean match on the keywords the user entered. Equivalent to a SELECT SQL query with a WHERE clause. Either there is a match for the WHERE clause or there isn’t)
  • Unstructured query (not just looking for exact matches but ranking the query output, so this is a continuum rather than boolean answer. Also known as a full text search).

Elasticsearch uses a query language called QueryDSL. A quick Google of this and I have found it is a “extensive Java framework for the generation of type-safe queries in a syntax similar to SQL” (Chapman, 2014).

Now search uses the GET HTTP header; and to set up a structured query you use the ‘filter’ attribute, and to set up a unstructured query you use the ‘query’ attribute which gives all results a score.

Using QueryDSL we can write a single query (known as a leaf query clause) or multiple queries in a single statement (known as compound query clauses).

Where I want to query (retrieve) all documents in an index I can use the match_all attribute:

GET programminglanguage/OO
{
  "query": {
    "match_all" : {}
  }
}

This is the equivalent of a SELECT * query in SQL and it is perfectly acceptable to use.

Note: In the above match_all query  it is an unstructured query because it uses the ‘query’ attribute.

If you want to limit the number of ranked results displayed in an unstructured query then you can specify the number of results you want with the ‘size’ attribute.

 

How to use your search query

Now if you use the ‘match’ attribute in your query then the search term goes through the analysis chain to tidy it up and is then used for unstructured query.

Whereas if you use the ‘term’ attribute then whatever the user wrote in is compared exactly to what is in the inverted index and a structured query is performed.

So in the below example I am performing a unstructured query.

match.PNG

 

To make your unstructured query more fine-grained there are 3 types of unstructured queries for you to choose to implement.

Boolean –  This is effectively a structured query, whatever you enter as a search term is used to find an exact match in the inverted index otherwise no hits are found.

I have a document with the name “Kelly” in the cats/domestic index and mapping type, so trying the bool query searching for a name “K'” I got no results because I have no document with the name “K” in the cats/domestic.

book.PNG

Whereas when I perform this bool query using the name of “Kelly” I get 1 hit, this is because there is the exactly 1 document with the name “Kelly”

bool 2.PNG

 

 

Phrase – This treats the entered search term as a phrase

match_phrase_prefix – This query splits the values in the string the user entered as a search term. If the user has not completed a word like in the below example of when I used match_phrase_prefix I just used K and so Elasticsearch looks at a dictionary of sorted words and puts the first 50 into the query one at a time.

match 2.PNG

 

The query_string query is interesting it is built rather like a SQL query in that you can use OR and AND. So in the below example I am searching the cats/domestic with the phrase “(blue OR black) AND (white OR red)” without specifying the field name and I am getting the correct result.

query string.PNG

Suggesters

Suggesters are faster than search queries, although the suggester can be implemented on a search query as well.

What a suggester allows the query to do is to suggest values similar to the users search term. So for example if the user misspelt and wrote in the name “Kellu” then the suggester could suggest “Kelly” which is another similar term.

How Elasticsearch works

Search queries in Elasticsearch go through 3 stages, here is a summary on what I understand them to be:

  1. Pre-query – This is checking the number of times a word exists in a particular document. This is only possible where Elasticsearch has a small data set.
  2. Query – This is checking through the inverted indexes for a matching value, this is achieved by running the search query on all shards holding the index we are asking for until a matching value is found which can point the search query to the document ID that holds this value.  This is a useful resource for finding out more a out the Query phase: https://www.elastic.co/guide/en/elasticsearch/guide/current/distributed-search.html
  3. Fetch – This returns the documents (whose document ID was listed alongside the search term in the inverted index) to the user.

Deep pagination – What this concept means is Elasticsearch will look through every document in the cluster even if you just want to search 10 documents, as you can imagine this is very inefficient in a large data set. It is best avoided.

Elasticsearch is known for its speed and a contributing factor is the Request Cache. As indexes are shared across multiple shards when a search query is run on an index, what happens is it is run individually on all the shards and then the resulting results are combined to form the total result. However each shard keeps a copy of its own results meaning if someone queries the same value again it will exist on a shards request cache and it can be returned much faster than having to search the shard.

 

Aggregations

Aggregations is a framework that helps the user learn more about the search results they have found with a search query (“Aggregations | Elasticsearch Reference [5.3] | Elastic,” n.d.)

There’s three main types:

Bucket aggregations – Group of documents that meet a specific factor. You can think of this like finding common features in a whole lot of different documents and grouping them in ‘buckets’ based on these common features.

Metric aggregations – Calculate statistical data about a collection of buckets.

Pipeline aggregations – Combine the insights generated from other aggregations. This is an aggregation on an aggregation.

Kibana can use visualization tools to create graphs and maps using aggregations.

You implement aggregations using the “aggregations” attribute in the search query.

I am unable to perform many of the aggregation commands due to having a small data set, however a summary of the aggregation commands available is:

Sum aggregation – This adds together values of the same field in multiple documents

Min/max aggregation – Display the highest or lowest value of a field in all documents in an aggregation.

Multiple metrics aggregation – Display both the highest and lowest values for a field in all documents in an aggregation.

Terms aggregations – This returns the top 5 values for a particular field in all documents in an aggregation.

Missing aggregation – Find documents in an aggregation that do not have a specified value.

Filter aggregation – This is what is used to create bucket aggregations.

Significant term aggregation – This is finds strangely common values, by checking document values for common values in the aggregation against the total data source the bucket aggregation was collected from.

It is important not to nest too many aggregations in a single command because they are very resource hungry and you can end up crashing your system, this occurrence is called combinatorial explosion.

 

Data Modelling

If you choose to use Elasticsearch as a data store in addition to or replacing a relational database management system then you will need to perform data modelling to transform your existing data into something useful for Elasticsearch.

There are several paradigm shifts you will have to have to make this process possible. Firstly you need to understand that duplicate data is fine in Elasticsearch as it makes searching faster, this goes against what we are taught for relational database design and so it is not initially intuitive.

Now to take the data stored in relational tables with  relationships between one another into Elasticsearch we can do one of three things:

Denormalise the data into a single document: This flattens the data out so if you had 2 tables in a direct relationship then you can place all columns and data into a single Elasticsearch mapping type. This is making the data structure flat so it is searchable.

Nested objects: Nested objects are one way to store the relationship between two relational tables in some form. For example in a relational database you may have two tables ‘program’ and ‘comments’. These tables have the relationship that an tvseries has one or many comments, and a comment belongs to one tvseries.

To transfer this relationship to Elasticsearch which does not use foreign keys to map relationships we can place ‘comment’ as a nested object in the ‘tvseries’.

nested objects.PNG

This means the foreign key relationship is lost but we have been able to retain some of the relationship between the two logical groups of data by nesting one inside the other.

Now in order to still be able to use the nested object and the root object it is stored in we need to be able to query them separately so we use nested queries.

Parent/child objects: Another way to map the relationship between logical groups of data is parent/child objects. I did this using the pet owning example I have used previously:

The parent object will be the owner, and the child object will be the cat. Here’s the steps I went through to create this parent/child object combination.

  1. Create a index named ” petowning”

setting it up.PNG

2. Create the parent object which is the owner

parent.PNG

 

3. Create the child object which is the cat

child.PNG

 

Now each of these three methods have advantages and disadvantages which need to be considered against your system requirements when you are performing data modelling:

Flatten data:  Uses less data than nested objects and parent/child objects, but the data relationships are totally lost

Nested objects: Faster but less flexible (because the root object that the nest object is held in must be re-indexed whenever the nested object is updated)

Parent/child objects: Less fast but more flexible

 

Relevancy

Elasticsearch by default uses the TF/IDF (Term Frequency/ Inverse Document Frequency) algorithm to determine how relevant a document is to a query.

This algorithm works by comparing term frequency against all other documents, after looking at the specificality of the search term. What this means is a shorter more specific search term has a greater rank.

 

Perculator

Instead of saying which documents match a query, the perculator does the opposite it outputs the queries that match a document.

To be able to output the search queries that match a document we have to store the search queries as JSON documents however this is no problem because the search queries are QueryDSL (as I have previously discussed) and this is very similar to JSON.

Below I am storing a search query to find a cat with the name “Violet”

storing query.PNG

 

So there we have it explanations and tested examples (by me on Elasticsearch 5.3 and Kibana 5.3) shown as screenshots of the many different functions that Elasticsearch and Kibana provides. I hope this has been interesting and useful for you, I personally have found it fascinating to go through almost all of the commands I learned on my course in more depth and understand them better.

 

Bibliography

Chapman, B. (2014, June 11). What Can Querydsl Do for Me Part 1: How to Enhance and Simplify Existing Spring Data JPA Repositories – http://www.credera.com. Retrieved April 16, 2017, from https://www.credera.com/blog/technology-insights/java/can-querydsl-part-1-enhance-simplify-existing-spring-data-jpa-repositories/

Aggregations | Elasticsearch Reference [5.3] | Elastic. (n.d.). Retrieved April 16, 2017, from https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations.html

 

Advertisements

Searching for an academic paper

For this weeks blog post Clare asked us to find 2 academic papers on IT and to explain why they have the features which identify an academic paper. These features, defined in this weeks notes are:

  • the title
  • the authors (usually with an email address and affiliation)
  • the abstract
  • the introduction
  • a review of other papers relevant to the topic ( a literature review)
  • a description of what the research was and what the researchers did
  • the results of what they did
  • a discussion about what the results mean
  • a conclusion
  • a list of references

Thinking about the sources I found for last weeks research journal entry I came to the conclusion that there is one that could potentially be considered academic paper, so I will start this entry by analyzing this publication to see if it is an academic paper.

System Virtualization  tools for Software Development

Title and author(s) of the article: ‘System Virtualization tools for Software Development’ by Juan C. Duenas (jcduenas@dit.upm.es), Jose L. Ruiz (jlrreveulta@indra.es and works as Senior Consultant), Feliz Cuadrado (fcuadrado@dit.upm.es), Boni Garcia (bgarcia@dit.upm.es), Hugo A. Parada (hparada@dit.upm.es). Note: All these authors except Jose L. Ruiz were specified as affiliated with the Department of Telematics Engineering at Universidad Politenica de Madrid

APA reference: Duenas, J. C., Ruiz, J. L., Cuadrado, F., Garcia, B., & Hugo, P. G.,. (2009). System virtualization tools for software development. IEEE Internet Computing, 13(5), 52-59. doi:http://dx.doi.org/10.1109/MIC.2009.115

how you found the article and what keywords you used: I used the NMIT provided ProQuest research database and searched with the keyword ‘Virtualization’

What kind of article it is: Journal paper in the IEEE Internet Computing journal in September 2009.

All the reasons that you think it is an academic article:

  • This is a peer reviewed text that was printed in the IEEE Internet Computing journal which is a scientific journal.
  • It is 9 pages long (so it is an appropriate academic paper length)
  • It is cited by two other sources.
  • It has the academic journal structure
  • All of the authors are credible as they are either senior technology consultants or affiliated with the Department of Telematics Engineering at Universidad Politenica de Madrid.

How well it fits the ‘structure of an academic article’:

 

  • the title: System Virtualization tools for Software Development
  • the authors (usually with an email address and affiliation): Juan C. Duenas (jcduenas@dit.upm.es), Jose L. Ruiz (jlrreveulta@indra.es and works as Senior Consultant), Feliz Cuadrado (fcuadrado@dit.upm.es), Boni Garcia (bgarcia@dit.upm.es), Hugo A. Parada (hparada@dit.upm.es). Note: All these authors except Jose L. Ruiz were specified as affiliated with the Department of Telematics Engineering at Universidad Politenica de Madrid
  • the abstract – Yes
  • the introduction – No
  • a review of other papers relevant to the topic ( a literature review) – There is no literature review section. However it is specified that the Virtualization system design they built was based on the model and technologies in two other sources.
  • a description of what the research was and what the researchers did -Yes. The use of use-case analysis was specified.
  • the results of what they did –  Yes the use cases identified with use case analysis were specified.
  • a discussion about what the results mean – Yes this was included, in addition to further work planned to continue the research
  • a conclusion – No
  • a list of references – Yes, cites 12 other sources

 

How many references it has: 12

How many citations it has: 2 other sources cite it

url of the article: http://llcp.nmit.ac.nz:2345/docview/197325712/abstract/49593612B92B4719PQ/1?accountid=40261

Are you interested in properly reading the article or not: Yes I would be interested in reading this paper in depth because it is tying into the concept of DevOps which I have researched for several previous research journal entries.

How does it tie in with DevOps well this research papers goal was to design a generic virtualization framework solution to help make the environment that developers work in as similar as possible to the production environment reducing time wasted trying to get software ready for deployment. This is also what DevOps aims to achieve; to reduce the differences between the development and production environments.

Now for the two academic papers we need to find for this week:

Early user involvement in the development of information technology-related products for older people

Title and author(s) of the article:
Early user involvement in the development of information technology-related products for older people by R Eisma, A Dickinson, J Goodman, A Sye and L Tiwari

 

APA reference:
Eisma, R., Dickinson, A., Goodman, J., Syme, A., Tiwari, L., & Newell, A. F. (2004). Early user involvement in the development of information technology-related products for older people. Universal Access in the Information Society, 3(2), 131. doi:http://dx.doi.org/10.1007/s10209-004-0092-z

 

How you found the article and what keywords you used: I found it on the ProQuest research database by searching with the keywords ‘user experience for older people’

 

What kind of article it is: It is a peer reviewed academic paper in the quarterly ‘Universal Access in the Information Society’ journal.

 

All the reasons that you think it is an academic article:

  • Peer reviewed which means at least two other academics have considered it to be useful as it adds credible information to the knowledge pool on user experience with older people
  • 11 pages so it is an appropriate length to be considered an academic paper and not a book
  • Contains references and abstract
  • Structured in typical academic paper style (see below)

 

how well it fits the ‘structure of an academic article’:

  • the title: Early user involvement in the development of information technology-related products for older people
  • the authors (usually with an email address and affiliation): R Eisma, A Dickison, J Goodman, A Syme, and L Tiwari
  • the abstract – Yes
  • the introduction – Yes
  • a review of other papers relevant to the topic ( a literature review) – There is no literature review section. However other studies are cited, specifically this paper states its findings match those of another study “Older citizens (50+) and European markets for ICT products and Services”
  • a description of what the research was and what the researchers did – Yes. The qualitative research methods of questionnaires, focus groups, workshops, and interviews
  • the results of what they did –  Yes, the use of statistics to quantify the qualitative results were only used in 3 locations however examples and general sum-up statements about the findings of the interviews for example were used frequently.
  • a discussion about what the results mean – Yes.
  • a conclusion – Yes
  • a list of references – Yes, cites 39 other sources

 

How many references it has: 39 references

 

How many citations it has: It is cited by by 66

 

url of the article: http://llcp.nmit.ac.nz:2345/docview/201544055/B5F4F4DECC9C4410PQ/5?accountid=40261

 

Are you interested in properly reading the article or not: Yes I am, the topic of making technology more accessible to older people is of particular interest to me (which is why the third research journal entry I wrote was on user experience) the reason for this is I have older parents and I often end up helping them to understand un-intuitive applications which has made it very clear to me how poorly most websites and applications are designed for people that are not using technology on a frequent basis.

 

The Next Generation Library Catalog: A Comparative Study of the OPACs of Koha, Evergreen, and Voyager

Title and author(s) of the article: ‘The Next Generation Library Catalog: A Comparative Study of OPACs of Koha, Evergreen and Voyager by Sharon Q. Yang, and Melissa A. Hofmann

APA reference:  Yang, S. Q., & Hofmann, M. A. (2010). The next generation library catalog: A comparative study of the OPACs of koha, evergreen, and voyager. Information Technology and Libraries, 29(3), 141-150. Retrieved from https://search.proquest.com/docview/746170319?accountid=40261

how you found the article and what keywords you used: I used ProQuest research database searching with the search term ‘Koha open source’

What kind of article it is: Journal paper in  the Information Technology and Libraries quarterly journal

All the reasons that you think it is an academic article:

  • Peer reviewed article
  • Structured as an academic paper with abstract, introduction, literature, references

how well it fits the ‘structure of an academic article’:

  • the title: The Next Generation Library Catalog: A Comparative Study of OPACs of Koha, Evergreen and Voyager
  • the authors (usually with an email address and affiliation): Sharon Q. Yang, and Melissa A Hofman (no email addresses specified)
  • the abstract: yes
  • the introduction: yes (not titled)
  • a review of other papers relevant to the topic: Literature review
  • a description of what the research was and what the researchers did: Yes. The features being compared in a qualitative comparative study of three Integrated Library Management Systems (ILMS) which are Koha, Evergreen, and Voyager, are discussed.
  • the results of what they did: The comparing of the three ILMS’s is included in the paper, along with screenshots.
  • a discussion about what the results mean: Yes, including the statement that the Koha ILMS meets the highest number of criteria in the study.
  • a conclusion: Yes
  • a list of references: Yes this contained 23 references

how many references it has: 23

how many citations it has: Cited by 14 other sources

url of the article: http://llcp.nmit.ac.nz:2345/docview/746170319/A0E73CDCDB4D4F90PQ/23?accountid=40261

Are you interested in properly reading the article or not (and give some reasons!): This paper outlines the vision for a ‘next generation library catalog’ and compares three major extant ILMS’s against this ideal to see how far we are away from it. This library catalog is the OPAC interface, which is the interface that users interact with in physical libraries or the catalog area of library websites.

The reason I am interested to read this paper is I work on Koha and so I am very interested to see how Koha compares to another open-source ILMS (Evergreen) and a proprietary competitor (Voyager). The downside to this article is because of its age a lot of the findings are out of date; it is investigating Koha 3.0 and Koha 17.05 is about to be released next month.

Searching for credible evidence

In class this week Clare asked us to research 3 sources for 2 topics: digital citizenship and virtualization technology  and answer 7 questions on each. Because my mother has been in and out of hospital for the last week, and I have been helping look after her when she was at home so please forgive my brevity in this research journal entry.

Digital citizenship

  1. Core Education Digital Citizenship web-page article

URL: http://core-ed.org/legacy/thought-leadership/ten-trends/ten-trends-2013/digital-citizenship

Search terms: “Digital citizenship”

How you found it: I found it by Google searching and it was near the bottom of the first page of results.

Who wrote/create it: No authors name is mentioned even though the article is written in first person, none of the other pages contained an authors name.

However Derek Wenmoth (Director of e-Learning at Core Education) coordinated the writing of the article series.

When was it written/created/recorded/published?: It was written in 2013 as part of an article of Educational trends in 2013 written on the Core Education website. Core Education is a professional development website for teachers and educators (“Home » CORE Education,” n.d.).

What kind of publication is it: Educational article (part of a series on education trends in 2013) it also includes a video that covers the same content as the written article.

How credible(believable) do you think it is: I view this article as credible because it is relatively recent having been written in 2013, and it was written for the website of an educational training organization, which advises the government on educational practices and so its is likely to be credible in its discussion on how digital citizenship is taught in schools.

Additionally the coordinator of the article series Derek Wenmoth (http://www.core-ed.org/about-core/our-team/senior-leadership-team/derek-wenmoth?url=/about/meet-our-team/derek-wenmoth) has two diplomas in teaching (so he has appropriate qualifications), and is viewed as a expert on educational policy shown through the fact he has consulted for the government (“Derek Wenmoth » CORE Education,” n.d.).

In 2008 (before this article was written) he was awarded as one of the Global Six which is 6 educators globally recognized as innovative by the George Lucas Educational Foundation (“Derek Wenmoth » CORE Education,” n.d.).

So he has qualification, experience and is considered an expert in the use of technology in education and so I believe as he coordinated the writing of this article it is likely to be credible.

2. Digital Citizenship: The Internet, society and Participation (MIT Press) – Book

URL: https://books.google.co.nz/books?hl=en&lr=&id=LgJw8U9Z0w0C&oi=fnd&pg=PR7&dq=what+is+digital+citizenship&ots=DWYxBRhGYn&sig=zTp9Q_c4j87A5CbEDpxogQb_uIA#v=onepage&q=what%20is%20digital%20citizenship&f=false

Search terms: “What is digital citizenship?”

How you found it:  I use Google Scholar search and it was on the first page of results

Who wrote/created it?:  Karen Mossberger (Associate Professor in Public Administration at the University of Illinois) , Caroline J. Tolbert (Associate professor at University of Iowa in the Political Science department) and Ramona S. McNeal (Visiting Assistant Professor at University of Illinois in the Political Science department) (“Digital Citizenship | The MIT Press,” n.d.)

When was it written/created/recorded/published?: October 2007

What kind of publication is it: It is a book produced by MIT press.

How credible(believable) do you think it is: I believe this is not a particularly credible and valid source because it is dated to 2007. Since then the internet and the use of technology has affected our lives significantly more, especially the proliferation of  social media.

However the authors and publisher of this source is credible because all the authors are associate professors in this field, and it is produced by a respected tertiary institution organization MIT Press.

Therefore although the authors and publication organization are credible, the age of this source makes it no longer credible.

3. Digital Citizenship – Wikipedia article

URL: https://en.wikipedia.org/wiki/Digital_citizen

Search terms: “Digital citizenship” in Google search

How you found it: I found it by Google searching and it was near the bottom of the first page of results.

Who wrote/created it?: Being a Wikipedia article it can be edited by anyone.

When was it written/created/recorded/published? It was originally written in December 2008 and was most recently edited on the 4th of April.

What kind of publication is it: A Wikipedia article

How credible(believable) do you think it is: Although more up to date than the second source, this is not a credible source because it can be modified by anyone, including people who do not necessarily have any knowledge of the concept of digital citizenship.

When people write incorrect information on Wikipedia it may be quite some time before it is corrected (if at all). So I do not view this to be a credible source.

Virtualization technology

  1. Virtualization vs Cloud Computing article in  Business News Daily

URL: http://www.businessnewsdaily.com/5791-virtualization-vs-cloud-computing.html

Search terms: “Virtualization technology”

How you found it: Google search

Who wrote/created it?: Sara Angeles who writes about technology for Business News Daily. She has written tech blogs for IT companies such as IdeatoAppster.com (app development company) and Izea ( a content marketing company) (Angeles, n.d.).

When was it written/created/recorded/published? 20 Jan 2014

What kind of publication is it: An article for the Business News Daily is a business advise publication.

How credible(believable) do you think it is: I think this article is relatively credible, because it’s in a business publication and so an editor would have checked it before it was published. As opposed to a blog post where no-one else needs to check it before it is published.

The author of the article used several quotes from people working in high power positions in large IT companies: VMware, InfraNet, Weidenhammer. This adds to the credibility of the article because it shows the author has done her research. Additionally she included links for readers to learn more about cloud computing which the article was comparing to Virtualization.

The article is dated 2014 which is recent enough to make the concepts it discusses still relevant today.

The article is written with few statistics or quantifiable facts this is likely because it is a high level scoped article discussing the general technologies of virtualization and cloud computing and when they can be helpful, rather than specific details.

So overall due to where it is written,  the use of quotes from industry experts, and its date of publication I believe this is relatively credible in what it talks about for a business audience, but it would not be useful if it was being read by technical audience because they would want to know more specifics and perhaps see some quantifiable facts.

2. Introduction to Virtualization – Eli the Computer Guy  YouTube video

URL: https://www.youtube.com/watch?v=zLJbP6vBk2M

Search terms: I used the search terms “What is virtualization” on YouTube

How you found it: YouTube

Who wrote/created it?: Eli Etherton (known as Eli the Computer Guy on his Youtube channel and website). He has an IT background and works as a consultant, in addition to having a highly successful YouTube channel providing instructional tech videos.

In terms of popularity Etherton’s videos are  “now among the top 1 percent of people listed in the Google preferred lineup of technology-focused YouTube channels”(“Eli the Computer Guy’s videos among top 1% of tech-focused YouTube channels – Technical.ly Baltimore,” 2014)

When was it written/created/recorded/published?  3 Feb 2012

What kind of publication is it: YouTube video

How credible(believable) do you think it is: I believe that this is a credible source because Etherton has a technology background and works as an IT consultant this means he will have to know his subject well. By taking the skills he has gained through this work experience to YouTube he is providing technology videos which are highly likely to be valid and credible.

Additionally if his facts were consistently incorrect it is unlikely he would have one of the most successful tech channels on YouTube.

3. System Virtualization tools for Software Development – Peer reviewed journal article

URL: http://llcp.nmit.ac.nz:2345/docview/197325712/E188A46A05164EB3PQ/4?accountid=40261

Search terms: “Virtualization technology”

How you found it: ProQuest research database

Who wrote/created it?
Juan C. Duenas, Jose L. Ruiz, Felix Cuadrado, Boni Garcia, Hugo A Parada G

When was it written/created/recorded/published? September 2009

What kind of publication is it: Article in the IEEE Computer Society periodical journal

How credible(believable) do you think it is: Apart from the age of this source it is very credible because it is written in a peer reviewed scholarly journal. The journal article  will of had to have been read and approved by 2 or more other people with knowledge of virtualization before it was permitted to be published.

Having been peer reviewed this means any incorrect informationin the article will of been highly likely to have been identified before publication. Therefore I would say  when this article was first published it would have been one of the highest levels of credibility available however due to its age it is no longer as credible now.

Bibliography:

Eli the Computer Guy’s videos among top 1% of tech-focused YouTube channels – Technical.ly Baltimore. (2014, May 5). Retrieved April 6, 2017, from https://technical.ly/baltimore/2014/05/05/eli-the-computer-guy-youtube/

Digital Citizenship | The MIT Press. (n.d.). Retrieved April 7, 2017, from https://mitpress.mit.edu/books/digital-citizenship

Home » CORE Education. (n.d.). Retrieved April 7, 2017, from http://www.core-ed.org/

Derek Wenmoth » CORE Education. (n.d.). Retrieved April 7, 2017, from http://www.core-ed.org/about-core/our-team/senior-leadership-team/derek-wenmoth?url=/about/meet-our-team/derek-wenmoth

Angeles, S. (n.d.). Sara Angeles | LinkedIn. Retrieved April 7, 2017, from https://www.linkedin.com/in/saraangeles/

Research Approaches part 2 and what is credible research?

Today we went through the remainder of the research methods which I will cover in this research journal entry, then we started discussing what sources we can consider credible sources for research

 

Research methods

Experimental research:

Proving or disproving a hypothesis over a series of tests on various groups, tests consist of manipulation /controlling of variables in a controlled environment. This method is part of the scientific paradigm.

It is testing an idea (this is idea is a hypothesis which is testable in a controlled and repeatable way) which is a concept from the scientific paradigm.

Example: Measuring effects of soft drinks/measuring the effects of mobile devices on eyesight over a period of time.

Strengths of this approach are: Repeatable, generalisable, easier to see if it is valid

Weaknesses of this approach is: Control can be difficult, confounding variables, statistics can still be interpreted in a biased way.

 

Social scientific paradigm research methods:

Exploratory research:

Qualitative research that lays the foundation for further work – getting to understand/know the subject/focuses on ideas on how to do research in the area -Preliminary work

When you find a subject has not been researched before, or if there are a few papers that exist but you are not happy with how they approached the research then you investigate the subject in a very broad way and then you can find a potential for future research.

e.g. Investigation of ambiguous questions e.g. deconstructing a question like the ‘best laptop’ question . Or investigating a brand new technology (technology).

Another example is exploring peoples perception on the dangers of true A.I. whether they see it as scary or exciting, this would be a beginning of much further work into that area

Strengths of this approach: Very helpful to investigate new areas, provide road-maps for where we might go, or what might be useful research. Can clarify what the biases are which can make eventual results more reliable.

Even no results are useful as they can help to identify further useful work.

Weaknesses of this approach: Less status, expensive (time and money), not conclusive and shouldn’t be considered as such. Always needs further clarification can be biased.

 

 

Discourse analysis

Analysis of spoken or written words to discover the meaning behind them.

For example: Different words are used for the same concept in an organization. So the CEO might refer to students as customer whilst the teachers might refer to students as students. The different words used to identify students convey different meanings.

This research method is analyzing the interpretation of language.

Discourse analysis can be used in courts as judges have to interpret the wording of laws.

Strengths of this approach: Its a familiar strategy that we all use informally (we all know how to do it), very powerful in extracting ‘real’ meaning.

Disadvantages of this approach: Very hard to do with non-expert language users, very hard to do, and time consuming (personal bias).

 

Action research

Has to be flexible, ongoing, process. You must keep an eye on the process and change when needed – researcher is part of the research (instead of trying to distance themselves from the research to try to improve the objectivity of the research).

Taking an agile focused approach to your research by investigating something and changing it and then testing it again. Also everyone is involved in the research including the researcher. So it is almost the iterative, user centric spiral model of Agile methods.

This research method can resolve some problem which traditional way of research can not do – and/or create something new before something fails, works well alongside the people who will use the end result of your research.

 

Focus groups

Getting a small, diverse group together asking questions on a specific topic. Guided and open discussion – The researcher guides the discussion at the start and then hopes the conversation opens up as members of the group chime in their opinion and then the focus group takes it from there.

The researcher is active at the start of the focus group meeting, and as the focus group goes on they become less active. They don’t just want a reply to one question they want others to add to the answer and make for a richer research method than an interview.

It is a form of exploratory research because you can analyze what the members say, and discover new areas to investigate.

Used for market research (business area) to test out peoples reactions of products and services.

e.g. Feedback on an app (in development), Sandra research – focus groups of high school students talking about IT careers.

Strengths of this approach: People are more open in a group discussion than in a one on one interview, and they feed off what others say

Weaknesses of this approach: What one person dominates the discussion, and influence others towards ‘group think’.

 

Design science research

Where you are creating something; e.g. new app, new OS and its a way of actually looking at a useful artifact, testing it, and evaluating it.

Used mainly in constructing, IT and education.

Practical in nature in that you are actually doing something. The intention of this research is to help  people.

 

Argumentation research

Posing two different viewpoints and supporting them with logical and emotional evidence – one is a thesis and the other point of view (pov) is the antithesis. (Social science paradigm)

Also known as the argumentation theory.

e.g. Two designers disagree over the design of an app – after comparing they came to a synthesis of the two povs.

Helps to resolve disagreements, allows the questioning/challenging of accepted wisdom.

Argument is very dependent on how well someone can present it, biased, synthesis may not be reached

 

 

Credible research

Sources for research are (with credible sources highlighted):

Library

  • Journals – Probably out of date
  • Newspapers/magazine
  • Textbooks – Tend to be out of date, and accepted wisdom rather than anything new, exciting or challenging
  • General
  • Previous projects
  • Library database – ProQuest, Eric –  Anything you find through library databases are usually peer reviewed, and so likely to be the most credible source in the Library.

Internet

  • White papers – Research at IBM puts out  good information. It is specifically funded by a company or government
  • Google scholar – Search engine where you find evidence. Alternatives are Research-gate, academic.edu.  Very good idea as it filters out non academic journals and academic books.
  • Journals – Slightly higher value than printed journals as they will likely be more up to date.
  • newspaper/magazines
  • e-books
  • websites
  • magazines
  • videos
  • online tutorials
  • images, graphics etc
  • code repository
  • databases of information
  • data sets – For example the census results, which are very highly credible
  • public records
  • Technical papers – Very high credibility. Usually not particularly biased, instead it is a description/discussion of a technical subject like new OS.
  • self publishing
  • vanity publishing
  • MOOCs – Massive Open Online Course – University courses made available to the public by universities such as Harvard. High credibility.

Other things

  • Other people
  • Your own experience
  • TV
  • experts – You have to be careful to determine if they are expert in the area they talking about
  • classrooms
  • data sets
  • feedback
  • observation
  • your own research

 

Its not just knowing the source of research is important, we need to know other things. For example if research came from a blog, some blogs are very credible, whilst others will not be at all credible. So you have to look at the credibility of the author in that case. So it depends

 

 

How can Agile methods and DevOps improve IT project success rates?

The statistics about IT projects success rates are shocking, to me the most interesting statistic in this article https://projectjournal.co.uk/2016/03/16/15-shocking-project-management-statistics/ is:

“On average, large IT projects run 45 percent over budget and 7 percent over time, while delivering 56 percent less value than predicted” (Bloch, Blumberg, & Laartz, 2012).

In this research journal entry I want to investigate what are the potential causes of these high failure rates and if the implementation of the principles and values in Agile methods and DevOps at the deployment end of software development projects can help in reducing these failure rates.

What are the issues causing the high IT project failure rates?

As we discussed in SYD701 a few weeks ago the likely cause of why there is such a high IT project failure rate is that projects are so complex nowadays that we do not have the methodologies or thinking paradigms to successfully build the systems on time, on budget and within or above expectations.

What do I mean by the projects being complex? Well the needs IT systems are attempting to solve nowadays are not easily definable,  there are many ways the system can be developed and many different solutions that could meet the requirements, whilst the end result is not always agreeable to all stakeholders. In other words the need that the IT system being developed is trying to solve is a mess, meaning it is not easily definable, and there multiple processes and outcomes that could be produced.

The Linear systems thinking paradigm that spawned the Structured Systems Analysis and Design Methods (SSADM) is not designed to help development teams design and develop systems from messes, because the mindset of this train of thinking was you could understand a need necessitating a system by dividing it down to its smallest parts. However how can you divide down a problem/need when you do not understand it.

You can’t and that’s part of the reason why the SSADM is not suitable for systems development of the modern, complex systems we build nowadays.

Lets see some hard facts of IT project failures from a study performed by the consulting company McKinsey & Company and Oxford University:

IT issues.png

(Bloch, Blumberg, & Laartz, 2012)

Something important to note in all these ‘failed’ systems is the common characteristic that they all had high risks in terms of financial, temporal and benefits (Benni, 2012), where there are high risks, particularly where the risks change frequently then a non-iterative systems development methodology is not ideal as it is developed to address the risks identified at the start of the SDLC and will not identify and address risks that appear later on in the SDLC, and that is the reason I believe that the study found “Every additional year (in a projects development) increases the expected cost overrun by 16.8% and schedule overrun by 4.8%” (Benni, 2012)

This McKensey and Oxford study performed qualitative research on IT executives to identify what they believed the solution to these IT project issues was and they came to the following conclusions of what is required for a successful IT project:

  • “focusing on managing strategy and stakeholders instead of exclusively concentrating on budget and scheduling
  • mastering technology and project content by securing critical internal and external talent
  • building effective teams by aligning their incentives with the overall goals of projects
  • excelling at core project-management practices, such as short delivery cycles and rigorous quality checks”  (Bloch, Blumberg, & Laartz, 2012)

 

Does the SSADM align with these 4 fixes?

Clearly the SSADM does not meet the first and last of those fixes; after the initial elicitation of client requirements in the Systems Analysis stage of the SDLC there is little client requirement  elicitation throughout the rest of the project. It is hard for the SSADM to manage stakeholders to ensure the product meets their requirements due to its non-iterative nature, meaning if the requirements change then the system itself cannot change to meet them,

Additionally SSADM does not have short delivery cycles but instead it releases the product in large chunks, this can create a headache for the operations team that have to make sure the product developed in the development environment actually works in the production environment before it is released to the users. I believe this mammoth release of software at the end of the SSADM was part of the reason that the concept of DevOps was first conceived to create a change of mindset to the paradigm of small, frequent releases of software to the operations team.

So due to the non-iterative nature of the Waterfall model which the SSADM follows each project is effectively ‘stuck’ with the requirements identified in the systems analysis stage at the start of the SDLC, thereby not making this methodology useful in environments where user and business requirements change or if there is a high risk of change.

And as we have seen in the research entry journal about DevOps, this deployment model works through the deployment of software from the development to the production environment in small features frequently which are easier for the operations team to test, troubleshot and deploy.

Can Agile and DevOps can help?

Now in my opinion at this stage of my knowledge about Agile methods the implementation of Agile Methods would help enormously in fixing most of the issues related to because:

  1. Agile methods is timeboxed meaning it promotes that a project is split up based on blocks of time known as sprints. By focusing on time rather than features and due to its iterative nature producing prototypes that can theoretically be ready for deployment at the end of each sprint the Project Manager can ensure that there will be a deployeable product at the end of the project.

This product will likely have the most important features in it, because Agile systems development methodologies such as Scrum generally place the most important features (known as user stories) in the earlier sprints to ensure they will be completed.

2. Now going 45% over budget can be partially attributed to going 7% over time however the difference in the percentages means there is obviously other factors  involved and I have to be honest and at this stage say I am not sure how the implementation of Agile methods could help with this.

3. However Agile methods are very likely able to improve the missing focus statistics because there is constant communication between the development team and the clients throughout the agile project.

However in our SYD701 class last week we looked at the history of computer systems and the quote from the paper we were looking at A Short History of Systems Development interested me:

“The problems with systems today are no different than fifty years ago:

  • End-user information requirements are not satisfied.
  • Systems lack documentation, making maintenance and upgrades difficult.
  • Systems lack integration.
  • Data redundancy plaques corporate data bases.
  • Projects are rarely delivered on time and within budget.
  • Quality suffers.
  • Development personnel are constantly fighting fires.
  • The backlog of improvements never seems to diminish, but rather increases.”(Bryce, 2006)

This is interesting because it shows that the implementation of Agile methods will not solve all of the problems. The second issue in the list; a lack of documentation with modern systems; could easily be perpetrated by Agile methods which focuses on working software over documentation.

As I am working on the Koha Library Management System at work I understand how important it is to have good documentation to understand how a system you didn’t build works as a whole (we are lucky with the Koha project because being an open source project there is a lot of documentation to make it easier for new developers to contribute and so we have a wiki for Koha development).

This is an example of how Agile methods and DevOps is not the silver bullet, they do not solve all of the problems facing modern systems development.

 

Interesting resources on IT project failures:

http://www.geneca.com/blog/software-project-failure-business-development

http://www.pmi.org/-/media/pmi/documents/public/pdf/learning/thought-leadership/pulse/pulse-of-the-profession-2015.pdf

http://calleam.com/WTPF/?page_id=1445

http://www.mckinsey.com/business-functions/digital-mckinsey/our-insights/delivering-large-scale-it-projects-on-time-on-budget-and-on-value

 

Bibliography:

Bloch, M., Blumberg, S., & Laartz, J. (2012, October). Delivering large-scale IT projects on time, on budget, and on value | McKinsey & Company. Retrieved March 24, 2017, from http://www.mckinsey.com/business-functions/digital-mckinsey/our-insights/delivering-large-scale-it-projects-on-time-on-budget-and-on-value

Benni, E. (2012, April). Transforming the company Avoiding the Black Swans – Success Factorsa dn core beliefs in Value Assurance. Mobily CIO Summit, Istanbul. Retrieved from http://mobilyciossummit.com/presentations/02_Mickensey_and_Co.pdf

Bryce, T. (2006, March 14). A Short History of Systems Development. Retrieved March 31, 2017, from http://it.toolbox.com/blogs/irm-blog/a-short-history-of-systems-development-8066

Elasticsearch Part 2

On Friday I attended an excellent Elasticsearch basics for developers course at work and I would like to discuss what I learned and how it has changed my  view of Elasticsearch since I wrote about it a couple of weeks back.

What is Elasticsearch?

Elasticsearch is usually thought of as a search engine but it is more than that, Elasticsearch can also be considered a:

  • Data store, meaning in addition to using it as a search engine for your app/system you could also use it as an alternative to a Relational Database Management System (RDBMS). Elasticsearch stores data in documents, mapping types and indexes, which are the equivalent of a Relational databases rows, tables, and database respectively.

 

  • Reporting tool – Elasticsearch can be used to store system logs. The RESTful API Kibana that you can use to interact with Elasticsearch can generate visual charts from Elasticsearch search query results. For example the below visual charts of system log information. These charts present the data in a far more useful format than a written system log.

kibana visualisation

(Christopher, 2015)

Something that particularly interested me about the course was that  the presenter Frederik said that Elasticsearch is very flexible and extremely useful as long as your prepared to spend time configuring it.

A lot of people implement Elasticsearch (which is actually pretty easy as I found last week) and expect it to be the equivalent of Google for their organizations data, however if you don’t configure it to match your business problem domain then it would not reach its full potential.

What is the internal structure of Elasticsearch?

Elasticsearch is built on top of Lucene, which is a search library. In the documentation it is very hard to determine where one ends and the other begins however I believe having done the course and read through the first answer in this very interesting StackOverFlow page (http://stackoverflow.com/questions/15025876/what-is-an-index-in-elasticsearch) that I have a good understand of this now, so lets test it out.

I look at Elasticsearch and Lucene as a 2 layered cake ( to see this graphically look at the below diagram where we have Elasticsearch as the top layer and Lucene as the bottom layer), the top layer (Elasticsearch) is the one that the user interacts with.  When you first install Elasticsearch a cluster is created ( a cluster is a collection of 1 or more nodes (instances of Elasticsearch)).

Inside this cluster by default you have 1 node (a single instance of Elasticsearch). This node contains indexes. Now an index is like a database instance, drilling down further we have mapping types (which are the equivalent of tables for example you could create a mapping type of student), inside a mapping type there are documents (which are a single data record making them the equivalent of a row in a database), and inside each indexed document there are properties which are the individual data values so for example 22 years old is a property for age).

To put the document into perspective it is just a JSON data structure.

So we have established that Elasticseach stores the data in indexes, with each data record known as a document.

But how does Elasticsearch actually find specific data when someone writes in a HTTP GET request into Kibana? Well that’s where Lucene comes in, Lucene is the bottom of the two layers in my cake simile. Lucene contains its own index, which is a inverted index: instead of storing data it points to which indexed documents in Elasticsearch index that the data value is stored in, in much the same way a book index points to the page number where a particular word exists.

Anther good analogy of what a inverted index is that it is quite similar to containers such as arrays and dictionaries which point to a specific location in memory where a particular value is stored rather than storing the value itself in their data structure.

Having done the course I now believe I understand how the index in Elasticsearch and the index in Lucene relate.

lucene and es.png

(Principe, 2013)

Now as I said by default your Elasticsearch cluster has one node, however Elasticsearch is extendable meaning you can add more nodes to your cluster.

Within each node there are 5 shards. What is a shard? “A shard is a single Lucene instance. It is a low-level “worker” unit which is managed automatically by elasticsearch. An index is a logical namespace which points to primary and replica shards.” (“Glossary of terms | Elasticsearch Reference [5.3] | Elastic,” n.d.). In other words each node contains the Elasticsearch indexes outside the shards and the Lucene Inverted indexes inside the shard.

Each shard has a backup in the form of a replica shard which is stored on a different node, this provides data redundancy and speeds up the search times because it is more likely the HTTP GET request is sent to a shard containing an inverted index with the search term in it.

What is the process that happens when the client RESTful API sends a request to the cluster?

In Elasticsearch the commands are called APIs so for example a delete command is a called delete API.

Now like I previously stated Elasticsearch is structured as a collection of nodes in a cluster (think of it like how there are multiple servers in the concept of the cloud).

The nodes store different information (in the form of Lucene inverted indexes and Elasticsearch indexes) so the request needs to go to a particular node to access particular data. However all the nodes store information about the topology of the cluster, so they know what node contains the data the API command seeks/wants to modify.

When you write a HTTP GET request in Kibana the ID specified in the GET request is hashed and then the hashed id is sent to a node in the cluster, it doesn’t matter which node that request is sent to as it will redirect the request to the appropriate node (if it doesn’t store the matching hashed id).

However to make sure that the same node is not always queried the destination node of each search query is different based on a round robin distribution.

How Elasticsearch data storage violates normalisation and refactoring

Elasticsearch is all about fast search times, to achieve this having duplicated data in multiple indexes is considered acceptable.

This is in complete contrast to the database concept of normalization and the programming concept of refactoring both of which stress the need to remove duplicate data/code.

What are the differences between Elasticsearch and a relational database

Although Elasticsearch can be used as a data store meaning you could implement it as an alternative to a relational database the differences are:

  • Elasticsearch does not use foreign keys to create relationships between indexes
  • Data can be duplicated to speed up the query time
  • Query joins (querying two or more indexes in a single query) are not available in any effective way from Elasticsearch, meaning although rudimentary joins can be implemented they are not very effective

So when should you replace your RDBMS with Elasticsearch? Well it depends on the sorts of queries you have performed/want to perform on your primary data store. If you are performing complex transaction queries (2 or more queries concurrently) then Elasticsearch is not ideal and you would be better off using a RDBMS such as MySQL and just use Elasticsearch as a search engine.

However if you don’t want complex transactional queries then Elasticsearch is a good alternative to a RDBMS.

What are the cons of Elasticsearch?

Elasticsearch is not ideal from a security point of view this is because it does not provide data or transport encryption.

It is near realtime – This means there is a slight latency after indexing a document before you can search for the data it holds.

What are the benefits of Elasticsearch?

The two main benefits of Elasticsearch are:

  • It is fast – Due to the data being duplicated in multiple shards it means it is faster to access data in either the primary or replica shards
  • It is distributed – Meaning it is easy to extend by creating another node in your cluster
  • High availability – By having a primary and replica shard to hold each inverted index twice this means the indexes are more easily available

 

Starting Elasticsearch on Linux

Last week I installed and used Elasticsearch on a Windows machine, now I want to cover how to use Elasticsearch on a Linux machine:

  1. Download both Elasticsearch and Kibana (the versions used in my course were Elasticsearch 5.1.1 and Kibana 5.1.1 however there are more recent versions available of both systems and so there may be version conflicts which are visible once you visit Kibana in your browser. If there are version issues simply install the version of either Kibana or Elasticsearch specified on the Kibana interface).
  2. Start two terminal windows. In one terminal navigate to the Elasticsearch directory and start Elasticsearch by writing in:

./elasticsearch-5.1.1/bin/elasticsearch

3. In the other terminal navigate to the Kibana directory and write in:
./kibana-5.1.1-linux-x86_64/bin/kibana

3. Now in browser visit Elasticsearch by writing in the URL:
http://localhst:9200

4. Also in your web browser visit Kibana by writing in the URL:

http://localhost:5601/app/kibana

Because you now interact with Elasticsearch through Kibana in the web browser everything is the same from this stage no matter what OS you are using.

 

Examples of Elasticsearch API commands

Create Index API – This creates a index which we can then use to index documents (create data records)

PUT student

{
“settings” : {…},

“mappings:: {…}

}

In this create index API command you can specify the number of shards and replicas you want the index to span. e.g.

PUT  “student”

{

“settings” : {

“number_of_shards” : 1,

“number_of_replicas” : 1

}

}

Index API – Here I am specifying I want to use the ‘student’ index, creating a ‘degree’ document type and specifying the ID of the document I am indexing is 1. Then I am indexing the document itself. By performing the index API I am automatically creating a document type of degree.

Note: Specifying the id value in the PUT command is optional.

PUT student/degree/1

{

“name” : “Alexander Buckley”,

“alt_names” : [ “Alex” ],

“origin” : “Nelson”

}

 

If there was no student index before I wanted to index this document I would need to write

PUT  student/degree/1/_create

{

“name” : “Alexander Buckley”,

“alt_names” : [ “Alex” ],

“origin” : “Nelson”

}

GET API – This retrieves the data for a specific document.

e.g.

GET student/degree/1

 

Exists API – This checks if there is a document with a particular id in a index.

e.g.

HEAD student/degree/1

This is checking if there is a document with the id of 1 in the student index and degree mapping type.

 

Delete API – This deletes a document from an index by specifying the id of the document you want deleted in the Delete API command

DELETE student/degree/1

Write consistency – Before any of these API commands is performed more than 1/2 of the shards in the cluster need to be available because it is dangerous to write to a single shard.

Versioning – Elasticsearch uses versioning to keep track of every write operation to a document. The version is assigned to a document when it is indexed and is automatically incremented with every write operation.

Update API – This allows you to update parts of a document. To do this you write the changes to specific properties of the document in the ‘docs’ . So in the below example I am updating the name property of the document with the id of 1. This will be merged with the existing document.

POST student/degree/1/_update

{

“docs” : {

“name” : “Alex Buckley”

}

}

 

Multi Get API – This is where you can request multiple documents from a specific index and mapping type. Whats returned is a doc array.

GET _mget

{

“docs” : [

{

“_index” : “student”,

“_type”   : “degree”,

“_id” : 1

},

{

“_index” : “student”,

“_type” : “degree”,

“_id” : 2,

“_source” : [“origin”]

}

]}

 

Bulk API – To perform multiple different API commands simultaneously. Elasticsearch splits the command up and sends them off to the appropriate node. If two requests are requesting/manipulating the same node then they are sent together.

PUT _bulk

{ “delete” : { “index” : “student”, “_type” : “degree”, “_id” : 2 } }\n

{ ” index” : { “_index” : “student” , “_type” : “degree”, “_id” : “3}}\n

{ “name” : “Jan Smith”, “alt_names” : [“Janet Smith”], “origin” : “Wellington” }\n

In this example I am deleting a document from the fruit index and indexing (adding) another document all with a single Bulk API. The benefit of this is once this API request has been redirected to the node containing the student index multiple API commands can be performed which is obviously more efficient.

Search – Text analysis

Unlike the range for commands for the CRUD actions in the above section for search we would use the GET API. This is sent from the Kibana client to the Elasticsearch cluster, and redirected from the node that received the command to the node containing the inverted index with a matching search value (known as a token).

This Lucene inverted index contains 3 columns , first column is the search token,  second column is the number of documents it exists in and third column is the id of the documents it exists in.

e.g.

token        docfreq.        postings (doc ids)

Janet          2                  3, 5

If we were using a GET API to find all instances of the word ‘Janet’ we would be returned with the documents 3 and 5.

When indexing a document you can use the index attribute to specify what fields you want to be searchable. This attribute can have one of three values:

  • analyzed: Make the field searchable and put it through the analyzer chain
  • not_analyzed: Make the field searchable, and don’t put it through the analyzer chain
  • no: Don’t make the field searchable

But what is the analyzer chain?

OK so the values from the index documents are placed in the Lucene inverted index and that is what is queried when using Elasticsearch as a search engine. If we have a string we want to be searchable then we often have to tidy it up a bit to make it more easily searchable, that’s where the analyzer chain comes in, it performs the following actions:

  1. The first step is the char filter, this removes any HTML syntax. e.g. the string “<h1> This is a heading </h1>” would become “This is a heading”.

2. The second step is the tokeniser, which usually does (but you can specify the steps you want the tokeniser to do):

  • Splitting the words in the string apart
  • Removes stop words like ‘a’
  • Make all letters in each word lower case
  • Replace similar words with their stem word. In other words the two words “run”, and “running”  are similar and so instead of writing them all to the inverted index we replace them with a singular word “run”.  Replacing similar words with stem words is automated by performing a stem algorithm.

3. Token filter

Interestingly all user query terms go through the same analyzer chain before they are compared against the inverted index if the user uses the ‘match’ attribute in their search query (which will be discussed below)

Search

Elasticsearch can perform two types of search:

  • Structured query – This is a boolean query in the sense that either a match for the search query is found or it isn’t. It is used for keyword searches.
  • Unstructured query – This can be used for searching for phrases and it ranks the matches on how relevant they are. It can also be called a fuzzy search, because it does not treat the results in a boolean way saying their either a match or not as the structured query does but it returns results that exist on a continuum of relevancy.

 

Search queries:

match_all: This is the equivalent of SELECT * in SQL queries. The below example will return all documents in the fruit index and berries mapping type.

GET student/degree/_search

{

“queries” : {

“match_all”: {}

}

}

Note: The top 10 results are returned, and so even if you perform the match_all query you will still only get 10 results back by default. But this can be customized.

 

If you want to search terms that have not been analyzed (i.e. haven’t gone through the analyzer chain when the document was indexed)  then you want to use the ‘term’ attribute

However if you want to query a field that has been analyzed (i.e. it has gone through the analyzer chain) then you will use the match attribute.

e.g.

GET student/degree/_search

{

“query” : {

“match” : { “name” : “Jan Smith” }

}

}

This means the term Jan Smith will go through the analyzer chain before it is compared against values in the inverted index.

The multi_match attribute can be used to find a match in multiple fields, i.e. it will put multiple search values through the analyzer chain in order to find a matching value in the inverted index.

GET student/degree/_search

{

“query” : {

“multi_match” : {

“fields” : [“name”, “origin” ],

“query”: “Jan Smith Wellington”

}

}

I will discuss the other search queries in my next Elasticsearch blog to prevent this one getting too long.

 

Mappings

When we index a document we specify a mapping type which is kind of like the table in a relational database, or a class in the OO paradigm because it has particular properties which all documents of that mapping type have values for.

The benefits of mapping types are that they make your indexes match the problem domain more closely. For example by making a mapping type of degree I am making the documents that I index in the degree mapping type a more specific type of student.

To save time when creating indexes with the same mapping type we can place the mapping type in a template and just apply the template to the index.

e.g.

PUT _template/book_template

{

     “template” : “book*”,

     “settings” : {

          “number_of_shard” : 1

       },

       “mappings” : {

                 “_default_” : {

                          “_all” : {

                                    “enabled” : false

                            }

                      }

              }

}

To apply this mapping type template to a index I would have to write:

PUT _template/book_wildcard

{

    “template” : “book*”,

    “mappings” : {

          “question”: {

               “dynamic_templates” : [

                 {

                            “integers” : {

                                     “match”: “*_i”,

                                      “mapping” : { “type”: “integer”}

                              }

                       }

                   ]

}}}

Note: It is recommended that you only assign a single mapping type to a index.

Conclusion

I have learned a lot from the Elasticsearch course and will continue to discuss what I learned in the next Elasticsearch blog.

 

Bibliography:

Christopher. (2015, April 16). Visualizing data with Elasticsearch, Logstash and Kibana. Retrieved March 26, 2017, from http://blog.webkid.io/visualize-datasets-with-elk/

Principe,  f. (2013, August 13). ELASTICSEARCH what is | Portale di Francesco Principe. Retrieved March 26, 2017, from http://fprincipe.altervista.org/portale/?q=en/node/81

Glossary of terms | Elasticsearch Reference [5.3] | Elastic. (n.d.). Retrieved March 30, 2017, from https://www.elastic.co/guide/en/elasticsearch/reference/current/glossary.html

Research approaches

Today in class we all talked about the research method we researched and wrote about in our research journal last week. Dejan and I talked about the argumentative research method which I had found to be also called dialectic methods whilst he found it to be called argumentation theory however both dialectic method and argumentation theory have similar fundamental ideas of posing two ideas against each other to identify a truth and resolve the disagreement.

Here is the research methods that were covered in today’s class:

Secondary research – Making use of existing information. It is about going out there and finding all the articles you can to find about the subject to identify the main themes (e.g. 50% of the papers discovered x), it is trying to integrate a whole collection of others research and integrate a new idea.

It is the foundation of just about everything you do in a research context. Secondary research can be useful for anything, for example integrating new data sets to find new knowledge.

The strengths of secondary research are it is easy to access, and cheap with the internet

Weaknesses of secondary research are the information may not be exactly what you want, or relevant, and there is a lot of it.

Meta-analysis – It is secondary that is interested in quantitative data only, specifically statistical data sets. Papers that have in them statistical information in them.

It looks for exactly the same question that has been asked to more than one sample, it is widely used in pharmaceutical studies. It allows you to complete your own primary data on a smallish sample making the primary research affordable, whilst using the findings of other studies investigating the same thing, and treating them all as one study which gives a more credible result.

Evidence based medicine – this uses meta analysis to perform systematic reviews. A group of researchers (academic researchers and GPs) would get together and research the usefulness of a drug; they would get primary research from all around the world in English and other languages. Then they would analyze the research filtering out the incredible and invalid primary research papers, then once they have identified the valid and credible data they would then perform statistical analysis on those good research.

They would then make this available to doctors so they didn’t have to make a judgement on individual studies proving the performance of the drug.

An example in IT is computer assisted instruction

The benefits of this approach are high credibility/better overall knowledge/quality of the knowledge is good

The weakness of this approach are bias can still come through (can compound bias already present in the initial studies) and from the people who are doing the meta-analysis

Also it costs a lot in terms of money and time.  There can be difficulty in finding of negative studies.

Randomized control trials (RCT) – This is effectively random drug trials. It uses blind (testers don’t know and double blind randomized trials testing one variable. – Used by big pharmaceutical uses it but also other kinds of medical interventions – Provide an unbiased study.

What it is trying to do is provide an unbiased study. The reason for wanting to reduce bias is because human bias can affect studies in many different ways you can’t imagine.

The person who knows which phial is a drug and which is a placebo is well up the ladder in the research because you don’t want to give away which is the drug and which is the placebo based on the doctors body language if they did know which was which.

The weaknesses of this approach are that they are very expensive and time consuming.

The strengths of this approach are they tend to give pretty clear results and remove most bias/ provide the best evidence that we can get

Case study research – Focuses on a case (people/group/company/event) that looks at one or more component of the case. The case can be a variety of things, e.g. a single company or collection of company.

e.g. collection of companies use of YouTube for marketing. You are trying to gain knowledge from a collection of cases for a specific variable or process.

Examples of case studies are:

  • Exploratory (pilots) find questions/measurements to be used on a larger scale
  • Illustrative/descriptive – To make the unfamiliar familiar by using common language (metaphor)
  • Explanatory – cause/effect relationships, how/why things happen. e.g. why did TradeMe succeed and Weedle fail
  • Cumulative – secondary case study – summaries on the case- integrating other information and refine them down to the single case you are studying

Example in IT – Widely used to explore to do a bigger study, to describe, greater insights into impact of IT

Case studies are useful in answering why, how questions, you wont come up with a definitive, final answer but you can learn a lot on the journey. You can look at the use of a technology on an existing company to see what changes were required by the company

Strengths of this approach are it is based on real life examples and so it is very practical and deals with real life issues. It provides in depth analysis

Bias – there is bias in you as researcher, and bias in the information. A technique to use to reduce bias is triangulation, it requires you to find data from 3 very different sources. If all 3 points agree then the data is more likely to be credible. If you have two agreeing sources and one strongly disagree. It gives you a somewhat objective viewpoints on the data.

Non generalisable (you can only draw common sense generalisations as case studies are specific to a particular situation e.g. apply a strategy to a different company it may not work in a new company which has different factors such as work culture) and hard to repeat. Loses lose its context (and therefore may not be useful).

You might do single instrumental case study were you learn about a single case such as a single company and then taking the  insights about that company and you may be able to apply these insights to other companies. So although case studies are not generalisable you may be able to apply the insights to other companies.

Within a case study there may be:

Observational research – Observing behavior of something in its natural context (not just people).

It very much an initial approach when you’re not quite sure what your looking at. e.g. observing peoples behavior when they are faced with new software. You could look at where they click, etc. You can then say a good design for a website for people with eye defects are if they were being observed

Examples in IT are in the use of software, user experience research, interface between people.

Strengths of this approach are that it gives a close view of whats happening (with people vs interface). Additionally it gives a viewpoint to people who can;t give an opinion in any other way.

Weaknesses of this approach are it is subjective (based on the researchers interpretation), time-consuming, can be ethical concerns (covert particularly).

Types of observation – Naturalistic – Observing behaviors in a natural setting, but you have no input

Participant – Observing behaviors in a natural setting, but you have active participation.

Laboratory – Observing behaviors in a lab environment which can or cannot have researcher participation.

Observational research particularly when it is about people is open to the researchers interpretation.

Interviews – Conversation between two people to extract specific information. The interview can be structured (with set questions), or it can be a free flowing conversation (unstructured), or semi-structured interview.

Can be used as a follow up to other research e.g. after a survey or as a prelude to a survey

You could use a structured interview if you know what kinds of answers you want, and/or know what questions you want to ask. Possibly to encourage deeper/critical thinking. Structured interviews allows for comparison after the interviews of multiple people is complete, it doesn’t give such a rich picture of what one person thinks but it allows you to say two out of the ten people can

If you want to get credibility for a thought or compare the answers then a structured interview is adventurous.

But if you want to investigate something in depth then unstructured or semi structured interviews are best; e.g. PHD interview which starts with a structured questions to compare the students response to other students responses. Then it was followed  by unstructured interview where responses that the student says are picked up by the interviewer and a follow up question is asked for clarification and/or expansion.

Differences between direct and indirect questions. Silence is an important tool.

Examples in IT where you are trying to find out a persons opinions, also can be used to investigate in a richer sense a persons knowledge of the topic. By interviewing people in different job roles you are getting viewpoints on a software systems implementation for example in an information rich way.

Strengths of this approach are it is flexible, useful to get good detail and clarification.

Weaknesses of this approach are the use of leading questions from the interviewer can influence the response of the interviewee. Bias/expensive/time consuming/may not interview the right people/ not always truthful/people will tell you what they think you want to know.

Next week Dejan and I will be talking about the argumentative method.

Argumentative research method

In our Wednesday class we were given a number and asked to research and answer 5 questions about the research method listed on the Google Doc with the corresponding number; Dejan and I got argumentative research method.

So lets get into this:

  • What is it ? (Short description of how it works)

Arguments are used to persuade someone to do something/agree with an aspect of your worldview. This is achieved by two or more different viewpoints being promoted by two or more parties, each viewpoint is backed up by logical reasons (e.g. objective facts) and/or emotional (e.g. subjective truths) to encourage the other party to agree with your viewpoint.

That is how we describe the concept of arguments in a general sense, relating this back to research the argumentative method is known as the Dialectic Method. It is a way to look at secondary research and debate opposing viewpoints to elicit the truth.

Dialectic Method is usually split up into 2 sub-methods: the Socratic method and Hegelian dialectic.

Dialectic research is a form of qualitative research because it uses words and it seeks a subjective truth that both parties can agree on. In other words it deals with people and opinions rather than scientific measurements (quantitative data).

Dialectical research works by comparing a thesis (an idea declared as the truth by one party known as the claimant), with an opposing idea (known as an antithesis). Both the thesis and the antithesis are backed up with reasoning based on facts elicited through others primary or secondary research and the end point of the dialectic method is a synthesis of ideas.

Sources:

https://en.wikipedia.org/wiki/Dialectic

https://en.wikipedia.org/wiki/Dialectical_research

http://www.dcs.gla.ac.uk/~johnson/teaching/research_skills/research.html

  • What kinds of questions/problems might it be useful for?

The dialectical method is used in research (specifically secondary research) to compare two or more opposing viewpoints which are backed-up with objective and subjective facts to determine which viewpoint is a subjective truth that all parties can agree on. In other words it is the identification of a truth based on opposing primary and secondary research of others.

Therefore Dialectic research would be useful in secondary research to compare the conflicting findings of multiple authors and argue for which is the likely truth.

Being a form of qualitative research dialectic research will fit into the Social Scientific Paradigm which has a Constructivism ontology, and Interpretist epistemology. This paradigm contains other qualitative research methods such as interviews and surveys, however dialectic research will take opposing ideas found through the primary research methods such as interviews and surveys and determine which is the one everyone agrees on.

So dialectic research is useful for evaluating primary research and solving problems that have had primary research performed, but no agreement of different parties.

Source:

http://www.mbaskool.com/business-concepts/marketing-and-strategy-terms/2232-dialectical-inquiry.html

  • How could it be used in IT research  (try to think of an example)?

Dialectic research can be used in IT research on topics where primary research has been performed and there is still no agreement between different stakeholders, this is because by comparing the ideas and reaching a synthesis the disagreement is solved and action can be taken. Examples are:

  • In a UX team one designer believes that an app they are working on should have a onboarding tool feature to show the user how to use the app, however another designer believes that the app is intuitive enough that this feature is not necessary. After qualitative research, such as multiple surveys by both parties has been conducted, both the designers perform dialectic research to compare the thesis (that the onboarding tool is necessary) and antithesis (that the tool isn’t necessary) and they reach the synthesis that they will recommend the development of an onboarding tool to the client.
  • The decision making process for legislators to decide what limits and constrictions they should place on A.I. research and development

Source:

http://www.dcs.gla.ac.uk/~johnson/teaching/research_skills/research.html

  • What are the strengths of the approach?

Dialectic research is useful in resolving disagreements of opinion, by outlining all opinions and reaching a synthesis of opinion. Therefore it is useful in the analysis of primary research, i.e. it is useful in the decision making process and in performing secondary research.

Dialectic research is very useful in reaching a consensus between all stakeholders allowing a process, implementation or inquiry to move forwards.

Regarding what makes a effective argument in this approach certain attributes should be meet:

  • The thesis and antithesis needs to be worded clearly so you know what is arguing for.
  • Both sides of the argument need to be based on objective facts because this increases the credibility of the thesis and antithesis because it is showing they actually have a logical and objective basis for what they are arguing for.
  • The argument is believable in other words it is similar enough to the worldview of the other party which is based on their personal biases as to make the argument seem credible.
  • Are the facts that are used to back up the thesis and antithesis valid and credible (which can be determined based on the 4 indicators I outlined in ‘What is credibility and validity” research journal entry, additionally are the facts actually relevant to the argument).

Source:
http://sites.stedwards.edu/marier-engw1307/files/2013/01/Judging_Stren_Weak-1ra34md.pdf

  • What are the weaknesses of the approach?

The facts that back up the thesis and antithesis in dialectic research are from primary research and so this approach is not useful in gathering quantitative, or qualitative facts in primary research.

The synthesis outcome of dialectic research is very much based on how well both the thesis and antithesis were articulated, so even if there is a greater quantity of valid and credible objective facts to back up one side of the argument it can still be discounted by the other party if it was not articulated effectively.

A good example of this is the dialectic methods used in the decision making process of politics; if the Labour party opposes a policy proposed by the National party in an argument based on a lot of credible and valid objective facts but does not articulate this objection clearly enough then the public are likely not to care to agree with the Labour party.

Whilst when evaluating primary research or secondary research this research approach is not useful if: the argument claim being made is unclear so the reader does not know what viewpoint is being argued for, the argument is based on emotions rather than objective facts, the claim being made is not believable, and the objective facts being used to backup the argument are made by biased sources, not valid and credible and if the facts are not relevant to the claim being made.

Source:

http://sites.stedwards.edu/marier-engw1307/files/2013/01/Judging_Stren_Weak-1ra34md.pdf

What is DevOps? and how does it relate to Agile methods?

Along with myself there were several other full-time summer IT interns at work over the summer, including one that was in the DevOps. When I heard he was working in that team I wondered what is DevOps?

Like I said in my previous research journal I am going to investigate what DevOps is, but I also want to see how (if at all) it relates to Agile methods.

What is DevOps?

devops.jpg

(Vashishtha, n.d.)

The above image illustrates the key idea of DevOps beautifully. Put simply DevOps is the increase in communication, collaboration and automation in and between the development and production environments.

What is the current software deployment model?

To understand DevOps I wanted to understand where we are with traditional systems deployment and what the issues are that DevOps attempts to solve.

Software development companies are generally structured in teams (I know from my experience this is an important concept in modern development organizations).

The work of two teams in particular affects the profitability of the companies software, these teams are:

  • Development team – They actually build the software in the development environment
  • Operations team – They deploy the software to the production environment which they maintain.

Now in traditional software development companies (I am referring to companies that have not implemented the use of DevOps) there is on a level of mistrust due to the major issue:

  • The development environment (that developers work in) and production environments (that operations maintain) are configured differently meaning when code is deployed into the production environment it takes time for the operations team to get it working successfully slowing down the whole software deployment process

Now I would have thought it was common sense for the development and production environments to be as identical as possible so systems/features built in the development environment could be seamlessly deployed to the production environment but this has not been the case.

The sorts of problems having dissimilar environments are the production environment is less forgiving to software exceptions than the development environment, and so an exception that causes no observable error or warning in the development environment can crash the system in the production environment. Not good when trying to deploy software on a tight deadline.

It is the operations team that have to fix up the code for the production environment before it can be released to the customer and because this just adds another job to their tasklist this is why there is a level of mistrust between the development and production environments.

The development team, meanwhile gets a level of annoyance at the operations team because the time it takes to deploy the code they write holds up the development team from deploying new systems/features.

This gridlock slows down the whole software development deployment which has a business cost, because remember IT is just there to help businesses and organizations. The detrimental business cost is that competitive advantage of meeting a customers needs or filling a business niche may be taken by a faster deploying competitor.

How can DevOps help?

I look at DevOps as a metaphorical combination of a communication course and the equivalent of an industrial revolution in software deployment.

What? Let me explain with several points:

  1. DevOps attempts to increase the collaboration of the development and operations teams  thereby speeding up the time it takes to deploy software to the customer. This collaboration is like a communication course of sorts as it is making the two teams communicate more so their systems can become more alike.

2. DevOps attempts to free up more time for both teams by automating the software deployment process as much as possible. This means automating the testing, deploying, and monitoring of software in both the development and production environments using a set of tools.

Therefore I view DevOps as the industrial revolution of IT systems development, because like with the Industrial Revolution of the 18th and 19th centuries DevOps tries to automate as many tasks as possible allowing the workers to work on what can’t be automated.

Another change that DevOps does is it attempts to change the mindset of both teams because instead of working on big new features for existing systems, it promotes the development of small code releases that can be quickly tested, deployed and monitored in the production environment by automated tools.

The benefit of getting small chunks of software out to the customer quickly, rather than big chunks of software more slowly is that the company can gain the competitive advantage by filling a business niche with its quickly evolving system as opposed to missing out to faster competitors.

What are the tools that DevOps use to implement these 4 points?

To be able to build small chunks of code and automate the testing of them the organization will need to implement a tool like Jenkins (https://jenkins.io/) (Rackspace, 2013).

They will also need a source control tool such as Git (Rackspace, 2013).

Tools that allow them to configure their environments, and automate the deployment of code to servers in the production environments  will be tools like Puppet (https://puppet.com/) (Rackspace, 2013).

The tools they use for application monitoring will be monitoring the system logs, these tools will be things like New Relic. The benefits of this tool is that it can monitor the system logs of thousands of servers and inform both teams of any issues of the new code  in the production environment (Rackspace, 2013).

Basically tools like New Relic make sense of vast quantities of data in much the same way (obviously on a much smaller scale and without the machine learning aspect) as systems like IBM Watson which trawl through vast quantities of data finding patterns and presenting insights (Rackspace, 2013).

How do the principles and values of DevOps and the Agile methods work together?

So Agile methods as I discussed in a a previous research journal entry are a set of values and principles to help development teams make decisions in the development of a product for a user.

This interesting YouTube video describes that the relationship between Agile and DevOps is that an Agile mindset exists from the:

  • User to the development team
  • and DevOps is from the Development team to the Operations team.

In other words they do not exists at the same time, this view is further backed up by this article in Information Week (http://www.informationweek.com/devops/agile-vs-devops-10-ways-theyre-different/d/d-id/1326121?image_number=11).

Now having looked at these resources my own opinion is even though there are minor differences in the way these two concepts are implemented, for example documentation is not viewed as highly as working software in Agile methods, whereas in DevOps documentation is required because the development team is handing the product to a new team (the operations team to deploy); the operations team has not worked on the product and so they require documentation to understand the system; something anyone who has worked on developing for an existing system they didn’t build will understand.

However, despite these minor differences I am amazed at the similarity in many ways of DevOps to Agile methods. DevOps changes the mindset of a software development organization so that it deploys software faster, which allows the development and production environments to use a more Agile methods approach to developing and deploying the small releases which happen more frequently than before DevOps was implemented.

So I believe that yes Agile and DevOps cover different parts of the systems development life cycle, with Agile methods covering the initial development of the product whilst DevOps covering the deployment of the product however the common fundamental concepts of smaller more frequent releases/deployment of software over one huge release, and increased communication between and within teams link these concepts together.

Interesting DevOps resources:

http://www.agilebuddha.com/agile/x-htm/

Bibliography:

Vashishtha, S. (n.d.). Demystifying DevOps : Difference between Agile and DevOps. Retrieved March 21, 2017, from http://www.agilebuddha.com/agile/x-htm/

Rackspace. (2013, December 12). What is DevOps? – In Simple English – YouTube. Retrieved March 24, 2017, from https://www.youtube.com/watch?v=_I94-tJlovg

Good Software Design: What is a design pattern?

Like I said in my previous Good Software Design research journal entry I am learning about design patterns in SDV701 at the moment and so to reinforce my knowledge I have decided to define and refine what I think design patterns are and how they can be implemented.

What are design patterns?

A design pattern is effectively the solution to a problem ( which could have many possible solutions) which has been discovered and implemented successfully by other developers.

These design patterns are not language specific instead they are general solutions you can modify to your programing language and specific application.

For example when developing a website providing access to sensitive data I as the developer would naturally implement a authentication system such as a login system. However this is actually a design pattern because I have seen authentication systems on many websites many times everyday that concept must have been conceived by one developer/development team originally and others chose to implement it because it looked to be a advantageous way to solve the problem of how to protect the sensitive data from unauthorized users.

We don’t think of an authentication system as a design pattern, but instead we think of it as the natural consequence of deciding the website will hold sensitive data.

Why is this? Well I believe this is because as developers the use of authentication systems has become part of our socially constructed reality, rather than something we consciously regard as a design pattern solution.

By defining that a obvious (or not so obvious solution) is a design pattern we are actually giving the pattern a  name, this is the benefit of learning design patterns as you can actually identify them with a name.

Below is the range of pattern categories that Matthias discussed with us in class today:

design pattern.PNG

(Otto, 2017)

Pattern categories:

Analysis patterns–  An analysis design pattern is identified when you first go into a business and  you spot they want to build a system that has similar needs to other systems you have built/learnt about and so you can take the design patterns used in these similar systems (like the authentication system I was talking about before) and implement them in this new organizations system thereby fulfilling a similar business requirement.

Another good example that Matthias came up with was if you were developing a e-Commerce website you would likely implement a shopping cart concept. Again this is because it has been used in other similar systems to meet a similar business need.

Architectural patterns– These are high level programming concepts for example MVC (Model View Controller).

Another example that Matthias said that I was not aware of was ORM – Object Relational Mapping, which goes from the OO world to relational world which is databases. It maps objects to the relational world of rows and columns in database tables.

Other examples are:

Client-server

Peer to peer

Design pattern categories

These are between the analysis and architectural patterns; design patterns are the building blocks.

Design – > General -> Fundamental design patterns:

e.g. Separation of concerns. You separate out the form stuff from the business stuff.

Expert – Give the responsibility of something such as enrollment to a class that is an ‘expert’ in it such as clsStudent.

High cohesion – How well do things belong together.

Low coupling – If you have two separate classes make sure they do not depend on each other too much, because a high coupling can mean changes to one class inadvertently alter the behavior of other classes.

Polymorphism

Design – > General ->Grasp patterns: 

These are the 10 commandments of good Object Oriented programming (which can be found here: https://dzone.com/articles/ten-commandments-of-object-oriented-design)

For example my favorite is the single responsibility principle which states that every class and method must only be responsible for one aspect of the problem domain.

Design – > Gang of four (GoF): We have 22 design patterns that are either structural, creational or behavioral.

What are the benefits of patterns?

  • Allow re-usability of proven principles
  • Improve developer communication as they give you words to descriptive good practices

What are the cons of patterns?

  • You can reuse the idea but you can’t copy and paste the code from one project to another, as design patterns differ slightly in their implementation depending on what system and programming language they are used in
  • It is labour intensive as you have to go through your code and replace code multiple times and in multiple places, this isn’t yet automate-able.

What are the criticisms of design patterns?

Design patterns can contain duplicate code thereby violating the once and only once principle.

A example of one of the 22 design patterns:

Singleton: You use this pattern where you only want one instance of a object in your system.

e.g. you only want one instance of:

  • Clock
  • Current printer
  • Database connection – A single database can be accessed by multiple connections simultaneously, but there should only be one pool. e.g. several processors can request access have short connection then the connection is freed up again.
  • Printer queue – All jobs for printer, but only one printer queue. Because there would be a risk for conflict if a single system had two printer queues.

In systems where you only want one instance of a object existing at a time, one class must be given the responsibility of ensuring there is only one instance existing. Now the expert fundamental design pattern makes us give the responsibility for something in our system to the class which is the ‘expert’ in this.

Well in the case of singleton design pattern that means giving the responsibility of ensuring there is only one instance of the class we are applying the singleton design pattern to, to the class itself.

In other words singleton pattern moves the responsibility for making sure there is only a single instance object to the class itself.

Using the clock example again the clock class itself must be responsible for making sure there is only one instance of clock, as the clock is a expert in clock.

How to implement singleton?

  1. Make constructor of the object private. This makes class not instantiable anymore
  2. Give other classes controlled access to the singleton object via a factory method in the singleton class (it is a generic method that is returning one of its own kind)

The factory method says if the factory method instance is null then create one, therefore the factory method checks if there is a extant singleton object.

The singleton contains a pointer to an instance of itself.

A singleton pattern is lazy meaning it is on demand only, it is only created if the factory method is called.

Now we have transferred the responsibility for managing the number of singleton objects existing in the system to the singleton class we need to consider the multi-user and multitasking ability of the system.

What do I mean by this ? Well when people access the Facebook db at the same time, (which is extremely likely given the number of users that could be connecting to the Facebook database on a single database server at a point in time, just based on the number of worldwide users).

For the database to support both of the queries simultaneously, it does what OS do when running multiple processes on a single processor they multi-thread meaning they split up the processor by timeslice and each slice is handed to each process.

However this can cause an issue with singleton classes as a singleton object can be created more than once at a time, i.e. you can have one singleton objects at a time if the singleton factory method (which the method which is called by other classes when they  want an instance of the singleton class; the factory method checks if the singleton object is null (i.e. there are no singleton objects in the system at the moment) and it will instantiate the singleton object if that conditional is meet)

And so at the moment if when the factory method is called in a multi-threading environment we could end up with two singleton objects at the same time which could  can crash the system.

What can we do to fix this problem? We need to make  the singleton object thread safe.

3.  To make the singleton object thread safe Matthias showed us that the most elegantly simple way was the folloing changes:

  • You make the singleton object class ‘sealed’ this means you cannot extend (create subclass off the singleton class).
  • You also make the singleton declaration atomic meaning it cannot be stopped half way through running if the processors timeslice for that thread is finished meaning you will not end up with two singleton object instances.

e.g.

sealed class clsNameComparer : IComparer //sealed means you can’t extend (set up inheritance) off the singleton object
{
private clsNameComparer() {}
public static readonly clsNameComparer Instance = new clsNameComparer(); //Generates the singleton, that makes the singleton object on demand

public int Compare(clsWork x, clsWork y)
{
string lcNameX = x.Name;
string lcNameY = y.Name;
return lcNameX.CompareTo(lcNameY);
}
}
}

4. Perform a secondary refactoring in the client (the caller of the singleton object)

mySingleton Instance = mySingleton.Instance;

Conclusion

I have found writing this up a good way to think about design patterns in a different way and a way to come up with examples not mentioned in class so it has been a valuable exercise, I hope it has been interesting for you and explained it has simplified this system which is basically just taking a model answer from someone else and adapting it for your own system.

Bibliography:

Otto, M. (2017, Semester). SV701 – Advanced Software Development. NMIT. Retrieved from https://livenmitac-my.sharepoint.com/personal/matthias_otto_nmit_ac_nz/_layouts/15/WopiFrame.aspx?sourcedoc=%7Bd180f53c-726e-4cc7-ae73-661e96d42d2a%7D&action=default