Final thoughts

For my final research journal entry I would like to wrap up the learning I have inadvertently done whilst learning about automated testing tools in preparation for my PRJ701 course. This learning is about the Test-Driven Development (TDD) framework which comes under the Extreme Programming (XP) Agile framework.

As I have discussed in multiple previous blog posts (‘How can Agile methods and DevOps improve IT project success rates?’, ‘What is DevOps? and how does it relate to Agile methods?’, ‘What is Agile methods, Scrum, and Kanban?’ blogs) Agile methods is a mindset (you could also refer to it as a way of thinking) based on four values and 12 principles outlined in the Agile Manifesto which was written in 2001 (see here for the principles and values https://www.smartsheet.com/comprehensive-guide-values-principles-agile-manifesto).

The methodologies that follow the Agile mindset are called frameworks and XP is a framework.

XP is very much a technology focused framework due to its use of Peer programming, Test-Driven Development, Refactoring and Continuous Integration; all of these practices are software specific, as opposed to Scrum which can be applied to non-IT projects.

Test Driven Development is where the developer writes unit tests (these test a small chunk of the code for example a single function or small piece of functionality) before actually writing the code that will be tested. When I say the developer writes the test, this will be in a testing environment such as Selenium and these are automated tests plans. At this stage the test will fail because there is no code written yet for it to test.

The benefits of writing the test before starting the development phase are that it forms part of an executable documentation i.e. it is documenting the user requirements  because each unit test is based on the user requirements being broken down to their smallest component. So by writing the unit tests you are also documenting the user requirements at the same time.

Secondly you are ensuring that all software will be tested because the test is written before the feature itself is written meaning the developer can test as they go, they do not have to find time to write a test after the feature has been written.

And thirdly you ensure that the user requirements documented in the automated unit tests are met and no extra unnecessary code of features are added. This benefit of the TDD is basically keeping development time and costs down by focusing the development team.

Another thing to keep in mind with TDD is that refactoring takes place after the developer has written the code, and the unit test successfully passes.

Whereas BDD (as I discussed in my blog ‘What is the Behat Framework and the Repository design pattern?’) concentrates more heavily on the client rather than the development team and their agility.

BDD does this by attempting to make the on-going intra-team and inter-team communication (i.e. inside the development team and between the development team and the stakeholder specifically the client) easier through the development of a common glossary of terms known as the Ubiquitous Language.

These terms are already familiar to the client (because they are based on the problem domain that the system is being built for) but it is useful for the development team to use them so when developers with different levels of knowledge and experience in the problem domain collaborate or communicate they have a common set of terminology which they all understand. I would say that this is especially useful in a remote team, because I know having terms documented in the Koha Wiki (I work in the Koha team at Catalyst in Wellington) is incredibly helpful with my remote work because I can understand what people say without having to send a clarifying message or email which takes longer to get a reply from than if your are doing face-to-face communication.

In BDD the developer writes BDD behavior tests before the coding begins and these tests record a larger set of functionality than the unit tests written in TDD and the trouble with these tests is they are like a black box; you put in a specific input and expect a particular output and if you don’t get that particular output value then the test fails but you don’t know which function failed.

Whereas the TDD unit tests are more useful in troubleshooting because they are more fine grained and so it is easier to identify what is causing the error to be thrown. That being said unit tests are not the only automated tests written in TDD also Integration Tests are written to test if large parts of the system integrated together work as intended. The combination of Unit and Integration Tests in TDD is a great advantage from a troubleshooting point of view because you have both the fine grained and big picture tests provided.

So in conclusion in my opinion TDD is more development focused by making the troubleshooting failed tests easier through the combination of unit and integrated tests whilst BDD is more client focused and the troubleshooting following a failed behavior test is slightly harder to perform.

Thanks very much for reading, and thanks to Clare (hope you feel better soon) and Belma for teaching RES701 I have really enjoyed the thought-provoking topics and I am really glad I choose to take this paper  🙂

 

Sources:

Farcic, V. (2013, December 20). Test Driven Development (TDD): Example Walkthrough | Technology Conversations. Retrieved June 8, 2017, from https://technologyconversations.com/2013/12/20/test-driven-development-tdd-example-walkthrough/

Kumar, M. (2012, November 5). What is TDD, BDD & ATDD ? – Assert Selenium. Retrieved June 8, 2017, from http://www.assertselenium.com/atdd/difference-between-tdd-bdd-atdd/

Test-driven development – Wikipedia. (2017, April 21). Retrieved May 29, 2017, from https://en.wikipedia.org/wiki/Test-driven_development

 

 

Advertisements

What is the Behat Framework and the Repository design pattern?

In this research journal entry I want to cover two separate topics: The Behat automated testing framework and the repository design pattern.

Behat

behat.png

(Pirrotta, 2015)

In the last few two weeks I have posted research journal entries about automated testing tools both open source and proprietary in preparation for my research and experimentation subproject I will perform for my PRJ701 project to identify a suitable automated testing tool for use in the Koha project. One automated testing tool which was not outlined in the articles I discussed in my research journal entries was Behat.

This article sums up the benefits of Behat nicely comparing it against another automated testing tool, Selenium: https://webservices.uchicago.edu/learningcenter/article/a_n_introduction_to_automated_testing_with_behat/

I first heard about Behat at Catalyst IT (my work in Wellington) because it was being used in the Mahara team (Mahara is an e-Learning platform where users can create a portfolio outlining their learning and achievements) for user interface testing.

A developer in the Mahara team named Rebecca kindly showed me the high level test plans for Behat and they looked like simple English which I first learnt about in SDV501; the test plans are actually written in a syntax called Gherkin and they are easy to write and understand even for people with limited programming experience.

Here’s an example of a high level Behat test plan (also known as Gherkin spec) written in Gherkin from the Behat website. It isn’t what is run when the Behat automated testing tool performs the test instead it can be likened to a use case of a requested function, in this case the terminal command ‘ls’:

Feature: ls
  In order to see the directory structure
  As a UNIX user
  I need to be able to list the current directory's contents

  Scenario: List 2 files in a directory
    Given I am in a directory "test"
    And I have a file named "foo"
    And I have a file named "bar"
    When I run "ls"
    Then I should get:
      """
      bar
      foo
      """

(“Quick Intro to Behat — Behat 2.5.3 documentation,” n.d.).

As you can see it is outlining:

  • The title which in this case is the ‘ls’ feature.
  • The story – This is outlining the reason for having the feature and it needs to be written in a very particular way: In order……As a…..I need to…. The functionality for the ‘I need to’ in this case listing the contents of the current directory is what this feature provides
  • Scenario – This is what the successful outcome of the feature should do, in this case it is listing 2 files in the directory
  • Steps – These are written in first person and they are steps written in a particular way:
  1. Given….. (this is the background actions taken to set up the scenario, which in this case is creating and navigating into a folder named ‘test’ in which you have previously created only two files named foo and bar)
  2. When…… (this is where the feature is being run, so in this case it is running the command ‘ls’)
  3. Then….. (this is the expected outcome of the test in order for it to be considered successful, in this case it is the displaying of foo bar)

Now something I found interesting in the aforementioned article is that it states the benefit of Behat is that it facilitates the Behavior-Driven Development methodology (which I first covered in the research journal entry named ‘What open source automated testing tools are available’ last week). This is because you can take the domain specific terminology of the clients problem domain and that will form the Ubiquitous Language (U.L.) of the project which makes the communication between the client and the developers easier and more consistent.

After writing up the Gherkin Specs in Gherkin which is flexible enough that it can contain UL, then the next thing to do is write step definitions which are what the automated testing actually runs, they contain calls to the functions in the systems code handing in parameters and testing the outputted value (see the below example, note it is not for the ls example above).

/**
     * @Given the recommendation phase is open
     */
    public function theRecommendationPhaseIsOpen()
    {
        $open = new DateTime("-1 week");
        $close = new DateTime("+2 months");
        $this->setPhase("RecommendationActual", $open, $close);

        expect($this->app->isOpen('RecommendationActual'))->toBe(true);
    }

    /**
     * @When I (try to) access the recommendation form
     */
    public function iAccessTheRecommendationForm()
    {
        $this->visitPath(sprintf('/recommender/index.php?key=%s', $this->accessKey));
    }

    /**
     * @Then I should see the recommendation form
     */
    public function iShouldSeeTheRecommendationForm()
    {
        $this->assertSession()->elementExists('css', "div.content h3:contains('Recommendation Form')");
        $this->assertSession()->elementExists('xpath', '//form[starts-with(@action, "recommendation.php")]');
    }

(McElwain, 2015)

Even the step definitions can contain the UL by making the function names reflect the UL terms for example in the step definition example above (an example from the previously mentioned article) the functions name is: iShouldSeeTheRecommendationForm() (McElwain, 2015).

Which is clearly reflecting the problem domain and is not an arbitrary function name like CheckForm().

I found a very interesting comment in this article was that it said by using UL in the step definitions rather than technical terms for U.I. elements it made the automated tests more stable because unlike with Selenium where you refer to elements by technical terms such as buttons which can move around as other developers make U.I. (User Interface) changes which can cause Selenium to fail because it cannot find the button in the specified location, with Behat you run the test on generic UL terms which makes the tests more stable and flexible when there are U.I. changes (McElwain, 2015).

I definitely believe this because it is what I heard from my team’s technical lead Chris who said that they had tried using Selenium in Koha before but it was very unstable partially because any U.I. changes can cause the tests to fail, so I will be interested to have a go experimenting with Behat on some of the Koha patches to see how it is more flexible by integrating UL terms into the tests.

 

Repository Design Pattern

The second topic I want to cover is Repository Design Patterns. In my research journal entry (‘Good software design: What is a design pattern’) I covered design patterns we needed to learn about for our SDV701 exam last term, since then I have learned about another design pattern called the Repository Design Pattern whilst doing my WEB701 project 2 assignment (the development of a laravel app and integrating in search functionality using ElasticSearch).

So let me start by defining what Laravel is, it is a PHP MVC framework. MVC stands for Model-View-Controller and it is a coding framework you follow to separate out the concerns of the U.I., system logic and data objects. Using Laravel to develop web apps you can reduce the coupling between the controller and the model by introducing a repository as you can see in the below diagram.

repository_pattern.png

(Pasic, 2015)

The repository means the controllers (shown as business logic in the above diagram) can get data from the model (shown as Data source in the diagram) via the repository which acts as the intermediary.

As this article (https://bosnadev.com/2015/03/07/using-repository-pattern-in-laravel-5/) outlines the benefits of the repository design pattern which in essence are:

By separating the controller and the model there is greater ability to test and maintain the model and controller, because you have lowered the coupling and having low coupling and high cohesion is always a important to achieve in programming (Pasic, 2015)

So in this weeks research journal entry I have covered what Behat is including its point of difference which is the Gherkin syntax which lets the users  integrate the UL of the problem domain into the Gherkin specs and step definitions, and what the repository design pattern which was unknown to me until a few days ago.

Bibliography:

Quick Intro to Behat — Behat 2.5.3 documentation. (n.d.). Retrieved June 5, 2017, from http://docs.behat.org/en/v2.5/quick_intro.html

McElwain, G. (2015, October 19). An Introduction to Automated Testing with Behat | Web Services. Retrieved June 5, 2017, from https://webservices.uchicago.edu/learningcenter/article/a_n_introduction_to_automated_testing_with_behat/

Pasic, M. (2015, March 7). Using Repository Pattern in Laravel 5 – Bosnadev – Code Factory. Retrieved June 2, 2017, from https://bosnadev.com/2015/03/07/using-repository-pattern-in-laravel-5/

Pirrotta, G. (2015, February 17). Behat-Gherkin/Mink: One Translator to Rule Them All – Giovanni Pirrotta. Retrieved June 5, 2017, from http://giovanni.pirrotta.it/blog/2015/02/17/behat-gherkin-mink-one-translator-to-rule-them-all/

What open source automated testing tools are available?

In this research journal entry I want to investigate what open source automated testing tools are available and what their points of difference are.

This article introduces 6 such tools https://techbeacon.com/6-top-open-source-testing-automation-frameworks-how-choose

Something I have learned from reading through this article are the concepts of Test-Driven Development and Behaviour-Driven Development. Basically these are methodologies that come under the Agile umbrella in much the same way as Scrum, Kanban and DevOps (the latter two of which I have discussed in previous research journal entries).

In Test-Driven Development test plans are formulated and written before the development begins, as the below diagram shows.

test driven.jpg

(“Test-driven development – Wikipedia,” 2017)

One of the benefits of writing the test plan first is that the code that is written is adequately documented because it is deliberately written to reflect the test plan. A criticism made of Test-Driven Development is that it uses Unit Testing which is a test of part of an application (such as a single module) to see if it works, however it is not checking if this part of the application works from the users perspective it is instead concentrating on if it works from the implementations perspective meaning as the implementation changes so does the unit test (Vasilev, 2011)

This slideshow (https://www.slideshare.net/shadrik/bdd-with-java-8323915) outlines three types of testing, and as you can see the open source automated testing tool Selenium is used for testing the user interface, whilst Unit testing has the expectation to check if the code written meets the programmers expectation. But BDD (Behaviour-Driven Development) concentrates on the functionality available to the end user whilst also testing the backend code of the application.

testing types.PNG

(Vasilev, 2011)

The concept of Behaviour-Driven Development (BDD) is to take the concept of writing the test plans before starting software development and extending it. As we know a large number of IT projects fail due to not meeting the needs of users by the time the software system is deployed. Well BDD uses three concepts to ensure the project meets the users needs:

  1. Domain Driven Design (DDD) – This is a way of mapping actions, events and data values in the problem domain to methods, and variables in the system software domain. Thereby ensuring the system reflects the problem domain.  This is a really important process because often teams building a system have no/little experience of the problem domain they are building a solution for and so it is easy for them to design a solution that does not reflect the problem domain (Chambers, 2016)
  2. Ubiquitous Language (UL) – This is the use of common terminology between the users and the development team to describe the problem domain and the system being developed. This helps make communication (which is hugely important in the Agile methods paradigm) easier with the users (Vasilev, 2011)
  3. Executable Documentation – This is another way to improve communication with non-developer stakeholders in a project, in addition to keeping documentation up to date with the system. This concept is where you write tests providing the functionality of a user story which is a user requirement of the system. So for example if the user wants to be able to  search the library catalog and then checkout an item this is a single user story and you would write the test for this before starting to code it. The tests themselves form the documentation, and because the code is written to reflect the test it means the documentation does not get out of date with the code. (Vasilev, 2011)

Reading through and understanding Test-Driven Development and Behaviour-Driven Development has been very interesting as it has extended my knowledge of Agile methodologies, and I can clearly see the advantages; it puts the developers and users on the same page as far as communicating, it makes the system domain reflect the problem domain particularly useful when the development team is new to the problem domain, and it keeps the documentation constantly up to date because the whole system is built according to the tests forming the documentation.

Now back to the automated testing tools:

Serenity – This used to be called Thucydides, and it’s point of difference appears to be that it works in with two other automated testing tools jBehave and Cucumber JVM.

This automated testing tool embodies the BDD methodology. Cucumber JVM provides a platform for writing Executable Documentation (discussed above)  (Colantonio, n.d.)

Robot Framework – This testing tool requires testers to write test plans in Python, Java or .Net. Its point of difference is that it uses a keyword approach. When this article says keyword what it means is commonly used words describing actions such as ‘submit form’ have a function attached to them which is run. The benefit of keywords is the tester can write the test plan faster because they can just write a keyword saving them having to write out a whole lot of test code. So it is basically a shortcut (Hallik, 2016).

RedwoodHQ – This testing tool lets users write test plans in Python, C#, or Java. Its point of difference is that it provides a platform for collaborative testing where multiple testers simultaneously test on a single RedwoodHQ interface. Additionally as with Robot Framework this tool allows the use of keywords in tests (Vasilev, 2011)

Sahi – This testing tool is used for testing web applications and its point of difference is it provides the record capturing test plan development that was available in a lot of the proprietary automated testing tools I learnt about for last weeks research entry journal. This article warned that record capture test plans are not as stable as actually coding test plans, however record capture is obviously faster for developers, and more useful if they do not know Java, Python, or .NET which most of the other open source testing tools require for writing test plans (Vasilev, 2011)

Galen Framework – This testing tool concentrates on the User Interface (U.I.) and in particular the User Experience. Its point of difference is it provides test plan syntax for checking the layout of the U.I. and it can produce HTML reports describing the findings of the test. This is clearly a very specialized testing tool but it could be useful for design patches in the Koha project where some text, colour or styling has been changed by a developer (Vasilev, 2011)

Gauge – This testing tools point of difference is as with Serenity it provides a platform to perform BDD, in particular Executable Documentation. The test plans in Gauge can be written in C#, Ruby or Java.

I have learned a lot from reading through this article and going and reading further articles such as about Test-Driven and Behaviour-driven development. It has been interesting to identify the points of difference for each of the open source automated testing tools in the article, and I have found that there are plenty of useful open source automated testing tools I can investigate in more detail and test in my end of year project.

For next weeks research entry journal I want to learn more about the open source tool Behat; it was not discussed in this article but I know it is used in industry at Catalyst.

Bibliography:

Colantonio, J. (n.d.). 6 top open-source test automation frameworks: How to choose. Retrieved May 29, 2017, from https://techbeacon.com/6-top-open-source-testing-automation-frameworks-how-choose

Test-driven development – Wikipedia. (2017, April 21). Retrieved May 29, 2017, from https://en.wikipedia.org/wiki/Test-driven_development

Vasilev, N. (2011, June). BDD with JBehave and Selenium. Retrieved from https://www.slideshare.net/shadrik/bdd-with-java-8323915

Chambers, R. (2016, February 14). What is Domain Driven Design? – Stack Overflow. Retrieved May 29, 2017, from https://stackoverflow.com/questions/5325836/what-is-domain-driven-design

Hallik, M. (2016, February 3). Robot Framework and the keyword-driven approach to test automation – Part 2 of 3 — Xebia Blog. Retrieved May 29, 2017, from http://blog.xebia.com/robot-framework-and-the-keyword-driven-approach-to-test-automation-part-2-of-3/

What automated testing tools are available?

As part of my PRJ701 project I plan to research automated testing tools which can be used for regression, functionality testing in the Koha project. This research will consist of both primary and secondary research.

Secondary research because I want to find out what automated testing tools are available, followed by primary research to test how well each of the automated testing tools works with Koha by writing test cases for a range of patches on the Koha bug tracker BugZilla and seeing if the tools can follow the test cases successfully.

Due to time constraints I am currently under with 6 assignments to work on at the moment I am writing up a brief research journal entry about a variety of automated testing tools I have found to date from this article: https://dzone.com/articles/top-10-automated-software-testing-tools

Selenium which is the best known automated testing tool works with web applications, the tester has to write whats known as a test case which is a series of steps that the browser must perform to interact with the website and if all the steps in the test case are completed without errors then the test case is a success. Selenium allows users to write test cases in a variety of languages such as PHP, Perl  and Python. I think that when I test Selenium I will write test cases in Perl as that is what Koha is written in and so it will be a language familiar to other Koha developers (Dave, 2016)

TestingWhiz is a automated testing tool that I haven’t heard of before, and it has a very useful feature that Selenium doesn’t have which is that it does not require the tester/developer to write code based test cases instead the tester screen captures themselves performing the steps of a test case and TestingWhiz will automatically generate the test case from that, as the below video shows:

(TestingWhiz, 2016)

The only issue with TestingWhiz is that it is not open source and so users would have to pay. In an open source project this is an issue as it could narrow the number of testers available down to just staff of software support vendors, leaving out a large proportion of the Koha community.

HPE Unified Functional Testing – This testing tool uses Visual Basic for writing test cases. Again this testing tool is not open source and so tester would have to pay around $2500 for a years subscription which is completely unrealistic for an open source project.

(“What is QTP / UFT ? (HP Unified Functional Testing),” 2017)

TestComplete – This is another paid testing tool which can be used for running functionality testing against mobile and web applications. It requires users to write test cases in one of the following “JavaScript, Python, VBScript, JScript, DelphiScript, C++Script & C# Script” (Dave, 2016)

Ranorex – A paid testing tool which like TestingWhiz does not require users to code test cases. It markets itself as a cross platform, cross device automation testing tool which is flexible enough that it will run even when there are changes to the UI of a web application since the test case was created (Ranorex, 2013)

(Ranorex, 2013)

Sahi – This testing tool is much the same as Ranorex and TestingWhiz; its proprietary and it allows users to do recorded test cases. (Dave, 2016)

Watir – This is an open source testing tool, it is written in Ruby. It seems to be mainly used for testing web forms.

Tosca TestSuite – This testing tool uses whats called a “model-based test automation to automate software testing” (Dave, 2016). What this mean is the testing tool will scan the web application this will allow it to identify test cases, which are known as models.

These models can have a variety of tests performed against them, including automated testing, manual testing and image testing (which is where a screenshot of a UI previously taken is compared against what exists in the web app now and Tosca TestSuite will identify any changes to the UI).

The variety of testing types that Tosca TestSuite offers would be advantageous for Koha because having done a fair amount of patch testing and signoffs myself I know that there are a lot of patches which are simple text/design changes that this type of automated testing would be very useful for. Unfortunately again this is a paid testing tool.

(“Model-Based Test Automation Technical Description,” n.d.)

Telerik TestStudio – This is a paid testing tool which allows you to write test cases in Visual Studio (which Telerik TestStudio integrates with) or record test cases.

This tool is useful because it can be used by DevOps to perform automated tests on deployed software when it is out in the production environment. It does this by integrating with Jenkins which is an automation server. Now I know that Koha instances are run on Jenkins and so this could be a useful primary research test for me to consider for my project (“Continuous Integration (CI) with Test Studio – Jenkins, TFS, Bamboo, TeamCity and More,” 2017)

WatiN – This is an open source tool that is designed for testing HTML and AJAX websites. It works on Internet Explorer and Mozilla Firefox and it also provides the image testing that Tosca TestSuite does.

Something I have realized from reading about these 10 automated testing tools is that a lot of them offer the same functionality, for example TestingWhiz, Ranorex and Sahi all allow users to create test cases by screen capturing them interacting with a applications U.I. Therefore in my project I am going to have to determine what is different about each tool to help me identify a single automated testing tool to recommend. Obviously the primary research I will do testing out these tools on Koha bugs will help in this process but it is useful to know the points of difference before starting the primary research.

Another observation I have made is out of these 10 automated testing tools 7 are  proprietary and so a licensing fee would have to be paid. In an open source project like Koha this is not ideal because not all testers/developers have money to spend on contributing to Koha.

Therefore for my next research journal entry I want to concentrate on finding out about other open source automated testing tools, like Behat and Robot Framework which are in use at Catalyst.

In preparation for my project I have also joined a automated testing meetup in Wellington: https://www.meetup.com/WeTest-Workshops/ which should be a useful source of information.

 

Bibliography:

Dave, P. (2016, October 18). Top 10 Automated Software Testing Tools – DZone DevOps. Retrieved May 22, 2017, from https://dzone.com/articles/top-10-automated-software-testing-tools

TestingWhiz. (2016, January 1). (151) Web Test Automation with Record/Playback Feature – Part 1 – YouTube. Retrieved May 22, 2017, from https://www.youtube.com/watch?v=m_226mOeiHA

What is QTP / UFT ? (HP Unified Functional Testing). (2017, April 3). Retrieved May 22, 2017, from http://www.learnqtp.com/what-is-qtp/

Ranorex. (2013, April 13). (151) Ranorex Automated Testing Tools for Desktop, Web and Mobile – YouTube. Retrieved May 23, 2017, from https://www.youtube.com/watch?v=qsh4zWa6bE8

Model-Based Test Automation Technical Description. (n.d.). Retrieved May 23, 2017, from https://www.tricentis.com/tricentis-tosca-testsuite/model-based-test-automation/detail/

Continuous Integration (CI) with Test Studio – Jenkins, TFS, Bamboo, TeamCity and More. (2017). Retrieved May 23, 2017, from http://www.telerik.com/teststudio/continuous-integration

 

 

Laravel Middleware. (2017). Retrieved May 15, 2017, from https://www.tutorialspoint.com/laravel/laravel_middleware.htm

Model-Based Test Automation Technical Description. (n.d.). Retrieved May 23, 2017, from https://www.tricentis.com/tricentis-tosca-testsuite/model-based-test-automation/detail/

Ranorex. (2013, April 13). (151) Ranorex Automated Testing Tools for Desktop, Web and Mobile – YouTube. Retrieved May 23, 2017, from https://www.youtube.com/watch?v=qsh4zWa6bE8

TestingWhiz. (2016, January 1). (151) Web Test Automation with Record/Playback Feature – Part 1 – YouTube. Retrieved May 22, 2017, from https://www.youtube.com/watch?v=m_226mOeiHA

What is QTP / UFT ? (HP Unified Functional Testing). (2017, April 3). Retrieved May 22, 2017, from http://www.learnqtp.com/what-is-qtp/

Does Open source software meet the user needs as well as proprietary software?

This debate usually brings out an opinion for technologists; it contributes to that very opinionated overarching debate which is what is better open source or proprietary software?

I have a strongly held opinion that open source software is best as I am working for an open source software development company and have seen first hand the pros and cons of open source software development, however I have not had any experience working for a proprietary software development company so I have a definite bias which I would like to disclose now.

However as we are discussing software development methodologies in SYD701 at the moment, and I have been fortunate enough to attend several IT events and meet other developers in industry over the last few months I would like to share my views on the argument that open source software doesn’t meet user needs as well as proprietary software.

A while back I attended an ITP (Institute of IT Professionals) discussion at the Nelson Hospital conference center where they debated open source and proprietary software; one of the arguments that the proprietary side made was that open source software does not meet the requirements of users because it does not rigidly follow a software development methodology and so it does not effectively capture client requirements. This argument is based on the assumption that open source software development is very haphazard, i.e.  a developer somewhere in the world comes up with an idea and develops a feature without ever talking to users.

Well I can say the vast majority of open source projects have a definite workflow you need to follow to develop and push an enhancement out for a open source product.

For a start you have to write up whats known as a RFC (Request For Comment). The open source product Koha ILMS (Integrated Library Management System) which I work on has a specific template that must be followed and can be seen here: https://wiki.koha-community.org/wiki/Category:RFCs

As of the 16th of May there are 88 RFC’s requesting feedback for proposed Koha enhancements (as you can see in the below screenshot) what this means is you really have to make an effort to get your RFC to stand out in order to get valuable feedback. How do you do this? Well you can use the #koha irc channel (which is where many developers and most importantly librarians chat about Koha, I’ll come to the importance of librarians in this project a bit later), Koha twitter page, Koha developer meetings (which are held once a month on the #koha irc channel), and emailing the release manager of the next Koha release.

rfc

It is important to add new information to the RFC as you make changes based on feedback so as to keep interest in the enhancement up, if other developers are interested in what you are working on then they are more likely to test your patches thereby helping you get your enhancement through the QA system faster.

After getting feedback from the RFC you can start designing the enhancement in more detail, Koha like many open source projects uses a agile methods approach. So rather than spending a large amount of time performing documentation you perform a more iterative, test based development by taking the initial user requirements captured in response to your RFC and developing a first prototype which you attach to a bug report in the Koha BugZilla bug tracker (https://bugs.koha-community.org/bugzilla3/) and then request feedback on that. Those users you hopefully got interested in response to your RFC will now be a great help because they are likely to want to help test your patches.  To help them test your patch(es) you have to write a test plan which is a numbered plan describing how to perform an action using Koha before and after the patch is applied so that the changes can be easily observed. This test plan must be follow-able for people new to Koha, increasing the testing audience available to you as well as making testing more inclusive to newbies.

If starting out developing for an open source project, you will have a lot to learn and so the feedback and improvements you receive from testers is amazingly useful. Using that feedback you iterate over the enhancement again, and attach it to the bug report for testing again. This process continues over and over until it is signed off by another developer.

Does this mean it will get into the final product? No its doesn’t, mature open source software projects like Koha want to enforce the quality of their code and so one signoff isn’t enough. In general at least 2 signoffs are required; one by another developer and one by an member of the QA team.

There are several types of tests that are performed on your patch(es) to ensure they meet coding guidelines and work as intended.

Firstly the code is eyeballed to determine if there are any obvious mistakes, this is easy because BugZilla highlights the changes made in the patch (much the same way as Git does) and so you don’t have to try and work out what has changed. Then the tester follows your testing plan to functionally test if the patch(es) work as expected. The functionality testing is what I plan to work on for my work placement PRJ701 project so as to speed it up through the use of automated testing tools like Behat.

Finally a qa test is run which is another check if the code meets the coding guidelines.

If all these three testing types prove that your patch(es) are up to standard then you will be signed off by another developer and a member of the QA team and your patch will be set to ‘Passed QA’ and it is very likely to be pushed to the Koha master branch (the only reason it would not be is if the release manager finds an issue which no other tester has identified which is highly unlikely or if your patches conflict with another enhancement that has been pushed to the master branch).

So you can see the testing process for open source software, in this case the Koha ILMS is very thorough so as to make the product as stable, maintainable and well developed as possible.

Now why is having librarians involved in Koha is useful? Well Koha is an Integrated Library Management System which is used by thousands of librarians around the world on a daily basis so having librarians involved in giving feedback to RFCs, testing patches, and in several cases writing their own patches means that Koha is definitely hearing and meeting client requirements. There cannot be many software development projects in the world where end users not only specify their requirements, but also test patches and help develop the product they use!

This is not just specific to Koha, many open source software development projects try to make themselves as inclusive as possible so as to get a variety of people contributing  for example Libre Office project is constantly trying to gain new developers through promoting the founding of Libre Office meetups, providing easier patches for beginners to test and signoff and through their website (https://www.libreoffice.org/community/get-involved/) they promote different ways people can contribute based on their experience, for example if you don’t know how to code you might like to contribute to the design of the Libre Office products.

So in conclusion open source software development very much meets the requirements of end users by making sure new enhancements have RFCs so you can gain feedback before you start designing and coding, adopting the agile methods paradigm to develop enhancements iteratively, and increasing the inclusiveness of the developer and tester base.

It is this inclusiveness in many of the mature open source projects (of course not all open source projects are inclusive) which means there is greater end user participation in the product development, and it means that open source software is not as affected by the lack of women and coloured developers as proprietary software development because people who don’t work in software development can still help contribute to the product making for a better performing end product.  So I would argue that in many cases open source products more closely meet client requirements than proprietary products.

Building the network in IT

In this research journal entry I would like to cover a technology related event I attended whilst in Wellington last week, the event was very motivating for me and I believe there are several lessons that can be learnt from it.

During the term break I headed up to Wellington to work for a week and on the Saturday I was up there I attended Code Camp Wellington. I assumed this to be a coding workshop full of students and new grads learning about and experimenting with the latest technologies from technology educators.

What I found was quite different, a very interesting and sociable technology conference; full of developers from all levels of the development hierarchy: senior, intermediate, junior devs as well as mature students trying to get into the IT profession. The suprising thing was there were very few IT students attending even though the event was free, and on a weekend during the middle of the term break.

The conference was held in the offices of TradeMe and Xero which are neighbors in Market Lane in Wellington.

The conference started with a superb keynote speech by Marie-Clare Andrews a Wellington IT startup entrepreneur who founded ShowGizmo the most successful events app in Australasia. In her speech she argued that Wellington was the best place to be for tech in the world at the moment. Why is this? Well Wellington has a diverse culture resulting from wide-scale immigration bringing in people of different cultures and as we all know the more diverse a development team (in terms of gender, race, culture, and socioeconomic background) the better the product tends to be because the team has a better understanding and empathy for all users.

The second reason she argued for Wellington was that it has a small CBD, being crammed in between the hills and the sea, making it easier for people to meetup, collaborate and swap ideas as opposed to Auckland which is far more spread out.

The key take away message from her speech was that this event we were all attending was a networking opportunity. Time and again I heard from many people throughout the day that networking is vitally important to your tech career. 80% of IT jobs are not advertised but are filled through referrals from existing staff, so the more tech people you know the better your chances to get a job through them.

The more you can build your network the greater chance of success you have of getting a great IT job. Where and when can you network? Well by attending events like this, joining meetup groups (https://www.meetup.com/), attending events like Startup Weekend Wellington, hackfests, and so on. In a city such as Wellington there are many opportunities for technologists of all skill sets and ages to meet with other developers learn from them and gain knowledge and employment.

Personally I was initially nervous in walking up to complete strangers who could be very senior developers without being introduced, and making small talk, being of a shy disposition and only just starting in this industry. However what I found is it really isn’t all that hard, all you do is walk up to them, smile, introduce yourself, and ask them about themselves. After hearing them describe themselves you get a pretty good idea what they are interested in, or any commonalities between the two of you and the conversation is easy from there.

Not only is it good from a career point of view but you can learn a huge amount from these networking events for example after this very inspirational talk from Marie-Clare I choose 6 other presentations to attend out of a potential 24. I would like to cover what I learnt in two of my favorite presentations:

Machine learning – Brad Stillwell, a senior developer at Xero, gave a fascinating talk about machine learning in general and specifically what supervisor machine learning is. Supervisor machine learning is effectively where you give a machine learning algorithm an associated set of inputs and their outputs as test data and the algorithm builds an association between the inputs and outputs so if you give it another input it can predict the output. In much the same way as our brain creates a network between neurons when we learn the equivalent of an English word in another language. We must build a connection to associate between our native language and that of the new language.

Another example is supplying a supervisor machine learning algorithm with a collection of input data for example fruit descriptions and output data fruit names:

Apple – Crunchy and green

Carrot – Crunchy and orange

Tomato – Red and round

You can train it to predict what fruit name is associated with a particular description. For example if I hand the machine learning algorithm the input of crunchy it will predict the fruit associated with that description is a apple. Notice that crunchy is associated with both apple and carrot and so why did it predict apple rather than carrot well this can be due to the frequency of ‘crunchy’ being associated with apples rather than carrots. So if our test data contains 5 apple instances and only 2 carrot instances then the machine learning algorithm has ‘learnt’ that crunchy is more likely to be associated with apples than carrots.

Career surgery – This was a sit down discussion rather than a formal presentation. Members of the audience asked a IT recruiter questions about how to gain employment in the tech industry in Wellington. It was particularly interesting hearing from people moving into the tech industry from other fields and how they had found trouble with trying to find employment.

There were people that were moving from marine biology into software development and data analysis and they were asking how could their skills from a scientific background be marketed in a CV to gain employment in tech. The recruiter gave the advice to market how they had analytical skills, attention to detail, and problem solving skills from their scientific background.

So in essence this talk taught me that whatever your background there are always skills you can take and use to market yourself  for a tech role.

So all in all the message I took away from attending this event is it is vital to network, attend events such as this and put yourself out of your comfort zone, to this end I have decided to attend Startup Weekend Wellington (http://communities.techstars.com/new-zealand/wellington/startup-weekend/10344) a 54 hour event to propose an idea, get into teams, found a startup company, build a product, and present the idea to tech leaders. A intense, competitive, and collaborative weekend getting hands on experience at starting a tech startup.

Elasticsearch part 3: The implementation

In my last blog post on Elasticsearch I covered the majority of the theory and commands that I learned at the Elasticsearch for developers course I completed at work now I want to have a play around with Elasticsearch and the RESTful API you use to interact with Elasticsearch called Kibana.

I decided to install both of these systems onto NMIT Windows laptop. What was a seamless installation process last time on my other laptop turned into a more complicated troubleshooting exercise this time.

After starting both Elasticsearch (accessible at http://localhost:9200) and Kibana (accessible at http://5601) I saw that Kibana was throwing an error (see below) because it required a default index pattern.

error message in kibana

What is a default index pattern? It is an index (which remember is like the equivalent of a relational database) or collection of indexes you want to interact with using Kibana. You can specify several using the wildcard symbol in the index pattern input box.

So the first thing I had to do was to create an index. I used the example index on the documentation on the Elasticsearch (https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-create-index.html#indices-create-index) which was called ‘twitter’.

creating an index.PNG

Then after indexing an document (equivalent of inserting a row into a table in a relational database), I set twitter* as the default index pattern, thereby removing the error I was getting.

An important point beginners of Elasticsearch need to know is that when you interacting with your Elasticsearch using the Kibana RESTful API you will be writing in sense syntax rather than curl which is for use in terminal. However the Kibana Dev Tools area which is where you will write sense syntax is fantastic because it automatically converts curl commands into sense syntax. For example I copied and pasted the curl command

curl -XGET 'http://localhost:9200/_nodes'

And Kibana converted it to:

GET /_nodes

 

Insert data

Now for some fun with Elasticsearch…

I have created a index named cats

created cat index.PNG

Then I create a mapping type (equivalent of a relational databases table) automatically when indexing a document (creating a data record). How so? Well I use the PUT API (in Elasticsearch jargon an API is a command).

PUT cats/domestic/1

What this means is use the cats index, create a mapping type named ‘domestic’ and create a document with the ID of 1 in this mapping type.

Note that the ID number is optional in PUT APIs.

entered cat

What is happening when I use the PUT API to create an index, well Kibana sends a index request to a node in the Elasticsearch cluster (collection of nodes (instances of Elasticsearch)). The ID value (manually set or auto generated) is hashed and used to find a matching shard to execute the index.

What is a shard? it is a conceptual object holding a collection documents allowing Elasticsearch to be distributed and extendable.

Once the matching shard has created the index then it is replicated to the replica shard (the backup shard).

Note: As you can see above you do not need to specify data types when creating or modifying Elasticsearch indexes.

Retrieve data

Now to retrieve the document I just indexed I need to use the GET API:

GET cats/domestic/1

Whats happening in the background when you send a GET API. Well the ID in the request is hashed and so when the request arrives at a node in the Elasticsearch cluster, then the hashed ID is used to route the request to a shard with a matching hashed ID value.

How to check if a document exists

To check if a document exists in the index then you can use HEAD API

HEAD cats/domestic/1

Now this should return a HTTP header 200 if the document exists and 404 if it doesn’t exist. Except when I ran it in Kibana I got a fatal error.

 

error.PNG

Now it seems several other people have had issues running the Exists API in Kibana as these forum question posts show, none of which were satisfactorily answered.

https://unix.stackexchange.com/questions/253414/elasticsearch-error-on-head-command

https://github.com/elastic/elasticsearch-php/issues/391

However this github source (https://github.com/elastic/kibana/pull/10611) suggests that the syntax for the Exists API is deprecated and I need to write:

HEAD cats/_mapping/domestic

However this produced the same error. I could not find any other useful suggestions online and so I will move on, and ask the trainer of the course Frederik later.

Delete data

DELETE index/mapping type/id

delete.PNG

The background process when the DELETE API is run is as usual the id in the request is hashed and this is used to route the request to the primary shard that this document lives in, after the document is deleted there then the primary shard updates replica shards.

Point of interest: Write consistency

Now because all documents are written on a primary shard and can this can be (but doesn’t have to be) replicated on several replica shards.

If you have set up replica shards when you created the index, then you need to make sure a certain number of these shards are available when writing to Elasticsearch.

You need to have:

(primary+replicas)/2 + 1 shards available to be written to

 

Update data

I indexed another document

PUT cats/domestic/1
{
  "name": "Kelly",
  "age" : "1",
  "colour": "black and white"
}

Then to update this making the age 2 I wrote:

update.PNG

As I understand it the background process in this command is all fields including ones not being updated are replaced. So fields not being replaced are just replaced with the same value. This is again performed on the primary shard first, and then replicated to the replica shards if applicable.

 

Get multiple documents simultaneously

I created another index named ‘president’, with the mapping type ‘individual’ and id ‘1’ for a document on George Washington.

Then to get the documents with id ‘1’ in cats/domestic and president/individual I perform a Multi Get API

multi get.PNG

 

Perform multiple different APIs simultaneously

To perform multiple different commands using Kibana you can use a Bulk API command. You can think of this like the equivalent of being able to perform a select, delete, update, and insert SQL query into multiple tables in a relational database in a single command.

When I first tried this command I wrote in the HTTP header: PUT _bulk this resulted in an error:

bulk error.PNG

After some troubleshooting I found this is being caused by the /n which need to be removed and then it work, like so:

worked.PNG

 

Text analysis

Elasticsearch is very useful for searching text, because it can store the words from a text such as a book in the inverted index in much the same way a book index holds keywords for readers to find easily.

The way we split text up so it can be stored in the inverted index for searching is using the Analyze API.

I started using this by specifying the HTTP header GET _analyze, I specified I wanted the tokenizer “keyword” this stores the supplied string as one keyword combination rather than splitting it, filter “lowercase” this lowercases my supplied text.

As you can see below ‘New South Wales’ has been transformed into ‘new south wales’

lowercase.PNG

Often for long text (like sentences) it is best to split the words up so they can be searched individually. You can do this by specifying the tokenizer “whitespace”. So using the Shakespearean sentence “You shall find of the king a husband, madam; you,sir, a father:” I used the whitespace tokenizer to split it up:

splits.PNG

If you want to learn more about what the analyser is doing you can implement the “explain”: true attribute.

Now the analyzer commands I have performed to date are using the default _analyzer on supplied text, but what if I wanted all data in a document I index to be analyzed and thereby made searchable?

Well you can configure a analyzer in an index when creating the index.

analyzer in.PNG

To make the job of the tokenizer easier you can implement character filters for example you can filter out HTML. This would be very important to make the system more secure.

char filter.PNG

It is interesting how the different analyzers work; the English one does not just split the words up it actually removes stop words (common words that add no value to a search query). Below I wrote in the sentence from the course exercise: “It is unlikely that I’m especially good at analysis yet” which has words like ‘unlikely’ just stored and indexed as ‘unlik’

english.PNG

Whereas all words are stored and indexed in their original form when using the standard analyzer.

standard analyzer.PNG

 

Mappings

Instead of letting Elasticsearch decide the data type of fields in an index you can specify it manually in the mappings. Like I said previously the mapping type which is the equivalent of a table in a relational database is just the name, so in my previous examples I have cats/domestic/1 this meant I had the mapping type name of ‘domestic’. However there are many attributes in an index that you can customize to make it match the business problem domain more closely.

Mappings are useful because they have some idea of how data is structured even though they don’t have a schema.

I created a index named programminglanguage, with a mapping type of ‘OO’. I set the data type of the “name” field (which is circled below) to a string.

field altering.PNG

You can also update your mapping attributes in an index however you need to keep in mind that you cannot remove an mapping type field.

To retrieve your mapping values for an index simply write in GET <indexname>/_mappings

Like so:

retrieve mappings.PNG

Now you can create objects in Elasticsearch, for example by default the comments in my below tvseries index will be a nested object.

That means ‘comments’ is of data type ‘object’.

nested objects.PNG

If I want to reference a field in the comments nested object I have to write: comments.<fieldname>

How do you set a field to be searchable?

You use the ‘index’ attribute in the mappings. You set it to ‘analyzed’ if you want it searchable and it goes through the analyzer.

You set it to not_analyzed if you want it searchable but don’t want it to go through the analyzer.

You set it to ‘no’ if you don’t want it searchable.

 

Index templates

An index template is a good way to make a index fast, without having to write it out manually. So once you have the mappings customized to your business problem domain you can then apply this to multiple similar indexes using a template. I like to think of this like inheritance hierarchies in Object Oriented programming, you place all the common features in the superclass and all subclasses inherit it, thereby only having to write it once.

To create a mapping you need a PUT HTTP header:

PUT _template/tv_template

This is creating a template in the _template area named tv_template

template 1.PNG

Like with indices you can delete, retrieve and retrieve all templates using similar commands. As I have not covered how to retrieve all I will do so now, it is very simple:

GET /_template

Searching

Elasticsearch can perform two kinds of searches on the searchable values (setting values to searchable is described further above):

  • Structured query (checking for an exact, boolean match on the keywords the user entered. Equivalent to a SELECT SQL query with a WHERE clause. Either there is a match for the WHERE clause or there isn’t)
  • Unstructured query (not just looking for exact matches but ranking the query output, so this is a continuum rather than boolean answer. Also known as a full text search).

Elasticsearch uses a query language called QueryDSL. A quick Google of this and I have found it is a “extensive Java framework for the generation of type-safe queries in a syntax similar to SQL” (Chapman, 2014).

Now search uses the GET HTTP header; and to set up a structured query you use the ‘filter’ attribute, and to set up a unstructured query you use the ‘query’ attribute which gives all results a score.

Using QueryDSL we can write a single query (known as a leaf query clause) or multiple queries in a single statement (known as compound query clauses).

Where I want to query (retrieve) all documents in an index I can use the match_all attribute:

GET programminglanguage/OO
{
  "query": {
    "match_all" : {}
  }
}

This is the equivalent of a SELECT * query in SQL and it is perfectly acceptable to use.

Note: In the above match_all query  it is an unstructured query because it uses the ‘query’ attribute.

If you want to limit the number of ranked results displayed in an unstructured query then you can specify the number of results you want with the ‘size’ attribute.

 

How to use your search query

Now if you use the ‘match’ attribute in your query then the search term goes through the analysis chain to tidy it up and is then used for unstructured query.

Whereas if you use the ‘term’ attribute then whatever the user wrote in is compared exactly to what is in the inverted index and a structured query is performed.

So in the below example I am performing a unstructured query.

match.PNG

 

To make your unstructured query more fine-grained there are 3 types of unstructured queries for you to choose to implement.

Boolean –  This is effectively a structured query, whatever you enter as a search term is used to find an exact match in the inverted index otherwise no hits are found.

I have a document with the name “Kelly” in the cats/domestic index and mapping type, so trying the bool query searching for a name “K'” I got no results because I have no document with the name “K” in the cats/domestic.

book.PNG

Whereas when I perform this bool query using the name of “Kelly” I get 1 hit, this is because there is the exactly 1 document with the name “Kelly”

bool 2.PNG

 

 

Phrase – This treats the entered search term as a phrase

match_phrase_prefix – This query splits the values in the string the user entered as a search term. If the user has not completed a word like in the below example of when I used match_phrase_prefix I just used K and so Elasticsearch looks at a dictionary of sorted words and puts the first 50 into the query one at a time.

match 2.PNG

 

The query_string query is interesting it is built rather like a SQL query in that you can use OR and AND. So in the below example I am searching the cats/domestic with the phrase “(blue OR black) AND (white OR red)” without specifying the field name and I am getting the correct result.

query string.PNG

Suggesters

Suggesters are faster than search queries, although the suggester can be implemented on a search query as well.

What a suggester allows the query to do is to suggest values similar to the users search term. So for example if the user misspelt and wrote in the name “Kellu” then the suggester could suggest “Kelly” which is another similar term.

How Elasticsearch works

Search queries in Elasticsearch go through 3 stages, here is a summary on what I understand them to be:

  1. Pre-query – This is checking the number of times a word exists in a particular document. This is only possible where Elasticsearch has a small data set.
  2. Query – This is checking through the inverted indexes for a matching value, this is achieved by running the search query on all shards holding the index we are asking for until a matching value is found which can point the search query to the document ID that holds this value.  This is a useful resource for finding out more a out the Query phase: https://www.elastic.co/guide/en/elasticsearch/guide/current/distributed-search.html
  3. Fetch – This returns the documents (whose document ID was listed alongside the search term in the inverted index) to the user.

Deep pagination – What this concept means is Elasticsearch will look through every document in the cluster even if you just want to search 10 documents, as you can imagine this is very inefficient in a large data set. It is best avoided.

Elasticsearch is known for its speed and a contributing factor is the Request Cache. As indexes are shared across multiple shards when a search query is run on an index, what happens is it is run individually on all the shards and then the resulting results are combined to form the total result. However each shard keeps a copy of its own results meaning if someone queries the same value again it will exist on a shards request cache and it can be returned much faster than having to search the shard.

 

Aggregations

Aggregations is a framework that helps the user learn more about the search results they have found with a search query (“Aggregations | Elasticsearch Reference [5.3] | Elastic,” n.d.)

There’s three main types:

Bucket aggregations – Group of documents that meet a specific factor. You can think of this like finding common features in a whole lot of different documents and grouping them in ‘buckets’ based on these common features.

Metric aggregations – Calculate statistical data about a collection of buckets.

Pipeline aggregations – Combine the insights generated from other aggregations. This is an aggregation on an aggregation.

Kibana can use visualization tools to create graphs and maps using aggregations.

You implement aggregations using the “aggregations” attribute in the search query.

I am unable to perform many of the aggregation commands due to having a small data set, however a summary of the aggregation commands available is:

Sum aggregation – This adds together values of the same field in multiple documents

Min/max aggregation – Display the highest or lowest value of a field in all documents in an aggregation.

Multiple metrics aggregation – Display both the highest and lowest values for a field in all documents in an aggregation.

Terms aggregations – This returns the top 5 values for a particular field in all documents in an aggregation.

Missing aggregation – Find documents in an aggregation that do not have a specified value.

Filter aggregation – This is what is used to create bucket aggregations.

Significant term aggregation – This is finds strangely common values, by checking document values for common values in the aggregation against the total data source the bucket aggregation was collected from.

It is important not to nest too many aggregations in a single command because they are very resource hungry and you can end up crashing your system, this occurrence is called combinatorial explosion.

 

Data Modelling

If you choose to use Elasticsearch as a data store in addition to or replacing a relational database management system then you will need to perform data modelling to transform your existing data into something useful for Elasticsearch.

There are several paradigm shifts you will have to have to make this process possible. Firstly you need to understand that duplicate data is fine in Elasticsearch as it makes searching faster, this goes against what we are taught for relational database design and so it is not initially intuitive.

Now to take the data stored in relational tables with  relationships between one another into Elasticsearch we can do one of three things:

Denormalise the data into a single document: This flattens the data out so if you had 2 tables in a direct relationship then you can place all columns and data into a single Elasticsearch mapping type. This is making the data structure flat so it is searchable.

Nested objects: Nested objects are one way to store the relationship between two relational tables in some form. For example in a relational database you may have two tables ‘program’ and ‘comments’. These tables have the relationship that an tvseries has one or many comments, and a comment belongs to one tvseries.

To transfer this relationship to Elasticsearch which does not use foreign keys to map relationships we can place ‘comment’ as a nested object in the ‘tvseries’.

nested objects.PNG

This means the foreign key relationship is lost but we have been able to retain some of the relationship between the two logical groups of data by nesting one inside the other.

Now in order to still be able to use the nested object and the root object it is stored in we need to be able to query them separately so we use nested queries.

Parent/child objects: Another way to map the relationship between logical groups of data is parent/child objects. I did this using the pet owning example I have used previously:

The parent object will be the owner, and the child object will be the cat. Here’s the steps I went through to create this parent/child object combination.

  1. Create a index named ” petowning”

setting it up.PNG

2. Create the parent object which is the owner

parent.PNG

 

3. Create the child object which is the cat

child.PNG

 

Now each of these three methods have advantages and disadvantages which need to be considered against your system requirements when you are performing data modelling:

Flatten data:  Uses less data than nested objects and parent/child objects, but the data relationships are totally lost

Nested objects: Faster but less flexible (because the root object that the nest object is held in must be re-indexed whenever the nested object is updated)

Parent/child objects: Less fast but more flexible

 

Relevancy

Elasticsearch by default uses the TF/IDF (Term Frequency/ Inverse Document Frequency) algorithm to determine how relevant a document is to a query.

This algorithm works by comparing term frequency against all other documents, after looking at the specificality of the search term. What this means is a shorter more specific search term has a greater rank.

 

Perculator

Instead of saying which documents match a query, the perculator does the opposite it outputs the queries that match a document.

To be able to output the search queries that match a document we have to store the search queries as JSON documents however this is no problem because the search queries are QueryDSL (as I have previously discussed) and this is very similar to JSON.

Below I am storing a search query to find a cat with the name “Violet”

storing query.PNG

 

So there we have it explanations and tested examples (by me on Elasticsearch 5.3 and Kibana 5.3) shown as screenshots of the many different functions that Elasticsearch and Kibana provides. I hope this has been interesting and useful for you, I personally have found it fascinating to go through almost all of the commands I learned on my course in more depth and understand them better.

 

Bibliography

Chapman, B. (2014, June 11). What Can Querydsl Do for Me Part 1: How to Enhance and Simplify Existing Spring Data JPA Repositories – http://www.credera.com. Retrieved April 16, 2017, from https://www.credera.com/blog/technology-insights/java/can-querydsl-part-1-enhance-simplify-existing-spring-data-jpa-repositories/

Aggregations | Elasticsearch Reference [5.3] | Elastic. (n.d.). Retrieved April 16, 2017, from https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations.html

 

How can Agile methods and DevOps improve IT project success rates?

The statistics about IT projects success rates are shocking, to me the most interesting statistic in this article https://projectjournal.co.uk/2016/03/16/15-shocking-project-management-statistics/ is:

“On average, large IT projects run 45 percent over budget and 7 percent over time, while delivering 56 percent less value than predicted” (Bloch, Blumberg, & Laartz, 2012).

In this research journal entry I want to investigate what are the potential causes of these high failure rates and if the implementation of the principles and values in Agile methods and DevOps at the deployment end of software development projects can help in reducing these failure rates.

What are the issues causing the high IT project failure rates?

As we discussed in SYD701 a few weeks ago the likely cause of why there is such a high IT project failure rate is that projects are so complex nowadays that we do not have the methodologies or thinking paradigms to successfully build the systems on time, on budget and within or above expectations.

What do I mean by the projects being complex? Well the needs IT systems are attempting to solve nowadays are not easily definable,  there are many ways the system can be developed and many different solutions that could meet the requirements, whilst the end result is not always agreeable to all stakeholders. In other words the need that the IT system being developed is trying to solve is a mess, meaning it is not easily definable, and there multiple processes and outcomes that could be produced.

The Linear systems thinking paradigm that spawned the Structured Systems Analysis and Design Methods (SSADM) is not designed to help development teams design and develop systems from messes, because the mindset of this train of thinking was you could understand a need necessitating a system by dividing it down to its smallest parts. However how can you divide down a problem/need when you do not understand it.

You can’t and that’s part of the reason why the SSADM is not suitable for systems development of the modern, complex systems we build nowadays.

Lets see some hard facts of IT project failures from a study performed by the consulting company McKinsey & Company and Oxford University:

IT issues.png

(Bloch, Blumberg, & Laartz, 2012)

Something important to note in all these ‘failed’ systems is the common characteristic that they all had high risks in terms of financial, temporal and benefits (Benni, 2012), where there are high risks, particularly where the risks change frequently then a non-iterative systems development methodology is not ideal as it is developed to address the risks identified at the start of the SDLC and will not identify and address risks that appear later on in the SDLC, and that is the reason I believe that the study found “Every additional year (in a projects development) increases the expected cost overrun by 16.8% and schedule overrun by 4.8%” (Benni, 2012)

This McKensey and Oxford study performed qualitative research on IT executives to identify what they believed the solution to these IT project issues was and they came to the following conclusions of what is required for a successful IT project:

  • “focusing on managing strategy and stakeholders instead of exclusively concentrating on budget and scheduling
  • mastering technology and project content by securing critical internal and external talent
  • building effective teams by aligning their incentives with the overall goals of projects
  • excelling at core project-management practices, such as short delivery cycles and rigorous quality checks”  (Bloch, Blumberg, & Laartz, 2012)

 

Does the SSADM align with these 4 fixes?

Clearly the SSADM does not meet the first and last of those fixes; after the initial elicitation of client requirements in the Systems Analysis stage of the SDLC there is little client requirement  elicitation throughout the rest of the project. It is hard for the SSADM to manage stakeholders to ensure the product meets their requirements due to its non-iterative nature, meaning if the requirements change then the system itself cannot change to meet them,

Additionally SSADM does not have short delivery cycles but instead it releases the product in large chunks, this can create a headache for the operations team that have to make sure the product developed in the development environment actually works in the production environment before it is released to the users. I believe this mammoth release of software at the end of the SSADM was part of the reason that the concept of DevOps was first conceived to create a change of mindset to the paradigm of small, frequent releases of software to the operations team.

So due to the non-iterative nature of the Waterfall model which the SSADM follows each project is effectively ‘stuck’ with the requirements identified in the systems analysis stage at the start of the SDLC, thereby not making this methodology useful in environments where user and business requirements change or if there is a high risk of change.

And as we have seen in the research entry journal about DevOps, this deployment model works through the deployment of software from the development to the production environment in small features frequently which are easier for the operations team to test, troubleshot and deploy.

Can Agile and DevOps can help?

Now in my opinion at this stage of my knowledge about Agile methods the implementation of Agile Methods would help enormously in fixing most of the issues related to because:

  1. Agile methods is timeboxed meaning it promotes that a project is split up based on blocks of time known as sprints. By focusing on time rather than features and due to its iterative nature producing prototypes that can theoretically be ready for deployment at the end of each sprint the Project Manager can ensure that there will be a deployeable product at the end of the project.

This product will likely have the most important features in it, because Agile systems development methodologies such as Scrum generally place the most important features (known as user stories) in the earlier sprints to ensure they will be completed.

2. Now going 45% over budget can be partially attributed to going 7% over time however the difference in the percentages means there is obviously other factors  involved and I have to be honest and at this stage say I am not sure how the implementation of Agile methods could help with this.

3. However Agile methods are very likely able to improve the missing focus statistics because there is constant communication between the development team and the clients throughout the agile project.

However in our SYD701 class last week we looked at the history of computer systems and the quote from the paper we were looking at A Short History of Systems Development interested me:

“The problems with systems today are no different than fifty years ago:

  • End-user information requirements are not satisfied.
  • Systems lack documentation, making maintenance and upgrades difficult.
  • Systems lack integration.
  • Data redundancy plaques corporate data bases.
  • Projects are rarely delivered on time and within budget.
  • Quality suffers.
  • Development personnel are constantly fighting fires.
  • The backlog of improvements never seems to diminish, but rather increases.”(Bryce, 2006)

This is interesting because it shows that the implementation of Agile methods will not solve all of the problems. The second issue in the list; a lack of documentation with modern systems; could easily be perpetrated by Agile methods which focuses on working software over documentation.

As I am working on the Koha Library Management System at work I understand how important it is to have good documentation to understand how a system you didn’t build works as a whole (we are lucky with the Koha project because being an open source project there is a lot of documentation to make it easier for new developers to contribute and so we have a wiki for Koha development).

This is an example of how Agile methods and DevOps is not the silver bullet, they do not solve all of the problems facing modern systems development.

 

Interesting resources on IT project failures:

http://www.geneca.com/blog/software-project-failure-business-development

http://www.pmi.org/-/media/pmi/documents/public/pdf/learning/thought-leadership/pulse/pulse-of-the-profession-2015.pdf

http://calleam.com/WTPF/?page_id=1445

http://www.mckinsey.com/business-functions/digital-mckinsey/our-insights/delivering-large-scale-it-projects-on-time-on-budget-and-on-value

 

Bibliography:

Bloch, M., Blumberg, S., & Laartz, J. (2012, October). Delivering large-scale IT projects on time, on budget, and on value | McKinsey & Company. Retrieved March 24, 2017, from http://www.mckinsey.com/business-functions/digital-mckinsey/our-insights/delivering-large-scale-it-projects-on-time-on-budget-and-on-value

Benni, E. (2012, April). Transforming the company Avoiding the Black Swans – Success Factorsa dn core beliefs in Value Assurance. Mobily CIO Summit, Istanbul. Retrieved from http://mobilyciossummit.com/presentations/02_Mickensey_and_Co.pdf

Bryce, T. (2006, March 14). A Short History of Systems Development. Retrieved March 31, 2017, from http://it.toolbox.com/blogs/irm-blog/a-short-history-of-systems-development-8066

Elasticsearch Part 2

On Friday I attended an excellent Elasticsearch basics for developers course at work and I would like to discuss what I learned and how it has changed my  view of Elasticsearch since I wrote about it a couple of weeks back.

What is Elasticsearch?

Elasticsearch is usually thought of as a search engine but it is more than that, Elasticsearch can also be considered a:

  • Data store, meaning in addition to using it as a search engine for your app/system you could also use it as an alternative to a Relational Database Management System (RDBMS). Elasticsearch stores data in documents, mapping types and indexes, which are the equivalent of a Relational databases rows, tables, and database respectively.

 

  • Reporting tool – Elasticsearch can be used to store system logs. The RESTful API Kibana that you can use to interact with Elasticsearch can generate visual charts from Elasticsearch search query results. For example the below visual charts of system log information. These charts present the data in a far more useful format than a written system log.

kibana visualisation

(Christopher, 2015)

Something that particularly interested me about the course was that  the presenter Frederik said that Elasticsearch is very flexible and extremely useful as long as your prepared to spend time configuring it.

A lot of people implement Elasticsearch (which is actually pretty easy as I found last week) and expect it to be the equivalent of Google for their organizations data, however if you don’t configure it to match your business problem domain then it would not reach its full potential.

What is the internal structure of Elasticsearch?

Elasticsearch is built on top of Lucene, which is a search library. In the documentation it is very hard to determine where one ends and the other begins however I believe having done the course and read through the first answer in this very interesting StackOverFlow page (http://stackoverflow.com/questions/15025876/what-is-an-index-in-elasticsearch) that I have a good understand of this now, so lets test it out.

I look at Elasticsearch and Lucene as a 2 layered cake ( to see this graphically look at the below diagram where we have Elasticsearch as the top layer and Lucene as the bottom layer), the top layer (Elasticsearch) is the one that the user interacts with.  When you first install Elasticsearch a cluster is created ( a cluster is a collection of 1 or more nodes (instances of Elasticsearch)).

Inside this cluster by default you have 1 node (a single instance of Elasticsearch). This node contains indexes. Now an index is like a database instance, drilling down further we have mapping types (which are the equivalent of tables for example you could create a mapping type of student), inside a mapping type there are documents (which are a single data record making them the equivalent of a row in a database), and inside each indexed document there are properties which are the individual data values so for example 22 years old is a property for age).

To put the document into perspective it is just a JSON data structure.

So we have established that Elasticseach stores the data in indexes, with each data record known as a document.

But how does Elasticsearch actually find specific data when someone writes in a HTTP GET request into Kibana? Well that’s where Lucene comes in, Lucene is the bottom of the two layers in my cake simile. Lucene contains its own index, which is a inverted index: instead of storing data it points to which indexed documents in Elasticsearch index that the data value is stored in, in much the same way a book index points to the page number where a particular word exists.

Anther good analogy of what a inverted index is that it is quite similar to containers such as arrays and dictionaries which point to a specific location in memory where a particular value is stored rather than storing the value itself in their data structure.

Having done the course I now believe I understand how the index in Elasticsearch and the index in Lucene relate.

lucene and es.png

(Principe, 2013)

Now as I said by default your Elasticsearch cluster has one node, however Elasticsearch is extendable meaning you can add more nodes to your cluster.

Within each node there are 5 shards. What is a shard? “A shard is a single Lucene instance. It is a low-level “worker” unit which is managed automatically by elasticsearch. An index is a logical namespace which points to primary and replica shards.” (“Glossary of terms | Elasticsearch Reference [5.3] | Elastic,” n.d.). In other words each node contains the Elasticsearch indexes outside the shards and the Lucene Inverted indexes inside the shard.

Each shard has a backup in the form of a replica shard which is stored on a different node, this provides data redundancy and speeds up the search times because it is more likely the HTTP GET request is sent to a shard containing an inverted index with the search term in it.

What is the process that happens when the client RESTful API sends a request to the cluster?

In Elasticsearch the commands are called APIs so for example a delete command is a called delete API.

Now like I previously stated Elasticsearch is structured as a collection of nodes in a cluster (think of it like how there are multiple servers in the concept of the cloud).

The nodes store different information (in the form of Lucene inverted indexes and Elasticsearch indexes) so the request needs to go to a particular node to access particular data. However all the nodes store information about the topology of the cluster, so they know what node contains the data the API command seeks/wants to modify.

When you write a HTTP GET request in Kibana the ID specified in the GET request is hashed and then the hashed id is sent to a node in the cluster, it doesn’t matter which node that request is sent to as it will redirect the request to the appropriate node (if it doesn’t store the matching hashed id).

However to make sure that the same node is not always queried the destination node of each search query is different based on a round robin distribution.

How Elasticsearch data storage violates normalisation and refactoring

Elasticsearch is all about fast search times, to achieve this having duplicated data in multiple indexes is considered acceptable.

This is in complete contrast to the database concept of normalization and the programming concept of refactoring both of which stress the need to remove duplicate data/code.

What are the differences between Elasticsearch and a relational database

Although Elasticsearch can be used as a data store meaning you could implement it as an alternative to a relational database the differences are:

  • Elasticsearch does not use foreign keys to create relationships between indexes
  • Data can be duplicated to speed up the query time
  • Query joins (querying two or more indexes in a single query) are not available in any effective way from Elasticsearch, meaning although rudimentary joins can be implemented they are not very effective

So when should you replace your RDBMS with Elasticsearch? Well it depends on the sorts of queries you have performed/want to perform on your primary data store. If you are performing complex transaction queries (2 or more queries concurrently) then Elasticsearch is not ideal and you would be better off using a RDBMS such as MySQL and just use Elasticsearch as a search engine.

However if you don’t want complex transactional queries then Elasticsearch is a good alternative to a RDBMS.

What are the cons of Elasticsearch?

Elasticsearch is not ideal from a security point of view this is because it does not provide data or transport encryption.

It is near realtime – This means there is a slight latency after indexing a document before you can search for the data it holds.

What are the benefits of Elasticsearch?

The two main benefits of Elasticsearch are:

  • It is fast – Due to the data being duplicated in multiple shards it means it is faster to access data in either the primary or replica shards
  • It is distributed – Meaning it is easy to extend by creating another node in your cluster
  • High availability – By having a primary and replica shard to hold each inverted index twice this means the indexes are more easily available

 

Starting Elasticsearch on Linux

Last week I installed and used Elasticsearch on a Windows machine, now I want to cover how to use Elasticsearch on a Linux machine:

  1. Download both Elasticsearch and Kibana (the versions used in my course were Elasticsearch 5.1.1 and Kibana 5.1.1 however there are more recent versions available of both systems and so there may be version conflicts which are visible once you visit Kibana in your browser. If there are version issues simply install the version of either Kibana or Elasticsearch specified on the Kibana interface).
  2. Start two terminal windows. In one terminal navigate to the Elasticsearch directory and start Elasticsearch by writing in:

./elasticsearch-5.1.1/bin/elasticsearch

3. In the other terminal navigate to the Kibana directory and write in:
./kibana-5.1.1-linux-x86_64/bin/kibana

3. Now in browser visit Elasticsearch by writing in the URL:
http://localhst:9200

4. Also in your web browser visit Kibana by writing in the URL:

http://localhost:5601/app/kibana

Because you now interact with Elasticsearch through Kibana in the web browser everything is the same from this stage no matter what OS you are using.

 

Examples of Elasticsearch API commands

Create Index API – This creates a index which we can then use to index documents (create data records)

PUT student

{
“settings” : {…},

“mappings:: {…}

}

In this create index API command you can specify the number of shards and replicas you want the index to span. e.g.

PUT  “student”

{

“settings” : {

“number_of_shards” : 1,

“number_of_replicas” : 1

}

}

Index API – Here I am specifying I want to use the ‘student’ index, creating a ‘degree’ document type and specifying the ID of the document I am indexing is 1. Then I am indexing the document itself. By performing the index API I am automatically creating a document type of degree.

Note: Specifying the id value in the PUT command is optional.

PUT student/degree/1

{

“name” : “Alexander Buckley”,

“alt_names” : [ “Alex” ],

“origin” : “Nelson”

}

 

If there was no student index before I wanted to index this document I would need to write

PUT  student/degree/1/_create

{

“name” : “Alexander Buckley”,

“alt_names” : [ “Alex” ],

“origin” : “Nelson”

}

GET API – This retrieves the data for a specific document.

e.g.

GET student/degree/1

 

Exists API – This checks if there is a document with a particular id in a index.

e.g.

HEAD student/degree/1

This is checking if there is a document with the id of 1 in the student index and degree mapping type.

 

Delete API – This deletes a document from an index by specifying the id of the document you want deleted in the Delete API command

DELETE student/degree/1

Write consistency – Before any of these API commands is performed more than 1/2 of the shards in the cluster need to be available because it is dangerous to write to a single shard.

Versioning – Elasticsearch uses versioning to keep track of every write operation to a document. The version is assigned to a document when it is indexed and is automatically incremented with every write operation.

Update API – This allows you to update parts of a document. To do this you write the changes to specific properties of the document in the ‘docs’ . So in the below example I am updating the name property of the document with the id of 1. This will be merged with the existing document.

POST student/degree/1/_update

{

“docs” : {

“name” : “Alex Buckley”

}

}

 

Multi Get API – This is where you can request multiple documents from a specific index and mapping type. Whats returned is a doc array.

GET _mget

{

“docs” : [

{

“_index” : “student”,

“_type”   : “degree”,

“_id” : 1

},

{

“_index” : “student”,

“_type” : “degree”,

“_id” : 2,

“_source” : [“origin”]

}

]}

 

Bulk API – To perform multiple different API commands simultaneously. Elasticsearch splits the command up and sends them off to the appropriate node. If two requests are requesting/manipulating the same node then they are sent together.

PUT _bulk

{ “delete” : { “index” : “student”, “_type” : “degree”, “_id” : 2 } }\n

{ ” index” : { “_index” : “student” , “_type” : “degree”, “_id” : “3}}\n

{ “name” : “Jan Smith”, “alt_names” : [“Janet Smith”], “origin” : “Wellington” }\n

In this example I am deleting a document from the fruit index and indexing (adding) another document all with a single Bulk API. The benefit of this is once this API request has been redirected to the node containing the student index multiple API commands can be performed which is obviously more efficient.

Search – Text analysis

Unlike the range for commands for the CRUD actions in the above section for search we would use the GET API. This is sent from the Kibana client to the Elasticsearch cluster, and redirected from the node that received the command to the node containing the inverted index with a matching search value (known as a token).

This Lucene inverted index contains 3 columns , first column is the search token,  second column is the number of documents it exists in and third column is the id of the documents it exists in.

e.g.

token        docfreq.        postings (doc ids)

Janet          2                  3, 5

If we were using a GET API to find all instances of the word ‘Janet’ we would be returned with the documents 3 and 5.

When indexing a document you can use the index attribute to specify what fields you want to be searchable. This attribute can have one of three values:

  • analyzed: Make the field searchable and put it through the analyzer chain
  • not_analyzed: Make the field searchable, and don’t put it through the analyzer chain
  • no: Don’t make the field searchable

But what is the analyzer chain?

OK so the values from the index documents are placed in the Lucene inverted index and that is what is queried when using Elasticsearch as a search engine. If we have a string we want to be searchable then we often have to tidy it up a bit to make it more easily searchable, that’s where the analyzer chain comes in, it performs the following actions:

  1. The first step is the char filter, this removes any HTML syntax. e.g. the string “<h1> This is a heading </h1>” would become “This is a heading”.

2. The second step is the tokeniser, which usually does (but you can specify the steps you want the tokeniser to do):

  • Splitting the words in the string apart
  • Removes stop words like ‘a’
  • Make all letters in each word lower case
  • Replace similar words with their stem word. In other words the two words “run”, and “running”  are similar and so instead of writing them all to the inverted index we replace them with a singular word “run”.  Replacing similar words with stem words is automated by performing a stem algorithm.

3. Token filter

Interestingly all user query terms go through the same analyzer chain before they are compared against the inverted index if the user uses the ‘match’ attribute in their search query (which will be discussed below)

Search

Elasticsearch can perform two types of search:

  • Structured query – This is a boolean query in the sense that either a match for the search query is found or it isn’t. It is used for keyword searches.
  • Unstructured query – This can be used for searching for phrases and it ranks the matches on how relevant they are. It can also be called a fuzzy search, because it does not treat the results in a boolean way saying their either a match or not as the structured query does but it returns results that exist on a continuum of relevancy.

 

Search queries:

match_all: This is the equivalent of SELECT * in SQL queries. The below example will return all documents in the fruit index and berries mapping type.

GET student/degree/_search

{

“queries” : {

“match_all”: {}

}

}

Note: The top 10 results are returned, and so even if you perform the match_all query you will still only get 10 results back by default. But this can be customized.

 

If you want to search terms that have not been analyzed (i.e. haven’t gone through the analyzer chain when the document was indexed)  then you want to use the ‘term’ attribute

However if you want to query a field that has been analyzed (i.e. it has gone through the analyzer chain) then you will use the match attribute.

e.g.

GET student/degree/_search

{

“query” : {

“match” : { “name” : “Jan Smith” }

}

}

This means the term Jan Smith will go through the analyzer chain before it is compared against values in the inverted index.

The multi_match attribute can be used to find a match in multiple fields, i.e. it will put multiple search values through the analyzer chain in order to find a matching value in the inverted index.

GET student/degree/_search

{

“query” : {

“multi_match” : {

“fields” : [“name”, “origin” ],

“query”: “Jan Smith Wellington”

}

}

I will discuss the other search queries in my next Elasticsearch blog to prevent this one getting too long.

 

Mappings

When we index a document we specify a mapping type which is kind of like the table in a relational database, or a class in the OO paradigm because it has particular properties which all documents of that mapping type have values for.

The benefits of mapping types are that they make your indexes match the problem domain more closely. For example by making a mapping type of degree I am making the documents that I index in the degree mapping type a more specific type of student.

To save time when creating indexes with the same mapping type we can place the mapping type in a template and just apply the template to the index.

e.g.

PUT _template/book_template

{

     “template” : “book*”,

     “settings” : {

          “number_of_shard” : 1

       },

       “mappings” : {

                 “_default_” : {

                          “_all” : {

                                    “enabled” : false

                            }

                      }

              }

}

To apply this mapping type template to a index I would have to write:

PUT _template/book_wildcard

{

    “template” : “book*”,

    “mappings” : {

          “question”: {

               “dynamic_templates” : [

                 {

                            “integers” : {

                                     “match”: “*_i”,

                                      “mapping” : { “type”: “integer”}

                              }

                       }

                   ]

}}}

Note: It is recommended that you only assign a single mapping type to a index.

Conclusion

I have learned a lot from the Elasticsearch course and will continue to discuss what I learned in the next Elasticsearch blog.

 

Bibliography:

Christopher. (2015, April 16). Visualizing data with Elasticsearch, Logstash and Kibana. Retrieved March 26, 2017, from http://blog.webkid.io/visualize-datasets-with-elk/

Principe,  f. (2013, August 13). ELASTICSEARCH what is | Portale di Francesco Principe. Retrieved March 26, 2017, from http://fprincipe.altervista.org/portale/?q=en/node/81

Glossary of terms | Elasticsearch Reference [5.3] | Elastic. (n.d.). Retrieved March 30, 2017, from https://www.elastic.co/guide/en/elasticsearch/reference/current/glossary.html

What is DevOps? and how does it relate to Agile methods?

Along with myself there were several other full-time summer IT interns at work over the summer, including one that was in the DevOps. When I heard he was working in that team I wondered what is DevOps?

Like I said in my previous research journal I am going to investigate what DevOps is, but I also want to see how (if at all) it relates to Agile methods.

What is DevOps?

devops.jpg

(Vashishtha, n.d.)

The above image illustrates the key idea of DevOps beautifully. Put simply DevOps is the increase in communication, collaboration and automation in and between the development and production environments.

What is the current software deployment model?

To understand DevOps I wanted to understand where we are with traditional systems deployment and what the issues are that DevOps attempts to solve.

Software development companies are generally structured in teams (I know from my experience this is an important concept in modern development organizations).

The work of two teams in particular affects the profitability of the companies software, these teams are:

  • Development team – They actually build the software in the development environment
  • Operations team – They deploy the software to the production environment which they maintain.

Now in traditional software development companies (I am referring to companies that have not implemented the use of DevOps) there is on a level of mistrust due to the major issue:

  • The development environment (that developers work in) and production environments (that operations maintain) are configured differently meaning when code is deployed into the production environment it takes time for the operations team to get it working successfully slowing down the whole software deployment process

Now I would have thought it was common sense for the development and production environments to be as identical as possible so systems/features built in the development environment could be seamlessly deployed to the production environment but this has not been the case.

The sorts of problems having dissimilar environments are the production environment is less forgiving to software exceptions than the development environment, and so an exception that causes no observable error or warning in the development environment can crash the system in the production environment. Not good when trying to deploy software on a tight deadline.

It is the operations team that have to fix up the code for the production environment before it can be released to the customer and because this just adds another job to their tasklist this is why there is a level of mistrust between the development and production environments.

The development team, meanwhile gets a level of annoyance at the operations team because the time it takes to deploy the code they write holds up the development team from deploying new systems/features.

This gridlock slows down the whole software development deployment which has a business cost, because remember IT is just there to help businesses and organizations. The detrimental business cost is that competitive advantage of meeting a customers needs or filling a business niche may be taken by a faster deploying competitor.

How can DevOps help?

I look at DevOps as a metaphorical combination of a communication course and the equivalent of an industrial revolution in software deployment.

What? Let me explain with several points:

  1. DevOps attempts to increase the collaboration of the development and operations teams  thereby speeding up the time it takes to deploy software to the customer. This collaboration is like a communication course of sorts as it is making the two teams communicate more so their systems can become more alike.

2. DevOps attempts to free up more time for both teams by automating the software deployment process as much as possible. This means automating the testing, deploying, and monitoring of software in both the development and production environments using a set of tools.

Therefore I view DevOps as the industrial revolution of IT systems development, because like with the Industrial Revolution of the 18th and 19th centuries DevOps tries to automate as many tasks as possible allowing the workers to work on what can’t be automated.

Another change that DevOps does is it attempts to change the mindset of both teams because instead of working on big new features for existing systems, it promotes the development of small code releases that can be quickly tested, deployed and monitored in the production environment by automated tools.

The benefit of getting small chunks of software out to the customer quickly, rather than big chunks of software more slowly is that the company can gain the competitive advantage by filling a business niche with its quickly evolving system as opposed to missing out to faster competitors.

What are the tools that DevOps use to implement these 4 points?

To be able to build small chunks of code and automate the testing of them the organization will need to implement a tool like Jenkins (https://jenkins.io/) (Rackspace, 2013).

They will also need a source control tool such as Git (Rackspace, 2013).

Tools that allow them to configure their environments, and automate the deployment of code to servers in the production environments  will be tools like Puppet (https://puppet.com/) (Rackspace, 2013).

The tools they use for application monitoring will be monitoring the system logs, these tools will be things like New Relic. The benefits of this tool is that it can monitor the system logs of thousands of servers and inform both teams of any issues of the new code  in the production environment (Rackspace, 2013).

Basically tools like New Relic make sense of vast quantities of data in much the same way (obviously on a much smaller scale and without the machine learning aspect) as systems like IBM Watson which trawl through vast quantities of data finding patterns and presenting insights (Rackspace, 2013).

How do the principles and values of DevOps and the Agile methods work together?

So Agile methods as I discussed in a a previous research journal entry are a set of values and principles to help development teams make decisions in the development of a product for a user.

This interesting YouTube video describes that the relationship between Agile and DevOps is that an Agile mindset exists from the:

  • User to the development team
  • and DevOps is from the Development team to the Operations team.

In other words they do not exists at the same time, this view is further backed up by this article in Information Week (http://www.informationweek.com/devops/agile-vs-devops-10-ways-theyre-different/d/d-id/1326121?image_number=11).

Now having looked at these resources my own opinion is even though there are minor differences in the way these two concepts are implemented, for example documentation is not viewed as highly as working software in Agile methods, whereas in DevOps documentation is required because the development team is handing the product to a new team (the operations team to deploy); the operations team has not worked on the product and so they require documentation to understand the system; something anyone who has worked on developing for an existing system they didn’t build will understand.

However, despite these minor differences I am amazed at the similarity in many ways of DevOps to Agile methods. DevOps changes the mindset of a software development organization so that it deploys software faster, which allows the development and production environments to use a more Agile methods approach to developing and deploying the small releases which happen more frequently than before DevOps was implemented.

So I believe that yes Agile and DevOps cover different parts of the systems development life cycle, with Agile methods covering the initial development of the product whilst DevOps covering the deployment of the product however the common fundamental concepts of smaller more frequent releases/deployment of software over one huge release, and increased communication between and within teams link these concepts together.

Interesting DevOps resources:

http://www.agilebuddha.com/agile/x-htm/

Bibliography:

Vashishtha, S. (n.d.). Demystifying DevOps : Difference between Agile and DevOps. Retrieved March 21, 2017, from http://www.agilebuddha.com/agile/x-htm/

Rackspace. (2013, December 12). What is DevOps? – In Simple English – YouTube. Retrieved March 24, 2017, from https://www.youtube.com/watch?v=_I94-tJlovg