Monday, July 15, 2013

Big Data, what it can and can't do

With all the hype around the big data, I happened to attend the Fifth Elephant 2013 conference to understand the playing field better. The speaker list was impressive and had some industry bigwigs like Dr. Edouard Servan-Schreiber, Director or Solution Architecture at 10Gen, the MongoDB company, Dr. Shailesh Kumar, Member Of Technical Staff from Google, Hyderabad and Andreas Kollegger Experience Architect from Neo4J to mention a few.

The experience was thoroughly fulfilling and it was nice to rub shoulders with the local tech community and connect on such a scale. It's just fascinating to see the amount of data that some companies generate, capture and operate upon on everyday basis.

 The blog contains my take on the technology applications and limitations, again thoughts may vary and that's why we have a comments section.

First, I would discuss where we cannot apply or use bigdata/NoSQL paradigms:

  • It cannot be used for applications and systems which have high volume of transactions which are long/complex or the system requires multiple join queries. That's something no NoSQL implementations guarantees so far. It may be on the cards but seems less unlikely as it will take the flavor of the non RDBMS implementation.
  • It cannot be applied to legacy systems which are tightly coupled with the data base systems. e.g. in one of my previous projects, one application was very DB heavy, as in it had a lot of functions and stored procedures which were the code of the application logic. So, even though the app had a huge amount of data, this coupling makes it difficult to move to a NoSQL implementation.
  • It is not a choice for applications which deal with a small amount of unstructured data. Honestly, because we cannot use and elephant to scare the mouse, a cat would do just fine.
  • It essentially cannot be used for anything that operates on real time. e.g. capturing data from a F1 car to do a real time diagnostics and see where the problem might come (or maybe we can if a little bit of latency would not be a problem).
NoSQL/bigdata have given us the power to operate in near real time on a very huge data set, but of-course the speed of the operation depends on the implementation of the crunching logic. So, in order to have a fast op (read low latency and high throughput) we need a NoSQL DB and near real time processing/crunching  capabilities.

Now let's touch upon some areas where bigdata/NoSQL can have a big impact:
  • e-Learning is one of the classic examples. I was working with an application which had a lot of custom courses, exams and associated media for students registering to take the course. It was designed with the rigidity of a RDBMS, but in retrospect I feel that this is a good candidate for a NoSQL implementation.
  • Banks and commercial institutions are already implementing big data in a lot of ways, and fraud monitoring agencies and companies rely on the processing capabilities of the bigdata stack to do transaction analysis in near real time. The transactions data still goes to a RDBMS system but a lot of other data is not being recorded on to NoSQL databases for trend analysis and simply put, faster access/look-ups.
  • Content Delivery Networks are also using bigdata stack for optimizing web app performance.Citibank has such a implementation, where application renders out of a content cache which uses MongoDB as storage. There can be a custom cache controller written over the DB to achieve something like this.
  • Bioinformatic  and cheminformatic systems can also leverage NoSQL databases for faster responses. I happened to work with the industry leaders Accelrys Inc. in chem and bioinformatics and there were a few application that I saw could definitely benefit from the bigdata stack. Some of their products can also use graph databases especially with the development of Accelrys Enterprise Platform AEP.
  • Large scale analytic processes and applications are the classic use case of a bigdata/NoSQL stack. Meteorological systems, trade analysis systems, logistics systems are places where we can use bigdata stack and I am sure is being used in some places. These systems need near real time analytics and also require data trends and reports over large data sets and over a long period of time, and that is where bigdata stack can help.
Lastly I would like to close the post with a discussion hat I had with a peer about example of having hue amount of data. Remember Gary Kasparov, who defeated Deep Blue and was a year later defeated by Deep Blue successor. The reason we concluded that the latest Deep Blue won was not because it was faster and better, but because it had a bigger data set and crunching ability than its predecessor.

So, it's the high volume of data that will win over a period of time than a well written crafty algorithm.

Wednesday, June 12, 2013

BDD using Cucumber-JVM

I personally had a tough time finding the correct list of resources, to get started with Cucumber-JVM. So, that's the sole purpose of creating this blog. It might be an added resource on the internet to help out anyone who wants to implement Behavior Driven Development and radically improve stability of their application.
A little breif about behavior driven development. It is not exactly the latest buzzword in town, however since I have used it in a lot of my previous projects, all I would like to say is that it is pretty much a very easy way to ensure that your workflows are not broken by subsequent releases. I have learnt to rely upon these tests (provided they are written well and foolproof) and so have a lot of organizations. What i have realized over time is that automate everything that you can, especially the tests and you will reach to a level of continuous delivery with your product/application. Now is that not worth the effort? So here is the hands on part.

One of the first things you would need is of-course an IDE. I prefer eclipse only because I have been using it a long time. You can write the entire set of tests using command line tools like VI also. The only thing that matters is that we configure and define paths properly to execute the tests.

The next thing you would need is the supporting jars. I assume that you would have a JDK installed and in your system/user classpath/path in order to start eclipse up, so the rest of the jars that are needed are as follows.

Here is the basic list of Jars you would need to setup your project:-

You can select your specific versions from maven repository as well.

With all the stuff needed you can start by creating a Java Project in eclipse. Once the project is created import all the jars downloaded into the projects build path and classpath (just copy all the jars in the lib directory at the project root, if not there then create one and modify the .classpath file of the project).

Start with creating the run-time interface or entry point for the test by defining a class, you can name it whatever you want, however for the sake of simplicity we will call it "RunCukesTest" in a test.java package in the source folder. We will mark it with the annotation @RunWith(Cucumber.class), it is a JUnit defined annotation that signals the dependencies to start when we run this class as a test. Here is what the class looks like.

package test.java;

import org.junit.runner.RunWith;

import cucumber.api.junit.Cucumber;


@RunWith(Cucumber.class)
public class RunCukesTest {
}


There are options that you can define using the @Cucumber.Options  tag at the class level e.g.

@Cucumber.Options(format="html:target/report", features="src/test/resources/" tags="@myTag")

  • format options specifies the format of the report i.e. html and the path relative to the project root where it should be saved to.
  • features define the path to the feature files, we can specify the complete relative path of a single feature file also if we only want to run it.
  • tags define the specific scenarios that you want to execute in the current run, so all the scenarios marked with @myTag will be run in the current test run
There are more options that can be configured but these are good to begin with, you can find the entire detail here.

You can right-click and run this file as a JUnit test. It will result in an error right now because you have no features defined or the required package declared.

The next step is to define a package under the source folder, call it test.resources. This will contain all your feature files. Features is a cool way of writing tests in plain English. The idea is that any one can write a feature, of-course it is to the developer to write the corresponding implementations.

The feature that you will write is a simple gmail login and open the mailbox.
Here is the feature file:- 

Feature: Gmail

Scenario: open a mail to read in gmail
Given I Login to gmail with "<username>" and "<password>"
When I click on the sign in button
Then I can see my mailbox

Save this file as GmailLogin.feature under the package test.resources

Now you will write the implementation step definitions of features. This will contain code that matches a regular expression with the annotation associated to the methods in the step definition file. For the above feature file, here is the step definition:-

package test.java;

import org.junit.Assert;
import org.openqa.selenium.By;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.WebElement;

import cucumber.api.java.Before;
import cucumber.api.java.en.Given;
import cucumber.api.java.en.Then;
import cucumber.api.java.en.When;

public class GmailStepDefs {
    WebDriver driver;

    @Before
    public void start() {
        driver = DriverHelper.getDriverInstance(DriverHelper.FIREFOX_DRIVER);
    }

    @Given("^I Login to gmail with \"([^\"]*)\" and \"([^\"]*)\"$")
    public void I_login_to_gmail(String username, String password)
            throws Exception {
        driver.get("http://mail.google.com");

        WebElement userid = driver.findElement(By.id("Email"));
        WebElement passwd = driver.findElement(By.id("Passwd"));

        userid.sendKeys(username);
        passwd.sendKeys(password);
    }

    @When("^I click on the sign in button$")
    public void I_click_on_signIn() throws Exception {
        WebElement signIn = driver.findElement(By.id("signIn"));
        signIn.click();
    }

    @Then("^I can see my mailbox$")
    public void I_can_see_my_mailBox() throws Exception {
        Thread.sleep(10000);
        WebElement composeButton = driver.findElement(By
                .xpath("//div[contains(text(), 'COMPOSE')"));
        Assert.assertEquals(true, null != composeButton);
    }

   @After
    public void end(){
       driver.quit();
    }
} 

Save this file under the test.java package. There is a driver helper, which is a factory class which I have used to create driver instances, only because the IE Driver is a little problematic to work with and we have to ensue for the stability f the tests that we are always working with a single instance. Here is the code for it:-

package test.java;

import java.util.concurrent.TimeUnit;

import org.openqa.selenium.WebDriver;
import org.openqa.selenium.chrome.ChromeDriver;
import org.openqa.selenium.firefox.FirefoxDriver;
import org.openqa.selenium.ie.InternetExplorerDriver;
import org.openqa.selenium.remote.DesiredCapabilities;

public class DriverHelper {
    
    public static final String INTERNET_EXPLORER_DRIVER = "IE";
    public static final String FIREFOX_DRIVER = "FF";
    public static final String CHROME_DRIVER = "GC";
    public static WebDriver driver;
    
    public static synchronized WebDriver getDriverInstance(String driverType) {
        if (driver == null){
            if (driverType.equalsIgnoreCase(INTERNET_EXPLORER_DRIVER)) {
                System.setProperty("webdriver.ie.driver",
                        "E:\\workspace\\TestCucumber\\lib\\IEDriverServer-32bit.exe");
                DesiredCapabilities capabilities = DesiredCapabilities
                        .internetExplorer();
                capabilities
                        .setCapability(
                                InternetExplorerDriver.INTRODUCE_FLAKINESS_BY_IGNORING_SECURITY_DOMAINS,
                                true);
                capabilities.setCapability("ignoreProtectedModeSettings", true);
                driver = new InternetExplorerDriver(capabilities);
                driver.manage().timeouts()
                        .pageLoadTimeout(180, TimeUnit.SECONDS);
                return driver;
            } else if (driverType.equalsIgnoreCase(FIREFOX_DRIVER)){
                return new FirefoxDriver();
            } else if (driverType.equalsIgnoreCase(CHROME_DRIVER)) {
                return new ChromeDriver();
            }
        }
        return driver;
    }

}

As you can see that there is a lot of code specifically for the creation of the IE driver instance. Now go back to the RunCukesTest class right-click, select run as Junit Test and enjoy the show. You will need to have the IEDriver 32-bit  or 64-bit depending on your OS to be present in the class path. These can be downloaded here.

Please feel free to contact if you need some more help or sample code for the blog here.

Tuesday, June 11, 2013

Scalability and Availability using Proxies

The first thing that comes to mind upon hearing the word proxy is some sort of distribution facade. That's more or less what I will be writing in the blog post with a bit more here and there. 

When I was in college the word proxy meant to call out someone else' attendance. So, If a student A is not present in the class, during the roll call ,a student B will reply in a 'Present' when A's roll number is called. This is the essence of proxies, they are a singular interface for all incoming requests in a cluster/multi-server environment. They give the appearance that you are every-time requesting a single server URL like www.xyz.com while it is actually the proxy that you are contacting. And then the proxy based on the distribution algorithm that it comes programmed with, pushes the request to any of the servers present behind it. This distribution mechanism is out of scope of this blog and I will hopefully cover it in a future blog.

Proxies are very basic level of scalability and availability. You can have a proxy upfront with two servers running in cluster. If you feel the need to scale-up, then add more servers to the system to make it scale up to the growing traffic demands. 
Also the proxy will make the system more available. So, one of your servers' goes down (read crashes), the proxy will direct all the incoming traffic to the other available servers in the cluster automatically. So, essentially your application/site is up and you do not lose business. 

There are two ways in which proxies are created in a system a forward proxy or just a proxy and reverse proxy. 
A forward proxy is used generally to segregate an internal network from the internet. You can control/filter content going in and out of the network. So in this case the application is more of securing the internal network against unintended use, attack or access. And the internal users are protected against any mal-ware or problem causing content online.

The Reverse proxy is what is generally used as a availability and scalability medium. The entire set of servers sitting behind a proxy and the requests are directed to each one of them depending upon the traffic that they are currently handling. It's a technique for scaling out more than scaling up, but it does have the desired effect nevertheless. 

Using this mechanism you can achieve more than just handling fail-overs or load distribution. The proxies can also be used as a caching mechanism to cache a lot of non-varying data being presented to the requesting clients. Which when done properly on a global scale can really boost up the application performance, or at-least give a mirage of faster response times. It can also be used to handle regionalization, which is some thing that most of the content delivery network products do, but on a much larger scale and they use HTTP caching mechanism to achieve this. Again that is some thing out of the scope of this article.

This proxy mechanism (read load balancing) can be set at the database levels too, which done in the desired way helps in achieving availability and scalability both. You can set up the data base as master-slave, multi master-slave, multi-master-multi-slave or buddy replication modes to get the maximum throughput and least load.