How to Monitor VMware Infrastructure for Free

If your IaaS backbone is formed by bare-metal machines with the industry leading VMware hypervisor, you’ve probably made this choice for a good reason – efficiency. When talking scale with massive I/O, CPU and/or memory loads, the public Cloud/SaaS offerings such as AWS Lambda are still far away from the classic bare-metal model, which yields much higher performance in a controlled and predictable environment. ESXi (product name – VMware vSphere Hypervisor), VMware’s bare-metal hypervisor, is the industry’s standard hypervisor with roughly 70%+ of the market share in this segment because of its robustness and performance. With a footprint of just 150MB it can support up to 128 vCPUs, 6TB or RAM and all kinds of OSs on top of it.

VMware vSphere is the enterprise product that adds a Virtual Center that automates the management of ESXi hypervisors and adds enterprise class features for mission-critical applications, such as the famous vMotion (ability to move a VM between ESXi hosts without downtime), the high-availability, proactive monitoring or disaster recovery features. These features, however, come at a certain price (currently at 6.5k USD for a single vSphere license, according to VMware’s website). On the other hand the ESXi hypervisor is offered for free. In case your software architecture employs stateless design, e.g. through Dockers/Kubernetes, then you can leverage the agility and robustness of stateless architectures and cut down some of the costly vSphere high availability features, such as shadow VM, vMotion, etc…. For example, in case your monitoring system shows that a container is bogged down, just kill it and start another node. From IaaS perspective this greatly simplifies your IaaS and reduces costs by just employing a number of good old and free ESXi-s…given that you have a good monitoring layer laid out.

In this post, we will review how to easily add free, yet enterprise-class monitoring layer for your light-weight VMware virtualisation tier.

What do We Mean by Syslog?

Let’s first get some terminology in place. The standard way to monitor UX systems is syslog. Syslog is an overloaded with meaning term. When people talk syslog they could mean the syslog daemon, called syslogd, that all UX systems have to collect and dump logs (usually in the /var/log directory). They could also mean the standard kernel functions for logging, called syslog. Less frequently they mean the syslog protocol (e.g. RFC-5424) for relaying logs over the network. There are also log relayers, which purpose is to receive, process and retransmit log messages; and there analysers that understand the syslog protocol, can collect large amounts of logs, extract, systemise and analyse them, even with AI predictive analytics which can help you discover failures before they occur.

Here is a simplistic representation of this ecosystem:

Exactly this kind of chain is used in the enterprise VMware vSphere monitoring feature. So our goal is to achieve a similar stack with free tools.

Choosing Your Syslog

Regardless of the transportation mean (UX domain socket for local or TCP/UDP socket for remote), the RFC-5424 syslog protocol should be observed or major loss of information could happen. Sadly in the pleiad of syslog client and server libraries, a very small number are strictly RFC-5424-compliant by default. If you don’t pick the right combination there is significant configuration and matching hustle associated.

Another important aspect is the performance – or how efficiently the local or remote syslogd can collect and relay messages. Among all contenders, syslog-ng seems to be a popular choice. However, when I led the vSphere monitoring team, we evaluated a number of syslogd libraries carefully and the open-source small and light-weight rsyslog showed much greater (40%+) efficiency, when compared to all others.

Albeit not that rich and popular as syslog-ng, Rsyslog also has all the needed features for enterprise class logging, such as throttling, flood protection, queue de/serialisation configuration for rapid-reliable log queueing, custom extensible log rotation, easily configurable and extensible log processing, etc. As a matter of fact, we were so impressed by its overall scoring that we selected it as the VMware vSphere syslogd, playing a key part in the enterprise VMware vSphere Monitoring layer. All that we need to do in order to have that “enterprise” syslogd is install a Linux flavour OS. If by some chance rsyslog is not present, install the rsyslog package as well, using your favourite package-installation tool. Here is how to do it for Ubuntu and Redhat.

Setup and Configuration

The next step is configuring rsyslogd to collect, process and store messages. This could be an altogether separate article, but luckily rsyslog has good documentation, along with example configurations for the most common case. Log filters and actions are the basic terminology. A filter specifies what messages to get while the action – what to do with them, (e.g. store them in a file). Here is a basic configuration example to get started with the log-storage and retention which, although utterly important, are not the focus of this post.

Now that we have set up a robust syslogd for collecting and relaying our logs, let’s configure the ESXi to send their logs to our remote syslogd (rsyslog). Here is the VMware article on the matter.

Finally, you may want to also relay all collected logs to some log-analysis tool or database that has a decent way to visualise the health state of your entire IaaS thanks to the logs you fed in and even predict failures based on integrated ML algorithms. Such tools are VMware’s LogInsight, the free plan of (limitations apply) Splunk, etc. In order to do that first you’ll need to add a line in the rsyslog configuration instructing it to not only store the received logs but also relay them to the remote log analyser. Here is an example syslog.conf that relays all messages to a remote syslogd or log analyser.

This is it. Now you have added a very capable, yet low-cost, monitoring layer to your infrastructure.

— About the author: Doichin Yordanov is R&D engineer with years experience in scalable enterprise infrastructure software. He has been technical lead of the VMware vSphere Log Collection and Monitoring.

Curation Service Build Path

Curation Project Summary

A leading financial publisher needed a way to measure the performance of an AI-driven semantic document classification and information extraction pipeline from large-volume news feed.

The service was developed using agile methodology over a period of one year and six months. It was delivered using a cutting edge technology stack: Java/Spring, PostgreSQL, MongoDB, ElasticSearch, REST API for the back-end and Angular4 as a web app for front-end.

The engineering infrastructure was built with Jenkins and Dockers, and deployed in the AWS Cloud.

The development was successful delivered on time and on budget.

How We Delivered the Service

Here we will list key points of building the service, what went very well, what went wrong and the lessons learned.

We developed 2 versions of the service. The first version was completed in six months, while the second one took us ten months.

Bellow you can see a brief list of the high-level requirements we followed for the development for Version 1:

  • The service provides storage of documents in JSON-LD format along with their annotations;
  • 3 user roles are supported:
    • Administrators – manage the system;
    • Curators – permission to curate documents;
    • Supervisors – permission to resolve conflicts in curated documents.
  • The following document annotation modifications are allowed:
    • Users should be able to accept or reject existing general annotations for documents;
    • Users should be able to modify the properties of annotations of relational type;
    • Through the use of Search API, Users should be able to add new annotations by searching for new concepts mentioned in an external GraphDB.
  • The service provides several statistics and reports on how well the Curators and Supervisors did their job over a set period of time;
  • The service automatically exports the curated documents for processing to other external systems.

Additionally, our client wanted to reuse some of the secure JWT libraries they already used in their infrastructure, which were built on plain old spring. This meant we were asked not to use Spring Boot for the new development.

To start with, we used two databases to store the necessary data; PostgreSQL where we stored all business logic relations, users, statistics and history logs and MongoDB where we stored the JSON documents that we needed to process.

According to the requirements, all annotations that came from an external system needed to be stored for each document and only a small part of them had to be available for curation. The format of the document and its annotations was specified as JSON-LD and consequently we decided to store that data into MongoDB.
For the business logic, after carefully analysing all the requirements, the natural choice of storage was a relational DB – PostgreSQL.

The initial annotated documents had to come into the Curation service via a queue of messages and once processed, with altered annotations, they had to move to a second, output queue. To achieve this, we chose to use RabbitMQ.

Challenge – Request Sample Data

At the start of the project, the client couldn’t provide us with sample documents and annotations for us to start working on. This was due to their desire to use a completely new format for the service – JSON-LD. The thousands of documents and millions of annotations available, were in a custom JSON format called Generic JSON.

Additionally, some of the older annotated documents were in a format called Gate, left behind by a very old and slow-functioning tool that used to have the same purpose as the new Curation tool we were tasked to build.
In the end, to start the working, the client specified how the documents and annotations had to look like in the new JSON-LD format and gave us handwritten sample data we could use.

Lessons learned:

  • If you require sample data to begin working, make sure you request that the client provide it on time.
  • Validate that the data you receive matches the original requirements and, if the format is new to the client, request they provide proper, detailed specifications.

Challenge – Wrap JSON-LD data

For the first version of the service we decided to store the JSON-LD in MongoDB in its row JSON format. This ended up being a mistake, as, in our MongoDB, we were unable to add custom information to each of the JDON-LD records, related only to the curation business logic.

In the second version of the service we needed to wrap the JDONL-LD in another JSON. This meant we could add as many properties as we wanted, without changing the client’s JSON-LD format.

Another mistake was the decision to use an ID out of the JSON-LD data as a primary key for MongoDB. Later on, when we changed how we extracted the String ID from the JDON-LD data we ended up needing to migrate all existing records.

Lessons learned:

  • Always use the MongoDB default primary key _id and add your data primary ID as unique, indexed constraints in the code.
  • Never use raw data which you cannot control as a main JSON entity in the DB; Create your own and put the raw one as a value property instead.

Challenge – Detach the code from the data format

In the beginning we decided that we should get some constants from the JSON-LD context part and use them in the backend to make our life easier. While rather obvious, this really helped us build a more compact documents and short and ready to use in URLs annotation IDs.

Unfortunately, later on, the client chose to change the JSON-LD context. This meant we had to change how we built URL friendly IDs, which required a really big refactoring and writing migration code for the existing 5k documents with over 600k annotations. The migration itself was a challenge as Mongo spring hibernate does not implement the repository Iterable as we expected – see below.

Lessons learned:

  • Carefully analyse the data a service works on and verify the parts that could and are allowed to be changed by specification at a later stage.

Challenge – Use paging to iterate the MongoDB repository

In Spring hibernate you can make the following simple interface and use it for iteration over all of the DB records:

public interface DocumentRepository extends MongoRepository<ProjectJsonldDocumentModel, String> {

Iterable findAll(Sort var1);


The problem is that Mongo implementation is trying to load all data on the memory and then return “Iterable” over it. Which, for a big data collection, always leads straight to the end: Out Of Memory exception 🙂
Instead, you should use pagination queries like this one:

Page findAll(Pageable var1);

You can easily wrap it in an Iterable implementation that executes internally, paging the next call.

Challenge – Use a DB Schema Migration Tool

In the beginning of the project we decided to use the Flyway schema migration Java tool to help us organise the different relational DB schema changes and how to migrate between them.

The tool is very helpful and allowed us to support both minor and major schema changes very easily. This lead to the overall improvement of the code architecture and helped us correct some bad schema decisions made in the early stages of the development of the service.

Here is a list of rules for building a schema and the reasons it’s important to follow them.

DB Rule 1: Always set names for the tables and columns to avoid ones auto-generated by hibernate – this will help keep the SQL queries clear to read and match your domain repository data models and naming convention.


@Table(name = "data_migration_versions")

public class DataMigrationVersion {

@Column(name = "created_on", nullable = false)
private Date createdOn;

DB Rule 2: Always use string type for enumerating columns – without this annotation the numeric value for the enum is stored by default. This is an absolute disaster when you need to read your tables directly or do an enum migration.



@Column(name = "send_status", nullable = false)
private SendStatusEnum sendStatus;

DB Rule 3: Always name your FK or unique constraints – that helps to easily find what you need to migrate or update in the schema migration scripts.


@Table(name = "document_handles",

uniqueConstraints = @UniqueConstraint(
name = "document_handles_unique_document_id_project_id",
columnNames = {"document_id", "project_id"}))
public class DocumentHandle {
@OneToOne(fetch = FetchType.EAGER)
@JoinColumn(name = "supervisor_document_handle_id",
foreignKey = @ForeignKey(name = "document_handles_fk_supervisor_document_handle_id"))
private SupervisorDocumentHandle supervisorHandle;

DB Rule 4: Always use UUID v2 as primary key – that helps manage several DB instances without the need to worry about primary key collisions.



@GeneratedValue(generator = "id")
@GenericGenerator(name = "id", strategy = "uuid2")
@Column(name = "id", unique = true, length = UUID_SIZE)
@Size(min = UUID_SIZE, max = UUID_SIZE)
private String id;

How to Configure Applications Running on Virtual Machines



Using docker containers is now a standard for deploying and managing applications. With them, transfering a particular configuration or startup parameters to an application is practically trivial. But what if your app is running on a Virtual Machine? How do you pass arguments to it, if it is running on a VM’s guest OS? There are still many apps out there that either haven’t been containerised or simply require some special permissions, etc, yet are still running on Virtual Machines. Passing startup parameters to such apps may be hard because it requires a tool with OS level access. Of course, Puppet, Chef or Ansible will handle the matter with ease, but those demand the application of specific knowledge.

In this post we outline a much simpler solution using a specific example to ensure clarity. The described solution is for AWS cloud and VMware vSphere based infrastructure.

The main steps are as follows:

  1. Provision a VM that contains your application and assign metadata (Tags) to it;
  2. Read the metadata from within the VM with a simple script;
  3. Start the app using the metadata as startup parameters/arguments.

And below you can read through a more detailed description of the proposed solution.

We’re working with the assumption that we have an application that requires a few parameters when starting up. This app has to be deployed for each of our clients and one of the startup parameters is the clientID, unique for each and every client. We have a VM template (AMI in Amazon terms) and we create a new VM (EC2) instance for each client. The question is, how to pass the startup parameter (in our case the clientID) to the app?

In AWS we are going to use Amazon Metadata Service with a combination of AWS Tags API.

Amazon Metadata Service is injected in each AWS EC2 instance. It contains diverse data, which you can query, using tools like Curl on Linux. Some of the information you can get includes the EC2 instanceID, the region where the instance is running and so on. For the full list of properties see the AWS docs here.

Tags in AWS are key/value pairs and you can map them directly to the application startup parameters. Basically, you need to create as many tags as your application startup parameters are.

We said the app is inspecting a parameter called “clientID”, so you can attach a tag with key=clientID and value= when a new EC2 instance is created. Assigning the tags is the first part of the solution, because now you have the application startup parameters attached to your instances. Next, we need to read these tags and pass them to our application as arguments.

For the second part of our solution we are going to create a simple script, which will read the instance tags and then start the application, passing them as arguments.

In the example below we’re using Bash, which depends on AWS CLI and jq library. AWS CLI requires AWS credentials in order to operate. For maximum security, the AWS credentials should only be allowed to read tags.

The script retrieves the EC2 instanceID and EC region from the metadata service like this:

Reading the region:

region=$(curl --silent | jq -r .region)

Reading the instanceID:

instanceId=$(ec2metadata --instance-id)

Reading a tag:

tagValue=$(aws ec2 describe-tags --filters "Name=resource-id,Values=${instanceId}" --region ${region} | jq -r ".Tags | .[] | select(.Key ==\”clientId\”) | .Value")

Last step is to start your app and pass the actual parameters:

/pathToMyApp –clientID ${tagValue}

This approach is also applicable for Virtual Machines running on VMware vSphere infrastructure. It is possible to provide information to the running guest OS in the form of key/value pairs using the official VMware vSphere API. One can make such API calls both against vSphere vCenter Server and VMware ESXi. In addition, bindings for this API are supported and provided by VMware for a number of different programing languages.

The examples in this post use the VMware vSphere API Python Bindings available on GitHub.

The basic assumption is that you know how to get a reference to the VM on which your app is running. You have to modify the ‘extraConfig’ property of the VM object. The property is a member of the VM’s ConfigSpec. It is of type OptionValue[]. For more information have a look at the vSphere API Reference.

myKey = 'guestinfo.clientId'
myValue = 'client identifier goes here'
cspec = pyVmomi.vim.vm.ConfigSpec()
cspec.extraConfig = [pyVmomi.vim.option.OptionValue(key=myKey, value=myValue)]

Next call ReconfigVM_Task on the VM MORef like this:


Note the key prefix ‘guestinfo.’! It is important to add this to the key, otherwise it can’t be read by the guest OS.

If this is executed on a powered off Virtual Machine, the key/value pair is saved in the VM’s .vmx file and survives restarts. If the info is added while the VM is running then the information will not be available after reboot.

So, how to read the property inside the Virtual Machine? We need ‘VMware Tools’ provided by VMware or, in the case of a Linux guest OS, the command ‘open-vm-tools’, which reads the info:

vmware-rpctool "info-get guestinfo.clientId"

All of these operations are already conveniently scripted on GitHub. Note that this script assumes ESXi/vCenterServer running on ‘localhost’ and does not use SSL verification, but is still a great starting point.

Now you have everything to extract the application parameters from a script running inside the VM. The final step is to start your app and pass the parameters.

This is how you can fully automate the deployment of applications packed as VMs.

VMware Cloud on AWS – A Bridge Between Private and Public Cloud

What are Hybrid Clouds?

In an IT world realigning around the Cloud, hybrid infrastructures or “Hybrid Clouds” are a way for companies to tap into the potential of both private and public infrastructures and get the best of both worlds – security, scalability, flexibility, and cost-effectiveness. They allow companies to share the computing workload of their data centres with public Clouds, run by a handful of big infrastructure corporations, such as Amazon (Amazon Web Services), Microsoft (Azure), Google (Google Cloud Platform), IBM (IBM Cloud) and others. Hybrid infrastructures are a cost-effective, highly available way for organisations to extend their data centres’ capacity, migrate data and applications to the Cloud and closer to customers, make use of new Cloud-native capabilities and create backups and disaster recovery solutions. A simple way to view hybrid computing is as having your company’s data reside both in the Cloud and on premise.

Moving workloads between different clouds

Effectively moving workloads between different clouds is a notoriously tedious and slow process. It involves accounting for virtual machines’ networking and storage configurations with the associated security policies, while converting them from one format to another. And this is just one of many challenges. Moving workloads from public back to the private Cloud is just as difficult considering management dependencies and proprietary APIs.

VMware and AWS

VMware software is entrenched in the data centers of enterprise customers (government and big companies) around the world. Enterprises that build and operate private clouds popularly use VMware’s cloud infrastructure software suite, with its server virtualization software practically ubiquitous. But with companies wanting to leverage the scale and capabilities of the Amazon public cloud, VMware realised the benefit of building a bridge to AWS. After the announcement of a partnership in 2016, with the VMware and AWS architectures being as different from each other as they are, it took more than a year to launch a solution and another six months (until VMworld 2018) for it to be fully globalised.

To launch VMware on Amazon’s cloud infrastructure AWS engineers had to change how they actually architect their data center – a massive effort, necessary to make sure that the Amazon Cloud infrastructure will be able to support the arguably best-in-class hypervisor, the VMware ESX hypervisor. To virtualise their infrastructure, Amazon’s traditionally used the open source Xen hypervisor, incompatible with VMware’s. Over the last year and a half, the company has been transitioning to a new custom distribution of KVM – a different open source hypervisor, with which compatibility won’t be an issue. In addition, practically everything in the AWS architecture has had to be modified as well – network, physical-server provisioning, security, etc.

VMware-AWS Hybrid Cloud

The basic premise is that the VMware-operated AWS-based service allows organisations who have on-site vSphere private infrastructures to migrate and extend them to the AWS Cloud (running on Amazon EC2 bare metal infrastructure), using the same software and methods to manage them. VMware-based workloads can now successfully be run on the AWS Cloud with applications deployed and managed across on-premise and public environments with guaranteed scalability, security and cost-effectiveness. Companies can take a hybrid approach to cloud adoption, consolidating and extending the capacities of their data centers and additionally modernising, simplifying and optimising their disaster recovery solutions.

AWS infrastructure and platform capabilities (Amazon S3, AWS Lambda, Elastic Load Balancing, Amazon Kinesis, Amazon RDS, etc.) can be natively integrated, which will allow organisations to quickly and easily innovate their enterprise applications. What they need to be mindful of, however, when selecting which capabilities to use, is that not all of them are available on the VMware stack. This could become an issue, should they ever decide to migrate their workloads from public cloud back to private.

Organisations can also simplify their Hybrid IT operations with VMware Cloud on AWS and leverage the power, speed and scale of the AWS cloud infrastructure to enhance their competitiveness and pace of innovation. They can use the same VMware Cloud Foundation technologies (NSX, vSAN, vSphere, vCenter Server) on the AWS Cloud without any purchase of new custom hardware, modification of operating models or applications being necessary. Workload portability and VM compatibility is automatically provisioned.

All AWS’ services, including databases, analytics, IoT, compute, security, deployments, mobile, application services, etc. can be leveraged with VMware Cloud on AWS with a promise of secure, predictable, low-latency connectivity.

Spring Transactional Proxies

Here we will discuss the Spring wonderland of proxies and database transactions.

Often we need to create Spring services that work with database transactions. First we define an interface, for example let’s take a simple UsersService like this:

public interface UsersService {

  void cleanupUsers();


Then implement with the known @Transactional annotation:

public class UsersServiceImpl implements UsersService {

  public void cleanupUsers() {


The @Transactional annotation results in the above relational DB operations done in one single transaction. If even a single operation fails, all will be reverted.

Now imagine that for some reason, which happens very often, we need to split the DB operations in separate transactions. One such reason could be that in our code, parts 1 to 3, cleanup a lot of depended tables that our user participated in, while in part 4 we need to delete some non-critical stats.

Imagine that our DB operations become so intensive that our beloved relational DB server decides to punish us and starts locking too many rows and eventually fails the whole transaction. We could separate all the code in 2 transactions – critical and non-critical. We separate all sensitive tables cleanup operations from parts 1 to 3 in one transaction A and all stats cleanup operations in another transaction B.

The relational DB monster should be happy as it does not need to lock so much rows with implementation refined like that:

public class UsersServiceImpl implements UsersService {

  public void cleanupUsers() {

  private void cleanupUsersMainTables() {

  private void cleanupUsersStatsTables() {


What is worth noting in the code above is that our beloved IDE warns us that Methods annotated with ‘@Transactional’ must be overridable. To get rid of the warning we make the methods protected.

Running our beloved tests, we hit an Exception:

org.springframework.dao.InvalidDataAccessApiUsageException: No EntityManager with actual transaction available for current thread - cannot reliably process 'remove' call
    at org.springframework.orm.jpa.EntityManagerFactoryUtils.convertJpaAccessExceptionIfPossible(
    at org.test.UsersServiceImpl.cleanupUsersMainTables(
    at org.test.UsersServiceImpl.cleanupUsers(
    at com.sun.proxy.$Proxy95.cleanupUsers(Unknown Source)
    at org.test.UsersServiceImplTest.testCleanupUsers(

Meaning our tested cleanupUsers() method is not running in a transaction at all. What the hell is going on here??

The Spring transnational behavior is well documented. Our methods cleanupUsersMainTables() and cleanupUsersStatsTables() are actually running without transactions. The main reason for this is the com.sun.proxy.$Proxy95 instances that are messing with our call stacktraces.

The wisdom: In order Spring to start the transaction process it must be able to alter our methods and inject code prior and post method call. This is achieved by creating a Proxy instance class that wraps our public interface methods. However, Spring can force us to use that proxy implementation only if we use the instance that he prepared for us, which can be done only by this:


  UsersService usersService;


Then comes the question why do we have the transitional magic when the @Transnational is used over interface implemented methods, but is not honored when we call our internal protected methods?

The reason is that when we call the internal methods we are calling the implementation instance this. In order to fix our messy code we need to somehow create a Spring proxy for our internal methods. We should start using it and Spring will take care of the transaction. Or will it?

Let’s give it a try:

interface UsersServiceInternals {
  void cleanupUsersMainTables();

  void cleanupUsersStatsTables();

And now let’s force our UserServiceImpl to implement it:

public class UsersServiceImpl implements UsersService, UsersServiceInternals {

  public void cleanupUsers() {

  public void cleanupUsersMainTables() {

  public void cleanupUsersStatsTables() {


That should’ve solved the problem and we run the tests again but aaaaaaaaaaaa… same error occurred. What is going on here this time!!!!??!!!???

We found out that Spring has created a proxy class for our internal interface methods but this proxy is still not in use. Ergo, we must obtain an instance of it in order to use it, instead of using the this:

public class UsersServiceImpl implements UsersService, UsersServiceInternals {

  private ApplicationContext applicationContext;


  public void cleanupUsers() {

  public void cleanupUsersMainTables() {

  public void cleanupUsersStatsTables() {

  private UsersServiceInternals getSpringProxy() {
    return applicationContext.getBean(UsersServiceInternals.class);


Now we can start the tests again and this time voila they all pass!

This is a story how knowing the basics of the Spring proxification helped us navigate in the Spring wonderland.

Data Analytics with Zero Latency and High Precision?

Everyone is “doing analytics” these days

Data Analytics is an IT buzzword. Hundreds of paradigms and solutions: change-data-capture, ETL, ELT, ingestion, staging, OLAP, data streaming, map/reduce, stream processing, data mining,… Amazon Redshift and Lambda; Apache Kafka, Storm, Spark and Hadoop Map/Reduce; Oracle GoldenGate; VMware Continuent, … gazillion of offers. All this hype makes it easy to loose track.

The Problem

What is the problem that all these solutions aim to solve anyway?

The business needs precise and rapid answers to simple yet critical questions. That’s all to it. How it can be achieved is a longer story.

Lets draw a real live analogy: Imagine that you are a couch and your business is to train a player for incoming competition. Unfortunately your top player starts to feel sick. You immediately grab him and go to see a doctor. At the doctor’s office you shoot with concise critical question: “Will my player be fit for the competition?”. Doctor’s answer is not that short: “Well, for an accurate assessment, I will have to run several urine and blood samples, do an EKG, ultrasound, chest X-rays and maybe an MRI. Your player needs to stay at the hospital for a couple of days, and avoid exercises as they wildly vary test results rendering analysis hard. We will then correlate all the data and get back to you in a few more days.”. You stare in disbelieve: first this doc is offering me to suspend training, right before the event; secondly, answers will come too late. No way!

Classical ETL

As ridiculous as it seems, data engineers often treat customers with offers full of latencies, lack of consistency or, worse, consistency on the price of downtime.

Various tools from Pentaho PDI to Scoop + Hadoop M/R… more or less classical extract-transform-load (ETL):

  • Proprietary scripts to export operational data into a set of CSV files (Hopefully the engineer knows how to encode incremental exports).
  • Logic to import the CSVss into the ETL engine with all imposed disk IO.
  • More logic to apply the actual analytical functions.
  • More scripts to load the results again into the analytical/reporting data store.
  • The result is a complex multi-step process spanning over vast volumes. This yields latency. The moment a change in the operational store is propagated to the reports, it may be already too late.

There are more hidden perils:

On each export, the conveniently available database integrity and type checks are lost. The developer needs to manually encode them. E.g. explicitly set data types for all CSV properties; encode checks for invalid value ranges, etc. Otherwise there is a great risk of data quality issues in the reports.
Since typical ETL tools apply transformations in-memory, costly disk swapping is involved for larger data sets.
Even a single in-flight problem causes restart of the entire lengthy job.
As latent, complex and error-prone as it is, the classical ETL process often lacks consistency. If correlated data is modified concurrently during export, the CSV files may contain inconsistent “relations”, e.g. employees without department, as the missing department was added in the database after the export had finished with the “department” table, but before the “employee” table was exported. Of course you can employ consistent native database tools such as Oracle’s data-pump or redo-log mining, but to integrate that tooling in the general data-flow increases the effort and complexity.

Stream Analytics

With all the data pooring in operational stores from IoT and the 24/7 global Cloud exposure, there is increasingly vaster data volumes that are screaming to be analyzed. The industry is responding with an approach that better suits enterprise scale – data stream analytics.

In summary, changes are captured as they occur and streamed to a scalable parallel processing engine. Incoming changes are analyzed immediately through delta aware functions and stream transformations. Results are merged (delta-aggregated) into the reporting store. There are a number of stream-aware frameworks that facilitate such process: Apache Kafka, Spark and Storm for example. Combination of such tools provides low latency, yet does not guarantee consistency for high precision decisions.

Developing robust and efficient stream analytics could a be very challenging task. One needs to integrate or even implement from scratch an efficient change-data-capture solution. Postgres for example has immature log mining technology. Captured changes need to be correlated to ensure referential integrity and transactional consistency. Choose scalable yet resilient computing framework able to overcome failures during stream analysis. Glue all systems together into coherent easy to use package and figure-out how to delta-merge the results into the analytical store.

“Hibernate” for Data Analytics

Remember Hibernate? The tool that revolutionazied the engineering of persistence layers – easy to learn, massive savings on boiler plate, yet error-prone persistence code. Well, for the sake of objectivity, it also brought sometimes lack of fine-grained SQL control.

We at DataStork believe it is about time data analytics to benefit such automation… yet keep fine-grained data crunching control when needed.

Meet the DataStork way, “Hibernate” for data analytics:

Data analysts encode questions by using plain old SQL (Geeks can still use various languages to encode complex analytical functions).

  • We analyze encoded queries and deploy agents at the relevant data sources to capture the data changes.
  • Captured changes are streamed in efficient compressed form.
  • Defined questions/transformations are applied over the stream by using highly scalable and robust parallel computing framework.
  • Data operations are kept as close to the database data as possible, to avoid unnecessary disk IO and leverage existing type-info and constraints.
  • Entity relations are regarded both on the operational and analytical databases to ensure fully consistent results both transactionaly and in terms of referential integrity.
  • Data analysts can inspect and adjust each of the generated SQL scripts for fine-grained control.

DataStork automates all aspects of modern data analytics through combination of innovative EL-T and stream analytics, ensuring 0 latency and high consistency. This approach also works with legacy relational databases.

Now you would know right-away whether you are fit to win the next major competition… because at the time you walk in the “doctor’s” office all the information has already been analyzed and the needed answers are waiting for you.

You are welcome to get in touch for more details.

Technical Matrix

Experienced engineers are those that do not head-jump on the solution with hipster next-great-thing technology approach, rather carefully evaluate the unique customer needs and apply the right technology combination for optimal results. Like an experienced craftsman, each DataStork engineer masters vast palette of tools and knows which ones to pick for the job.

Get a summary of the technologies that DataStork masters: DataStork-Expertise