Curation Service Build Path

Curation Project Summary

A leading financial publisher needed a way to measure the performance of an AI-driven semantic document classification and information extraction pipeline from large-volume news feed.

The service was developed using agile methodology over a period of one year and six months. It was delivered using a cutting edge technology stack: Java/Spring, PostgreSQL, MongoDB, ElasticSearch, REST API for the back-end and Angular4 as a web app for front-end.

The engineering infrastructure was built with Jenkins and Dockers, and deployed in the AWS Cloud.

The development was successful delivered on time and on budget.

How We Delivered the Service

Here we will list key points of building the service, what went very well, what went wrong and the lessons learned.

We developed 2 versions of the service. The first version was completed in six months, while the second one took us ten months.

Bellow you can see a brief list of the high-level requirements we followed for the development for Version 1:

  • The service provides storage of documents in JSON-LD format along with their annotations;
  • 3 user roles are supported:
    • Administrators – manage the system;
    • Curators – permission to curate documents;
    • Supervisors – permission to resolve conflicts in curated documents.
  • The following document annotation modifications are allowed:
    • Users should be able to accept or reject existing general annotations for documents;
    • Users should be able to modify the properties of annotations of relational type;
    • Through the use of Search API, Users should be able to add new annotations by searching for new concepts mentioned in an external GraphDB.
  • The service provides several statistics and reports on how well the Curators and Supervisors did their job over a set period of time;
  • The service automatically exports the curated documents for processing to other external systems.

Additionally, our client wanted to reuse some of the secure JWT libraries they already used in their infrastructure, which were built on plain old spring. This meant we were asked not to use Spring Boot for the new development.

To start with, we used two databases to store the necessary data; PostgreSQL where we stored all business logic relations, users, statistics and history logs and MongoDB where we stored the JSON documents that we needed to process.

According to the requirements, all annotations that came from an external system needed to be stored for each document and only a small part of them had to be available for curation. The format of the document and its annotations was specified as JSON-LD and consequently we decided to store that data into MongoDB.
For the business logic, after carefully analysing all the requirements, the natural choice of storage was a relational DB – PostgreSQL.

The initial annotated documents had to come into the Curation service via a queue of messages and once processed, with altered annotations, they had to move to a second, output queue. To achieve this, we chose to use RabbitMQ.

Challenge – Request Sample Data

At the start of the project, the client couldn’t provide us with sample documents and annotations for us to start working on. This was due to their desire to use a completely new format for the service – JSON-LD. The thousands of documents and millions of annotations available, were in a custom JSON format called Generic JSON.

Additionally, some of the older annotated documents were in a format called Gate, left behind by a very old and slow-functioning tool that used to have the same purpose as the new Curation tool we were tasked to build.
In the end, to start the working, the client specified how the documents and annotations had to look like in the new JSON-LD format and gave us handwritten sample data we could use.

Lessons learned:

  • If you require sample data to begin working, make sure you request that the client provide it on time.
  • Validate that the data you receive matches the original requirements and, if the format is new to the client, request they provide proper, detailed specifications.

Challenge – Wrap JSON-LD data

For the first version of the service we decided to store the JSON-LD in MongoDB in its row JSON format. This ended up being a mistake, as, in our MongoDB, we were unable to add custom information to each of the JDON-LD records, related only to the curation business logic.

In the second version of the service we needed to wrap the JDONL-LD in another JSON. This meant we could add as many properties as we wanted, without changing the client’s JSON-LD format.

Another mistake was the decision to use an ID out of the JSON-LD data as a primary key for MongoDB. Later on, when we changed how we extracted the String ID from the JDON-LD data we ended up needing to migrate all existing records.

Lessons learned:

  • Always use the MongoDB default primary key _id and add your data primary ID as unique, indexed constraints in the code.
  • Never use raw data which you cannot control as a main JSON entity in the DB; Create your own and put the raw one as a value property instead.

Challenge – Detach the code from the data format

In the beginning we decided that we should get some constants from the JSON-LD context part and use them in the backend to make our life easier. While rather obvious, this really helped us build a more compact documents and short and ready to use in URLs annotation IDs.

Unfortunately, later on, the client chose to change the JSON-LD context. This meant we had to change how we built URL friendly IDs, which required a really big refactoring and writing migration code for the existing 5k documents with over 600k annotations. The migration itself was a challenge as Mongo spring hibernate does not implement the repository Iterable as we expected – see below.

Lessons learned:

  • Carefully analyse the data a service works on and verify the parts that could and are allowed to be changed by specification at a later stage.

Challenge – Use paging to iterate the MongoDB repository

In Spring hibernate you can make the following simple interface and use it for iteration over all of the DB records:


public interface DocumentRepository extends MongoRepository<ProjectJsonldDocumentModel, String> {

...
Iterable findAll(Sort var1);

}

The problem is that Mongo implementation is trying to load all data on the memory and then return “Iterable” over it. Which, for a big data collection, always leads straight to the end: Out Of Memory exception 🙂
Instead, you should use pagination queries like this one:


Page findAll(Pageable var1);

You can easily wrap it in an Iterable implementation that executes internally, paging the next call.

Challenge – Use a DB Schema Migration Tool

In the beginning of the project we decided to use the Flyway schema migration Java tool to help us organise the different relational DB schema changes and how to migrate between them.

The tool is very helpful and allowed us to support both minor and major schema changes very easily. This lead to the overall improvement of the code architecture and helped us correct some bad schema decisions made in the early stages of the development of the service.

Here is a list of rules for building a schema and the reasons it’s important to follow them.

DB Rule 1: Always set names for the tables and columns to avoid ones auto-generated by hibernate – this will help keep the SQL queries clear to read and match your domain repository data models and naming convention.

Example:


@Table(name = "data_migration_versions")

public class DataMigrationVersion {



@Column(name = "created_on", nullable = false)
private Date createdOn;

DB Rule 2: Always use string type for enumerating columns – without this annotation the numeric value for the enum is stored by default. This is an absolute disaster when you need to read your tables directly or do an enum migration.

Example:


@Enumerated(EnumType.STRING)

@Column(name = "send_status", nullable = false)
private SendStatusEnum sendStatus;

DB Rule 3: Always name your FK or unique constraints – that helps to easily find what you need to migrate or update in the schema migration scripts.

Example:


@Table(name = "document_handles",

uniqueConstraints = @UniqueConstraint(
name = "document_handles_unique_document_id_project_id",
columnNames = {"document_id", "project_id"}))
public class DocumentHandle {
...
@OneToOne(fetch = FetchType.EAGER)
@JoinColumn(name = "supervisor_document_handle_id",
foreignKey = @ForeignKey(name = "document_handles_fk_supervisor_document_handle_id"))
private SupervisorDocumentHandle supervisorHandle;

DB Rule 4: Always use UUID v2 as primary key – that helps manage several DB instances without the need to worry about primary key collisions.

Example:


@Id

@GeneratedValue(generator = "id")
@GenericGenerator(name = "id", strategy = "uuid2")
@Column(name = "id", unique = true, length = UUID_SIZE)
@Size(min = UUID_SIZE, max = UUID_SIZE)
private String id;

DataStork to Partner with HedgeServ in November

 

 

We are happy to announce that this month we’re adding yet another partner to the list of companies who have trusted our engineering expertise. HedgeServ is a top-ranked global, independent fund administrator, servicing more than $350 billion of assets across 200 client relationships. They have 14 offices in 8 different countries and have received #1 rankings in Fund Accounting, Reporting & Reporting Technology, Client Service, Investor Services, Hedge Fund Expertise and Regulatory Expertise.

DataStork specialists went in as software development consultants back in September, seeking to aid HedgeServ engineers with upgrading one of the company’s core products – software used to manage investor activity for alternative investment funds. Our team reviewed appropriate technologies and practices with the HedgeServ engineering team and validated the new software architecture and design to deliver a state-of-the-art application over a CI/CD pipeline.

Having found the input invaluable, HedgeServ sought to build a more permanent partnership. We are happy to have been given the opportunity to provide additional consultancy, tech partnership and mentoring with the challenge at hand: ensuring HedgeServ’s software is enhanced in the most effective way, ensuring its future scalability.


Another Exciting New Partnership from October

 

 

We have the pleasure to announce that from the 1st of October DataStork added Uber Technologies Inc. to its ever-growing list of partners. Uber, a global peer-to-peer ridesharing, taxi cab, food delivery, bicycle-sharing, and transportation network company (TNC) headquartered in San Francisco, California, has chosen DataStork as the first Bulgarian company to provide external expertise for its R&D Software Development office in Sofia, Bulgaria. We are proud to take on this responsibility and the thrilling challenges that come with it.

The partnership will see DataStork engineers working on a number of exciting core projects alongside Uber staff, solving complex software problems and developing the company’s backbone systems on an ongoing, long-term basis. Millions of customers turn to Uber’s services daily and with the new Uber Eats and Uber Freights the number is expected to cross the billion mark. This means that the foundation of its IT offering needs to scale to meet the demand and will require some truly innovative and inspired engineering. This is where DataStork’s specialists with their knowledge on scalable software come in. Some of our most senior staff members are already eagerly sinking their teeth, getting acquainted with the client domain and specifics.

We’re looking forward to a productive and exciting partnership ahead!


A New DataStork Partner

 

 

We are proud to announce a new and exciting partnership! Bosch Software Innovations, a subsidiary of Robert Bosch GmbH, the world leading, multinational engineering and electronics company with German origins, will be working closely with teams of DataStork engineers. This new partnership will see us working with the existing Bosch IoT systems to build highly-customised software solutions, meeting the needs of some of Bosch’s largest and most important enterprise clients. We’re happy to have earned our new partner’s trust and look forward to a productive and long-term relationship.

The first project is expected to start in the next one to two months with some of our senior staff members already discussing requirements and opportunities. Stay tuned for exciting news on this front.


VMware Cloud on AWS – A Bridge Between Private and Public Cloud

What are Hybrid Clouds?

In an IT world realigning around the Cloud, hybrid infrastructures or “Hybrid Clouds” are a way for companies to tap into the potential of both private and public infrastructures and get the best of both worlds – security, scalability, flexibility, and cost-effectiveness. They allow companies to share the computing workload of their data centres with public Clouds, run by a handful of big infrastructure corporations, such as Amazon (Amazon Web Services), Microsoft (Azure), Google (Google Cloud Platform), IBM (IBM Cloud) and others. Hybrid infrastructures are a cost-effective, highly available way for organisations to extend their data centres’ capacity, migrate data and applications to the Cloud and closer to customers, make use of new Cloud-native capabilities and create backups and disaster recovery solutions. A simple way to view hybrid computing is as having your company’s data reside both in the Cloud and on premise.

Moving workloads between different clouds

Effectively moving workloads between different clouds is a notoriously tedious and slow process. It involves accounting for virtual machines’ networking and storage configurations with the associated security policies, while converting them from one format to another. And this is just one of many challenges. Moving workloads from public back to the private Cloud is just as difficult considering management dependencies and proprietary APIs.

VMware and AWS

VMware software is entrenched in the data centers of enterprise customers (government and big companies) around the world. Enterprises that build and operate private clouds popularly use VMware’s cloud infrastructure software suite, with its server virtualization software practically ubiquitous. But with companies wanting to leverage the scale and capabilities of the Amazon public cloud, VMware realised the benefit of building a bridge to AWS. After the announcement of a partnership in 2016, with the VMware and AWS architectures being as different from each other as they are, it took more than a year to launch a solution and another six months (until VMworld 2018) for it to be fully globalised.

To launch VMware on Amazon’s cloud infrastructure AWS engineers had to change how they actually architect their data center – a massive effort, necessary to make sure that the Amazon Cloud infrastructure will be able to support the arguably best-in-class hypervisor, the VMware ESX hypervisor. To virtualise their infrastructure, Amazon’s traditionally used the open source Xen hypervisor, incompatible with VMware’s. Over the last year and a half, the company has been transitioning to a new custom distribution of KVM – a different open source hypervisor, with which compatibility won’t be an issue. In addition, practically everything in the AWS architecture has had to be modified as well – network, physical-server provisioning, security, etc.

VMware-AWS Hybrid Cloud

The basic premise is that the VMware-operated AWS-based service allows organisations who have on-site vSphere private infrastructures to migrate and extend them to the AWS Cloud (running on Amazon EC2 bare metal infrastructure), using the same software and methods to manage them. VMware-based workloads can now successfully be run on the AWS Cloud with applications deployed and managed across on-premise and public environments with guaranteed scalability, security and cost-effectiveness. Companies can take a hybrid approach to cloud adoption, consolidating and extending the capacities of their data centers and additionally modernising, simplifying and optimising their disaster recovery solutions.

AWS infrastructure and platform capabilities (Amazon S3, AWS Lambda, Elastic Load Balancing, Amazon Kinesis, Amazon RDS, etc.) can be natively integrated, which will allow organisations to quickly and easily innovate their enterprise applications. What they need to be mindful of, however, when selecting which capabilities to use, is that not all of them are available on the VMware stack. This could become an issue, should they ever decide to migrate their workloads from public cloud back to private.

Organisations can also simplify their Hybrid IT operations with VMware Cloud on AWS and leverage the power, speed and scale of the AWS cloud infrastructure to enhance their competitiveness and pace of innovation. They can use the same VMware Cloud Foundation technologies (NSX, vSAN, vSphere, vCenter Server) on the AWS Cloud without any purchase of new custom hardware, modification of operating models or applications being necessary. Workload portability and VM compatibility is automatically provisioned.

All AWS’ services, including databases, analytics, IoT, compute, security, deployments, mobile, application services, etc. can be leveraged with VMware Cloud on AWS with a promise of secure, predictable, low-latency connectivity.


Dear entrepreneurs, don’t outsource, Up-source!

So you’re hit by a great idea and the entrepreneur-you awakes.You circle it around. Your friends think it’s magnificent, you are encouraged further by few VCs that find it intriguing and an angel is even willing to minimally ceed you for a minimal viable product (MVP). The plan looks good, but few months later you are still nowhere with the implementation. There turns to be a substantial obstacle ahead: You just can’t hire a critical mass of developers that are capable of bootstrapping it. How come?

The software industry is in constant boom. The Moor’s law is good proof of this trend, doubling the computational resources needed to run the produced software year in and year out. New software paradigms arrive each year and the entire economy rushes to implement them: Virtualization >> Cloud >> BigData >> Fog and IoT >> AI ….resourceful embankments each of them an universe on its own. The everything-as-a-service paradigm and the commodity Cloud computing resource opens infinite opportunities. Pleiades of software startups and ideas strive to explore them. There is one thing that is slowing this boom though – the well-educated engineering force to support it. Given the ever growing software production needs, we still have roughly the same number of technical universities and computer science graduates as 2, 3, even 5 years ago, not to mention experienced engineers.

How then your young startup is supposed to find engineers capable of turning your sophisticated innovation in state-of-art software? In this market, you are competing with universe of startups and tens of thousands of well established and funded companies that lure engineers with social benefits, flexible working time, unlimited vacations, luxurious offices and what not.

On the other hand, as author of the idea, you probably don’t want to just outsource your baby to a bunch of random engineers with doubtful credibility and capabilities, some strangers from the other end of the world, hardly able to comprehend their culture, with unclear processes and lack of guarantees contrasting the bold claims they make out of the blue.

The model of partnering with experienced engineering company specializing in boosting startups has its advantages, even over an internal development: you get quality engineering processes, people, product and project management, with less risk in case of financial difficulties. Ultimately you get the efficiency of engineers that have years and years of experience on such projects.

Whether you have the finance but can’t compile your team or are not washed by surplus of capital, don’t panic! There are professional engineers on the market, a cohesive group of freelancers, small or even medium software shop that are enjoying to collaborate with fresh startups. Why? Don’t know for all of them, but we at DataStork do it because it’s challenging, hence fun. Novel ideas, noble and bold, are full of technical challenges, often vague requirements, limited budget and rapid time-to-market leave no space for errors and trials and this is where professionals shine. You just need to understand how to recognize shops capable of delivering your idea.

Ask yourself: What are the most important qualities that a team should possess to implement my idea: relevant domain and technical expertise, proven proficiency and efficiency, be excited about the idea? What you need to avoid probably is: remote-only team with distant culture, cocky statements, costs vagueness hidden by agile methodology terms, lack of transparency on the exact team members and processes. In two words, don’t search to outsource as this is what you will get, rather look to “up” your startup with pros that serially boost startups, i.e. up-source.

Here are some key characteristics of up-sourcing teams.

The up-source team would be willing to understand well the idea in a series of live meetings and Q&A sessions, before jumping on the offer and implementation. The team would then provide you with a release plan of clearly defined milestones and firm total estimate that would guarantee your precious resources are spent for maximal market impact. The up-source team would give you a timeline with frequent incremental releases (e.g. every 2 weeks). With such release cycle you are in control. The risk of mis-implementation is mitigated since the team is legally bound to supply you fully functional production-ready code, every 2 weeks.

If this team is confident in its processes and quality there won’t be a problem to offer you concise legal contract that crisply explains the processes and gives you quality and cost guarantees. Look for agreements that do not charge on work associated with bug fixing. Everybody claims high-quality, so look for measurable assessment of the quality, e.g. 4, 5 or even 6 on the 6-Sigma scale, depending on your budget and practical needs.

Of course it is not always possible to articulate all possible functionalities that your software needs to have. At this early stage, you might not even have a clue about them. Hence the total estimate and cost guarantees would be taken as understanding of the current vision. The vision would often change with the evolution of your idea and startup. If your engineering partner is experienced, it will guide you through the inception phase while the vision is forming and compile for you a software specification of clearly defined goals, functional and non-functional requirements, described along with estimates and priorities; and laid down in a work-break-down (WBS) structure with logical milestones. The specification should be clearly separated on functional and technical part, whereas the former describes what your system needs to accomplish and the latter depicts by what technical means it will be accomplished. Lack of separation is a sign of lack of maturity.

Should the vision change, you should follow the concisely described change-management procedure in your legal contract and work with your engineering partner to update the estimates, specification and the release plan. Well written requirements are like state-of-art legal contract: easy to read, concise and sharp on the success/failure criteria. Given such specification with estimated and prioritized list of functionalities will let you fit your effort to the available budget. Pick the right balance between product richness and time-to-market. An experienced team would advise you, for an optimal release path, to choose only the critical features, just like you would do when your funds are about to deplete, still keeping an eye on the good user experience and the wow-effect. As the saying goes “if you did it perfect, then you are late”.

Watch out in the estimation phase. A serious partner would not throw estimates out of the blue, but will analyze and consider your idea from all aspects. Well grounded and professional estimates are done after the problem is well understood and the team speaks “your language”. Surely estimates carry uncertainty, but for experienced architects that know how to systematize requirements and choose suitable architecture, the uncertainty should fall within fixed error rate not exceed 20%, e.g. 3 man-months plus/minus 2 weeks.

Experienced engineering team will be able to provide you with references of similar projects so that you can cross-check their ability to deliver. Experience varies across domains, but generally if your idea requires state-of-art engineering and technologies, then look for at least 10 years of average team member experience. For example BigData engineering requires 5 years hands-on classical databases and 5 years on NoSQL to start with, less than that and you will get sub-optimal or over-designed solution.

An up-source team would specifically turn your attention to the post-MVP period and take care of how you transition to your own internal development. It would offer a smooth transfer-of-information plan with enough education, on-site mentoring, even help in forming your internal team when you are financially savvy. Such team would explain that writing software in state-of-art modular fashion following best practices for clean self-documented code, providing frequent incremental releases with product-ready end-to-end functionality is the key to the successfully boosting your idea to orbit and eventually ensures easy transition to the next phase of internal team development. Up-sourcing works the same was as a rocket booster smoothly transitions power to internal engine when the lower orbit is reached.

Experienced engineering partner would take care of the technical side of your intellectual property in systematic way. It will turn your attention to reasonably priced patent options ordered in long-term IP protection strategy which evolve both in strength and costs with your company progression.

If such teams are experienced in bootstrapping startups they should be efficient enough so that their service comes at reasonable price for fresh startup as yours. Avoid cosmic as they yield lack of expertise and efficiency. Avoid ridiculously low rates as well as they usually yield straight forward scam, junk software or at least compromises in key aspects of it.

We touched on some problems of the startup sourcing and the model of startup up-sourcing/boosting. We went through handful of key characteristics of good startup up-sourcing shops. It takes years and years of experience though to be able to recognize ripe from rotten apples in the software services industry. Hopefully you now have some more hints for your checklist.

To sum it up an experienced team of engineers, keen to boost your idea is somewhere there, take your time to discover it, to feel it. When you find it, you will feel it. It should feel as a long-term partnership that speaks your “language”, is passionate and cares about your idea, still acts in cool-headed systematic way and provides guarantees about its claims. Stay away from murky waters that you can’t visit live at least twice a month, especially stay away from hot-shots throwing promises about one-shot software miracles.

Don’t outsource, up-source! Good luck!


Powerful avoid-that-employer sign?

There are so many formal characteristics to consider when choosing a new employer: The development opportunities, the team, the boss, the processes, the location, remote work, growth plans and stability… oh, yes and the financial part of course, but this write-up is not for those that are ready to sacrifice their dream job for a larger check.

There are also soft characteristics to watch for: are the interviewers smiling, are they relaxed or stressed, is the interview flow more like Q&A or it is friendly discussion, are you judged by a computer or human, are you represented in front of your potential team or there are layers of people between the interviewers and your actual team.

These are all important characteristics, but this write-up will focus on one relatively new sign, which although subtle, is a composite indicator, quite strongly related to almost all company characteristics: The advance notice period. Yes, that’s right, the period stating how many months you are obliged to notify your company in advance of your departure. This little clause, taking almost unnoticeable place in your draft contract may tell you succinctly so much about the company.

We are living in a time of scarce educated workforce, which is almost always unavailable and rare. That is why, employers explain, it is necessary to raise the norm of 1 month advance period to 2 or more months. Thus the employer is supposed to be able to find suitable replacement for the leaving personal. Sounds logical?… at first glance, but let’s argue about it.

Imagine a company that knows how to build teams and cares about its people. The company knows exactly John’s role and contribution to the overall success: what are his professional qualities, strong and weak sides, where he fits best, how he feels today, what are his professional desires.

Such company would definitely notice when John starts to feel unhappy, and act long before he starts thinking of leaving, e.g. try to offer another opportunity matching better John’s expectations.

Such company would structure its teams so that there are no single points of failure. Tomorrow John decides to take parental leave – no problem – what’s the point of forcing a person to literally wait for 2, 3 months before he is able to pursue his new dream?

Such company, be it even a large enterprise, would not measure its success by the headcount of this or that office. Usually such companies argue that if John leaves, his headcount would be closed and the office would be forced to report lower headcount as compared to other company offices. Do you want to work for a luxurious office of prestige company, if you are treated as a headcount, if the office success is measured in headcounts, not in useful innovative products?

Lastly, when you see higher advance notice period, check the attrition rates. There is a chance that the company is just foolishly trying to lower its abnormal attrition. Question any number above 5% for large companies and 20% for startups.

I don’t understand the rising trend of the advance notice period. For me this is not fixing the core issues, not addressing why people are leaving, but are just trying to cover it all up in a naive way. On the other hand there are companies that care about their employees and treat them with utmost respect as people which are integral part the success. Usually such companies won’t hold you even for a day if you feel unhappy.

So next time you search for your next great adventure, you may also want to consider the notice period.


Data Analytics with Zero Latency and High Precision?

Everyone is “doing analytics” these days

Data Analytics is an IT buzzword. Hundreds of paradigms and solutions: change-data-capture, ETL, ELT, ingestion, staging, OLAP, data streaming, map/reduce, stream processing, data mining,… Amazon Redshift and Lambda; Apache Kafka, Storm, Spark and Hadoop Map/Reduce; Oracle GoldenGate; VMware Continuent, … gazillion of offers. All this hype makes it easy to loose track.

The Problem

What is the problem that all these solutions aim to solve anyway?

The business needs precise and rapid answers to simple yet critical questions. That’s all to it. How it can be achieved is a longer story.

Lets draw a real live analogy: Imagine that you are a couch and your business is to train a player for incoming competition. Unfortunately your top player starts to feel sick. You immediately grab him and go to see a doctor. At the doctor’s office you shoot with concise critical question: “Will my player be fit for the competition?”. Doctor’s answer is not that short: “Well, for an accurate assessment, I will have to run several urine and blood samples, do an EKG, ultrasound, chest X-rays and maybe an MRI. Your player needs to stay at the hospital for a couple of days, and avoid exercises as they wildly vary test results rendering analysis hard. We will then correlate all the data and get back to you in a few more days.”. You stare in disbelieve: first this doc is offering me to suspend training, right before the event; secondly, answers will come too late. No way!

Classical ETL

As ridiculous as it seems, data engineers often treat customers with offers full of latencies, lack of consistency or, worse, consistency on the price of downtime.

Various tools from Pentaho PDI to Scoop + Hadoop M/R… more or less classical extract-transform-load (ETL):

  • Proprietary scripts to export operational data into a set of CSV files (Hopefully the engineer knows how to encode incremental exports).
  • Logic to import the CSVss into the ETL engine with all imposed disk IO.
  • More logic to apply the actual analytical functions.
  • More scripts to load the results again into the analytical/reporting data store.
  • The result is a complex multi-step process spanning over vast volumes. This yields latency. The moment a change in the operational store is propagated to the reports, it may be already too late.

There are more hidden perils:

On each export, the conveniently available database integrity and type checks are lost. The developer needs to manually encode them. E.g. explicitly set data types for all CSV properties; encode checks for invalid value ranges, etc. Otherwise there is a great risk of data quality issues in the reports.
Since typical ETL tools apply transformations in-memory, costly disk swapping is involved for larger data sets.
Even a single in-flight problem causes restart of the entire lengthy job.
As latent, complex and error-prone as it is, the classical ETL process often lacks consistency. If correlated data is modified concurrently during export, the CSV files may contain inconsistent “relations”, e.g. employees without department, as the missing department was added in the database after the export had finished with the “department” table, but before the “employee” table was exported. Of course you can employ consistent native database tools such as Oracle’s data-pump or redo-log mining, but to integrate that tooling in the general data-flow increases the effort and complexity.

Stream Analytics

With all the data pooring in operational stores from IoT and the 24/7 global Cloud exposure, there is increasingly vaster data volumes that are screaming to be analyzed. The industry is responding with an approach that better suits enterprise scale – data stream analytics.

In summary, changes are captured as they occur and streamed to a scalable parallel processing engine. Incoming changes are analyzed immediately through delta aware functions and stream transformations. Results are merged (delta-aggregated) into the reporting store. There are a number of stream-aware frameworks that facilitate such process: Apache Kafka, Spark and Storm for example. Combination of such tools provides low latency, yet does not guarantee consistency for high precision decisions.

Developing robust and efficient stream analytics could a be very challenging task. One needs to integrate or even implement from scratch an efficient change-data-capture solution. Postgres for example has immature log mining technology. Captured changes need to be correlated to ensure referential integrity and transactional consistency. Choose scalable yet resilient computing framework able to overcome failures during stream analysis. Glue all systems together into coherent easy to use package and figure-out how to delta-merge the results into the analytical store.

“Hibernate” for Data Analytics

Remember Hibernate? The tool that revolutionazied the engineering of persistence layers – easy to learn, massive savings on boiler plate, yet error-prone persistence code. Well, for the sake of objectivity, it also brought sometimes lack of fine-grained SQL control.

We at DataStork believe it is about time data analytics to benefit such automation… yet keep fine-grained data crunching control when needed.

Meet the DataStork way, “Hibernate” for data analytics:

Data analysts encode questions by using plain old SQL (Geeks can still use various languages to encode complex analytical functions).

  • We analyze encoded queries and deploy agents at the relevant data sources to capture the data changes.
  • Captured changes are streamed in efficient compressed form.
  • Defined questions/transformations are applied over the stream by using highly scalable and robust parallel computing framework.
  • Data operations are kept as close to the database data as possible, to avoid unnecessary disk IO and leverage existing type-info and constraints.
  • Entity relations are regarded both on the operational and analytical databases to ensure fully consistent results both transactionaly and in terms of referential integrity.
  • Data analysts can inspect and adjust each of the generated SQL scripts for fine-grained control.

DataStork automates all aspects of modern data analytics through combination of innovative EL-T and stream analytics, ensuring 0 latency and high consistency. This approach also works with legacy relational databases.

Now you would know right-away whether you are fit to win the next major competition… because at the time you walk in the “doctor’s” office all the information has already been analyzed and the needed answers are waiting for you.

You are welcome to get in touch for more details.


Technical Matrix

Experienced engineers are those that do not head-jump on the solution with hipster next-great-thing technology approach, rather carefully evaluate the unique customer needs and apply the right technology combination for optimal results. Like an experienced craftsman, each DataStork engineer masters vast palette of tools and knows which ones to pick for the job.

Get a summary of the technologies that DataStork masters: DataStork-Expertise