Curation Service Build Path

Curation Project Summary

A leading financial publisher needed a way to measure the performance of an AI-driven semantic document classification and information extraction pipeline from large-volume news feed.

The service was developed using agile methodology over a period of one year and six months. It was delivered using a cutting edge technology stack: Java/Spring, PostgreSQL, MongoDB, ElasticSearch, REST API for the back-end and Angular4 as a web app for front-end.

The engineering infrastructure was built with Jenkins and Dockers, and deployed in the AWS Cloud.

The development was successful delivered on time and on budget.

How We Delivered the Service

Here we will list key points of building the service, what went very well, what went wrong and the lessons learned.

We developed 2 versions of the service. The first version was completed in six months, while the second one took us ten months.

Bellow you can see a brief list of the high-level requirements we followed for the development for Version 1:

  • The service provides storage of documents in JSON-LD format along with their annotations;
  • 3 user roles are supported:
    • Administrators – manage the system;
    • Curators – permission to curate documents;
    • Supervisors – permission to resolve conflicts in curated documents.
  • The following document annotation modifications are allowed:
    • Users should be able to accept or reject existing general annotations for documents;
    • Users should be able to modify the properties of annotations of relational type;
    • Through the use of Search API, Users should be able to add new annotations by searching for new concepts mentioned in an external GraphDB.
  • The service provides several statistics and reports on how well the Curators and Supervisors did their job over a set period of time;
  • The service automatically exports the curated documents for processing to other external systems.

Additionally, our client wanted to reuse some of the secure JWT libraries they already used in their infrastructure, which were built on plain old spring. This meant we were asked not to use Spring Boot for the new development.

To start with, we used two databases to store the necessary data; PostgreSQL where we stored all business logic relations, users, statistics and history logs and MongoDB where we stored the JSON documents that we needed to process.

According to the requirements, all annotations that came from an external system needed to be stored for each document and only a small part of them had to be available for curation. The format of the document and its annotations was specified as JSON-LD and consequently we decided to store that data into MongoDB.
For the business logic, after carefully analysing all the requirements, the natural choice of storage was a relational DB – PostgreSQL.

The initial annotated documents had to come into the Curation service via a queue of messages and once processed, with altered annotations, they had to move to a second, output queue. To achieve this, we chose to use RabbitMQ.

Challenge – Request Sample Data

At the start of the project, the client couldn’t provide us with sample documents and annotations for us to start working on. This was due to their desire to use a completely new format for the service – JSON-LD. The thousands of documents and millions of annotations available, were in a custom JSON format called Generic JSON.

Additionally, some of the older annotated documents were in a format called Gate, left behind by a very old and slow-functioning tool that used to have the same purpose as the new Curation tool we were tasked to build.
In the end, to start the working, the client specified how the documents and annotations had to look like in the new JSON-LD format and gave us handwritten sample data we could use.

Lessons learned:

  • If you require sample data to begin working, make sure you request that the client provide it on time.
  • Validate that the data you receive matches the original requirements and, if the format is new to the client, request they provide proper, detailed specifications.

Challenge – Wrap JSON-LD data

For the first version of the service we decided to store the JSON-LD in MongoDB in its row JSON format. This ended up being a mistake, as, in our MongoDB, we were unable to add custom information to each of the JDON-LD records, related only to the curation business logic.

In the second version of the service we needed to wrap the JDONL-LD in another JSON. This meant we could add as many properties as we wanted, without changing the client’s JSON-LD format.

Another mistake was the decision to use an ID out of the JSON-LD data as a primary key for MongoDB. Later on, when we changed how we extracted the String ID from the JDON-LD data we ended up needing to migrate all existing records.

Lessons learned:

  • Always use the MongoDB default primary key _id and add your data primary ID as unique, indexed constraints in the code.
  • Never use raw data which you cannot control as a main JSON entity in the DB; Create your own and put the raw one as a value property instead.

Challenge – Detach the code from the data format

In the beginning we decided that we should get some constants from the JSON-LD context part and use them in the backend to make our life easier. While rather obvious, this really helped us build a more compact documents and short and ready to use in URLs annotation IDs.

Unfortunately, later on, the client chose to change the JSON-LD context. This meant we had to change how we built URL friendly IDs, which required a really big refactoring and writing migration code for the existing 5k documents with over 600k annotations. The migration itself was a challenge as Mongo spring hibernate does not implement the repository Iterable as we expected – see below.

Lessons learned:

  • Carefully analyse the data a service works on and verify the parts that could and are allowed to be changed by specification at a later stage.

Challenge – Use paging to iterate the MongoDB repository

In Spring hibernate you can make the following simple interface and use it for iteration over all of the DB records:


public interface DocumentRepository extends MongoRepository<ProjectJsonldDocumentModel, String> {

...
Iterable findAll(Sort var1);

}

The problem is that Mongo implementation is trying to load all data on the memory and then return “Iterable” over it. Which, for a big data collection, always leads straight to the end: Out Of Memory exception 🙂
Instead, you should use pagination queries like this one:


Page findAll(Pageable var1);

You can easily wrap it in an Iterable implementation that executes internally, paging the next call.

Challenge – Use a DB Schema Migration Tool

In the beginning of the project we decided to use the Flyway schema migration Java tool to help us organise the different relational DB schema changes and how to migrate between them.

The tool is very helpful and allowed us to support both minor and major schema changes very easily. This lead to the overall improvement of the code architecture and helped us correct some bad schema decisions made in the early stages of the development of the service.

Here is a list of rules for building a schema and the reasons it’s important to follow them.

DB Rule 1: Always set names for the tables and columns to avoid ones auto-generated by hibernate – this will help keep the SQL queries clear to read and match your domain repository data models and naming convention.

Example:


@Table(name = "data_migration_versions")

public class DataMigrationVersion {



@Column(name = "created_on", nullable = false)
private Date createdOn;

DB Rule 2: Always use string type for enumerating columns – without this annotation the numeric value for the enum is stored by default. This is an absolute disaster when you need to read your tables directly or do an enum migration.

Example:


@Enumerated(EnumType.STRING)

@Column(name = "send_status", nullable = false)
private SendStatusEnum sendStatus;

DB Rule 3: Always name your FK or unique constraints – that helps to easily find what you need to migrate or update in the schema migration scripts.

Example:


@Table(name = "document_handles",

uniqueConstraints = @UniqueConstraint(
name = "document_handles_unique_document_id_project_id",
columnNames = {"document_id", "project_id"}))
public class DocumentHandle {
...
@OneToOne(fetch = FetchType.EAGER)
@JoinColumn(name = "supervisor_document_handle_id",
foreignKey = @ForeignKey(name = "document_handles_fk_supervisor_document_handle_id"))
private SupervisorDocumentHandle supervisorHandle;

DB Rule 4: Always use UUID v2 as primary key – that helps manage several DB instances without the need to worry about primary key collisions.

Example:


@Id

@GeneratedValue(generator = "id")
@GenericGenerator(name = "id", strategy = "uuid2")
@Column(name = "id", unique = true, length = UUID_SIZE)
@Size(min = UUID_SIZE, max = UUID_SIZE)
private String id;

Case Study: BigData for Logistics Optimizations

For Logistic companies to reduce delivery times and expenses is a way to survive. Enterprise Logistics may require massive computing power and still discover the trends after days of data crunching.

Learn how DataStork helped a leading Logistics-Optimization company to scale its analytics. This helped them accommodate more clients and serve faster responses to business critical questions: DataStork-BigData-Case-Study