Spring Transactional Proxies

Here we will discuss the Spring wonderland of proxies and database transactions.

Often we need to create Spring services that work with database transactions. First we define an interface, for example let’s take a simple UsersService like this:


public interface UsersService {
...
void cleanupUsers();
...
}

Then implement with the known @Transactional annotation:


@Service
public class UsersServiceImpl implements UsersService {
...
@Transactional
@Overwrite
public void cleanupUsers() {
callDbOperationsPart1();
callDbOperationsPart2();
callDbOperationsPart3();
callDbOperationsPart4();
}
...
}

The @Transactional annotation results in the above relational DB operations done in one single transaction. If even a single operation fails, all will be reverted.

Now imagine that for some reason, which happens very often, we need to split the DB operations in separate transactions. One such reason could be that in our code, parts 1 to 3, cleanup a lot of depended tables that our user participated in, while in part 4 we need to delete some non-critical stats.

Imagine that our DB operations become so intensive that our beloved relational DB server decides to punish us and starts locking too many rows and eventually fails the whole transaction. We could separate all the code in 2 transactions – critical and non-critical. We separate all sensitive tables cleanup operations from parts 1 to 3 in one transaction A and all stats cleanup operations in another transaction B.

The relational DB monster should be happy as it does not need to lock so much rows with implementation refined like that:


@Service
public class UsersServiceImpl implements UsersService {
...
@Overwrite
public void cleanupUsers() {
cleanupUsersMainTables();
cleanupUsersStatsTables();
}
@Transactional
private void cleanupUsersMainTables() {
callDbOperationsPart1();
callDbOperationsPart2();
callDbOperationsPart3();
}
@Transactional
private void cleanupUsersStatsTables() {
callDbOperationsPart4();
}
...
}

What is worth noting in the code above is that our beloved IDE warns us that Methods annotated with ‘@Transactional’ must be overridable. To get rid of the warning we make the methods protected.

Running our beloved tests, we hit an Exception:


org.springframework.dao.InvalidDataAccessApiUsageException: No EntityManager with actual transaction available for current thread - cannot reliably process 'remove' call
at org.springframework.orm.jpa.EntityManagerFactoryUtils.convertJpaAccessExceptionIfPossible(EntityManagerFactoryUtils.java:413)
...
at org.test.UsersServiceImpl.cleanupUsersMainTables(UsersServiceImpl.java:239)
at org.test.UsersServiceImpl.cleanupUsers(UsersServiceImpl.java:226)
...
at com.sun.proxy.$Proxy95.cleanupUsers(Unknown Source)
at org.test.UsersServiceImplTest.testCleanupUsers(UsersServiceImplTest.java:237)
...

Meaning our tested cleanupUsers() method is not running in a transaction at all. What the hell is going on here??

The Spring transnational behavior is
well documented. Our methods cleanupUsersMainTables() and cleanupUsersStatsTables() are actually running without transactions. The main reason for this is the com.sun.proxy.$Proxy95 instances that are messing with our call stacktraces.

The wisdom: In order Spring to start the transaction process it must be able to alter our methods and inject code prior and post method call. This is achieved by creating a Proxy instance class that wraps our public interface methods. However, Spring can force us to use that proxy implementation only if we use the instance that he prepared for us, which can be done only by this:


...
@Authowired
UsersService usersService;
...
}

Then comes the question why do we have the transitional magic when the @Transnational is used over interface implemented methods, but is not honored when we call our internal protected methods?

The reason is that when we call the internal methods we are calling the implementation instance this. In order to fix our messy code we need to somehow create a Spring proxy for our internal methods. We should start using it and Spring will take care of the transaction. Or will it?

Let’s give it a try:


interface UsersServiceInternals {
void cleanupUsersMainTables();
void cleanupUsersStatsTables();
}

And now let’s force our UserServiceImpl to implement it:


@Service
public class UsersServiceImpl implements UsersService, UsersServiceInternals {
...
@Overwrite
public void cleanupUsers() {
cleanupUsersMainTables();
cleanupUsersStatsTables();
}
@Transactional
@Overwrite
public void cleanupUsersMainTables() {
callDbOperationsPart1();
callDbOperationsPart2();
callDbOperationsPart3();
}
@Transactional
@Overwrite
public void cleanupUsersStatsTables() {
callDbOperationsPart4();
}
...
}

That should’ve solved the problem and we run the tests again but aaaaaaaaaaaa… same error occurred. What is going on here this time!!!!??!!!???

We found out that Spring has created a proxy class for our internal interface methods but this proxy is still not in use. Ergo, we must obtain an instance of it in order to use it, instead of using the this:


@Service
public class UsersServiceImpl implements UsersService, UsersServiceInternals {
...
@Autowired
private ApplicationContext applicationContext;
...
@Overwrite
public void cleanupUsers() {
getSpringProxy().cleanupUsersMainTables();
getSpringProxy().cleanupUsersStatsTables();
}
@Transactional
@Overwrite
public void cleanupUsersMainTables() {
callDbOperationsPart1();
callDbOperationsPart2();
callDbOperationsPart3();
}
@Transactional
@Overwrite
public void cleanupUsersStatsTables() {
callDbOperationsPart4();
}
private UsersServiceInternals getSpringProxy() {
return applicationContext.getBean(UsersServiceInternals.class);
}
...
}

Now we can start the tests again and this time voila they all pass! This is a story how knowing the basics of the Spring proxification helped us navigate in the Spring wonderland.


Data Analytics with Zero Latency and High Precision?

Everyone is “doing analytics” these days

Data Analytics is an IT buzzword. Hundreds of paradigms and solutions: change-data-capture, ETL, ELT, ingestion, staging, OLAP, data streaming, map/reduce, stream processing, data mining,… Amazon Redshift and Lambda; Apache Kafka, Storm, Spark and Hadoop Map/Reduce; Oracle GoldenGate; VMware Continuent, … gazillion of offers. All this hype makes it easy to loose track.

The Problem

What is the problem that all these solutions aim to solve anyway?

The business needs precise and rapid answers to simple yet critical questions. That’s all to it. How it can be achieved is a longer story.

Lets draw a real live analogy: Imagine that you are a couch and your business is to train a player for incoming competition. Unfortunately your top player starts to feel sick. You immediately grab him and go to see a doctor. At the doctor’s office you shoot with concise critical question: “Will my player be fit for the competition?”. Doctor’s answer is not that short: “Well, for an accurate assessment, I will have to run several urine and blood samples, do an EKG, ultrasound, chest X-rays and maybe an MRI. Your player needs to stay at the hospital for a couple of days, and avoid exercises as they wildly vary test results rendering analysis hard. We will then correlate all the data and get back to you in a few more days.”. You stare in disbelieve: first this doc is offering me to suspend training, right before the event; secondly, answers will come too late. No way!

Classical ETL

As ridiculous as it seems, data engineers often treat customers with offers full of latencies, lack of consistency or, worse, consistency on the price of downtime.

Various tools from Pentaho PDI to Scoop + Hadoop M/R… more or less classical extract-transform-load (ETL):

  • Proprietary scripts to export operational data into a set of CSV files (Hopefully the engineer knows how to encode incremental exports).
  • Logic to import the CSVss into the ETL engine with all imposed disk IO.
  • More logic to apply the actual analytical functions.
  • More scripts to load the results again into the analytical/reporting data store.
  • The result is a complex multi-step process spanning over vast volumes. This yields latency. The moment a change in the operational store is propagated to the reports, it may be already too late.

There are more hidden perils:

On each export, the conveniently available database integrity and type checks are lost. The developer needs to manually encode them. E.g. explicitly set data types for all CSV properties; encode checks for invalid value ranges, etc. Otherwise there is a great risk of data quality issues in the reports.
Since typical ETL tools apply transformations in-memory, costly disk swapping is involved for larger data sets.
Even a single in-flight problem causes restart of the entire lengthy job.
As latent, complex and error-prone as it is, the classical ETL process often lacks consistency. If correlated data is modified concurrently during export, the CSV files may contain inconsistent “relations”, e.g. employees without department, as the missing department was added in the database after the export had finished with the “department” table, but before the “employee” table was exported. Of course you can employ consistent native database tools such as Oracle’s data-pump or redo-log mining, but to integrate that tooling in the general data-flow increases the effort and complexity.

Stream Analytics

With all the data pooring in operational stores from IoT and the 24/7 global Cloud exposure, there is increasingly vaster data volumes that are screaming to be analyzed. The industry is responding with an approach that better suits enterprise scale – data stream analytics.

In summary, changes are captured as they occur and streamed to a scalable parallel processing engine. Incoming changes are analyzed immediately through delta aware functions and stream transformations. Results are merged (delta-aggregated) into the reporting store. There are a number of stream-aware frameworks that facilitate such process: Apache Kafka, Spark and Storm for example. Combination of such tools provides low latency, yet does not guarantee consistency for high precision decisions.

Developing robust and efficient stream analytics could a be very challenging task. One needs to integrate or even implement from scratch an efficient change-data-capture solution. Postgres for example has immature log mining technology. Captured changes need to be correlated to ensure referential integrity and transactional consistency. Choose scalable yet resilient computing framework able to overcome failures during stream analysis. Glue all systems together into coherent easy to use package and figure-out how to delta-merge the results into the analytical store.

“Hibernate” for Data Analytics

Remember Hibernate? The tool that revolutionazied the engineering of persistence layers – easy to learn, massive savings on boiler plate, yet error-prone persistence code. Well, for the sake of objectivity, it also brought sometimes lack of fine-grained SQL control.

We at DataStork believe it is about time data analytics to benefit such automation… yet keep fine-grained data crunching control when needed.

Meet the DataStork way, “Hibernate” for data analytics:

Data analysts encode questions by using plain old SQL (Geeks can still use various languages to encode complex analytical functions).

  • We analyze encoded queries and deploy agents at the relevant data sources to capture the data changes.
  • Captured changes are streamed in efficient compressed form.
  • Defined questions/transformations are applied over the stream by using highly scalable and robust parallel computing framework.
  • Data operations are kept as close to the database data as possible, to avoid unnecessary disk IO and leverage existing type-info and constraints.
  • Entity relations are regarded both on the operational and analytical databases to ensure fully consistent results both transactionaly and in terms of referential integrity.
  • Data analysts can inspect and adjust each of the generated SQL scripts for fine-grained control.

DataStork automates all aspects of modern data analytics through combination of innovative EL-T and stream analytics, ensuring 0 latency and high consistency. This approach also works with legacy relational databases.

Now you would know right-away whether you are fit to win the next major competition… because at the time you walk in the “doctor’s” office all the information has already been analyzed and the needed answers are waiting for you.

You are welcome to get in touch for more details.