Monday, October 17, 2016

Achieving Consistency in a Microservices Architecture

Microservices are loosely coupled independently deployable services. Although a well designed service will not directly operate on shared data it may still need to ensure that that data will ultimately remain consistent. For example, the requirement to debit an account to pay for an on-line purchase creates a dependency between the customer and supplier account balances and the stock database. Historically a distributed transaction has been used to maintain this consistency which in turn will employ some flavour of distributed locking during the data update phase. This dependency introduces tight coupling, higher latencies and greater lock contention, especially when failures occur where the locks cannot be released until all services involved in the transaction become available again. Whilst the user of the system may be satisfied with this state of affairs it should not be the only possible interaction pattern. A more common approach will employ the notion of eventual consistency where the data may sometimes be in an inconsistent state but will eventually come back into the desired state: in our example the stock level will be reduced, the payment has been processed and the item delivered.

I have, from time to time, seen blogs and articles that recognise this problem and suggest solutions but they seem to mandate that either service calls naturally map on to a single data update or that the service writer picks one of the services to do the coordination taking on the responsibility of ensuring that all services involved in the interaction will eventually reach their target consistent state (see for example a quote from the article Distributed Transactions: The Icebergs of Microservices: "you have to pick one of the services to be the primary handler for the event. It will handle the original event with a single commit, and then take responsibility for asynchronously communicating the secondary effects to other services"). This sounds feasible but now you have to start thinking about how to provide the reliability guarantees in the presence of failures, how to orchestrate services, storing extra state with every persistent update so that the activity coordinator can continue the interaction after failures have been resolved. In other words, whilst this is a workable approach it hides much of the complexity involved in reliably recovering from system and network failures which at scale will surely happen. A more robust design for microservice architectures is to delegate the coordination component of the workflow to a specialised service explicitly designed for this kind of task.

We have been working in this area for many years and one set of ideas and protocols that we believe are particularly suited to microservices architectures is the use of compensatable units of work to achieve eventual consistency guarantees in this kind of loosely coupled service based environment. I produced a write up of the approach and accompanying protocol for use in REST based systems back in 2009 (Compensating RESTful Transactions) based on earlier work done by Mark Little et al. Mark also wrote some interesting blogs in 2011 (When ACID is too strong and Slightly alkaline transactions if you please ...) about alternatives to ACID when various constraints are loosened and his summary is relevant to the problems facing microservice architects.

The use of compensations, coordinated by a dedicated service, will give all the benefits suggested in Graham Lea's article referred to earlier, but with the additional guarantees of consistency, reliability, manageability, reduced complexity etc in the presence of failures. The essence of the idea is that the prepare step is skipped and instead the services involved in the interaction register compensation actions with a dedicated coordinator:

  1. The client creates a coordination resource (identified via a resource url)
  2. The client makes service invocations passing the coordinator url by some (unspecified) mechanism
  3. The service registers its compensate logic with the coordinator and performs the service request as normal
  4. When the client is done it tells the coordinator to complete or cancel the interaction
    • in the complete case the coordinator has nothing to do (except clean up actions)
    • in the cancel case the coordinator initiates the undo logic. Services are not allowed to fail this step. If they are not available or cannot compensate for the activity immediately the coordinator will keep on trying until all services have compensated (and only then will it clean up)

We do not have an implementation of this (JDI) protocol but we do have an implementation of an ACID variant of it (called RTS) which has had extensive exposure in the field (and this can/will serve as the basis for the implementation of the JDI protocol). The documentation for RTS is available at our project web site. The nice thing about this work is that it can integrate seamlessly into Java EE environments and additionally is available as a WildFly subsystem. This latter feature means that it can be packaged as a WildFly Swarm microservice using the WildFly Swarm Project Generator. In this way if your microservices are using REST for API calls then they can make immediate use of this feature.

We also have a working prototype framework for how to do compensations in a Java SE environment. The API is available at github where we also provide a number of quickstarts showing how to use it.

Finally, we have a solution where we allow the compensation data to be stored at the same time as the data updates in a single (one phase) transaction thus ensuring that the coordinator will have access to the compensation data. This technique works particularly well with document oriented databases such as MongoDB