Wednesday, May 29, 2013

Compensating Transactions: When ACID is too much (Part 2: Non-Transactional Resources)

Introduction


In part one in this series I explained why ACID transactions are not always appropriate. I also introduced compensation-based transactions as a possible alternative to ACID transactions. In this post I'll focus on situations where the application needs to coordinate multiple non-transactional resources and show how a compensation-based transaction could be used to solve this problem.

For the sake of this discussion, I'm defining a transactional resource as one that can participate in a two phase protocol and can thus be prepared and later committed or rolled back. For example, XA-capable databases or message queues would be considered transactional resources. In contrast a non-transactional resource is one that does not offer this facility. For example, the sending of an email or printing of a cheque can not easily participate in this two phase protocol. Third party services can also be hard to coordinate in an ACID transaction. Even though these services might be implemented with ACID transactions, they may not allow participation in any existing transaction.
A compensation-based transaction could be a good fit for these situations. The non-transactional work can be carried out in the scope of the compensation-based transaction. Providing that a compensation handler is registered, the work can later be undone, should the compensation-based transaction need to be aborted. For example, the compensation handler for sending an email, could be to send a second email asking the recipient to disregard the first email. The printing of a cheque could be compensated by canceling the cheque and notifying the recipient of the cancellation. 
It's also possible to coordinate transactional and non-transactional resources in a compensation-based transaction. Here the application just needs to create compensation handlers for the non-transactional resources. You could still use an ACID transaction with the last resource commit optimization (LRCO) if you only have one non-transactional resource, but this approach is not recommended if you have multiple non-transactional resources.
In a nutshell: If you find yourself needing to coordinate multiple non-transactional resources, you should consider using compensations.

Code Example

In this code example, we have a simple service that is used by an EComerce application to sell books. As well as making updates to transactional resources, such as a database, it also needs to send an email notifying the customer that the order was made.

public class BookService {
 
    @Inject
    EmailSender emailSender;
 
    @Compensatable
    public void buyBook(String item, String emailAddress) {
 
        emailSender.notifyCustomerOfPurchase(item, emailAddress);
        //Carry out other activities, such as updating inventory and charging the customer
    }
}

The above class represents the BookService. The 'buyBook' method coordinates updates to the database and notifies the customer via an email. The 'buyBook' method is annotated with '@Compensatable'. Processing of this annotation ensures that a compensation-based transaction is running when the method is invoked. This annotation is processed similarly to the @Transactional annotation (new to JTA 1.2). The key difference being that it works with a compensation-based transaction, rather than a JTA (ACID) transaction. An uncaught RuntimeException (or subclass of) will cause the transaction to be canceled, and any completed work to be compensated. Again, this behavior is based on the Transaction handling behavior of @Transactional in JTA 1.2.
For the sake of brevity, I have excluded the calls to update the other transactional resources. Part 3 of this series will show interoperation with JTA ACID transactions.

public class EmailSender {
 
    @Inject
    OrderData orderData;
  
    @CompensateWith(NotifyCustomerOfCancellation.class)
    public void notifyCustomerOfPurchase(String item, String emailAddress) {
 
        orderData.setEmailAddress(emailAddress);
        orderData.setItem(item);
        //send email here...
    }
}

This class carries out the work required to notify the customer. In this case it simulates the sending of an email. The method 'notifyCustomerOfPurchase' can later be compensated, should the transaction fail. This is configured through the use of the 'CompensateWith' annotation. This annotation specifies which class to use to compensate the work done within the method. For this compensation to be possible, it will need available to it, key information about the work completed. In this case the item ordered and the address of the customer. This data is stored in a CDI managed bean, 'orderData', which as we will see later, is also injected in the compensation handler.

@CompensationScoped
public class OrderData {
 
    private String item;
    private String emailAddress;
    ...
}

This managed bean represents the state required by the compensation handler to undo the work. The key thing to notice here is that the bean is annotated with @CompensationScoped. This scope is very similar to the @TransactionScoped annotation (new in JTA 1.2). This annotation ensures that the lifecycle of the bean is tied to the current running transaction. In this case the lifecycle of the compensation based transaction, but in the case of @TransactionScoped it is tied to the lifecycle of the JTA transaction. The @CompensationScoped bean will also be serialized to the transaction log, so that it is available in the case that the compensation handler needs to be invoked at recovery time.

public class NotifyCustomerOfCancellation implements CompensationHandler {
 
    @Inject
    OrderData orderData;
 
    @Override
    public void compensate() {
        String emailMessage = "Sorry, your order for " + orderData.getItem() + " has been cancelled";
        //send email here...
    }
}

This class implements the compensation handler. For our example it simply takes the details of the order from the injected OrderData bean and then sends an email to the customer informing them that the order failed.

Summary

In this blog post I explained why it's difficult to coordinate non-transational resources in an ACID transaction and showed how a compensation-based transaction can be used to solve this problem.
Part 3, of this series, will look at cross-domain distributed transactions: Here I'll show that ACID transactions are not always a good choice for scenarios where the transaction is distributed, and potentially crossing multiple business domains. I'll show how a compensation-based transaction could be used to provide a better solution.

Saturday, May 25, 2013

2PC or 3PC

"2PC or not 2PC, that is the question". Perhaps if Shakespeare were alive today Hamlet would be asking one of the most popular questions in distributed systems for at least the past three decades. I'm not going to revisit the reasons why we use a consensus protocol in ACID transactions, because you can find enough articles on the subject if you read this blog. However, it seems that the distinction between different consensus protocols, specifically two-phase commit (2PC) and three-phase commit (3PC) is not necessarily as widely understood, or rather why vendor transaction managers use 2PC. Yes, there are other consensus protocols, such as Paxos, but let's just focus on 2PC or 3PC today. Before we start looking at why 2PC is more popular, if you haven't read about the FischerLynchPaterson (FLP) results which show that you can't reach agreement in an asynchronous environment if even one failure is allowed, without augmenting the system with, say, failure detection mechanisms.

Strict 2PC is a blocking protocol: for instance, if the coordinator crashes during the second (commit) phase, then the participants must remain in their indeterminate state until it recovers, which could be forever if recovery never happens. Now of course using a strict 2PC protocol in the real world would immediately present some problems: as long as participants remain blocked, the resources that they represent and hold, e.g., locks, are maintained, thus possibly preventing further work on behalf of others. If failures never happen then it's not an issue, but unfortunately the 2nd law of thermodynamics will always get you in the end: entropy increases and failures do happen, no matter how improbable.

This is why heuristics were introduced as a way of allowing 2PC to operate in a more pragmatic manner: any participant that has got past the first (prepare) phase and does not receive a response from the coordinator in a timely manner can unilaterally decide to commit or abort its portion of the transaction. If it gets the decision right (same as the one the coordinator made) then everything is fine. If it gets the decision wrong, then we have a non-atomic outcome, or a heuristic outcome. Resolving this typically requires outside help, e.g., a system administrator. However, with the assumption that failures are relatively rare coupled with the fact that a failure would have to happen after the prepare phase to potentially cause a heuristic, this is not such a bad compromise. Yes heuristics happen, but they will typically be rare. And of course let's not forget about those other optimisations we typically use with 2PC, such as presumed abort, which help to reduce performance overhead as well as allowing safe autonomous choices to be made which do not result in heuristics.

But what about 3PC? You can find good descriptions of the protocol in the literature and elsewhere, so I won't go into details here. Suffice it to say that 3PC removes the blocking nature by disseminating the decision from the coordinator about whether to commit or abort the transaction amongst all of the participants. This means that if the coordinator does crash, the participants can still move forward to complete the transaction. In theory this is a good thing. However, an additional phase, which includes information about the transaction outcome and participants, introduces an overhead in every transaction and this overhead is only useful if the coordinator fails.

As such, 3PC is good for environments where failures are common (more probable) and where heuristics aren't allowed. This additional overhead is not something which the majority of environments, users, use cases etc. are prepared to accept and prefer to use 2PC, along with the other optimisations and trade-offs I mentioned earlier. We optimise for the failure-free environment, which is still the most probable situation for the majority of transaction use cases. That's why all major transaction implementations today use 2PC and not 3PC. Now that doesn't mean it will continue to be the case: if environments or scenarios change in such a way that blocking or heuristics are no longer applicable (maybe mobile) then we may see a change in direction.

Friday, May 17, 2013

Compensating Transactions: When ACID is too much (Part 1: Introduction)

ACID transactions are a useful tool for application developers and can provide very strong guarantees, even in the presence of failures. However, ACID transactions are not always appropriate for every situation. In this series of blog posts. I'll present several such scenarios and show how an alternative non-ACID transaction model can be used.

The isolation property of an ACID transaction is typically achieved through optimistic or pessimistic concurrency control. Both approaches can impose negative impacts on certain classes of applications, if the duration of the transaction exceeds a few seconds (see here for a good explanation). This can frequently be the case for transactions involving slow participants (humans, for example) or those distributed over high latency networks (such as the Internet). Also, some actions cannot simply be rolled back; such as, the sending of an email or the invocation of some third-party service.

A common strategy for applications that cannot use ACID, is to throw out transactions altogether. However, with this approach you are missing out on many of the benefits that transactions can provide. There are many alternative transaction models that relax some of the ACID properties, while still retaining many of the strong guarantees essential for building robust enterprise applications. These models are often referred to as "Extended Transaction models" and should be considered before deciding not to use transactions at all.

In the Narayana project, we have support for three Extended Transaction models; Nested Top Level Transactions, Nested Transactions and a compensation-based model based on Sagas. In this series of blog posts I'll be focusing on the compensation-based approach.


What is a ‘Compensation-based transaction’

Transaction systems typically use a two-phase protocol to acheive atomicity between participants. This is the case for both ACID transactions and our compensation-based transactions model. In the first phase, each individual participant, of an ACID transaction, will make durable any state changes that were made during the scope of the transaction. These state changes can either be rolled back or committed later once the outcome of the transaction has been determinned. However, participants in a compensation-based transaction behave slightly differently. Here any state changes, made in the scope of the transaction, are committed during (or prior) to the first phase. In order to make "rollback" possible, a compensation handler is logged during the first phase. This allows the state changes to be 'undone' if the transaction later fails.

What Affect Does this Have on the Isolation property of the Transaction?

The Isolation property of a transaction dictates what, if any, changes are visible outside of the transaction, prior to its completion. For ACID transactions, the isolation property is usually pretty strong with database vendors offering some degree of relaxation via the isolation level configuration. However, in a compensation-based transaction the isolation level is totally relaxed allowing units of work to be completed and visible to other transactions, as the current compensation-based transaction progresses. The benefit of this model is that database resources are not held for prelonged periods of time. However, the down-side is that this model is only applicable for applications that can tolerate this reduced level of isolation.

The following two diagrams show an example, where a client is coordinating invocations to multiple services that each make updates to a database. The diagrams are simplified in order to focus on the different isolation levels offered by a ACID and compensation-based transaction. The example also assumes a database is used by the service, but it could equally apply to other resources.


The diagram above shows a simplified sequence diagram of the interactions that occur in an ACID transaction. After the client begins the (ACID) transaction it invokes the first service. This service makes a change to a database and at this point database resources are held by the transaction. This example uses pessemistic locking. Had optimistic locking been used, the holding or database resources could have been delayed until the prepare phase, but this could result in more failures to prepare. The Client then invokes the other services, who may in turn hold resources on other transactional resources. Depending on the latency of the network and the nature of the work carried out by the services, this could take some time to complete. All the while, the DB resources are still held by the transaction. If all goes well, the client then requests that the transaction manager commit the transaction. The transaction manager invokes the two-phase commit protocol, by first preparing all the participants and then if all goes well, commits all the participants. It's not until the database participant is told to commit, that these database resources are released.

From the diagram, you can see how, in an ACID transaction, DB resources could be held for a relativley long period of time. Also, assuming the service does not wish to make a heuristic decision, this duration is beyond the control of the service. It must wait to be informed of the outcome of the protocol, which is subject to any delays introduced by the other participants.




The diagram above shows a simplified sequence diagram of the intercations that occur in a compensation-based transaction. The client begins a new (compensation-based) transaction and then invokes the first service. The service then sends an update to the database, which is committed imediatlly, in a relativly short, seperate ACID transaction. At this point (not shown in the diagram) the service informs the transaction manager that it has completed it's work, which causes the transaction manager to record the outcome of this participant's work to durable storage along with the details of the compensation handler and any state required to carry out the compensation. It's possible to delay the commit of the ACID tranaction until after the compensation handler has been logged (see here), this removes a failure window in which a non-atomic outcome could occur.

The client now invokes the other services who, in this example, behave similarly to the first service. Finally, the client can request that the Transaction Manager close (commit) or cancel (rollback) the compensation-based transaction. In the case of cancel, the transaction manager calls the compensating action asociated with each participant that previously completed some work. In this example, the compensating action makes an update to the database in a new, relativley short ACID transaction. The service can also be notified if/when the compensation-based transaction closes. We'll cover situations when this is useful later in this series. The notification of (close/compensate) is retried until it is acknowledged by the service. Although this is good for reliability, it does require that the logic of the handlers be idempotent.

From the diagram, you can see that the duration for which DB resources are held, is greatly reduced. This comes at a cost of relaxed isolation (see the 'changes visible' marker). However, in scenarios where compensation is rare, the relaxed isolation could be of little concern as the visible changes are usually valid.

It is also possible to mitigate this loss of isolation by marking the change as tentative in the first phase and then marking the change as confirmed/cancelled in the second phase. For example, the initial change could mark a seat on a plane as reserved; the seat could later be released or marked as booked, depending on the outcome of the transaction. Here we have traded the holding of database-level resources for the holding of application-level resources (in this case the seat). This approach is covered in more detail later in the series.


What's Coming up in the Next Posts?

The following three posts will each focus on particular set of use-cases where compensation-based transactions could prove to be a better fit than ACID transactions. In each part, I'll provide a code example, using the latest iteration of our new API for compensation-based transactions (first introduced here).


  • Part 2: Non-transactional Work. This part will cover situations where you need to coordinate multiple non-transactional resources, such as sending an email or invoking a third party service.
  • Part 3: Cross-domain Distributed Transactions: This part covers a scenario where the transaction is distributed, and potentially crosses multiple business domains.
  • Part 4: Long-lived Transactions. This part covers transactions that span long periods of time and shows how it's possible to continue the transaction even if some work fails.


In part five, I'll cover the status of our support for compensation-based transactions and present a roadmap for our future work.

Monday, May 6, 2013

When you need ACID and can't get it ...

What happens when you need traditional ACID transactions across your resource managers and they won't behave? By that I mean they're not two-phase aware so can't be driven by a coordinator to achieve all of the necessary properties. Of course if you've just one such resource manager (let's call it one-phase aware for now) then you can probably make use of the LRCO optimisation.

However, what if you've got more than one such resource manager? Well if you've checked out JBossTS in the past then you'll know that we allow you to enlist multiple one-phase resource managers, but there's a very big caveat. This really isn't an option that anyone should choose. Fortunately I blogged about a better option over 6 years ago (!): compensation transactions. There were a few follow up entries, but the original one is the place to start.

Now what got me revisiting this was the article I wrote earlier on banks, ACID and BASE. It wasn't so much the notion that banks don't use ACID as the fact that we still see a lot of popular (NoSQL) databases that don't support transactions. Of course some use transactions internally (local transactions), but what I'm talking about is what is often referred to as global transactions: those transactions that span multiple resource managers. When you want to send a message, update a traditional database and update a NoSQL instance all within the scope of the same transaction, in most cases you're out of luck. And this is something that many enterprise customers want to do or will want to do soon.

That is unless you can use LRCO, which may not be possible if the resource manager(s) don't support the necessary recovery semantics.

Therefore, you're left in a situation that has very few options. One would be to ignore the problem and assume that because failures are rare (they are rare, right?) it's unlikely to ever be a problem. Personally I wouldn't recommend this.

Another option would be to resolve any problems manually. Again, since failures are rare (we're sure, right?) the chances of having to do this are slim and if you do have to resolve then it's a good enough trade-off. Of course you've got to hope that you've got enough information to detect and handle the recovery. Again, this isn't something I'd recommend unless you're a company that can afford to employ people who do nothing each day other than resolve these problems. (Yes, they do exist.)

My recommended option is to use compensation transactions, as I outlined many years ago. They will automate the recovery (and detection) as well as allow you to seamlessly integrate with a range of resource managers which are "well behaved". I'm hoping that we'll get a chance to try these out soon with some enterprise applications that use NoSQL implementations that don't support global transactions. Once we've done this then it'll be a good reason for one of the team to come back and write something here.

Friday, May 3, 2013

Cross posting on banks, ACID and BASE

Just a cross post that I thought people might be interested in. Caused by a recent article that seemed to get interpreted as banks don't use ACID transactions!