Transaction stores
Some time ago I prototyped a Redis based implementation of the SlotStore
backend suitable for installations where nodes hosting the storage can
come and go making it well suited for cloud based deployments of the
Recovery Manager.
In the context of the CAP theorem of
distributed computing, the recovery store needs to behave as a CP
system, ie it needs to be able to tolerate network partitions and yet
continue to provide Strong Consistency. Redis can provide the strong
consistency guarantee if the RedisRaft
module is used with Redis running as a cluster. RedisRaft achieves
consistency and partition tolerance by ensuring that:
- acknowledged writes are guaranteed to be committed and never
lost,
- reads will always return the most up-to-date committed write,
- the cluster is sized correctly: a RedisRaft cluster of 3 nodes can
tolerate a single node failure and a cluster of 5 can tolerate 2 node
failures, … ie if the cluster is to tolerate losing N nodes then the
cluster size must be at least 2*N+1, thus the minimum cluster size is 3
and the reason for having an odd number of nodes in the cluster is to
avoid “split brain” scenarios during network partitions; an odd number
guarantees that one side of the split will be in the majority.
During network splits the cluster will become unavailable for a while, ie the
cluster is designed to survive failures of a few nodes in the cluster,
but it is not a suitable solution for applications that require
availability in the event of large net splits, however transaction
systems favour Consistency over Availability.
A key motivator for this new SlotStore backend is to address a common
problem with using the Narayana transaction stores on cloud platforms
when scaling down a node that has in doubt transactions which can leave
them unmanaged. Most cloud platforms can detect crashed nodes and
restart them but this must be carefully managed to ensure that the
restarted node is identically configured (same node identifier, same
transaction store and same resource adapters). The current cloud
solution, when running on Openshift, is to use a ReplicaSet and to veto
scale down until all transactions are completed which can take an
indeterminate amount of time, but if we can ask another member of the
deployment to finish these in doubt transactions then all but the last
node can be safely shutdown even with in doubt transactions. The
resulting increase in availability in the presence of node or network
failures is a significant benefit for transactional applications which,
after all, is key reason why businesses are embracing cloud based
deployments.
Remark: Redis is offered as a managed service on a majority of cloud
platforms which can help customers to get started with this solution.
But note that standard Redis excludes the RedisRaft module which is a
requirment for use as a transaction store.
A Redis backed store
Redis is a key value store. Keys are stored in hash slots and hash
slots are shared evenly amongst the shards (keys -> hash slots ->
shards), the redis cluster
specication contains the details. Re-sharding involves moving hash
slots to other nodes, impacting performance. Thus, if we can control
which hash slots the [slot store] keys map onto then we can improve
performance under both normal and failure conditions. This periodic
rebalancing of the cluster can be optimised if keys belonging to the
same recovery manager are stored in the same hash slot, additionally
having the keys, for a particular recovery node, colocated on a single
cluster node is good for the general performance of the transaction
store.
Also noteworthy is that the keys mapped to a particular hash slot can
operated upon transactionally,
which is not the case for keys in different slots, meaning that no
inter-node hand-shaking is required. This feature opens up the
possibility, perhaps, of allowing concurrent access to recovery logs by
different recovery managers - but that’s something for a future
iteration of the design, but if the logs in a store are shared then be
aware that some Narayana recovery modules cache records so those
implementations would need to re-evaluated, noting in particular that
Redis has support for optimistic concurrency using the watch API which
clients can use to observe updates to key values by other recovery
managers.
Key space design
A recovery manager has a unique node identifier. We’d like to be able
to form “recovery groups” such that any recovery manager in the group
can manage transactions created by the others, but not at the same time.
To this end we assign a “failoverGroupId” to each recovery manager and
use that as the Redis key prefix. This will force all keys created by
members of the failover group into the same hash slot, a cloud example
of this idea is that the pods in a deployment would all share the same
failoverGroupId so any pod in the deployement can take over when the
deployment is scaled down.
Failover
Failover involves detecting when a member of the “recovery group” is
removed from the cluster and to then migrate the keys to another member
of the group. I added an example to the LRA
recovery coordinator and used the jedis
redis API rename command to “migrate” the keys which is an atomic
operation.
Issues
The performance of Redis Raft in this implementation of the SlotStore
backend is poor (more than 4 times slower than the default store); I
have not invested any effort on improving it but may follow up with
another post to discuss throughput since it is a general issue that
needs to be solved for any cloud based object store, examples of topics
to investigate include pipelining redis
commands (similar to how we batch writes to the Journal Store),
using Virtual Threads,
etc.
We are therefore investigating other alternatives, including an
Infinispan based slot store backend - Infinispan now supports partition
tolerance so it has become a suitable candidate for a transaction store.
Such a store will produce many of the benefits of a Redis store although
its performance will be a key implementation constraint.
This design for the key space may not be suitable for transactions
with subordinates or nested transactions or for ones that require
participant logs to be distinct from the transaction log, such as JTS or
XTS. I say may since a modification to the design should accomodate
these models.
Assumptions
The cloud platform administrator is responsible for:
- detecting and restarting failed nodes;
- issuing the migrate command on one of the remaining nodes
- for detecting when the deployment is scaled down to zero with
pending transactions (including orphans) and emitting a warning
accordingly
Example of how to migrate
logs
The demo presents the use case of migrating logs between LRA
coordinators. To run the demo you will need to:
- Clone and build a narayana git branch with support for a Redis
backed store.
- Start a 3-node cluster of Redis nodes running RedisRaft.
- Build and start two LRA coordinators with distinct node id’s, node1
and node2.
- Start an LRA on the first coordinator and then halt it to simulate a
failure.
- View the redis keys using the Redis CLI noticing that the keys embed
the node id of the owning coordinator.
- Ask the second coordinator to migrate the keys from node1 to
node2.
- Coordinators maintain a cache of LRAs, but since this is just a PoC
I haven’t implemented refreshing internal caches so you will need to
simulate that by restarting the second coordinator.
- When the first periodic recovery cycle runs (the default is every 2
minutes) the migrated LRAs will be detected which you can verify
(
curl http://localhost:50001/lra-coordinator/|jq
).
Please refer to the demonstrator
instructions for the full details.
Notes
The JIRA issue and branch is JBTM-3762.
- the implementation
is is in the slot store directory
- the tests
can be ran using
mvn test -Predis-store -Dtest=RedisStoreTest#test1 -f ArjunaCore/arjuna/pom.xml
and assume that a redis cluster is running on the test machine
- the
demo is work in progress and is strictly a PoC
- redis raft implementation:
git clone https://github.com/RedisLabs/redisraft.git
and
build it with: cmake and make