Friday, July 8, 2011

rethinking log storage

Transaction coordinators like JBossTS must write log files to fault tolerant storage in order to be able to guarantee correct completion of transactions in the event of a crash. Historically this has meant using files hosted on RAIDed hard disks.

The workload is characterised by near 100% writes, reads for recovery only, a large number of small (sub 1 KB) operations and, critically, the need to guarantee that the data has been forced through assorted cache layers to non-volatile media before continuing.

Unfortunately this combination of requirements has a tendency to negate many of the performance optimizations provided in storage systems, making log writing a common bottleneck for high performance transaction systems. Addressing this can often force users into unwelcome architectural decisions, particularly in cloud environments.

So naturally transaction system developers like the JBossTS team spend a lot of time thinking about how to mange log I/O better. Those who have read or watched my JUDCon presentation will already have some hints of the ideas we're currently looking at, including using cluster based in-memory replication instead of disk storage.

More recently I've been playing with SSD based storage some more, following up on earlier work to better understand some of the issues that arise as users transition to a new generation of disk technology with new performance characteristics.

As you'll recall, we recently took the high speed journalling code from HornetQ and used it as the basis for a transaction log. Relatively speaking, the results were a pretty spectacular improvement over our classic logging code. In absolute terms however we still weren't getting the best out the hardware.

A few weeks back I got my grubby paws on one of the latest SSDs. On paper its next generation controller and faster interface should have provided a substantial improvement over its predecessor. Not so for my transaction logging tests though - in keeping with tradition it utterly failed to outperform the older generation of technology. On further investigation the reasons for this become clear.

Traditional journalling solutions are based on a) aggregating a large number of small I/Os into a much smaller number of larger I/Os so that the drive can keep up with the load and b) serializing these writes to a single append-only file in order to avoid expensive head seeks.

With SSDs the first of those ideas is still sound, but the number of I/O events the drive can deal with is substantially higher. This requires re-tuning the journal parameters. For some usage it even becomes undesirable to batch the I/Os - until the drive is saturated with requests it's just pointless overhead that delays threads unnecessarily. As those threads (transactions) may have locks on other data any additional latency is undesirable. Also, unlike some other systems, a transaction manager has one thread per tx, as it's the one executing the business logic. It can't proceed until the log write completes, so write batching solutions involve significant thread management and scheduling overhead and often have a large number of threads parked waiting on I/O.

There is another snag though: even where the journalling code can use async I/O to dispatch multiple writes to the O/S in parallel, the filesystem still runs them sequentially as they contend on the inode semaphore for the log file. Thus writes for unrelated transactions must wait unnecessarily, much like the situation that arises in business applications which uses too coarse-grained locking. Also, the nice ncq for the drive remains largely unused, limiting the ability of the drive controller to execute writes in parallel.

The serialization of writes to the hardware, whilst useful for head positioning on HDDs, is a painful mistake for SSDs. These devices, like a modern CPU, require high concurrency in the workload to perform at their best. So, just as we go looking for parallelization opportunities in our apps, so we must look for them in the design of the I/O logging.

The most obvious solution is to load balance the logging over several journal files when running on an SSD. It's not quite that simple though - care must be taken to avoid a filesystem journal write as that will trigger a buffer flush for the entire filesystem, not just the individual file. Not to mention contention on the filesystem journal. For optimal performance it may pay to put each log file on its own small filesystem. But I'm getting ahead of myself. First there is the minor matter of actually writing a prototype load balancer to test the ideas. Any volunteers?

No comments: