Narayana team blog

Thursday, March 10, 2011

SQL != NoSQL

The new menagerie of NoSQL databases do not have the same characteristics as traditional SQL databases, much less each other. Which is kind of the point of using them in the first place.

So why are so many otherwise sane and smart people doing their best to ignore this fact? And why is the middleware community pandering to their delusions?

There is a dangerously seductive attraction in the idea that you can improve an existing system just by swapping out some component implementation for another using the same interface. A java.util.Map may abstract a Hashtable, HashMap, ConcurrentHashMap or distributed, replicated in-memory data grid like Infinispan. Choose wisely and your system works better. Choose poorly and things unravel pretty fast. It is necessary to look beyond the interface to the underlying implementation and understand the details in order to know how to drive the interface in optimal, or even correct, manner.

The JPA may abstract not only one of several ORMs, but one of an even larger number of relational database engines behind them. It may even be implemented using an OGM backed by a key-value store such as Infinispan or Cassandra. The ability to reuse existing JPA code or programming skills when migrating from relational to non-relational storage is attractive to both developers and management. The middleware community responds to this user demand with solutions like Kundera and Hibernate OGM, which developers lap up in ever increasing numbers. Unfortunately they often do this with an inadequate understanding of the underlying implementation details.

As middleware developers we are guilty of doing too good a job of abstracting away the underlying implementation detail. Many users willingly buy into this delusion, being all too keen to believe we can magically shield them from having to understand those details.

There are two approaches to dealing with this problem: Improve the abstraction so it becomes less important to understand the implementation details, and provide material to help the users understand those details in cases where they must.

These tasks are going to occupy a big chunk of my time in the future, as I shift attention towards providing transaction management capability for the new generation of cloud environments, where data is managed and manipulated in both SQL and NoSQL stores. Interesting times ahead I think.

Wednesday, March 9, 2011

Is it turned On?

Question #0 on any tech troubleshooting checklist is, as you well know, 'Is it plugged in?'. This is followed closely by question #1, 'Is it turned On?'. Sometimes I have to relearn this the hard way.

Those joining the party late will have missed yesterday's exciting episode, at the end of which the intrepid hero is left scratching his head at the failure of his Shiny New SSD to outperform his clunky HDD. Let's move the incredibly riveting plot forward a bit with some hot command line action scenes:

$ dd if=/dev/zero of=/tmp/ssd/foo bs=4k count=1000000 oflag=direct

32.9 MB/s

This sucks. Let's swap in the noop scheduler, downgrade the journalled ext3 to ext2 and change the mount to noatime:

$ dd if=/dev/zero of=/tmp/ssd/foo bs=4k count=1000000 oflag=direct

35.0 MB/s

Better, but still sucking.

$ dd if=/dev/zero of=/tmp/ssd/foo bs=8k count=500000 oflag=direct

58.2 MB/s

$ dd if=/dev/zero of=/tmp/ssd/foo bs=16k count=250000 oflag=direct

87.6 MB/s

ok, so the performance is a direct factor of the block size. Ramp the block size up high enough and we saturate the drive at over 200MB/s. But what is limiting the number of blocks we can throw at the device? The hardware spec rates it at 50000 x 4k IOPS, which would be 195MB/s . Let's throw a few more processor cores at the problem just for the hell of it:

$ dd if=/dev/zero of=/tmp/ssd/foo1 bs=4k count=1000000 oflag=direct &

$ dd if=/dev/zero of=/tmp/ssd/foo2 bs=4k count=1000000 oflag=direct &

$ dd if=/dev/zero of=/tmp/ssd/foo3 bs=4k count=1000000 oflag=direct &

$ dd if=/dev/zero of=/tmp/ssd/foo4 bs=4k count=1000000 oflag=direct &

10.5 MB/s

10.5 MB/s

10.5 MB/s

10.5 MB/s

Well, 42 > 35, but nowhere near a linear speedup. Something is fishy here. SATA should do NCQ, which would allow all four of those processes (actually up to 32) to have outstanding requests, so we should be soaking up a lot more of that lovely bandwidth.

Unless...

$ lsmod | grep libata
libata 209361 1 piix

umm, oops.

The Intel ICH10R on our P6X58D-E is running in running in legacy IDE mode, because someone didn't check the BIOS settings carefully enough when building the machine. Not that I have any clue who that may have been. No, Sir, not at all.

Ahem. Let's reboot shall we...

$ lsmod | grep libata
libata 209361 1 ahci

Right, that's better. Off we go again:

$ dd if=/dev/zero of=/tmp/ssd/foo bs=4k count=1000000 oflag=direct
76.8 MB/s

Double the speed. Not too bad for five minutes work, even if it did require walking all the way down the hall to the machine room.

$ dd if=/dev/zero of=/tmp/ssd/foo1 bs=4k count=1000000 oflag=direct &

$ dd if=/dev/zero of=/tmp/ssd/foo2 bs=4k count=1000000 oflag=direct &

34.5 MB/s

34.5 MB/s

Huh?

With libata now correctly driving the SSD with all its features supported, those concurrent processes should be getting 70+MB/s each, not sharing it. Grrr.

Oh well, let's see how the transaction system is doing shall we. It's writing a single file at a time anyhow. Since we were already running at over 36k tx/s against a theoretical max of 55k we can't expect a 2x speedup the raw dd numbers would suggest, but we should see some improvement...

30116 tx/second.

Some days it's just not worth getting out of bed.

Change request to Clebert: make the block size in the Journal a config option.

Tuesday, March 8, 2011

hardware p0rn

Like most geeks I adore shiny new tech toys. I recently built two machines to host additional virtualized slave nodes for our hudson cluster. Thanks to the drop in RAM prices I had enough budget to stuff them full (i7 based workstations, so triple channel RAM on a 6 slot board topping out at 24GB). There was even enough left over for an SSD. yay. So today I fired one up and dumped my infamous XA transaction microbenchmark onto it.

The SSD totally failed to outperform the HDD.

So I checked the disk mount points, ran some diagnostics with dd and was still left scratching my head. The test code is basically write only, that being the nature of the beast when working with transaction recovery logs. Fortunately I'd opted for an SSD with the Sandforce controller, which provides for much more symmetric read/write performance compared to some of the competition. The manufacturer specs are 285MB/s read and 275MB/s write. Let's give it a try:

$dd if=/dev/zero of=/tmp/ssd/foo bs=1M count=4000 oflag=direct
4000+0 records in
4000+0 records out
4194304000 bytes (4.2 GB) copied, 19.0998 seconds, 220 MB/s

ok, close enough. The HDD does not do so well:

$dd if=/dev/zero of=/tmp/hdd/foo bs=1M count=4000 oflag=direct
4000+0 records in
4000+0 records out
4194304000 bytes (4.2 GB) copied, 30.5214 seconds, 137 MB/s

So if I can push around 2x the data onto it, I should be able to log 2x the number of transactions per second, right? cool.

Umm, No.

25106 tx/second to HDD log.

24996 tx/second to SSD log.

hmmm.

A little bit of fiddling around with the ObjectStore provides another useful data point: a single transaction log record for this test case is a little over 600 bytes - basically two Xids (one for each XAResource) plus transaction id and overhead.
So, 25k tx/s * 600 bytes = 15MB/s give or take. Less that 1/10th of what the SSD should handle.

We know the benchmark will run at over 50k tx/s with an in-memory store, so we're definitely bottlenecked on the I/O somewhere, but it's not write bandwidth. With a conventional HDD I'd put my money on the number of physical writes (syncs/forces) a drive head can handle. High performance log stores are designed to do contiguous append to a single file to eliminate head seek latency, so it's something to do with the number of events the device can handle in series. Let's use the filesystem's native block size:

dd if=/dev/zero of=/tmp/hdd/foo bs=4k count=1000000 oflag=direct
1000000+0 records in
1000000+0 records out
4096000000 bytes (4.1 GB) copied, 139.289 seconds, 29.4 MB/s

dd if=/dev/zero of=/tmp/sdd/foo bs=4k count=1000000 oflag=direct
1000000+0 records in
1000000+0 records out
4096000000 bytes (4.1 GB) copied, 158.286 seconds, 25.9 MB/s

and there we have it - the block management is killing us.

Looks like I need to spend some time down in the native code and kernel to understand the I/O paths at work here. My feeling is that SSD technology just invalidated a lot of design assumptions that were probably a bit outdated already. SSDs don't do head seeks. The unit of parallelism is the memory bank, not the drive head. A log store designed for SSDs should probably stripe writes over multiple files for lock concurrency and not care about contiguous blocks. Unless the SSD drive firmware is written to assume and optimise for legacy HDD usage patterns. Not that a modern HDD actually does so much in the way of truly contiguous writes anyhow - there is an awful lot of abstraction between the filesystem blocks and the physical layout these days.

Even setting aside the redesign for SSD optimization it's clear there is still room for improvement here. On the HDD we should be able to trade transaction latency against batch size to reduce the number of write events and drive up the overall throughput until we saturate the drive bandwidth. On the SSD I'm not so sure we can actually saturate a properly tuned drive on this machine - the CPU can run 50k tx/s but that's only ~30MB/s of log data. Naturally a server class machine is going to have more CPU power without necessarily increasing the number of drives, but if we can retain a ratio of say one SSD to between 8-16 cores then we should be firmly in CPU bound territory. Welcome to the new era. Your design assumptions are now invalid and your transaction manager needs redesigning to be more CPU efficient. Have a nice day.

More Speed!

"You don't have to outrun the bear, you just have to outrun the guy next to you."

Nevertheless it's always a good idea to run as fast as possible - there is no telling when the guy next to you may put on a sudden burst of acceleration.

So we did some more performance improvements on JBossTS.

For transactions that involve multiple XAResources, the protocol overhead is divided between network round trips to communicate between the TM and RMs, and disks writes to create a recovery log. There are a limited number of tricks that can be done with the network I/O (issuing the prepares in parallel rather than series springs to mind), but the disk I/O is another matter.

JBossTS writes its log through an abstract storage API called the ObjectStore. There are several ObjectStore implementations, including one that uses a relational database. Most however are filesystem based. The default store is an old, reliable piece of code that has worked well for years, but recent hardware evolution has led us to question some of the design decisions.

For years now multicore chips meant the number of concurrent threads in a typical production app server deployment rising and the processor capability in general outstripping the disk I/O capability. Relatively recently SSD have rebalanced things a bit, but we still see a large number of threads (transactions) contending for relatively limited I/O capability.

Time to break out the profiler again...

For total reliability a transaction log write must be forced to disk before the transaction can proceed. No in-memory fs block buffering by the O/S, thank you. This limits the ability of the O/S and disk to batch and reorder writes for efficiency. Essentially there are a fixed number of syncs() a drive can perform per second and for small data like tx logs it's not bottlenecked on the I/O bandwidth.

The design of the current default store is such that it syncs once for each transaction. This becomes a problem when the number of transactions your CPUs and RMs can handle exceeds the number of syncs your disk array can handle. Which is pretty much what's happened. So, back to the drawing board.

Clearly what we need is some way for multiple transaction log writes to be batched into a single sync. Which in turn means putting the log records in a single file, with all the packing and garbage collection problems that entails. Then you have the thread management problems, making sure individual writes blocks until the shared sync is done. That's a lot of fairly complex code and testing. Nightmare.

So we found someone who has already done it and borrowed their implementation. Got to love this open source thing :-)

The new ObjectStore implementation is based on Clebert's brilliant Journal code from HornetQ. The same bit of code that makes persistent messaging in HornetQ so staggeringly quick.

The Journal API is not a perfect for for what we want, but nothing that can't be fixed with another layer of abstraction. One adaptor layer later and we're ready to run another microbenchmark.

First let us get a baseline by using an in-memory ObjectStore (basically a ConcurrentHashMap). Useless for production use, but helpful to establish the runtime needed to execute a basic transaction with two dummy resources and all the log serialization overhead but no actual disk write.

53843 tx/second using 400 threads.

ok, that will do for starters - thanks to lock reduction and other performance optimizations made earlier we're pretty much saturating the quad core i7 CPU. We could probably improve matters a bit by tweaking the ConcurrentHashMap lock striping, but let's move on and swap in the default ObjectStore:

1650 tx/second using 100 threads.

iowait through the roof, CPU mostly idle. dear oh dear. Look on the bright side: that means we've got a lot of scope for improvement.

Adding more threads just causes more scheduling and locking overhead - we're bottlenecked on the number of disk syncs the 2x HDD RAID-1 can handle.

Let's wheel out the secret weapon and plug in it...

20306 tx/second at 400 threads.

yup, you read that right. Don't get too excited though - you won't see that kind of performance improvement in production. We're running an empty transaction with dummy resources - no business logic and no RM communication overhead. Still, pretty sweet huh? Better buy Clebert a beer next time you see him. And one for me too of course.

And if you thought that was good...

Linux has a nifty little library for doing efficient asynchronous I/O operations. If you are running the One True Operating System and you don't mind polluting your Pure Java with a little bit of native code you can drop a .so file into LD_LIBRARY_PATH and leave the other guy to be eaten by that bear:

36491 tx/second at 400 threads.

Coming Soon to a JBossTS release near you. Enjoy.

Must Go Faster

Transaction management systems, like much other middleware, embody a very simple tradeoff: they make life easier for developers at the cost of a runtime overhead. Maximising the ease of use and minimising the overhead are things we spend a fair bit of time thinking about.

Conventional wisdom in the transaction community is that the runtime cost of a classic ACID transaction is dominated by the disk writing needed to create a log for crash recovery purposes. As with much folklore there is some truth in this, but it's not the whole story. For starters, there are some use cases where we don't need to write a log at all.

In transactions that have only one resource it's easy to optimise the 2PC protocol to a 1PC and avoid much of the overhead. But here is the snag: the design of the transaction API in Java EE does not allow for developers to communicate metadata about number of resource managers expected in the transaction ahead of time. In some cases it's not actually known, as it may depend on runtime data values. However, it's disappointing that you still need some parts of the XA protocol overhead even where the transaction is known at design time to be local (native) to a single RM.

xaresource1.start();
xaresource1.end();
xaresource1.commit();

is still three network rounds trips and one disk sync. That's a huge improvement on the eight trips and three syncs needed for a 2PC, but it's nevertheless more than the single trip and sync in the native case with

connection.commit();

Fortunately there is a workaround: define both XA and non-XA ('local-tx' in JBossAS -ds.xml terminology) datasources in the app server and use the local one wherever you know the transaction won't involve other RMs.

Perhaps one day we'll have a less clunky solution - maybe being able to define a single datasource as supporting both XA and non-XA cases, then annotating transactional methods with e.g. @OneResource or @MultiResource to tell the JCA and TM how to manage the connection. Or even being able to escalate an RM local tx to an XA one on demand rather than having to chose in advance, although that would need RM support as well as changes to the XA protocol and Java APIs. Dream On.

Even where it's running with 1PC optimization for a single RM, the transaction manager still provides benefits over the native connection based transaction management. The most critical is the ability to handle certain lifecycle events, in particular beforeCompletion(), a notification phase that allows in-memory caches such as an ORM session to be flushed to stable store and its companion afterCompletion() which allows for resource cleanup. The TM's ability to manage transaction timeouts is also important to prevent poorly written operations from locking up resources for a protracted period.

As with writing logs for 2PC recovery, the management of timeouts is one of those activities we have to do every time, even though it turns out to be required in only a tiny minority of cases. Efficiently managing the rollback of transactions that have exceeded their allotted lifetime is a seemingly trivial overhead compared to the log write and as a result the code for it received little attention until fairly recently. This is where the folklore came to bite us: conventional wisdom dictated that we should focus the performance tuning effort on the I/O paths and not worry too much about functions that just operated on in-memory structures.

WRONG.

For the reasons outlines above, in a typical app server workload there are an awful lot of transactions containing just a single resource and hence not doing a log write. D'Oh. For those use cases the overhead of the in-memory activity in the TM is actually significant. So we sat down, wrote a highly concurrent 1PC microbenchmark test scenario, put it though a profiler, shuddered and went down the pub.

When we'd recovered a bit we tuned the transaction reaper, a background process responsible for timing out transactions. By deferring much of the work as long as possible it turns out to be possible to skip it entirely for many transactions. A short lived tx does not always need to be inserted into the time ordered reaper queue - it may terminate normally long before the reaper needs to take any action. By being lazy we saved a lot of list sorting and the associated locking overhead.

As a result the recent JBossTS releases substantially outperform their predecessors in the single resource case, particularly when scaling to a large number of threads. Upgrade and Enjoy.

Next time: More Speed! Very fast I/O for 2PC logging.

Hello World

So apparently there is this thing called 'blogging', which seems to consist of equal parts narcissism and self-publicity. As far as I can tell the aim of the geek's version of this game is to make yourself insanely attractive to hiring managers who need to fill obscure technical vacancies. Oddly enough this activity not only counts as work, but is actively encouraged by management. Sign me up!

What's that? You want actual content? Oh, umm, right then. Well let's start with something simple shall we? Maybe I should introduce myself. I'm Jonathan Halliday, dev team lead for JBossTS. I spend my days working on transactions code, trying to remove more bugs than I add. After ten years or so I think I'm gradually getting the hang of it, which probably means it's time to find a new challenge. Meanwhile I'll try to spend a little more time communicating about the cool stuff I'm doing instead of just getting on with it. Starting, in the next exciting episode, with some impressive performance improvements. Because hey, who doesn't like faster code, right?

Thursday, February 17, 2011

Jonathan on Memory Resident Transactional Objects

I came across this gem from the JBossTS project lead. Well worth a read. It also reminds me that I need to write up some of the stuff I did over Christmas. Now where did I put that spare hour or so?