Narayana team blog: hardware p0rn

Like most geeks I adore shiny new tech toys. I recently built two machines to host additional virtualized slave nodes for our hudson cluster. Thanks to the drop in RAM prices I had enough budget to stuff them full (i7 based workstations, so triple channel RAM on a 6 slot board topping out at 24GB). There was even enough left over for an SSD. yay. So today I fired one up and dumped my infamous XA transaction microbenchmark onto it.

The SSD totally failed to outperform the HDD.

So I checked the disk mount points, ran some diagnostics with dd and was still left scratching my head. The test code is basically write only, that being the nature of the beast when working with transaction recovery logs. Fortunately I'd opted for an SSD with the Sandforce controller, which provides for much more symmetric read/write performance compared to some of the competition. The manufacturer specs are 285MB/s read and 275MB/s write. Let's give it a try:

$dd if=/dev/zero of=/tmp/ssd/foo bs=1M count=4000 oflag=direct
4000+0 records in
4000+0 records out
4194304000 bytes (4.2 GB) copied, 19.0998 seconds, 220 MB/s

ok, close enough. The HDD does not do so well:

$dd if=/dev/zero of=/tmp/hdd/foo bs=1M count=4000 oflag=direct
4000+0 records in
4000+0 records out
4194304000 bytes (4.2 GB) copied, 30.5214 seconds, 137 MB/s

So if I can push around 2x the data onto it, I should be able to log 2x the number of transactions per second, right? cool.

Umm, No.

25106 tx/second to HDD log.

24996 tx/second to SSD log.

hmmm.

A little bit of fiddling around with the ObjectStore provides another useful data point: a single transaction log record for this test case is a little over 600 bytes - basically two Xids (one for each XAResource) plus transaction id and overhead.
So, 25k tx/s * 600 bytes = 15MB/s give or take. Less that 1/10th of what the SSD should handle.

We know the benchmark will run at over 50k tx/s with an in-memory store, so we're definitely bottlenecked on the I/O somewhere, but it's not write bandwidth. With a conventional HDD I'd put my money on the number of physical writes (syncs/forces) a drive head can handle. High performance log stores are designed to do contiguous append to a single file to eliminate head seek latency, so it's something to do with the number of events the device can handle in series. Let's use the filesystem's native block size:

dd if=/dev/zero of=/tmp/hdd/foo bs=4k count=1000000 oflag=direct
1000000+0 records in
1000000+0 records out
4096000000 bytes (4.1 GB) copied, 139.289 seconds, 29.4 MB/s

dd if=/dev/zero of=/tmp/sdd/foo bs=4k count=1000000 oflag=direct
1000000+0 records in
1000000+0 records out
4096000000 bytes (4.1 GB) copied, 158.286 seconds, 25.9 MB/s

and there we have it - the block management is killing us.

Looks like I need to spend some time down in the native code and kernel to understand the I/O paths at work here. My feeling is that SSD technology just invalidated a lot of design assumptions that were probably a bit outdated already. SSDs don't do head seeks. The unit of parallelism is the memory bank, not the drive head. A log store designed for SSDs should probably stripe writes over multiple files for lock concurrency and not care about contiguous blocks. Unless the SSD drive firmware is written to assume and optimise for legacy HDD usage patterns. Not that a modern HDD actually does so much in the way of truly contiguous writes anyhow - there is an awful lot of abstraction between the filesystem blocks and the physical layout these days.

Even setting aside the redesign for SSD optimization it's clear there is still room for improvement here. On the HDD we should be able to trade transaction latency against batch size to reduce the number of write events and drive up the overall throughput until we saturate the drive bandwidth. On the SSD I'm not so sure we can actually saturate a properly tuned drive on this machine - the CPU can run 50k tx/s but that's only ~30MB/s of log data. Naturally a server class machine is going to have more CPU power without necessarily increasing the number of drives, but if we can retain a ratio of say one SSD to between 8-16 cores then we should be firmly in CPU bound territory. Welcome to the new era. Your design assumptions are now invalid and your transaction manager needs redesigning to be more CPU efficient. Have a nice day.

1 comment:

Andrig T Miller said...: One thing to consider with the SSD is the I/O scheduler. For my setup (two 80GB SSD's in RAID-0 configuration), I changed the I/O scheduler to be the noop scheduler. It doesn't make sense to have lots of fancy algorithms to optimize head seeks where there are no head seeks. So, you can boot the kernel with elevator=noop. You should also change the caching in the BIOS to write-back caching, if you can. Finally, you can mount the file system with noatime. These things should help the throughput some, but SSD's tend to do much better then HDD's when the I/O size is much larger than what you have.; March 8, 2011 at 8:50 PM

Tuesday, March 8, 2011

hardware p0rn

1 comment: