Monthly Archives: September 2010

The ReadWrite Store: Locality at Scale

The advantage of the standard WORM append-only store over the RWStore is that nothing gets thrown away. This is also its major failing.

The advantage the RWStore has over the append-only store is that it is able to recycle its allocations. But this creates its own problems.

For the WORMstore, append-only allocation can lead to very large – increasingly large – data files as the amount of data to be stored increases. But it retains one massive advantage – write locality.

The RWStore, by recycling allocations does not suffer from disproportionately increased size as the amount of data to be stored increases, but the higher data churn as storage is reallocated leads to the “double whammy” of increased IO and reduced locality.

This combination of increased IO and reduced locality can bring a SATA-based system to its knees in a large load – SCSI does a whole lot better in managing random writes.

A separate but related issue is the potential for matching Solid State Disks with the RWStore. The random access writes and smaller storage requirements appear to be a good fit. When considering maximising SSD performance there is an issue with “read-back”. This is where a write does not fill a sector, in which case a read-back of the existing content is required since the write-erase cycle of the SSD requires full sector data. There is therefore an advantage to write to sector boundaries where possible.

Using a 100M triple BSBM load, we monitored the allocation efficiency of the RWStore. This generated a store size of 20Gb and reported an allocation efficiency of close to 99%. This meant that 99% of the allocation “slots” were filled with data. However, in order to reach this high efficiency, it was evident that we were eagerly recycling released allocations, the result being that as the store got bigger, and the number of allocation blocks (each with a minimum of 1024 allocation slots) grew, so the locality of a group of, say, 100 allocations, was degraded.

The solution was to be a little less eager in our recycling. The number of free slots for each FixedAllocator is maintained and previously, when this number increased from zero the FixedAllocator was returned to a free list to allow recycling of the released allocations. By introducing a threshold, other than 1, we could ensure that at least this threshold of allocations would be available when the FixedAllocator was returned to the free list, and these would have good locality. For a large write cache a relatively large number of allocations will be made from the same FixedAllocator such that the flushing of a single write cache will result in groups of writes with good locality.

In addition to the increased locality we have added double buffering to the write cache. This serves two purposes:

1) It pads the output data to the full size of the allocation slot, guaranteeing 4K alignment for larger allocations.

2) It enables write IO elision by merging contiguous data blocks.

The combination of increased locality with write elision leads to a significant reduction of IO requests – to less than 20% of the previous approach.

The 100M load on the SATA drive improved from 129 minutes 0 secs to 79 minutes 29 secs – a performance increase of 61%. The resulting store file increased from 20Gb to 24Gb.

Initial runs also indicate a 10-15% performance improvement with an SSD.

Other load tests confirm that the SCSI-based systems benefited from the improved locality, but not to the same extent – they were already doing better.

The scalability of this approach was confirmed by a 2 billion triple load of the Uniprot data on a SCSI-based system at a sustained rate of 30,171tps and producing a store of 170Gb in 18hrs 39mins.

Clearly there is a lot more that can be done to determine optimal configurations for different systems and finding the sweet spot for different levels of caching and locality.

At present it is apparent that there is a significant difference between SATA and SCSI that can be mitigated enormously with the locality and buffering approaches we have taken. However, it still appears that for very large stores SCSI controllers do a much better job of maintaining adequate random write throughput.