Monthly Archives: November 2008

new algorithm for scale-out joins

The new join strategy (pipeline joins) is now running. Jini join performance 5x better than it used to be (with nested subquery). This puts it at 4x slower than the nested subquery joins for local triple store. Let me put that in perspective — scale-out distributed joins running at only 1/4th the performance of optimized local joins as measured WITHOUT any additional hardware and on a resource constrained platform (my laptop). I fully expect the pipeline joins to be competetive with local join performance in a scale-out environment.

The “pipeline” join always reads from a local index partition and “joins” intermediate binding sets propagated from the prior join dimension. This means that each join dimension can run in parallel, can optimize reads by re-ordering the binding sets in order to have good locality on the index partition, eliminates access path tests for binding sets that produce the same bindings, etc. When the index is partitioned, there will be one join task per index partition touched by the join and each join task can be running on a different machine. While the original join strategy (nested subquery) is slightly better for the local triple store, but the pipeline join is the hands down winner for scale-out. It completely avoids many of the bottlenecks for jini.

In reality, the numbers may be much better since this is measured on a laptop, which is resource constrained – especially for jini, and the data size used for correctness testing is small so the potential benefits of the new join algorithm are more limited. Also, the new “pipeline” joins are fully distributed while the old join algorithm only distributed the data, but not the computation of the JOIN itself. This means that we can really leverage all the hardware. We already see linear scale-out for data load, and now we hope to see it for RDFS closure and high-level query as well.

The next step is to measure performance on some modest to large data sets and establish a baseline for comparison of scale-out join performance. If the scale-out join performance is good, and it should, then we will go back and test with dynamic index partitioning enabled.

Also, bloom filters are enabled for scale-out and now used by the joins. This is a great win for complex joins, such as LUBM Q9. The bloom filter is an in memory data structure that can very rapidly determine whether a fully bound point test is NOT an index hit. When the bloom filter reports “no”, you are done and you do not touch the index. When the bloom filter reports “yes”, you have to read the index to verify that there really is a hit.

Bloom filters are a stochastic data structure, require about 1 byte per index entry, and must be provisioned up front for an expected number of index entries. So if you expect 10M triples, that is a 10MB data structure. Since bloom filters do not scale-up, they are automatically disabled once the #of index entries in the mutable B+Tree exceeds about 2M tuples. BUT, they are great for scale-out since the data on the mutable B+Tree is migrated into perfect index segments, and we generate perfect fit bloom filters for those index segments. Every time we overflow a journal, we wind up with a new (empty) B+Tree to absorb writes, so the bloom filter is automatically re-enabled. Further, when index partition splits are also enabled, the #of entries in an index partition should be such that the bloom filters are always on. This should be a drammatic boost for the scale-out system over an equivalent scale-up system and is yet another way in which scale-out can leverage more resources.

Things have been quite, but that’s because we’ve been working quitely! With scale-out join performance nailed, we will be ready for an initial release.