Monthly Archives: August 2011

Understanding query performance

For up to date information, see the QueryOptimization page on the wiki.

Bigdata on a Journal (single machine) is great at fast evaluation of selective queries. A selective query is one where an index can be used to rapidly recover exactly the data on which the query needs to operate. In contrast, an unselective queries has to read a lot of data from under constrained access paths. On the Journal, that translates into random IOs that will eventually saturate your disk. If that is your workload, consider more spindles, SSD, or a cluster. A bigdata federation (cluster) organizes the data differently on the disk (most of the data is physically organized on the disk in key order) so we can do sustained sequential IOs for unselective queries. This means that unselective queries on a cluster can run with much less IO wait than on a Journal. However, a Journal will have less latency for selective queries since it has less coordination overhead. A classic throughput versus latency tradeoff.

Bigdata is designed to pipeline query evaluation very efficiently. There is no practical limit on the #of results which bigdata can return when it is pipelining a query, but unselective queries may saturate the disk with IO Wait. If you are trying to materialize too many results at once in the JVM, then you can run into problems. For example, this can happen if your query has a large result set and uses ORDER BY or DISTINCT. Both ORDER BY and DISTINCT force total materialization of the query result set even if you use OFFSET/LIMIT. Materialization of large result sets is expensive (today) because it puts a lot of heap pressure on the JVM — an issue which is addressed by our forthcoming analytic query milestone release — see below.

During pipelined query evaluation, intermediate results flow from one query operator to the next in chunks. Different operators in the query plan run simultaneously as they consume their inputs. If one operator is producing solutions faster than another operator can consume them, then multiple invocations of the slower operator can run simultaneously. The total CPU and RAM burden during pipelined evaluation is managed by limiting the size of the chunks flowing through the query plan, the #of concurrent instances of an operator which may be executing simultaneously for the same query, and the capacity of the input queue for each operator in the query plan. Low level annotations are available to control all of these parameters, but that is generally below the level at which users will do performance tuning. The other critical parameter is the #of queries running concurrently on the QueryEngine (discussed below).

You can do quite a bit to make queries fast and a highly tuned application will really sing, but first you need to understand why your query is slow. Read on.

There are several ways in which things can go wrong for query performance:
1. Bad query (the query does not do what you intended).
2. Bad join ordering (bigdata reorders joins, but sometimes it gets this wrong).
3. Bad query plan (sometimes an alternative plan can be more efficient, e.g., subqueries with hash joins).
4. Too much GC pressure (see below).
5. Too much IO Wait.
6. Too many concurrent queries.

We are going to look at three main indicators: IO Wait, CPU, and GC time. These will be your guideposts in understanding what is going on inside your server. However, before you get started, you should review the configuration of your server. Make sure that you are using a server mode JVM (-server), that you have given your JVM enough memory, but not ALL your memory since the OS needs to use memory to buffer the disk as well. A good rule of thumb is to give the JVM 1/2 of the memory on the server and let the OS use the rest to buffer the disk. And never, ever let the JVM swap. Java does not handle swapping gracefully since it needs to scan memory when it does a GC pass, and performance will tank if GC is hitting the disk. You also need to make sure you are using read-only connection when you run your queries since you will have much higher potentially concurrency that way (the B+Tree is single threaded for mutation, but fully concurrent for readers). Then there are a slew of other performance tuning parameters to consider, from the backing Journal mode (RWStore scales better than the WORM), to the branching factors, the B+Tree “write retention queue” capacity, etc. See performance optimizations on the wiki for some tips.

The first thing to do is look at the DISK activity and see what is happening there. If the disk is saturated, then you are IO bound and you need to think about whether your query is doing what you want it to do, whether you want something you can have (realistic expectations), and whether the join ordering is bad (this can lead to excess IO). vmstat is a great tool for this under un*x. It reports blocks in, blocks out, CPU utilization, and the all important IO Wait. Depending on your requirements, you may be able to “fix” an IO Wait problem by adding spindles or using SSD, but first make sure that the query plan is good.

Next, take a look at the CPU. If you are CPU bound with little disk activity and low heap pressure (see below), then you might be looking at a bad join ordering problem with a tight loop in memory. Review the join plan carefully. It is also possible to hit concurrency bottlenecks. If you are running on a server with a LOT of cores (16, 24, or more) then you should look at the context switching (vmstat again). You can get into CAS spins and lock hot spots on a server with a lot of cores which do not show up on a 8 core machine. If you believe that this is what is going on, you can put a profiler on the JVM and see where it is spending its time. You can get a lot of red herrings this way since the instrumentation of the JVM by the profiler can cause some methods which do basically nothing but are called frequently to be reported as “false” hot spots. Signatures of real hot spots are locks, or lock pools (stripped locks) which are hot or CAS contention (AtomicLong and friends). We put a lot of effort into eliminating this sort of hot spot, but sometimes a new one can show up.

Finally, you also need to pay attention to GC. If there is too much pressure on the JVM heap, then application throughput falls through the floor. Bigdata generally controls this pretty well, but you can get pathological cases. Depending on how you have the JVM setup, this might appear as one core being at 100% activity while the others are idle or all cores might be at 100% (parallel GC). Note that there will always be very low IO wait if you are GC bound. If you have significant IO Wait, then you do not need to look at the heap pressure.

You can use jstat to report on GC at the command line or jvisualvm to see this in a UI. You can also use any number of excellent Java profilers, such as YourKit. If you are seeing 20% GC time, then you are getting into a high heap pressure scenario and your throughput is dropping. If it is about 50%, then 1/2 of the CPU time is going to garbage collection and your throughput has fallen by AT LEAST 50%. Heap pressure is driven by the object creation and retention rate. Make sure that you are not running too many queries in parallel for the server resources you have available. The NanoSparqlServer uses a thread pool to control the #of application level query threads. If you are using the Sesame API, then you have to control the query parallelism yourself. If you have a query which is creating a lot of heap pressure, then see if you can make your query more selective, change it so it does not materialize as much data on the JVM heap (ORDER BY forces everything to be materialized — even if you use it with OFFSET and LIMIT), or explore some of the low level parameters for tuning the QueryEngine (experts only). We are also working on an analytic package for SPARQL which will address the heap pressure issue directly — see below.

Now that you have an idea of where the bottleneck lies, it is time to look at your query. Bigdata reorders joins. This is done based on the fast range counts of the different triple/quad patterns in the query. The algorithm basically starts with the most selective access path and the best index for that and propagates variable bindings in order to estimate the as-bound cardinality and hence the order in which the subsequent joins should be executed. This works great much of the time, but it can get the join order wrong (cardinality estimation error can increase exponentially, so bad join orders do happen). If you have a bad join order you can reorder the joins by hand, but first you need to disable the join optimizer. See com.bigdata.rdf.sail.QueryHints for more about this and other interesting bits query hints, but basically you need to add this to your query:

Use the explain option to the NanoSparqlServer to see the gory details of the query evaluation — just add &explain to the query URL. It will paint an HTML page with the original SPARQL query, the Abstract Syntax Tree (AST), the bigdata query plan (made up of bigdata operators, aka “bops”), and a table full of interesting statistics which you can paste into a worksheet. Since the response is HTML not SPARQL or RDF, you need to do this in a web browser or save the response off to a file and look at it in a browser. You can also get at this by turning on logging for some different classes, but the nicest way to get all this stuff is using explain with the NanoSparqlServer. Some of the data reported by explain is relatively self-explanatory, things like the join-hit ratios, the #of solutions flowing into and out of each bigdata operator, and the join order. There is a lot of more arcane information in there as well.

So, look at your queries, look at the explanation of the query, and look at the CPU, IO Wait, and GC time. Those are the main guideposts for understanding why a given query might not have the performance you expect. You also need to think carefully about your data, your ontology / schema and your query and make sure that you understand the interactions which are really going on.

Analytic query

The next milestone release in our roadmap is analytic query support. This includes support for aggregation (GROUP BY and select expressions), subquery with temporary solution sets, and new bigdata operators for DISTINCT and ORDER BY.

The analytic query package does NOT suffer from JVM heap pressure. What we’ve done is develop a memory manager for the Java native process heap, NOT the JVM object heap. The MemoryManager is pretty much the RWStore for main memory, even terabytes of main memory. It is 100% Java and works with NIO buffers to manage native memory. We use the MemoryManager to put very large working sets onto the native process heap, rather than the JVM object heap. The analytic query package also includes a new hash tree index structure, the HTree, which we use for DISTINCT and GROUP BY on large result sets. We run the HTree against the MemoryManager so the data stays in RAM, but stays off the JVM object heap. Finally, we will be introducing a new Runtime Query Optimizer (RTO). The RTO is very slick and gives excellent performance on unselective queries by doing runtime optimization of the query plan and paying attention to the actual correlations which exist in the data for the query. It will be fantastic on a cluster doing heaving analytic queries.


Bigdata 1.0.1 Release

This is a bigdata (R) release. This release is capable of loading 1B triples in
under one hour on a 15 node cluster. JDK 1.6 is required.

Bigdata(R) is a horizontally scaled open source architecture for indexed data
with an emphasis on semantic web data architectures. Bigdata operates in both
a single machine mode (Journal) and a cluster mode (Federation). The Journal
provides fast scalable ACID indexed storage for very large data sets. The
federation provides fast scalable shard-wise parallel indexed storage using
dynamic sharding and shard-wise ACID updates. Both platforms support fully
concurrent readers with snapshot isolation.

Distributed processing offers greater throughput but does not reduce query or
update latency. Choose the Journal when the anticipated scale and throughput
requirements permit. Choose the Federation when the administrative and machine
overhead associated with operating a cluster is an acceptable tradeoff to have
essentially unlimited data scaling and throughput.

See [1,2,8] for instructions on installing bigdata(R), [4] for the javadoc, and
[3,5,6] for news, questions, and the latest developments. For more information
about SYSTAP, LLC and bigdata, see [7].

Starting with this release, we offer a WAR artifact [8] for easy installation of
the Journal mode database. For custom development and cluster installations we
recommend checking out the code from SVN using the tag for this release. The
code will build automatically under eclipse. You can also build the code using
the ant script. The cluster installer requires the use of the ant script.

You can checkout this release from the following URL:

Bug fixes:

– (Unicode clean schema
names in the sparse row store).

– (TermIdEncoder should
use more bits for scale-out).

– (OSX requires specialized
performance counter collection classes).

– (BigdataValueFactory.asValue()
must return new instance when DummyIV is used).

– (TermIdEncoder limits
Journal to 2B distinct RDF Values per triple/quad store instance).

– (SPO not Serializable
exception in SIDS mode (scale-out)).

– (ClassCastException when
querying with binding-values that are not known to the database).

– (UnsupportedOperatorException
for some SPARQL queries).

– (Query failure when
comparing with non materialized value).

– (RWStore reports
“FixedAllocator returning null address, with freeBits”.)

– (NamedGraph pattern
fails to bind graph variable if only one binding exists.)

– (log4j – slf4j bridge.)

Note: Some of these bug fixes require data migration. For details, see

New features:

– Single machine data storage to ~50B triples/quads (RWStore);
– Simple embedded and/or webapp deployment (NanoSparqlServer);
– 100% native SPARQL 1.0 evaluation with lots of query optimizations;

Feature summary:

– Triples, quads, or triples with provenance (SIDs);
– Fast RDFS+ inference and truth maintenance;
– Clustered data storage is essentially unlimited;
– Fast statement level provenance mode (SIDs).

The road map [3] for the next releases includes:

– High-volume analytic query and SPARQL 1.1 query, including aggregations;
– Simplified deployment, configuration, and administration for clusters; and
– High availability for the journal and the cluster.

For more information, please see the following links:


About bigdata: