Understanding query performance

For up to date information, see the QueryOptimization page on the wiki.

Bigdata on a Journal (single machine) is great at fast evaluation of selective queries. A selective query is one where an index can be used to rapidly recover exactly the data on which the query needs to operate. In contrast, an unselective queries has to read a lot of data from under constrained access paths. On the Journal, that translates into random IOs that will eventually saturate your disk. If that is your workload, consider more spindles, SSD, or a cluster. A bigdata federation (cluster) organizes the data differently on the disk (most of the data is physically organized on the disk in key order) so we can do sustained sequential IOs for unselective queries. This means that unselective queries on a cluster can run with much less IO wait than on a Journal. However, a Journal will have less latency for selective queries since it has less coordination overhead. A classic throughput versus latency tradeoff.

Bigdata is designed to pipeline query evaluation very efficiently. There is no practical limit on the #of results which bigdata can return when it is pipelining a query, but unselective queries may saturate the disk with IO Wait. If you are trying to materialize too many results at once in the JVM, then you can run into problems. For example, this can happen if your query has a large result set and uses ORDER BY or DISTINCT. Both ORDER BY and DISTINCT force total materialization of the query result set even if you use OFFSET/LIMIT. Materialization of large result sets is expensive (today) because it puts a lot of heap pressure on the JVM — an issue which is addressed by our forthcoming analytic query milestone release — see below.

During pipelined query evaluation, intermediate results flow from one query operator to the next in chunks. Different operators in the query plan run simultaneously as they consume their inputs. If one operator is producing solutions faster than another operator can consume them, then multiple invocations of the slower operator can run simultaneously. The total CPU and RAM burden during pipelined evaluation is managed by limiting the size of the chunks flowing through the query plan, the #of concurrent instances of an operator which may be executing simultaneously for the same query, and the capacity of the input queue for each operator in the query plan. Low level annotations are available to control all of these parameters, but that is generally below the level at which users will do performance tuning. The other critical parameter is the #of queries running concurrently on the QueryEngine (discussed below).

You can do quite a bit to make queries fast and a highly tuned application will really sing, but first you need to understand why your query is slow. Read on.

There are several ways in which things can go wrong for query performance:
1. Bad query (the query does not do what you intended).
2. Bad join ordering (bigdata reorders joins, but sometimes it gets this wrong).
3. Bad query plan (sometimes an alternative plan can be more efficient, e.g., subqueries with hash joins).
4. Too much GC pressure (see below).
5. Too much IO Wait.
6. Too many concurrent queries.

We are going to look at three main indicators: IO Wait, CPU, and GC time. These will be your guideposts in understanding what is going on inside your server. However, before you get started, you should review the configuration of your server. Make sure that you are using a server mode JVM (-server), that you have given your JVM enough memory, but not ALL your memory since the OS needs to use memory to buffer the disk as well. A good rule of thumb is to give the JVM 1/2 of the memory on the server and let the OS use the rest to buffer the disk. And never, ever let the JVM swap. Java does not handle swapping gracefully since it needs to scan memory when it does a GC pass, and performance will tank if GC is hitting the disk. You also need to make sure you are using read-only connection when you run your queries since you will have much higher potentially concurrency that way (the B+Tree is single threaded for mutation, but fully concurrent for readers). Then there are a slew of other performance tuning parameters to consider, from the backing Journal mode (RWStore scales better than the WORM), to the branching factors, the B+Tree “write retention queue” capacity, etc. See performance optimizations on the wiki for some tips.

The first thing to do is look at the DISK activity and see what is happening there. If the disk is saturated, then you are IO bound and you need to think about whether your query is doing what you want it to do, whether you want something you can have (realistic expectations), and whether the join ordering is bad (this can lead to excess IO). vmstat is a great tool for this under un*x. It reports blocks in, blocks out, CPU utilization, and the all important IO Wait. Depending on your requirements, you may be able to “fix” an IO Wait problem by adding spindles or using SSD, but first make sure that the query plan is good.

Next, take a look at the CPU. If you are CPU bound with little disk activity and low heap pressure (see below), then you might be looking at a bad join ordering problem with a tight loop in memory. Review the join plan carefully. It is also possible to hit concurrency bottlenecks. If you are running on a server with a LOT of cores (16, 24, or more) then you should look at the context switching (vmstat again). You can get into CAS spins and lock hot spots on a server with a lot of cores which do not show up on a 8 core machine. If you believe that this is what is going on, you can put a profiler on the JVM and see where it is spending its time. You can get a lot of red herrings this way since the instrumentation of the JVM by the profiler can cause some methods which do basically nothing but are called frequently to be reported as “false” hot spots. Signatures of real hot spots are locks, or lock pools (stripped locks) which are hot or CAS contention (AtomicLong and friends). We put a lot of effort into eliminating this sort of hot spot, but sometimes a new one can show up.

Finally, you also need to pay attention to GC. If there is too much pressure on the JVM heap, then application throughput falls through the floor. Bigdata generally controls this pretty well, but you can get pathological cases. Depending on how you have the JVM setup, this might appear as one core being at 100% activity while the others are idle or all cores might be at 100% (parallel GC). Note that there will always be very low IO wait if you are GC bound. If you have significant IO Wait, then you do not need to look at the heap pressure.

You can use jstat to report on GC at the command line or jvisualvm to see this in a UI. You can also use any number of excellent Java profilers, such as YourKit. If you are seeing 20% GC time, then you are getting into a high heap pressure scenario and your throughput is dropping. If it is about 50%, then 1/2 of the CPU time is going to garbage collection and your throughput has fallen by AT LEAST 50%. Heap pressure is driven by the object creation and retention rate. Make sure that you are not running too many queries in parallel for the server resources you have available. The NanoSparqlServer uses a thread pool to control the #of application level query threads. If you are using the Sesame API, then you have to control the query parallelism yourself. If you have a query which is creating a lot of heap pressure, then see if you can make your query more selective, change it so it does not materialize as much data on the JVM heap (ORDER BY forces everything to be materialized — even if you use it with OFFSET and LIMIT), or explore some of the low level parameters for tuning the QueryEngine (experts only). We are also working on an analytic package for SPARQL which will address the heap pressure issue directly — see below.

Now that you have an idea of where the bottleneck lies, it is time to look at your query. Bigdata reorders joins. This is done based on the fast range counts of the different triple/quad patterns in the query. The algorithm basically starts with the most selective access path and the best index for that and propagates variable bindings in order to estimate the as-bound cardinality and hence the order in which the subsequent joins should be executed. This works great much of the time, but it can get the join order wrong (cardinality estimation error can increase exponentially, so bad join orders do happen). If you have a bad join order you can reorder the joins by hand, but first you need to disable the join optimizer. See com.bigdata.rdf.sail.QueryHints for more about this and other interesting bits query hints, but basically you need to add this to your query:
PREFIX BIGDATA_QUERY_HINTS: <http://www.bigdata.com/queryHints#com.bigdata.rdf.sail.QueryHints.optimizer=None>

Use the explain option to the NanoSparqlServer to see the gory details of the query evaluation — just add &explain to the query URL. It will paint an HTML page with the original SPARQL query, the Abstract Syntax Tree (AST), the bigdata query plan (made up of bigdata operators, aka “bops”), and a table full of interesting statistics which you can paste into a worksheet. Since the response is HTML not SPARQL or RDF, you need to do this in a web browser or save the response off to a file and look at it in a browser. You can also get at this by turning on logging for some different classes, but the nicest way to get all this stuff is using explain with the NanoSparqlServer. Some of the data reported by explain is relatively self-explanatory, things like the join-hit ratios, the #of solutions flowing into and out of each bigdata operator, and the join order. There is a lot of more arcane information in there as well.

So, look at your queries, look at the explanation of the query, and look at the CPU, IO Wait, and GC time. Those are the main guideposts for understanding why a given query might not have the performance you expect. You also need to think carefully about your data, your ontology / schema and your query and make sure that you understand the interactions which are really going on.

Analytic query

The next milestone release in our roadmap is analytic query support. This includes support for aggregation (GROUP BY and select expressions), subquery with temporary solution sets, and new bigdata operators for DISTINCT and ORDER BY.

The analytic query package does NOT suffer from JVM heap pressure. What we’ve done is develop a memory manager for the Java native process heap, NOT the JVM object heap. The MemoryManager is pretty much the RWStore for main memory, even terabytes of main memory. It is 100% Java and works with NIO buffers to manage native memory. We use the MemoryManager to put very large working sets onto the native process heap, rather than the JVM object heap. The analytic query package also includes a new hash tree index structure, the HTree, which we use for DISTINCT and GROUP BY on large result sets. We run the HTree against the MemoryManager so the data stays in RAM, but stays off the JVM object heap. Finally, we will be introducing a new Runtime Query Optimizer (RTO). The RTO is very slick and gives excellent performance on unselective queries by doing runtime optimization of the query plan and paying attention to the actual correlations which exist in the data for the query. It will be fantastic on a cluster doing heaving analytic queries.

Leave a Reply