Monthly Archives: March 2012


We’ve just added support for SPARQL 1.1 UPDATE. This is available from r6172 in SVN and will be part of our next milestone release. You can use it through the Sesame API and the NanoSparqlServer.

Check it out and let us know what you think.

Custom Functions

We’ve added a new page to the wiki which documents how to write your own custom functions. The wiki page includes some examples and links you to heavily documented source code in SVN.

Bigdata uses a vectored query engine. Chunks of solutions flow through the query plan operators. There is parallelism across queries, across operators within a query, and within an operator (multiple instances of the same operator can be evaluated in parallel). Operators broadly break down into those which operate on solutions and those which operate on value expressions. The former are vectored, operate on chunks of solutions at a time, and have access to the indices. The latter are not vectored, operate on a single solution at a time, and do not have access to the indices.

People who write custom functions need to be aware of IVs, which are the “Internal Value” objects used to represent RDF Values inside of bigdata. There are a lot of different kinds of IVs, including those which are fully inline (supporting xsd datatypes, etc) and those which are recorded assigned by index (TERM2ID or BLOBS, depending on the size of the Value). IVs are used directly in the statement indices and in query processing.

Solutions flowing through a bigdata query are modeled using IVs. RDF Values in the query are batch resolved to IVs when the query is compiled and then ”cached” on the IV. This “IVCache” is the critical bit of glue which lets you access the materialized RDF Value in a custom function. There are methods which encapsulate the work required to turn an IV into a Value and a Value into an IV. You can use those methods and ignore the IV interface for the most part, but if you put in a little more effort you can often dramatically improve the performance of your custom function.

Bigdata tries to avoid RDF Value materialization whenever possible. IVs are more compact, are faster to process, and do not require lookups against the lexicon indices. If the query engine decides that it needs to materialize some variable before evaluating a filter or a projection, then it will do that automatically. Custom functions which can process IVs natively are significantly faster than those which rely on materialized RDF Values. These functions have the “NEVER” materialization requirements. Many functions rely on materialized Values, but can use a “fast path” to quickly drop arguments which are not valid for that function. For example, functions which require literals as arguments can test on IV.isLiteral() and throw a SparqlTypeErrorException if the argument is not a literal. These functions have “SOMETIMES” materialization requirements. Then there are functions which “ALWAYS” need materialized Values. Often you can convert an ALWAYS function into a SOMETIMES function with a little bit more work and get a big performance boost for your efforts.

Have fun!

Updated BSBM v3.1 Results (53712 QMpH)

Someone was asking for BSBM v3.1 results. Here are some from the current revision in SVN against an Apple Mac Mini. Try it out on your server. You can follow the benchmarking guide on our wiki.

Scale factor:           284826
Number of warmup runs:  50
Number of clients:      16
Seed:                   1075
Number of query mix runs (without warmups): 500 times
min/max Querymix runtime: 0.7289s / 1.6698s
Total runtime (sum):    525.696 seconds
Total actual runtime:   33.512 seconds
QMpH:                   53712.44 query mixes per hour
CQET:                   1.05139 seconds average runtime of query mix
CQET (geom.):           1.04659 seconds geometric mean runtime of query mix

Metrics for Query:      1
Count:                  500 times executed in whole run
AQET:                   0.039063 seconds (arithmetic mean)
AQET(geom.):            0.036439 seconds (geometric mean)
QPS:                    401.58 Queries per second
minQET/maxQET:          0.00889232s / 0.11675030s
Average result count:   7.98
min/max result count:   0 / 10
Number of timeouts:     0

Metrics for Query:      2
Count:                  3000 times executed in whole run
AQET:                   0.040905 seconds (arithmetic mean)
AQET(geom.):            0.038344 seconds (geometric mean)
QPS:                    383.49 Queries per second
minQET/maxQET:          0.00988646s / 0.20486457s
Average result count:   19.48
min/max result count:   6 / 36
Number of timeouts:     0

Metrics for Query:      3
Count:                  500 times executed in whole run
AQET:                   0.049103 seconds (arithmetic mean)
AQET(geom.):            0.046191 seconds (geometric mean)
QPS:                    319.47 Queries per second
minQET/maxQET:          0.01107620s / 0.23461456s
Average result count:   5.47
min/max result count:   0 / 10
Number of timeouts:     0

Metrics for Query:      4
Count:                  500 times executed in whole run
AQET:                   0.048209 seconds (arithmetic mean)
AQET(geom.):            0.045754 seconds (geometric mean)
QPS:                    325.39 Queries per second
minQET/maxQET:          0.01487138s / 0.12486670s
Average result count:   7.56
min/max result count:   0 / 10
Number of timeouts:     0

Metrics for Query:      5
Count:                  0 times executed in whole run
AQET:                   0.000000 seconds (arithmetic mean)
AQET(geom.):            NaN seconds (geometric mean)
QPS:                    Infinity Queries per second
minQET/maxQET:          179769313486231570000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000.00000000s / 0.00000000s
Average result count:   0.00
min/max result count:   2147483647 / -2147483648
Number of timeouts:     0

Metrics for Query:      7
Count:                  2000 times executed in whole run
AQET:                   0.080021 seconds (arithmetic mean)
AQET(geom.):            0.076796 seconds (geometric mean)
QPS:                    196.04 Queries per second
minQET/maxQET:          0.02779225s / 0.33339837s
Average result count:   11.97
min/max result count:   1 / 100
Number of timeouts:     0

Metrics for Query:      8
Count:                  1000 times executed in whole run
AQET:                   0.043752 seconds (arithmetic mean)
AQET(geom.):            0.040962 seconds (geometric mean)
QPS:                    358.54 Queries per second
minQET/maxQET:          0.01055718s / 0.22980238s
Average result count:   4.85
min/max result count:   0 / 19
Number of timeouts:     0

Metrics for Query:      9
Count:                  2000 times executed in whole run
AQET:                   0.030722 seconds (arithmetic mean)
AQET(geom.):            0.028557 seconds (geometric mean)
QPS:                    510.61 Queries per second
minQET/maxQET:          0.00424333s / 0.11850076s
Average result (Bytes): 6861.40
min/max result (Bytes): 1519 / 13057
Number of timeouts:     0

Metrics for Query:      10
Count:                  1000 times executed in whole run
AQET:                   0.038099 seconds (arithmetic mean)
AQET(geom.):            0.035781 seconds (geometric mean)
QPS:                    411.74 Queries per second
minQET/maxQET:          0.00881888s / 0.17458824s
Average result count:   1.78
min/max result count:   0 / 9
Number of timeouts:     0

Metrics for Query:      11
Count:                  500 times executed in whole run
AQET:                   0.030195 seconds (arithmetic mean)
AQET(geom.):            0.027771 seconds (geometric mean)
QPS:                    519.51 Queries per second
minQET/maxQET:          0.00423775s / 0.09756225s
Average result count:   10.00
min/max result count:   10 / 10
Number of timeouts:     0

Metrics for Query:      12
Count:                  500 times executed in whole run
AQET:                   0.032718 seconds (arithmetic mean)
AQET(geom.):            0.030581 seconds (geometric mean)
QPS:                    479.46 Queries per second
minQET/maxQET:          0.00585602s / 0.10520701s
Average result (Bytes): 1476.21
min/max result (Bytes): 1446 / 1509
Number of timeouts:     0

This result is quoted for 16 concurrent clients, 50 warmup trials and 500 presentations of the query mixes. The database was the bigdata RWStore running on a single machine. These results were obtained against branches/BIGDATA_RELEASE_1_1_0 from SVN r6122. The machine is a dual core i7 (four cores total) with 4MB shared cache @ 2.7Ghz running Ubuntu 11 (Natty) with 16G of DDR3 1333MHz RAM and a single SATA3 256G SSD drive (an 2011 Apple Mac Mini). IO utilization approximately 0%. CPU utilization was 65% during the run. The JVM was Oracle Java 1.6.0_27 using

SPARQL 1.1 Basic Federated Query

We’ve added support for SPARQL 1.1 Basic Federated Query. We plan to add support for SPARQL 1.1 Update next, following up with a new release shortly.

SPARQL 1.1 Basic Federated Query let’s you write queries against multiple SPARQL end points. Each end point is denoted in the SPARQL query using the SERVICE keyword. For example, the following query joins local data matching ?s ?p1 ?o1 with REMOTE data matching ?s ?p2 ?o2. You can write queries which mix local data freely with remote data from one or more end points.

SELECT ?s ?o1 ?o2 
  ?s ?p1 ?o1 .
  SERVICE <> {
    ?s ?p2 ?o2

Bigdata vectors solutions flowing into and out of both SPARQL 1.0 and SPARQL 1.1 remotes end point and let’s you control the evaluation order in detail using query hint. You can configure the level of SPARQL support for the end point using the ServiceRegistry.

You can also use the SERVICE keyword for internal services. For example, our own full text search engine is implemented as a SERVICE and Open Sahara has integrations for their text and geospatial indexing extensions which plug into bigdata using an internal SERVICE. Internal SERVICEs look just like remote SPARQL end points in the query, but they live in the same JVM and can be much faster. This opens up bigdata to a host of interesting integrations. Imagine a bridge to an embedded Prolog reasoner….

We have put together a wiki page which explains how to use Federated Query in depth and offers tricks and tips for controlling the evaluation order using Query Hints, Named Subquery.

You can try it out now by checking out bigdata from the 1.1.x maintenance branch in SVN.

Virtual Graphs

We’ve added support for virtual graphs to bigdata. This was done at the suggestion of David Booth who outlined this concept in a recent presentation (see page 21). With virtual graphs you can dynamically combine large numbers of named graphs into the same “virtual” graph. This achieves exactly the same purpose as specifying a large number of FROM or FROM NAMED clauses in your SPARQL query, but the definition of what is in each graph is encapsulated in the quad store itself.

Virtual graphs are a quads mode feature and is available from SVN as of r6059 (this revision also uses Sesame 2.6.3, but we are not quite finished with the SPARQL Federation support). There is a which documents the virtual graphs feature.

Feedback is welcome.