Monthly Archives: May 2011

Using Statement Identifiers to Manage Provenance

Sometimes it is nice to be able to say things about statements, such as where they came from and who asserted them. The RDF data model does not provide a convenient mechanism for assigning identity to particular statements or for making statements about statements. RDF reification is cumbersome, results in a huge expansion in number of triples in the database, and is incompatible with most inference and rule engines.

Named graphs (quads) is one way to approach provenance. By grouping triples into named graphs and assigning a URI as the graph identifier, you can then make statements about the named graph to identify the provenance of the group of triples (the group size could even theoretically be one). Unfortunately this approach has a few drawbacks as well. Partitioning the knowledge base into groups creates challenges for inference and rule engines, and full named graph support in bigdata requires twice as many statement indices as triples.

If all you need is an unpartitioned, inference-capable knowledge base with the ability to make assertions about statements, bigdata provides you with a third alternative to simple triples or fully indexed quads: statement identifiers (SIDs). With SIDs, the database acts as if it is triples mode, but each triple is assigned a statement identifier (on demand) that can be used in additional statements (meta-statements):

(s, p, o, c)
1. (<mike>, <likes>, <RDF>, :sid1)
2. (:sid1, <source>, <>)

Statement 1 asserts that


Bigdata 0.84.0 release

This is a bigdata (R) release. This release is capable of loading 1B triples in under one hour on a 15 node cluster. JDK 1.6 is required.

See [1,2] for instructions on installing bigdata(R), [4] for the javadoc, and [3,5,6] for news, questions, and the latest developments. For more information about SYSTAP, LLC and bigdata, see [7].

Please note that we recommend checking out the code from SVN using the tag for this release. The code will build automatically under eclipse. You can also build the code using the ant script. The cluster installer requires the use of the ant script. You can checkout this release from the following URL:

New features:

– Inlining provenance metadata into the statement indices and fast reverse lookup of provenance metadata using statement identifiers (SIDs).

Significant bug fixes:

– The journal size could double in some cases following a restart due to a type in the WORMStrategy constructor.


– Fixed a concurrency hole in the commit protocol for the Journal which could result in a concurrent modification to the B+Tree during the commit protocol.

– Fixed a problem in the abort protocol for the BigdataSail.

– Fixed a problem where the BigdataSail would permit the same thread to obtain more than one UNISOLATED connection:


The road map [3] for the next releases includes:

– Single machine data storage to 10B+ triples;
– Simple embedded and/or webapp deployment;
– 100% native SPARQL evaluation with lots of query optimizations;
– High-volume analytic query and SPARQL 1.1 query, including aggregations;
– Simplified deployment, configuration, and administration for clusters.
– High availability for the journal and the cluster;

For more information, please see the following links:


About bigdata: