Monthly Archives: April 2012

Client-Server API

Did you know that bigdata has a built-in REST API for client-server access to the RDF database? We call this interface the “NanoSparqlServer”, and it’s API is outlined in detail on the wiki:

What’s new with the NSS is that we’ve recently added a Java API around it so that you can write client code without having to understand the HTTP API or make HTTP calls directly. This is why there is suddenly a new dependency on Apache’s HTTP Components in the codebase. The Java wrapper is called “RemoteRepository”. If you’re comfortable writing application code against the Sesame SAIL/Repository API you should feel pretty at home with the RemoteRepository class. Not exactly the same, but very very similar.

The class itself is pretty self-explanatory but if you like examples, there is a test case for every API call in RemoteRepository in the class TestNanoSparqlClient. (That test case also conveniently demonstrates how to launch a NanoSparqlServer wrapping a bigdata journal using Jetty, which it does at the beginning of every test.)


Custom SPARQL Functions

I put together a more useful example of how to write a custom SPARQL function with bigdata. It’s up on the wiki here:

The example details a common use case – filtering out solutions based on security credentials for a particular user. For example, if you wanted to return a list of document visible to the user “John”, you could do it with a custom SPARQL function:

PREFIX ex: <>
  ?doc rdf:type ex:Document .
  filter(ex:validate(?doc, ?user)) .
BINDINGS ?user {

The function is called by referencing its unique URI, in this case ex:validate. This URI must be registered with bigdata’s FunctionRegistry along with an appropriate factory and operator. The wiki details how to do that. In the query above, the function is called with two arguments, the document to be validated and the user to validate against. The user in this simple example is a constant included in the BINDINGS clause. Always remember that bigdata custom functions are executed one solution at a time – they do not yet benefit from vectored execution and thus are not suitable for reading data from the indices. (The functions must operate without reading from the index on a per execution call basis.) A custom service (distinct from a custom function) is a more appropriate choice when execution requires touching indices. This is how we implement SPARQL 1.1 Federation.


Graph Data Management 2012

The Graph Data Management 2012 workshop was last week in Washington, DC. The workshop brought together an interesting mixture of people from several different background. There were people people focused on data mining and prediction, people focused on graph algorithms (iterative algorithms over materialized graphs), and several presentations on “graph databases” (3 on RDF databases and one on HyperGraphDB). Many thanks to the workshop organizers for pulling together such an interesting event!

It is clear that the “graph database” space is currently handicapped by a lack of standards. SPARQL can certainly solve many of the problems there, but it lacks a standardized way for dealing with provenance (aka link attributes). We have efficient extensions for this and it sounds like at least Virtuoso will be picking them up as well, so maybe we can drive standardization that way. SPARQL has support for property paths, but it lacks a means to express iterative refinement algorithms so they could be executed efficiently within the database. It is possible to use SPARQL update commands to operate iteratively on data sets on the server without round-tripping large graphs to the client, but it is not yet possible to specify control logic for such updates in a standardized manner, and without extensions which clarify which graphs or solutions should be durable and which should be wired into main memory it is difficult to use SPARQL update for iterative algorithms which assemble an annotated graph. Equally worrisome, it appears that it is not yet possible to create good benchmarks for graph databases right now because the low level APIs wipe out the tremendous advantage which you gain from vectored evaluation in a database.

We will be announcing some new features over the next few weeks and the coming months designed to address some of these issues. The first feature will extend SPARQL 1.1 UPDATE to let you provision and manage solutions sets. A preview of this SPARQL UPDATE extension is published on the bigdata wiki. The extension adds just a little bit of syntax, but a whole lot of power. It was originally envisioned to give people the ability to page through large result sets without re-evaluating complex joins – a use case which is illustrated on the wiki. However, we see lots opportunities beyond an application aware SPARQL cache.

Another feature which will come out later this year is a distributed client/server graph protocol. This is designed to address the tight coupling of applications with graph databases, provide a fast, scalable object level cache for graph data, and provide both fast in-memory traversal on the client and efficient subgraph matching on the server. Clients will also be able to create “graph transactions” and post updates back to the server and write through cache fabric. We plan to have multiple client language bindings for this, providing graph database access within the browser, in Java, etc. We are even looking at a GPU binding for pure computational speed. The language bindings will be generated based on metadata describing the object models.


Bigdata 1.2.0 release (SPARQL UPDATE, Federated Query, Service Description and more)

This is a major version release of bigdata(R). Bigdata is a horizontally-scaled, open-source architecture for indexed data with an emphasis on RDF capable of loading 1B triples in under one hour on a 15 node cluster. Bigdata operates in both a single machine mode (Journal) and a cluster mode (Federation). The Journal provides fast scalable ACID indexed storage for very large data sets, up to 50 billion triples / quads. The federation provides fast scalable shard-wise parallel indexed storage using dynamic sharding and shard-wise ACID updates and incremental cluster size growth. Both platforms support fully concurrent readers with snapshot isolation.

Distributed processing offers greater throughput but does not reduce query or update latency. Choose the Journal when the anticipated scale and throughput requirements permit. Choose the Federation when the administrative and machine overhead associated with operating a cluster is an acceptable tradeoff to have essentially unlimited data scaling and throughput.

See [1,2,8] for instructions on installing bigdata(R), [4] for the javadoc, and [3,5,6] for news, questions, and the latest developments. For more information about SYSTAP, LLC and bigdata, see [7].

Starting with the 1.0.0 release, we offer a WAR artifact [8] for easy installation of the single machine RDF database. For custom development and cluster installations we recommend checking out the code from SVN using the tag for this release. The code will build automatically under eclipse. You can also build the code using the ant script. The cluster installer requires the use of the ant script.

You can download the WAR from:

You can checkout this release from:

New features:

– SPARQL 1.1 Service Description
– SPARQL 1.1 Basic Federated Query
– New integration point for custom services (ServiceRegistry).
– Remote Java client for NanoSparqlServer
– Sesame 2.6.3
– Ganglia integration (cluster)
– Performance improvements (cluster)

Feature summary:

– Single machine data storage to ~50B triples/quads (RWStore);
– Clustered data storage is essentially unlimited;
– Simple embedded and/or webapp deployment (NanoSparqlServer);
– Triples, quads, or triples with provenance (SIDs);
– Fast RDFS+ inference and truth maintenance;
– Fast 100% native SPARQL 1.1 evaluation;
– Integrated “analytic” query package;
– %100 Java memory manager leverages the JVM native heap (no GC);

Road map [3]:

– SPARQL 1.1 property paths (last missing feature for SPARQL 1.1);
– Runtime Query Optimizer for Analytic Query mode;
– Simplified deployment, configuration, and administration for clusters; and
– High availability for the journal and the cluster.

Change log:

Note: Versions with (*) MAY require data migration. For details, see [9].

1.2.0: (*)

– (Monitoring webapp)
– (Support evaluation of 3rd party operators)
– (Compact and efficient movement of binding sets between nodes.)
– (Cluster leaks threads under read-only index operations: DGC thread leak)
– (Thread-local cache combined with unbounded thread pools causes effective memory leak: termCache memory leak & thread-local buffers)
– (KeyBeforePartitionException on cluster)
– (Class loader problem)
– (Ganglia integration)
– (Logger for RWStore transaction service and recycler)
– (SPARQL query can fail to notice when IRunningQuery.isDone() on cluster)
– (RWStore does not track tx release correctly)
– (HTTP Repostory broken with bigdata 1.1.0)
– (SPARQL 1.1 Federation extension)
– (Serialization error in SIDs mode on cluster)
– (Global Row Store Read on Cluster uses Tx)
– (IExtension implementations do point lookups on lexicon)
– (“No such index” on cluster under concurrent query workload)
– (Java level deadlock in DS)
– (Uncaught interrupt resolving RDF terms)
– (KeyAfterPartitionException / KeyBeforePartitionException on cluster)
– (NoSuchVocabularyItem with LUBMVocabulary for DerivedNumericsExtension)
– (Query statistics do not update correctly on cluster)
– (Too many GRS reads on cluster)
– (Sail does not flush assertion buffers before query)
– (acceptTaskService pool size on cluster)
– (Optimize serialization for query messages on cluster)
– (Test suite for writeCheckpoint() and recycling for BTree/HTree)
– (Cluster does not map input solution(s) across shards)
– (Error releasing deferred frees using 1.0.6 against a 1.0.4 journal)
– (PhysicalAddressResolutionException against 1.0.6)
– (RWStore reset() should be thread-safe for concurrent readers)
– (Java API for NanoSparqlServer REST API)
– (AbstractTripleStore.destroy() does not clear the locator cache)
– (Empty chunk in ThickChunkMessage (cluster))
– (Virtual Graphs)
– (Sesame 2.6.3)
– (Bring bigdata RDF/XML parser up to openrdf 2.6.3.)
– (SPARQL 1.1 Service Description)
– (Aggregation with an solution set as input should produce an empty solution as output)
– (Incorrect error handling for SPARQL aggregation; fix in 2.6.1)
– (Order the same Blank Nodes together in ORDER BY)
– (SPARQL 1.1 BINDINGS are ignored)
– (Bigdata2Sesame2BindingSetIterator throws QueryEvaluationException were it should throw NoSuchElementException)
– (UNION with Empty Group Pattern)
– (Exception when using SPARQL sort & statement identifiers)
– (Load, closure and query performance in 1.1.x versus 1.0.x)
– (LIMIT causes hash join utility to log errors)
– (Expose the LexiconConfiguration to Function BOPs)
– (Query with two “FILTER NOT EXISTS” expressions returns no results)
– (REGEXBOp should cache the Pattern when it is a constant)
– (Java 7 Compiler Compatibility)
– (Review function bop subclass hierarchy, optimize datatype bop, etc.)
– (CONSTRUCT WHERE shortcut)
– (Incremental materialization of Tuple and Graph query results)
– (Modify the IChangeLog interface to support multiple agents)
– (Expose timestamp of LexiconRelation to function bops)
– (ClassCastException during hash join (can not be cast to TermId))
– (Review materialization for inline IVs)
– (BSBM BI Q5 error using MERGE JOIN)

1.1.0 (*)

– (Lexicon joins)
– (Store large literals as “blobs”)
– (Scale-out LUBM “how to” in wiki and build.xml are out of date.)
– (Implement an persistence capable hash table to support analytic query)
– (AccessPath should visit binding sets rather than elements for high level query.)
– (SliceOp appears to be necessary when operator plan should suffice without)
– (Bottom-up evaluation semantics).
– (Derived xsd numeric data types must be inlined as extension types.)
– (Revisit pruning of intermediate variable bindings during query execution)
– (Lift conditions out of subqueries.)
– (Native ORDER BY)
– (Inline predeclared URIs and namespaces in 2-3 bytes)
– (NanoSparqlServer does not locate “html” resources when run from jar)
– (Support inlining of unicode data in the statement indices.)
– (Scalable default graph evaluation)
– (Prune variable bindings during query evaluation)
– (Direct translation of openrdf AST to bigdata AST)
– (Fix StrBOp and other IValueExpressions)
– (Optimize OPTIONALs with multiple statement patterns.)
– (Native SPARQL evaluation on cluster)
– (Cluster does not compute closure)
– (HTree hash join performance)
– (inline xsd:unsigned datatypes)
– (xsd:string cast fails for non-numeric data)
– (New query hints model.)
– (Use of read-only tx per query defeats cache on cluster)


– (BTreeCounters does not track bytes released)
– (Refactor performance counters using accessor interface)
– (B+Tree should delete bloom filter when it is disabled.)
– (RWStore does not prune the CommitRecordIndex)
– (Persistent memory leaks (RWStore/DISK))
– (FastRDFValueCoder2: ArrayIndexOutOfBoundsException)
– (Release age advanced on WORM mode journal)
– (Add a DELETE by access path method to the NanoSparqlServer)
– (Add “context-uri” request parameter to specify the default context for INSERT in the REST API)
– (log4j configuration error message in WAR deployment)
– (Add a fast range count method to the REST API)
– (Support temp triple store wrapped by a BigdataSail)
– (NQuads support for NanoSparqlServer)
– (Bug fix to DEFAULT_RDF_FORMAT for bulk data loader in scale-out)
– (Support either lockfile (procmail) and dotlockfile (liblockfile1) in scale-out)
– (BigdataSail#getReadOnlyConnection() race condition with concurrent commit)
– (Address is 0L)
– (TestMROWTransactions failure in CI)


– (Query time expansion of (foo rdf:type rdfs:Resource) drags in SPORelation for scale-out.)
– (Scale-out LUBM “how to” in wiki and build.xml are out of date.)
– (Query not terminated by error.)
– (NamedGraph pattern fails to bind graph variable if only one binding exists.)
– (IRunningQuery not closed promptly.)
– (DataLoader fails to load resources available from the classpath.)
– (Support for the streaming of bigdata IBindingSets into a sparql query.)
– (ClosedByInterruptException during heavy query mix.)
– (NotSerializableException for SPOAccessPath.)
– (Change dependencies to Apache River 2.2.0)

1.0.1 (*)

– (Unicode clean schema names in the sparse row store).
– (TermIdEncoder should use more bits for scale-out).
– (OSX requires specialized performance counter collection classes).
– (BigdataValueFactory.asValue() must return new instance when DummyIV is used).
– (TermIdEncoder limits Journal to 2B distinct RDF Values per triple/quad store instance).
– (SPO not Serializable exception in SIDS mode (scale-out)).
– (ClassCastException when querying with binding-values that are not known to the database).
– (UnsupportedOperatorException for some SPARQL queries).
– (Query failure when comparing with non materialized value).
– (RWStore reports “FixedAllocator returning null address, with freeBits”.)
– (NamedGraph pattern fails to bind graph variable if only one binding exists.)
– (log4j – slf4j bridge.)

For more information about bigdata(R), please see the following links:


About bigdata: