Author Archives: Mike Personick

Inline URIs

There is a commit working its way through CI right now in the 1.3 maintenance branch. This commit puts in place a mechanism to inline many different types of URIs directly into the statement indices, including UUIDs. The new mechanism and how to use it are described in the ticket:

This change will be backward compatible with old journals, but obviously to take advantage of it you would need to reload the data so that URIs can be inlined. It should be possible to inline all sorts of URIs with this new mechanism. Inlining terms is a great way to save space in the indices and improve performance, as it eliminates the need to round-trips terms through the dictionary indices.

Bigdata and Blueprints

Here is a post I just did on the gremlin-users Google Group. Thought it might be of interest to a wider audience as well.!forum/gremlin-users


On Thursday, May 29, 2014 12:30:20 PM UTC-6, Jack wrote:

Would you and Marko care to explain the big differentiators between Titan and BigData?



I have never done a comprehensive side-by-side with Titan/Cassandra but I can tell you a little about what Bigdata offers. The genesis for Bigdata was the BigTable paper Google did back in 2006. At that time we were very interested in using graphs and particularly the semantic web to facilitate dynamic federation and semantic alignment of heterogeneous data within a schema flexible framework, all at scale. The landscape for graph databases was quite different back then, and within the semantic web community the real focus was semantics, not scale. Using the principles Google outlined for BigTable, we designed Bigdata from the ground up to be a massively scalable distributed database specifically for graphs.

Bigdata at its core is really a KV store – all data is ultimately persisted inside unsigned byte[] key -> unsigned byte[] value BTree indices. These indices can be key-range sharded and dynamically distributed and load balanced across a cluster. On top of the KV layer, Bigdata has a graph database layer that supports the RDF data model. Graph data is triple indexed to achieve perfect access paths for all eight possible graph access patterns. This concept of covering indices was developed by Andreas Harth and Stefan Decker back in 2005 as part of their YARS database. Using these indices, we can support arbitrary graph pattern joins at the core of our query engine, which supports the high-level query language SPARQL, the only open-standard we have as a community for graph query. In the scale-out mode of the database, the query engine pushes intermediate solutions out across the cluster and executes joins at the data. I think this is probably the key differentiating feature over other graph databases written on top of existing KV stores like Cassandra and Accumulo – these implementations tend to use a client-based query controller (if they provide a query controller at all) that pulls data across the network and does joins at the controller, instead of pushing the computation out to the data. This client-controller strategy can result in a huge amount of wasted IO and network traffic.

Bigdata was designed from the ground up to perform well as both a single-server and distributed database. Bigdata as a single-server database supports up to 50 billion RDF statements (each statement translates roughly to one vertex, edge, or property). Single-server mode comprises several deployment modes as well – you can stand up Bigdata as an actual server and access it via its REST API. There is a BigdataGraphRemote Blueprints implementation that wraps this mode. You can also use Bigdata embedded inside your application’s JVM. There is a BigdataGraphEmbedded Blueprints implementation that wraps this mode.

I suppose the key difference between Titan and Bigdata might be Bigdata’s query optimizer. With a query optimizer you can give the database an operator tree in the form of a declarative query and the query optimizer can do things like use cardinalities of the different operators to determine an optimal execution strategy. Without this all a database can do is exactly what you to tell it to do in exactly the order you tell it to do it, which oftentimes is not the best order at all. This leads to “hand-optimization” to get queries to run quickly (I want to combine these five different predicates to select vertices or edges but I have to re-order them myself to get the best performance). We’ve done a little bit of work exposing the query optimizer in the BigdataGraph implementation – we provide a custom GraphQuery implementation that will let you specify predicates to be executed by the query engine with the help of the query optimizer. We would like to expose other aspects of Bigdata’s query engine/optimizer through Blueprints as well.

I am also not aware of an HA architecture for Titan, but this might be my own ignorance. Bigdata has a high-availability architecture based on a quorum model. In this mode all data is kept on all nodes so queries are answered locally and can be load-balanced to achieve perfect linear scaling for read. Data is replicated using a low-level write replication pipeline with a 2-phase commit. The cluster requires a simple majority quorum to be available. Nodes can go offline temporarily and come back, or new nodes can join in their place. Re-joining or new nodes will catch up to the current state using a playback of the missing commits from the current quorum leader. If the quorum leader goes down this role will failover automatically to another node.

One other interesting feature of Bigdata is an integrated graph analytics engine that supports the Gather-Apply-Scatter (GAS) API. GAS lets you do interesting traversal based analysis – generally operations written on top of BFS or multiple BFS passes. The canonical examples are Page Rank and Shortest Path (SSSP). The GAS engine is integrated directly into Bigdata’s query engine, although this functionality is not yet exposed through the Blueprints API. I’ve spoken with Marko about how this might get exposed in TinkerPop3. In the meantime you can always work with your graph data inside Bigdata through any of the other APIs as well (you are not limited to only Blueprints or only Sesame or only the Bigdata Workbench for a particular database instance).

We also did a literature review of the state of the art in graph database technology about a year ago that might further address your question. There is an entire section on KV stores and Map/Reduce systems.

The Blueprints implementation over Bigdata is very new and we are very excited to get it out into the community and get feedback on it. We welcome comments and suggestions and please do let us know if you find any problems. The best thing to do with an issue is to post it to our trac system, preferably with a small self-contained test case that demonstrates the issue.


Mike Personick
Core Bigdata Development Team

Bigdata and Blueprints

I’ve committed a Blueprints/Gremlin integration. Here is how to get started with the Blueprints/Gremlin APIs in a few simple steps:

1. Go get bigdata and start the server:

> svn co svn:// bigdata
> cd bigdata
> ant start-bigdata

2. Go get the Tinkerpop Property Graph (sample GraphML data):

3. Run an ant task to download, unpack, and configure the Gremlin console to work with bigdata:

From the bigdata directory:
> ant gremlin

4. Start Gremlin:

From the bigdata directory:
> ./ant-build/gremlin-groovy-2.5.0/bin/
         (o o)

5. From Gremlin (or Blueprints code), you can connect to the bigdata server, or create a local instance (either in-memory or persistent):

gremlin> import com.bigdata.blueprints.*
gremlin> remoteGraph = BigdataGraphFactory.connect("http://localhost:9999")
gremlin> inMemGraph = BigdataGraphFactory.create()
gremlin> persistentGraph = BigdataGraphFactory.create("/tmp/bigdata.jnl")

6. Load your sample GraphML data into the graph:

gremlin> persistentGraph.loadGraphML("graph-example-1.xml")

7. You can then exit Gremlin and re-open your persistent graph later:

gremlin> persistentGraph ="/tmp/bigdata.jnl")

MPGraph is now MapGraph

MapGraph is a Massively Parallel Graph Processing API (previously known as “MPGraph”) that lets you express graph analytics (e.g. BFS, Shortest Path) in a vertex-centric programming abstraction known as GAS (Gather-Apply-Scatter). The API is based on the same Gather-Apply-Scatter model used in GraphLab. MapGraph comes in two flavors – a CPU version that is currently integrated into bigdata and a standalone GPU version that delivers up to 3 billion traversed edges per second on a single GPU. MapGraph on the GPU is up to two order of magnitude faster than parallel CPU implementations on up 24 CPU cores and has performance comparable to a state-of-the-art manually optimized GPU implementation of the same analytic. MapGraph’s easy-to-use GAS API allows new algorithms to be implemented in a few hours that can then fully exploit the data-level parallelism of the GPU.

The CPU version of MapGraph operates over graph data inside bigdata and is exposed via a SPARQL 1.1 Service Call:

PREFIX gas: <>
SELECT ?s ?p ?o ?depth {
  # run the Shortest Path algorithm
  SERVICE gas:service {
    gas:program gas:gasClass "" .
    gas:program gas:in <:> . # starting point
    gas:program gas:target <:> . # target vertices
    gas:program gas:out ?s . # bound to the visited vertices.
    gas:program gas:out1 ?depth . # bound to the depth of the visited vertices.
  # join with the statement indices
  ?s ?p ?o . # extract all links and attributes for the visited vertices.

This query combines a shortest path operation to find all vertices on the shortest path between two nodes, then does a join against the statement indices to fill in all the edges along that shortest path. The example above shows how you might create a connected graph between IP addresses using traceroute data.

SPARQL 1.1 Property Paths

We’ve completed an implementation of SPARQL 1.1 Property Paths to be included in an upcoming bigdata minor release. For an early preview of this feature, grab the latest code from the bigdata 1.2 maintenance branch in SVN ( Feedback is greatly appreciated!


Client-Server API

Did you know that bigdata has a built-in REST API for client-server access to the RDF database? We call this interface the “NanoSparqlServer”, and it’s API is outlined in detail on the wiki:

What’s new with the NSS is that we’ve recently added a Java API around it so that you can write client code without having to understand the HTTP API or make HTTP calls directly. This is why there is suddenly a new dependency on Apache’s HTTP Components in the codebase. The Java wrapper is called “RemoteRepository”. If you’re comfortable writing application code against the Sesame SAIL/Repository API you should feel pretty at home with the RemoteRepository class. Not exactly the same, but very very similar.

The class itself is pretty self-explanatory but if you like examples, there is a test case for every API call in RemoteRepository in the class TestNanoSparqlClient. (That test case also conveniently demonstrates how to launch a NanoSparqlServer wrapping a bigdata journal using Jetty, which it does at the beginning of every test.)

Custom SPARQL Functions

I put together a more useful example of how to write a custom SPARQL function with bigdata. It’s up on the wiki here:

The example details a common use case – filtering out solutions based on security credentials for a particular user. For example, if you wanted to return a list of document visible to the user “John”, you could do it with a custom SPARQL function:

PREFIX ex: <>
  ?doc rdf:type ex:Document .
  filter(ex:validate(?doc, ?user)) .
BINDINGS ?user {

The function is called by referencing its unique URI, in this case ex:validate. This URI must be registered with bigdata’s FunctionRegistry along with an appropriate factory and operator. The wiki details how to do that. In the query above, the function is called with two arguments, the document to be validated and the user to validate against. The user in this simple example is a constant included in the BINDINGS clause. Always remember that bigdata custom functions are executed one solution at a time – they do not yet benefit from vectored execution and thus are not suitable for reading data from the indices. (The functions must operate without reading from the index on a per execution call basis.) A custom service (distinct from a custom function) is a more appropriate choice when execution requires touching indices. This is how we implement SPARQL 1.1 Federation.

Using Statement Identifiers to Manage Provenance

Sometimes it is nice to be able to say things about statements, such as where they came from and who asserted them. The RDF data model does not provide a convenient mechanism for assigning identity to particular statements or for making statements about statements. RDF reification is cumbersome, results in a huge expansion in number of triples in the database, and is incompatible with most inference and rule engines.

Named graphs (quads) is one way to approach provenance. By grouping triples into named graphs and assigning a URI as the graph identifier, you can then make statements about the named graph to identify the provenance of the group of triples (the group size could even theoretically be one). Unfortunately this approach has a few drawbacks as well. Partitioning the knowledge base into groups creates challenges for inference and rule engines, and full named graph support in bigdata requires twice as many statement indices as triples.

If all you need is an unpartitioned, inference-capable knowledge base with the ability to make assertions about statements, bigdata provides you with a third alternative to simple triples or fully indexed quads: statement identifiers (SIDs). With SIDs, the database acts as if it is triples mode, but each triple is assigned a statement identifier (on demand) that can be used in additional statements (meta-statements):

(s, p, o, c)
1. (<mike>, <likes>, <RDF>, :sid1)
2. (:sid1, <source>, <>)

Statement 1 asserts that