I had an interesting conversation with one of the blueprints developers (Joshua Shinavier) about graph databases during the CSHALS 2012 conference.
Blueprints appear to follow an object model very similar to the one that Martyn Cutcher had developed in the Generic Persistent Object (GPO) / Generic Object Model (GOM). GOM allows schema flexible objects, object link sets, and link properties via link “reification”. In fact, we have had most of a GPO/GOM implementation for bigdata since 2006. The main hangup has been getting the free time to support a horizontally scaled GPO/GOM model (in particular, horizontal scaling for GPO link sets). A very similar technology was used in the core of the K42 engine by STEP UK (K42 was a high performance object database engine underlying an XML Topic Maps engine back in 2000).
Bigdata also has a native provenance mode for the SAIL interface featuring Statement Identifiers (SIDs). This mode was developed to support the intelligence and topic maps community and allows statements about statements. We’ve blogged on this in the past. The SIDs mode let’s you attach attributes to “links” efficiently, and even let’s you attach links to links (statement identifiers can appear in any position of a statement), which is more general than the blueprints API.
The SIDs mode is extremely efficient. The representation of a statement is just its “{s,p,o}” representation using the internal values (IVs) for that statement. This means that there is no indirection through indices when performing reverse traversal from a statement about a statement to the statement which is being described.
However, all scalable (persistence class) graph databases use indices. Even if you represent the object identifier as an integer, that integer is still indirected through an object index (and through the file system) in order to resolve the object. The GPO model caches all weakly referenced objects in RAM, so once retrieved traversal is O(1), but access to disk is never less than O(log n) since it always implies indices for an updatable data model.
There is a long history (read war) in the network (CODASYL) and relational database spaces. To my mind, both groups had useful things to say. The main argument of the relational group was that an independence between the logical representation and the physical data model was necessary and allowed for declarative query semantics which in turn allowed for sophisticated query optimization. Query optimizers can generally produce query plans that do as well as all but the very best hand coded queries. The network database camp pointed out the flexibility of the data model and eventually showed that it was possible to produce declarative query languages for network databases. The issue was eventually settled in the market place, with the relational model taking the lead for several decades. See “What goes around comes around” by Stonebreaker and Hellerstein for a somewhat slanted take on all of this.
Object database and graph databases are very closely related to the earlier network databases. The same benefits (schema flexibility) and cautions (lack of declarative query model) apply. API such as blueprints can provide great convenience, but they force all query optimization onto the application writer.
Bigdata puts a lot of effort into query optimization. The most obvious place is simply the join ordering. If you want to traverse from some vertex through some edges to some set of vertices, bigdata will do fast range counts on the access paths and decide on a join order for that computation which can be several orders of magnitude faster than naive traversal. Bigdata can also use hash joins and variable pruning to tremendously speed up queries which visit intermediate vertex sets which are not required in the final solution set. This is possible through a combination of high level declarative query (SPARQL) and dedicated query optimization code. When using bigdata in its SIDs (aka provenance aka graph database mode) you can get all of the benefit of that performance for “path traversal” in a “graph”. And you can have 50 billion edges in a graph on a single machine and efficiently scale that graph out across a cluster of machines. All in open source.
There are graph traversal patterns which do not fit neatly into a high level query language without loop constructs. SPARQL actually does provide for some of these via property paths, which we are in the process of building into bigdata. However, you can also drop into the SAIL API with bigdata and run access paths based on triple patterns which correspond more or less directly to the vertex-edge traversal patterns of blueprints, and which support not only link attributes but also links for links.
There is no native blueprints implementation for bigdata. You can certainly try the blueprints to Sail integration against the BigdataSail, but I would also encourage people to try running bigdata in its SIDs mode and enjoy the performance that you can get from optimized high level query against a high performance graph database. If you need vertex/edge traversal, you can get that from the Sail, but you will have much higher performance if you avoid the RDF Value materialization step and stay within the bigdata native API.



