Monthly Archives: January 2012

Bigdata 1.0.4 (maintenance release)

This is a 1.0.x maintenance release of bigdata(R). New users are encouraged to go directly to the 1.1.0 release. Bigdata is a horizontally-scaled, open-source architecture for indexed data with an emphasis on RDF capable of loading 1B triples in under one hour on a 15 node cluster. Bigdata operates in both a single machine mode (Journal) and a cluster mode (Federation). The Journal provides fast scalable ACID indexed storage for very large data sets, up to 50 billion triples / quads. The federation provides fast scalable shard-wise parallel indexed storage using dynamic sharding and shard-wise ACID updates and incremental cluster size growth. Both platforms support fully concurrent readers with snapshot isolation.

Distributed processing offers greater throughput but does not reduce query or update latency. Choose the Journal when the anticipated scale and throughput requirements permit. Choose the Federation when the administrative and machine overhead associated with operating a cluster is an acceptable tradeoff to have essentially unlimited data scaling and throughput.

See [1,2,8] for instructions on installing bigdata(R), [4] for the javadoc, and [3,5,6] for news, questions, and the latest developments. For more information about SYSTAP, LLC and bigdata, see [7].

Starting with the 1.0.0 release, we offer a WAR artifact [8] for easy installation of the single machine RDF database. For custom development and cluster installations we recommend checking out the code from SVN using the tag for this release. The code will build automatically under eclipse. You can also build the code using the ant script. The cluster installer requires the use of the ant script.

You can download the WAR from:

You can checkout this release from:

Feature summary:

– Single machine data storage to ~50B triples/quads (RWStore);
– Clustered data storage is essentially unlimited;
– Simple embedded and/or webapp deployment (NanoSparqlServer);
– Triples, quads, or triples with provenance (SIDs);
– 100% native SPARQL 1.0 evaluation with lots of query optimizations;
– Fast RDFS+ inference and truth maintenance;
– Fast statement level provenance mode (SIDs).

Road map [3]:

– High-volume analytic query and SPARQL 1.1 query, including aggregations;
– SPARQL 1.1 Update, Property Paths, and Federation support;
– Simplified deployment, configuration, and administration for clusters; and
– High availability for the journal and the cluster.

Change log:

Note: Versions with (*) require data migration. For details, see [9].


– (Logger for RWStore transaction service and recycler)
– (RWStore does not track tx release correctly)
– (Thread-local cache combined with unbounded thread pools causes effective memory leak: termCache memory leak & thread-local buffers)


– (BTreeCounters does not track bytes released)
– (Refactor performance counters using accessor interface)
– (B+Tree should delete bloom filter when it is disabled.)
– (RWStore does not prune the CommitRecordIndex)
– (Persistent memory leaks (RWStore/DISK))
– (FastRDFValueCoder2: ArrayIndexOutOfBoundsException)
– (Release age advanced on WORM mode journal)
– (Add a DELETE by access path method to the NanoSparqlServer)
– (Add “context-uri” request parameter to specify the default context for INSERT in the REST API)
– (log4j configuration error message in WAR deployment)
– (Add a fast range count method to the REST API)
– (Support temp triple store wrapped by a BigdataSail)
– (NQuads support for NanoSparqlServer)
– (Bug fix to DEFAULT_RDF_FORMAT for bulk data loader in scale-out)
– (Support either lockfile (procmail) and dotlockfile (liblockfile1) in scale-out)
– (BigdataSail#getReadOnlyConnection() race condition with concurrent commit)
– (Address is 0L)
– (TestMROWTransactions failure in CI)


– (Query time expansion of (foo rdf:type rdfs:Resource) drags in SPORelation for scale-out.)
– (Scale-out LUBM “how to” in wiki and build.xml are out of date.)
– (Query not terminated by error.)
– (NamedGraph pattern fails to bind graph variable if only one binding exists.)
– (IRunningQuery not closed promptly.)
– (DataLoader fails to load resources available from the classpath.)
– (Support for the streaming of bigdata IBindingSets into a sparql query.)
– (ClosedByInterruptException during heavy query mix.)
– (NotSerializableException for SPOAccessPath.)
– (Change dependencies to Apache River 2.2.0)

1.0.1 (*)

– (Unicode clean schema names in the sparse row store).
– (TermIdEncoder should use more bits for scale-out).
– (OSX requires specialized performance counter collection classes).
– (BigdataValueFactory.asValue() must return new instance when DummyIV is used).
– (TermIdEncoder limits Journal to 2B distinct RDF Values per triple/quad store instance).
– (SPO not Serializable exception in SIDS mode (scale-out)).
– (ClassCastException when querying with binding-values that are not known to the database).
– (UnsupportedOperatorException for some SPARQL queries).
– (Query failure when comparing with non materialized value).
– (RWStore reports “FixedAllocator returning null address, with freeBits”.)
– (NamedGraph pattern fails to bind graph variable if only one binding exists.)
– (log4j – slf4j bridge.)

For more information about bigdata, please see the following links:


About bigdata:



Ganglia integration

Ganglia is one of the first and most scalable performance metric platforms (known scaling up to 2000 nodes) and also handles aggregations of clusters (grids). There has also been a lot of work recently on improving the web facing UIs for graphing performance metrics. The data are collected on each host by gmond. By default, each gmond instance discovers and listens to all other hosts on the network using UDP multicast and builds up a soft state model of the metrics for the entire cluster. A gmetad instance running on one (or more) nodes joins the ganglia network, aggregates the metrics, and writes them onto RRDtool files. A web UI may then be used to graph the performance metrics. In the new web UI, this is done using both RRDtool and flot (flot is used for interactive graphics). In addition to performance metric collection, there are a variety of hooks which can be used to integrate ganglia with nagios so you can integrate performance metrics into your monitoring solution as well.

Bigdata has long had an internal hierarchical metrics collection mechanism. Metrics are reported out using XML on both an embedded HTTP server (per service) and aggregated and reported out by the load balancer service (LBS). The aggregated metrics are available for plotting and sample Excel worksheets are available which show you how to do exactly that. You can also run scripts over the historical metrics data to obtain detailed histories of performance metrics. There are pluses and minuses to this approach. On the plus side, bigdata retains detailed performance metrics, collects a LOT of data on disk, ram, cpu and application metrics, and provides some sophisticated table oriented views of those metrics for plotting. Bigdata also collects “events”, which can be graphed by the LBS using flot. On the down side, the Excel “integration” is a clunky so we have been looking around for a way to leverage work that other people have done on metrics collection and reporting.

We’ve recently put together an embedded GangliaService for Java under the Apache 2.0 license. Unlike the other efforts we could find (hadoop-commons, embedded-ganglia), the GangliaService is a full ganglia peer. It both listens to the ganglia 3.1 protocol and reports out metrics from the application and/or host using the same protocol. It could even be used to run ganglia on operating systems where there is no existing port (for example, on Windows where we have a typeperf integration which collects platform level statistics). Since the GangliaService is a ganglia peer, it maintains the soft state of the metrics for the entire cluster and can produce load balanced report for the cluster just as if you were using gstat. However, the load balanced reports produced by the GangliaService can be customized to use different metrics and scoring routines and can be directly accessed and leveraged by the Java application.

The main entry point is GangliaService. It is trivial to setup with defaults and you can easily register your own metrics collection classes to report out on your application.

GangliaServer service = new GangliaService("MyService");
// Register to collect metrics.
service.addMetricCollector(new MyMetricsCollector());
// Join the ganglia network; Start collecting and reporting metrics.;

The following will return the default load balanced report, which contains exactly the same information that you would get from gstat -a. You can also use an alternative method signature to get a report based on your own list of metrics and/or have the report sorted by the metric (or even a synthetic metric) of your choice.

IHostReport[] hostReport = service.getHostReport();

The bigdata-ganglia module is available in the 1.1.0 branch in SVN. The bigdata-ganglia JAR is included in the bigdata 1.1.1 release. You can also get the bigdata-ganglia source and JAR here.

This version does not yet include the typeperf, vmstat, pidstat and similar host and service level collection modules. We plan to refactor those out of the core bigdata code base soon, but they currently collect and report using bigdata’s internal hierarchical counter set data model rather than the flatter ganglia metrics data model.

The GangliaService is not yet integrated in bigdata 1.1.1, but we will be taking that step shortly as part of our effort to simplify the deployment and management of a bigdata federation. First, we will use the GangliaService to replace the centralized LBS with a ganglia network. We have a bit of work to do still since bigdata also reports events to the LBS and uses a custom load balanced metric. However, ganglia is beginning to offer support for JSON based event descriptions which can then be overlaid on the generated graphs. Beyond replacing the LBS, we plan to leverage this effort again when we decompose the MetadataService (aka MDS aka shard locator) into a P2P service running on the DataService (DS) nodes. This will offer significantly better scaling and will remove one of the main barriers to petabyte scale for bigdata.