Blazegraph 2.1.4 Release

6 Replies

We’re very pleased to announce the the release of Blazegraph 2.1.4. This is a maintenance release of Blazegraph. See the full details here.

2.1.4 Release Fixes

[BLZG-533] – Vector the query engine on the native heap
[BLZG-2041] – BigdataSail should not locate the AbstractTripleStore until a connection is requested
[BLZG-2053] – Blazegraph Security Reporting Instructions
[BLZG-2050] – Fork Colt Libraries to remove hep.aida
[BLZG-2065] – Remove Autojar and Unused Ant Scripts

Download it, clone it, have it sent via carrier pigeon (transportation charges may apply). Find a bug, hit JIRA. Have a question, try the mailing list or contact-us.

Security Reporting Procedure
Blazegraph has an updated security reporting procedure. Please see the guide for reporting security related issues. This process is monitored on a daily basis. All security reports are acknowledged within 24 hours. Mitigations for reported security issues are made in a reasonable timeframe, which may be as quickly as 24 hours for high-severity issues.

Github

git clone -b BLAZEGRAPH_RELEASE_2_1_4 --single-branch https://github.com/blazegraph/database.git BLAZEGRAPH_RELEASE_2_1_4
cd BLAZEGRAPH_RELEASE_2_1_4
./scripts/mavenInstall.sh
./scripts/startBlazegraph.sh

Maven Central

Blazegraph 2.1.4 is now on Maven Central. You can also get the Tinkerpop3 API and the new BlazegraphTPFServer.

    
    <dependency>
        <groupId>com.blazegraph</groupId>
        <artifactId>bigdata-core</artifactId>
        <version>2.1.4</version>
    </dependency>
    <!-- Use if Tinkerpop 2.5 support is needed ; See also Tinkerpop3 below. -->
    <dependency>
        <groupId>com.blazegraph</groupId>
        <artifactId>bigdata-blueprints</artifactId>
        <version>2.1.4</version>
    </dependency>

Tinkerpop3

Blazegraph Tinkerpop3

Tinkerpop3 is here! Get it from Maven Central.

   <dependency>
      <groupId>com.blazegraph</groupId>
      <artifactId>blazegraph-gremlin</artifactId>
      <version>1.0.0</version>
   </dependency>

Mavenization

Everybody loves (and hates) Maven. Starting with the 2.0.0 release, Blazegraph has been broken into a collection of maven artifacts. This has enabled us to work on new features like TinkerPop3, which require Java 8 support while keeping the core platform at Java 7 to support users who are still there. Checkout the Maven Notes on the wiki for full details on the architecture, getting started with development, building snapshots, etc. If you have a 1.5.3 version checked out in Eclipse, you will want to pay attention to Getting Started with Eclipse and allocate a little extra time for the transition.

Deployers

2.1.4 provides updates for the Debian Deployer, an RPM Deployer, and a Tarball. In addition to the blazegraph.war and blazegraph.jar archives.

GPU Acceleration

Are you interested in trying out GPU Acceleration for you Blazegraph instance? Contact us for a free trial!

Stay in touch, we’d love to hear from you.

Blazegraph 2.1.2 Release

2 Replies

We’re very pleased to announce the the release of Blazegraph 2.1.2. This is a maintenance release of Blazegraph. See the full details here.

2.1.2 Release Fixes

[BLZG-1911] – Blazegraph 2.1 version does not work on Windows (async IO causes file lock errors)
[BLZG-1954] – Potential Race Condition in Latched Executor
[BLZG-1957] – PipelinedHashJoin defect in combination with VALUES clause

Download it, clone it, have it sent via carrier pigeon (transportation charges may apply). Find a bug, hit JIRA. Have a question, try the mailing list or contact-us.

Github

git clone -b BLAZEGRAPH_RELEASE_2_1_2 --single-branch https://github.com/blazegraph/database.git BLAZEGRAPH_RELEASE_2_1_2
cd BLAZEGRAPH_RELEASE_2_1_2
./scripts/mavenInstall.sh
./scripts/startBlazegraph.sh

Maven Central

Blazegraph 2.1.2 is now on Maven Central. You can also get the Tinkerpop3 API and the new BlazegraphTPFServer.

    
    <dependency>
        <groupId>com.blazegraph</groupId>
        <artifactId>bigdata-core</artifactId>
        <version>2.1.2</version>
    </dependency>
    <!-- Use if Tinkerpop 2.5 support is needed ; See also Tinkerpop3 below. -->
    <dependency>
        <groupId>com.blazegraph</groupId>
        <artifactId>bigdata-blueprints</artifactId>
        <version>2.1.2</version>
    </dependency>

Tinkerpop3

Blazegraph Tinkerpop3

Tinkerpop3 is here! Get it from Maven Central.

   <dependency>
      <groupId>com.blazegraph</groupId>
      <artifactId>blazegraph-gremlin</artifactId>
      <version>1.0.0</version>
   </dependency>

Blazegraph-Based TDF Server
The Blazegraph-Based TPF Server is a Linked Data Fragment (LDF) server that provides a Triple Pattern Fragment (TPF) interface using the Blazegraph graph database as backend. It was originally developed by Olaf Hartig and is being released via Blazegraph under the Apache 2 license. See here to get started.

Mavenization

Deployers

2.1.2 provides updates for the Debian Deployer, an RPM Deployer, and a Tarball. In addition to the blazegraph.war and blazegraph.jar archives.

GPU Acceleration

Are you interested in trying out GPU Acceleration for you Blazegraph instance? Contact us for a free trial!

Much, much, more….

There also many other features for improved data loading, building custom vocabularies, and many more. Check it out.

Stay in touch, we’d love to hear from you.

Blazegraph 2.1.1 Release!

1 Reply

We’re very pleased to announce the the release of Blazegraph 2.1.1. This is a maintenance release of Blazegraph. See the full details here.

Download it, clone it, have it sent via carrier pigeon (transportation charges may apply). Find a bug, hit JIRA. Have a question, try the mailing list or contact-us.

Github

git clone -b BLAZEGRAPH_RELEASE_2_1_1 --single-branch https://github.com/blazegraph/database.git BLAZEGRAPH_RELEASE_2_1_1
cd BLAZEGRAPH_RELEASE_2_1_1
./scripts/mavenInstall.sh
./scripts/startBlazegraph.sh

Bug Fixes for the update to Lucene 5.5.0 Version
Blazegraph 2.1.1 provides fixes for Lucene 5.5.0 support. There’s a guide to reindexing with the updated Lucene tokenizers here. It also now includes support for adding a text index to an existing namespace without reloading.

GeoSpatial Searching

Did you try the GeoSpatial features in 2.1.0? 2.1.1 has some important updates and fixes. The full details are on the Wiki. As a quick start, you can configure your namespace to enable geo-spatial:

com.bigdata.rdf.store.AbstractTripleStore.geoSpatial=true

Add some data with geospatial information:

@prefix geoliteral: <http://www.bigdata.com/rdf/geospatial/literals/v1#> .
@prefix example: <http://www.example.com/> .

example:Oktoberfest-2013
    rdf:type example:Fair ;
    rdfs:label "Oktoberfest 2013" ;
    example:happened "48.13188#11.54965#1379714400"^^geoliteral:lat-lon-time ;
                example:city example:Munich .

example:RAR-2013
    rdf:type example:Festival ;
    rdfs:label "Rock am Ring 2013" ;
    example:happened "50.33406#6.94259#1370556000"^^geoliteral:lat-lon-time ;
                example:city example:Nuerburg .

Then issue some queries to get started!

PREFIX geoliteral: <http://www.bigdata.com/rdf/geospatial/literals/v1#>
PREFIX geo: <http://www.bigdata.com/rdf/geospatial#>
PREFIX example: <http://www.example.com/>

SELECT * WHERE {
  SERVICE geo:search {
    ?event geo:search "inCircle" .
    ?event geo:searchDatatype geoliteral:lat-lon-time .
    ?event geo:predicate example:happened .
    ?event geo:spatialCircleCenter "48.13743#11.57549" .
    ?event geo:spatialCircleRadius "100" . # default unit: Kilometers
    ?event geo:timeStart "1356994800" . # 01.01.2013, 00:00:00
    ?event geo:timeEnd "1388530799" .   # 31.12.2013, 23:59:59
  }
}

Maven Central

Blazegraph 2.1.1 is now on Maven Central. You can also get the Tinkerpop3 API and the new BlazegraphTPFServer.

    
    <dependency>
        <groupId>com.blazegraph</groupId>
        <artifactId>bigdata-core</artifactId>
        <version>2.1.0</version>
    </dependency>
    <!-- Use if Tinkerpop 2.5 support is needed ; See also Tinkerpop3 below. -->
    <dependency>
        <groupId>com.blazegraph</groupId>
        <artifactId>bigdata-blueprints</artifactId>
        <version>2.1.0</version>
    </dependency>

Tinkerpop3

Blazegraph Tinkerpop3

Tinkerpop3 is here! Get it from Maven Central.

   <dependency>
      <groupId>com.blazegraph</groupId>
      <artifactId>blazegraph-gremlin</artifactId>
      <version>1.0.0</version>
   </dependency>

Blazegraph-Based TPF Server
The TPF Server for Blazegraph (TP4Blazegraph) is a Linked Data Fragment (LDF) server that provides a Triple Pattern Fragment (TPF) interface using the Blazegraph graph database as backend. It was originally developed by Olaf Hartig and is being released via Blazegraph under the Apache 2 license. See here to get started.

Mavenization

Deployers

2.1.1 provides updates for the Debian Deployer, an RPM Deployer, and a Tarball. In addition to the blazegraph.war and blazegraph.jar archives.

GPU Acceleration

Are you interested in trying out GPU Acceleration for you Blazegraph instance? Contact us for a free trial!

Much, much, more….

There also many other features for improved data loading, building custom vocabularies, and many more. Check it out.

Stay in touch, we’d love to hear from you.

Blazegraph 2.1.0 Release!

Blazegraph 2.0.0 Release!

5 Replies

We’re very pleased to announce the the release Blazegraph 2.0.0. 2.0.0 release is a Major release of Blazegraph. There are some very exciting changes for query performance, load performance, improvement deployment options, migration to maven, moving to github, and many more. This lays the foundation for new features with Tinkerpop3 support, GPU Acceleration, etc. Download it, clone it, have it sent via carrier pigeon (transportation charges may apply). Find a bug, hit JIRA. Have a question, try the mailing list or contact-us.

Github

git clone -b BLAZEGRAPH_RELEASE_2_0_0 --single-branch https://github.com/blazegraph/database.git BLAZEGRAPH_RELEASE_2_0_0
cd BLAZEGRAPH_RELEASE_2_0_0
./scripts/mavenInstall.sh
./scripts/startBlazegraph.sh

Query Performance

We’ve implemented Symmetric Hash Joins and provided a number of improvements and new features for query performance and optimizations. For those of you that followed Dr. Michael Schmidt’s excellent posts on the 1.5.2 features. These improvement represent a continuation of this work. Let us know how it works for you. As always, query optimization is a place where we’d love to help you get your application tuned for the best performance.

Maven Central

Blazegraph 2.0.0 is now on Maven Central. You can also get the Tinkerpop3 API and the new BlazegraphTPFServer.

    
    <dependency>
        <groupId>com.blazegraph</groupId>
        <artifactId>bigdata-core</artifactId>
        <version>2.0.0</version>
    </dependency>
    <!-- Use if Tinkerpop 2.5 support is needed ; See also Tinkerpop3 below. -->
    <dependency>
        <groupId>com.blazegraph</groupId>
        <artifactId>bigdata-blueprints</artifactId>
        <version>2.0.0</version>
    </dependency>

Tinkerpop3

Blazegraph Tinkerpop3

Tinkerpop3 is here! Get it from Maven Central.

   <dependency>
      <groupId>com.blazegraph</groupId>
      <artifactId>blazegraph-gremlin</artifactId>
      <version>1.0.0</version>
   </dependency>

Mavenization

Everybody loves (and hates) Maven. Starting with this release, Blazegraph has been broken into a collection of maven artifacts. This has enabled us to work on new features like TinkerPop3, which require Java 8 support while keeping the core platform at Java 7 to support users who are still there. Checkout the Maven Notes on the wiki for full details on the architecture, getting started with development, building snapshots, etc. If you have a 1.5.3 version checked out in Eclipse, you will want to pay attention to Getting Started with Eclipse and allocate a little extra time for the transition.

So long /bigdata, nice to have known you…

Starting with the 2.0 release, default service context is /blazegraph, i.e. http://localhost:9999/blazegraph/. This is setup to redirect from /bigdata. We are continuing to provide the bigdata.war and bigdata.jar archives, which by default point to /bigdata as part of the distribution, but this will be discontinued at some point in the future. We encourage all users to migration to the /blazegraph context with this release.

If you are currently using a serviceURL like http://localhost:9999/bigdata/sparql and want to continue, you’ll need to use bigdata.war, bigdata.jar, or customize the jetty.xml for your deployer. If not, they new default serviceURL is http://localhost:9999/blazegraph/sparql.

Deployers

2.0.0 provides Debian Deployer, an RPM Deployer, and a Tarball. In addition to the blazegraph.war and blazegraph.jar archives.

Blazegraph Benchmarking

We will be releasing published benchmarks for LUBM and BSBM for the 2.0.0. release.

Enterprise Features (HA and Scale-out)

Much, much, more….

There also many other features for improved data loading, building custom vocabularies, and many more that will be coming out in a series of blog posts over the next month or so. Please check back.

Stay in touch, we’d love to hear from you.

Blazegraph 1.5.3 Release!

Understanding SPARQL’s Bottom-up Semantics

4 Replies

Preface: In the 1.5.2 release we’ve implemented a couple of fixes regarding issues related to SPARQL’s bottom-up evaluation approach and associated variable scoping problems. If you encounter regressions with some of your queries after upgrading to 1.5.2, this blog post may help you identify ill-designed queries that are not in line with the official SPARQL 1.1 semantics. Please consult our Wiki for a more comprehensive discussion of SPARQL’s bottom-up semantics.

In one of our recent posts titled In SPARQL, Order Matters we discussed implications of SPARQL’s “as written” evaluation semantics. Another tricky aspect in SPARQL is it’s bottom-up evaluation semantics. Informally speaking, bottom-up evaluation means that subqueries and nested groups are (logically) evaluated first. As a consequence, actual evaluation order chosen by SPARQL engines must yield the same results as bottom-up evaluation order in order to be valid.

Blazegraph does not actually use bottom-up evaluation normally. Instead, Blazegraph prefers to reorder joins and join groups in order to reduce the amount of data read and the size of the intermediate solutions flowing through the query using what is known as left-deep evaluation. However, some queries can only be evaluated by bottom-up plans. Bottom-up plans are often much less efficient. However, bottom-up evaluation can be required if some variables are not visible in some intermediate scopes such that more efficient left-deep plans can not be used.

This guide will help you understand why, how to recognize when bottom-up evaluation semantics are being used and what you can do to avoid the problem and get more performance out of the your SPARQL queries. It also sketches Blazegraph’s built-in optimization techniques for efficiently dealing with issues induced by SPARQL’s bottom-up semantics.

Illustrating the Problem by Example

Let’s start out with a very simple dataset that we will use throughout the upcoming examples:

<http://example.com/Alice>   a <http://example.com/Person> .
<http://example.com/Flipper> a <http://example.com/Animal> .

This is, we have two triples, one stating the Alice is a person and the other one stating that Flipper is an Animal. Let’s start out with a simple query to illustrate what bottom-up evaluation actually means and which problems it can generate:

SELECT ?s ?personType WHERE {
  BIND(<http://example.com/Person> AS ?personType)
  {
    ?s a ?o
    FILTER(?o=?personType)
  }
}

The query aims to extract all instances of type <http://example.com/Person>. To this end, variable ?personType is bound to the URI <http://example.com/Person> using the BIND keyword in the first line, the triple pattern “?s a ?o” searches for all typed instances, and the filter retains those that coincide with the actual binding of ?personType. Looks reasonable, doesn’t it? But there is a “gotcha”. The ?personType variable will not be bound when the inner basic graph group pattern is evaluated! We will explain why and what to do about this below.

Let’s see what happens according to SPARQL bottom-up semantics. Generally speaking, bottom-up evaluation means that, from a logical point of view, we start with evaluating the “leaf nodes” of the query tree first, using these results to iteratively evaluating composed nodes at higher levels. Considering our query above, one node in the query tree is the ({}-enclosed) subgroup consisting of the statement pattern and the filter. Here’s what happens when evaluating this node:

1. We evaluate the triple pattern “?s a ?o” against the data, which gives us two intermediate result mappings, namely

{ ?s -> <http://example.com/Alice>,   ?o -> <http://example.com/Person> },
{ ?s -> <http://example.com/Flipper>, ?o -> <http://example.com/Animal> }

2. Next, we apply the FILTER (?o=?type) over the two mappings from 2a.

And here’s the problem: our two intermediate result mappings do not contain a binding for variable ?type (because the latter has not yet been bound when starting evaluation bottom-up). In such cases, the SPARQL semantic defines that the FILTER condition evaluates to an “error”, in which case the FILTER rejects all mappings. We thus end up with the empty result for the subgroup, and it is easy to see that, as a consequence, the whole query gives us the empty result.

SPARQL has a lot of great features, but this is not one of them. To help make people’s lives easier we are extending Blazegraph’s EXPLAIN feature to also report problems such as this in your queries (see Figure below). This will be part of the next Blazegraph release.

Optimizations in Blazegraph

Strictly following bottom-up semantics when implementing a query engine is typically not a good choice when it comes to evaluation performance. Top-down evaluation, where we inject mappings from previous evaluation steps into subsequent subgroups, can lead to significant speedups. The good news is that, for a broad range of SPARQL queries, bottom-up and top-down evaluation coincide. This holds, for instance, for the complete class of SPARQL queries built only from triple patterns connected through “.” (so-called conjunctive queries).

When it comes to Blazegraph, the basic evaluation approach is a top-down approach. For queries where top-down and bottom-up evaluation make a difference, Blazegraph rewrites queries in such a way that their top-down evaluation result coincides with the bottom-up result, where possible (wherever this is not possible, it essentially switches to bottom-up evaluation for that part of the query). There are various techniques and tricks that are implemented in Blazegraph’s optimizer for this purpose: in many cases, subgroups can just be flattened out without changing semantics, variables in subgroups that are known to be unbound can be renamed to avoid clashes, etc. With the fixes in 1.5.2 we’ve ruled out various inconsistencies between Blazegraph and the official W3C spec. If you plan to migrate from previous versions to 1.5.2 or later, we recommend you reading our extended discussion on bottom-up semantics in our Wiki.

Summary

Although some of the examples above were somewhat artificially designed to illustrate the issues arising in the context of SPARQL’s bottom-up semantics by means of simplistic examples, we have observed ill-designed queries of this style in practice in both our own applications and in customer applications. We hope that the examples and guidelines in this post help our readers and users to avoid the common pitfalls and write better, standard-compliance SPARQL in the future.

We’d love to hear from you.

Did you make similar experiences with SPARQL’s semantics? Or do you have a cool new application using Blazegraph or are interested in understanding how to make Blazegraph work best for your application? Get in touch or send us an email at blazegraph at blazegraph.com.

SPARQL UPDATE performance gain. An easy win with the right data structure.

1 Reply

We had reports of a performance slowdown for SPARQL UPDATE INSERT/DELETE WHERE queries:

DELETE {...}
INSERT {...}
WHERE {...}

For example, you can observe this using the following SPARQL UPDATE request against a data set with 100k or more instances of rdf:label.

DELETE { ?s rdf:label ?o } INSERT {?s rdf:comment ?o } WHERE { ?s rdf:label ?o }

Looking into the timing, we found that the time to insert or remove each statement was growing in proportion to the number of statements already added or removed in the connection. The actual timings are below. In the first log output, it took 3 seconds to process 10,000 statements. The performance is fairly flat for the next 20,000 statements. However, the latency of the operation then starts to grow very rapidly. By the last log output it was taking 97 seconds to process 10,000 statements!

     [java] Added 10000 stmts for removal in 3 seconds (now 10000 stmts in total)
     [java] Added 10000 stmts for removal in 3 seconds (now 20000 stmts in total)
     [java] Added 10000 stmts for removal in 4 seconds (now 30000 stmts in total)
     [java] Added 10000 stmts for removal in 9 seconds (now 40000 stmts in total)
     [java] Added 10000 stmts for removal in 9 seconds (now 50000 stmts in total)
     [java] Added 10000 stmts for removal in 12 seconds (now 60000 stmts in total)
     [java] Added 10000 stmts for removal in 20 seconds (now 70000 stmts in total)
     [java] Added 10000 stmts for removal in 26 seconds (now 80000 stmts in total)
     [java] Added 10000 stmts for removal in 32 seconds (now 90000 stmts in total)
     [java] Added 10000 stmts for removal in 39 seconds (now 100000 stmts in total)
     [java] Added 10000 stmts for removal in 46 seconds (now 110000 stmts in total)
     [java] Added 10000 stmts for removal in 53 seconds (now 120000 stmts in total)
     [java] Added 10000 stmts for removal in 65 seconds (now 130000 stmts in total)
     [java] Added 10000 stmts for removal in 74 seconds (now 140000 stmts in total)
     [java] Added 10000 stmts for removal in 83 seconds (now 150000 stmts in total)
     [java] Added 10000 stmts for removal in 90 seconds (now 160000 stmts in total)
     [java] Added 10000 stmts for removal in 97 seconds (now 170000 stmts in total)
     ...

What’s going on here!?! The underlying BigdataSailConnection batches everything into a StatementBuffer object. The StatementBuffer is then incrementally flushed onto the backing indices every 10,000 statements (by default – See BigdataSail.Options.BUFFER_CAPACITY to change this parameter). The remove statements code path is nearly identical. Both add statements and remove statements on the BigdataSailConnection are known to scale into billions of statements added or removed in a single operation. Scaling should not be worse than log-linear since it depends on the depth of the B+Tree indices (the cost of an index probe on a B+Tree is bounded by log(p), where p is the height of the B+Tree).

Looking at the scaling performance, it was immediately suggestive of a poor data structure choice. Something where the cost of the scan was proportional to the number of items scanned.

Looking in our code, the relevant lines are:

// Evaluate the WHERE clause.
final MutableTupleQueryResult result = new MutableTupleQueryResult(
  ASTEvalHelper.evaluateTupleQuery(
    context.conn.getTripleStore(), astContainer,
    context.getQueryBindingSet()/* bindingSets */));

// Evaluate the DELETE clause (evaluation of the INSERT clause is similar).
final ASTConstructIterator itr = new ASTConstructIterator(context, context.conn.getTripleStore(), template, op.getWhereClause(), null, result);

while (itr.hasNext()) {
  final BigdataStatement stmt = itr.next();
  addOrRemoveStatement(context.conn.getSailConnection(), stmt, false/* insert */);
}

// Batch up individual add or remove statement calls.
void addOrRemoveStatement(final BigdataSailConnection conn, final BigdataStatement spo, final boolean insert) throws SailException {
// ...
if (insert) {
  conn.addStatement(s, p, o, contexts);
} else {
  conn.removeStatements(s, p, o, contexts);
}

Looking into our code (above), we found an openrdf MutableTupleQueryResult object. This object is used to buffer the statements identified by the WHERE clause of the query. That buffer is then rewound and replayed twice. Once for the DELETE clause and once more for the INSERT clause. So my suspicions naturally fell on the otherwise innocuous line:

final BigdataStatement stmt = itr.next();

That line is our ASTConstructIterator, which is known to be scalable. But in this case it is basked by the MutableTupleQueryResult class. So, looking into MutableTupleQueryResult I see:

private List bindingSets = new LinkedList();

Bingo!

MutableTupleQueryResult.next() is calling through to List.get(index):

public BindingSet next() {
  if (hasNext()) {
    BindingSet result = get(currentIndex);
    lastReturned = currentIndex;
    currentIndex++;
    return result;
  }
}

public BindingSet get(int index) {
  return bindingSets.get(index);
}

List.get(index) against a LinkedList does a scan. That explains the performance degradation we were seeing. A one line change fixes this.

private List bindingSets = new ArrayList();

Performance for the same SPARQL UPDATE request is now perfectly flat. The main cost was the MutableTupleQueryResult scan of the LinkedList. Blazegraph automatically and scaleably buffers the data and incrementally evicts the mutations to the write cache and from there to the disk.

[java] Added 10000 stmts for removal in 3 seconds (now 10000 stmts in total)
[java] Added 10000 stmts for removal in 3 seconds (now 20000 stmts in total)
[java] Added 10000 stmts for removal in 1 seconds (now 30000 stmts in total)
[java] Added 10000 stmts for removal in 1 seconds (now 40000 stmts in total)
[java] Added 10000 stmts for removal in 0 seconds (now 50000 stmts in total)
[java] Added 10000 stmts for removal in 1 seconds (now 60000 stmts in total)
[java] Added 10000 stmts for removal in 0 seconds (now 70000 stmts in total)
[java] Added 10000 stmts for removal in 0 seconds (now 80000 stmts in total)
[java] Added 10000 stmts for removal in 0 seconds (now 90000 stmts in total)
[java] Added 10000 stmts for removal in 0 seconds (now 100000 stmts in total)
[java] Added 10000 stmts for removal in 0 seconds (now 110000 stmts in total)
[java] Added 10000 stmts for removal in 0 seconds (now 120000 stmts in total)
[java] Added 10000 stmts for removal in 0 seconds (now 130000 stmts in total)
[java] Added 10000 stmts for removal in 0 seconds (now 140000 stmts in total)
[java] Added 10000 stmts for removal in 0 seconds (now 150000 stmts in total)
[java] Added 10000 stmts for removal in 0 seconds (now 160000 stmts in total)
[java] Added 10000 stmts for removal in 0 seconds (now 170000 stmts in total)
[java] Added 10000 stmts for removal in 0 seconds (now 180000 stmts in total)
[java] Added 10000 stmts for removal in 0 seconds (now 190000 stmts in total)
[java] Added 10000 stmts for removal in 0 seconds (now 200000 stmts in total)
[java] Added 10000 stmts for removal in 0 seconds (now 210000 stmts in total)
[java] Added 10000 stmts for removal in 1 seconds (now 220000 stmts in total)
[java] Added 10000 stmts for removal in 0 seconds (now 230000 stmts in total)
[java] Added 10000 stmts for removal in 0 seconds (now 240000 stmts in total)
[java] Added 10000 stmts for removal in 0 seconds (now 250000 stmts in total)
[java] Added 10000 stmts for removal in 1 seconds (now 260000 stmts in total)
[java] Added 10000 stmts for removal in 0 seconds (now 270000 stmts in total)
[java] Added 10000 stmts for removal in 0 seconds (now 280000 stmts in total)
[java] Added 10000 stmts for removal in 0 seconds (now 290000 stmts in total)
[java] Added 10000 stmts for removal in 0 seconds (now 300000 stmts in total)
[java] Added 10000 stmts for removal in 0 seconds (now 310000 stmts in total)
[java] Added 10000 stmts for removal in 0 seconds (now 320000 stmts in total)
[java] Added 10000 stmts for removal in 0 seconds (now 330000 stmts in total)
[java] Added 10000 stmts for removal in 1 seconds (now 340000 stmts in total)
[java] Added 10000 stmts for removal in 0 seconds (now 350000 stmts in total)
[java] Added 10000 stmts for removal in 0 seconds (now 360000 stmts in total)
[java] Added 10000 stmts for removal in 1 seconds (now 370000 stmts in total)
[java] Added 10000 stmts for removal in 1 seconds (now 380000 stmts in total)
[java] Added 10000 stmts for removal in 0 seconds (now 390000 stmts in total)
[java] Added 10000 stmts for removal in 0 seconds (now 400000 stmts in total)
[java] Added 10000 stmts for removal in 1 seconds (now 410000 stmts in total)
[java] Added 10000 stmts for removal in 0 seconds (now 420000 stmts in total)
[java] Added 10000 stmts for removal in 0 seconds (now 430000 stmts in total)
[java] Added 10000 stmts for removal in 0 seconds (now 440000 stmts in total)
[java] Added 10000 stmts for removal in 1 seconds (now 450000 stmts in total)
[java] Added 10000 stmts for removal in 0 seconds (now 460000 stmts in total)
[java] Added 10000 stmts for removal in 0 seconds (now 470000 stmts in total)
[java] Added 10000 stmts for removal in 0 seconds (now 480000 stmts in total)
[java] Added 10000 stmts for removal in 1 seconds (now 490000 stmts in total)

See https://jira.blazegraph.com/browse/BLZG-1404 for the workaround and a link to the corresponding ticket that we have raised with the openrdf project. This fix will be in our next major release, and is available in hot fix releases to customers with support subscriptions.

Thanks,
Bryan

Blazegraph 1.5.2 Release!

In SPARQL, Order Matters

1 Reply

One common mistake made when writing SPARQL queries is the implicit assumption that, within SPARQL join groups, the order in which components of the group are specified does not matter. The point here is that, by definition, SPARQL follows a sequential semantics when evaluating join groups and, although in many cases the order of evaluating patterns does not change the outcome of the query , there are some situations where the order is crucial to get the expected query result.

In Blazegraph 1.5.2 we’ve fixed some internal issues regarding the evaluation order within join groups, making Blazegraph’s join group evaluation strategy compliant with the official SPARQL W3C semantics. As a consequence, for some query patterns, the behavior of Blazegraph changed and if you’re upgrading your Blazegraph installation it might make sense to review your queries for such patterns, in order to avoid regressions.

A Simple Example

To illustrate why order in SPARQL indeed matters, assume we want to find all person URIs and, if present, associated images. Instantiating this scenario, let us consider query

### Q1a:
SELECT * WHERE {
  ?person rdf:type <http://example.com/Person>
  OPTIONAL { ?person <http://example.com/image> ?image }
}

over a database containing the following triples:

<http://example.com/Alice> rdf:type                    <http://example.com/Person> .
<http://example.com/Alice> <http://example.com/image>  "Alice.jpg" .
<http://example.com/Bob>   rdf:type                    <http://example.com/Person> .

The result of this query is quite obvious and in line with our expectations: We get one result row for Alice, including the existing image, plus one result row for Bob, without an image:

?person                    | ?image
---------------------------|------------
<http://example.com/Alice> | "Alice.jpg"
<http://example.com/Bob>   |

Now let’s see what happens when swapping the order of the triple pattern and the OPTIONAL block in the query:

### Q1b:
SELECT * WHERE {
  OPTIONAL { ?person <http://example.com/image> ?image } .
  ?person rdf:type <http://example.com/Person>
}

The result of Q1b may come at a surprise:

?person                    | ?image
---------------------------|------------
<http://example.com/Alice> | "Alice.jpg"

Where’s Bob gone?

As mentioned before, SPARQL evaluates the query sequentially. This is, in the first step it evaluates the OPTIONAL { ?person :image ?image } block. As a result of this first step, we obtain:

?person                    | ?image
---------------------------|------------
<http://example.com/Alice> | "Alice.jpg"

No Bob in sight.

In a second step, this partial result joined with the (non-optional) triple pattern ?person rdf:type <http://example.com/Person>. Intuitively speaking, the join with the triple patterns now serves as an assertion, retaining those result rows for which the value of variable ?person can be shown to be of rdf:type <http://example.com/Person>. Obviously, this assertion holds for our (single) result, the URI of Alice, so the previous intermediate result is left unmodified.

So what we get in this case is the “set of all persons that have an image associated, including the respective images”. But is this informal description really capturing what Q2b does? Consider the evaluation of Q2b over a database where none of the persons has an image associated, say:

<http://example.com/Alice> rdf:type <http://example.com/Person> .
<http://example.com/Bob>   rdf:type <http://example.com/Person> .

Here’s the result:

?person                    | ?image
---------------------------|------------------------------------
<http://example.com/Alice> | 
<http://example.com/Bob>   |

Surprised again?

Well, here’s the explanation: SPARQL evaluation starts out over the so-called “empty mapping” (also called “universal mapping”): the OPTIONAL block is not able to retrieve any results, and given the semantics of OPTIONAL, this step simply retains the empty mapping. This empty mapping is then joined with the non-optional triple pattern, giving us the two bindings for ?person as result, with the ?image being unbound in both.

Taking our observations together, the semantics of Q1b could be phrased as: “If there is at least one triple with predicate <http://example.com/image> in our database, return all persons that have an image and including the image, otherwise return all persons.”

That’s quite different from our intended search request, isn’t it?

Practical Implications

The example above illustrates an interesting edge case, which (given that the semantics sketched informally above is somewhat odd) does not frequently show up in practice though. For join groups composed out of (non-optional) statement patterns only, for instance, the order in which you denote the triple patterns does not matter at all. But for operators OPTIONAL and MINUS we may run into such ordering problems.

Interestingly, even for patterns involving OPTIONAL (and MINUS), in many cases there are different orders that lead to the same result (independently on the data in the underlying database). Let’s look at the following variant of our initial query Q1a which, in addition, extracts the person label (and assume we have labels in our sample database for the persons, as well):

### Q2a
SELECT * WHERE {
  ?person rdf:type <http://example.com/Person> .
  ?person rdfs:label ?label .
  OPTIONAL { ?person <http://example.com/image> ?image } .
}

Following the same arguments as in our example before, it is not allowed to move the OPTIONAL to the beginning, but the following query

### Q2b
SELECT * WHERE {
  ?person rdf:type <http://example.com/Person> .
  OPTIONAL { ?person <http://example.com/image> ?image } .
  ?person rdfs:label ?label .
}

can be shown to be semantically equivalent. Without going into details, the key difference here towards our example from the beginning is that variable ?person has definitely been bound (in both Q2a and Q2b) when evaluating the OPTIONAL; this rules out strange effects as discussed in the beginning .

If you’re confused now about where to place your OPTIONAL’s, here are some rules of thumb:

Be aware that order matters
First specify your non-optional parts of the query (unless there’s some good to not do so)
Then specify your OPTIONAL clauses to include optional patterns of your graph

For short, unless you want to encode somewhat odd patterns in SPARQL: it usually makes sense to put OPTIONAL patterns in the end of your join groups. If you’re still unsure about some of your queries, just have a look at the semantics section in the official W3C specs.

Optimizations within Blazegraph 1.5.2

While the official SPARQL semantics uses a sequential approach, one key optimization approach of the Blazegraph query optimizer is to identify a semantics-preserving reordering of join groups in the query that minimizes the amount of work spent in evaluating the query. Thereby, amongst other heuristics (such as cardinality estimations for the individual triple patterns), optimizers typically try to evaluate non-optional patterns first (this holds particularly in the presence of more complex OPTIONAL expressions).

As a consequence, queries like those sketched above challenge the Blazegraph query optimizer: reordering must be done carefully and based on sound theoretical foundations, in order to retain the semantics of the original query within the rewriting process. In Blazegraph 1.5.2, we’ve reworked Blazegraph’s internal join group reordering strategy.

For instance, Blazegraph 1.5.2 is able to detect the equivalence transformations (as sketched in Q2a/Q2b) when evaluating queries and uses this knowledge to find more efficient query execution plans: if you run Q2b query above in Blazegraph and turn the EXPLAIN mode on, you’ll see that the OPTIONAL clause is moved to the end by the optimizer in the optimized abstract syntax tree.

We’d love to hear from you.

Do you have a cool new application using Blazegraph or are interested in understanding how to make Blazegraph work best for your application? Get in touch or send us an email at blazegraph at blazegraph.com.

Blazegraph(tm)

Blazegraph(tm) is a scale-out storage and computing fabric supporting optional transactions, very high concurrency, and very high aggregate IO rates.

Blazegraph 2.1.4 Release

Blazegraph 2.1.2 Release

Blazegraph 2.1.1 Release!

Blazegraph 2.1.0 Release!

Blazegraph 2.0.0 Release!

Blazegraph 1.5.3 Release!

Understanding SPARQL’s Bottom-up Semantics

Illustrating the Problem by Example

Other Problematic Cases

Other Problematic Cases: BIND and VALUES

Optimizations in Blazegraph

Summary

We’d love to hear from you.

SPARQL UPDATE performance gain. An easy win with the right data structure.

Blazegraph 1.5.2 Release!

In SPARQL, Order Matters

A Simple Example

Practical Implications

Optimizations within Blazegraph 1.5.2

We’d love to hear from you.