Category Archives: Uncategorized

Blazegraph 2.1.4 Release

SystapSiteBanner_1600x362_150px_wideWe’re very pleased to announce the the release of Blazegraph 2.1.4.  This is a maintenance release of Blazegraph.  See the full details here.

2.1.4 Release Fixes

  • [BLZG-533] – Vector the query engine on the native heap
  • [BLZG-2041] – BigdataSail should not locate the AbstractTripleStore until a connection is requested
  • [BLZG-2053] – Blazegraph Security Reporting Instructions
  • [BLZG-2050] – Fork Colt Libraries to remove hep.aida
  • [BLZG-2065] – Remove Autojar and Unused Ant Scripts

Download it, clone it, have it sent via carrier pigeon (transportation charges may apply).  Find a bug, hit JIRA.  Have a question, try the mailing list or contact-us.

Security Reporting Procedure
Blazegraph has an updated security reporting procedure. Please see the guide for reporting security related issues. This process is monitored on a daily basis. All security reports are acknowledged within 24 hours. Mitigations for reported security issues are made in a reasonable timeframe, which may be as quickly as 24 hours for high-severity issues.

Github

git clone -b BLAZEGRAPH_RELEASE_2_1_4 --single-branch https://github.com/blazegraph/database.git BLAZEGRAPH_RELEASE_2_1_4
cd BLAZEGRAPH_RELEASE_2_1_4
./scripts/mavenInstall.sh
./scripts/startBlazegraph.sh

Maven Central

Blazegraph 2.1.4 is now on Maven Central.    You can also get the Tinkerpop3 API and the new BlazegraphTPFServer.

    
    <dependency>
        <groupId>com.blazegraph</groupId>
        <artifactId>bigdata-core</artifactId>
        <version>2.1.4</version>
    </dependency>
    <!-- Use if Tinkerpop 2.5 support is needed ; See also Tinkerpop3 below. -->
    <dependency>
        <groupId>com.blazegraph</groupId>
        <artifactId>bigdata-blueprints</artifactId>
        <version>2.1.4</version>
    </dependency>

Tinkerpop3

Blazegraph Tinkerpop3

Tinkerpop3 is here!  Get it from Maven Central.

   <dependency>
      <groupId>com.blazegraph</groupId>
      <artifactId>blazegraph-gremlin</artifactId>
      <version>1.0.0</version>
   </dependency>

Mavenization

Everybody loves (and hates) Maven.  Starting with the 2.0.0 release, Blazegraph has been broken into a collection of maven artifacts.    This has enabled us to work on new features like TinkerPop3, which require Java 8 support while keeping the core platform at Java 7 to support users who are still there.   Checkout the Maven Notes on the wiki for full details on the architecture, getting started with development, building snapshots, etc.  If you have a 1.5.3 version checked out in Eclipse, you will want to pay attention to Getting Started with Eclipse and allocate a little extra time for the transition.

Deployers

2.1.4 provides updates for the Debian Deployer, an RPM Deployer, and a Tarball.  In addition to the blazegraph.war and blazegraph.jar archives.

GPU Acceleration

Are you interested in trying out GPU Acceleration for you Blazegraph instance? Contact us for a free trial!

Stay in touch, we’d love to hear from you.

facebooktwittergoogle_pluslinkedin

Blazegraph 2.1.2 Release

SystapSiteBanner_1600x362_150px_wideWe’re very pleased to announce the the release of Blazegraph 2.1.2.  This is a maintenance release of Blazegraph.  See the full details here.

2.1.2 Release Fixes

  • [BLZG-1911] – Blazegraph 2.1 version does not work on Windows (async IO causes file lock errors)
  • [BLZG-1954] – Potential Race Condition in Latched Executor
  • [BLZG-1957] – PipelinedHashJoin defect in combination with VALUES clause

Download it, clone it, have it sent via carrier pigeon (transportation charges may apply).  Find a bug, hit JIRA.  Have a question, try the mailing list or contact-us.

Github

git clone -b BLAZEGRAPH_RELEASE_2_1_2 --single-branch https://github.com/blazegraph/database.git BLAZEGRAPH_RELEASE_2_1_2
cd BLAZEGRAPH_RELEASE_2_1_2
./scripts/mavenInstall.sh
./scripts/startBlazegraph.sh

Maven Central

Blazegraph 2.1.2 is now on Maven Central.    You can also get the Tinkerpop3 API and the new BlazegraphTPFServer.

    
    <dependency>
        <groupId>com.blazegraph</groupId>
        <artifactId>bigdata-core</artifactId>
        <version>2.1.2</version>
    </dependency>
    <!-- Use if Tinkerpop 2.5 support is needed ; See also Tinkerpop3 below. -->
    <dependency>
        <groupId>com.blazegraph</groupId>
        <artifactId>bigdata-blueprints</artifactId>
        <version>2.1.2</version>
    </dependency>

Tinkerpop3

Blazegraph Tinkerpop3

Tinkerpop3 is here!  Get it from Maven Central.

   <dependency>
      <groupId>com.blazegraph</groupId>
      <artifactId>blazegraph-gremlin</artifactId>
      <version>1.0.0</version>
   </dependency>

Blazegraph-Based TDF Server
The Blazegraph-Based TPF Server is a Linked Data Fragment (LDF) server that provides a Triple Pattern Fragment (TPF) interface using the Blazegraph graph database as backend.  It was originally developed by Olaf Hartig and is being released via Blazegraph under the Apache 2 license.   See here to get started.

Mavenization

Everybody loves (and hates) Maven.  Starting with the 2.0.0 release, Blazegraph has been broken into a collection of maven artifacts.    This has enabled us to work on new features like TinkerPop3, which require Java 8 support while keeping the core platform at Java 7 to support users who are still there.   Checkout the Maven Notes on the wiki for full details on the architecture, getting started with development, building snapshots, etc.  If you have a 1.5.3 version checked out in Eclipse, you will want to pay attention to Getting Started with Eclipse and allocate a little extra time for the transition.

Deployers

2.1.2 provides updates for the Debian Deployer, an RPM Deployer, and a Tarball.  In addition to the blazegraph.war and blazegraph.jar archives.

GPU Acceleration

Are you interested in trying out GPU Acceleration for you Blazegraph instance? Contact us for a free trial!

Much, much, more….

There also many other features for improved data loading, building custom vocabularies, and many more. Check it out.

Stay in touch, we’d love to hear from you.

facebooktwittergoogle_pluslinkedin

Blazegraph 2.1.1 Release!

SystapSiteBanner_1600x362_150px_wideWe’re very pleased to announce the the release of Blazegraph 2.1.1.  This is a maintenance release of Blazegraph.  See the full details here.

Download it, clone it, have it sent via carrier pigeon (transportation charges may apply).  Find a bug, hit JIRA.  Have a question, try the mailing list or contact-us.

Github

git clone -b BLAZEGRAPH_RELEASE_2_1_1 --single-branch https://github.com/blazegraph/database.git BLAZEGRAPH_RELEASE_2_1_1
cd BLAZEGRAPH_RELEASE_2_1_1
./scripts/mavenInstall.sh
./scripts/startBlazegraph.sh

 Bug Fixes for the update to Lucene 5.5.0 Version
Blazegraph 2.1.1 provides fixes for Lucene 5.5.0 support. There’s a guide to reindexing with the updated Lucene tokenizers here. It also now includes support for adding a text index to an existing namespace without reloading.

GeoSpatial Searching

Did you try the GeoSpatial features in 2.1.0? 2.1.1 has some important updates and fixes. The full details are on the Wiki. As a quick start, you can configure your namespace to enable geo-spatial:

com.bigdata.rdf.store.AbstractTripleStore.geoSpatial=true

Add some data with geospatial information:

@prefix geoliteral: <http://www.bigdata.com/rdf/geospatial/literals/v1#> .
@prefix example: <http://www.example.com/> .

example:Oktoberfest-2013
    rdf:type example:Fair ;
    rdfs:label "Oktoberfest 2013" ;
    example:happened "48.13188#11.54965#1379714400"^^geoliteral:lat-lon-time ;
                example:city example:Munich .

example:RAR-2013
    rdf:type example:Festival ;
    rdfs:label "Rock am Ring 2013" ;
    example:happened "50.33406#6.94259#1370556000"^^geoliteral:lat-lon-time ;
                example:city example:Nuerburg .

Then issue some queries to get started!

PREFIX geoliteral: <http://www.bigdata.com/rdf/geospatial/literals/v1#>
PREFIX geo: <http://www.bigdata.com/rdf/geospatial#>
PREFIX example: <http://www.example.com/>

SELECT * WHERE {
  SERVICE geo:search {
    ?event geo:search "inCircle" .
    ?event geo:searchDatatype geoliteral:lat-lon-time .
    ?event geo:predicate example:happened .
    ?event geo:spatialCircleCenter "48.13743#11.57549" .
    ?event geo:spatialCircleRadius "100" . # default unit: Kilometers
    ?event geo:timeStart "1356994800" . # 01.01.2013, 00:00:00
    ?event geo:timeEnd "1388530799" .   # 31.12.2013, 23:59:59
  }
}

Maven Central

Blazegraph 2.1.1 is now on Maven Central.    You can also get the Tinkerpop3 API and the new BlazegraphTPFServer.

    
    <dependency>
        <groupId>com.blazegraph</groupId>
        <artifactId>bigdata-core</artifactId>
        <version>2.1.0</version>
    </dependency>
    <!-- Use if Tinkerpop 2.5 support is needed ; See also Tinkerpop3 below. -->
    <dependency>
        <groupId>com.blazegraph</groupId>
        <artifactId>bigdata-blueprints</artifactId>
        <version>2.1.0</version>
    </dependency>

Tinkerpop3

Blazegraph Tinkerpop3

Tinkerpop3 is here!  Get it from Maven Central.

   <dependency>
      <groupId>com.blazegraph</groupId>
      <artifactId>blazegraph-gremlin</artifactId>
      <version>1.0.0</version>
   </dependency>

Blazegraph-Based TPF Server
The TPF Server for Blazegraph (TP4Blazegraph) is a Linked Data Fragment (LDF) server that provides a Triple Pattern Fragment (TPF) interface using the Blazegraph graph database as backend.  It was originally developed by Olaf Hartig and is being released via Blazegraph under the Apache 2 license.   See here to get started.

Mavenization

Everybody loves (and hates) Maven.  Starting with the 2.0.0 release, Blazegraph has been broken into a collection of maven artifacts.    This has enabled us to work on new features like TinkerPop3, which require Java 8 support while keeping the core platform at Java 7 to support users who are still there.   Checkout the Maven Notes on the wiki for full details on the architecture, getting started with development, building snapshots, etc.  If you have a 1.5.3 version checked out in Eclipse, you will want to pay attention to Getting Started with Eclipse and allocate a little extra time for the transition.

Deployers

2.1.1 provides updates for the Debian Deployer, an RPM Deployer, and a Tarball.  In addition to the blazegraph.war and blazegraph.jar archives.

GPU Acceleration

Are you interested in trying out GPU Acceleration for you Blazegraph instance? Contact us for a free trial!

Much, much, more….

There also many other features for improved data loading, building custom vocabularies, and many more. Check it out.

Stay in touch, we’d love to hear from you.

facebooktwittergoogle_pluslinkedin

Blazegraph 2.1.0 Release!

SystapSiteBanner_1600x362_150px_wideWe’re very pleased to announce the the release Blazegraph 2.1.0.  2.1.0 release is a Major release of Blazegraph.    There are some very exciting changes for GeoSpatial Searching, update to Lucene 5.5.0, support for online back-up via the REST API, JSON-LD support, and much more.

Download it, clone it, have it sent via carrier pigeon (transportation charges may apply).  Find a bug, hit JIRA.  Have a question, try the mailing list or contact-us.

Github

We love Sourceforge, but today it’s important to be on github as well.  This release will be on both Github and Sourceforge.  We’ll continue to use Sourceforge for distribution, mailing lists, etc., but will migrate towards Github for the open source releases.

git clone -b BLAZEGRAPH_RELEASE_2_1_0 --single-branch https://github.com/blazegraph/database.git BLAZEGRAPH_RELEASE_2_1_0
cd BLAZEGRAPH_RELEASE_2_1_0
./scripts/mavenInstall.sh
./scripts/startBlazegraph.sh

Update to Lucene 5.5.0 Version
Blazegraph 2.1.0 supports Lucene 5.5.0. There’s a guide to reindexing with the updated Lucene tokenizers here. It also now includes support for adding a text index to an existing namespace without reloading.

Using Pubchem with Blazegraph
Blazegraph 2.1.0 now includes a custom vocabulary for using with Pubchem. You can checkout a quick guide at Blazegraph Pubchem. We’ll be including some feature blog posts on using and tuning with the Pubchem data. If you’re interested in applying Blazegraph to this data or have any questions, we’d love hear from you.

GeoSpatial Searching

Check out our new features for GeoSpatial searching. The full details are on the Wiki. As a quick start, you can configure your namespace to enable geo-spatial:

com.bigdata.rdf.store.AbstractTripleStore.geoSpatial=true

Add some data with geospatial information:

@prefix geoliteral: <http://www.bigdata.com/rdf/geospatial/literals/v1#> .
@prefix example: <http://www.example.com/> .

example:Oktoberfest-2013
    rdf:type example:Fair ;
    rdfs:label "Oktoberfest 2013" ;
    example:happened "48.13188#11.54965#1379714400"^^geoliteral:lat-lon-time ;
                example:city example:Munich .

example:RAR-2013
    rdf:type example:Festival ;
    rdfs:label "Rock am Ring 2013" ;
    example:happened "50.33406#6.94259#1370556000"^^geoliteral:lat-lon-time ;
                example:city example:Nuerburg .

Then issue some queries to get started!

PREFIX geoliteral: <http://www.bigdata.com/rdf/geospatial/literals/v1#>
PREFIX geo: <http://www.bigdata.com/rdf/geospatial#>
PREFIX example: <http://www.example.com/>

SELECT * WHERE {
  SERVICE geo:search {
    ?event geo:search "inCircle" .
    ?event geo:searchDatatype geoliteral:lat-lon-time .
    ?event geo:predicate example:happened .
    ?event geo:spatialCircleCenter "48.13743#11.57549" .
    ?event geo:spatialCircleRadius "100" . # default unit: Kilometers
    ?event geo:timeStart "1356994800" . # 01.01.2013, 00:00:00
    ?event geo:timeEnd "1388530799" .   # 31.12.2013, 23:59:59
  }
}

Maven Central

Blazegraph 2.1.0 is now on Maven Central.    You can also get the Tinkerpop3 API and the new BlazegraphTPFServer.

    
    <dependency>
        <groupId>com.blazegraph</groupId>
        <artifactId>bigdata-core</artifactId>
        <version>2.1.0</version>
    </dependency>
    <!-- Use if Tinkerpop 2.5 support is needed ; See also Tinkerpop3 below. -->
    <dependency>
        <groupId>com.blazegraph</groupId>
        <artifactId>bigdata-blueprints</artifactId>
        <version>2.1.0</version>
    </dependency>

Tinkerpop3

Blazegraph Tinkerpop3

Tinkerpop3 is here!  Get it from Maven Central.

   <dependency>
      <groupId>com.blazegraph</groupId>
      <artifactId>blazegraph-gremlin</artifactId>
      <version>1.0.0</version>
   </dependency>

Blazegraph-Based TDF Server
The Blazegraph-Based TPF Server is a Linked Data Fragment (LDF) server that provides a Triple Pattern Fragment (TPF) interface using the Blazegraph graph database as backend.  It was originally developed by Olaf Hartig and is being released via Blazegraph under the Apache 2 license.   See here to get started.

Mavenization

Everybody loves (and hates) Maven.  Starting with the 2.0.0 release, Blazegraph has been broken into a collection of maven artifacts.    This has enabled us to work on new features like TinkerPop3, which require Java 8 support while keeping the core platform at Java 7 to support users who are still there.   Checkout the Maven Notes on the wiki for full details on the architecture, getting started with development, building snapshots, etc.  If you have a 1.5.3 version checked out in Eclipse, you will want to pay attention to Getting Started with Eclipse and allocate a little extra time for the transition.

Deployers

2.1.0 provides  Debian Deployer, an RPM Deployer, and a Tarball.  In addition to the blazegraph.war and blazegraph.jar archives.

Blazegraph Benchmarking

We will be releasing published benchmarks for LUBM and BSBM for the 2.1.0. release.

Enterprise Features (HA and Scale-out)

Starting in release 2.0.0, the Scale-out and HA capabilities are moved to Enterprise features. These are available to uses with support and/or license subscription. If you are an existing GPLv2 user of these features, we have some easy ways to migrate. Contact us for more information. We’d like to make it as easy as possible.

Much, much, more….

There also many other features for improved data loading, building custom vocabularies, and many more that will be coming out in a series of blog posts over the next month or so.  Please check back.

Stay in touch, we’d love to hear from you.

facebooktwittergoogle_pluslinkedin

Understanding SPARQL’s Bottom-up Semantics

Preface: In the 1.5.2 release we’ve implemented a couple of fixes regarding issues related to SPARQL’s bottom-up evaluation approach and associated variable scoping problems. If you encounter regressions with some of your queries after upgrading to 1.5.2, this blog post may help you identify ill-designed queries that are not in line with the official SPARQL 1.1 semantics. Please consult our Wiki for a more comprehensive discussion of SPARQL’s bottom-up semantics.

In one of our recent posts titled In SPARQL, Order Matters we discussed implications of SPARQL’s “as written” evaluation semantics. Another tricky aspect in SPARQL is it’s bottom-up evaluation semantics.  Informally speaking, bottom-up evaluation means that subqueries and nested groups are (logically) evaluated first. As a consequence, actual evaluation order chosen by SPARQL engines must  yield the same results as bottom-up evaluation order in order to be valid.

Blazegraph does not actually use bottom-up evaluation normally.  Instead, Blazegraph prefers to reorder joins and join groups in order to reduce the amount of data read and the size of the intermediate solutions flowing through the query using what is known as left-deep evaluation. However, some queries can only be evaluated by bottom-up plans.  Bottom-up plans are often much less efficient. However, bottom-up evaluation can be required if some variables are not visible in some intermediate scopes such that more efficient left-deep plans can not be used.

This guide will help you understand why, how to recognize when bottom-up evaluation semantics are being used and what you can do to avoid the problem and get more performance out of the your SPARQL queries. It also sketches Blazegraph’s built-in optimization techniques for efficiently dealing with issues induced by SPARQL’s bottom-up semantics.

Illustrating the Problem by Example

Let’s start out with a very simple dataset that we will use throughout the upcoming examples:

<http://example.com/Alice>   a <http://example.com/Person> .
<http://example.com/Flipper> a <http://example.com/Animal> .

This is, we have two triples, one stating the Alice is a person and the other one stating that Flipper is an Animal. Let’s start out with a simple query to illustrate what bottom-up evaluation actually means and which problems it can generate:

SELECT ?s ?personType WHERE {
  BIND(<http://example.com/Person> AS ?personType)
  {
    ?s a ?o
    FILTER(?o=?personType)
  }
}

The query aims to extract all instances of type <http://example.com/Person>. To this end, variable ?personType is bound to the URI <http://example.com/Person> using the BIND keyword in the first line, the triple pattern “?s a ?o” searches for all typed instances, and the filter retains those that coincide with the actual binding of ?personType. Looks reasonable, doesn’t it?  But there is a “gotcha”.  The ?personType variable will not be bound when the inner basic graph group pattern is evaluated!  We will explain why and what to do about this below.

Let’s see what happens according to SPARQL bottom-up semantics. Generally speaking, bottom-up evaluation means that, from a logical point of view, we start with evaluating the “leaf nodes” of the query tree first, using these results to iteratively evaluating composed nodes at higher levels. Considering our query above, one node in the query tree is the ({}-enclosed) subgroup consisting of the statement pattern and the filter. Here’s what happens when evaluating this node:

1. We evaluate the triple pattern “?s a ?o” against the data, which gives us two intermediate result mappings, namely

{ ?s -> <http://example.com/Alice>,   ?o -> <http://example.com/Person> },
{ ?s -> <http://example.com/Flipper>, ?o -> <http://example.com/Animal> }

2. Next, we apply the FILTER (?o=?type) over the two mappings from 2a.

And here’s the problem: our two intermediate result mappings do not contain a binding for variable ?type (because the latter has not yet been bound when starting evaluation bottom-up). In such cases, the SPARQL semantic defines that the FILTER condition evaluates to an “error”, in which case the FILTER rejects all mappings. We thus end up with the empty result for the subgroup, and it is easy to see that, as a consequence, the whole query gives us the empty result.

SPARQL has a lot of great features, but this is not one of them.  To help make people’s lives easier we are extending Blazegraph’s EXPLAIN  feature to also report problems such as this in your queries (see Figure below).  This will be part of the next Blazegraph release.

explain-hint2

Other Problematic Cases

Well, you now may argue that the query above is unnecessarily complex and that it is somewhat silly to put the triple patterns in a dedicated subgroup, and that no one came ever up with such a query. And (with the exception of the Blog author) you’re probably right, but look at the following one:

SELECT ?person ?nonPerson ?type WHERE {
  BIND(<http://example.com/Person> AS ?type)
  {
    ?person a ?o
    FILTER(?o=?type)
  }
  UNION
  {
    ?nonPerson a ?o
    FILTER(?o!=?type)
  }
}

In this UNION query, the BIND expression is used to introduce a variable binding that is intended to be reused in the two parts of the UNION: the idea is to extract all instances of type <http://example.com/Person> in the first part of the UNION, and all the others in the second part of the UNION.

But again, this query does not work as desired: the two blocks of the UNION open up new scopes, in which the ?type variable is not known. For the same reasons as in the example before, both FILTER expression evaluate to error and we end up with the empty result. One way to fix this is by (redundantly) pushing the BIND into the two parts of the UNION:

SELECT ?person ?nonPerson ?type WHERE {
  {
    BIND(<http://example.com/Person> AS ?type)
    ?person a ?o
    FILTER(?o=?type)
  }
  UNION
  {
    BIND(<http://example.com/Person> AS ?type)
    ?nonPerson a ?o
    FILTER(?o!=?type)
  }
}

The latter query will give us the desired results without any performance penalty, namely:

?person                    | ?nonPerson                   | ?type
---------------------------------------------------------------------------------------
<http://example.com/Alice> |                              | <http://example.com/Person>
                           | <http://example.com/Flipper> | <http://example.com/Animal>

Other Problematic Cases: BIND and VALUES

The problem sketched above is not restricted to the usage of variables in FILTER expressions. Similar issues may arise whenever we use variables in nodes that “consume” these variable without matching them against the dataset. More concretely, this means: using a triple pattern with a variable in an inner scope is not a problem: the variables are matched against the dataset independently from the outside, and will be joined with the outside part at a later point in time. But when using SPARQL 1.1 constructs such BIND or VALUES clauses, you may run into the same problems as sketched before by means of the FILTER expression. Look at the following query, which aims at extracting all persons (first part of the UNION) and all instances that are not persons (second part of the UNION), including the respective types:

SELECT ?s ?type WHERE {
  BIND("http://example.com/" AS ?typeBase)
  {
    BIND(URI(CONCAT(?typeBase,"Person")) AS ?type)
    ?s a ?o
    FILTER(?o=?type)
  }
  UNION
  {
    BIND(URI(CONCAT(?typeBase,"Animal")) AS ?type)
    ?s a ?o
    FILTER(?o=?type)
  }
}

The problem is essentially the same: we bind variable ?typeBase outside. In the inner UNION blocks, we try to bind variable ?type based on ?typeBase – but the latter is not in scope here. Hence, the query returns the empty result.

Optimizations in Blazegraph

Strictly following bottom-up semantics when implementing a query engine is typically not a good choice when it comes to evaluation performance. Top-down evaluation, where we inject mappings from previous evaluation steps into subsequent subgroups, can lead to significant speedups. The good news is that, for a broad range of SPARQL queries, bottom-up and top-down evaluation coincide. This holds, for instance, for the complete class of SPARQL queries built only from triple patterns connected through “.” (so-called conjunctive queries).

When it comes to Blazegraph, the basic evaluation approach is a top-down approach. For queries where top-down and bottom-up evaluation make a difference, Blazegraph rewrites queries in such a way that their top-down evaluation result coincides with the bottom-up result, where possible (wherever this is not possible, it essentially switches to bottom-up evaluation for that part of the query). There are various techniques and tricks that are implemented in Blazegraph’s optimizer for this purpose: in many cases, subgroups can just be flattened out without changing semantics, variables in subgroups that are known to be unbound can be renamed to avoid clashes, etc. With the fixes in 1.5.2 we’ve ruled out various inconsistencies between Blazegraph and the official W3C spec. If you plan to migrate from previous versions to 1.5.2 or later, we recommend you reading our extended discussion on bottom-up semantics in our Wiki.

Summary

Although some of the examples above were somewhat artificially designed to illustrate the issues arising in the context of SPARQL’s bottom-up semantics by means of simplistic examples, we have observed ill-designed queries of this style in practice in both our own applications and in customer applications.  We hope that the examples and guidelines in this post help our readers and users to avoid the common pitfalls and write better, standard-compliance SPARQL in the future.

We’d love to hear from you.

Did you make similar experiences with SPARQL’s semantics? Or do you have a cool new application using Blazegraph or are interested in understanding how to make Blazegraph work best for your application?   Get in touch or send us an email at blazegraph at blazegraph.com.

facebooktwittergoogle_pluslinkedin

SPARQL UPDATE performance gain. An easy win with the right data structure.

We had reports of a performance slowdown for SPARQL UPDATE INSERT/DELETE WHERE queries:

DELETE {...}
INSERT {...}
WHERE {...}

For example, you can observe this using the following SPARQL UPDATE request against a data set with 100k or more instances of rdf:label.

DELETE { ?s rdf:label ?o } INSERT {?s rdf:comment ?o } WHERE { ?s rdf:label ?o }

Looking into the timing, we found that the time to insert or remove each statement was growing in proportion to the number of statements already added or removed in the connection. The actual timings are below.  In the first log output, it took 3 seconds to process 10,000 statements.  The performance is fairly flat for the next 20,000 statements. However, the latency of the operation then starts to grow very rapidly.  By the last log output it was taking 97 seconds to process 10,000 statements!

     [java] Added 10000 stmts for removal in 3 seconds (now 10000 stmts in total)
     [java] Added 10000 stmts for removal in 3 seconds (now 20000 stmts in total)
     [java] Added 10000 stmts for removal in 4 seconds (now 30000 stmts in total)
     [java] Added 10000 stmts for removal in 9 seconds (now 40000 stmts in total)
     [java] Added 10000 stmts for removal in 9 seconds (now 50000 stmts in total)
     [java] Added 10000 stmts for removal in 12 seconds (now 60000 stmts in total)
     [java] Added 10000 stmts for removal in 20 seconds (now 70000 stmts in total)
     [java] Added 10000 stmts for removal in 26 seconds (now 80000 stmts in total)
     [java] Added 10000 stmts for removal in 32 seconds (now 90000 stmts in total)
     [java] Added 10000 stmts for removal in 39 seconds (now 100000 stmts in total)
     [java] Added 10000 stmts for removal in 46 seconds (now 110000 stmts in total)
     [java] Added 10000 stmts for removal in 53 seconds (now 120000 stmts in total)
     [java] Added 10000 stmts for removal in 65 seconds (now 130000 stmts in total)
     [java] Added 10000 stmts for removal in 74 seconds (now 140000 stmts in total)
     [java] Added 10000 stmts for removal in 83 seconds (now 150000 stmts in total)
     [java] Added 10000 stmts for removal in 90 seconds (now 160000 stmts in total)
     [java] Added 10000 stmts for removal in 97 seconds (now 170000 stmts in total)
     ...

What’s going on here!?!  The underlying BigdataSailConnection batches everything into a StatementBuffer object. The StatementBuffer is then incrementally flushed onto the backing indices every 10,000 statements (by default – See BigdataSail.Options.BUFFER_CAPACITY to change this parameter). The remove statements code path is nearly identical. Both add statements and remove statements on the BigdataSailConnection are known to scale into billions of statements added or removed in a single operation.  Scaling should not be worse than log-linear since it depends on the depth of the B+Tree indices (the cost of an index probe on a B+Tree is bounded by log(p), where p is the height of the B+Tree).

Looking at the scaling performance, it was immediately suggestive of a poor data structure choice. Something where the cost of the scan was proportional to the number of items scanned.

Looking in our code, the relevant lines are:

// Evaluate the WHERE clause.
final MutableTupleQueryResult result = new MutableTupleQueryResult(
  ASTEvalHelper.evaluateTupleQuery(
    context.conn.getTripleStore(), astContainer,
    context.getQueryBindingSet()/* bindingSets */));

// Evaluate the DELETE clause (evaluation of the INSERT clause is similar).
final ASTConstructIterator itr = new ASTConstructIterator(context, context.conn.getTripleStore(), template, op.getWhereClause(), null, result);

while (itr.hasNext()) {
  final BigdataStatement stmt = itr.next();
  addOrRemoveStatement(context.conn.getSailConnection(), stmt, false/* insert */);
}

// Batch up individual add or remove statement calls.
void addOrRemoveStatement(final BigdataSailConnection conn, final BigdataStatement spo, final boolean insert) throws SailException {
// ...
if (insert) {
  conn.addStatement(s, p, o, contexts);
} else {
  conn.removeStatements(s, p, o, contexts);
}

Looking into our code (above), we found an openrdf MutableTupleQueryResult object. This object is used to buffer the statements identified by the WHERE clause of the query. That buffer is then rewound and replayed twice. Once for the DELETE clause and once more for the INSERT clause. So my suspicions naturally fell on the otherwise innocuous line:

final BigdataStatement stmt = itr.next();

That line is our ASTConstructIterator, which is known to be scalable. But in this case it is basked by the MutableTupleQueryResult class. So, looking into MutableTupleQueryResult I see:

private List bindingSets = new LinkedList();

Bingo!

MutableTupleQueryResult.next() is calling through to List.get(index):

public BindingSet next() {
  if (hasNext()) {
    BindingSet result = get(currentIndex);
    lastReturned = currentIndex;
    currentIndex++;
    return result;
  }
}

public BindingSet get(int index) {
  return bindingSets.get(index);
}

List.get(index) against a LinkedList does a scan. That explains the performance degradation we were seeing. A one line change fixes this.

private List bindingSets = new ArrayList();

Performance for the same SPARQL UPDATE request is now perfectly flat.  The main cost was the MutableTupleQueryResult scan of the LinkedList. Blazegraph automatically and scaleably buffers the data and incrementally evicts the mutations to the write cache and from there to the disk.

[java] Added 10000 stmts for removal in 3 seconds (now 10000 stmts in total)
[java] Added 10000 stmts for removal in 3 seconds (now 20000 stmts in total)
[java] Added 10000 stmts for removal in 1 seconds (now 30000 stmts in total)
[java] Added 10000 stmts for removal in 1 seconds (now 40000 stmts in total)
[java] Added 10000 stmts for removal in 0 seconds (now 50000 stmts in total)
[java] Added 10000 stmts for removal in 1 seconds (now 60000 stmts in total)
[java] Added 10000 stmts for removal in 0 seconds (now 70000 stmts in total)
[java] Added 10000 stmts for removal in 0 seconds (now 80000 stmts in total)
[java] Added 10000 stmts for removal in 0 seconds (now 90000 stmts in total)
[java] Added 10000 stmts for removal in 0 seconds (now 100000 stmts in total)
[java] Added 10000 stmts for removal in 0 seconds (now 110000 stmts in total)
[java] Added 10000 stmts for removal in 0 seconds (now 120000 stmts in total)
[java] Added 10000 stmts for removal in 0 seconds (now 130000 stmts in total)
[java] Added 10000 stmts for removal in 0 seconds (now 140000 stmts in total)
[java] Added 10000 stmts for removal in 0 seconds (now 150000 stmts in total)
[java] Added 10000 stmts for removal in 0 seconds (now 160000 stmts in total)
[java] Added 10000 stmts for removal in 0 seconds (now 170000 stmts in total)
[java] Added 10000 stmts for removal in 0 seconds (now 180000 stmts in total)
[java] Added 10000 stmts for removal in 0 seconds (now 190000 stmts in total)
[java] Added 10000 stmts for removal in 0 seconds (now 200000 stmts in total)
[java] Added 10000 stmts for removal in 0 seconds (now 210000 stmts in total)
[java] Added 10000 stmts for removal in 1 seconds (now 220000 stmts in total)
[java] Added 10000 stmts for removal in 0 seconds (now 230000 stmts in total)
[java] Added 10000 stmts for removal in 0 seconds (now 240000 stmts in total)
[java] Added 10000 stmts for removal in 0 seconds (now 250000 stmts in total)
[java] Added 10000 stmts for removal in 1 seconds (now 260000 stmts in total)
[java] Added 10000 stmts for removal in 0 seconds (now 270000 stmts in total)
[java] Added 10000 stmts for removal in 0 seconds (now 280000 stmts in total)
[java] Added 10000 stmts for removal in 0 seconds (now 290000 stmts in total)
[java] Added 10000 stmts for removal in 0 seconds (now 300000 stmts in total)
[java] Added 10000 stmts for removal in 0 seconds (now 310000 stmts in total)
[java] Added 10000 stmts for removal in 0 seconds (now 320000 stmts in total)
[java] Added 10000 stmts for removal in 0 seconds (now 330000 stmts in total)
[java] Added 10000 stmts for removal in 1 seconds (now 340000 stmts in total)
[java] Added 10000 stmts for removal in 0 seconds (now 350000 stmts in total)
[java] Added 10000 stmts for removal in 0 seconds (now 360000 stmts in total)
[java] Added 10000 stmts for removal in 1 seconds (now 370000 stmts in total)
[java] Added 10000 stmts for removal in 1 seconds (now 380000 stmts in total)
[java] Added 10000 stmts for removal in 0 seconds (now 390000 stmts in total)
[java] Added 10000 stmts for removal in 0 seconds (now 400000 stmts in total)
[java] Added 10000 stmts for removal in 1 seconds (now 410000 stmts in total)
[java] Added 10000 stmts for removal in 0 seconds (now 420000 stmts in total)
[java] Added 10000 stmts for removal in 0 seconds (now 430000 stmts in total)
[java] Added 10000 stmts for removal in 0 seconds (now 440000 stmts in total)
[java] Added 10000 stmts for removal in 1 seconds (now 450000 stmts in total)
[java] Added 10000 stmts for removal in 0 seconds (now 460000 stmts in total)
[java] Added 10000 stmts for removal in 0 seconds (now 470000 stmts in total)
[java] Added 10000 stmts for removal in 0 seconds (now 480000 stmts in total)
[java] Added 10000 stmts for removal in 1 seconds (now 490000 stmts in total)

See https://jira.blazegraph.com/browse/BLZG-1404 for the workaround and a link to the corresponding ticket that we have raised with the openrdf project. This fix will be in our next major release, and is available in hot fix releases to customers with support subscriptions.

Thanks,
Bryan

facebooktwittergoogle_pluslinkedin

Migration of issue tracker (trac => jira)

All, we have finally completed the migration from trac to jira. Please visit and use the new jira instance for Blazegraph.

Many thanks to Brad for making this happen!

Note: trac will remain online in a read-only mode. A cross-walk of trac to jira tickets is available.

All trac tickets have been updated with a comment containing the link to the corresponding JIRA ticket.

Thanks,
Bryan

facebooktwittergoogle_pluslinkedin

CPU, Disk, Main Memory for Graphs

Bryan and I were chatting back on the train from NYC on CPU memory bandwidth. Here’s a quick write-up of the discussion.

15 years ago database researchers recognized that CPU memory bandwidth was the limiting factor for relational database performance. This observation was made in the context of relatively wide tables in RDBMS platforms that were heavily oriented to key range scan on a primary key. The resulting architectures are similar to the structure of array (SOA) pattern used by the high performance computing community and within our Mapgraph platform.

Graphs are commonly modeled as 3-column tables. These tables intrinsically have a very narrow stride, similar to column stores for relational data. Whether organized for main memory or disk, the goal of the query planner is to generate an execution strategy that is the most efficient for a given query.

Graph stores that are organized for disk maintain multiple indices with a log structured cost (e.g., a B+Tree) that allow them to jump to the page on the disk that has the relevant tuples for any given access path. Due to the relative cost of memory and disk, main memory systems (such as SPARQLcity) sometimes choose a single index and resort to full table scans when the default index is not useful for a query or access pattern. In making such decisions, main memory databases are trading off memory for selectivity. These designs can consume fewer resources, but it can be much more efficient to have a better index for a complex query plan. Thus, a disk based database can often do as well as or better than a memory based database of the disk based system has a more appropriate index or family of indices. In fact, 90%+ of the performance of a database platform comes from the query optimizer. The difference in performance between a good query plan and a bad query plan for the same database and hardware can be easily 10x, 100x, or 10000x depending on the query. A main memory system with a bad query plan can easily be beaten by a disk -based system with a good query plan. This is why we like to work closely with our customers. For example, one of our long term customers recently upgraded from 1.2.x (under a long term support contract) to 1.5.x and obtained a 100% performance improvement without changing a single line of their code.

Main memory systems become critical when queries much touch a substantial portion of the data. This is true for most graph algorithms that are not hop constrained. For example, an unconstrained breadth first search on a scale free graph will tend to visit all vertices in the graph during the traversal. A page rank or connected components computation will tend to visit all vertices on each iteration and may required up to 50 iterations to converge page rank to a satisfactory epsilon. In such cases, CPU memory architectures will spend most of the wall clock time blocked on memory fetches due to the inherent non-local access patterns during graph traversal. Architectures such as the XMT/XMT-2 (the Urika appliance) handle this problem by using very slow cores, zero latency thread switching, a fast interconnect and hash partitioned memory allocations. The bet of the XMT architecture is that non-locality dominates so you might as well spread all data out everywhere and hide the latency by having a large number of memory transactions in flight. We take a different approach with GPUs and achieve a 10x price/performance benefit over the XMT-2 and a 3x cost savings. This savings will increase substantially when the Pascal GPU is released in Q1 2016 due to an additional 4x gain in memory bandwidth driven by the breadth of the commodity market for GPUs. We obtain this dramatic price/performance and actual performance advantage using zero overhead context switching, fast memory, 1000s of threads to get a large number of in flight memory transactions, and paying attention to locality. The XMT-2 is a beautiful architecture, but locality *always* matters at every level of the memory hierarchy.

facebooktwittergoogle_pluslinkedin

Blazegraph 1.5.1 Released!

Blazegraph 1.5.1 is released! This is a major release of Blazegraph™. The official release is made into the Sourceforge Git repository. Releases after 1.4.0 will no longer be made into SVN.

The full feature matrix is here.

blazegraph_wide_85px_height

You can download the WAR (standalone), JAR (executable), or HA artifacts from sourceforge.

You can checkout this release from:

git clone -b BLAZEGRAPH_RELEASE_1_5_1 --single-branch git://git.code.sf.net/p/bigdata/git BLAZEGRAPH_RELEASE_1_5_1

Feature summary:

– Highly Available Replication Clusters (HAJournalServer [10])
– Single machine data storage to ~50B triples/quads (RWStore);
– Clustered data storage is essentially unlimited (BigdataFederation);
– Simple embedded and/or webapp deployment (NanoSparqlServer);
– Triples, quads, or triples with provenance (RDR/SIDs);
– Fast RDFS+ inference and truth maintenance;
– Fast 100% native SPARQL 1.1 evaluation;
– Integrated “analytic” query package;
– %100 Java memory manager leverages the JVM native heap (no GC);
– RDF Graph Mining Service (GASService) [12].
– Reification Done Right (RDR) support [11].
– RDF/SPARQL workbench.
– Blueprints API.

Road map [3]:

– Column-wise indexing;
– Runtime Query Optimizer for quads;
– New scale-out platform based on MapGraph (100x => 10000x faster)

Change log:

Note: Versions with (*) MAY require data migration. For details, see [9].

New features:
– BigdataSailFactory moved to client package (http://trac.bigdata.com/ticket/1152)
– This release includes significant performance gains for property paths.
– Both correctness and performance gains for complex join group and optional patterns.
– Support for concurrent writers and group commit. This is a beta feature in 1.5.1 and must be explicitly enabled for the database. Group commit for HA is also working in master, but was not ready for the 1.5.1 QA and hence is not in the 1.5.1 release branch.

1.5.1:

– http://trac.blazegraph.com/ticket/566 Concurrent unisolated operations against multiple KBs on the same Journal
– http://trac.blazegraph.com/ticket/801 Adding Optional removes solutions
– http://trac.blazegraph.com/ticket/835 Query solutions are duplicated and increase by adding graph patterns
– http://trac.blazegraph.com/ticket/1003 Property path operator should output solutions incrementally
– http://trac.blazegraph.com/ticket/1007 Using a bound variable to refer to a graph
– http://trac.blazegraph.com/ticket/1033 NPE if remote http server fails to provide a Content-Type header
– http://trac.blazegraph.com/ticket/1071 problems with UNIONs + complex OPTIONAL groups
– http://trac.blazegraph.com/ticket/1103 Executable Jar should bundle the BuildInfo class
– http://trac.blazegraph.com/ticket/1105 SPARQL UPDATE should have nice error messages when namespace does not support named graphs
– http://trac.blazegraph.com/ticket/1108 NSS startup error: java.lang.IllegalArgumentException: URI is not hierarchical
– http://trac.blazegraph.com/ticket/1110 Data race in BackgroundGraphResult.run()/close()
– http://trac.blazegraph.com/ticket/1112 GPLv2 license header update with new contact information
– http://trac.blazegraph.com/ticket/1113 Add hook to override the DefaultOptimizerList
– http://trac.blazegraph.com/ticket/1114 startHAServices no longer respects environment variables
– http://trac.blazegraph.com/ticket/1115 Build version in SF GIT master is wrong
– http://trac.blazegraph.com/ticket/1116 README.md needs updating for Blazegraph transition
– http://trac.blazegraph.com/ticket/1118 Optimized variable projection into subqueries/subgroups
– http://trac.blazegraph.com/ticket/1125 OSX vm_stat output has changed
– http://trac.blazegraph.com/ticket/1129 Concurrent modification problem with group commit
– http://trac.blazegraph.com/ticket/1130 ClocksNotSynchronizedException (HA, GROUP_COMMIT)
– http://trac.blazegraph.com/ticket/1131 DELETE-WITH-QUERY and UPDATE-WITH-QUERY (GROUP COMMIT)
– http://trac.blazegraph.com/ticket/1132 GlobalRowStoreHelper can hold hard reference to GSR index (GROUP COMMIT)
– http://trac.blazegraph.com/ticket/1137 Code review on “instanceof Journal”
– http://trac.blazegraph.com/ticket/1139 BigdataSailFactory.connect()
– http://trac.blazegraph.com/ticket/1142 Isolation broken in NSS when groupCommit disabled
– http://trac.blazegraph.com/ticket/1143 GROUP_COMMIT environment variable
– http://trac.blazegraph.com/ticket/1146 SPARQL Federated Query uses too many HttpClient objects
– http://trac.blazegraph.com/ticket/1147 DELETE DATA must not allow blank nodes
– http://trac.blazegraph.com/ticket/1152 BigdataSailFactory? must be moved to the client package

Full release notes are here.

[1] http://wiki.blazegraph.com/wiki/index.php/Main_Page
[2] http://wiki.blazegraph.com/wiki/index.php/GettingStarted
[3] http://wiki.blazegraph.com/wiki/index.php/Roadmap
[4] http://www.bigdata.com/bigdata/docs/api/
[5] http://sourceforge.net/projects/bigdata/
[6] http://www.bigdata.com/blog
[7] http://www.systap.com/bigdata.htm
[8] http://sourceforge.net/projects/bigdata/files/bigdata/
[9] http://wiki.blazegraph.com/wiki/index.php/DataMigration
[10] http://wiki.blazegraph.com/wiki/index.php/HAJournalServer
[11] http://www.bigdata.com/whitepapers/reifSPARQL.pdf
[12] http://wiki.blazegraph.com/wiki/index.php/RDF_GAS_API
[13] http://wiki.blazegraph.com/wiki/index.php/NanoSparqlServer#Downloading_the_Executable_Jar
[14] https://blog.bigdata.com/?p=811

facebooktwittergoogle_pluslinkedin

Blazegraph 1.5.1 Feature Preview

Starting with 1.5.1, BlazeGraph supports task-oriented concurrent writers. This support is based on the pre-existing support for task-based concurrency control in BlazeGraph. Those mechanisms were previously used only in the scale-out architecture. They are now incorporated into the REST API and can even be used by aware embedded applications.

This is a beta feature — make backups!

There are two primary benefits from group commit.

First, you can have multiple tenants in the same database instance and the updates for one tenant will no longer block the updates for the other tenants. Thus, one tenant can be safely running a long running update and other tenants can still enjoy low latency updates.

Second, group commit automatically combines a sequence of updates on one (or more) tenant(s) into a single commit point on the disk. This provides higher potential throughput. It also means that it is no longer as important for applications to batch their updates since group commit will automatically perform some batching.

Early adopters are encouraged to enable this using the following symbolic property. While the Journal has always supported group commit at the AbstractTask layer, we have added support for hierarchical locking and modified the REST API to use group commit when this feature is enabled. Therefore this feature is a “beta” in 1.5.1 while work out any new kinks.

# Note: Default is false.
com.bigdata.journal.Journal.groupCommit=true

If you are using the REST API, then that is all you need to do. Group commit will be automatically enabled. This can even be done with an existing Journal since there are no differences in the manner in which the data are stored on the disk.

Embedded Applications and Group Commit

If you are using the internal APIs (Sail, AbstractTripleStore, stored queries, etc.) then you need to understand what is happening when group commit is enabled and make a slight change to your code.

  • When you set this property to true, you are asserting that your application will submit all tasks for evaluation to the IConcurrencyManager associated with the Journal and you are agreeing to let the database decide when it will perform a commit.
  • When you set this property to false (the default), you are asserting that your application will control when the database performs a commit. This is how embedded application has been written historically.
  • Any mutation operations must use the following incantation. This incantation will submit a task that obtains the necessary locks and the task will then run. If the task exits normally (versus by throwing an exception) then it will join the next commit group. The Future.get() call will return either when the task fails or when its write set has been melded into a commit point.

    AbstractApiTask.submitApiTask(IIndexManager indexManager, IApiTask task).get();

    There are a few “gotchas” with the group commit support. This is because commits are decided by IApiTask completion and tasks are scheduled by the concurrency manager, lock manager, and write executor service.

  • Mutation tasks that do not complete normally MUST throw an exception!
  • Applications MUST NOT call Journal.commit(). Instead, they submit an IApiTask using AbstractApiTask.submit(). The database will meld the write set of the task into a group commit sometime after the task completes successfully.
  • Servlets exposing mutation methods MUST NOT flush the response inside of their AbstractRestApiTask. This is because ServletOutputStream.flush() is interpreted as committing the http response to the client. As soon as this is done the client is unblocked and may issue new operations under the assumption that the data has been committed. However, the ACID commit point for the task is *after* it terminates normally. Thus the servlet must flush the response only after the task is done executing and NOT within the task body. The BigdataServlet.submitApiTask() method handles this for you so your code looks like this:

  • // Example of task execution from within a BigdataServlet
    try {
    submitApiTask(new MyTask(req, resp, namespace, timestamp,...)).get();
    } catch (Throwable t) {
    launderThrowable(t, resp, ...);
    }

  • BigdataSailConnection.commit() no longer causes the database to go through a commit point. You MUST still call conn.commit(). It will still flush out the assertion buffers (for asserted and retracted statements) to the indices, which is necessary for your writes to become visible. When you task ends and the indices go through a checkpoint, it does not actually trigger a commit. Thus, in order to use group commit, you must obtain your connection from within an IApiTask, invoke conn.commit() if things are successful and otherwise throw an exception. The following template shows what this looks like.
  • // Example of a concurrent writer task using group commit APIs.
    public class MyWriteTask extends AbstractApiTask {
    public Void call() throws Exception {
    BigdataSailRepositoryConnection conn = null;
    boolean success = false;
    try {
    conn = getUnisolatedConnection();
    // WRITE ON THE CONNECTION
    conn.commit(); // Commit the mutation.
    success = true;
    return (Void) null;
    } finally {
    if (conn != null) {
    if (!success)
    conn.rollback();
    conn.close();
    }
    }
    }
    }

    How it works.

    The group commit mechanisms are based on hierarchical locking and pre-declared locks. Tasks pre-declare their locks. The lock manager orders the lock requests to avoid deadlocks. Once a task owns its locks, it is executed by the WriteExecutorService. Lock in AbstractTask is responsible for isolating its index views, checkpointing the modified indices after the task has finished its work, and handshaking with the WriteExecutorService around group commits.

    Most tasks just need to declare the namespace on which they want to operate. This will automatically obtain a lock for all indices in that namespace. Some special kinds of tasks (those that create and destroy namespaces) must also obtain a lock on the global row store (aka the GRS). This is an internal key-value store where BlazeGraph stores the namespace declarations.

    facebooktwittergoogle_pluslinkedin