Author Archives: Michael Schmidt

Screen Shot 2015-08-07 at 10.21.56 AM

Understanding SPARQL’s Bottom-up Semantics

Preface: In the 1.5.2 release we’ve implemented a couple of fixes regarding issues related to SPARQL’s bottom-up evaluation approach and associated variable scoping problems. If you encounter regressions with some of your queries after upgrading to 1.5.2, this blog post may help you identify ill-designed queries that are not in line with the official SPARQL 1.1 semantics. Please consult our Wiki for a more comprehensive discussion of SPARQL’s bottom-up semantics.

In one of our recent posts titled In SPARQL, Order Matters we discussed implications of SPARQL’s “as written” evaluation semantics. Another tricky aspect in SPARQL is it’s bottom-up evaluation semantics.  Informally speaking, bottom-up evaluation means that subqueries and nested groups are (logically) evaluated first. As a consequence, actual evaluation order chosen by SPARQL engines must  yield the same results as bottom-up evaluation order in order to be valid.

Blazegraph does not actually use bottom-up evaluation normally.  Instead, Blazegraph prefers to reorder joins and join groups in order to reduce the amount of data read and the size of the intermediate solutions flowing through the query using what is known as left-deep evaluation. However, some queries can only be evaluated by bottom-up plans.  Bottom-up plans are often much less efficient. However, bottom-up evaluation can be required if some variables are not visible in some intermediate scopes such that more efficient left-deep plans can not be used.

This guide will help you understand why, how to recognize when bottom-up evaluation semantics are being used and what you can do to avoid the problem and get more performance out of the your SPARQL queries. It also sketches Blazegraph’s built-in optimization techniques for efficiently dealing with issues induced by SPARQL’s bottom-up semantics.

Illustrating the Problem by Example

Let’s start out with a very simple dataset that we will use throughout the upcoming examples:

<http://example.com/Alice>   a <http://example.com/Person> .
<http://example.com/Flipper> a <http://example.com/Animal> .

This is, we have two triples, one stating the Alice is a person and the other one stating that Flipper is an Animal. Let’s start out with a simple query to illustrate what bottom-up evaluation actually means and which problems it can generate:

SELECT ?s ?personType WHERE {
  BIND(<http://example.com/Person> AS ?personType)
  {
    ?s a ?o
    FILTER(?o=?personType)
  }
}

The query aims to extract all instances of type <http://example.com/Person>. To this end, variable ?personType is bound to the URI <http://example.com/Person> using the BIND keyword in the first line, the triple pattern “?s a ?o” searches for all typed instances, and the filter retains those that coincide with the actual binding of ?personType. Looks reasonable, doesn’t it?  But there is a “gotcha”.  The ?personType variable will not be bound when the inner basic graph group pattern is evaluated!  We will explain why and what to do about this below.

Let’s see what happens according to SPARQL bottom-up semantics. Generally speaking, bottom-up evaluation means that, from a logical point of view, we start with evaluating the “leaf nodes” of the query tree first, using these results to iteratively evaluating composed nodes at higher levels. Considering our query above, one node in the query tree is the ({}-enclosed) subgroup consisting of the statement pattern and the filter. Here’s what happens when evaluating this node:

1. We evaluate the triple pattern “?s a ?o” against the data, which gives us two intermediate result mappings, namely

{ ?s -> <http://example.com/Alice>,   ?o -> <http://example.com/Person> },
{ ?s -> <http://example.com/Flipper>, ?o -> <http://example.com/Animal> }

2. Next, we apply the FILTER (?o=?type) over the two mappings from 2a.

And here’s the problem: our two intermediate result mappings do not contain a binding for variable ?type (because the latter has not yet been bound when starting evaluation bottom-up). In such cases, the SPARQL semantic defines that the FILTER condition evaluates to an “error”, in which case the FILTER rejects all mappings. We thus end up with the empty result for the subgroup, and it is easy to see that, as a consequence, the whole query gives us the empty result.

SPARQL has a lot of great features, but this is not one of them.  To help make people’s lives easier we are extending Blazegraph’s EXPLAIN  feature to also report problems such as this in your queries (see Figure below).  This will be part of the next Blazegraph release.

explain-hint2

Other Problematic Cases

Well, you now may argue that the query above is unnecessarily complex and that it is somewhat silly to put the triple patterns in a dedicated subgroup, and that no one came ever up with such a query. And (with the exception of the Blog author) you’re probably right, but look at the following one:

SELECT ?person ?nonPerson ?type WHERE {
  BIND(<http://example.com/Person> AS ?type)
  {
    ?person a ?o
    FILTER(?o=?type)
  }
  UNION
  {
    ?nonPerson a ?o
    FILTER(?o!=?type)
  }
}

In this UNION query, the BIND expression is used to introduce a variable binding that is intended to be reused in the two parts of the UNION: the idea is to extract all instances of type <http://example.com/Person> in the first part of the UNION, and all the others in the second part of the UNION.

But again, this query does not work as desired: the two blocks of the UNION open up new scopes, in which the ?type variable is not known. For the same reasons as in the example before, both FILTER expression evaluate to error and we end up with the empty result. One way to fix this is by (redundantly) pushing the BIND into the two parts of the UNION:

SELECT ?person ?nonPerson ?type WHERE {
  {
    BIND(<http://example.com/Person> AS ?type)
    ?person a ?o
    FILTER(?o=?type)
  }
  UNION
  {
    BIND(<http://example.com/Person> AS ?type)
    ?nonPerson a ?o
    FILTER(?o!=?type)
  }
}

The latter query will give us the desired results without any performance penalty, namely:

?person                    | ?nonPerson                   | ?type
---------------------------------------------------------------------------------------
<http://example.com/Alice> |                              | <http://example.com/Person>
                           | <http://example.com/Flipper> | <http://example.com/Animal>

Other Problematic Cases: BIND and VALUES

The problem sketched above is not restricted to the usage of variables in FILTER expressions. Similar issues may arise whenever we use variables in nodes that “consume” these variable without matching them against the dataset. More concretely, this means: using a triple pattern with a variable in an inner scope is not a problem: the variables are matched against the dataset independently from the outside, and will be joined with the outside part at a later point in time. But when using SPARQL 1.1 constructs such BIND or VALUES clauses, you may run into the same problems as sketched before by means of the FILTER expression. Look at the following query, which aims at extracting all persons (first part of the UNION) and all instances that are not persons (second part of the UNION), including the respective types:

SELECT ?s ?type WHERE {
  BIND("http://example.com/" AS ?typeBase)
  {
    BIND(URI(CONCAT(?typeBase,"Person")) AS ?type)
    ?s a ?o
    FILTER(?o=?type)
  }
  UNION
  {
    BIND(URI(CONCAT(?typeBase,"Animal")) AS ?type)
    ?s a ?o
    FILTER(?o=?type)
  }
}

The problem is essentially the same: we bind variable ?typeBase outside. In the inner UNION blocks, we try to bind variable ?type based on ?typeBase – but the latter is not in scope here. Hence, the query returns the empty result.

Optimizations in Blazegraph

Strictly following bottom-up semantics when implementing a query engine is typically not a good choice when it comes to evaluation performance. Top-down evaluation, where we inject mappings from previous evaluation steps into subsequent subgroups, can lead to significant speedups. The good news is that, for a broad range of SPARQL queries, bottom-up and top-down evaluation coincide. This holds, for instance, for the complete class of SPARQL queries built only from triple patterns connected through “.” (so-called conjunctive queries).

When it comes to Blazegraph, the basic evaluation approach is a top-down approach. For queries where top-down and bottom-up evaluation make a difference, Blazegraph rewrites queries in such a way that their top-down evaluation result coincides with the bottom-up result, where possible (wherever this is not possible, it essentially switches to bottom-up evaluation for that part of the query). There are various techniques and tricks that are implemented in Blazegraph’s optimizer for this purpose: in many cases, subgroups can just be flattened out without changing semantics, variables in subgroups that are known to be unbound can be renamed to avoid clashes, etc. With the fixes in 1.5.2 we’ve ruled out various inconsistencies between Blazegraph and the official W3C spec. If you plan to migrate from previous versions to 1.5.2 or later, we recommend you reading our extended discussion on bottom-up semantics in our Wiki.

Summary

Although some of the examples above were somewhat artificially designed to illustrate the issues arising in the context of SPARQL’s bottom-up semantics by means of simplistic examples, we have observed ill-designed queries of this style in practice in both our own applications and in customer applications.  We hope that the examples and guidelines in this post help our readers and users to avoid the common pitfalls and write better, standard-compliance SPARQL in the future.

We’d love to hear from you.

Did you make similar experiences with SPARQL’s semantics? Or do you have a cool new application using Blazegraph or are interested in understanding how to make Blazegraph work best for your application?   Get in touch or send us an email at blazegraph at blazegraph.com.

facebooktwittergoogle_pluslinkedin
blazegraph_by_systap_favicon

In SPARQL, Order Matters

One common mistake made when writing SPARQL queries is the implicit assumption that, within SPARQL join groups, the order in which components of the group are specified does not matter. The point here is that, by definition, SPARQL follows a sequential semantics when evaluating join groups and, although in many cases the order of evaluating patterns does not change the outcome of the query , there are some situations where the order is crucial to get the expected query result.

In Blazegraph 1.5.2 we’ve fixed some internal issues regarding the evaluation order within join groups, making Blazegraph’s join group evaluation strategy compliant with the official SPARQL W3C semantics. As a consequence, for some query patterns, the behavior of Blazegraph changed and if you’re upgrading your Blazegraph installation it might make sense to review your queries for such patterns, in order to avoid regressions.

A Simple Example

To illustrate why order in SPARQL indeed matters, assume we want to find all person URIs and, if present, associated images.  Instantiating this scenario, let us consider query

### Q1a:
SELECT * WHERE {
  ?person rdf:type <http://example.com/Person>
  OPTIONAL { ?person <http://example.com/image> ?image }
}

over a database containing the following triples:

<http://example.com/Alice> rdf:type                    <http://example.com/Person> .
<http://example.com/Alice> <http://example.com/image>  "Alice.jpg" .
<http://example.com/Bob>   rdf:type                    <http://example.com/Person> .

The result of this query is quite obvious and in line with our expectations: We get one result row for Alice, including the existing image, plus one result row for Bob, without an image:

?person                    | ?image
---------------------------|------------
<http://example.com/Alice> | "Alice.jpg"
<http://example.com/Bob>   |

Now let’s see what happens when swapping the order of the triple pattern and the OPTIONAL block in the query:

### Q1b:
SELECT * WHERE {
  OPTIONAL { ?person <http://example.com/image> ?image } .
  ?person rdf:type <http://example.com/Person>
}

The result of Q1b may come at a surprise:

?person                    | ?image
---------------------------|------------
<http://example.com/Alice> | "Alice.jpg"

Where’s Bob gone?

As mentioned before, SPARQL evaluates the query sequentially. This is, in the first step it evaluates the OPTIONAL { ?person :image ?image } block.  As a result of this first step, we obtain:

?person                    | ?image
---------------------------|------------
<http://example.com/Alice> | "Alice.jpg"

No Bob in sight.

In a second step, this partial result joined with the (non-optional) triple pattern ?person rdf:type <http://example.com/Person>. Intuitively speaking, the join with the triple patterns now serves as an assertion, retaining those result rows for which the value of variable ?person can be shown to be of rdf:type <http://example.com/Person>. Obviously, this assertion holds for our (single) result, the URI of Alice, so the previous intermediate result is left unmodified.

So what we get in this case is the “set of all persons that have an image associated, including the respective images”. But is this informal description really capturing what Q2b does? Consider the evaluation of Q2b over a database where none of the persons has an image associated, say:

<http://example.com/Alice> rdf:type <http://example.com/Person> .
<http://example.com/Bob>   rdf:type <http://example.com/Person> .

Here’s the result:

?person                    | ?image
---------------------------|------------------------------------
<http://example.com/Alice> | 
<http://example.com/Bob>   | 

Surprised again?

Well, here’s the explanation: SPARQL evaluation starts out over the so-called “empty mapping” (also called “universal mapping”): the OPTIONAL block is not able to retrieve any results, and given the semantics of OPTIONAL, this step simply retains the empty mapping. This empty mapping is then joined with the non-optional triple pattern, giving us the two bindings for  ?person as result, with the ?image being unbound in both.

Taking our observations together, the semantics of Q1b could be phrased as: “If there is at least one triple with predicate <http://example.com/image> in our database, return all persons that have an image and including the image, otherwise return all persons.”

That’s quite different from our intended search request, isn’t it?

Practical Implications

The example above illustrates an interesting edge case, which (given that the semantics sketched informally above is somewhat odd) does not frequently show up in practice though. For join groups composed out of (non-optional) statement patterns only, for instance, the order in which you denote the triple patterns does not matter at all. But for operators OPTIONAL and MINUS we may run into such ordering problems.

Interestingly, even for patterns involving OPTIONAL (and MINUS), in many cases there are different orders that lead to the same result (independently on the data in the underlying database). Let’s look at the following variant of our initial query Q1a which, in addition, extracts the person label (and assume we have labels in our sample database for the persons, as well):

### Q2a
SELECT * WHERE {
  ?person rdf:type <http://example.com/Person> .
  ?person rdfs:label ?label .
  OPTIONAL { ?person <http://example.com/image> ?image } .
}

Following the same arguments as in our example before, it is not allowed to move the OPTIONAL to the beginning, but the following query

### Q2b
SELECT * WHERE {
  ?person rdf:type <http://example.com/Person> .
  OPTIONAL { ?person <http://example.com/image> ?image } .
  ?person rdfs:label ?label .
}

can be shown to be semantically equivalent. Without going into details, the key difference here towards our example from the beginning is that variable ?person has definitely been bound (in both Q2a and Q2b) when evaluating the OPTIONAL; this rules out strange effects as discussed in the beginning .

If you’re confused now about where to place your OPTIONAL’s, here are some rules of thumb:

  1. Be aware that order matters
  2. First specify your non-optional parts of the query (unless there’s some good to not do so)
  3. Then specify your OPTIONAL clauses to include optional patterns of your graph

For short, unless you want to encode somewhat odd patterns in SPARQL: it usually makes sense to put OPTIONAL patterns in the end of your join groups. If you’re still unsure about some of your queries, just have a look at the semantics section in the official W3C specs.

Optimizations within Blazegraph 1.5.2

While the official SPARQL semantics uses a sequential approach, one key optimization approach of the Blazegraph query optimizer is to identify a semantics-preserving reordering of join groups in the query that minimizes the amount of work spent in evaluating the query. Thereby, amongst other heuristics (such as cardinality estimations for the individual triple patterns), optimizers typically try to evaluate non-optional patterns first (this holds particularly in the presence of more complex OPTIONAL expressions).

As a consequence, queries like those sketched above challenge the Blazegraph query optimizer: reordering must be done carefully and based on sound theoretical foundations, in order to retain the semantics of the original query within the rewriting process. In Blazegraph 1.5.2, we’ve reworked Blazegraph’s internal join group reordering strategy.

For instance, Blazegraph 1.5.2 is able to detect the equivalence transformations (as sketched in Q2a/Q2b) when evaluating queries and uses this knowledge to find more efficient query execution plans: if you run Q2b query above in Blazegraph and turn the EXPLAIN mode on, you’ll see that the OPTIONAL clause is moved to the end by the optimizer in the optimized abstract syntax tree.

We’d love to hear from you.

Do you have a cool new application using Blazegraph or are interested in understanding how to make Blazegraph work best for your application?   Get in touch or send us an email at blazegraph at blazegraph.com.

facebooktwittergoogle_pluslinkedin
Solr_Logo_on_white_250px_wide

Blazegraph 1.5.2 to Support Hybrid Search using External Solr Indices

While graph databases are a perfect fit for storing and querying structured data, they are not primarily designed to deal efficiently with unstructured data and keyword queries. Therefore, such unstructured data is often kept in dedicated systems that are laid out to tackle the specific challenges for evaluating keyword queries in an efficient way — including advanced techniques such as stemming, TF-IDF indexing, support for complex keyword search requests, scoring, etc.

Graph databases, on the other hand, are about connecting things, so in many scenarios we want to combine the capabilities of structured queries with those of queries against a fulltext index. To give just one simple example, assume we have a structured graph database with data about historical characters and a complementary keyword index over a corpus of historical texts (which may or may not be under our control). Assume we now want to combine structured queries  — asking, e.g., for persons that fall into certain categories such as epochs or countries they lived in  — with historical texts from the index that prominently feature these persons.

blazegraph_by_systap_faviconThe upcoming Blazegraph 1.5.2 release will support such hybrid queries against external Solr fulltext search indices. The fulltext search feature has been implemented as a Blazegraph custom service: using a standard-compliant SPARQL SERVICE call with a reserved service URI, you can now easily combine structured search capabilities over the graph database with information held in an external Solr index.

researchspacelogo_blackonwhite

Blazegraph’s hybrid search capabilities are currently used by the British Museum in the ResearchSpace project, which aims at building a collaborative environment for humanities and cultural heritage research using knowledge representation and Semantic Web technologies.  In this context, Blazegraph’s hybrid search feature supports users in expressing complex search requests for cultural heritage objects. Hybrid SPARQL queries utilizing a Solr index are used to support a semantic autocompletion: As the user types a keyword, hybrid queries are issued in real-time to match keywords against entities in a cultural heritage knowledge graph. Depending on the current context of the search, persons, objects or places are suggested, providing a user friendly means to disambiguate terms as the user types.   If you’re going to be in San Jose for the Smart Data conference, we’re giving a tutorial on the approach.

To illustrate the new hybrid search feature by example, a single SPARQL query like

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax.ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX ex: <http://www.example.com/>
PREFIX fts: <http://www.bigdata.com/rdf/fts#>
SELECT ?person ?kwDoc ?snippet WHERE {
        ?person rdf:type ex:Artist .
        ?person ex:associatedWith ex:TheBlueRider .
        ?person ex:bornIn ex:Germany .
        ?person rdfs:label ?label .
  SERVICE <http://www.bigdata.com/rdf/fts#search> {
        ?kwDoc fts:search ?label .
        ?kwDoc fts:endpoint "http://my.solr.server/solr/select" .
        ?kwDoc fts:params "fl=id,score,snippet" .
        ?kwDoc fts:scoreField "score" .
        ?kwDoc fts:score ?score .
        ?kwDoc fts:snippetField "snippet" .
        ?kwDoc fts:snippet ?snippet .
  }
} ORDER BY ?person ?score

would

  • first extract all persons associated with the group “The Blue Rider” that were born in Germany, then
  • take the label of these persons as search string and send a request against a Solr server, in order to extract a ranked list of articles for the respective persons (including text snippets where these persons are mentioned), next
  • order the results by person and relevance as requested by the ORDER BY, and finally
  • return the identified person URIs (variable ?person, from the graph database), the ID of the keyword index document (variable ?kwDoc, from the fulltext index), and the associated text snippet provided by the keyword index (variable ?snippet).

As the example illustrates, parameterization of the keyword index is made via a reserved, “magic vocabulary”: for instance, within the SERVICE keyword, the object linked through fts:search identifies the search string to be submitted against the keyword index, while fts:endpoint points refers to the address of the Solr server.

Of course, the hybrid search feature is not domain dependent: no matter what data has been loaded into your database and no matter what the keyword index looks like, you can now post hybrid search queries against your data and the external index. The implementation even allows you to query multiple keyword indices within one query and, by the use of SPARQL 1.1 federation, combine this with requests against multiple SPARQL endpoints at a time. The search string can be dynamically extracted from the database (as in the example above, where we bind variable ?label through a structured query and use it as a search string) or can be  a static search string. Even more, nothing prevents you from using more complex Solr keyword search strings using boolean connectives such as AND, OR, or negation: in SPARQL, these complex search strings can be easily composed by the use of BIND in combination with string concatenation. For instance, we may modify the first part of our example as

...
        ?person ex:associatedWith ex:TheBlueRider .
        ?person ex:bornIn ex:Germany .
        ?person rdfs:label ?label .
        BIND(CONCAT("\"", ?label, "\" AND -\"expressionism\"") AS ?search)
        SERVICE <http://www.bigdata.com/rdf/fts#search> {
                ?kwDoc fts:search ?search .
                ...
        }
...

in order to search for keyword index documents mentioning these persons without explicitly mentioning “expressionism” (the “-” in Solr is used to express negation).

If you want to learn more about Blazegraph’s upcoming Solr index support, please check out the documentation in our Wiki.

We’d love to hear from you.

Do you have a cool new application using Blazegraph or are interested in understanding how to make Blazegraph work best for your application?    Get in touch or send us an email at blazegraph at blazegraph.com.

facebooktwittergoogle_pluslinkedin