Custom Functions

We’ve added a new page to the wiki which documents how to write your own custom functions. The wiki page includes some examples and links you to heavily documented source code in SVN.

Bigdata uses a vectored query engine. Chunks of solutions flow through the query plan operators. There is parallelism across queries, across operators within a query, and within an operator (multiple instances of the same operator can be evaluated in parallel). Operators broadly break down into those which operate on solutions and those which operate on value expressions. The former are vectored, operate on chunks of solutions at a time, and have access to the indices. The latter are not vectored, operate on a single solution at a time, and do not have access to the indices.

People who write custom functions need to be aware of IVs, which are the “Internal Value” objects used to represent RDF Values inside of bigdata. There are a lot of different kinds of IVs, including those which are fully inline (supporting xsd datatypes, etc) and those which are recorded assigned by index (TERM2ID or BLOBS, depending on the size of the Value). IVs are used directly in the statement indices and in query processing.

Solutions flowing through a bigdata query are modeled using IVs. RDF Values in the query are batch resolved to IVs when the query is compiled and then ”cached” on the IV. This “IVCache” is the critical bit of glue which lets you access the materialized RDF Value in a custom function. There are methods which encapsulate the work required to turn an IV into a Value and a Value into an IV. You can use those methods and ignore the IV interface for the most part, but if you put in a little more effort you can often dramatically improve the performance of your custom function.

Bigdata tries to avoid RDF Value materialization whenever possible. IVs are more compact, are faster to process, and do not require lookups against the lexicon indices. If the query engine decides that it needs to materialize some variable before evaluating a filter or a projection, then it will do that automatically. Custom functions which can process IVs natively are significantly faster than those which rely on materialized RDF Values. These functions have the “NEVER” materialization requirements. Many functions rely on materialized Values, but can use a “fast path” to quickly drop arguments which are not valid for that function. For example, functions which require literals as arguments can test on IV.isLiteral() and throw a SparqlTypeErrorException if the argument is not a literal. These functions have “SOMETIMES” materialization requirements. Then there are functions which “ALWAYS” need materialized Values. Often you can convert an ALWAYS function into a SOMETIMES function with a little bit more work and get a big performance boost for your efforts.

Have fun!

Leave a Reply