Over the years I have come to realise that few ideas are successful for the reasons they were originally developed.
Sometimes you just have to ask the question “How did we get here?”.
What makes XML a success? Is it down to a unique ability to represent structured data? I think not. It is more likely because it is a lot like HTML and there have developed a range of available tools to parse and process the format. Why is HTML a success? Motivationally because of its de-facto use in markup and the magical result of browsers rendering the documents, but also because it is really easy to do something really simple and then not much harder to do something a little more complicated.
So what about RDF? RDF takes a ride on XML acceptance and introduces a triple-based data model. Where XML provides hierarchical data modelling, RDF provides a simple list of triples (at least theoretically). With this simplicity comes the problem… interpretation. XML naturally encapsulates data whilst RDF enables expansion with assertion of disconnected triples. We must follow a set of rules to interpret the triples to form coherent data structures, and these rules must be understood whenever data is created or accessed.
With BigData we want to be able to store really large amounts of information with flexible representation. A triple store provides for the flexibility, but we must be careful not to create complex problems of interpretation.
My strong feeling is that too much debate has been focused on RDF syntax and its interpretation. Instead we should understand that a triple store can flexibly represent data structures and then discover how we are able to represent the data we want to use.
So, I am arguing for a general re-appraisal of the approach to building RDF-based applications. One that essentially defocuses from RDF but is able to leverage the underlying representation when appropriate.
Is it really the case that we want to make SPARQL queries against RDF data? Of course, if we do have an underlying RDF representation then such queries could be made, but is this really the goal? When a developer is tasked with displaying a list of products purchased by a customer, do they want to make a SPARQL query or do they just want the list products?
Many years ago someone coined the term “impedance mismatch” to describe the problem of transforming data between the format used by a programming language and that used to represent it externally, for example in a database. A figure that I heard repeated was that 90% of computer code was involved with data transformation. I suspect this figure is now much higher since we no longer have to worry only about data storage transformations but also other representations. Discussing this recently with respect to RDF the phrase “mother of all impedance mismatches” was coined. We’re going to call this MAIM!
So what does this mean for BigData? Well, we are aiming to solve MAIM by providing a toolset (interfaces and metadata) that enables the easy creation of domain specific models. We will use an underlying triple representation augmented with indices to support the efficient access to domain data. The flexibility of the “schema-free” triple-based representation will enable data sharing between different models/ontologies, while the domain specific metadata will resolve issues of interpretation in defined contexts.
A key advantage of this approach is that the triple representation and custom indices are driven by the requirement to support access patterns for the domain model. So, along with solving MAIM, we also hope to save on time spent discussing which RDF representation should be used.