Last month I had the opportunity to participate in the Big Graph Data panel at ISWC 2012 with Tim Berner’s Lee, Mike Stonebraker, and John Giannandrea with Frank van Harmelen as the moderator. If you missed the panel then, you can catch it online at videolectures.net.
A few interesting positions emerged during the panel:
– Relational databases : Stonebraker points out that the big relational database platforms are old technology. This is not a critique of the relational model, but a critique of the implementations. He points to new platforms with very high transactions per second based on main memory. Well and good, but not really about graphs. My own spin on relational databases is that they *could* be doing better. The bigdata(R) platform itself is based on the same kinds of technology that are used by relational databases – B+Trees, Hash indices, Multi-Version Concurrency Control, etc. There are some critical challenges for relational databases to play as high performance graph databases, including: (a) SPARQL queries with 50-100 joins are common, but cost-based query optimizers blow up on the state space for these query plans; (b) schema fluidity (any property may be ANY value); and (c) property cardinality (multiple values for the same cell). Beyond that, the Big Problem is getting the physical schema for the database right.
– Big graphs : There is a consensus that this is a basic research problem. Google’s Knowledge Graph (John Giannaandrea) is actually “little data”. The knowledge graph is a relatively small set of highly curated relationships. It fits nicely in main memory and that is how it is served out. There are hard issues with large graphs that the research community is still addressing. Key problems are how to organize the information for locality, how to partition the information among the nodes of a cluster, and how to support both structured graph query (SPARQL) and efficient graph mining algorithms.
– Schema Alignment : This is not just a hard problem, but that it is an intractable problem. Even in a single data set there is always a schema alignment problem. I like to point to the UMLS as applied by the National Library of Medicine. This gold standard for taxonomies has an inter-rater reliability of only 50%. That means that different trained reviewers are likely to assign different tags to the same article. We need to being looking at schema alignment as a problem with noisy data and apply techniques that are robust for noisy data. The data mining community is doing this right now, but without an explicit focus on schema alignment.
– Map/Reduce and Key-Value Stores : These are the wrong technologies for graph processing. A lot of noise and effort is going into trying to cram large graphs into these platforms. Map/Reduce has tremendous latencies and it is just painful watching people trying to architect around the latency of running a map/reduce job. Distributed computing can be fast, scalable and have low latency — people have been doing this for years. Key-Value stores are great for storing property sets, but they can not handle the cardinality associated with the link sets of high cardinality vertices. If the logical row model could be extended to support paged access to the link sets then this technology could be applied to a highly scalable linked data cache (ala the diplodocus sub-graph clusters), but you will never be able to implement an efficient graph query language against a key-value store alone.
– There is no “graph database community” – rather we have distinct communities for graph theory, graph mining (Apache Giraph, Spark’s Bagel, Signal/Collect, graphlab), object-based graph databases (blueprints et al), and semantic web graph databases. We need to start talking. Object-based graph databases are optimized for property set retrieval – which is basically the same goal as a key-value store. That’s why things like Titan can be layered over key-value stores. However, they are NOT optimized for queries that cut across graphs – which is the sweet spot for semantic web graph databases. Graph databases should optimize for both complex SPARQL query processing and the look up of the property set + link set for a vertex (after all, this is the Linked Data “GET URI” method). The LDBC has members from at least three of these communities (graph mining, object-based graph databases, and semantic-web graph databases). Hopefully it will drive standards and benchmarks for link data access access, declarative query, and high level APIs for graph mining.
– High level APIs : I see the success of blueprints as a disaster for everyone. Blueprints reduces to pointer chasing against the disk (if the data are durable) and locks in the messaging model for graph processing (making efficient distributed graph mining impossible). APIs that force programmers to write procedural code for each data access or query are a great way to keep programmers employed, but a miserable way to develop applications. High level declarative query lets the database query planner optimize the access and hides the physical schema permitting the database to evolve over time. Yes, we need APIs for writing graph-mining kernels but they need to allow the underlying implementation to maximize the possible parallelism and abstract away from the physical schema. Rather than hard-coding to APIs, think standard abstractions, LLVM and cross-compilation.
SYSTAP is leading a DARPA XDATA team for accelerated graph processing on GPU clusters where we are looking at some of these issues – I will post on this soon.