In this post, I will go into why there is a “big graph anti-pattern”, the fundamentally different kinds of graph processing, how to match the technology to the problem, and what are some successful patterns for scalable graph processing.
The big graph anti-pattern is “Throw everything into a big graph and then using the same tools that gave us horizontal scaling for other problems: map/reduce and key-value stores.”
There are several fallacies here. First, there are many types of graph processing and you can not use the same architecture to scale all of them. This is the main focus of this posting and I will go into detail on why this does not work below. Second, the advantage of throwing the data together is that you can move onto finding information immediately. However, throwing the data together does not eliminate the schema alignment problem. It just let’s you choose when you are going to deal with it and how much effort you will put into it. You still need to understand the data and the analytics to interpret either one. Lastly, if you are using statistical models, then you can run into problems if there are too many variables and your model winds up lacking predictive power. You are better off focusing on the information that you need to make a specific decision rather than allowing statistical algorithms to go off on a fishing expedition.
There are some fundamental architectural differences in systems for high performance graph traversal and graph analytics, systems for high performance graph pattern matching, map/reduce platforms and key-value stores. If you only test at a small data scale, the scaling properties of these different architectures are not as evident and your benchmarking will fail to predict the actual performance characteristics of these technology on larger graphs. As the data scale increases, the differences become significant and determine what does and does not scale.
The only way to get scaling and high throughput for graph traversal and graph mining is to get the architecture, the software, and the hardware right. If you make the wrong choices, the communications costs change from O(N) (using 2D partitioning) to O(N*N) (the best case using any other approach). These problems are bandwidth limited, so you can’t just throw CPU cores and main memory at them. Efficient parallel graph algorithms are hard and good implementations will bottleneck at the CPU memory bus – beyond that you get negative scaling – your system just slows down as you add more CPU cores. Another rarely appreciated bottleneck is the high cost of sequential code. Any sequential code in your algorithm, including between iterations, has a huge negative impact on throughput since all the other cores are sitting idle. This is especially true for graphs with a high diameter, such as bitcoin transaction data or road maps.
Kinds of graph processing. There are several very different types of “graph” operations:
- Gathered reads for property and link set retrieval. This is a parallel workload for the random recovery of attribute sets from known vertices and edges of interest. This is a good fit for a key-value store.
- Graph traversal algorithms, such as BFS (breadth first search) and SSSP (shortest paths). The workload for these algorithms requires visiting each vertex in the graph at least once through a series of iterative 1-hop expansions over the graph. The size of the frontier (the vertices to be visited in the next round) often grows exponentially before eventually decaying, so the solution must be able to handle both small and large frontiers efficiently. The number of iterations depends strongly on the depth of the graph. Social networks may have a depth of 6. Road networks or bitcoin transaction data can have a depth of 10,000. This class of algorithms often does very little work per edge and vertex visited, so it places an extreme burden on the memory bus. This extreme workload is why BFS is used for the Graph 500. Disk is not a good fit here.
- Graph analytic algorithms, such as page rank, k-means, etc. These algorithms all have a workload that requires multiple full visitations of the entire graph. The frontier starts out with all vertices and then slowly drops towards zero as the algorithm converges. You have to read all the data, multiple times. Again, disk is not a good fit here. You need memory bandwidth. Lot’s of memory bandwidth. More than you can get from a CPU.
- Graph query (aka graph pattern matching). This involves matching specific patterns among the vertices, edges, and their labels in the graph using a high level query language such as SPARQL or Cypher (neo4j’s graph query language). Performance here depends on (a) a good query optimizer to reorder the joins in order to minimize the required effort; (b) propagating constraints from one join to the next in order to read less data and solve the query more rapidly; and (c) a good choice of indices (nice graph databases do not make you guess at what indices you need, they handle this automatically). This is one case where disk is a good fit. Index data structures can be used to rapidly identify a subset of the data to be read, but memory bandwidth is still a bottleneck for databases – this is the whole reason behind the emergence of the column-wise storage (something that we will introduce into bigdata this year).
Graph databases (or at least high level query against a graph database) and graph mining systems have fundamentally different workloads and require different techniques. When it comes to gathered property set retrieval, Accumulo, Cassandra, and related key-value stores are all very similar technologies. Property set lookups can be parallelized against any of them once the desired vertex set is known, so choose whatever works for you. However, key-value stores are not able to provide efficient graph query or efficient graph mining/traversal. I will try to explain why below. First, I will focus on why they can not be used to create scalable graph traversal and graph analytic solutions.
Graph traversal. For graph traversal, you need forward and reverse indices to follow links. A graph database normally carries at least a forward and reverse index and can therefore be used for graph traversal, but this is not efficient and it is not scalable.
There are a few problems. First, graph traversal algorithms typically need to visit large parts of the graph in multiple iterations. This is not efficient against disk if there is any random access. (GraphChi is an example that uses an IO efficient solution against disk. However, it must read the entire graph in each iteration. While it scales well on a single machine, it can not be used for low latency operations, is a poor choice for algorithms such as BFS or SSSP, and it does not have the throughput of a main memory or GPU based solution.) Second, the approach to horizontal scaling for graph traversal must be based on what is variously called vertex cuts (graph lab uses this nomenclature) or a 2D decomposition (this is the terminology in the HPC space, which has been doing this for years for sparse matrix vector operations, which are very similar to graph operations).
In a 2D decomposition, the access to the links of the graph is decomposed against a 2-dimensional compute grid over virtual nodes. The vertices are organized into the rows and columns of the compute topology by using the same partitions for both dimensions. The rows provide access to the out-edges of the graph. The columns provide access to the in-edges of the graph. The diagonal consists of those edges whose source and target vertices fall into the same partition. Using a naïve partitioning strategy, the vertices are assigned to partitions by dividing the vertex identifier by the number of rows/columns in the compute grid. Graph aware partitioning can be used into increase the local density and interconnectedness of those partitions by what amounts to relabeling the vertex identifiers.
To gather the in edges for all vertices in the frontier whose vertex identifier falls into a given partition, the operation is decomposed into a local operation on each compute node in the corresponding column of the 2D compute topology. The intermediate per-compute node results are then aggregated back to the compute node processing that partition. This parallelizes the effort across the column. The scatter over the out edges is pretty much the same, but the operation is parallelized over a row of the 2D compute topology. GraphLab and HPC sparse matrix vector multiplication systems all use this approach. So does the system that took first place in the Graph 500.
There are two basic problems that are addressed by a 2D decomposition. They are co-location of the link weights in the forward and reverse traversal directions. This is vital for algorithms that modify the link weights during traversal. Without a 2D decomposition, link updates are always random since they can only be 1:1 with one of the indices. On the other index they have a random access pattern. This causes a severe bottleneck. A 2D decomposition also minimizes the communication volume by decomposing the gather and scatter phases over a row or column of the 2D compute grid. Compute nodes outside of the row (scatter) or column (gather) do not participate in that operation. This means that he communication pattern is both efficiently parallelized and regular.
Graph databases and blue prints can not scale well for graph traversal or graph analytics. First, graph databases need to minimize the data read for efficient high level query, so they use a very different data partitioning scheme (not 2D). While they can be used for graph traversal algorithms for small data sets, the approach is simply not scalable for the reasons outlined above, e.g., O(N^2) communications plus high latency associated with the disk. Client-based graph traversal APIs, such as blueprints, face another problem as well – the client is doing round trips with the server/cluster. This is not only a throughput bottleneck, but it also limits the size of the frontier and computation state to the memory of the client.
You might ask, can’t I get away with using a graph database and blueprints? Not if you want your solution to scale. The promise of simple scaling that we have for key-value stores and map/reduce simply does not hold for graphs. You must be using the right technology to get beyond toy problems. How can you tell if you are using the right technology? Look at the data layout – if it is a key-value store, it will not scale for graph traversal or graph analytics. Look at the graph computation, if it is guided by a client API such as blueprints, it will not scale. Look at the platform – if it has huge latencies for each iteration, such as Hadoop, it will not scale to graphs with large diameters (with one minute overhead per map/reduce job and 8000 iterations for BFS on bitcoin, you would have to wait 8000 minutes just for the job scheduling overhead on Hadoop – MPGraph does the entire computation in 300ms.)
Graph query. There are a few reasons why the graph traversal and graph analytics architectures do not perform well for graph query. The main reason is that each join needs to be distributed across a row or column of the 2D compute topology. While this is efficient for graph traversal, it is not efficient for low latency graph pattern matching queries.
The way to make graph pattern matching queries fast is to optimize the join ordering (the most selective access path is run first), read as little data possible, and then feed constraints from that join into the remaining joins. This can be done either by passing along intermediate solutions containing variable bindings discovered in the data and using nested indexed joins (bigdata does this) or through sideways information passing that let’s you skip parts of a access path that are provably unable to join (RDF3X does this). Either way, you execute the joins in order of their selectivity and pass along constraints that allow you to avoid reading most of the data. This is how to achieve low latency for high level declarative graph query languages. This is also why people have such difficulties building scalable high level query solutions over existing key-value stores. These architectures do not provide ways to constrain the access paths based on the data already read in previous joins. As a result, they wind up sending all the data from each access path back to the client, which is then forced to do the join locally in memory on the client. This approach reads way too much data, slams the network, and slams the client. For example, this is why Rya can not scale for graph query. Accumulo does not let Rya flow the query over the data.
The 2D approach is not well suited to graph query because you have to read on all compute nodes in a row/column of the 2D compute topology to access the link set for a vertex. In contrast, that data is co-located in the forward and reverse indices of a graph database. This allows less inter-node communication for graph query access paths.
What if we had a hybrid system that maintained both the forward and reverse indices and the 2D layout so it could answer both low latency graph queries and provide efficient and scalable graph traversal? So far I have not seen any architectures that maintain both kinds of indices, but this could be interesting. There might be high data volume queries (queries where we need intrinsically need to read a lot of data, such as rollups over the entire graph) where we could accelerate the query using the 2D partitions.
When should you scale-out a graph database? A graph database with a decent query optimizer should be able to handle upwards of 10-50 billion edges on a single machine and provide low latency query. The main enabling points are a good query optimizer and fast disk (SSD or PCIe flash) since graph query will result in random read IO patterns on the disk. The random IO pattern occurs because the index pages are not laid out in key order on the disk. Bigdata also provides a high 9s open source deployment with linear scaling in query throughput as a function of the size of the replication cluster.
Horizontally scaled graph databases have inter-node communication overhead and are slower for low-latency most queries. The main reason to scale-out a graph database is because you need throughput for data load (billions of edges per hour) and you need to run queries that do rollups over all that data. If you can partition your graph based along lines that make sense for your business such that most queries run inside of a single partition, you can often get much higher performance from a pool of graph databases each servicing a different partition of the data. When necessary, you can use federated query to read across those partitions (bigdata builds in support for federated query).
We develop two complementary kinds of open source graph technologies that target fundamentally different kinds of graph problems. One project (MPGraph) provides graph traversal and graph analytics on GPUs. The other (bigdata) is a graph database that supports graph pattern matching using a high level query language. The GPU approach represents the best known technique for high performance graph traversal and graph analytics and outperforms main-memory CPU solutions on machines with up to 24 cores by between 5x ~ 500x. The GPU currently requires a compile time schema for the property set and link set. Our road map for that technology includes adding topology compression (up to 1 billion edges on a single card), column-wise compression of schema flexible property and link sets, and 2D decomposition onto multiple GPUs for graphs with more than 1B edges.
The bigdata graph database uses multiple indices to avoid bias in the access paths, an efficient representation of property sets, link sets, and link attributes, and a high-level query language paired with a query optimizer for efficient low-latency high level query. There is also a vertex-centric API for the bigdata graph database. Throughput of the vertex-centric API against the graph database over SSD is less than 1M traversed edges per second and has the same in-memory limits on the problem size identified above (compare this to 3 billion traversed edges per second on a GPU and you can see why we have two different approaches to graphs!). We are working to improve graph traversal throughput on the graph database using column-wise indexing to reduce IO and the CPU overhead associated with materializing edges from the index, but there is still a performance gap of several orders of magnitude when comparing a disk/index based graph mining API and a graph mining API running on a GPU. This performance gap is intrinsic. It is the difference in the bandwidth of disk, main memory, and the memory bandwidth of the GPU, which is 10x greater than the memory bandwidth of the CPU.
Match the technology to the problem:
- Key-value stores are good at property and link set retrieval since they can perform the gathered reads efficiently and in parallel.
- Memory bandwidth is the bottleneck for graph traversal and graph analytic algorithms. MPGraph excels here since the DRAM on the GPU is 10x faster than the CPU RAM. Touching the disk is a huge penalty for graph traversal, and even IO efficient approaches are bandwidth limited by the sequential transfer rate of the disk rather than the bandwidth of main memory or GPU device memory.
- High performance for graph query requires good query optimizers and flowing the query across the cluster (for scale-out). Bigdata has two different query optimizers, one which emphasizes low latency query, where the overhead of the query optimizer itself can cause low throughput, and one which uses deep sampling of the query against the data to find the minimum cost query plan for long running, data intensive queries. Constraints are propagated from join to join by flowing the intermediate solutions to each join in turn. Bigdata reads less data from the disk to answer a query because variable bindings discovered in earlier joins restrict the access paths for later joins. Due to non-locality, graph query typically turns into random IOs against the disk, so you always want to deploy the graph database over SSD or PCIe flash for fast high level query. Don’t scale-out a graph database unless you need it. Single machine or replication cluster deployments handle large graphs (50B edges) and can deliver low-latency query. Scale-out deployments build in more coordination overhead and should be undertaken only after careful examination of your requirements. Graphs do not enjoy the same simple scaling model as key-value stores and map/reduce. You have to use the right technology for your problem.
What about YarcData? The Cray XMT and XMT2 architecture marketed and sold by YarcData as the “Urika” appliance is worth discussing in some depth (Cray and YarcData are trademarks of Cray, Inc.). The basis of the XMT architecture is some slow “stream” processors with zero cost latency switching. The XMT architecture scales by keeping a lot of memory transactions in flight and then switching between those memory transactions when they arrive in the queue for a stream processor with zero latency. The concept for the XMT is not worry about where the data was stored, but just make sure that you can keep those slow stream processors busy by having enough parallel work and moving the data around over the fast interconnect. For those who want to lay down the cash on a YarcData appliance (rumored to be well north of $1M), this is a good and scalable solution. However, it is instructive to compare the YarcData solution with a GPU. MPGraph running on a GPU at 3 billion edges per second has much more throughput than the YarcData appliance. In fact, the challenge with GPUs is that they are so fast that it is difficult to keep them busy – this is the opposite of the YarcData problem. There are two ways to make this work out in favor of the GPU, and we are pursuing both of them. First, if we apply topology compression to the data, we can get nearly ten times the number of edges into a single GPU. That means instead of 100M edges at the speed of light, we will have nearly 1B edges at the speed of light on a single card. This is more than enough for most problems for under $7k in hardware (if you buy the expensive K20 cards rather than the gamer cards), or you can rent it on EC2 for ~$400/month. Second, if we keep of the edges of the graph in DRAM, then we can put nearly 60 billion edges (with topology compression) into a GPU compute cluster on EC2. Rather than sending patches of the graph to the GPU, we can keep it all in DRAM and just send vertex state and frontier updates over the PCIe bus and the cluster interconnect. To my mind, the ORNL Titan supercomputer (also built by Cray) with 18,000+ NVIDIA Kepler GPUs is the ultimate graph processing machine and uses commodity hardware and GPUs.
Conclusion: Always keep in mind the bottleneck, which is either disk (graph query) or RAM (graph traversal and graph analytics). While these graph problems may look similar from the outside, they have completely different computational workloads and scaling requirements and must be addressed using different kinds of technologies. Avoid the “one big graph” anti-pattern and deploy a mixture of technologies that address the different workloads and computations that you need for your application. You may want to put everything into a key-value store for fast gathered property set retrieval, but you can’t query or traverse it efficiently there. Fast query requires a graph database with a high performance query optimizer and query engine. Fast graph traversal and graph analytics require fast memory and efficient parallelism. Think about how to partition the workload and the data to provide fast and scalable solutions. For example, could you put a social network topology entirely onto a GPU, run your analytics there, and then do gathered reads against a key-value store or graph database? This could give you full depth graph analytics over your social network in a faction of a second and then you can materialize the data you need from that key-value store you already have.