Bryan and I were chatting back on the train from NYC on CPU memory bandwidth. Here’s a quick write-up of the discussion.
15 years ago database researchers recognized that CPU memory bandwidth was the limiting factor for relational database performance. This observation was made in the context of relatively wide tables in RDBMS platforms that were heavily oriented to key range scan on a primary key. The resulting architectures are similar to the structure of array (SOA) pattern used by the high performance computing community and within our Mapgraph platform.
Graphs are commonly modeled as 3-column tables. These tables intrinsically have a very narrow stride, similar to column stores for relational data. Whether organized for main memory or disk, the goal of the query planner is to generate an execution strategy that is the most efficient for a given query.
Graph stores that are organized for disk maintain multiple indices with a log structured cost (e.g., a B+Tree) that allow them to jump to the page on the disk that has the relevant tuples for any given access path. Due to the relative cost of memory and disk, main memory systems (such as SPARQLcity) sometimes choose a single index and resort to full table scans when the default index is not useful for a query or access pattern. In making such decisions, main memory databases are trading off memory for selectivity. These designs can consume fewer resources, but it can be much more efficient to have a better index for a complex query plan. Thus, a disk based database can often do as well as or better than a memory based database of the disk based system has a more appropriate index or family of indices. In fact, 90%+ of the performance of a database platform comes from the query optimizer. The difference in performance between a good query plan and a bad query plan for the same database and hardware can be easily 10x, 100x, or 10000x depending on the query. A main memory system with a bad query plan can easily be beaten by a disk -based system with a good query plan. This is why we like to work closely with our customers. For example, one of our long term customers recently upgraded from 1.2.x (under a long term support contract) to 1.5.x and obtained a 100% performance improvement without changing a single line of their code.
Main memory systems become critical when queries much touch a substantial portion of the data. This is true for most graph algorithms that are not hop constrained. For example, an unconstrained breadth first search on a scale free graph will tend to visit all vertices in the graph during the traversal. A page rank or connected components computation will tend to visit all vertices on each iteration and may required up to 50 iterations to converge page rank to a satisfactory epsilon. In such cases, CPU memory architectures will spend most of the wall clock time blocked on memory fetches due to the inherent non-local access patterns during graph traversal. Architectures such as the XMT/XMT-2 (the Urika appliance) handle this problem by using very slow cores, zero latency thread switching, a fast interconnect and hash partitioned memory allocations. The bet of the XMT architecture is that non-locality dominates so you might as well spread all data out everywhere and hide the latency by having a large number of memory transactions in flight. We take a different approach with GPUs and achieve a 10x price/performance benefit over the XMT-2 and a 3x cost savings. This savings will increase substantially when the Pascal GPU is released in Q1 2016 due to an additional 4x gain in memory bandwidth driven by the breadth of the commodity market for GPUs. We obtain this dramatic price/performance and actual performance advantage using zero overhead context switching, fast memory, 1000s of threads to get a large number of in flight memory transactions, and paying attention to locality. The XMT-2 is a beautiful architecture, but locality *always* matters at every level of the memory hierarchy.