Some people have been asking about scaling limits for bigdata. We have run out to 10B triples in a single semantic web database instance, and we are producer bound for most of that. By the time we are nearing 10B, the producers have been at maximum CPU utilization for hours while the database servers are at perhaps 35-40% utilization.
So, what is limiting us right now is the single machine capacity for the producer. As the number of index partitions grows over time, the producer needs to allocate the data to be written onto the index into more and more buffers (one per index partition). As we get into 100s of index partitions, the RDF/XML parsers are running full blast into those buffers without putting an appreciable load onto the database.
To get around the single machine limit for the producers, we are going to refactor the clients so that they have an aggregation stage, similar to, but somewhat different from, a reduce phase. That will allow us to run enough RDF/XML parser clients to feed the system and sustain high throughput well past 10B triples.
Since we can scale by adding hardware, even after a bigdata federation has been deployed, the practical scaling limit for bigdata is going to be at least another order of magnitude (100B).
Update: We have since resolved the bottleneck mentioned in the original post without the introduction of an aggregator phase. The problem was traced to some POS index queues in the clients which were being filled with small chunks due to a systematic presentation of specific predicates once per document. Those chunks are now automatically combined on insertion into the queue, which solved the problem — at least at this scale!
The gods have smiled upon us and given us a bit more time on this cluster.
It seems that Bryan has the RAM problem on the clients solved – they are no longer swapping. This has let us run out to a different problem – RAM demands on the data services. 🙂
Good progress though, last night’s run yielded 5 billion triples loaded in just under 10 hours for an average throughput of 135k triples per second. Max throughput was just above 210k triples per second. 1 billion triples was reached in an astonishing 78 minutes.
Configuration was 20 data services (10 blades with 2 data services each), 8 client services (4 blades with 2 each), and one blade for centralized services.
Note that these times are for simple RDF load – no closure. Previous tests have demonstrated that closure takes about as long as simple load – so double the time and halve the throughput to get our numbers with closure. Still quite impressive.
At this point, we have run out to 3B triples on a cluster with a net throughput of more than 100,000 triples per second for the cluster. The per-machine throughput is now ~ 10,000 triples per second. We have also addressed a high memory demand issue in the data services which was leading to premature RAM exhaustion.
We are currently looking into memory demand for the clients, which increases in proportion to the #of index partitions. I think that we will solve this by adding compressing to the RDF Values in the ID2TERM index, leading to fewering splits of that index and hence less RAM demand on the clients.
While the throughput is now reasonable at 10,000 triples-per-second/host, I am hopeful that we can improve on this substantially by introducing asynchronous writes for the TERM2ID index.