Some people have been asking about scaling limits for bigdata. We have run out to 10B triples in a single semantic web database instance, and we are producer bound for most of that. By the time we are nearing 10B, the producers have been at maximum CPU utilization for hours while the database servers are at perhaps 35-40% utilization.
So, what is limiting us right now is the single machine capacity for the producer. As the number of index partitions grows over time, the producer needs to allocate the data to be written onto the index into more and more buffers (one per index partition). As we get into 100s of index partitions, the RDF/XML parsers are running full blast into those buffers without putting an appreciable load onto the database.
To get around the single machine limit for the producers, we are going to refactor the clients so that they have an aggregation stage, similar to, but somewhat different from, a reduce phase. That will allow us to run enough RDF/XML parser clients to feed the system and sustain high throughput well past 10B triples.
Since we can scale by adding hardware, even after a bigdata federation has been deployed, the practical scaling limit for bigdata is going to be at least another order of magnitude (100B).
Update: We have since resolved the bottleneck mentioned in the original post without the introduction of an aggregator phase. The problem was traced to some POS index queues in the clients which were being filled with small chunks due to a systematic presentation of specific predicates once per document. Those chunks are now automatically combined on insertion into the queue, which solved the problem — at least at this scale!