Message Queue Evaluation Notes – Second Life Wiki
Summary and Overview
One of the infrastructure tools that we’ve identified for the future internal architecture of Second Life is messaging. Message queuing systems allow systems that send messages to not have to worry about how they will be delivered, and allow consumers of messages to gather whichever ones interest them, at their own pace.
Ideally we’d have a completely scaleable system that clients could treat as singular black box. It would act as a well-known cluster to which senders or receivers of messages could connect, and be able to communicate asynchronously to or from anywhere else on the grid. Unfortunately it seems as though this dream, like so many others, is unattainable by any currently available software. We investigated around 15 open source systems that were explicitly designed for message queueing and found that none of them achieved this ideal.
Our use cases mostly involve very large numbers of queues; the smallest number we’re even considering is double the number of concurrent users. Our largest use case would be of the same order of magnitude as registered users. This means that we have to plan for millions, and probably tens or hundreds of millions, of message queues, since we want whatever system we choose to last us until then. Pretty much all of the message queue systems we investigated are intended to maximize message throughput rather than number of consumers. The clustering that they implement mainly serves as additional horsepower to deliver more message throughput. In particular the clustering we’ve seen replicates all state to every machine in the cluster, meaning that the cluster cannot add queues beyond the capacity of an individual node. In order to have a solution that scales in terms of number of queues, each node will have to be able to contain a subset of the global state — it actually seems as though such a message system would want to be coupled with a distributed storage system.
In any case, given that we expect that we’d have to develop our own queue scaling solution that involves partitioning, which may or may not be a task we are interested in taking on, the strongest candidates are RabbitMQ and Apache QPID. Both are mature products that support AMQP (though they support different versions). Both have strong vendor support, and both have good single-host performance numbers. There is some more investigation to be done — we’d like to know their true maximum capacity when clustered, to evaluate whether it’s worth clustering at all, given that we have to implement partitioning ourselves anyway.
We’re unfortunately pretty far from having closed the case on which technology to choose, or even if we can use any of these at all. Investigation will continue!