You may have read or even noticed that Foursquare experienced a serious outage this week. It turns out that in order to maintain performance they keep all of their check-in data in RAM, and they ran out of memory on one of the machines hosting a database shard.
Eliot Horowitz from 10gen posted a great behind-the-scenes writeup of the outage over on the mongodb-user Google Group:
As many of you are aware, Foursquare had a significant outage this
week. The outage was caused by capacity problems on one of the
machines hosting the MongoDB database used for check-ins. This is an
account of what happened, why it happened, how it can be prevented,
and how 10gen is working to improve MongoDB in light of this outage.
The post and subsequent thread contains lots of nice technical detail and they are worth reading and thinking about if you work with MongoDB in particular or distributed/NoSQL systems in general.