Read that, perhaps while watching/listening toKevin Kelly‘s riveting talk (from the TED EG conference) on the next 5000 days of the web (that’s right, we’re only 5000 days old) and the future of the Semantic web. Kelly says we’re building not a series of small machines on the Internet, but one gigantic thinking machine, approaching the connectivity level of the human mind.
Last Tuesday, I was part of the Velocity 2009 Keynote, where I gave a talk entitled, “Fixing Twitter”. I covered the last year or so of work in improving Twitter to deal with the massive traffic and user loads we’ve been under and how we use metrics to destroy the fail-whale.
I posted this in response to a post on GigaOM, but it was such a long comment, I felt that it was worthy as a post on it’s own.
The workloads of social networking sites fall mostly into the ‘read lots, write once’ class (most of the web exists within this paradigm.) Regardless of the database company that’s responsible for the software, the main idea in scaling this read heavy workload is to remove the burden from the database and move it to distributed memory stores.
As an engineer, you want applications to pull from the same cache pool to reduce I/O pressure. To ensure that every machine isn’t replicating data in individual caches, you have to go distributed. That’s the win with memcached.
Putting a distributed cache between the application and the database increases performance and shares data across your application servers, something that the database cannot do on it’s own. The database has on-disk and in memory caching, but eventually you’ll run out of memory on a single host if your working set exceeds the host’s memory.
Memcached also covers up replication lag (MySQL is terrible at replication, Oracle not so much) in large environments by putting data into the distributed cache (Write-through caching) before the slave database has finished it’s writing. Data is available immediately to clients, before the replication has completed.
It will also provide a large amount of savings when you’re constantly executing that O(n x m) query to find out who is friends with whom on your social networking site.
This comes with a cost, though. Relational database functions, like joining across large data sets, and atomic operations, become very difficult to execute. Memcached becomes the central server, and there is always a fear that an important key will drop out of cache because of a random eviction.
It’s not without risk, either. Dependence on the cache can hurt you severely if lots of memcached servers fail (and they do fail), Leaving you in a ‘cold cache’ situation where it can take hours to repopulate your working set back into the cache pool.
Don’t question MySQL’s performance — relational databases are great, but they are not the only solution to storage problems. the two problems that are being solved here are, highly orthogonal.
I’d also like to state that the majority of alternate key-value store databases listed in Richard Jones’ article and in Lenoard Lin’s blog are really not ready for high production loads (with maybe the exception of Tokyo Cabinet, HDFS, and Cassandra). There is still a ton of ‘secret sauce’ the large sites are keeping quiet about in order to make these into effective data stores.
Lin states this in his review as well: “Your comfort-level running in prod may vary, but for most sane people, I doubt you’d want to.”
I’m announcing the release of mod_memcache_block, a distributed IP blocking system for Apache, with rate limiting based on HTTP request code.
For many years I’ve had a need for a module like this — A distributed blocking system which could operate across large web serving clusters and register hits in a central store. With rate limiting, incrementing counters on a single host is fairly useless when you have hundreds of servers behind a load balancer.
An attacker could hit many machines within the limit period before being detected, because there would be no central count. By keeping the counts in a memcache pool, all servers share the same data.
It won’t defend against attacks coming from random proxy addresses (say, Tor), and might unfairly count hundreds of users who live behind a single proxy (like corporate NAT), but it offers some protection against attacks coming from a single source IP.
The software is released under the Apache 2.0 Open Source License.
From the docs:
mod_memcache_block is an Apache module that allows you to block access to your servers using a block list stored in memcache. It also offers distributed rate limiting based on HTTP response code.
Distributed White and Black listing of IPs, ranges, and CIDR blocks
Configurable timeouts, memcache server listings
Support for continuous hasing using libmemcached’s Ketama
Windowded Rate limiting based on Response code (to block brute-force dictionary attacks against .htpasswd, for example)
libmemcached-0.25 or better
Apache 2.x (tested with 2.2.11)
There’s a small interview with me in today’s O’Reilly radar, where I talk about some of the things that I’ll be presenting as part of my Velocity 2009 talk. You can listen to, and read the transcript here:
A bit of last-minute news, but I’ll be on a panel at SXSW Interactive: “Using GPS & Location to Enhance Social Networking”.
First there were social networks, and then there were location-based social networks, and now GPS and navigation-enhanced mobile social networks. This panel will explore how these emerging platforms integrate with existing social networks (facebook, twitter, etc), leverage GPS navigation functionality, and take location-aware social networking to the next level.
Since people started using virtual hosts by name with Apache HTTPD and other web servers, it has become very difficult to figure out which virtual hosts live on a single IP, if all you have is the IP address.
Have a look at the Robtex Internet Swiss Army Knife. It solves this problem, and far more, including AS# lookups, BGP dereferencing, and DNS checks. There’s a firefox search toolbar available for the site (very useful!) and RBL (blacklist) check tools right on the main page of the site.