[Openstack-operators] memcached redundancy

Clint Byrum clint at fewbar.com
Thu Aug 21 18:19:59 UTC 2014

Excerpts from Joe Topjian's message of 2014-08-14 09:09:59 -0700:
> Hello,
> I have an OpenStack cloud with two HA cloud controllers. Each controller
> runs the standard controller components: glance, keystone, nova minus
> compute and network, cinder, horizon, mysql, rabbitmq, and memcached.
> Everything except memcached is accessed through haproxy and everything is
> working great (well, rabbit can be finicky ... I might post about that if
> it continues).
> The problem I currently have is how to effectively work with memcached in
> this environment. Since all components are load balanced, they need access
> to the same memcached servers. That's solved by the ability to specify
> multiple memcached servers in the various openstack config files.
> But if I take a server down for maintenance, I notice a 2-3 second delay in
> all requests. I've confirmed it's memcached by editing the list of
> memcached servers in the config files and the delay goes away.

I've seen a few responses to this that show a _massive_ misunderstanding
of how memcached is intended to work.

Memcached should never need to be load balanced at the connection
level. It has a consistent hash ring based on the keys to handle
load balancing and failover. If you have 2 servers, and 1 is gone,
the failover should happen automatically. This gets important when you
have, say, 5 memcached servers as it means that given 1 failed server,
you retain n-1 RAM for caching.

What I suspect is happening is that we're not doing that right by
either not keeping persistent connections, or retrying dead servers
too aggressively.

In fact, it looks like the default one used in oslo-incubator's
'memorycache', the 'memcache' driver, will by default retry dead servers
every 30 seconds, and wait 3 seconds for a timeout, which probably
matches the behavior you see. None of the places I looked in Nova seem
to allow passing in a different dead_retry or timeout. In my experience,
you probably want something like dead_retry == 600, so only one slow
operation every 10 minutes per process (so if you have 10 nova-api's
running, that's 10 requests every 10 minutes).

It is also possible that some of these objects are being re-created on
every request, as is common if caching is implemented too deep inside
"middleware" and not at the edges of a solution. I haven't dug deep
enough in, but suffice to say, replicating and load balancing may be the
cheaper solution to auditing the code and fixing it at this point.

More information about the OpenStack-operators mailing list