[Openstack-operators] memcached redundancy
morgan.fainberg at gmail.com
Fri Aug 22 19:25:24 UTC 2014
For Keystone, there will be a MongoDB backend in Juno that uses the Dogpile-based key-value storage. The Dogpile storage of tokens (available in icehouse) requires a simple backend that implements the basic types of interfaces (get, set, delete, get_multi, set_multi, delete_multi, etc) and that can communicate to whatever storage/cache system you want to use. Obviously it’s optimized for caching (the library is named dogpile.cache), but it works well as a key-value-store implementation as well.
There are also significant strides towards supporting no-persistence (when using PKI tokens). There are still some roadblocks from getting us clear of needing the token persistence backends as an option.
With that said… Back on the original topic.
We definitely are using memcached incorrectly (as a persistent store), but at the time we needed to provide some alternatives to alleviate the issues you are highlighting with storing tokens in SQL (there are ways to make the SQL backend better as well). This incorrect use of memcached does drive towards wanting connection *AND* storage level redundancy.
With regards to the MemoryCache oslo-incubator library (and oslo.cache basic library) there is some work that has been proposed (a spec) to move to dogpile.cache and really focus on using caching backends (such as memcachd) correctly across OpenStack. This opens the door to having more control on how we work with memcache (or any other backend) that we use for caching. This change is a tentative target for Kilo and the subsequent cycles.
Now with all of that in mind, some of the issue comes from the basic python memcache library and how it handles dead servers (with socket timeouts, marking them dead, etc) and probably how we’re setting those timeouts / limits.
There is a lot of room for improvement in how we cache; just remember caching is one of the hardest things to do right. Doing caching wrong opens up the potential for a lot of bugs.
From: Joe Topjian <joe at topjian.net>
Reply: Joe Topjian <joe at topjian.net>>
Date: August 22, 2014 at 12:03:56
To: Morgan Fainberg <morgan.fainberg at gmail.com>>
Cc: openstack-operators <openstack-operators at lists.openstack.org>>
Subject: Re: [Openstack-operators] memcached redundancy
> It sounds like there are two incorrect uses of memcached: The actual
> communication of the openstack components to memcached and using memcached
> itself as a persistent token store. Though from what it sounds like, if the
> former was done better, the latter wouldn't be too much of an issue?
> I do agree that using something like memcached, which explicitly advertises
> itself as a bad solution for persistent storage, can ultimately be asking
> for trouble.
> With that said, though, it looks like there are currently two choices for a
> keystone token backend: memcached and SQL. Both have obvious downsides.
> Personally I'd rather deal with my current memcached issues than go back to
> storing tokens in SQL.
> ... unless I'm missing something? Is there more to the current state of
> Keystone token backends than the memcached and SQL backends that have been
> around for the past few years?
> On Fri, Aug 22, 2014 at 12:39 PM, Morgan Fainberg > > wrote:
> > While keystone uses memcache as a possible token storage backend we are
> > working towards eliminating the design that makes memcache a desirable
> > token backend.
> > Using memcache for the token backend is not the best approach as the token
> > backend (up through icehouse and in some cases will hold true for Juno)
> > assumes stable storage for at least the life of the token.
> > I agree with Josh, we are likely using memcached incorrectly in a number
> > of cases.
> > --Morgan
> > On Thursday, August 21, 2014, Joshua Harlow wrote:
> >> +1 for this, remember the 'cache' in memcache *strongly* indicates what
> >> it should be used for.
> >> A useful link to read over @
> >> http://joped.com/2009/03/a-rant-about-proper-memcache-usage/
> >> -Josh
> >> On Aug 21, 2014, at 11:19 AM, Clint Byrum wrote:
> >> > Excerpts from Joe Topjian's message of 2014-08-14 09:09:59 -0700:
> >> >> Hello,
> >> >>
> >> >> I have an OpenStack cloud with two HA cloud controllers. Each
> >> controller
> >> >> runs the standard controller components: glance, keystone, nova minus
> >> >> compute and network, cinder, horizon, mysql, rabbitmq, and memcached.
> >> >>
> >> >> Everything except memcached is accessed through haproxy and everything
> >> is
> >> >> working great (well, rabbit can be finicky ... I might post about that
> >> if
> >> >> it continues).
> >> >>
> >> >> The problem I currently have is how to effectively work with memcached
> >> in
> >> >> this environment. Since all components are load balanced, they need
> >> access
> >> >> to the same memcached servers. That's solved by the ability to specify
> >> >> multiple memcached servers in the various openstack config files.
> >> >>
> >> >> But if I take a server down for maintenance, I notice a 2-3 second
> >> delay in
> >> >> all requests. I've confirmed it's memcached by editing the list of
> >> >> memcached servers in the config files and the delay goes away.
> >> >
> >> > I've seen a few responses to this that show a _massive_ misunderstanding
> >> > of how memcached is intended to work.
> >> >
> >> > Memcached should never need to be load balanced at the connection
> >> > level. It has a consistent hash ring based on the keys to handle
> >> > load balancing and failover. If you have 2 servers, and 1 is gone,
> >> > the failover should happen automatically. This gets important when you
> >> > have, say, 5 memcached servers as it means that given 1 failed server,
> >> > you retain n-1 RAM for caching.
> >> >
> >> > What I suspect is happening is that we're not doing that right by
> >> > either not keeping persistent connections, or retrying dead servers
> >> > too aggressively.
> >> >
> >> > In fact, it looks like the default one used in oslo-incubator's
> >> > 'memorycache', the 'memcache' driver, will by default retry dead servers
> >> > every 30 seconds, and wait 3 seconds for a timeout, which probably
> >> > matches the behavior you see. None of the places I looked in Nova seem
> >> > to allow passing in a different dead_retry or timeout. In my experience,
> >> > you probably want something like dead_retry == 600, so only one slow
> >> > operation every 10 minutes per process (so if you have 10 nova-api's
> >> > running, that's 10 requests every 10 minutes).
> >> >
> >> > It is also possible that some of these objects are being re-created on
> >> > every request, as is common if caching is implemented too deep inside
> >> > "middleware" and not at the edges of a solution. I haven't dug deep
> >> > enough in, but suffice to say, replicating and load balancing may be the
> >> > cheaper solution to auditing the code and fixing it at this point.
> >> >
> >> > _______________________________________________
> >> > OpenStack-operators mailing list
> >> > OpenStack-operators at lists.openstack.org
> >> > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
> >> _______________________________________________
> >> OpenStack-operators mailing list
> >> OpenStack-operators at lists.openstack.org
> >> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
> > _______________________________________________
> > OpenStack-operators mailing list
> > OpenStack-operators at lists.openstack.org
> > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
More information about the OpenStack-operators