[openstack-dev] [all][oslo] Dealing with database connection sharing issues
Mike Bayer
mbayer at redhat.com
Fri Feb 20 02:50:13 UTC 2015
Joshua Harlow <harlowja at outlook.com> wrote:
> Doug Hellmann wrote:
>> On Thu, Feb 19, 2015, at 01:09 PM, Ben Nemec wrote:
>>> Hi,
>>>
>>> Mike Bayer recently tracked down an issue with database errors in Cinder
>>> to a single database connection being shared over multiple processes.
>>> This is not something that should happen, and it turns out to cause
>>> intermittent failures in the Cinder volume service. Full details can be
>>> found in the bug here: https://bugs.launchpad.net/cinder/+bug/1417018
>>> and his mailing list thread here:
>>> http://lists.openstack.org/pipermail/openstack-dev/2015-February/057184.html
>>>
>>> The question we're facing is what to do about it. There's quite a lot
>>> of discussion on https://review.openstack.org/#/c/156725 and in
>>> http://eavesdrop.openstack.org/irclogs/%23openstack-oslo/%23openstack-oslo.2015-02-18.log
>>> starting at 2015-02-18T21:38:12 but I'll try to summarize it here.
>>>
>>> On the plus side, we have a way to detect this sort of thing in oslo.db.
>>> That's what Mike's change 156725 is about. On the minus side,
>>> recovering from this in oslo.db is papering over a legitimate problem in
>>> the calling service, and a lot of the discussion has been around how to
>>> communicate that to the calling service. A few options that have been
>>> mentioned:
>>>
>>> 1) Leave the linked change as-is, with a warning logged that will
>>> hopefully be seen and prompt a fix in the service.
>>>
>>> The concerns raised with this is that the warning log level is a very
>>> operator-visible thing and there's nothing an operator can do to fix
>>> this other than pester the developers. Also, it seems developers tend
>>> to ignore logs, so it's unlikely they'll pick up on it themselves.
>>>
>>> Note that while the errors resulting from this situation are
>>> intermittent, the actual situation happens on every start up of
>>> cinder-volume, so these messages will always be logged as it stands
>>> today.
>>>
>>> 2) Change the log message to debug.
>>>
>>> This is the developer-focused log level, but as noted above developers
>>> tend to ignore logs and it will be very easy for the message to get lost
>>> in the debug noise. This option would likely require someone to go
>>> specifically looking for the error to find it.
>>>
>>> 3) Make the error a hard failure.
>>>
>>> Rather than hide the error by recovering, fail immediately when it's
>>> detected. This has the problem of making all the existing Cinder code
>>> (and any other services with the same problem) in the wild incompatible
>>> with any new releases of oslo.db, but it's about the only way to make
>>> sure the error will be addressed now and in any future occurrences.
>>>
>>> 4) Leave the bug alone for now and just log a message so we can find out
>>> how widespread this problem actually is.
>>>
>>> At the moment we only know it exists in Cinder, but due to the way the
>>> service code works it's quite possible other projects have the same
>>> problem and don't know it yet.
>>>
>>> 5) Allow this sort of connection sharing to continue for a deprecation
>>> period with apppropriate logging, then make it a hard failure.
>>>
>>> This would provide services time to find and fix any sharing problems
>>> they might have, but would delay the timeframe for a final fix.
>>>
>>> 6-ish) Fix oslo-incubator service.py to close all file descriptors after
>>> forking.
>>>
>>> This is a best practice anyway so it's something we intend to pursue,
>>> but it's probably more of a long-term fix because it will take some work
>>> to implement and make sure it doesn't break existing services. It also
>>> papers over the problem and according to Mike is basically a slower and
>>> messier alternative to his current proposed change, so it's probably a
>>> tangential change to avoid this in the future as opposed to a solution.
>>>
>>> If you've made it this far, thank you and please provide thoughts on the
>>> options presented above. :-)
>>
>> I'm not sure why 6 is "slower", can someone elaborate on that?
>
> Whether it's slower or not I put up:
>
> https://review.openstack.org/#/c/157608
>
> It's still not fully functional (something is not quite right with it still...) but it will close any potentially left open file descriptors.
I think that should be in place as well. oslo.db’s check and this activity are not mutually exclusive.
My concern is that we put mechanisms in place which make everything work out but do it all in a slow and foolish way. The way openstack apps try to recover from everything tends to lead to this, e.g. things are configured wrong but nobody notices, things just run terribly. Example. We set HAProxy to time out in 90 seconds, even though the connections are pooled and are expected to live idle as long as 60 minutes. The application logs hundreds of “connection closed” errors, all of which are recovered from by detecting them with our “SELECT 1” handler and reconnecting. The app runs like crap and nobody notices for months b.c. nobody reads the logs or if they do, they assume the errors are something else.
More information about the OpenStack-dev
mailing list