[openstack-dev] [all][oslo] Dealing with database connection sharing issues
Ben Nemec
openstack at nemebean.com
Thu Feb 19 18:09:55 UTC 2015
Hi,
Mike Bayer recently tracked down an issue with database errors in Cinder
to a single database connection being shared over multiple processes.
This is not something that should happen, and it turns out to cause
intermittent failures in the Cinder volume service. Full details can be
found in the bug here: https://bugs.launchpad.net/cinder/+bug/1417018
and his mailing list thread here:
http://lists.openstack.org/pipermail/openstack-dev/2015-February/057184.html
The question we're facing is what to do about it. There's quite a lot
of discussion on https://review.openstack.org/#/c/156725 and in
http://eavesdrop.openstack.org/irclogs/%23openstack-oslo/%23openstack-oslo.2015-02-18.log
starting at 2015-02-18T21:38:12 but I'll try to summarize it here.
On the plus side, we have a way to detect this sort of thing in oslo.db.
That's what Mike's change 156725 is about. On the minus side,
recovering from this in oslo.db is papering over a legitimate problem in
the calling service, and a lot of the discussion has been around how to
communicate that to the calling service. A few options that have been
mentioned:
1) Leave the linked change as-is, with a warning logged that will
hopefully be seen and prompt a fix in the service.
The concerns raised with this is that the warning log level is a very
operator-visible thing and there's nothing an operator can do to fix
this other than pester the developers. Also, it seems developers tend
to ignore logs, so it's unlikely they'll pick up on it themselves.
Note that while the errors resulting from this situation are
intermittent, the actual situation happens on every start up of
cinder-volume, so these messages will always be logged as it stands today.
2) Change the log message to debug.
This is the developer-focused log level, but as noted above developers
tend to ignore logs and it will be very easy for the message to get lost
in the debug noise. This option would likely require someone to go
specifically looking for the error to find it.
3) Make the error a hard failure.
Rather than hide the error by recovering, fail immediately when it's
detected. This has the problem of making all the existing Cinder code
(and any other services with the same problem) in the wild incompatible
with any new releases of oslo.db, but it's about the only way to make
sure the error will be addressed now and in any future occurrences.
4) Leave the bug alone for now and just log a message so we can find out
how widespread this problem actually is.
At the moment we only know it exists in Cinder, but due to the way the
service code works it's quite possible other projects have the same
problem and don't know it yet.
5) Allow this sort of connection sharing to continue for a deprecation
period with apppropriate logging, then make it a hard failure.
This would provide services time to find and fix any sharing problems
they might have, but would delay the timeframe for a final fix.
6-ish) Fix oslo-incubator service.py to close all file descriptors after
forking.
This is a best practice anyway so it's something we intend to pursue,
but it's probably more of a long-term fix because it will take some work
to implement and make sure it doesn't break existing services. It also
papers over the problem and according to Mike is basically a slower and
messier alternative to his current proposed change, so it's probably a
tangential change to avoid this in the future as opposed to a solution.
If you've made it this far, thank you and please provide thoughts on the
options presented above. :-)
-Ben
More information about the OpenStack-dev
mailing list