[Openstack-operators] [nova] Automatically disabling compute service on RBD EMFILE failures
mriedem at linux.vnet.ibm.com
Sat Jan 7 18:04:25 UTC 2017
A few weeks ago someone in the operators channel was talking about
issues with ceph-backed nova-compute and OSErrors for too many open
files causing issues.
We have a bug reported that's very similar sounding:
During the periodic update_available_resource audit, the call to RBD to
get disk usage fails with the EMFILE OSError. Since this is in a
periodic it doesn't cause any direct operations to fail, but it will
cause issues with scheduling as that host is really down, however,
nothing sets the service to down (disabled).
I had proposed a solution in the bug report that we could automatically
disable the service for that host when this happens, and then
automatically enable the service again if/when the next periodic task
run is successful. Disabling the service would take that host out of
contention for scheduling and may also trigger an alarm for the operator
to investigate the failure (although if there are EMFILE errors from the
ceph cluster I'm guessing alarms should already be going off).
Anyway, I wanted to see how hacky of an idea this is. We already
automatically enable/disable the service from the libvirt driver when
the connection to libvirt itself drops via an event callback. This would
be similar albeit less sophisticated as it's not using an event
listening mechanism, we'd have to maintain some local state in memory to
know if we need to enable/disable the service again. And it seems pretty
hacky/one-offish to handle this just for the RBD failure, but maybe we
just generically handle it for any EMFILE error when collecting disk
usage in the resource audit?
More information about the OpenStack-operators