Open Stack

Sat Jan 7 18:04:25 UTC 2017

A few weeks ago someone in the operators channel was talking about 
issues with ceph-backed nova-compute and OSErrors for too many open 
files causing issues.

We have a bug reported that's very similar sounding:

https://bugs.launchpad.net/nova/+bug/1651526

During the periodic update_available_resource audit, the call to RBD to 
get disk usage fails with the EMFILE OSError. Since this is in a 
periodic it doesn't cause any direct operations to fail, but it will 
cause issues with scheduling as that host is really down, however, 
nothing sets the service to down (disabled).

I had proposed a solution in the bug report that we could automatically 
disable the service for that host when this happens, and then 
automatically enable the service again if/when the next periodic task 
run is successful. Disabling the service would take that host out of 
contention for scheduling and may also trigger an alarm for the operator 
to investigate the failure (although if there are EMFILE errors from the 
ceph cluster I'm guessing alarms should already be going off).

Anyway, I wanted to see how hacky of an idea this is. We already 
automatically enable/disable the service from the libvirt driver when 
the connection to libvirt itself drops via an event callback. This would 
be similar albeit less sophisticated as it's not using an event 
listening mechanism, we'd have to maintain some local state in memory to 
know if we need to enable/disable the service again. And it seems pretty 
hacky/one-offish to handle this just for the RBD failure, but maybe we 
just generically handle it for any EMFILE error when collecting disk 
usage in the resource audit?

-- 

Thanks,

Matt Riedemann

Open Stack

[Openstack-operators] [nova] Automatically disabling compute service on RBD EMFILE failures

OpenStack

Community

Documentation

Branding & Legal