[Openstack-operators] A couple of recent bugs that hit us in regions with cells and moderate (to heavy) build/delete activity

Matt Van Winkle mvanwink at rackspace.com
Thu Feb 12 19:14:15 UTC 2015

Hey folks,
Apologies if any of this has been discussed on the list already.  I've tried to check everything ahead of time.

We recently had two bugs combine to hit us in some of our regions as we rolled out some new code.  The result of them was rabbit servers not accept connections and/or crashing with OOM errors.   I wanted to pass them along as I know from the Large Deployments Team, there are more and more folks using cells to manage larger regions.   Here are the specific bugs:

Cells doesn't properly track RabbitMQ connection pools:

Oslo messaging bgt in version 1.5.1 that leaks channels :
Upstream bug: https://bugs.launchpad.net/oslo.messaging/+bug/1406629
Upstream fix: https://review.openstack.org/#/c/145232/9/oslo_messaging/_drivers/impl_rabbit.py

We are deploying patches for both in our problem areas now and the rest of the fleet in the immediate future, but this gave us quite a run for our money last week.  I wanted to share in case anyone else is chasing these issues and/or might after an upcoming code update.

