[Openstack-operators] Base images removed in upgrade essex -> folsom and other stories

Sam Morrison sorrison at gmail.com
Wed Nov 14 05:03:05 UTC 2012

We upgraded our cloud from Essex -> Folsom yesterday and had a major loss of data that I thought I'd share.

With Essex the flag remove_used_base_images had a default of False, with Folsom it was changed to True. We hadn't explicitly set this so we had whatever the default was.

After the upgrade which went relatively smoothly (a lot easier than diablo -> essex) almost all our base images were deleted by the image cache clean up.
I can't explain how this happened. We lost a total of about 70 images that affected ~200 running instances.

We have since disabled this flag until we can find out what went wrong. I can see it in the logs and if this flag is enabled it would delete a lot of in use base files still.

We have an nfs mounted /var/lib/nova/instances directory where the _base dir is located so I'm wondering if this had something to do with it? 
Is the image cache cleanup meant to work in a shared instance storage environment?

We also came across an issue where some compute nodes were reporting bogus resource stats. Eg:

2012-11-13 05:04:38 INFO nova.compute.manager [-] Updating host status
2012-11-13 05:06:14 AUDIT nova.compute.resource_tracker [-] Free ram (MB): -739665
2012-11-13 05:06:14 AUDIT nova.compute.resource_tracker [-] Free disk (GB): 12654
2012-11-13 05:06:14 AUDIT nova.compute.resource_tracker [-] Free VCPUS: -188
2012-11-13 05:06:14 INFO nova.compute.resource_tracker [-] Compute_service record updated for np-rcc6

This happened to be addressed by the following bug, it turns out it does a regex for the db filter.

So a compute node of np-rcc5 would also pull in np-rcc50, np-rcc51.. and so on and so on. 

All in all apart from our huge data loss the upgrade went pretty well. 

The main issues we have now are usability issues with the dashboard:
Pagination doesn't work
The green notification boxes that appear top right get in the way of the links behind them
The new containers view is confusing and you can no longer see how much data in a specific container like you used to.
The launch instance box sometimes gets the bottom cut off making it useless
Same with if you have lots of security groups in terms of the launch instance box

I should also add we have moved to a using nova cells, this went pretty smoothly and we're awaiting eagerly for the cells code to hit trunk so we can contribute our enhancements to cells.


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-operators/attachments/20121114/ac10ed34/attachment.html>

More information about the OpenStack-operators mailing list