[Openstack-operators] /var/lib/nova/instances fs filled up corrupting my Linux instances

Joe Topjian joe.topjian at cybera.ca
Thu Mar 14 14:23:39 UTC 2013


On Wed, Mar 13, 2013 at 5:29 PM, Michael Still <mikal at stillhq.com> wrote:

> On Wed, Mar 13, 2013 at 5:23 PM, Joe Topjian <joe.topjian at cybera.ca>
> wrote:
> > On Wed, Mar 13, 2013 at 5:12 PM, Michael Still <mikal at stillhq.com>
> wrote:
> >> On Wed, Mar 13, 2013 at 4:42 PM, Joe Topjian <joe.topjian at cybera.ca>
> >> wrote:
> >> > It would, yes, but I think your caveat trumps that idea. Having x
> nodes
> >> > be
> >> > able to work with a shared _base directory is great for saving space
> and
> >> > centrally using images. As an example, one of my OpenStack's _base
> >> > directory
> >> > is 650gb in size. It's currently shared via NFS. If it was not shared
> or
> >> > used a _base_$host scheme, that would be 650gb per compute node. 10
> >> > nodes
> >> > and you're already at 6.5TB.
> >>
> >> Is that _base directory so large because its never been cleaned up
> >> though? What sort of maintenance are you performing on it?
> >
> > It's true that I haven't done any maintenance to _base. From my
> estimations,
> > a cleanup wouldn't reclaim a substantial amount of space to warrant me
> doing
> > an actual cleanup (basically "benefits of disk space reclaimed" is not
> > greater than "risk of accidentally corrupting x users instances" yet).
>
> What release of openstack are you running? I think you might get
> significant benefits from turning cleanup on, so long as you're using
> grizzly [1]. I'd be very very interested in the results of a lab test.
>

I am using Folsom and do plan on testing Grizzly when it's released.


>
> Michael
>
> 1: yes I know its not released yet, but if you found a bug now we
> could fix it before it hurts everyone else...
>

The scenario that I ran into had these conditions:

1. Using shared storage
2. All instances (1 or more) of a certain image or snapshots are running on
one compute node
3. That compute node is taken down for 10 minutes or so (reboot,
maintenance, etc)
4. That compute node is unable to mark its _base files as being in use
since it's offline
5. Other compute nodes see that those _base files are not in use and delete
them
6. The compute node comes back online and the image/snapshot in question is
now broke

Has that scenario been accounted for or fixed?

The probability of this happening is low, but it actually happened to me
twice before I figured out what was going on. I had two users who created
snapshots and then launched a single instance of those snapshots. One
compute node went offline due to a hardware failure and the _base image of
that snapshot was removed while it was down.

The second compute node was fat-finger-rebooted when someone looked at the
first node and the same thing happened with another _base snapshot.

Optimistically only two instances were lost. However, I don't like having
to email users saying "oops - your instance is gone" when it's something I
could have prevented.

If the compute nodes were shut down proactively, I could have live migrated
everything off of those nodes, but since this was due to hardware failures,
I had no time to react like that.

Joe

-- 
Joe Topjian
Systems Administrator
Cybera Inc.

www.cybera.ca

Cybera is a not-for-profit organization that works to spur and support
innovation, for the economic benefit of Alberta, through the use
of cyberinfrastructure.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-operators/attachments/20130314/673fae85/attachment.html>


More information about the OpenStack-operators mailing list