[Openstack-operators] /var/lib/nova/instances fs filled up corrupting my Linux instances

Joe Topjian joe.topjian at cybera.ca
Thu Mar 14 18:20:10 UTC 2013


On Thu, Mar 14, 2013 at 10:00 AM, Michael Still <mikal at stillhq.com> wrote:

> On Thu, Mar 14, 2013 at 10:51 AM, Joe Topjian <joe.topjian at cybera.ca>
> wrote:
> > https://bugs.launchpad.net/nova/+bug/1126375
> >
> > Admittedly, the bug report does not explain the scenario in detail, but I
> > noted "No matter how many precautions are taken, some scenarios will
> still
> > slip by." which I still firmly believe. My intention was to push for the
> > cleanup to be turned off by default before discussing the possible ways
> it
> > would't work as expected. I felt that if by simply describing the
> scenario,
> > that single scenario would be accounted for but thought would not go into
> > any other ways it could happen (I feel this is what happened with the
> NeCTAR
> > incident).
> >
> > I fully admit to being difficult with this, but it's something I believe
> > strongly in. I have never run into another service or package that has a
> > task enabled by default which deletes (rather than archives or recycles)
> > data. I am all for these types of cleanup tasks, but feel they must be
> > opt-in.
>
> I vetoed that review and I'd do it again. Nothing you have said has
> convinced me that the cleaner should be turned off by default. What
> went wrong with Nectar is that they deployed code without testing it
> in their environment. We've already discussed my feelings about that.
>

I could have tested for weeks in a lab and never come across the scenario
that I described where I lost images.

Operators do test things, but unfortunately production environments tends
to bring out the most random events you'd never think of.

It's not always possible to create an exact lab of a production
environment. If I'm running a 10,000 node cloud, do I need a 10,000 node
lab to be sure of my tests? What about a 1,000 node cloud or 500 node? Or
5? How can I be sure that my tests are one-to-one transferrable to my
production environment? Similarly, how can I simulate production workloads
in a lab? How can I simulate end-user decisions that sometimes seem totally
off the wall or random?


>
> Frankly, I think its much worse to disable it and have compute nodes
> fill their disks, than to have automated cleanup. The whole point of
> cloud infrastructure is to manage machines so you don't have to.
> Performing a manual cleanup on 10,000 compute nodes is not something
> we should force operators to do.
>

I think it should be the operator who ultimately makes the choice on how
they manage their disks. I think it's great that a cleanup process exists,
but I strongly feel it should be opt-in. No offence, but you don't
understand my environment. Why should you make the decision on what should
be running by default?

If it is off by default, then the operator has not lost anything. If it's
on by default, there is the possibility that the operator will lose
something that they did not want lost.

I agree, if it is off by default, then the operator does risk running out
of disk space. However, that problem and resolution is much more familiar
to an operator than a missing _base image and resulting broken instances.



> A simple example of something which cleans caches would be squid. I'm
> sure I can find other examples trivially. We can't ship software with
> an unbounded disk cache -- its guaranteed to hurt users.
>

If Squid acted the same as _base, then entire websites (or whatever else
was being cached) would be broke when the cache was pruned.

I tend to think of this cleanup process as more of a mailbox issue. It's
the difference between getting a call from a user saying "my mailbox is
full" versus "20 important emails just disappeared".

The potential end-users visibility and downtime of the cleanup process is
very high.


>
> If you find bugs, report the actual bug. The devs aren't clairvoyant,
> and we can only fix things we're told about. I've now spent a year
> talking to ops people and trying to get them to help us help them
> (tagging bugs with the ops tag for example). I've spent the majority
> of my development time trying to make things easier for ops folks. I
> am very offended that you think we're deliberately trying to make your
> life harder, and to be honest it makes me wonder why I bother spending
> time trying to help.
>

No one is clairvoyant - not devs, operators, or end users. It's for that
reason that potentially dangerous processes need to be handled with
caution. I think it's a good indication that a certain feature or process
needs to be handled cautiously when you're unable to foresee what kinds of
damage it can do. That's all I'm trying to say.

I am not saying you're deliberately making life harder. I understand that
this discussion of a single topic is getting more attention than many other
great contributions you've made to OpenStack and that's unfortunate.



>
> Michael
>



-- 
Joe Topjian
Systems Administrator
Cybera Inc.

www.cybera.ca

Cybera is a not-for-profit organization that works to spur and support
innovation, for the economic benefit of Alberta, through the use
of cyberinfrastructure.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-operators/attachments/20130314/6a67a999/attachment-0001.html>


More information about the OpenStack-operators mailing list