[openstack-dev] [TripleO] os-refresh-config run frequency
Sullivan, Jon Paul
JonPaul.Sullivan at hp.com
Mon Jul 21 02:51:51 UTC 2014
> -----Original Message-----
> From: Clint Byrum [mailto:clint at fewbar.com]
> Sent: 21 July 2014 01:58
> To: openstack-dev
> Subject: Re: [openstack-dev] [TripleO] os-refresh-config run frequency
>
> Excerpts from Dan Prince's message of 2014-07-20 11:51:27 -0700:
> > On Thu, 2014-07-17 at 15:54 +0100, Michael Kerrin wrote:
> > > On Thursday 26 June 2014 12:20:30 Clint Byrum wrote:
> > >
> > > > Excerpts from Macdonald-Wallace, Matthew's message of 2014-06-26
> > > 04:13:31 -0700:
> > >
> > > > > Hi all,
> > >
> > > > >
> > >
> > > > > I've been working more and more with TripleO recently and whilst
> > > it does
> > >
> > > > > seem to solve a number of problems well, I have found a couple
> > > > > of
> > >
> > > > > idiosyncrasies that I feel would be easy to address.
> > >
> > > > >
> > >
> > > > > My primary concern lies in the fact that os-refresh-config does
> > > not run on
> > >
> > > > > every boot/reboot of a system. Surely a reboot *is* a
> > > configuration
> > >
> > > > > change and therefore we should ensure that the box has come up
> > > > > in
> > > the
> > >
> > > > > expected state with the correct config?
> > >
> > > > >
> > >
> > > > > This is easily fixed through the addition of an "@reboot" entry
> > > > > in
> > >
> > > > > /etc/crontab to run o-r-c or (less easily) by re-designing o-r-c
> > > to run
> > >
> > > > > as a service.
> > >
> > > > >
> > >
> > > > > My secondary concern is that through not running
> > > > > os-refresh-config
> > > on a
> > >
> > > > > regular basis by default (i.e. every 15 minutes or something in
> > > the same
> > >
> > > > > style as chef/cfengine/puppet), we leave ourselves exposed to
> > > someone
> > >
> > > > > trying to make a "quick fix" to a production node and taking
> > > > > that
> > > node
> > >
> > > > > offline the next time it reboots because the config was still
> > > > > left
> > > as
> > >
> > > > > broken owing to a lack of updates to HEAT (I'm thinking a "quick
> > > change"
> > >
> > > > > to allow root access via SSH during a major incident that is
> > > > > then
> > > left
> > >
> > > > > unchanged for months because no-one updated HEAT).
> > >
> > > > >
> > >
> > > > > There are a number of options to fix this including Modifying
> > >
> > > > > os-collect-config to auto-run os-refresh-config on a regular
> > > > > basis
> > > or
> > >
> > > > > setting os-refresh-config to be its own service running via
> > > upstart or
> > >
> > > > > similar that triggers every 15 minutes
> > >
> > > > >
> > >
> > > > > I'm sure there are other solutions to these problems, however I
> > > know from
> > >
> > > > > experience that claiming this is solved through "education of
> > > users" or
> > >
> > > > > (more severely!) via HR is not a sensible approach to take as by
> > > the time
> > >
> > > > > you realise that your configuration has been changed for the
> > > > > last
> > > 24
> > >
> > > > > hours it's often too late!
> > >
> > > > So I see two problems highlighted above.
> > >
> > > >
> > >
> > > > 1) We don't re-assert ephemeral state set by o-r-c scripts. You're
> > > right,
> > >
> > > > and we've been talking about it for a while. The right thing to do
> > > is
> > >
> > > > have os-collect-config re-run its command on boot. I don't think a
> > > cron
> > >
> > > > job is the right way to go, we should just have a file in /var/run
> > > that
> > >
> > > > is placed there only on a successful run of the command. If that
> > > file
> > >
> > > > does not exist, then we run the command.
> > >
> > > >
> > >
> > > > I've just opened this bug in response:
> > >
> > > >
> > >
> > > > https://bugs.launchpad.net/os-collect-config/+bug/1334804
> > >
> > > >
> > >
> > >
> > >
> > > I have been looking into bug #1334804 and I have a review up to
> > > resolve it. I want to highlight something.
> > >
> > >
> > >
> > > Currently on a reboot we start all services via upstart (on debian
> > > anyways) and there have been quite a lot of issues around this -
> > > missing upstart scripts and timing issues. I don't know the issues
> > > on fedora.
> > >
> > >
> > >
> > > So with a fix to #1334804, on a reboot upstart will start all the
> > > services first (with potentially out-of-date configuration), then
> > > o-c-c will start o-r-c and will now configure all services and
> > > restart them or start them if upstart isn't configured properly.
> > >
> > >
> > >
> > > I would like to turn off all boot scripts for services we configure
> > > and leave all this to o-r-c. I think this will simplify things and
> > > put us in control of starting services. I believe that it will also
> > > narrow the gap between fedora and debian or debian and debian so
> > > what works on one should work on the other and make it easier for
> developers.
> >
> > I'm not sold on this approach. At the very least I think we want to
> > make this optional because not all deployments may want to have o-r-c
> > be the central service starting agent. So I'm opposed to this being
> > our (only!) default...
> >
>
> I felt this way too. However, I'm open to it because I am worried that
> it is a bit idealistic without much justification for being so.
>
> We know o-r-c will be there, and really must be there. We're already
> saying it needs to run to assert ephemeral state, and one thing
> ephemeral is "things started".
>
> Now, we can, and maybe even should, take a hard line long term that o-r-
> c does not do this. That it stores everything in system level configs
> that are started in the normal system boot. I _want_ this to be the
> case. But thus far, we've failed to assert that and things have
> occasionally been very broken on reboot. Short of forcing a reboot in
> every CI run, we're going to have trouble detecting this.
>
> So, I think we have two options:
>
> 1) O-r-c doing the asserting, with which we can more or less predict
> that subsequent boots will work in the same manner as the first boot.
>
> 2) Reboot in CI.
>
> I would vote for 2, as it probably won't add much time and will test
> system start up.
I like the start of this - "We know o-r-c will, and really must be there".
To me, this means that we can trust o-r-c to exist and to start our services, and the only service we have to worry about starting on reboot is o-r-c.
This is the best solution, imho, as when a node is booting we actually have no idea how long it has been down for. Reboot is the simplest case of "boot after previous configuration", where the window in which the node has been down is minimal.
Those of us running a public cloud know that nodes fail, and can often be down for longer periods. This introduces a window in which configuration can change, and on a node in maintenance, that changed configuration will not be written to the ephemeral state, and so a reboot would start misconfigured services.
By asserting that o-r-c will always start the services, you are also asserting that the service configurations are always up-to-date upon starting, and that seems like a win to me.
In short, I think there are 3 scenarios that must be catered for, first boot, reboot and delayed boot. And I think that delegating all Config creation/service starting to o-r-c makes all of these three scenarios predictable and repeatable.
>
> > The job of o-r-c in this regard is to assert state... which to me
> > means making sure that a service is configured correctly (config
> > files, set to start on boot, and initially started). Requiring o-r-c
> > to be the service starting agent (always) is beyond the scope of the
> o-r-c tool.
Set to start on boot is a fallacy on clustered opoerations where you can lose a node for longer outage windows :(
> >
> > If people want to use it in that mode I think having an *option* to do
> > this is fine. I don't think it should be required though. Furthermore
> > I don't think we should get into the habit of writing our elements in
> > such a matter that things no longer start on boot without o-r-c in the
> mix.
> >
>
> I don't think We need an option. Options are for real incompatible
> differences of operation, like "I want to run in ultra-secure mode and
> that breaks stuff that I don't care about so I turn those things off"
> or "I want to use packages because my business and support model is
> built around it." Those are real, legitimate differences which we _do_
> need options for.
>
> We need to clearly state a design principle, and we need to ensure that
> our CI tests the mechanism by which we do these things.
>
> > I do think we can solve these problems. But taking a hardwired
> > prescriptive approach is not good here...
> >
>
> It's just one option, and not the best one. I am quite confident that
> you all will figure out how to test reboots and do that. And then we'll
> all feel better about trusting the system to start services on a cold
> boot.
I don't think we can ever trust the system to startup on cold boot. I think by setting a single startup mechanism we will give a clear message around expectation of elements, and provide for the most compatible set of elements that work consistently together across first boot, reboot, and delayed boot.
>
> _______________________________________________
> OpenStack-dev mailing list
> OpenStack-dev at lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Thanks,
Jon-Paul Sullivan ☺ Cloud Services - @hpcloud
Postal Address: Hewlett-Packard Galway Limited, Ballybrit Business Park, Galway.
Registered Office: Hewlett-Packard Galway Limited, 63-74 Sir John Rogerson's Quay, Dublin 2.
Registered Number: 361933
The contents of this message and any attachments to it are confidential and may be legally privileged. If you have received this message in error you should delete it from your system immediately and advise the sender.
To any recipient of this message within HP, unless otherwise stated, you should consider this message and attachments as "HP CONFIDENTIAL".
More information about the OpenStack-dev
mailing list