[openstack-dev] [All] Maintenance mode in OpenStack during patching/upgrades

Christopher Aedo doc at aedo.net
Tue Oct 21 02:44:35 UTC 2014


I'm glad to see there's more than one interested person here too :)

Regarding the Xen-specific host maintenance mode, if it gets dropped I
would not complain since it's useful only to those running Xen at the
moment.  The issues around when it works and doesn't work are my
bigger concern - as similar limitations exist in the migrate code
today.  They're not xen-specific, but do seem to consider few
deployment scenarios (and don't seem to work if you're using
ceph-backed storage for instance).

As Joe pointed out, there's definitely a need for maintenance mode.
Having a reliable method to pull a compute node out of a cluster would
be incredibly valuable.  This will certainly be a required component
of any full-environment upgrade path.

The scenario Joe outlined is the only working approach I'm aware of
right now, but I'm not a fan of disabling the compute service.  For
one thing, hopefully it will raise an alarm with your monitoring
system.  It also has the potential of interfering with other
operations that are ongoing (and with nova compute disabled, will you
still/always be able to reliably migrate a VM off the host?)

Also, I would like to see "maintenance mode" for Nova be limited just
to stopping any further VMs being sent there, and the node reporting
that it's in maintenance mode.  I think proactive workload migration
should be handled independently, as I can imaging scenarios where
maintenance mode might be desired without coupling migration to it.

I would love to keep discussing this further - a small session in
Paris would be great.  But it seems like there's never enough time at
the summits, so I don't have high hopes for making much progress on
this specific topic there.  Just the same, if anything gets pulled
together, I'll be keeping an eye out for it.

-Christopher

On Fri, Oct 17, 2014 at 9:21 PM, Joe Cropper <cropper.joe at gmail.com> wrote:
> I’m glad to see this topic getting some focus once again.  :-)
>
> From several of the administrators I talk with, when they think of putting a host into maintenance mode, the common requests I hear are:
>
> 1. Don’t schedule more VMs to the host
> 2. Provide an optional way to automatically migrate all (usually active) VMs off the host so that users’ workloads remain “unaffected” by the maintenance operation
>
> #1 can easily be achieved, as has been mentioned several times, by simply disabling the compute service.  However, #2 involves a little more work, although certainly possible using all the operations provided by nova today (e.g., live migration, etc.).  I believe these types of discussions have come up several times over the past several OpenStack releases—certainly since Grizzly (i.e., when I started watching this space).
>
> It seems that the general direction is to have the type of workflow needed for #2 outside of nova (which is certainly a valid stance).  To that end, it would be fairly straightforward to build some code that logically sits on top of nova, that when entering maintenance:
>
> 1. Prevents VMs from being scheduled to the host;
> 2. Maintains state about the maintenance operation (e.g., not in maintenance, migrations in progress, in maintenance, or error);
> 3. Provides mechanisms to, upon entering maintenance, dictates which VMs (active, all, none) to migrate and provides some throttling capabilities to prevent hundreds of parallel migrations on densely packed hosts (all done via a REST API).
>
> If anyone has additional questions, comments, or would like to discuss some options, please let me know.  If interested, upon request, I could even share a video of how such cases might work.  :-)  My colleagues and I have given these use cases a lot of thought and consideration and I’d love to talk more about them (perhaps a small session in Paris would be possible).
>
> - Joe
>
> On Oct 17, 2014, at 4:18 AM, John Garbutt <john at johngarbutt.com> wrote:
>
>> On 17 October 2014 02:28, Matt Riedemann <mriedem at linux.vnet.ibm.com> wrote:
>>>
>>>
>>> On 10/16/2014 7:26 PM, Christopher Aedo wrote:
>>>>
>>>> On Tue, Sep 9, 2014 at 2:19 PM, Mike Scherbakov
>>>> <mscherbakov at mirantis.com> wrote:
>>>>>>
>>>>>> On Tue, Sep 9, 2014 at 6:02 PM, Clint Byrum <clint at fewbar.com> wrote:
>>>>>
>>>>> The idea is not simply deny or hang requests from clients, but provide
>>>>> them
>>>>> "we are in maintenance mode, retry in X seconds"
>>>>>
>>>>>> You probably would want 'nova host-servers-migrate <host>'
>>>>>
>>>>> yeah for migrations - but as far as I understand, it doesn't help with
>>>>> disabling this host in scheduler - there is can be a chance that some
>>>>> workloads will be scheduled to the host.
>>>>
>>>>
>>>> Regarding putting a compute host in maintenance mode using "nova
>>>> host-update --maintenance enable", it looks like the blueprint and
>>>> associated commits were abandoned a year and a half ago:
>>>> https://blueprints.launchpad.net/nova/+spec/host-maintenance
>>>>
>>>> It seems that "nova service-disable <host> nova-compute" effectively
>>>> prevents the scheduler from trying to send new work there.  Is this
>>>> the best approach to use right now if you want to pull a compute host
>>>> out of an environment before migrating VMs off?
>>>>
>>>> I agree with Tim and Mike that having something respond "down for
>>>> maintenance" rather than ignore or hang would be really valuable.  But
>>>> it also looks like that hasn't gotten much traction in the past -
>>>> anyone feel like they'd be in support of reviving the notion of
>>>> "maintenance mode"?
>>>>
>>>> -Christopher
>>>>
>>>> _______________________________________________
>>>> OpenStack-dev mailing list
>>>> OpenStack-dev at lists.openstack.org
>>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>>>
>>>
>>> host-maintenance-mode is definitely a thing in nova compute via the os-hosts
>>> API extension and the --maintenance parameter, the compute manager code is
>>> here [1].  The thing is the only in-tree virt driver that implements it is
>>> xenapi, and I believe when you put the host in maintenance mode it's
>>> supposed to automatically evacuate the instances to some other host, but you
>>> can't target the other host or tell the driver, from the API, which
>>> instances you want to evacuate, e.g. all, none, running only, etc.
>>>
>>> [1]
>>> http://git.openstack.org/cgit/openstack/nova/tree/nova/compute/manager.py?id=2014.2#n3990
>>
>> We should certainly make that more generic. It doesn't update the VM
>> state, so its really only admin focused in its current form.
>>
>> The XenAPI logic only works when using XenServer pools with shared NFS
>> storage, if my memory serves me correctly. Honestly, its a bit of code
>> I have planned on removing, along with the rest of the pool support.
>>
>> In terms of requiring DB downtime in Nova, the current efforts are
>> focusing on avoiding downtime all together, via expand/contract style
>> migrations, with a little help from objects to avoid data migrations.
>>
>> That doesn't mean maintenance mode if not useful for other things,
>> like an emergency patching of the hypervisor.
>>
>> John
>>
>> _______________________________________________
>> OpenStack-dev mailing list
>> OpenStack-dev at lists.openstack.org
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
>
> _______________________________________________
> OpenStack-dev mailing list
> OpenStack-dev at lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev



More information about the OpenStack-dev mailing list