[Openstack-operators] Maintenance

Joseph Bajin josephbajin at gmail.com
Fri Apr 22 22:55:11 UTC 2016


Rob/Jay,

The use of the OSOps Working group and its repos is a great way to address
this.. If any of you are coming to the Summit, please take a look at our
Etherpad that we have created.[1]   This could be a great discussion topic
for the working sessions and we can brainstorm how we could help with this.


Joe

[1] https://etherpad.openstack.org/p/AUS-ops-OSOps

On Fri, Apr 22, 2016 at 4:02 PM, Robert Starmer <robert at kumul.us> wrote:

> Maybe a result of the discussion can be a set of models (let's not go so
> far as to call them best pracices yet :) for how maintainance can be done
> at scale, perhaps solidifying the descriptions Jay has above with the user
> stories Tomi described in his initial note.  This seems like an achievable
> outcome from a working session, and the output even has a target, either
> creating scripable workflows that could end up in the OSops repository, or
> as user stories that can be mapped to the PM working group.
>
> R
>
> On Fri, Apr 22, 2016 at 12:47 PM, Jay Pipes <jaypipes at gmail.com> wrote:
>
>> On 04/14/2016 05:14 AM, Juvonen, Tomi (Nokia - FI/Espoo) wrote:
>> <snip>
>>
>>> As admin I want to know when host is ready to actions to be done by admin
>>> during the maintenance. Meaning physical resources are emptied.
>>>
>>
>> You are equating "host maintenance mode" with the end result of a call to
>> `nova host-evacuate-live`. The two are not the same.
>>
>> "host maintenance mode" typically just refers to taking a Nova compute
>> node out of consideration for placing new workloads on that compute node.
>> Putting a Nova compute node into host maintenance mode is as simple as
>> calling `nova service-disable $hostname nova-compute`.
>>
>> Depending on what you need to perform on the compute node that is in host
>> maintenance mode, you *may* want to migrate the workloads from that compute
>> node to some other compute node that isn't in host maintenance mode. The
>> `nova host-evacuate $hostname` and `nova host-evacuate-live $hostname`
>> commands in the Nova CLI [1] can be used to migrate or live-migrate all
>> workloads off the target compute node.
>>
>> Live migration will reduce the disruption that tenant workloads (data
>> plane) experience during the workload migration. However, research at
>> Mirantis has shown that libvirt/KVM/QEMU live migration performed against
>> workloads with even a medium rate of memory page dirtying can easily never
>> complete. Solutions like auto-converge and xbzrle compression have minimal
>> effect on this, unfortunately. Pausing a workload manually is typically
>> what is done to force the live migration to complete.
>>
>> [1] Note that these are commands in the Nova CLI tool
>> (python-novaclient). Neither a host-evacuate nor a host-evacuate-live REST
>> API call exists in the Compute API. This fact alone should suggest to folks
>> that the appropriate place to put logic associated with performing host
>> maintenance tasks should be *outside* of Nova entirely...
>>
>> As owner of a server I want to prepare for maintenance to minimize
>>> downtime,
>>> keep capacity on needed level and switch HA service to server not
>>> affected by maintenance.
>>>
>>
>> This isn't an appropriate use case, IMHO. HA control planes should, by
>> their very nature, be established across various failure domains. The whole
>> *point* of having an HA service is so that you don't need to "prepare" for
>> some maintenance event (planned or unplanned).
>>
>> All HA control planes worth their salt will be able to notify some
>> external listener of a partition in the cluster. This HA control plane is
>> the responsibility of the tenant, not the infrastructure (i.e. Nova). I
>> really do not want to add coupling between infrastructure control plane
>> services and tenant control plane services.
>>
>> As owner of a server I want to know when my servers will be down because
>>> of
>>> host maintenance as it might be servers are not moved to another host.
>>>
>>
>> See above. As an owner of a server involved in an HA cluster, it is *the
>> server owner's* responsibility to set things up so that the cluster
>> rebalances, handles redirected load, or does the custom thing that they
>> want. This isn't, IMHO, the domain of the NVFi but rather a much
>> higher-level NFVO orchestration layer.
>>
>> As owner of a server I want to know if host is to be totally removed, so
>>> instead of keeping my servers on host during maintenance, I want to move
>>> them to somewhere else.
>>>
>>
>> This isn't something the owner of a server even knows about in a cloud
>> environment. Owners of a server don't (and shouldn't) know which compute
>> node they are, nor should they know that a host is having a planned or
>> unplanned host maintenance event.
>>
>> The infrastructure owner (cloud deployer/operator) is responsible for
>> doing the needful and performing a [live] migration of workloads off of a
>> failing host or a host that is undergoing a cold upgrade. The tenant
>> doesn't know anything about these things, and shouldn't.
>>
>> As owner of a server I want to send acknowledgement to be ready for host
>>> maintenance and I want to state if servers are to be moved or kept on
>>> host.
>>>
>>
>> This is describing some virtual inventory management or CMDB
>> functionality that isn't in scope for infrastructure services like Nova.
>> Perhaps it's worth looking into how something like Remedy can manage your
>> virtual inventory in this manner, but I don't see this being in the
>> OpenStack realm really...
>>
>> FWIW, this is the same objection I had to Tacker joining the OpenStack
>> Big Tent. It is essentially a monolithic, purpose-built-for-Telco
>> application that orchestrates VNFs at layers way above the OpenStack
>> deployment.
>>
>> Best,
>> -jay
>>
>> Removal and creating of server is in owner's control already. Optionally
>>> server
>>> Configuration data could hold information about automatic actions to be
>>> done
>>> when host is going down unexpectedly or in controlled manner. Also
>>> actions at
>>> the same if down permanently or only temporarily. Still this needs
>>> acknowledgement from server owner as he needs time for application level
>>> controlled HA service switchover.
>>> Br,
>>> Tomi
>>>
>>>
>>> _______________________________________________
>>> OpenStack-operators mailing list
>>> OpenStack-operators at lists.openstack.org
>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
>>>
>>>
>> _______________________________________________
>> OpenStack-operators mailing list
>> OpenStack-operators at lists.openstack.org
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
>>
>
>
> _______________________________________________
> OpenStack-operators mailing list
> OpenStack-operators at lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-operators/attachments/20160422/d570fcaa/attachment.html>


More information about the OpenStack-operators mailing list