[openstack-dev] [nova] periodic task
Matt Riedemann
mriedem at linux.vnet.ibm.com
Tue Aug 25 16:10:51 UTC 2015
On 8/25/2015 10:03 AM, Gary Kotton wrote:
>
>
> On 8/25/15, 7:04 AM, "Matt Riedemann" <mriedem at linux.vnet.ibm.com> wrote:
>
>>
>>
>> On 8/24/2015 9:32 PM, Gary Kotton wrote:
>>> In item #2 below the reboot is down via the guest and not the nova
>>> api¹s :)
>>>
>>> From: Gary Kotton <gkotton at vmware.com <mailto:gkotton at vmware.com>>
>>> Reply-To: OpenStack List <openstack-dev at lists.openstack.org
>>> <mailto:openstack-dev at lists.openstack.org>>
>>> Date: Monday, August 24, 2015 at 7:18 PM
>>> To: OpenStack List <openstack-dev at lists.openstack.org
>>> <mailto:openstack-dev at lists.openstack.org>>
>>> Subject: [openstack-dev] [nova] periodic task
>>>
>>> Hi,
>>> A couple of months ago I posted a patch for bug
>>> https://launchpad.net/bugs/1463688. The issue is as follows: the
>>> periodic task detects that the instance state does not match the state
>>> on the hypervisor and it shuts down the running VM. There are a number
>>> of ways that this may happen and I will try and explain:
>>>
>>> 1. Vmware driver example: a host where the instances are running goes
>>> down. This could be a power outage, host failure, etc. The first
>>> iteration of the perdioc task will determine that the actual
>>> instacne is down. This will update the state of the instance to
>>> DOWN. The VC has the ability to do HA and it will start the instance
>>> up and running again. The next iteration of the periodic task will
>>> determine that the instance is up and the compute manager will stop
>>> the instance.
>>> 2. All drivers. The tenant decides to do a reboot of the instance and
>>> that coincides with the periodic task state validation. At this
>>> point in time the instance will not be up and the compute node will
>>> update the state of the instance as DWON. Next iteration the states
>>> will differ and the instance will be shutdown
>>>
>>> Basically the issue hit us with our CI and there was no CI running for a
>>> couple of hours due to the fact that the compute node decided to
>>> shutdown the running instances. The hypervisor should be the source of
>>> truth and it should not be the compute node that decides to shutdown
>>> instances. I posted a patch to deal with this
>>> https://review.openstack.org/#/c/190047/. Which is the reason for this
>>> mail. The patch is backwards compatible so that the existing deployments
>>> and random shutdown continues as it works today and the admin now has an
>>> ability just to do a log if there is a inconsistency.
>>>
>>> We do not want to disable the periodic task as knowing the current state
>>> of the instance is very important and has a ton of value, we just do not
>>> want the periodic to task to shut down a running instance.
>>>
>>> Thanks
>>> Gary
>>>
>>>
>>>
>>> _________________________________________________________________________
>>> _
>>> OpenStack Development Mailing List (not for usage questions)
>>> Unsubscribe:
>>> OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>>
>>
>> In #2 the guest shouldn't be rebooted by the user (tenant) outside of
>> the nova-api. I'm not sure if it's actually formally documented in the
>> nova documentation, but from what I've always heard/known, nova is the
>> control plane and you should be doing everything with your instances via
>> the nova-api. If the user rebooted via nova-api, the task_state would
>> be set and the periodic task would ignore the instance.
>
> Matt, this is one case that I showed where the problem occurs. There are
> others and I can invest time to see them. The fact that the periodic task
> is there is important. What I don¹t understand is why having an option of
> log indication for an admin is something that is not useful and instead we
> are going with having the compute node shutdown instance when this should
> not happen. Our infrastructure is behaving like cattle. That should not be
> the case and the hypervisor should be the source of truth.
>
> This is a serious issue and instances in production can and will go down.
>
>>
>> --
>>
>> Thanks,
>>
>> Matt Riedemann
>>
>>
>> __________________________________________________________________________
>> OpenStack Development Mailing List (not for usage questions)
>> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
>
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
For the HA case #1, the periodic task checks to see if the instance.host
doesn't match the compute service host [1] and skips if they don't match.
Shouldn't your HA scenario be updating which host the instance is
running on? Or is this a vCenter-ism?
[1]
http://git.openstack.org/cgit/openstack/nova/tree/nova/compute/manager.py#n5871
--
Thanks,
Matt Riedemann
More information about the OpenStack-dev
mailing list