[openstack-dev] [vmware][nova] Prevent HA configuration with different hostnames

Matthew Booth mbooth at redhat.com
Wed Feb 11 17:08:22 UTC 2015


On 11/02/15 16:40, Gary Kotton wrote:
> 
> 
> On 2/11/15, 6:35 PM, "Sylvain Bauza" <sbauza at redhat.com> wrote:
> 
>>
>> Le 11/02/2015 17:04, Gary Kotton a écrit :
>>> I posted a fix that does not break things and supports HA.
>>> https://review.openstack.org/154029
>>
>>
>> Just let's be clear, HA is *not* supported by Nova now.
> 
> That is not correct. It is actually support if the host_ip is the same. If
> the host_ip is not the same then there is an issue when one of the compute
> nodes restarts - it will delete all instances that do not have its host_ip.

FWIW, I suspect you're correct. My patch enforces that.

So, check out the init code from compute manager:

    def init_host(self):
        """Initialization for a standalone compute service."""
        self.driver.init_host(host=self.host)

The driver is being configured with the Compute service's 'host' config
variable. Let's scan through and look what else uses that:


    def _update_resource_tracker(self, context, instance):
        """Let the resource tracker know that an instance has changed
state."""

        if (instance['host'] == self.host and
                self.driver.node_is_available(instance['node'])):

Instances with the old host value will be ignored by
_update_resource_tracker. That doesn't sound good.

There's _destroy_evacuated_instances(), which you already found :)

There's this in _validate_instance_group_policy:

            group_hosts = group.get_hosts(context, exclude=[instance.uuid])
            if self.host in group_hosts:
                msg = _("Anti-affinity instance group policy was violated.")

You've changed group affinity.

There's _check_instance_build_time:

        filters = {'vm_state': vm_states.BUILDING,
                   'host': self.host}

Nova won't check instances created by the old host to see if they're stuck.

There's this in _heal_instance_info_cache:

            db_instances = objects.InstanceList.get_by_host(
                context, self.host, expected_attrs=[], use_slave=True)

This cleanup job won't find instances created by the old host.

There's this in _poll_rebooting_instances:

            filters = {'task_state':
                       [task_states.REBOOTING,
                        task_states.REBOOT_STARTED,
                        task_states.REBOOT_PENDING],
                       'host': self.host}

Nova won't poll instances created by the old host.

This is just a cursory flick through. I'm fairly sure this is going to
be a lot of work to fix. My patch just ensures that Nova refuses to
start instead of letting bad things happen. If you ensure that
'self.host' in the above code is the same for all HA nodes I don't see
why it shouldn't work, though. My patch won't prevent that.

Matt

> 
>>
>> The main reason is that compute *nodes* are considered given by the
>> hypervisor (ie. the virt driver ran by the compute manager worker), so
>> if 2 or more hypervisors on two distinct machines are getting the same
>> list of nodes, then you would have duplicates.
> 
> No. There are no duplicates.
> 
>>
>> -Sylvain
>>
>>> On 2/11/15, 5:55 PM, "Matthew Booth" <mbooth at redhat.com> wrote:
>>>
>>>> On 11/02/15 15:49, Gary Kotton wrote:
>>>>> Hi,
>>>>> I do not think that that is a healthy solution. That effectively would
>>>>> render a cluster down if the compute node goes down. That would be a
>>>>> real
>>>>> disaster. The ugly work around is setting the host names to be the
>>>>> same
>>>>> value.
>>>> I don't think that's an ugly work around. I think that's the only
>>>> currently viable solution.
>>>>
>>>>> This is something that we should discuss at the next summit and I
>>>>> would
>>>>> hope to propose a topic to talk about.
>>>> Sounds like a good plan. However, given that the bug is marked Critical
>>>> I was assuming we wanted a more expedient fix, which is what I've
>>>> proposed.
>>>>
>>>> Matt
>>>>
>>>>> Thanks
>>>>> Gary
>>>>>
>>>>> On 2/11/15, 5:31 PM, "Matthew Booth" <mbooth at redhat.com> wrote:
>>>>>
>>>>>> I just posted this:
>>>>>>
>>>>>> https://review.openstack.org/#/c/154907/
>>>>>>
>>>>>> as an alternative fix for critical bug:
>>>>>>
>>>>>> https://bugs.launchpad.net/nova/+bug/1419785
>>>>>>
>>>>>> I've just knocked this up quickly for illustration: it obviously
>>>>>> needs
>>>>>> plenty of cleanup. I have confirmed that it works, though.
>>>>>>
>>>>>> Before I take it any further, though, I'd like to get some feedback
>>>>>> on
>>>>>> the approach. I prefer this to the alternative, because the
>>>>>> underlying
>>>>>> problem is deeper than supporting evacuate. I'd prefer to be honest
>>>>>> with
>>>>>> the user and just say it ain't gonna work. The alternative would
>>>>>> leave
>>>>>> Nova running in a broken state, leaving inconsistent state in its
>>>>>> wake
>>>>>> as it runs.
>>>>>>
>>>>>> Matt
>>>>>>
>>>>>> -- 
>>>>>> Matthew Booth
>>>>>> Red Hat Engineering, Virtualisation Team
>>>>>>
>>>>>> Phone: +442070094448 (UK)
>>>>>> GPG ID:  D33C3490
>>>>>> GPG FPR: 3733 612D 2D05 5458 8A8A 1600 3441 EA19 D33C 3490
>>>>>>
>>>>>>
>>>>>>
>>>>>> ______________________________________________________________________
>>>>>> __
>>>>>> __
>>>>>> OpenStack Development Mailing List (not for usage questions)
>>>>>> Unsubscribe:
>>>>>> OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
>>>>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________________________________
>>>>> __
>>>>> _
>>>>> OpenStack Development Mailing List (not for usage questions)
>>>>> Unsubscribe:
>>>>> OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
>>>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>>>>
>>>>
>>>> -- 
>>>> Matthew Booth
>>>> Red Hat Engineering, Virtualisation Team
>>>>
>>>> Phone: +442070094448 (UK)
>>>> GPG ID:  D33C3490
>>>> GPG FPR: 3733 612D 2D05 5458 8A8A 1600 3441 EA19 D33C 3490
>>>>
>>>>
>>>> ________________________________________________________________________
>>>> __
>>>> OpenStack Development Mailing List (not for usage questions)
>>>> Unsubscribe: 
>>>> OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
>>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>>
>>>
>>> _________________________________________________________________________
>>> _
>>> OpenStack Development Mailing List (not for usage questions)
>>> Unsubscribe: 
>>> OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>
>>
>> __________________________________________________________________________
>> OpenStack Development Mailing List (not for usage questions)
>> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> 
> 
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> 


-- 
Matthew Booth
Red Hat Engineering, Virtualisation Team

Phone: +442070094448 (UK)
GPG ID:  D33C3490
GPG FPR: 3733 612D 2D05 5458 8A8A 1600 3441 EA19 D33C 3490



More information about the OpenStack-dev mailing list