[openstack-dev] [Fuel] HA cluster disk monitoring, failover and recovery

Alex Schultz aschultz at mirantis.com
Tue Nov 17 17:52:04 UTC 2015


On Tue, Nov 17, 2015 at 11:12 AM, Vladimir Kuklin <vkuklin at mirantis.com> wrote:
> Bogdan
>
> I think we should firstly check whether attribute deletion leads to node
> starting its services or not. From what I read in the official Pacemaker
> documentation, it should work out of the box without the need to restart the
> node.

It does start up the services when the attribute is cleared. QA has a
test to validate this as part of this change.

> And by the way the quote above mentions 'use ONE of the following methods'
> meaning that we could actually use attribute deletion. The 2nd and the 3rd
> options do the same - they clear short-living node attribute. So we need to
> figure out why OCF script does not update the corresponding attribute by
> itself.
>

https://github.com/ClusterLabs/pacemaker/blob/master/extra/resources/SysInfo#L215-L227

It doesn't have something that updates it to green because essentially
when this condition hits, the sysinfo service is also stopped. It has
no way of knowing when it is cleared because all the resources are
stopped and there is no longer a service running to reset the
attribute.  We would need something outside of pacemaker to mark it OK
or perhaps write a custom health strategy[0][1] that would not stop
the sysinfo task and update the ocf script to update the status to
green if all disks are OK.

-Alex

[0] https://github.com/openstack/fuel-library/blob/master/deployment/puppet/cluster/manifests/sysinfo.pp#L50-L55
[1] http://clusterlabs.org/wiki/SystemHealth

>
>
> On Tue, Nov 17, 2015 at 7:03 PM, Bogdan Dobrelya <bdobrelia at mirantis.com>
> wrote:
>>
>> On 17.11.2015 15:28, Kyrylo Galanov wrote:
>> > Hi Team,
>>
>> Hello
>>
>> >
>> > I have been testing fail-over after free disk space is less than 512 mb.
>> > (https://review.openstack.org/#/c/240951/)
>> > Affected node is stopped correctly and services migrate to a healthy
>> > node.
>> >
>> > However, after free disk space is more than 512 mb again the node does
>> > not recover it's state to operating. Moreover, starting the resources
>> > manually would rather fail. In a nutshell, the pacemaker service / node
>> > should be restarted. Detailed information is available
>> > here:
>> > https://www.suse.com/documentation/sle_ha/book_sleha/data/sec_ha_configuration_basics_monitor_health.html
>> >
>> > How do we address this issue?
>>
>> According to the docs you provided,
>> " After a node's health status has turned to red, solve the issue that
>> led to the problem. Then clear the red status to make the node eligible
>> again for running resources. Log in to the cluster node and use one of
>> the following methods:
>>
>>     Execute the following command:
>>
>>     crm node status-attr NODE delete #health_disk
>>
>>     Restart OpenAIS on that node.
>>
>>     Reboot the node.
>>
>> The node will be returned to service and can run resources again. "
>>
>> So this looks like an expected behaviour!
>>
>> What else could be done:
>> - We should check if we have this nuance documented, and submit a bug to
>> fuel-docs team, if not yet there.
>> - Submitting a bug and inspecting logs would be nice to do as well.
>> I believe some optimizations may be done, bearing in mind this pacemaker
>> cluster-recheck-interval and failure-timeout story [0].
>>
>> [0]
>> http://blog.kennyrasschaert.be/blog/2013/12/18/pacemaker-high-failability/
>>
>> >
>> >
>> > Best regards,
>> > Kyrylo
>> >
>> >
>> >
>> > __________________________________________________________________________
>> > OpenStack Development Mailing List (not for usage questions)
>> > Unsubscribe:
>> > OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
>> > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>> >
>>
>>
>> --
>> Best regards,
>> Bogdan Dobrelya,
>> Irc #bogdando
>>
>> __________________________________________________________________________
>> OpenStack Development Mailing List (not for usage questions)
>> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
>
>
>
> --
> Yours Faithfully,
> Vladimir Kuklin,
> Fuel Library Tech Lead,
> Mirantis, Inc.
> +7 (495) 640-49-04
> +7 (926) 702-39-68
> Skype kuklinvv
> 35bk3, Vorontsovskaya Str.
> Moscow, Russia,
> www.mirantis.com
> www.mirantis.ru
> vkuklin at mirantis.com
>
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>



More information about the OpenStack-dev mailing list