[openstack-dev] [Fuel] HA cluster disk monitoring, failover and recovery

Vladimir Kuklin vkuklin at mirantis.com
Tue Nov 17 17:12:56 UTC 2015


Bogdan

I think we should firstly check whether attribute deletion leads to node
starting its services or not. From what I read in the official Pacemaker
documentation, it should work out of the box without the need to restart
the node.
And by the way the quote above mentions 'use ONE of the following methods'
meaning that we could actually use attribute deletion. The 2nd and the 3rd
options do the same - they clear short-living node attribute. So we need to
figure out why OCF script does not update the corresponding attribute by
itself.



On Tue, Nov 17, 2015 at 7:03 PM, Bogdan Dobrelya <bdobrelia at mirantis.com>
wrote:

> On 17.11.2015 15:28, Kyrylo Galanov wrote:
> > Hi Team,
>
> Hello
>
> >
> > I have been testing fail-over after free disk space is less than 512 mb.
> > (https://review.openstack.org/#/c/240951/)
> > Affected node is stopped correctly and services migrate to a healthy
> node.
> >
> > However, after free disk space is more than 512 mb again the node does
> > not recover it's state to operating. Moreover, starting the resources
> > manually would rather fail. In a nutshell, the pacemaker service / node
> > should be restarted. Detailed information is available
> > here:
> https://www.suse.com/documentation/sle_ha/book_sleha/data/sec_ha_configuration_basics_monitor_health.html
> >
> > How do we address this issue?
>
> According to the docs you provided,
> " After a node's health status has turned to red, solve the issue that
> led to the problem. Then clear the red status to make the node eligible
> again for running resources. Log in to the cluster node and use one of
> the following methods:
>
>     Execute the following command:
>
>     crm node status-attr NODE delete #health_disk
>
>     Restart OpenAIS on that node.
>
>     Reboot the node.
>
> The node will be returned to service and can run resources again. "
>
> So this looks like an expected behaviour!
>
> What else could be done:
> - We should check if we have this nuance documented, and submit a bug to
> fuel-docs team, if not yet there.
> - Submitting a bug and inspecting logs would be nice to do as well.
> I believe some optimizations may be done, bearing in mind this pacemaker
> cluster-recheck-interval and failure-timeout story [0].
>
> [0]
> http://blog.kennyrasschaert.be/blog/2013/12/18/pacemaker-high-failability/
>
> >
> >
> > Best regards,
> > Kyrylo
> >
> >
> >
> __________________________________________________________________________
> > OpenStack Development Mailing List (not for usage questions)
> > Unsubscribe:
> OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> >
>
>
> --
> Best regards,
> Bogdan Dobrelya,
> Irc #bogdando
>
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>



-- 
Yours Faithfully,
Vladimir Kuklin,
Fuel Library Tech Lead,
Mirantis, Inc.
+7 (495) 640-49-04
+7 (926) 702-39-68
Skype kuklinvv
35bk3, Vorontsovskaya Str.
Moscow, Russia,
www.mirantis.com <http://www.mirantis.ru/>
www.mirantis.ru
vkuklin at mirantis.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20151117/b69e5290/attachment.html>


More information about the OpenStack-dev mailing list