[openstack-dev] [Fuel] HA cluster disk monitoring, failover and recovery

Alex Schultz aschultz at mirantis.com
Tue Nov 17 15:18:37 UTC 2015


On Tue, Nov 17, 2015 at 9:01 AM, Vladimir Kuklin <vkuklin at mirantis.com> wrote:
> Folks
>
> Is not it possible for an OCF script to clear this attribute after a
> sufficient period of successful monitoring of node health? It could be a
> better approach in this case then restarting the node.
>

So this leverages the pacemaker provided sysinfo and leverages core
pacemaker/corosync functionality.  We'd have to look into the core of
that to see if it would be possible to have it automatically mark the
node cleared if the space gets cleared up. (essentially clearing the
health_disk status attribute)  You do not have to reboot/restart the
node. You simply clear the alarm and the services automatically start
back up.  As I have previously mentioned about this change, it is not
a replacement for proper system monitoring and ensuring node health.
This is simply a minor improvement to ensure that the cluster doesn't
crash itself when it runs out of space. The goal of change was to
ensure that rabbitmq/mysql/etc are cleanly shutdown prior to a
critical lack of disk space which can lead to the systems melting
down.

Thanks,
-Alex

> On Tue, Nov 17, 2015 at 5:41 PM, Alex Schultz <aschultz at mirantis.com> wrote:
>>
>> Hey Kyrylo,
>>
>>
>> On Tue, Nov 17, 2015 at 8:28 AM, Kyrylo Galanov <kgalanov at mirantis.com>
>> wrote:
>> > Hi Team,
>> >
>> > I have been testing fail-over after free disk space is less than 512 mb.
>> > (https://review.openstack.org/#/c/240951/)
>> > Affected node is stopped correctly and services migrate to a healthy
>> > node.
>> >
>> > However, after free disk space is more than 512 mb again the node does
>> > not
>> > recover it's state to operating. Moreover, starting the resources
>> > manually
>> > would rather fail. In a nutshell, the pacemaker service / node should be
>> > restarted. Detailed information is available here:
>> >
>> > https://www.suse.com/documentation/sle_ha/book_sleha/data/sec_ha_configuration_basics_monitor_health.html
>> >
>> > How do we address this issue?
>> >
>>
>> So the original change for this was
>> https://review.openstack.org/#/c/226062/. As indicated by the commit
>> message, the only way pacemaker will recover is that the operator must
>> run a pacemaker command to clear the disk alert.
>>
>> crm node status-attr <hostname> delete "#health_disk"
>>
>> Once the operator has cleared up the diskspace issue and run the above
>> command, pacemaker will rejoin the cluster and start services again.
>> The documentation bug for this is
>> https://bugs.launchpad.net/fuel/+bug/1500422.
>>
>> Thanks,
>> -Alex
>>
>> >
>> > Best regards,
>> > Kyrylo
>> >
>> >
>> > __________________________________________________________________________
>> > OpenStack Development Mailing List (not for usage questions)
>> > Unsubscribe:
>> > OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
>> > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>> >
>>
>> __________________________________________________________________________
>> OpenStack Development Mailing List (not for usage questions)
>> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
>
>
>
> --
> Yours Faithfully,
> Vladimir Kuklin,
> Fuel Library Tech Lead,
> Mirantis, Inc.
> +7 (495) 640-49-04
> +7 (926) 702-39-68
> Skype kuklinvv
> 35bk3, Vorontsovskaya Str.
> Moscow, Russia,
> www.mirantis.com
> www.mirantis.ru
> vkuklin at mirantis.com
>
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>



More information about the OpenStack-dev mailing list