[openstack-dev] [fuel] [HA] How long we need to wait for cloud recovery after some destructive scenarios?

Timur Nurlygayanov tnurlygayanov at mirantis.com
Wed Jun 3 09:50:01 UTC 2015


Hi team,

I'm working on HA / destructive / recovery automated tests [1] for
OpenStack clouds and I want to get some expectations from users, operators
and developers for the speed of OpenStack recovery after some destructive
actions.
For example, how long cluster should be unavailable if one of three
controller will be destroyed? I think that the right answer is '0 seconds,
no downtime' - users shouldn't see anything strange when we lost one
controller in our cloud (if it is 'true' HA configuration).
In the real world I can see that such destructive scenarios require some
time to recover the cloud (1-15 minutes in different cases) - and I just
want to get your expectations or the requirements.

How fast we can / should fully recover the cloud in the following cases:
1. Restart RabbitMQ services
2. Restart MySQL / Galera services
3. Restart Neutron services (like L3 agents)
4. Hard shutdown of any OpenStack controllers
5. Shutdown of the ethernet interfaces of management / data networks

Of course, it depends on the configuration, but we can describe some
common, 'expected', asseptance values (SLA) for downtime in differrent
destructive cases and use them to verify the clouds today and in the future.
We will use these values in HAOS project [1], which will allow to validate
any clouds with the same scenarios and with the same SLA for recovery time.

Any comments are welcome :)
Thank you!

-- 

Timur,
Senior QA Engineer
OpenStack Projects
Mirantis Inc
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20150603/dbfcac33/attachment.html>


More information about the OpenStack-dev mailing list