Data Center Survival in case of Disaster / HW Failure in DC

Erik McCormick emccormick at
Thu May 5 17:48:46 UTC 2022

This sounds like a great topic we could discuss at the Ops Meetup in Berlin
Friday after the Summit. I'm going to plop it in the planning etherpad [1]
and we can brainstorm face to face.

I encourage any operator of Openstack who's around for the Summit to stick
around for an extra day and join us. *nudge Tim* Registration [2] is open.
Visit the planning etherpad [1] to propose topics and +1 the ones you are
interested it.




On Thu, May 5, 2022, 10:48 AM Tim Bell <tim.bell at> wrote:

> Interesting - we’re starting work on exactly the same analysis at the
> moment.
> We’re looking at a separate region for the recovery site, this guarantees
> no dependencies in the control plane.
> Ideally, we’d be running active/active for the most critical applications
> (following AWS recommendations
> but there are some issues we’re working through (such as how to replicate
> block/object stores between regions).
> Keeping images/projects in sync between regions also does not seem simple,
> especially where you want different quotas (e.g. you can have 100 cores in
> the production site but only 10 by default in the recovery site)
> As in any DR plan, testing is key - we’ve started to have a look at
> security groups to do a simulated disconnect test and see what’s not yet in
> the recovery site.
> Does anyone have some best practise recommendations or tools for OpenStack
> disaster recovery ?
> Cheers
> Tim
> On 5 May 2022, at 14:37, Eugen Block <eblock at> wrote:
> Hi,
> first, I wouldn't run VMs on control nodes, that way you mix roles
> (control and compute) and in case that one control node fails the VMs are
> not available. That would not be the case if the control node is only a
> control node and is also part of a highly available control plane (like
> yours appears to be). Depending on how your control plane is defined, the
> failure of one control node should be tolerable.
> There has been some work on making compute nodes highly available but I
> don't know the current status. But in case a compute node fails but nova is
> still responsive a live or cold migration could still be possible to
> evacuate that host. If a compute node fails and is unresponsive you'll
> probably need some DB tweaking to revive VMs on a different node.
> So you should have some spare resources to be able to recover from a
> compute node failure.
> As for ceph it should be configured properly to sustain the loss of a
> defined number of nodes or disks, I don't know your requirements. If your
> current cluster has "only" 3 nodes you probably run replicated pools with
> size 3 (I hope) with min_size 2 and failure-domain host. You could sustain
> one node failure without clients noticing it, a second node would cause the
> cluster to pause. Also you don't have the possibility to recover from a
> node failure until it is up again, meaning the degraded PGs can't be
> recovered on a different node. So this also depends on your actual
> resiliency requirements. If you have a second site you could use rbd
> mirroring [1] to sync all rbd images between sites. In case the primary
> site goes down entirely you could switch to the primary site by promoting
> the rbd images.
> So you see there is plenty of information to cover and careful planning is
> required.
> Regards,
> Eugen
> [1]
> Zitat von KK CHN < at>:
> List,
> We are having an old cloud setup with OpenStack  Ussuri usng Debian OS,
> (Qemu KVM ).  I know its very old and we can't upgrade to to new versions
> right now.
> The  Deployment is as follows.
> A.    3 Controller in (cum compute nodes . VMs are running on controllers
> too..) in HA mode.
> B.   6 separate Compute nodes
> C.    3 separate Storage node with Ceph RBD
> Question is
> 1.  In case of any Sudden Hardware failure of one  or more controller node
> OR Compute node  OR Storage Node  what will be the immediate redundant
> recovery setup need to be employed ?
> 2.  In case H/W failure our  recovery need to as soon as possible. For
> example less than30 Minutes after the first failure occurs.
> 3.  Is there setup options like a hot standby or similar setups or what  we
> need to employ ?
> 4. To meet all   RTO (< 30 Minutes down time ) and RPO(from the exact point
> of crash all applications and data must be consistent) .
> 5. Please share  your thoughts for reliable crash/fault resistance
> configuration options in DC.
> We  have   a remote DR setup right now in a remote location. Also I would
> like to know if there is a recommended way to make the remote DR site
> Automatically up and run  ? OR How to automate the service from DR site
> to  meet exact RTO and RPO
> Any thoughts most welcom.
> Regards,
> Krish
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

More information about the openstack-discuss mailing list