Data Center Survival in case of Disaster / HW Failure in DC

Tim Bell tim.bell at cern.ch
Thu May 5 14:46:25 UTC 2022


Interesting - we’re starting work on exactly the same analysis at the moment.

We’re looking at a separate region for the recovery site, this guarantees no dependencies in the control plane.

Ideally, we’d be running active/active for the most critical applications (following AWS recommendations https://docs.aws.amazon.com/whitepapers/latest/disaster-recovery-workloads-on-aws/disaster-recovery-options-in-the-cloud.html <https://docs.aws.amazon.com/whitepapers/latest/disaster-recovery-workloads-on-aws/disaster-recovery-options-in-the-cloud.html>) but there are some issues we’re working through (such as how to replicate block/object stores between regions).

Keeping images/projects in sync between regions also does not seem simple, especially where you want different quotas (e.g. you can have 100 cores in the production site but only 10 by default in the recovery site)

As in any DR plan, testing is key - we’ve started to have a look at security groups to do a simulated disconnect test and see what’s not yet in the recovery site.

Does anyone have some best practise recommendations or tools for OpenStack disaster recovery ?

Cheers
Tim

> On 5 May 2022, at 14:37, Eugen Block <eblock at nde.ag> wrote:
> 
> Hi,
> 
> first, I wouldn't run VMs on control nodes, that way you mix roles (control and compute) and in case that one control node fails the VMs are not available. That would not be the case if the control node is only a control node and is also part of a highly available control plane (like yours appears to be). Depending on how your control plane is defined, the failure of one control node should be tolerable.
> There has been some work on making compute nodes highly available but I don't know the current status. But in case a compute node fails but nova is still responsive a live or cold migration could still be possible to evacuate that host. If a compute node fails and is unresponsive you'll probably need some DB tweaking to revive VMs on a different node.
> So you should have some spare resources to be able to recover from a compute node failure.
> As for ceph it should be configured properly to sustain the loss of a defined number of nodes or disks, I don't know your requirements. If your current cluster has "only" 3 nodes you probably run replicated pools with size 3 (I hope) with min_size 2 and failure-domain host. You could sustain one node failure without clients noticing it, a second node would cause the cluster to pause. Also you don't have the possibility to recover from a node failure until it is up again, meaning the degraded PGs can't be recovered on a different node. So this also depends on your actual resiliency requirements. If you have a second site you could use rbd mirroring [1] to sync all rbd images between sites. In case the primary site goes down entirely you could switch to the primary site by promoting the rbd images.
> So you see there is plenty of information to cover and careful planning is required.
> 
> Regards,
> Eugen
> 
> [1] https://docs.ceph.com/en/latest/rbd/rbd-mirroring/
> 
> Zitat von KK CHN <kkchn.in at gmail.com>:
> 
>> List,
>> 
>> We are having an old cloud setup with OpenStack  Ussuri usng Debian OS,
>> (Qemu KVM ).  I know its very old and we can't upgrade to to new versions
>> right now.
>> 
>> The  Deployment is as follows.
>> 
>> A.    3 Controller in (cum compute nodes . VMs are running on controllers
>> too..) in HA mode.
>> 
>> B.   6 separate Compute nodes
>> 
>> C.    3 separate Storage node with Ceph RBD
>> 
>> Question is
>> 
>> 1.  In case of any Sudden Hardware failure of one  or more controller node
>> OR Compute node  OR Storage Node  what will be the immediate redundant
>> recovery setup need to be employed ?
>> 
>> 2.  In case H/W failure our  recovery need to as soon as possible. For
>> example less than30 Minutes after the first failure occurs.
>> 
>> 3.  Is there setup options like a hot standby or similar setups or what  we
>> need to employ ?
>> 
>> 4. To meet all   RTO (< 30 Minutes down time ) and RPO(from the exact point
>> of crash all applications and data must be consistent) .
>> 
>> 5. Please share  your thoughts for reliable crash/fault resistance
>> configuration options in DC.
>> 
>> 
>> We  have   a remote DR setup right now in a remote location. Also I would
>> like to know if there is a recommended way to make the remote DR site
>> Automatically up and run  ? OR How to automate the service from DR site
>> to  meet exact RTO and RPO
>> 
>> Any thoughts most welcom.
>> 
>> Regards,
>> Krish
> 
> 
> 
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-discuss/attachments/20220505/d6675945/attachment.htm>


More information about the openstack-discuss mailing list