Data Center Survival in case of Disaster / HW Failure in DC

KK CHN kkchn.in at gmail.com
Fri May 6 09:07:44 UTC 2022


Thanks Eugen .
I fully agree with not running VMs on Control nodes.   When we rolled out
the Controller resources we couldn't spare out only as a controller, Becoz
of the utilization of the controller host machines resources, so we decided
to use them as compute nodes also.

On Thu, May 5, 2022 at 6:09 PM Eugen Block <eblock at nde.ag> wrote:

> Hi,
>
> first, I wouldn't run VMs on control nodes, that way you mix roles
> (control and compute) and in case that one control node fails the VMs
> are not available. That would not be the case if the control node is
> only a control node and is also part of a highly available control
> plane (like yours appears to be). Depending on how your control plane
> is defined, the failure of one control node should be tolerable.
> There has been some work on making compute nodes highly available but
> I don't know the current status.


could you point out the links/docs where I can refer for a proper setup.


> But in case a compute node fails but
> nova is still responsive a live or cold migration could still be
> possible to evacuate that host.



> If a compute node fails and is
> unresponsive you'll probably need some DB tweaking to revive VMs on a
> different node.
>
Don't know much about this, any reference is welcome.


> So you should have some spare resources to be able to recover from a
> compute node failure.
> As for ceph it should be configured properly to sustain the loss of a
> defined number of nodes or disks, I don't know your requirements. If
> your current cluster has "only" 3 nodes you probably run replicated
> pools with size 3 (I hope) with min_size 2 and failure-domain host.

you mean 3 OSD s  in  single compute node ? I can follow this way is it the
best way to do so ?

>
>
Any reference to this best ceph deployment model which do the best fault
tolerance, kindly share.


> You could sustain one node failure without clients noticing it, a
> second node would cause the cluster to pause. Also you don't have the
> possibility to recover from a node failure until it is up again,
> meaning the degraded PGs can't be recovered on a different node. So
> this also depends on your actual resiliency requirements. If you have
> a second site you could use rbd mirroring [1] to sync all rbd images
> between sites.


 We have a connectivity link of only 1Gbps between DC and DR and  DR is 300
miles away from DC.  And the syncing enhancement, how can we achieve this ?
Because our HDD writing speeds are too limited, maybe 80 Mbps to 100 Mbps
.. SSDs are not available for all compute host machines.
Each VM has 800 GB to 1 TB Disk size.

Is there any best practice for syncing enhancement mechanisms  for HDDs in
DC hosts (with a  connectivity of 1 Gbps between DR sites with HDD hosts .
?)


> In case the primary site goes down entirely you could
> switch to the primary site by promoting the rbd images.
> So you see there is plenty of information to cover and careful
> planning is required.
>
> Regards,
> Eugen
>

Thanks again for sharing your thoughts.

>
> [1] https://docs.ceph.com/en/latest/rbd/rbd-mirroring/
>
> Zitat von KK CHN <kkchn.in at gmail.com>:
>
> > List,
> >
> > We are having an old cloud setup with OpenStack  Ussuri usng Debian OS,
> > (Qemu KVM ).  I know its very old and we can't upgrade to to new versions
> > right now.
> >
> > The  Deployment is as follows.
> >
> > A.    3 Controller in (cum compute nodes . VMs are running on controllers
> > too..) in HA mode.
> >
> > B.   6 separate Compute nodes
> >
> > C.    3 separate Storage node with Ceph RBD
> >
> > Question is
> >
> > 1.  In case of any Sudden Hardware failure of one  or more controller
> node
> > OR Compute node  OR Storage Node  what will be the immediate redundant
> > recovery setup need to be employed ?
> >
> > 2.  In case H/W failure our  recovery need to as soon as possible. For
> > example less than30 Minutes after the first failure occurs.
> >
> > 3.  Is there setup options like a hot standby or similar setups or what
> we
> > need to employ ?
> >
> > 4. To meet all   RTO (< 30 Minutes down time ) and RPO(from the exact
> point
> > of crash all applications and data must be consistent) .
> >
> > 5. Please share  your thoughts for reliable crash/fault resistance
> > configuration options in DC.
> >
> >
> > We  have   a remote DR setup right now in a remote location. Also I would
> > like to know if there is a recommended way to make the remote DR site
> > Automatically up and run  ? OR How to automate the service from DR site
> > to  meet exact RTO and RPO
> >
> > Any thoughts most welcom.
> >
> > Regards,
> > Krish
>
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-discuss/attachments/20220506/8bc7083e/attachment.htm>


More information about the openstack-discuss mailing list