Data Center Survival in case of Disaster / HW Failure in DC
KK CHN
kkchn.in at gmail.com
Fri May 6 09:07:44 UTC 2022
Thanks Eugen .
I fully agree with not running VMs on Control nodes. When we rolled out
the Controller resources we couldn't spare out only as a controller, Becoz
of the utilization of the controller host machines resources, so we decided
to use them as compute nodes also.
On Thu, May 5, 2022 at 6:09 PM Eugen Block <eblock at nde.ag> wrote:
> Hi,
>
> first, I wouldn't run VMs on control nodes, that way you mix roles
> (control and compute) and in case that one control node fails the VMs
> are not available. That would not be the case if the control node is
> only a control node and is also part of a highly available control
> plane (like yours appears to be). Depending on how your control plane
> is defined, the failure of one control node should be tolerable.
> There has been some work on making compute nodes highly available but
> I don't know the current status.
could you point out the links/docs where I can refer for a proper setup.
> But in case a compute node fails but
> nova is still responsive a live or cold migration could still be
> possible to evacuate that host.
> If a compute node fails and is
> unresponsive you'll probably need some DB tweaking to revive VMs on a
> different node.
>
Don't know much about this, any reference is welcome.
> So you should have some spare resources to be able to recover from a
> compute node failure.
> As for ceph it should be configured properly to sustain the loss of a
> defined number of nodes or disks, I don't know your requirements. If
> your current cluster has "only" 3 nodes you probably run replicated
> pools with size 3 (I hope) with min_size 2 and failure-domain host.
you mean 3 OSD s in single compute node ? I can follow this way is it the
best way to do so ?
>
>
Any reference to this best ceph deployment model which do the best fault
tolerance, kindly share.
> You could sustain one node failure without clients noticing it, a
> second node would cause the cluster to pause. Also you don't have the
> possibility to recover from a node failure until it is up again,
> meaning the degraded PGs can't be recovered on a different node. So
> this also depends on your actual resiliency requirements. If you have
> a second site you could use rbd mirroring [1] to sync all rbd images
> between sites.
We have a connectivity link of only 1Gbps between DC and DR and DR is 300
miles away from DC. And the syncing enhancement, how can we achieve this ?
Because our HDD writing speeds are too limited, maybe 80 Mbps to 100 Mbps
.. SSDs are not available for all compute host machines.
Each VM has 800 GB to 1 TB Disk size.
Is there any best practice for syncing enhancement mechanisms for HDDs in
DC hosts (with a connectivity of 1 Gbps between DR sites with HDD hosts .
?)
> In case the primary site goes down entirely you could
> switch to the primary site by promoting the rbd images.
> So you see there is plenty of information to cover and careful
> planning is required.
>
> Regards,
> Eugen
>
Thanks again for sharing your thoughts.
>
> [1] https://docs.ceph.com/en/latest/rbd/rbd-mirroring/
>
> Zitat von KK CHN <kkchn.in at gmail.com>:
>
> > List,
> >
> > We are having an old cloud setup with OpenStack Ussuri usng Debian OS,
> > (Qemu KVM ). I know its very old and we can't upgrade to to new versions
> > right now.
> >
> > The Deployment is as follows.
> >
> > A. 3 Controller in (cum compute nodes . VMs are running on controllers
> > too..) in HA mode.
> >
> > B. 6 separate Compute nodes
> >
> > C. 3 separate Storage node with Ceph RBD
> >
> > Question is
> >
> > 1. In case of any Sudden Hardware failure of one or more controller
> node
> > OR Compute node OR Storage Node what will be the immediate redundant
> > recovery setup need to be employed ?
> >
> > 2. In case H/W failure our recovery need to as soon as possible. For
> > example less than30 Minutes after the first failure occurs.
> >
> > 3. Is there setup options like a hot standby or similar setups or what
> we
> > need to employ ?
> >
> > 4. To meet all RTO (< 30 Minutes down time ) and RPO(from the exact
> point
> > of crash all applications and data must be consistent) .
> >
> > 5. Please share your thoughts for reliable crash/fault resistance
> > configuration options in DC.
> >
> >
> > We have a remote DR setup right now in a remote location. Also I would
> > like to know if there is a recommended way to make the remote DR site
> > Automatically up and run ? OR How to automate the service from DR site
> > to meet exact RTO and RPO
> >
> > Any thoughts most welcom.
> >
> > Regards,
> > Krish
>
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-discuss/attachments/20220506/8bc7083e/attachment.htm>
More information about the openstack-discuss
mailing list