<div dir="ltr"><div dir="ltr"><br></div><div>Thanks Eugen .<br></div><div>I fully agree with not running VMs on Control nodes. When we rolled out the Controller resources we couldn't spare out only as a controller, Becoz of the utilization of the controller host machines resources, so we decided to use them as compute nodes also. <br></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Thu, May 5, 2022 at 6:09 PM Eugen Block <<a href="mailto:eblock@nde.ag">eblock@nde.ag</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Hi,<br>
<br>
first, I wouldn't run VMs on control nodes, that way you mix roles <br>
(control and compute) and in case that one control node fails the VMs <br>
are not available. That would not be the case if the control node is <br>
only a control node and is also part of a highly available control <br>
plane (like yours appears to be). Depending on how your control plane <br>
is defined, the failure of one control node should be tolerable.<br>
There has been some work on making compute nodes highly available but <br>
I don't know the current status.</blockquote><div> </div><div>could you point out the links/docs where I can refer for a proper setup. <br></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"> But in case a compute node fails but <br>
nova is still responsive a live or cold migration could still be <br>
possible to evacuate that host. </blockquote><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">If a compute node fails and is <br>
unresponsive you'll probably need some DB tweaking to revive VMs on a <br>
different node.<br></blockquote><div>Don't know much about this, any reference is welcome. <br></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
So you should have some spare resources to be able to recover from a <br>
compute node failure.<br>
As for ceph it should be configured properly to sustain the loss of a <br>
defined number of nodes or disks, I don't know your requirements. If <br>
your current cluster has "only" 3 nodes you probably run replicated <br>
pools with size 3 (I hope) with min_size 2 and failure-domain host.</blockquote><div>you mean 3 OSD s in single compute node ? I can follow this way is it the best way to do so ?<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"> <br></blockquote><div>Any reference to this best ceph deployment model which do the best fault tolerance, kindly share.<br></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
You could sustain one node failure without clients noticing it, a <br>
second node would cause the cluster to pause. Also you don't have the <br>
possibility to recover from a node failure until it is up again, <br>
meaning the degraded PGs can't be recovered on a different node. So <br>
this also depends on your actual resiliency requirements. If you have <br>
a second site you could use rbd mirroring [1] to sync all rbd images <br>
between sites. </blockquote><br></div><div class="gmail_quote"> We have a connectivity link of only 1Gbps between DC and DR and DR is 300 miles away from DC. And the syncing enhancement, how can we achieve this ? Because our HDD writing speeds are too limited, maybe 80 Mbps to 100 Mbps .. SSDs are not available for all compute host machines.</div><div class="gmail_quote">Each VM has 800 GB to 1 TB Disk size. <br></div><div class="gmail_quote"><div><br></div><div>Is there any best practice for syncing enhancement mechanisms for HDDs in DC hosts (with a connectivity of 1 Gbps between DR sites with HDD hosts . ?)<br></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">In case the primary site goes down entirely you could <br>
switch to the primary site by promoting the rbd images.<br>
So you see there is plenty of information to cover and careful <br>
planning is required.<br>
<br>
Regards,<br>
Eugen<br></blockquote><div> </div><div>Thanks again for sharing your thoughts. <br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<br>
[1] <a href="https://docs.ceph.com/en/latest/rbd/rbd-mirroring/" rel="noreferrer" target="_blank">https://docs.ceph.com/en/latest/rbd/rbd-mirroring/</a><br>
<br>
Zitat von KK CHN <<a href="mailto:kkchn.in@gmail.com" target="_blank">kkchn.in@gmail.com</a>>:<br>
<br>
> List,<br>
><br>
> We are having an old cloud setup with OpenStack Ussuri usng Debian OS,<br>
> (Qemu KVM ). I know its very old and we can't upgrade to to new versions<br>
> right now.<br>
><br>
> The Deployment is as follows.<br>
><br>
> A. 3 Controller in (cum compute nodes . VMs are running on controllers<br>
> too..) in HA mode.<br>
><br>
> B. 6 separate Compute nodes<br>
><br>
> C. 3 separate Storage node with Ceph RBD<br>
><br>
> Question is<br>
><br>
> 1. In case of any Sudden Hardware failure of one or more controller node<br>
> OR Compute node OR Storage Node what will be the immediate redundant<br>
> recovery setup need to be employed ?<br>
><br>
> 2. In case H/W failure our recovery need to as soon as possible. For<br>
> example less than30 Minutes after the first failure occurs.<br>
><br>
> 3. Is there setup options like a hot standby or similar setups or what we<br>
> need to employ ?<br>
><br>
> 4. To meet all RTO (< 30 Minutes down time ) and RPO(from the exact point<br>
> of crash all applications and data must be consistent) .<br>
><br>
> 5. Please share your thoughts for reliable crash/fault resistance<br>
> configuration options in DC.<br>
><br>
><br>
> We have a remote DR setup right now in a remote location. Also I would<br>
> like to know if there is a recommended way to make the remote DR site<br>
> Automatically up and run ? OR How to automate the service from DR site<br>
> to meet exact RTO and RPO<br>
><br>
> Any thoughts most welcom.<br>
><br>
> Regards,<br>
> Krish<br>
<br>
<br>
<br>
<br>
</blockquote></div></div>