<div dir="ltr"><div dir="ltr"><br></div><div>Thanks Eugen .<br></div><div>I fully agree with not running VMs on Control nodes.   When we rolled out the Controller resources we couldn't spare out only as a controller, Becoz of the utilization of the controller host machines resources, so we decided to use them as compute nodes also. <br></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Thu, May 5, 2022 at 6:09 PM Eugen Block <<a href="mailto:eblock@nde.ag">eblock@nde.ag</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Hi,<br>

<br>

first, I wouldn't run VMs on control nodes, that way you mix roles  <br>

(control and compute) and in case that one control node fails the VMs  <br>

are not available. That would not be the case if the control node is  <br>

only a control node and is also part of a highly available control  <br>

plane (like yours appears to be). Depending on how your control plane  <br>

is defined, the failure of one control node should be tolerable.<br>

There has been some work on making compute nodes highly available but  <br>

I don't know the current status.</blockquote><div> </div><div>could you point out the links/docs where I can refer for a proper setup. <br></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"> But in case a compute node fails but  <br>

nova is still responsive a live or cold migration could still be  <br>

possible to evacuate that host. </blockquote><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">If a compute node fails and is  <br>

unresponsive you'll probably need some DB tweaking to revive VMs on a  <br>

different node.<br></blockquote><div>Don't know much about this, any reference is welcome. <br></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

So you should have some spare resources to be able to recover from a  <br>

compute node failure.<br>

As for ceph it should be configured properly to sustain the loss of a  <br>

defined number of nodes or disks, I don't know your requirements. If  <br>

your current cluster has "only" 3 nodes you probably run replicated  <br>

pools with size 3 (I hope) with min_size 2 and failure-domain host.</blockquote><div>you mean 3 OSD s  in  single compute node ? I can follow this way is it the best way to do so ?<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">  <br></blockquote><div>Any reference to this best ceph deployment model which do the best fault tolerance, kindly share.<br></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

You could sustain one node failure without clients noticing it, a  <br>

second node would cause the cluster to pause. Also you don't have the  <br>

possibility to recover from a node failure until it is up again,  <br>

meaning the degraded PGs can't be recovered on a different node. So  <br>

this also depends on your actual resiliency requirements. If you have  <br>

a second site you could use rbd mirroring [1] to sync all rbd images  <br>

between sites. </blockquote><br></div><div class="gmail_quote"> We have a connectivity link of only 1Gbps between DC and DR and  DR is 300 miles away from DC.  And the syncing enhancement, how can we achieve this ? Because our HDD writing speeds are too limited, maybe 80 Mbps to 100 Mbps .. SSDs are not available for all compute host machines.</div><div class="gmail_quote">Each VM has 800 GB to 1 TB Disk size. <br></div><div class="gmail_quote"><div><br></div><div>Is there any best practice for syncing enhancement mechanisms  for HDDs in DC hosts (with a  connectivity of 1 Gbps between DR sites with HDD hosts .  ?)<br></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">In case the primary site goes down entirely you could  <br>

switch to the primary site by promoting the rbd images.<br>

So you see there is plenty of information to cover and careful  <br>

planning is required.<br>

<br>

Regards,<br>

Eugen<br></blockquote><div> </div><div>Thanks again for sharing your thoughts.  <br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

<br>

[1] <a href="https://docs.ceph.com/en/latest/rbd/rbd-mirroring/" rel="noreferrer" target="_blank">https://docs.ceph.com/en/latest/rbd/rbd-mirroring/</a><br>

<br>

Zitat von KK CHN <<a href="mailto:kkchn.in@gmail.com" target="_blank">kkchn.in@gmail.com</a>>:<br>

<br>

> List,<br>

><br>

> We are having an old cloud setup with OpenStack  Ussuri usng Debian OS,<br>

> (Qemu KVM ).  I know its very old and we can't upgrade to to new versions<br>

> right now.<br>

><br>

> The  Deployment is as follows.<br>

><br>

> A.    3 Controller in (cum compute nodes . VMs are running on controllers<br>

> too..) in HA mode.<br>

><br>

> B.   6 separate Compute nodes<br>

><br>

> C.    3 separate Storage node with Ceph RBD<br>

><br>

> Question is<br>

><br>

> 1.  In case of any Sudden Hardware failure of one  or more controller node<br>

> OR Compute node  OR Storage Node  what will be the immediate redundant<br>

> recovery setup need to be employed ?<br>

><br>

> 2.  In case H/W failure our  recovery need to as soon as possible. For<br>

> example less than30 Minutes after the first failure occurs.<br>

><br>

> 3.  Is there setup options like a hot standby or similar setups or what  we<br>

> need to employ ?<br>

><br>

> 4. To meet all   RTO (< 30 Minutes down time ) and RPO(from the exact point<br>

> of crash all applications and data must be consistent) .<br>

><br>

> 5. Please share  your thoughts for reliable crash/fault resistance<br>

> configuration options in DC.<br>

><br>

><br>

> We  have   a remote DR setup right now in a remote location. Also I would<br>

> like to know if there is a recommended way to make the remote DR site<br>

> Automatically up and run  ? OR How to automate the service from DR site<br>

> to  meet exact RTO and RPO<br>

><br>

> Any thoughts most welcom.<br>

><br>

> Regards,<br>

> Krish<br>

<br>

<br>

<br>

<br>

</blockquote></div></div>