Data Center Survival in case of Disaster / HW Failure in DC

CHALANSONNET Stéphane (Acoss) stephane.chalansonnet at acoss.fr
Thu May 5 14:31:09 UTC 2022


Hello,

First Openstack isnt VMware.Crtiticals applications need to be design with a multiples regions/AZ aware.
Best way was two regions  , 2 contrôle plane (keystone , Galera cluster on both (not the same)  2 storage system and application spitted on both  is the best pratice 😉
You can do also with two AZ but the control plane is share between two AZ (galezra cluster and RabbitMq )

If you are trying to do the same thing as a streched VMware Cluster on a dual-site it's complicated, i you have 3 sites (low latencies <1ms) you can !

1.In case of any Sudden Hardware failure of one  or more controller node OR Compute node  OR Storage Node  what will be the immediate redundant recovery setup need to be employed ?
=> You need to splt control plane throught multiples AZ (3 is the best for Galera cluster and RabbitMQ , 2 if you have HA VMware)
=> Network compute work with keepalived so also need to split them on two AZ/region, you need . Dhcp , Metadata services are also redundancy on multiples network node
=> compute : HA OpenStack (like VMware HA) seems to work since Wallaby with Masakari . Actually you need to do the job by yourself if you need to failover instances.
The storage is also the bggest problem  . You need a streched storage that work with Cinder : Ceph can do the job but you also need three site (quorum on monitor need 2 node alives).Netapp solution seems to work well (trident NFS)

2.  In case H/W failure our  recovery need to as soon as possible. For example less than30 Minutes after the first failure occurs.
Galera Cluster and RabbitMq need to be check first 
With two sites 30mn is really a big challenge, with 3 site it should good !

3.  Is there setup options like a hot standby or similar setups or what  we need to employ ?
2 AZ , 2 storage system and application spitted on both  is the best pratice 😉

4. To meet all   RTO (< 30 Minutes down time ) and RPO(from the exact point
of crash all applications and data must be consistent) .


Regards,
Stéphane Chalansonnet




-----Message d'origine-----
De : openstack-discuss-request at lists.openstack.org <openstack-discuss-request at lists.openstack.org> 
Envoyé : jeudi 5 mai 2022 11:56
À : openstack-discuss at lists.openstack.org
Objet : openstack-discuss Digest, Vol 43, Issue 12

Send openstack-discuss mailing list submissions to
	openstack-discuss at lists.openstack.org

To subscribe or unsubscribe via the World Wide Web, visit
	http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-discuss
or, via email, send a message with subject or body 'help' to
	openstack-discuss-request at lists.openstack.org

You can reach the person managing the list at
	openstack-discuss-owner at lists.openstack.org

When replying, please edit your Subject line so it is more specific than "Re: Contents of openstack-discuss digest..."


Today's Topics:

   1. Data Center Survival in case of Disaster / HW Failure in DC
      (KK CHN)
   2. [nova][placement] Incomplete Consumers return negative value
      after upgrade (Jan Wasilewski)
   3. Re: [all][tc][Release Management] Improvements in project
      governance (Slawek Kaplonski)


----------------------------------------------------------------------

Message: 1
Date: Thu, 5 May 2022 14:46:11 +0530
From: KK CHN <kkchn.in at gmail.com>
To: openstack-discuss at lists.openstack.org
Subject: Data Center Survival in case of Disaster / HW Failure in DC
Message-ID:
	<CAKgGyB_UmDG+TknDaYLGAuyOnxY8ipSF3qC4tS90T0zHzu6fZg at mail.gmail.com>
Content-Type: text/plain; charset="utf-8"

List,

We are having an old cloud setup with OpenStack  Ussuri usng Debian OS, (Qemu KVM ).  I know its very old and we can't upgrade to to new versions right now.

The  Deployment is as follows.

A.    3 Controller in (cum compute nodes . VMs are running on controllers
too..) in HA mode.

B.   6 separate Compute nodes

C.    3 separate Storage node with Ceph RBD

Question is

1.  In case of any Sudden Hardware failure of one  or more controller node OR Compute node  OR Storage Node  what will be the immediate redundant recovery setup need to be employed ?

2.  In case H/W failure our  recovery need to as soon as possible. For example less than30 Minutes after the first failure occurs.

3.  Is there setup options like a hot standby or similar setups or what  we need to employ ?

4. To meet all   RTO (< 30 Minutes down time ) and RPO(from the exact point
of crash all applications and data must be consistent) .

5. Please share  your thoughts for reliable crash/fault resistance configuration options in DC.


We  have   a remote DR setup right now in a remote location. Also I would
like to know if there is a recommended way to make the remote DR site Automatically up and run  ? OR How to automate the service from DR site to  meet exact RTO and RPO

Any thoughts most welcom.

Regards,
Krish
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-discuss/attachments/20220505/57a22952/attachment-0001.htm>

------------------------------

Message: 2
Date: Thu, 5 May 2022 11:31:16 +0200
From: Jan Wasilewski <finarffin at gmail.com>
To: openstack-discuss at lists.openstack.org
Subject: [nova][placement] Incomplete Consumers return negative value
	after upgrade
Message-ID:
	<CAN4DDNgG=i8tGpPQW2PUXVUp72T9r4CWDdxmfRg-bX8T1eE5AA at mail.gmail.com>
Content-Type: text/plain; charset="utf-8"

Hi,

after an upgrade from Stein to Train, I hit an issue with negative value during upgrade check for placement:

# placement-status upgrade
check+------------------------------------------------------------------+|
Upgrade Check Results
|+------------------------------------------------------------------+|
Check: Missing Root Provider IDs                                 ||
Result: Success                                                  ||
Details: None
|+------------------------------------------------------------------+|
Check: Incomplete Consumers                                      ||
Result: Warning                                                  ||
Details: There are -20136 incomplete consumers table records for ||
existing allocations. Run the "placement-manage db             ||
online_data_migrations" command.
|+------------------------------------------------------------------+


Seems that negative value is a result that I get such values from tables consumer and allocations:

mysql> select count(id), consumer_id from allocations group by
consumer_id;...1035 rows in set (0.00 sec)

mysql> select count(*) from consumers;+----------+| count(*)
|+----------+|    21171 |+----------+1 row in set (0.04 sec)

Unfortunately such warning cannot be solved by execution of suggested command( placement-manage db online_data_migrations) as it seems it adds records to consumers table - not to allocations, which looks like to be a problem here. I was following recommendations from this discussion:
http://lists.openstack.org/pipermail/openstack-discuss/2020-November/018536.html
but unfortunately it doesn't solve the issue(even not changing a negative value). I'm just wondering if I skipped something important and you can suggest some (obvious?) solution.

Thank you in advance for your time and help.
Best regards,
Jan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-discuss/attachments/20220505/96ed0b46/attachment-0001.htm>

------------------------------

Message: 3
Date: Thu, 05 May 2022 11:55:32 +0200
From: Slawek Kaplonski <skaplons at redhat.com>
To: openstack-discuss at lists.openstack.org
Cc: El?d Ill?s <elod.illes at est.tech>
Subject: Re: [all][tc][Release Management] Improvements in project
	governance
Message-ID: <3198538.44csPzL39Z at p1>
Content-Type: text/plain; charset="utf-8"

Hi,

On ?roda, 20 kwietnia 2022 13:19:10 CEST El?d Ill?s wrote:
> Hi,
> 
> At the very same time at the PTG we discussed this on the Release 
> Management session [1] as well. To release deliverables without 
> significant content is not ideal and this came up in previous 
> discussions as well. On the other hand unfortunately this is the most 
> feasible solution from release management team perspective especially 
> because the team is quite small (new members are welcome! feel free to 
> join the release management team! :)).
> 
> To change to independent release model is an option for some cases, 
> but not for every project. (It is less clear for consumers what 
> version is/should be used for which series; Fixing problems that comes 
> up in specific stable branches, is not possible; testing the 
> deliverable against a specific stable branch constraints is not 
> possiblel; etc.)
> 
> See some other comments inline.
> 
> [1] https://etherpad.opendev.org/p/april2022-ptg-rel-mgt#L44
> 
> El?d
> 
> On 2022. 04. 19. 18:01, Michael Johnson wrote:
> > Comments inline.
> >
> > Michael
> >
> > On Tue, Apr 19, 2022 at 6:34 AM Slawek Kaplonski<skaplons at redhat.com>  wrote:
> >> Hi,
> >>
> >>
> >> During the Zed PTG sessions in the TC room we were discussing some ideas how we can improve project governance.
> >>
> >> One of the topics was related to the projects which don't really have any changes in the cycle. Currently we are forcing to do new release of basically the same code when it comes to the end of the cycle.
> >>
> >> Can/Should we maybe change that and e.g. instead of forcing new release use last released version of the of the repo for new release too?
> > In the past this has created confusion in the community about if a 
> > project has been dropped/removed from OpenStack. That said, I think 
> > this is the point of the "independent" release classification.
> Yes, exactly as Michael says.
> >> If yes, should we then automatically propose change of the release model to the "independent" maybe?
> > Personally, I would prefer to send an email to the discuss list 
> > proposing the switch to independent. Patches can sometimes get 
> > merged before everyone gets to give input. Especially since the 
> > patch would be proposed in the "releases" project and may not be on 
> > the team's dashboards.
> The release process catches libraries only (that had no merged 
> change), so the number is not that huge, sending a mail seems to be a fair option.
> 
> (The process says: "Evaluate any libraries that did not have any 
> change merged over the cycle to see if it is time to transition them 
> to the independent release model 
> <https://releases.openstack.org/reference/release_models.html#openstack-related-libraries>.
> Note: client libraries (and other libraries strongly tied to another
> deliverable) should generally follow their parent deliverable release 
> model, even if they did not have a lot of activity themselves).")
> >> What would be the best way how Release Management team can maybe notify TC about such less active projects which don't needs any new release in the cycle? That could be one of the potential conditions to check project's health by the TC team.
> > It seems like this would be a straight forward script to write given 
> > we already have tools to capture the list of changes included in a 
> > given release.
> 
> There are a couple of good signals already for TC to catch inactive 
> projects, like the generated patches that are not merged, for example:
> 
> https://review.opendev.org/q/topic:reno-yoga+is:open
> https://review.opendev.org/q/topic:create-yoga+is:open
> https://review.opendev.org/q/topic:add-xena-python-jobtemplates+is:ope
> n
> 
> (Note that in the past not merged patches caused issues and discussing 
> with the TC resulted a suggestion to force-merge them to avoid future
> issues)
> 
> >> Another question is related to the projects which aren't really active and are broken during the final release time. We had such problem in the last cycle, see [1] for details. Should we still force pushing fixes for them to be able to release or maybe should we consider deprecation of such projects and not to release it at all?
> > In the past we have simply not released projects that are broken and 
> > don't have people actively working on fixing them. It has been a 
> > signal to the community that if they value the project they need to 
> > contribute to it.
> 
> Yes, that's a fair point, too, maybe those broken deliverables should 
> not be released at all. I'm not sure, but that might cause another 
> issues for release management tooling, though...
> 
> Besides, during our PTG session we came to the conclusion that we need 
> another step in our process:
> * "propose DNM changes on every repository by RequirementsFreeze (5 
> weeks before final release) to check that tests are still passing with 
> the current set of dependencies"
> Hopefully this will catch broken things well in advance.
> 
> >> [1]http://lists.openstack.org/pipermail/openstack-discuss/2022-Marc
> >> h/027864.html
> >>
> >>
> >> --
> >>
> >> Slawek Kaplonski
> >>
> >> Principal Software Engineer
> >>
> >> Red Hat
> 

Thx for all inputs in that topic so far. Here is my summary and conclusion of what was said in that thread:

     *  we shouldn't try automatically switch such "inactive" projects to the independent model, and we should continue bumping versions of such projects every cycle as it makes many things easier,
     *  Release Management team will test projects about 5 weeks before final release - that may help us find broken projects which then can be discussed and eventually marked as deprecated to not release broken code finally,
     *  To check potentially inactive projects TC can:
         *  finish script https://review.opendev.org/c/openstack/governance/+/810037[1] and use stats generate by that script to periodically check projects' health,
         *  check projects with no merged generated patches, like:
	https://review.opendev.org/q/topic:reno-yoga+is:open[2]
	https://review.opendev.org/q/topic:create-yoga+is:open[3]
	https://review.opendev.org/q/topic:add-xena-python-jobtemplates+is:open[4]

Feel free to add/change anything in that summary if I missed or misunderstood anything there or if You have any idea about other improvements we can do in that area.

--
Slawek Kaplonski
Principal Software Engineer
Red Hat

--------
[1] https://review.opendev.org/c/openstack/governance/+/810037
[2] https://review.opendev.org/q/topic:reno-yoga+is:open
[3] https://review.opendev.org/q/topic:create-yoga+is:open
[4] https://review.opendev.org/q/topic:add-xena-python-jobtemplates+is:open
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-discuss/attachments/20220505/d480b3e6/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 488 bytes
Desc: This is a digitally signed message part.
URL: <http://lists.openstack.org/pipermail/openstack-discuss/attachments/20220505/d480b3e6/attachment.sig>

------------------------------

Subject: Digest Footer

_______________________________________________
openstack-discuss mailing list
openstack-discuss at lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-discuss


------------------------------

End of openstack-discuss Digest, Vol 43, Issue 12
*************************************************


More information about the openstack-discuss mailing list