Proposal to move cinder backup tests out of the integrated gate
I wanted to send this separate from the latest gate status update [1] since it's primarily about latent cinder bugs causing failures in the gate for which no one is really investigating.
Running down our tracked gate bugs [2] there are several related to cinder-backup testing:
* http://status.openstack.org/elastic-recheck/#1483434 * http://status.openstack.org/elastic-recheck/#1745168 * http://status.openstack.org/elastic-recheck/#1739482 * http://status.openstack.org/elastic-recheck/#1635643
All of those bugs were reported a long time ago. I've done some investigation into them (at least at the time of reporting) and some are simply due to cinder-api using synchronous RPC calls to cinder-volume (or cinder-backup) and that doesn't scale. This bug isn't a backup issue, but it's definitely related to using RPC call rather than cast:
http://status.openstack.org/elastic-recheck/#1763712
Regarding the backup tests specifically, I don't see a reason why they need to be run in the integrated gate jobs, e.g. tempest-full(-py3). They don't involve other services, so in my opinion we should move the backup tests to a separate job which only runs on cinder changes to alleviate these latent bugs failing jobs for unrelated changes and resetting the entire gate.
I would need someone from the cinder team that is more involved in knowing what their job setup looks like to identify a candidate job for these tests if this is something everyone can agree on doing.
[1] http://lists.openstack.org/pipermail/openstack-discuss/2018-December/000867.... [2] http://status.openstack.org/elastic-recheck/
Hi,
+1 from me. We also see it quite often failing in some neutron jobs.
— Slawek Kaplonski Senior software engineer Red Hat
Wiadomość napisana przez Matt Riedemann mriedemos@gmail.com w dniu 12.12.2018, o godz. 21:00:
I wanted to send this separate from the latest gate status update [1] since it's primarily about latent cinder bugs causing failures in the gate for which no one is really investigating.
Running down our tracked gate bugs [2] there are several related to cinder-backup testing:
- http://status.openstack.org/elastic-recheck/#1483434
- http://status.openstack.org/elastic-recheck/#1745168
- http://status.openstack.org/elastic-recheck/#1739482
- http://status.openstack.org/elastic-recheck/#1635643
All of those bugs were reported a long time ago. I've done some investigation into them (at least at the time of reporting) and some are simply due to cinder-api using synchronous RPC calls to cinder-volume (or cinder-backup) and that doesn't scale. This bug isn't a backup issue, but it's definitely related to using RPC call rather than cast:
http://status.openstack.org/elastic-recheck/#1763712
Regarding the backup tests specifically, I don't see a reason why they need to be run in the integrated gate jobs, e.g. tempest-full(-py3). They don't involve other services, so in my opinion we should move the backup tests to a separate job which only runs on cinder changes to alleviate these latent bugs failing jobs for unrelated changes and resetting the entire gate.
I would need someone from the cinder team that is more involved in knowing what their job setup looks like to identify a candidate job for these tests if this is something everyone can agree on doing.
[1] http://lists.openstack.org/pipermail/openstack-discuss/2018-December/000867.... [2] http://status.openstack.org/elastic-recheck/
--
Thanks,
Matt
---- On Thu, 13 Dec 2018 05:00:25 +0900 Matt Riedemann mriedemos@gmail.com wrote ----
I wanted to send this separate from the latest gate status update [1] since it's primarily about latent cinder bugs causing failures in the gate for which no one is really investigating.
Running down our tracked gate bugs [2] there are several related to cinder-backup testing:
I agree that those are long pending bugs but seems not occurring so frequently. First, two are ~20 times in the last 10 days and last two even less.
All of those bugs were reported a long time ago. I've done some investigation into them (at least at the time of reporting) and some are simply due to cinder-api using synchronous RPC calls to cinder-volume (or cinder-backup) and that doesn't scale. This bug isn't a backup issue, but it's definitely related to using RPC call rather than cast:
http://status.openstack.org/elastic-recheck/#1763712
Regarding the backup tests specifically, I don't see a reason why they need to be run in the integrated gate jobs, e.g. tempest-full(-py3). They don't involve other services, so in my opinion we should move the backup tests to a separate job which only runs on cinder changes to alleviate these latent bugs failing jobs for unrelated changes and resetting the entire gate.
I would need someone from the cinder team that is more involved in knowing what their job setup looks like to identify a candidate job for these tests if this is something everyone can agree on doing.
Also, I would like to know that is cinder backup standard feature (including snapshot back etc)? There is no harm to test those in the integrated job. But I agree that if those tests/features are not stable then, we can skip or remove them from integrated gate testing and let cinder to test them on their gate in the specific job till they are stable. As you mentioned, let's wait for Cinder team to respond on these.
-gmann
[1] http://lists.openstack.org/pipermail/openstack-discuss/2018-December/000867.... [2] http://status.openstack.org/elastic-recheck/
--
Thanks,
Matt
On Wed, 12 Dec 2018, Matt Riedemann wrote:
I wanted to send this separate from the latest gate status update [1] since it's primarily about latent cinder bugs causing failures in the gate for which no one is really investigating.
Thanks for writing up this and the other message [1]. It provides much more visibility and context over the situation and hopefully can stimulate people to think about making fixes and perhaps changes to some of the ways we do things that aren't always working.
In that spirit...
Regarding the backup tests specifically, I don't see a reason why they need to be run in the integrated gate jobs, e.g. tempest-full(-py3). They don't involve other services, so in my opinion we should move the backup tests to a separate job which only runs on cinder changes to alleviate these latent bugs failing jobs for unrelated changes and resetting the entire gate.
I guess in this case these tests were exposed by their failing, and it was only once investigating that you realized they weren't truly integration tests? Have you, Matt, got any ideas on how to find other non-integration tests that are being treated as integration which we could move to their own things? De-tangling the spaghetti is likely to reveal plenty of improvements but also plenty of areas that need more attention.
A couple of things I've been working on lately that might be useful for refining tests, such that we catch failures in check before the integrated gate:
* In nova, we're in the process of removing placement, but continuing to use a real, instead of fake, placement fixture in the functional tests [2][3]. This is relatively straightforward for placement, since it is just a simple wsgi app, but it might be possible to create similar things for Cinder and other services, so that functional tests that are currently using a faked out stub of an alien service can use something a bit more robust.
If people are interested in trying to make that happen, I'd be happy to help make it go but I wouldn't be able until next year.
* In placement we wanted to do some very simple live performance testing but didn't want to pay the time cost of setting up a devstack or tempest node, so did something much more lightweight [4] which takes about 1/3rd or less of the time. This may be a repeatable pattern for other kinds of testing. Often times devstack and tempest are overkill but we default to them because they are there.
And finally, the use of wsgi-intercept[5] (usually with gabbi [6]), has made it possible to have reasonably high confidence of the behavior of the placement and nova APIs via functional tests, catching issues before anything as costly as tempest gets involved. Any service which presents its API as a WSGI should be able to use the same tooling if they want.
If anyone is curious about any of this stuff, please feel free to ask.
[1] http://lists.openstack.org/pipermail/openstack-discuss/2018-December/000867.... [2] https://review.openstack.org/#/c/617941/ [3] https://git.openstack.org/cgit/openstack/placement/tree/placement/tests/func... [4] https://review.openstack.org/#/c/619248/ [5] https://pypi.org/project/wsgi_intercept/ [6] https://gabbi.readthedocs.io
On 12/13/2018 7:27 AM, Chris Dent wrote:
I guess in this case these tests were exposed by their failing, and it was only once investigating that you realized they weren't truly integration tests? Have you, Matt, got any ideas on how to find other non-integration tests that are being treated as integration which we could move to their own things? De-tangling the spaghetti is likely to reveal plenty of improvements but also plenty of areas that need more attention.
I haven't done a full audit anytime recent, no. I'm sure there are lots of tempest tests that are single-service, e.g. doing things in just cinder, glance, nova or neutron, which don't require any other services (nova might be the exception in some cases since we need at least glance and neutron for building a server). There was a time a few years ago where QA folk were working on pulling stuff like that out of tempest and moving it into the project trees if their functional testing would cover it, e.g. anything that just required the compute API and DB was a candidate to move into nova functional tests (like flavor and aggregates tests). However, interop tests are based on tempest so there are some tests you just can't remove because they are used by refstack.
On 12/12/18 14:00 -0600, Matt Riedemann wrote:
I wanted to send this separate from the latest gate status update [1] since it's primarily about latent cinder bugs causing failures in the gate for which no one is really investigating.
Running down our tracked gate bugs [2] there are several related to cinder-backup testing:
- http://status.openstack.org/elastic-recheck/#1483434
- http://status.openstack.org/elastic-recheck/#1745168
- http://status.openstack.org/elastic-recheck/#1739482
- http://status.openstack.org/elastic-recheck/#1635643
All of those bugs were reported a long time ago. I've done some investigation into them (at least at the time of reporting) and some are simply due to cinder-api using synchronous RPC calls to cinder-volume (or cinder-backup) and that doesn't scale. This bug isn't a backup issue, but it's definitely related to using RPC call rather than cast:
http://status.openstack.org/elastic-recheck/#1763712
Regarding the backup tests specifically, I don't see a reason why they need to be run in the integrated gate jobs, e.g. tempest-full(-py3). They don't involve other services, so in my opinion we should move the backup tests to a separate job which only runs on cinder changes to alleviate these latent bugs failing jobs for unrelated changes and resetting the entire gate.
FWIW cinder backup by default uses swift as the backup driver and requires its services, and that's the way it is being run in this job [1].
The job could be modified to use e.g. NFS driver and not depend on other OpenStack services (unless one wanted to be fancy and have Manila provision the backup share).
Cheers,
-- Tom Barron (tbarron)
[1] http://logs.openstack.org/55/569055/7/gate/nova-next/2e23975/logs/screen-c-b...
I would need someone from the cinder team that is more involved in knowing what their job setup looks like to identify a candidate job for these tests if this is something everyone can agree on doing.
[1] http://lists.openstack.org/pipermail/openstack-discuss/2018-December/000867.... [2] http://status.openstack.org/elastic-recheck/
--
Thanks,
Matt
FWIW cinder backup by default uses swift as the backup driver and requires its services, and that's the way it is being run in this job [1].
The job could be modified to use e.g. NFS driver and not depend on other OpenStack services (unless one wanted to be fancy and have Manila provision the backup share).
Cheers,
-- Tom Barron (tbarron)
Tom,
Thanks for pointing this out. I am a little nervous about changing the default in testing given that is the default configuration for backup. I think if we are not able to narrow down the source of the issues this could be a road to investigate but I don't think that is the first course of action we want to take.
Jay
On 13/12/18 09:15 -0600, Jay Bryant wrote:
FWIW cinder backup by default uses swift as the backup driver and requires its services, and that's the way it is being run in this job [1].
The job could be modified to use e.g. NFS driver and not depend on other OpenStack services (unless one wanted to be fancy and have Manila provision the backup share).
Cheers,
-- Tom Barron (tbarron)
Tom,
Thanks for pointing this out. I am a little nervous about changing the default in testing given that is the default configuration for backup. I think if we are not able to narrow down the source of the issues this could be a road to investigate but I don't think that is the first course of action we want to take.
Jay
Yup, I was just pointing out that this *is* currently a service integration test and then looking at two ends of the spectrum in terms of running it without dependencies on other services (except the normal keystone, etc.) and with other dependencies than swift.
FWIW I agree that keeping the default and fixing the issue makes sense :)
On 12/13/2018 8:16 AM, Tom Barron wrote:
FWIW cinder backup by default uses swift as the backup driver and requires its services, and that's the way it is being run in this job [1].
The job could be modified to use e.g. NFS driver and not depend on other OpenStack services (unless one wanted to be fancy and have Manila provision the backup share).
For the integrated gate, I'm specifically looking for *not fancy* combinations of services. The integrated gate jobs which run on most changes throughout the system should test the most boring scenarios possible and ideally hit as many different services as possible, otherwise more exotic configurations can be run in separate special purpose jobs, much like third party CI.
On 12/12/2018 2:00 PM, Matt Riedemann wrote:
I wanted to send this separate from the latest gate status update [1] since it's primarily about latent cinder bugs causing failures in the gate for which no one is really investigating.
Matt, thank you for putting together this information. I am sorry that these issues with Cinder are impacting Nova's ability to merge code. I don't think we knew that this was having an impact on Nova.
Running down our tracked gate bugs [2] there are several related to cinder-backup testing:
- http://status.openstack.org/elastic-recheck/#1483434
- http://status.openstack.org/elastic-recheck/#1745168
- http://status.openstack.org/elastic-recheck/#1739482
- http://status.openstack.org/elastic-recheck/#1635643
All of those bugs were reported a long time ago. I've done some investigation into them (at least at the time of reporting) and some are simply due to cinder-api using synchronous RPC calls to cinder-volume (or cinder-backup) and that doesn't scale. This bug isn't a backup issue, but it's definitely related to using RPC call rather than cast:
Thanks to bringing this up Dan Smith has proposed a patch that may help with the timeouts. https://review.openstack.org/#/c/624809/ The thought is that cocurrent LVM processes might be the source of the timeout. We will continue to work with Dan on that patch.
Regarding the backup tests specifically, I don't see a reason why they need to be run in the integrated gate jobs, e.g. tempest-full(-py3). They don't involve other services, so in my opinion we should move the backup tests to a separate job which only runs on cinder changes to alleviate these latent bugs failing jobs for unrelated changes and resetting the entire gate.
I would need someone from the cinder team that is more involved in knowing what their job setup looks like to identify a candidate job for these tests if this is something everyone can agree on doing.
We have a member of the team that might have some bandwidth to start working on check/gate issues. I have added this issue to our meeting agenda for next week. We should be able to get attention from the team members can help at that point in time.
[1] http://lists.openstack.org/pipermail/openstack-discuss/2018-December/000867.... [2] http://status.openstack.org/elastic-recheck/
On 12/13/2018 9:03 AM, Jay Bryant wrote:
Matt, thank you for putting together this information. I am sorry that these issues with Cinder are impacting Nova's ability to merge code. I don't think we knew that this was having an impact on Nova.
FWIW it's not a nova thing, it's everything that uses the integrated-gate jobs (tempest-full). So failures *anywhere* in these jobs will impact our (OpenStack as a whole) ability to get changes through on *all* projects. And this isn't just a cinder issue, nova has provided our fair share of gate breakers, but I like to think that we also try to stay on top of them.
On 12/12/2018 2:00 PM, Matt Riedemann wrote:
I wanted to send this separate from the latest gate status update [1] since it's primarily about latent cinder bugs causing failures in the gate for which no one is really investigating.
Running down our tracked gate bugs [2] there are several related to cinder-backup testing:
- http://status.openstack.org/elastic-recheck/#1483434
- http://status.openstack.org/elastic-recheck/#1745168
- http://status.openstack.org/elastic-recheck/#1739482
- http://status.openstack.org/elastic-recheck/#1635643
All of those bugs were reported a long time ago. I've done some investigation into them (at least at the time of reporting) and some are simply due to cinder-api using synchronous RPC calls to cinder-volume (or cinder-backup) and that doesn't scale. This bug isn't a backup issue, but it's definitely related to using RPC call rather than cast:
http://status.openstack.org/elastic-recheck/#1763712
Regarding the backup tests specifically, I don't see a reason why they need to be run in the integrated gate jobs, e.g. tempest-full(-py3). They don't involve other services, so in my opinion we should move the backup tests to a separate job which only runs on cinder changes to alleviate these latent bugs failing jobs for unrelated changes and resetting the entire gate.
I would need someone from the cinder team that is more involved in knowing what their job setup looks like to identify a candidate job for these tests if this is something everyone can agree on doing.
[1] http://lists.openstack.org/pipermail/openstack-discuss/2018-December/000867....
This is an old thread but gmann recently skipping a cinder backup test which was failing a lot [1] prompted me to revisit this.
As such I've proposed a change [2] which will disable the cinder-backup service in the tempest-full job which is in the integrated-gate project template and run by most projects.
There is a voting job running against cinder changes named "cinder-tempest-dsvm-lvm-lio-barbican" which will still test the backup service but it's not gating - it's up to the cinder team if they want to make that job gating. The other thing is it doesn't look like that job runs on glance (or swift) changes so if the cinder team is interested in co-gating changes between at least cinder and glance, they could add cinder-tempest-dsvm-lvm-lio-barbican to glance so it runs there and/or create a new cinder-backup job which just runs backup tests and gate on that in both cinder and glance.
[1] https://review.openstack.org/#/c/651660/ [2] https://review.openstack.org/#/c/651865/
---- On Thu, 11 Apr 2019 11:50:06 -0500 Matt Riedemann mriedemos@gmail.com wrote ----
On 12/12/2018 2:00 PM, Matt Riedemann wrote:
I wanted to send this separate from the latest gate status update [1] since it's primarily about latent cinder bugs causing failures in the gate for which no one is really investigating.
Running down our tracked gate bugs [2] there are several related to cinder-backup testing:
- http://status.openstack.org/elastic-recheck/#1483434
- http://status.openstack.org/elastic-recheck/#1745168
- http://status.openstack.org/elastic-recheck/#1739482
- http://status.openstack.org/elastic-recheck/#1635643
All of those bugs were reported a long time ago. I've done some investigation into them (at least at the time of reporting) and some are simply due to cinder-api using synchronous RPC calls to cinder-volume (or cinder-backup) and that doesn't scale. This bug isn't a backup issue, but it's definitely related to using RPC call rather than cast:
http://status.openstack.org/elastic-recheck/#1763712
Regarding the backup tests specifically, I don't see a reason why they need to be run in the integrated gate jobs, e.g. tempest-full(-py3). They don't involve other services, so in my opinion we should move the backup tests to a separate job which only runs on cinder changes to alleviate these latent bugs failing jobs for unrelated changes and resetting the entire gate.
I would need someone from the cinder team that is more involved in knowing what their job setup looks like to identify a candidate job for these tests if this is something everyone can agree on doing.
[1] http://lists.openstack.org/pipermail/openstack-discuss/2018-December/000867....
This is an old thread but gmann recently skipping a cinder backup test which was failing a lot [1] prompted me to revisit this.
As such I've proposed a change [2] which will disable the cinder-backup service in the tempest-full job which is in the integrated-gate project template and run by most projects.
at end goal i agree on this but this will skip all backup tests whihc are running fine, so let's wait till we move those tests to cindet tempest plugin or run on other integrated job etc.
There is a voting job running against cinder changes named "cinder-tempest-dsvm-lvm-lio-barbican" which will still test the backup service but it's not gating - it's up to the cinder team if they want to make that job gating. The other thing is it doesn't look like that job runs on glance (or swift) changes so if the cinder team is interested in co-gating changes between at least cinder and glance, they could add cinder-tempest-dsvm-lvm-lio-barbican to glance so it runs there and/or create a new cinder-backup job which just runs backup tests and gate on that in both cinder and glance.
Initially, I was on the side to test/run everything together but on second thought and by seeing tempest-full unstable I agree with you to find some solution to make integrated-gate template testing (tempest-full) more efficient and stable for each service.
neutron also face lot of test failure due to volume backup or image tests which definitely not related to neutron and not worth to block neutron development for that.
I have added this topic in QA PTG etherpad to find the best possible solution. - https://etherpad.openstack.org/p/qa-train-ptg
-gmann
[1] https://review.openstack.org/#/c/651660/ [2] https://review.openstack.org/#/c/651865/
--
Thanks,
Matt
participants (6)
-
Chris Dent
-
Ghanshyam Mann
-
Jay Bryant
-
Matt Riedemann
-
Slawomir Kaplonski
-
Tom Barron