[qa][ptg][nova][cinder][keystone][neutron][glance][swift][placement] How to make integrated-gate testing (tempest-full) more stable and fast

Ghanshyam Mann gmann at ghanshyammann.com
Mon May 27 11:35:25 UTC 2019

 ---- On Thu, 16 May 2019 20:48:30 +0900 Erno Kuvaja <ekuvaja at redhat.com> wrote ----
 > On Tue, May 7, 2019 at 12:31 AM Tim Burke <tim at swiftstack.com> wrote:
 >      On 5/5/19 12:18 AM, Ghanshyam Mann       wrote:
 >                  Current integrated-gate jobs (tempest-full) is not so stable for various bugs specially timeout. We triedto improve it via filtering the slow tests in the separate tempest-slow job but the situation has not been improved much.We talked about the Ideas to make it more stable and fast for projects especially when failure is notrelated to each project. We are planning to split the integrated-gate template (only tempest-full job asfirst step) per related services. Idea:- Run only dependent service tests on project gate.          I love this plan already.
 >             - Tempest gate will keep running all the services tests as the integrated gate at a centeralized  place without any change in the current job.- Each project can run the below mentioned template. - All below template will be defined and maintained by QA team.           My biggest regret is that I couldn't figure out how to do this     myself. Much thanks to the QA team!
 >             I would like to know each 6 services which run integrated-gate jobs1."Integrated-gate-networking" (job to run on neutron gate) Tests to run in this template: neutron APIs , nova APIs,  keystone APIs ? All scenario currently running in tempest-full in the same way ( means non-slow and in serial)Improvement for neutron gate: exlcude the cinder API tests,  glance API tests, swift API tests,2."Integrated-gate-storage" (job to run on cinder gate, glance gate)Tests to run in this template: Cinder APIs , Glance APIs, Swift APIs, Nova APIs and All scenario currently running in tempest-full in the same way ( means non-slow and in serial)Improvement for cinder, glance gate: excluded the neutron APIs tests, Keystone APIs tests3. "Integrated-gate-object-storage" (job to run on swift gate)Tests to run in this template: Cinder APIs , Glance APIs, Swift APIs and All scenario currently running in tempest-full in the same way ( means non-slow and in serial)Improvement for swift gate: excluded the neutron APIs tests, - Keystone APIs tests, - Nova APIs tests.          This sounds great. My only question is why Cinder tests are still     included, but I trust that it's there for a reason and I'm just     revealing my own ignorance of Swift's consumers, however removed.
 >             Note: swift does not run integrated-gate as of now.          Correct, and for all the reasons that you're seeking to address.       Some eight months ago I'd gotten tired of seeing spurious failures       that had nothing to do with Swift, and I was hard pressed to find       an instance where the tempest tests caught a regression or       behavior change that wasn't already caught by Swift's own       functional tests. In short, the signal-to-noise ratio for those       particular tests was low enough that a failure only told me "you       should leave a recheck comment," so I proposed       https://review.opendev.org/#/c/601813/ . There was also a side       benefit of having our longest-running job change from       legacy-tempest-dsvm-neutron-full (at 90-100 minutes) to       swift-probetests-centos-7 (at ~30 minutes), tightening developer       feedback loops.
 >      It sounds like this proposal addresses both concerns: by reducing       the scope of tests to what might actually exercise the Swift API       (if indirectly), the signal-to-noise ratio should be much better       and the wall-clock time will be reduced.
 >             4. "Integrated-gate-compute" (job to run on Nova gate)tests to run is : Nova APIs, Cinder APIs , Glance APIs ?, neutron APIs and All scenario currently running in tempest-full in same way ( means non-slow and in serial)Improvement for Nova gate: excluded the swift APIs tests(not running in current job but in future, it might), Keystone API tests. 5. "Integrated-gate-identity" (job to run on keystone gate)Tests to run is : all as all project use keystone, we might need to run all tests as it is running in integrated-gate.But does keystone is being unsed differently by all services? if no then, is it enough to run only single service tests say Nova or neutron ?6. "Integrated-gate-placement" (job to run on placement gate)Tests to run in this template: Nova APIs tests, Neutron APIs tests + scenario tests + any new service depends on placement APIs Improvement for placement gate: excluded the  glance APIs tests, cinder APIs tests, swift APIs tests, keystone APIs testsThoughts on this approach?The important point is we must not lose the coverage of integrated testing per project. So I would like toget each project view if we are missing any dependency (proposed tests removal) in above proposed templates.          As far as Swift is aware, these dependencies seem accurate; at any     rate, *we* don't use anything other than Keystone, even by way of     another API. Further, Swift does not use particularly esoteric     Keysonte APIs; I would be OK with integrated-gate-identity not     exercising Swift's API with the assumption that some other (or     indeed, almost *any* other) service would likely exercise the parts     that we care about.
 >             - https:/etherpad.openstack.org/p/qa-train-ptg -gmann
 > While I'm all up for limiting the scope Tempest is targeting for each patch to save time and our precious infra resources I have feeling that we might end up missing something here. Honestly I'm not sure what that something would be and maybe it's me thinking the scopes wrong way around.
 > For example:4. "Integrated-gate-compute" (job to run on Nova gate)
 > I'm not exactly sure what any given Nova patch would be able to break from Cinder, Glance or Neutron or on number 2 what Swift is depending on Glance and Cinder that we could break when we introduce a change.

There can be various scenario where these services are cross-dependent. It is difficult to judge the isolation among them. For example, multi-attach feature depends on Nova as well as Cinder to work correctly. Either side change can break this feature. 

 > Shouldn't we be looking "What projects are consuming service X and target those Tempest tests"? In Glance perspective this would be (from core projects) Glance, Cinder, Nova; Cinder probably interested about Cinder, Glance and Nova (anyone else consuming Cinder?) etc.

I agree on your point of more optimize the testing base on consumer only. But there are few cross service call among consumer and consumed services. For example, Nova and Cinder call back to each other in case of the Swap volume feature.
To be honest, I want to cover the most broader possible coverage with consumer and consumed services cross-testing. There is a possibility of optimizing it more but that has the risk of losing some coverage and introducing a regression. That risk is more dangerous and we should avoid that until we are very clear about service isolation. 

 > I'd like to propose approach where we define these jobs and run them in check for the start and let gate run full suites until we figure out are we catching something in gate we did not catch in check and once the understanding has been reached that we have sufficient coverage, we can go ahead and swap gate using those jobs as well. This approach would give us the benefit where the impact is highest until we are confident we got the coverage right. I think biggest issue is that for the transition period _everyone_ needs to understand that gate might catch something check did not and simple "recheck" might not be sufficient when tempest succeeded in check but failed in gate.

I like your idea of testing this idea as experimental way before actual migration. But I am worried about how to do that. There are two challenges here-
1. Any job in gate pipeline has to run in check pipeline first. Replacing integrated-gate to integrated-gate-* in check pipeline only need exception in that process. 
2. how to get the matrix of failure-gap between check and gate pipeline due to this change?  OpenStack health dashboard does not collect the check pipeline data. 


 > Best,
 > Erno "jokke_" Kuvaja

More information about the openstack-discuss mailing list