[all][qa] Tempest jobs are swapping
cboylan at sapwetik.org
Tue Jul 2 22:37:34 UTC 2019
I've been working to bring up a new cloud as part of our nodepool resource set and one of the things we do to sanity check that is run a default tempest full job. The first time I ran tempest it failed because I hadn't configured swap on the test node and we ran out of memory. I added swap, reran things and tempest passed just fine.
Our base jobs configure swap as a last ditch effort to avoid failing jobs unnecessarily but the ideal is to avoid swap entirely. In the past 8GB of memory has been plenty to run the tempest testsuite so I think something has changed here and I think we should be able to get us running back under 8GB of memory again.
I bring this up because in recent weeks we've seen different groups attempt to reduce their resource footprint (which is good), but many of the approaches seem to ignore that making our jobs as quick and reliable as possible (eg don't use swap) will have a major impact. This is due to the way gating works where a failure requires we discard all results for subsequent changes in the gate, remove the change that failed, then re enqueue jobs for the changes after the failed change. On top of that the quicker our jobs run the quicker we return resources to the pool.
How do we debug this? Devstack jobs actually do capture dstat data as well as memory specific information that can be used to identify resource hogs. Taking a recent tempest-full job's dstat log we can see that cinder-backup is using 785MB of memory all on its own  (scroll to the bottom). Devstack also captures memory usage of a larger set of processes in its peakmem_tracker log . This includes RSS specifically which doesn't match up with dstat's number making me think dstat's number may be virtual memory and not resident memory. This peakmem_tracker log identifies other processes which we might look at for improving this situation.
It would be great if the QA team and various projects could take a look at this to help improve the reliability and throughput of our testing. Thank you.
More information about the openstack-discuss