These jobs seem to timeout from every provider on the regular[1], but the issue is surely more apparent with tempest on FN. The result is quite a bit of lost time. 361 jobs that run for several hours results in a little over a 1000 hours of lost cycles. [1] http://logstash.openstack.org/#/dashboard/file/logstash.json?query=filename:%5C%22job-output.txt%5C%22%20AND%20message:%5C%22RUN%20END%20RESULT_TIMED_OUT%5C%22&from=7d On Thu, Aug 1, 2019 at 5:01 AM Ian Wienand <iwienand@redhat.com> wrote:
On Fri, Jul 26, 2019 at 04:53:28PM -0700, Clark Boylan wrote:
Given my change shows this can be so much quicker is there any interest in modifying devstack to be faster here? And if so what do we think an appropriate approach would be?
My first concern was if anyone considered openstack-client setting these things up as actually part of the testing. I'd say not, comments in [1] suggest similar views.
My second concern is that we do keep sufficient track of complexity v speed; obviously doing things in a sequential manner via a script is pretty simple to follow and as we start putting things into scripts we make it harder to debug when a monoscript dies and you have to start pulling apart where it was. With just a little json fiddling we can currently pull good stats from logstash ([2]) so I think as we go it would be good to make sure we account for the time using appropriate wrappers, etc.
Then the third concern is not to break anything for plugins -- devstack has a very very loose API which basically relies on plugin authors using a combination of good taste and copying other code to decide what's internal or not.
Which made me start thinking I wonder if we look at this closely, even without replacing things we might make inroads?
For example [3]; it seems like SERVICE_DOMAIN_NAME is never not default, so the get_or_create_domain call is always just overhead (the result is never used).
Then it seems that in the gate, basically all of the "get_or_create" calls will really just be "create" calls? Because we're always starting fresh. So we could cut out about half of the calls there pre-checking if we know we're under zuul (proof-of-concept [4]).
Then we have blocks like:
get_or_add_user_project_role $member_role $demo_user $demo_project get_or_add_user_project_role $admin_role $admin_user $demo_project get_or_add_user_project_role $another_role $demo_user $demo_project get_or_add_user_project_role $member_role $demo_user $invis_project
If we wrapped that in something like
start_osc_session ... end_osc_session
which sets a variable that means instead of calling directly, those functions write their arguments to a tmp file. Then at the end call, end_osc_session does
$ osc "$(< tmpfile)"
and uses the inbuilt batching? If that had half the calls by skipping the "get_or" bit, and used common authentication from batching, would that help?
And then I don't know if all the projects and groups are required for every devstack run? Maybe someone skilled in the art could do a bit of an audit and we could cut more of that out too?
So I guess my point is that maybe we could tweak what we have a bit to make some immediate wins, before anyone has to rewrite too much?
-i
[1] https://review.opendev.org/673018 [2] https://ethercalc.openstack.org/rzuhevxz7793 [3] https://review.opendev.org/673941 [4] https://review.opendev.org/673936