[qa][openstackclient] Debugging devstack slowness

Clark Boylan

26 Jul 2019 26 Jul '19

4:53 p.m.

Today I have been digging into devstack runtime costs to help Donny Davis understand why tempest jobs sometimes timeout on the FortNebula cloud. One thing I discovered was that the keystone user, group, project, role, and domain setup [0] can take many minutes [1][2] (in the examples here almost 5). I've rewritten create_keystone_accounts to be a python tool [3] and get the runtime for that subset of setup from ~100s to ~9s [4]. I imagine that if we applied this to the other create_X_accounts functions we would see similar results. I think this is so much faster because we avoid repeated costs in openstack client including: python process startup, pkg_resource disk scanning to find entrypoints, and needing to convert names to IDs via the API every time osc is run. Given my change shows this can be so much quicker is there any interest in modifying devstack to be faster here? And if so what do we think an appropriate approach would be? [0] https://opendev.org/openstack/devstack/src/commit/6aeaceb0c4ef078d028fb6605c... [1] http://logs.openstack.org/05/672805/4/check/tempest-full/14f3211/job-output.... [2] http://logs.openstack.org/05/672805/4/check/tempest-full/14f3211/job-output.... [3] https://review.opendev.org/#/c/673108/ [4] http://logs.openstack.org/08/673108/6/check/devstack-xenial/a4107d0/job-outp... Note the jobs compared above all ran on rax-dfw. Clark

Show replies by date

Alex Schultz

29 Jul 29 Jul

6:40 a.m.

On Fri, Jul 26, 2019 at 5:57 PM Clark Boylan <cboylan@sapwetik.org> wrote:

...

Today I have been digging into devstack runtime costs to help Donny Davis understand why tempest jobs sometimes timeout on the FortNebula cloud. One thing I discovered was that the keystone user, group, project, role, and domain setup [0] can take many minutes [1][2] (in the examples here almost 5).

I've rewritten create_keystone_accounts to be a python tool [3] and get the runtime for that subset of setup from ~100s to ~9s [4]. I imagine that if we applied this to the other create_X_accounts functions we would see similar results.

I think this is so much faster because we avoid repeated costs in openstack client including: python process startup, pkg_resource disk scanning to find entrypoints, and needing to convert names to IDs via the API every time osc is run. Given my change shows this can be so much quicker is there any interest in modifying devstack to be faster here? And if so what do we think an appropriate approach would be?

In tripleo, we've also run into the same thing for other actions. While you can do bulk openstack client actions[0], it's not the best thing if you need to create a resource and fetch an ID for a subsequent action. We ported our post-installation items to python[1] and noticed a dramatic improvement as well. It might be beneficial to maybe add some caching into openstackclient so that the startup cost isn't so large every time? [0] https://review.opendev.org/#/c/521146/ [1] https://review.opendev.org/#/c/614540/

...

[0] https://opendev.org/openstack/devstack/src/commit/6aeaceb0c4ef078d028fb6605c... [1] http://logs.openstack.org/05/672805/4/check/tempest-full/14f3211/job-output.... [2] http://logs.openstack.org/05/672805/4/check/tempest-full/14f3211/job-output.... [3] https://review.opendev.org/#/c/673108/ [4] http://logs.openstack.org/08/673108/6/check/devstack-xenial/a4107d0/job-outp...

Note the jobs compared above all ran on rax-dfw.

Clark

Clark Boylan

8:51 a.m.

On Mon, Jul 29, 2019, at 6:41 AM, Alex Schultz wrote:

...

On Fri, Jul 26, 2019 at 5:57 PM Clark Boylan <cboylan@sapwetik.org> wrote:

...
Today I have been digging into devstack runtime costs to help Donny Davis understand why tempest jobs sometimes timeout on the FortNebula cloud. One thing I discovered was that the keystone user, group, project, role, and domain setup [0] can take many minutes [1][2] (in the examples here almost 5).

I've rewritten create_keystone_accounts to be a python tool [3] and get the runtime for that subset of setup from ~100s to ~9s [4]. I imagine that if we applied this to the other create_X_accounts functions we would see similar results.

I think this is so much faster because we avoid repeated costs in openstack client including: python process startup, pkg_resource disk scanning to find entrypoints, and needing to convert names to IDs via the API every time osc is run. Given my change shows this can be so much quicker is there any interest in modifying devstack to be faster here? And if so what do we think an appropriate approach would be?

In tripleo, we've also run into the same thing for other actions. While you can do bulk openstack client actions[0], it's not the best thing if you need to create a resource and fetch an ID for a subsequent action. We ported our post-installation items to python[1] and noticed a dramatic improvement as well. It might be beneficial to maybe add some caching into openstackclient so that the startup cost isn't so large every time?

Reading more of what devstack does I've realized that there is quite a bit of logic tied around devstack's use of OSC. In particular if you select one option you get this endpoint and if you select another option you get that endpoint, or if this service and that service are enabled then they need this common role, etc. I think the best way to tackle this would be to have devstack write a manifest file, then have a tool (maybe in osc or sdk?) that can read a manifest and execute the api updates in order, storing intermediate results so that they can be referred to without doing further API lookups. Sounds like such a thing would be useful outside of devstack as well. I brought this up briefly with Monty and he said he would explore it a bit on the SDK side of things. Does this seem like a reasonable approach? Anyone else have better ideas? The big key here seems to be reusing authentication tokens and remembering resource ID data so that we can avoid unnecessary (and costly) lookups every time we want to modify a resource or associate resources.

...

[0] https://review.opendev.org/#/c/521146/ [1] https://review.opendev.org/#/c/614540/

...
[0] https://opendev.org/openstack/devstack/src/commit/6aeaceb0c4ef078d028fb6605c... [1] http://logs.openstack.org/05/672805/4/check/tempest-full/14f3211/job-output.... [2] http://logs.openstack.org/05/672805/4/check/tempest-full/14f3211/job-output.... [3] https://review.opendev.org/#/c/673108/ [4] http://logs.openstack.org/08/673108/6/check/devstack-xenial/a4107d0/job-outp...

Note the jobs compared above all ran on rax-dfw.

Clark

Ian Wienand

1 Aug 1 Aug

1:58 a.m.

On Fri, Jul 26, 2019 at 04:53:28PM -0700, Clark Boylan wrote:

...

Given my change shows this can be so much quicker is there any interest in modifying devstack to be faster here? And if so what do we think an appropriate approach would be?

My first concern was if anyone considered openstack-client setting these things up as actually part of the testing. I'd say not, comments in [1] suggest similar views. My second concern is that we do keep sufficient track of complexity v speed; obviously doing things in a sequential manner via a script is pretty simple to follow and as we start putting things into scripts we make it harder to debug when a monoscript dies and you have to start pulling apart where it was. With just a little json fiddling we can currently pull good stats from logstash ([2]) so I think as we go it would be good to make sure we account for the time using appropriate wrappers, etc. Then the third concern is not to break anything for plugins -- devstack has a very very loose API which basically relies on plugin authors using a combination of good taste and copying other code to decide what's internal or not. Which made me start thinking I wonder if we look at this closely, even without replacing things we might make inroads? For example [3]; it seems like SERVICE_DOMAIN_NAME is never not default, so the get_or_create_domain call is always just overhead (the result is never used). Then it seems that in the gate, basically all of the "get_or_create" calls will really just be "create" calls? Because we're always starting fresh. So we could cut out about half of the calls there pre-checking if we know we're under zuul (proof-of-concept [4]). Then we have blocks like: get_or_add_user_project_role $member_role $demo_user $demo_project get_or_add_user_project_role $admin_role $admin_user $demo_project get_or_add_user_project_role $another_role $demo_user $demo_project get_or_add_user_project_role $member_role $demo_user $invis_project If we wrapped that in something like start_osc_session ... end_osc_session which sets a variable that means instead of calling directly, those functions write their arguments to a tmp file. Then at the end call, end_osc_session does $ osc "$(< tmpfile)" and uses the inbuilt batching? If that had half the calls by skipping the "get_or" bit, and used common authentication from batching, would that help? And then I don't know if all the projects and groups are required for every devstack run? Maybe someone skilled in the art could do a bit of an audit and we could cut more of that out too? So I guess my point is that maybe we could tweak what we have a bit to make some immediate wins, before anyone has to rewrite too much? -i [1] https://review.opendev.org/673018 [2] https://ethercalc.openstack.org/rzuhevxz7793 [3] https://review.opendev.org/673941 [4] https://review.opendev.org/673936

Donny Davis

8 a.m.

These jobs seem to timeout from every provider on the regular[1], but the issue is surely more apparent with tempest on FN. The result is quite a bit of lost time. 361 jobs that run for several hours results in a little over a 1000 hours of lost cycles. [1] http://logstash.openstack.org/#/dashboard/file/logstash.json?query=filename:%5C%22job-output.txt%5C%22%20AND%20message:%5C%22RUN%20END%20RESULT_TIMED_OUT%5C%22&from=7d On Thu, Aug 1, 2019 at 5:01 AM Ian Wienand <iwienand@redhat.com> wrote:

...

On Fri, Jul 26, 2019 at 04:53:28PM -0700, Clark Boylan wrote:

...
Given my change shows this can be so much quicker is there any interest in modifying devstack to be faster here? And if so what do we think an appropriate approach would be?

My first concern was if anyone considered openstack-client setting these things up as actually part of the testing. I'd say not, comments in [1] suggest similar views.

My second concern is that we do keep sufficient track of complexity v speed; obviously doing things in a sequential manner via a script is pretty simple to follow and as we start putting things into scripts we make it harder to debug when a monoscript dies and you have to start pulling apart where it was. With just a little json fiddling we can currently pull good stats from logstash ([2]) so I think as we go it would be good to make sure we account for the time using appropriate wrappers, etc.

Then the third concern is not to break anything for plugins -- devstack has a very very loose API which basically relies on plugin authors using a combination of good taste and copying other code to decide what's internal or not.

Which made me start thinking I wonder if we look at this closely, even without replacing things we might make inroads?

For example [3]; it seems like SERVICE_DOMAIN_NAME is never not default, so the get_or_create_domain call is always just overhead (the result is never used).

Then it seems that in the gate, basically all of the "get_or_create" calls will really just be "create" calls? Because we're always starting fresh. So we could cut out about half of the calls there pre-checking if we know we're under zuul (proof-of-concept [4]).

Then we have blocks like:

get_or_add_user_project_role $member_role $demo_user $demo_project get_or_add_user_project_role $admin_role $admin_user $demo_project get_or_add_user_project_role $another_role $demo_user $demo_project get_or_add_user_project_role $member_role $demo_user $invis_project

If we wrapped that in something like

start_osc_session ... end_osc_session

which sets a variable that means instead of calling directly, those functions write their arguments to a tmp file. Then at the end call, end_osc_session does

$ osc "$(< tmpfile)"

and uses the inbuilt batching? If that had half the calls by skipping the "get_or" bit, and used common authentication from batching, would that help?

And then I don't know if all the projects and groups are required for every devstack run? Maybe someone skilled in the art could do a bit of an audit and we could cut more of that out too?

So I guess my point is that maybe we could tweak what we have a bit to make some immediate wins, before anyone has to rewrite too much?

-i

[1] https://review.opendev.org/673018 [2] https://ethercalc.openstack.org/rzuhevxz7793 [3] https://review.opendev.org/673941 [4] https://review.opendev.org/673936

Ghanshyam Mann

7 Aug 7 Aug

4:01 a.m.

---- On Thu, 01 Aug 2019 17:58:18 +0900 Ian Wienand <iwienand@redhat.com> wrote ----

...

On Fri, Jul 26, 2019 at 04:53:28PM -0700, Clark Boylan wrote:

...
Given my change shows this can be so much quicker is there any interest in modifying devstack to be faster here? And if so what do we think an appropriate approach would be?

My first concern was if anyone considered openstack-client setting these things up as actually part of the testing. I'd say not, comments in [1] suggest similar views.

My second concern is that we do keep sufficient track of complexity v speed; obviously doing things in a sequential manner via a script is pretty simple to follow and as we start putting things into scripts we make it harder to debug when a monoscript dies and you have to start pulling apart where it was. With just a little json fiddling we can currently pull good stats from logstash ([2]) so I think as we go it would be good to make sure we account for the time using appropriate wrappers, etc.

I agree on this concern about maintainability and debugging with scripts. Now a days, very less people have good knowledge on devstack code and debugging the failure on job side is much harder for most of the developers. IMO the maintainability and easy to debug is much needed as first priority. If we wanted to convert the OSC with something faster, Tempest service client comes into my mind. They are the very straight call to API directly but the token is requested for each API call. But that is something need PoC about speed improvement especially.

...

Then the third concern is not to break anything for plugins -- devstack has a very very loose API which basically relies on plugin authors using a combination of good taste and copying other code to decide what's internal or not.

Which made me start thinking I wonder if we look at this closely, even without replacing things we might make inroads?

For example [3]; it seems like SERVICE_DOMAIN_NAME is never not default, so the get_or_create_domain call is always just overhead (the result is never used).

Then it seems that in the gate, basically all of the "get_or_create" calls will really just be "create" calls? Because we're always starting fresh. So we could cut out about half of the calls there pre-checking if we know we're under zuul (proof-of-concept [4]).

Then we have blocks like:

get_or_add_user_project_role $member_role $demo_user $demo_project get_or_add_user_project_role $admin_role $admin_user $demo_project get_or_add_user_project_role $another_role $demo_user $demo_project get_or_add_user_project_role $member_role $demo_user $invis_project

If we wrapped that in something like

start_osc_session ... end_osc_session

which sets a variable that means instead of calling directly, those functions write their arguments to a tmp file. Then at the end call, end_osc_session does

$ osc "$(< tmpfile)"

and uses the inbuilt batching? If that had half the calls by skipping the "get_or" bit, and used common authentication from batching, would that help?

And then I don't know if all the projects and groups are required for every devstack run? Maybe someone skilled in the art could do a bit of an audit and we could cut more of that out too?

Yeah, improving such usused o not required call with the audit is a good call. For example, In most place, devstack need just resource id or name or few fields for created resource so get call which gives complete resource fileds might not be needed and for async call we can have an exception to get resource('addressess' in server). -gmann

...

So I guess my point is that maybe we could tweak what we have a bit to make some immediate wins, before anyone has to rewrite too much?

-i

[1] https://review.opendev.org/673018 [2] https://ethercalc.openstack.org/rzuhevxz7793 [3] https://review.opendev.org/673941 [4] https://review.opendev.org/673936

Ben Nemec

6 Aug 6 Aug

8:24 a.m.

Just a reminder that there is also http://lists.openstack.org/pipermail/openstack-dev/2016-April/092546.html which was intended to address this same issue. I toyed around with it a bit for TripleO installs back then and it did seem to speed things up, but at the time there was a bug in our client plugin where it was triggering a prompt for input that was problematic with the server running in the background. I never really got back to it once that was fixed. :-/ On 7/26/19 6:53 PM, Clark Boylan wrote:

...

Today I have been digging into devstack runtime costs to help Donny Davis understand why tempest jobs sometimes timeout on the FortNebula cloud. One thing I discovered was that the keystone user, group, project, role, and domain setup [0] can take many minutes [1][2] (in the examples here almost 5).

I've rewritten create_keystone_accounts to be a python tool [3] and get the runtime for that subset of setup from ~100s to ~9s [4]. I imagine that if we applied this to the other create_X_accounts functions we would see similar results.

I think this is so much faster because we avoid repeated costs in openstack client including: python process startup, pkg_resource disk scanning to find entrypoints, and needing to convert names to IDs via the API every time osc is run. Given my change shows this can be so much quicker is there any interest in modifying devstack to be faster here? And if so what do we think an appropriate approach would be?

[0] https://opendev.org/openstack/devstack/src/commit/6aeaceb0c4ef078d028fb6605c... [1] http://logs.openstack.org/05/672805/4/check/tempest-full/14f3211/job-output.... [2] http://logs.openstack.org/05/672805/4/check/tempest-full/14f3211/job-output.... [3] https://review.opendev.org/#/c/673108/ [4] http://logs.openstack.org/08/673108/6/check/devstack-xenial/a4107d0/job-outp...

Note the jobs compared above all ran on rax-dfw.

Clark

Clark Boylan

8:49 a.m.

On Tue, Aug 6, 2019, at 8:26 AM, Ben Nemec wrote:

...

Just a reminder that there is also http://lists.openstack.org/pipermail/openstack-dev/2016-April/092546.html which was intended to address this same issue.

I toyed around with it a bit for TripleO installs back then and it did seem to speed things up, but at the time there was a bug in our client plugin where it was triggering a prompt for input that was problematic with the server running in the background. I never really got back to it once that was fixed. :-/

I'm not tied to any particular implementation. Mostly I wanted to show that we can take this ~5 minute portion of devstack and turn it into a 15 second portion of devstack by improving our use of the service APIs (and possibly even further if we apply it to all of the api interaction). Any idea how difficult it would be to get your client as a service stuff running in devstack again? I do not think we should make a one off change like I've done in my POC. That will just end up being harder to understand and debug in the future since it will be different than all of the other API interaction. I like the idea of a manifest or feeding a longer lived process api update commands as we can then avoid requesting new tokens as well as pkg_resource startup time. Such a system could be used by all of devstack as well (avoiding the "this bit is special" problem). Is there any interest from the QA team in committing to an approach and working to do a conversion? I don't want to commit any more time to this myself unless there is strong interest in getting changes merged (as I expect it will be a slow process weeding out places where we've made bad assumptions particularly around plugins). One of the things I found was that using names with osc results in name to id lookups as well. We can avoid these entirely if we remember name to id mappings instead (which my POC does). Any idea if your osc as a service tool does or can do that? Probably have to be more careful for scoping things in a tool like that as it may be reused by people with name collisions across projects/users/groups/domains.

...

On 7/26/19 6:53 PM, Clark Boylan wrote:

...
Today I have been digging into devstack runtime costs to help Donny Davis understand why tempest jobs sometimes timeout on the FortNebula cloud. One thing I discovered was that the keystone user, group, project, role, and domain setup [0] can take many minutes [1][2] (in the examples here almost 5).

I've rewritten create_keystone_accounts to be a python tool [3] and get the runtime for that subset of setup from ~100s to ~9s [4]. I imagine that if we applied this to the other create_X_accounts functions we would see similar results.

I think this is so much faster because we avoid repeated costs in openstack client including: python process startup, pkg_resource disk scanning to find entrypoints, and needing to convert names to IDs via the API every time osc is run. Given my change shows this can be so much quicker is there any interest in modifying devstack to be faster here? And if so what do we think an appropriate approach would be?

[0] https://opendev.org/openstack/devstack/src/commit/6aeaceb0c4ef078d028fb6605c... [1] http://logs.openstack.org/05/672805/4/check/tempest-full/14f3211/job-output.... [2] http://logs.openstack.org/05/672805/4/check/tempest-full/14f3211/job-output.... [3] https://review.opendev.org/#/c/673108/ [4] http://logs.openstack.org/08/673108/6/check/devstack-xenial/a4107d0/job-outp...

Note the jobs compared above all ran on rax-dfw.

Clark

Jeremy Stanley

9:16 a.m.

On 2019-08-06 08:49:17 -0700 (-0700), Clark Boylan wrote: [...]

...

One of the things I found was that using names with osc results in name to id lookups as well. We can avoid these entirely if we remember name to id mappings instead (which my POC does). Any idea if your osc as a service tool does or can do that? Probably have to be more careful for scoping things in a tool like that as it may be reused by people with name collisions across projects/users/groups/domains. [...]

Out of curiosity, could OSC/SDK cache those relationships so they're only looked up once (or at least infrequently)? I guess there are cache invalidation concerns if an entity is deleted and another created out-of-band using the same name, but if it's all done through the same persistent daemon then that's less of a risk right? -- Jeremy Stanley

Chris Dent

9:38 a.m.

On Tue, 6 Aug 2019, Jeremy Stanley wrote:

...

On 2019-08-06 08:49:17 -0700 (-0700), Clark Boylan wrote: [...]

...
One of the things I found was that using names with osc results in name to id lookups as well. We can avoid these entirely if we remember name to id mappings instead (which my POC does). Any idea if your osc as a service tool does or can do that? Probably have to be more careful for scoping things in a tool like that as it may be reused by people with name collisions across projects/users/groups/domains. [...]

Out of curiosity, could OSC/SDK cache those relationships so they're only looked up once (or at least infrequently)? I guess there are cache invalidation concerns if an entity is deleted and another created out-of-band using the same name, but if it's all done through the same persistent daemon then that's less of a risk right?

If we are in a situation where name to id and id to name translations are slow at the services' API layer, isn't that a really big bug? One where the fixing is beneficial to everyone, including devstack users? (Yes, I'm aware of TCP overhead and all that, but I reckon that's way down on the list of contributing factors here?) -- Chris Dent ٩◔̯◔۶ https://anticdent.org/ freenode: cdent

Dean Troyer

12:44 p.m.

On Tue, Aug 6, 2019 at 11:42 AM Chris Dent <cdent+os@anticdent.org> wrote:

...

If we are in a situation where name to id and id to name translations are slow at the services' API layer, isn't that a really big bug? One where the fixing is beneficial to everyone, including devstack users?

While the name->ID lookup is an additional API round trip, it does not cause an additional python startup scan, which is the major killer here. In fact, it is possible that there is more than one lookup and that at least one will always be done because we do not know if that value is a name or an ID. The GET is done in any case because nearly every time (in non-create operations) we probably want the full object anyway. I also played with starting OSC as a background process a while back, it actually does work pretty well and with a bit more error handling would have been good enough(tm)[0]. The major concern with it then was it was not representative of how people actually use OSC and changed the testing value we get from doing that. dt [0] Basically run interactive mode in background, plumb up stdin/stdout to some descriptors and off to the races. -- Dean Troyer dtroyer@gmail.com

Jeremy Stanley

1 p.m.

On 2019-08-06 14:44:36 -0500 (-0500), Dean Troyer wrote: [...]

...

The major concern with it then was it was not representative of how people actually use OSC and changed the testing value we get from doing that. [...]

In an ideal world, OSC would have explicit functional testing independent of the side effect of calling it when standing up DevStack. -- Jeremy Stanley

Doug Hellmann

1 p.m.

...

On Aug 6, 2019, at 3:44 PM, Dean Troyer <dtroyer@gmail.com> wrote:

On Tue, Aug 6, 2019 at 11:42 AM Chris Dent <cdent+os@anticdent.org> wrote:

...
If we are in a situation where name to id and id to name translations are slow at the services' API layer, isn't that a really big bug? One where the fixing is beneficial to everyone, including devstack users?

While the name->ID lookup is an additional API round trip, it does not cause an additional python startup scan, which is the major killer here. In fact, it is possible that there is more than one lookup and that at least one will always be done because we do not know if that value is a name or an ID. The GET is done in any case because nearly every time (in non-create operations) we probably want the full object anyway.

I also played with starting OSC as a background process a while back, it actually does work pretty well and with a bit more error handling would have been good enough(tm)[0]. The major concern with it then was it was not representative of how people actually use OSC and changed the testing value we get from doing that.

dt

[0] Basically run interactive mode in background, plumb up stdin/stdout to some descriptors and off to the races.

-- Dean Troyer dtroyer@gmail.com

I made some notes about the plugin lookup issue a while back [1] and I looked at that again the most recent time we were in Denver [2], and came to the conclusion that the implementation was going to require more changes in osc-lib than I was going to have time to figure out on my own. Unfortunately, it’s not a simple matter of choosing between looking at 1 internal cache or doing the pkg_resource scan because of the plugin version management layer osc-lib added. In any case, I think we’ve discussed the fact many times that the way to fix this is to not scan for plugins unless we have to do so. We just need someone to sit down and work on figuring out how to make that work. Doug [1] https://etherpad.openstack.org/p/mFsAgTZggf [2] https://etherpad.openstack.org/p/train-ptg-osc

Clark Boylan

9:40 a.m.

On Tue, Aug 6, 2019, at 9:17 AM, Jeremy Stanley wrote:

...

On 2019-08-06 08:49:17 -0700 (-0700), Clark Boylan wrote: [...]

...
One of the things I found was that using names with osc results in name to id lookups as well. We can avoid these entirely if we remember name to id mappings instead (which my POC does). Any idea if your osc as a service tool does or can do that? Probably have to be more careful for scoping things in a tool like that as it may be reused by people with name collisions across projects/users/groups/domains. [...]

Out of curiosity, could OSC/SDK cache those relationships so they're only looked up once (or at least infrequently)? I guess there are cache invalidation concerns if an entity is deleted and another created out-of-band using the same name, but if it's all done through the same persistent daemon then that's less of a risk right?

They could cache these things too. The concern is a valid one too; however, a relatively short TTL may address that as these resources tend to all be used near each other. For example create a router, network, subnet in neutron or a user, role, group/domain in keystone. That said I think a bigger win would be caching tokens if we want to make changes to caching for osc (I think it can cache tokens but we don't set it up properly in devstack?) Every invocation of osc first hits the pkg_resources cost, then hits the catalog and token lookup costs, then does name to id translations, then does the actual thing you requested. Addressing the first two upfront costs likely has a bigger impact than name to id translations. Clark

Ben Nemec

9:34 a.m.

On 8/6/19 10:49 AM, Clark Boylan wrote:

...

On Tue, Aug 6, 2019, at 8:26 AM, Ben Nemec wrote:

...
Just a reminder that there is also http://lists.openstack.org/pipermail/openstack-dev/2016-April/092546.html which was intended to address this same issue.

I toyed around with it a bit for TripleO installs back then and it did seem to speed things up, but at the time there was a bug in our client plugin where it was triggering a prompt for input that was problematic with the server running in the background. I never really got back to it once that was fixed. :-/

I'm not tied to any particular implementation. Mostly I wanted to show that we can take this ~5 minute portion of devstack and turn it into a 15 second portion of devstack by improving our use of the service APIs (and possibly even further if we apply it to all of the api interaction). Any idea how difficult it would be to get your client as a service stuff running in devstack again?

I wish I could take credit, but this is actually Dan Berrange's work. :-)

...

I do not think we should make a one off change like I've done in my POC. That will just end up being harder to understand and debug in the future since it will be different than all of the other API interaction. I like the idea of a manifest or feeding a longer lived process api update commands as we can then avoid requesting new tokens as well as pkg_resource startup time. Such a system could be used by all of devstack as well (avoiding the "this bit is special" problem).

Is there any interest from the QA team in committing to an approach and working to do a conversion? I don't want to commit any more time to this myself unless there is strong interest in getting changes merged (as I expect it will be a slow process weeding out places where we've made bad assumptions particularly around plugins).

One of the things I found was that using names with osc results in name to id lookups as well. We can avoid these entirely if we remember name to id mappings instead (which my POC does). Any idea if your osc as a service tool does or can do that? Probably have to be more careful for scoping things in a tool like that as it may be reused by people with name collisions across projects/users/groups/domains.

I don't believe this would handle name to id mapping. It's a very thin wrapper around the regular client code that just makes it persistent so we don't pay the startup costs every call. On the plus side that means it basically works like the vanilla client, on the minus side that means it may not provide as much improvement as a more targeted solution. IIRC it's pretty easy to use, so I can try it out again and make sure it still works and still provides a performance benefit.

...

...
On 7/26/19 6:53 PM, Clark Boylan wrote:

...
Today I have been digging into devstack runtime costs to help Donny Davis understand why tempest jobs sometimes timeout on the FortNebula cloud. One thing I discovered was that the keystone user, group, project, role, and domain setup [0] can take many minutes [1][2] (in the examples here almost 5).

I've rewritten create_keystone_accounts to be a python tool [3] and get the runtime for that subset of setup from ~100s to ~9s [4]. I imagine that if we applied this to the other create_X_accounts functions we would see similar results.

I think this is so much faster because we avoid repeated costs in openstack client including: python process startup, pkg_resource disk scanning to find entrypoints, and needing to convert names to IDs via the API every time osc is run. Given my change shows this can be so much quicker is there any interest in modifying devstack to be faster here? And if so what do we think an appropriate approach would be?

[0] https://opendev.org/openstack/devstack/src/commit/6aeaceb0c4ef078d028fb6605c... [1] http://logs.openstack.org/05/672805/4/check/tempest-full/14f3211/job-output.... [2] http://logs.openstack.org/05/672805/4/check/tempest-full/14f3211/job-output.... [3] https://review.opendev.org/#/c/673108/ [4] http://logs.openstack.org/08/673108/6/check/devstack-xenial/a4107d0/job-outp...

Note the jobs compared above all ran on rax-dfw.

Clark

Ben Nemec

7 Aug 7 Aug

6:33 a.m.

On 8/6/19 11:34 AM, Ben Nemec wrote:

...

On 8/6/19 10:49 AM, Clark Boylan wrote:

...
On Tue, Aug 6, 2019, at 8:26 AM, Ben Nemec wrote:

...
Just a reminder that there is also http://lists.openstack.org/pipermail/openstack-dev/2016-April/092546.html

which was intended to address this same issue.

I toyed around with it a bit for TripleO installs back then and it did seem to speed things up, but at the time there was a bug in our client plugin where it was triggering a prompt for input that was problematic with the server running in the background. I never really got back to it once that was fixed. :-/

I'm not tied to any particular implementation. Mostly I wanted to show that we can take this ~5 minute portion of devstack and turn it into a 15 second portion of devstack by improving our use of the service APIs (and possibly even further if we apply it to all of the api interaction). Any idea how difficult it would be to get your client as a service stuff running in devstack again?

I wish I could take credit, but this is actually Dan Berrange's work. :-)

...
I do not think we should make a one off change like I've done in my POC. That will just end up being harder to understand and debug in the future since it will be different than all of the other API interaction. I like the idea of a manifest or feeding a longer lived process api update commands as we can then avoid requesting new tokens as well as pkg_resource startup time. Such a system could be used by all of devstack as well (avoiding the "this bit is special" problem).

Is there any interest from the QA team in committing to an approach and working to do a conversion? I don't want to commit any more time to this myself unless there is strong interest in getting changes merged (as I expect it will be a slow process weeding out places where we've made bad assumptions particularly around plugins).

One of the things I found was that using names with osc results in name to id lookups as well. We can avoid these entirely if we remember name to id mappings instead (which my POC does). Any idea if your osc as a service tool does or can do that? Probably have to be more careful for scoping things in a tool like that as it may be reused by people with name collisions across projects/users/groups/domains.

I don't believe this would handle name to id mapping. It's a very thin wrapper around the regular client code that just makes it persistent so we don't pay the startup costs every call. On the plus side that means it basically works like the vanilla client, on the minus side that means it may not provide as much improvement as a more targeted solution.

IIRC it's pretty easy to use, so I can try it out again and make sure it still works and still provides a performance benefit.

It still works and it still helps. Using the osc service cut about 3 minutes off my 21 minute devstack run. Subjectively I would say that most of the time was being spent cloning and installing services and their deps. I guess the downside is that working around the OSC slowness in CI will reduce developer motivation to fix the problem, which affects all users too. Then again, this has been a problem for years and no one has fixed it, so apparently that isn't a big enough lever to get things moving anyway. :-/

...

...
...
On 7/26/19 6:53 PM, Clark Boylan wrote:

...
Today I have been digging into devstack runtime costs to help Donny Davis understand why tempest jobs sometimes timeout on the FortNebula cloud. One thing I discovered was that the keystone user, group, project, role, and domain setup [0] can take many minutes [1][2] (in the examples here almost 5).

I've rewritten create_keystone_accounts to be a python tool [3] and get the runtime for that subset of setup from ~100s to ~9s [4]. I imagine that if we applied this to the other create_X_accounts functions we would see similar results.

I think this is so much faster because we avoid repeated costs in openstack client including: python process startup, pkg_resource disk scanning to find entrypoints, and needing to convert names to IDs via the API every time osc is run. Given my change shows this can be so much quicker is there any interest in modifying devstack to be faster here? And if so what do we think an appropriate approach would be?

[0] https://opendev.org/openstack/devstack/src/commit/6aeaceb0c4ef078d028fb6605c...

[1] http://logs.openstack.org/05/672805/4/check/tempest-full/14f3211/job-output....

[2] http://logs.openstack.org/05/672805/4/check/tempest-full/14f3211/job-output....

[3] https://review.opendev.org/#/c/673108/ [4] http://logs.openstack.org/08/673108/6/check/devstack-xenial/a4107d0/job-outp...

Note the jobs compared above all ran on rax-dfw.

Clark

Sean Mooney

7:37 a.m.

On Wed, 2019-08-07 at 08:33 -0500, Ben Nemec wrote:

...

On 8/6/19 11:34 AM, Ben Nemec wrote:

...
On 8/6/19 10:49 AM, Clark Boylan wrote:

...
On Tue, Aug 6, 2019, at 8:26 AM, Ben Nemec wrote:

...
Just a reminder that there is also http://lists.openstack.org/pipermail/openstack-dev/2016-April/092546.html

which was intended to address this same issue.

I toyed around with it a bit for TripleO installs back then and it did seem to speed things up, but at the time there was a bug in our client plugin where it was triggering a prompt for input that was problematic with the server running in the background. I never really got back to it once that was fixed. :-/

I'm not tied to any particular implementation. Mostly I wanted to show that we can take this ~5 minute portion of devstack and turn it into a 15 second portion of devstack by improving our use of the service APIs (and possibly even further if we apply it to all of the api interaction). Any idea how difficult it would be to get your client as a service stuff running in devstack again?

I wish I could take credit, but this is actually Dan Berrange's work. :-)

...
I do not think we should make a one off change like I've done in my POC. That will just end up being harder to understand and debug in the future since it will be different than all of the other API interaction. I like the idea of a manifest or feeding a longer lived process api update commands as we can then avoid requesting new tokens as well as pkg_resource startup time. Such a system could be used by all of devstack as well (avoiding the "this bit is special" problem).

Is there any interest from the QA team in committing to an approach and working to do a conversion? I don't want to commit any more time to this myself unless there is strong interest in getting changes merged (as I expect it will be a slow process weeding out places where we've made bad assumptions particularly around plugins).

One of the things I found was that using names with osc results in name to id lookups as well. We can avoid these entirely if we remember name to id mappings instead (which my POC does). Any idea if your osc as a service tool does or can do that? Probably have to be more careful for scoping things in a tool like that as it may be reused by people with name collisions across projects/users/groups/domains.

I don't believe this would handle name to id mapping. It's a very thin wrapper around the regular client code that just makes it persistent so we don't pay the startup costs every call. On the plus side that means it basically works like the vanilla client, on the minus side that means it may not provide as much improvement as a more targeted solution.

IIRC it's pretty easy to use, so I can try it out again and make sure it still works and still provides a performance benefit.

It still works and it still helps. Using the osc service cut about 3 minutes off my 21 minute devstack run. Subjectively I would say that most of the time was being spent cloning and installing services and their deps.

I guess the downside is that working around the OSC slowness in CI will reduce developer motivation to fix the problem, which affects all users too. Then again, this has been a problem for years and no one has fixed it, so apparently that isn't a big enough lever to get things moving anyway. :-/

using osc diretly i dont think the slowness is really perceptable from a human stand point but it adds up in a ci run. there are large problems to kill with gate slowness then fixing osc will solve be every little helps. i do agree however that the gage is not a big enough motivater for people to fix osc slowness as we can wait hours in some cases for jobs to start so 3 minutes is not really a consern form a latency perspective but if we saved 3 mins on every run that might in aggreaget reduce the latency problems we have.

...

...
...
...
On 7/26/19 6:53 PM, Clark Boylan wrote:

...
Today I have been digging into devstack runtime costs to help Donny Davis understand why tempest jobs sometimes timeout on the FortNebula cloud. One thing I discovered was that the keystone user, group, project, role, and domain setup [0] can take many minutes [1][2] (in the examples here almost 5).

I've rewritten create_keystone_accounts to be a python tool [3] and get the runtime for that subset of setup from ~100s to ~9s [4]. I imagine that if we applied this to the other create_X_accounts functions we would see similar results.

I think this is so much faster because we avoid repeated costs in openstack client including: python process startup, pkg_resource disk scanning to find entrypoints, and needing to convert names to IDs via the API every time osc is run. Given my change shows this can be so much quicker is there any interest in modifying devstack to be faster here? And if so what do we think an appropriate approach would be?

[0]

https://opendev.org/openstack/devstack/src/commit/6aeaceb0c4ef078d028fb6605c...

...

...
...
...
...
[1] http://logs.openstack.org/05/672805/4/check/tempest-full/14f3211/job-output....

[2] http://logs.openstack.org/05/672805/4/check/tempest-full/14f3211/job-output....

[3] https://review.opendev.org/#/c/673108/ [4]

http://logs.openstack.org/08/673108/6/check/devstack-xenial/a4107d0/job-outp...

...
Note the jobs compared above all ran on rax-dfw.

Clark

Ben Nemec

8:11 a.m.

On 8/7/19 9:37 AM, Sean Mooney wrote:

...

On Wed, 2019-08-07 at 08:33 -0500, Ben Nemec wrote:

...
On 8/6/19 11:34 AM, Ben Nemec wrote:

...
On 8/6/19 10:49 AM, Clark Boylan wrote:

...
On Tue, Aug 6, 2019, at 8:26 AM, Ben Nemec wrote:

...
Just a reminder that there is also http://lists.openstack.org/pipermail/openstack-dev/2016-April/092546.html

which was intended to address this same issue.

I toyed around with it a bit for TripleO installs back then and it did seem to speed things up, but at the time there was a bug in our client plugin where it was triggering a prompt for input that was problematic with the server running in the background. I never really got back to it once that was fixed. :-/

I'm not tied to any particular implementation. Mostly I wanted to show that we can take this ~5 minute portion of devstack and turn it into a 15 second portion of devstack by improving our use of the service APIs (and possibly even further if we apply it to all of the api interaction). Any idea how difficult it would be to get your client as a service stuff running in devstack again?

I wish I could take credit, but this is actually Dan Berrange's work. :-)

...
I do not think we should make a one off change like I've done in my POC. That will just end up being harder to understand and debug in the future since it will be different than all of the other API interaction. I like the idea of a manifest or feeding a longer lived process api update commands as we can then avoid requesting new tokens as well as pkg_resource startup time. Such a system could be used by all of devstack as well (avoiding the "this bit is special" problem).

Is there any interest from the QA team in committing to an approach and working to do a conversion? I don't want to commit any more time to this myself unless there is strong interest in getting changes merged (as I expect it will be a slow process weeding out places where we've made bad assumptions particularly around plugins).

One of the things I found was that using names with osc results in name to id lookups as well. We can avoid these entirely if we remember name to id mappings instead (which my POC does). Any idea if your osc as a service tool does or can do that? Probably have to be more careful for scoping things in a tool like that as it may be reused by people with name collisions across projects/users/groups/domains.

I don't believe this would handle name to id mapping. It's a very thin wrapper around the regular client code that just makes it persistent so we don't pay the startup costs every call. On the plus side that means it basically works like the vanilla client, on the minus side that means it may not provide as much improvement as a more targeted solution.

IIRC it's pretty easy to use, so I can try it out again and make sure it still works and still provides a performance benefit.

It still works and it still helps. Using the osc service cut about 3 minutes off my 21 minute devstack run. Subjectively I would say that most of the time was being spent cloning and installing services and their deps.

I guess the downside is that working around the OSC slowness in CI will reduce developer motivation to fix the problem, which affects all users too. Then again, this has been a problem for years and no one has fixed it, so apparently that isn't a big enough lever to get things moving anyway. :-/

using osc diretly i dont think the slowness is really perceptable from a human stand point but it adds up in a ci run. there are large problems to kill with gate slowness then fixing osc will solve be every little helps. i do agree however that the gage is not a big enough motivater for people to fix osc slowness as we can wait hours in some cases for jobs to start so 3 minutes is not really a consern form a latency perspective but if we saved 3 mins on every run that might in aggreaget reduce the latency problems we have.

I find the slowness very noticeable in interactive use. It adds something like 2 seconds to a basic call like image list that returns almost instantly in the OSC interactive shell where there is no startup overhead. From my performance days, any latency over 1 second was considered unacceptable for an interactive call. The interactive shell does help with that if I'm doing a bunch of calls in a row though. That said, you're right that 3 minutes multiplied by the number of jobs we run per day is significant. Picking 1000 as a round number (and I'm pretty sure we run a _lot_ more than that per day), a 3 minute decrease in runtime per job would save about 50 hours of CI time in total. Small things add up at scale. :-)

...

...
...
...
...
On 7/26/19 6:53 PM, Clark Boylan wrote:

...
Today I have been digging into devstack runtime costs to help Donny Davis understand why tempest jobs sometimes timeout on the FortNebula cloud. One thing I discovered was that the keystone user, group, project, role, and domain setup [0] can take many minutes [1][2] (in the examples here almost 5).

I've rewritten create_keystone_accounts to be a python tool [3] and get the runtime for that subset of setup from ~100s to ~9s [4]. I imagine that if we applied this to the other create_X_accounts functions we would see similar results.

I think this is so much faster because we avoid repeated costs in openstack client including: python process startup, pkg_resource disk scanning to find entrypoints, and needing to convert names to IDs via the API every time osc is run. Given my change shows this can be so much quicker is there any interest in modifying devstack to be faster here? And if so what do we think an appropriate approach would be?

[0]

https://opendev.org/openstack/devstack/src/commit/6aeaceb0c4ef078d028fb6605c...

...
...
...
...
...
[1] http://logs.openstack.org/05/672805/4/check/tempest-full/14f3211/job-output....

[2] http://logs.openstack.org/05/672805/4/check/tempest-full/14f3211/job-output....

[3] https://review.opendev.org/#/c/673108/ [4]

http://logs.openstack.org/08/673108/6/check/devstack-xenial/a4107d0/job-outp...

...
Note the jobs compared above all ran on rax-dfw.

Clark

Sean Mooney

10:16 a.m.

On Wed, 2019-08-07 at 10:11 -0500, Ben Nemec wrote:

...

On 8/7/19 9:37 AM, Sean Mooney wrote:

...
On Wed, 2019-08-07 at 08:33 -0500, Ben Nemec wrote:

...
On 8/6/19 11:34 AM, Ben Nemec wrote:

...
On 8/6/19 10:49 AM, Clark Boylan wrote:

...
On Tue, Aug 6, 2019, at 8:26 AM, Ben Nemec wrote:

...
Just a reminder that there is also http://lists.openstack.org/pipermail/openstack-dev/2016-April/092546.html

which was intended to address this same issue.

I toyed around with it a bit for TripleO installs back then and it did seem to speed things up, but at the time there was a bug in our client plugin where it was triggering a prompt for input that was problematic with the server running in the background. I never really got back to it once that was fixed. :-/

I'm not tied to any particular implementation. Mostly I wanted to show that we can take this ~5 minute portion of devstack and turn it into a 15 second portion of devstack by improving our use of the service APIs (and possibly even further if we apply it to all of the api interaction). Any idea how difficult it would be to get your client as a service stuff running in devstack again?

I wish I could take credit, but this is actually Dan Berrange's work. :-)

...
I do not think we should make a one off change like I've done in my POC. That will just end up being harder to understand and debug in the future since it will be different than all of the other API interaction. I like the idea of a manifest or feeding a longer lived process api update commands as we can then avoid requesting new tokens as well as pkg_resource startup time. Such a system could be used by all of devstack as well (avoiding the "this bit is special" problem).

Is there any interest from the QA team in committing to an approach and working to do a conversion? I don't want to commit any more time to this myself unless there is strong interest in getting changes merged (as I expect it will be a slow process weeding out places where we've made bad assumptions particularly around plugins).

One of the things I found was that using names with osc results in name to id lookups as well. We can avoid these entirely if we remember name to id mappings instead (which my POC does). Any idea if your osc as a service tool does or can do that? Probably have to be more careful for scoping things in a tool like that as it may be reused by people with name collisions across projects/users/groups/domains.

I don't believe this would handle name to id mapping. It's a very thin wrapper around the regular client code that just makes it persistent so we don't pay the startup costs every call. On the plus side that means it basically works like the vanilla client, on the minus side that means it may not provide as much improvement as a more targeted solution.

IIRC it's pretty easy to use, so I can try it out again and make sure it still works and still provides a performance benefit.

It still works and it still helps. Using the osc service cut about 3 minutes off my 21 minute devstack run. Subjectively I would say that most of the time was being spent cloning and installing services and their deps.

I guess the downside is that working around the OSC slowness in CI will reduce developer motivation to fix the problem, which affects all users too. Then again, this has been a problem for years and no one has fixed it, so apparently that isn't a big enough lever to get things moving anyway. :-/

using osc diretly i dont think the slowness is really perceptable from a human stand point but it adds up in a ci run. there are large problems to kill with gate slowness then fixing osc will solve be every little helps. i do agree however that the gage is not a big enough motivater for people to fix osc slowness as we can wait hours in some cases for jobs to start so 3 minutes is not really a consern form a latency perspective but if we saved 3 mins on every run that might in aggreaget reduce the latency problems we have.

I find the slowness very noticeable in interactive use. It adds something like 2 seconds to a basic call like image list that returns almost instantly in the OSC interactive shell where there is no startup overhead. From my performance days, any latency over 1 second was considered unacceptable for an interactive call. The interactive shell does help with that if I'm doing a bunch of calls in a row though.

well that was kind of my point when we write sripts we invoke it over and over again. if i need to use osc to do lots of commands for some reason i generaly enter the interactive mode. the interactive mode already masks the pain so anytime it has bother me in the past i have just ended up using it instead. its been a long time since i looked at this but it think there were two reasons it is slow on startup. one is the need to get the token for each request and the other was related to the way we scan for plugins. i honestly dont know if either have imporved but the interactive shell elimiates both as issues.

...

That said, you're right that 3 minutes multiplied by the number of jobs we run per day is significant. Picking 1000 as a round number (and I'm pretty sure we run a _lot_ more than that per day), a 3 minute decrease in runtime per job would save about 50 hours of CI time in total. Small things add up at scale. :-)

yep it defintly does.

...

...
...
...
...
...
On 7/26/19 6:53 PM, Clark Boylan wrote: > Today I have been digging into devstack runtime costs to help Donny > Davis understand why tempest jobs sometimes timeout on the > FortNebula cloud. One thing I discovered was that the keystone user, > group, project, role, and domain setup [0] can take many minutes > [1][2] (in the examples here almost 5). > > I've rewritten create_keystone_accounts to be a python tool [3] and > get the runtime for that subset of setup from ~100s to ~9s [4]. I > imagine that if we applied this to the other create_X_accounts > functions we would see similar results. > > I think this is so much faster because we avoid repeated costs in > openstack client including: python process startup, pkg_resource > disk scanning to find entrypoints, and needing to convert names to > IDs via the API every time osc is run. Given my change shows this > can be so much quicker is there any interest in modifying devstack > to be faster here? And if so what do we think an appropriate > approach would be? > > [0] >

https://opendev.org/openstack/devstack/src/commit/6aeaceb0c4ef078d028fb6605c...

...
...
...
...
> > > [1] >

http://logs.openstack.org/05/672805/4/check/tempest-full/14f3211/job-output....

...

...
...
...
...
...
> > > [2] > http://logs.openstack.org/05/672805/4/check/tempest-full/14f3211/job-output.... > > > [3] https://review.opendev.org/#/c/673108/ > [4] >

http://logs.openstack.org/08/673108/6/check/devstack-xenial/a4107d0/job-outp...

...
...
...
...
> > > > Note the jobs compared above all ran on rax-dfw. > > Clark >

Donny Davis

12:45 p.m.

Just for reference FortNebula does 73-80 jobs an hour, so that's 1700(ish) jobs a day -3 minutes per job. That is 5200(ish) cycle minutes a day. Or about 4 days worth of computing time. If there can be a fix that saves minutes, its surely worth it. On Wed, Aug 7, 2019 at 1:18 PM Sean Mooney <smooney@redhat.com> wrote:

...

On Wed, 2019-08-07 at 10:11 -0500, Ben Nemec wrote:

...
On 8/7/19 9:37 AM, Sean Mooney wrote:

...
On Wed, 2019-08-07 at 08:33 -0500, Ben Nemec wrote:

...
On 8/6/19 11:34 AM, Ben Nemec wrote:

...
On 8/6/19 10:49 AM, Clark Boylan wrote:

...
On Tue, Aug 6, 2019, at 8:26 AM, Ben Nemec wrote: > Just a reminder that there is also >

...
...
...
...
...
> > which was intended to address this same issue. > > I toyed around with it a bit for TripleO installs back then and it did > seem to speed things up, but at the time there was a bug in our client > plugin where it was triggering a prompt for input that was

...
...
...
...
...
> with the server running in the background. I never really got back to it > once that was fixed. :-/

I'm not tied to any particular implementation. Mostly I wanted to show that we can take this ~5 minute portion of devstack and turn it into a 15 second portion of devstack by improving our use of the service APIs (and possibly even further if we apply it to all of the api interaction). Any idea how difficult it would be to get your client as a service stuff running in devstack again?

I wish I could take credit, but this is actually Dan Berrange's work. :-)

...
I do not think we should make a one off change like I've done in

my

...
POC. That will just end up being harder to understand and debug in the future since it will be different than all of the other API interaction. I like the idea of a manifest or feeding a longer

...
...
...
...
...
process api update commands as we can then avoid requesting new tokens as well as pkg_resource startup time. Such a system could be used by all of devstack as well (avoiding the "this bit is special"

...
...
...
...
...
Is there any interest from the QA team in committing to an

approach

...
and working to do a conversion? I don't want to commit any more time to this myself unless there is strong interest in getting changes merged (as I expect it will be a slow process weeding out places where we've made bad assumptions particularly around plugins).

One of the things I found was that using names with osc results in name to id lookups as well. We can avoid these entirely if we remember name to id mappings instead (which my POC does). Any idea if your osc as a service tool does or can do that? Probably have to be more careful for scoping things in a tool like that as it may be reused by people with name collisions across projects/users/groups/domains.

I don't believe this would handle name to id mapping. It's a very

...
...
...
...
wrapper around the regular client code that just makes it

http://lists.openstack.org/pipermail/openstack-dev/2016-April/092546.html problematic lived problem). thin persistent so

...
...
...
...
we don't pay the startup costs every call. On the plus side that means it basically works like the vanilla client, on the minus side that means it may not provide as much improvement as a more targeted solution.

IIRC it's pretty easy to use, so I can try it out again and make sure it still works and still provides a performance benefit.

It still works and it still helps. Using the osc service cut about 3 minutes off my 21 minute devstack run. Subjectively I would say that most of the time was being spent cloning and installing services and their deps.

I guess the downside is that working around the OSC slowness in CI will reduce developer motivation to fix the problem, which affects all users too. Then again, this has been a problem for years and no one has fixed it, so apparently that isn't a big enough lever to get things moving anyway. :-/

using osc diretly i dont think the slowness is really perceptable from a human stand point but it adds up in a ci run. there are large problems to kill with gate slowness then fixing osc will solve be every little helps. i do agree however that the gage is not a big enough motivater for people to fix osc slowness as we can wait hours in some cases for jobs to start so 3 minutes is not really a consern form a latency perspective but if we saved 3 mins on every run that might in aggreaget reduce the latency problems we have.

I find the slowness very noticeable in interactive use. It adds something like 2 seconds to a basic call like image list that returns almost instantly in the OSC interactive shell where there is no startup overhead. From my performance days, any latency over 1 second was considered unacceptable for an interactive call. The interactive shell does help with that if I'm doing a bunch of calls in a row though. well that was kind of my point when we write sripts we invoke it over and over again. if i need to use osc to do lots of commands for some reason i generaly enter the interactive mode. the interactive mode already masks the pain so anytime it has bother me in the past i have just ended up using it instead.

its been a long time since i looked at this but it think there were two reasons it is slow on startup. one is the need to get the token for each request and the other was related to the way we scan for plugins. i honestly dont know if either have imporved but the interactive shell elimiates both as issues.

...
That said, you're right that 3 minutes multiplied by the number of jobs we run per day is significant. Picking 1000 as a round number (and I'm pretty sure we run a _lot_ more than that per day), a 3 minute decrease in runtime per job would save about 50 hours of CI time in total. Small things add up at scale. :-)

yep it defintly does.

...
...
...
...
...
> > On 7/26/19 6:53 PM, Clark Boylan wrote: > > Today I have been digging into devstack runtime costs to

...
...
...
...
...
> > Davis understand why tempest jobs sometimes timeout on the > > FortNebula cloud. One thing I discovered was that the keystone user, > > group, project, role, and domain setup [0] can take many minutes > > [1][2] (in the examples here almost 5). > > > > I've rewritten create_keystone_accounts to be a python tool [3] and > > get the runtime for that subset of setup from ~100s to ~9s [4]. I > > imagine that if we applied this to the other create_X_accounts > > functions we would see similar results. > > > > I think this is so much faster because we avoid repeated costs in > > openstack client including: python process startup,

...
...
...
...
...
> > disk scanning to find entrypoints, and needing to convert names to > > IDs via the API every time osc is run. Given my change shows

help Donny pkg_resource this

...
...
...
...
...
> > can be so much quicker is there any interest in modifying devstack > > to be faster here? And if so what do we think an appropriate > > approach would be? > > > > [0] > >

https://opendev.org/openstack/devstack/src/commit/6aeaceb0c4ef078d028fb6605c...

...
...
...
...
> > > > > > [1] > >

http://logs.openstack.org/05/672805/4/check/tempest-full/14f3211/job-output....

...
...
...
...
...
> > > > > > [2] > >

http://logs.openstack.org/05/672805/4/check/tempest-full/14f3211/job-output....

...
...
...
...
...
> > > > > > [3] https://review.opendev.org/#/c/673108/ > > [4] > >

http://logs.openstack.org/08/673108/6/check/devstack-xenial/a4107d0/job-outp...

...
...
...
...
> > > > > > > > Note the jobs compared above all ran on rax-dfw. > > > > Clark > > > >

Ben Nemec

14 Aug 14 Aug

10:07 a.m.

I have a PoC patch up in devstack[0] to start using the openstack-server client. It passed the basic devstack test and looking through the logs you can see that openstack calls are now completing in fractions of a second as opposed to 2.5 to 3, so I think it's working as intended. That said, it needs quite a bit of refinement. For example, I think we should disable this on any OSC patches. I also suspect it will fall over for any projects that use an OSC plugin since the server is started before any plugins are installed. This could probably be worked around by restarting the service after a project is installed, but it's something that needs to be dealt with. Before I start taking a serious look at those things, do we want to pursue this? It does add some potential complexity to debugging if a client call fails or if the server crashes. I'm not sure I can quantify the risk there though since it's always Just Worked(tm) for me. -Ben 0: https://review.opendev.org/676016

Clark Boylan

10:20 a.m.

On Wed, Aug 14, 2019, at 10:07 AM, Ben Nemec wrote:

...

I have a PoC patch up in devstack[0] to start using the openstack-server client. It passed the basic devstack test and looking through the logs you can see that openstack calls are now completing in fractions of a second as opposed to 2.5 to 3, so I think it's working as intended.

That said, it needs quite a bit of refinement. For example, I think we should disable this on any OSC patches. I also suspect it will fall over for any projects that use an OSC plugin since the server is started before any plugins are installed. This could probably be worked around by restarting the service after a project is installed, but it's something that needs to be dealt with.

Before I start taking a serious look at those things, do we want to pursue this? It does add some potential complexity to debugging if a client call fails or if the server crashes. I'm not sure I can quantify the risk there though since it's always Just Worked(tm) for me.

Considering that our number one identified e-r bug is job timeouts [1] I think anything to reduce job time by measurable amounts is worthwhile. Additionally if we save 5 minutes per devstack run and then run devstack 10k times a day (not an up to date number but has been in that range in the past, someone can double check this with grafana or logstash or zuul dashboard) that is a massive savings when looked at on the whole. To me that makes it worthwhile.

...

-Ben

0: https://review.opendev.org/676016

[1] http://status.openstack.org/elastic-recheck/index.html#1686542

Ian Wienand

6:49 p.m.

On Wed, Aug 14, 2019 at 12:07:01PM -0500, Ben Nemec wrote:

...

I have a PoC patch up in devstack[0] to start using the openstack-server client. It passed the basic devstack test and looking through the logs you can see that openstack calls are now completing in fractions of a second as opposed to 2.5 to 3, so I think it's working as intended.

I see this as having a couple of advantages * no bespoke API interfacing code to maintain * the wrapper is custom but pretty small * plugins can benefit by using the same wrapper * we can turn the wrapper off and fall back to the same calls directly with the client (also good for local interaction) * in a similar theme, it's still pretty close to "what I'd type on the command line to do this" which is a bit of a devstack theme So FWIW I'm positive on the direction, thanks! -i (some very experienced people have said "we know it's slow" and I guess we should take advice on if this is a temporary work-around, or an actual solution)

Ben Nemec

15 Aug 15 Aug

1:57 p.m.

On 8/14/19 8:49 PM, Ian Wienand wrote:

...

On Wed, Aug 14, 2019 at 12:07:01PM -0500, Ben Nemec wrote:

...
I have a PoC patch up in devstack[0] to start using the openstack-server client. It passed the basic devstack test and looking through the logs you can see that openstack calls are now completing in fractions of a second as opposed to 2.5 to 3, so I think it's working as intended.

I see this as having a couple of advantages

* no bespoke API interfacing code to maintain * the wrapper is custom but pretty small * plugins can benefit by using the same wrapper * we can turn the wrapper off and fall back to the same calls directly with the client (also good for local interaction) * in a similar theme, it's still pretty close to "what I'd type on the command line to do this" which is a bit of a devstack theme

So FWIW I'm positive on the direction, thanks!

-i

(some very experienced people have said "we know it's slow" and I guess we should take advice on if this is a temporary work-around, or an actual solution)

Okay, I've got https://review.opendev.org/#/c/676016/ passing devstack ci now and I think it's ready for initial review. I don't know if everything I'm doing will fly with the devstack folks, but the reasons why should be covered in the commit message. I'm open to suggestions on alternate ways to accomplish the same things.

2153

Age (days ago)

2173

Last active (days ago)

List overview

Download

23 comments

11 participants

participants (11)

Alex Schultz
Ben Nemec
Chris Dent
Clark Boylan
Dean Troyer
Donny Davis
Doug Hellmann
Ghanshyam Mann
Ian Wienand
Jeremy Stanley
Sean Mooney