[qa][openstackclient] Debugging devstack slowness
Today I have been digging into devstack runtime costs to help Donny Davis understand why tempest jobs sometimes timeout on the FortNebula cloud. One thing I discovered was that the keystone user, group, project, role, and domain setup [0] can take many minutes [1][2] (in the examples here almost 5).
I've rewritten create_keystone_accounts to be a python tool [3] and get the runtime for that subset of setup from ~100s to ~9s [4]. I imagine that if we applied this to the other create_X_accounts functions we would see similar results.
I think this is so much faster because we avoid repeated costs in openstack client including: python process startup, pkg_resource disk scanning to find entrypoints, and needing to convert names to IDs via the API every time osc is run. Given my change shows this can be so much quicker is there any interest in modifying devstack to be faster here? And if so what do we think an appropriate approach would be?
[0] https://opendev.org/openstack/devstack/src/commit/6aeaceb0c4ef078d028fb6605c... [1] http://logs.openstack.org/05/672805/4/check/tempest-full/14f3211/job-output.... [2] http://logs.openstack.org/05/672805/4/check/tempest-full/14f3211/job-output.... [3] https://review.opendev.org/#/c/673108/ [4] http://logs.openstack.org/08/673108/6/check/devstack-xenial/a4107d0/job-outp...
Note the jobs compared above all ran on rax-dfw.
Clark
On Fri, Jul 26, 2019 at 5:57 PM Clark Boylan cboylan@sapwetik.org wrote:
Today I have been digging into devstack runtime costs to help Donny Davis understand why tempest jobs sometimes timeout on the FortNebula cloud. One thing I discovered was that the keystone user, group, project, role, and domain setup [0] can take many minutes [1][2] (in the examples here almost 5).
I've rewritten create_keystone_accounts to be a python tool [3] and get the runtime for that subset of setup from ~100s to ~9s [4]. I imagine that if we applied this to the other create_X_accounts functions we would see similar results.
I think this is so much faster because we avoid repeated costs in openstack client including: python process startup, pkg_resource disk scanning to find entrypoints, and needing to convert names to IDs via the API every time osc is run. Given my change shows this can be so much quicker is there any interest in modifying devstack to be faster here? And if so what do we think an appropriate approach would be?
In tripleo, we've also run into the same thing for other actions. While you can do bulk openstack client actions[0], it's not the best thing if you need to create a resource and fetch an ID for a subsequent action. We ported our post-installation items to python[1] and noticed a dramatic improvement as well. It might be beneficial to maybe add some caching into openstackclient so that the startup cost isn't so large every time?
[0] https://review.opendev.org/#/c/521146/ [1] https://review.opendev.org/#/c/614540/
[0] https://opendev.org/openstack/devstack/src/commit/6aeaceb0c4ef078d028fb6605c... [1] http://logs.openstack.org/05/672805/4/check/tempest-full/14f3211/job-output.... [2] http://logs.openstack.org/05/672805/4/check/tempest-full/14f3211/job-output.... [3] https://review.opendev.org/#/c/673108/ [4] http://logs.openstack.org/08/673108/6/check/devstack-xenial/a4107d0/job-outp...
Note the jobs compared above all ran on rax-dfw.
Clark
On Mon, Jul 29, 2019, at 6:41 AM, Alex Schultz wrote:
On Fri, Jul 26, 2019 at 5:57 PM Clark Boylan cboylan@sapwetik.org wrote:
Today I have been digging into devstack runtime costs to help Donny Davis understand why tempest jobs sometimes timeout on the FortNebula cloud. One thing I discovered was that the keystone user, group, project, role, and domain setup [0] can take many minutes [1][2] (in the examples here almost 5).
I've rewritten create_keystone_accounts to be a python tool [3] and get the runtime for that subset of setup from ~100s to ~9s [4]. I imagine that if we applied this to the other create_X_accounts functions we would see similar results.
I think this is so much faster because we avoid repeated costs in openstack client including: python process startup, pkg_resource disk scanning to find entrypoints, and needing to convert names to IDs via the API every time osc is run. Given my change shows this can be so much quicker is there any interest in modifying devstack to be faster here? And if so what do we think an appropriate approach would be?
In tripleo, we've also run into the same thing for other actions. While you can do bulk openstack client actions[0], it's not the best thing if you need to create a resource and fetch an ID for a subsequent action. We ported our post-installation items to python[1] and noticed a dramatic improvement as well. It might be beneficial to maybe add some caching into openstackclient so that the startup cost isn't so large every time?
Reading more of what devstack does I've realized that there is quite a bit of logic tied around devstack's use of OSC. In particular if you select one option you get this endpoint and if you select another option you get that endpoint, or if this service and that service are enabled then they need this common role, etc. I think the best way to tackle this would be to have devstack write a manifest file, then have a tool (maybe in osc or sdk?) that can read a manifest and execute the api updates in order, storing intermediate results so that they can be referred to without doing further API lookups.
Sounds like such a thing would be useful outside of devstack as well. I brought this up briefly with Monty and he said he would explore it a bit on the SDK side of things. Does this seem like a reasonable approach? Anyone else have better ideas?
The big key here seems to be reusing authentication tokens and remembering resource ID data so that we can avoid unnecessary (and costly) lookups every time we want to modify a resource or associate resources.
[0] https://review.opendev.org/#/c/521146/ [1] https://review.opendev.org/#/c/614540/
[0] https://opendev.org/openstack/devstack/src/commit/6aeaceb0c4ef078d028fb6605c... [1] http://logs.openstack.org/05/672805/4/check/tempest-full/14f3211/job-output.... [2] http://logs.openstack.org/05/672805/4/check/tempest-full/14f3211/job-output.... [3] https://review.opendev.org/#/c/673108/ [4] http://logs.openstack.org/08/673108/6/check/devstack-xenial/a4107d0/job-outp...
Note the jobs compared above all ran on rax-dfw.
Clark
On Fri, Jul 26, 2019 at 04:53:28PM -0700, Clark Boylan wrote:
Given my change shows this can be so much quicker is there any interest in modifying devstack to be faster here? And if so what do we think an appropriate approach would be?
My first concern was if anyone considered openstack-client setting these things up as actually part of the testing. I'd say not, comments in [1] suggest similar views.
My second concern is that we do keep sufficient track of complexity v speed; obviously doing things in a sequential manner via a script is pretty simple to follow and as we start putting things into scripts we make it harder to debug when a monoscript dies and you have to start pulling apart where it was. With just a little json fiddling we can currently pull good stats from logstash ([2]) so I think as we go it would be good to make sure we account for the time using appropriate wrappers, etc.
Then the third concern is not to break anything for plugins -- devstack has a very very loose API which basically relies on plugin authors using a combination of good taste and copying other code to decide what's internal or not.
Which made me start thinking I wonder if we look at this closely, even without replacing things we might make inroads?
For example [3]; it seems like SERVICE_DOMAIN_NAME is never not default, so the get_or_create_domain call is always just overhead (the result is never used).
Then it seems that in the gate, basically all of the "get_or_create" calls will really just be "create" calls? Because we're always starting fresh. So we could cut out about half of the calls there pre-checking if we know we're under zuul (proof-of-concept [4]).
Then we have blocks like:
get_or_add_user_project_role $member_role $demo_user $demo_project get_or_add_user_project_role $admin_role $admin_user $demo_project get_or_add_user_project_role $another_role $demo_user $demo_project get_or_add_user_project_role $member_role $demo_user $invis_project
If we wrapped that in something like
start_osc_session ... end_osc_session
which sets a variable that means instead of calling directly, those functions write their arguments to a tmp file. Then at the end call, end_osc_session does
$ osc "$(< tmpfile)"
and uses the inbuilt batching? If that had half the calls by skipping the "get_or" bit, and used common authentication from batching, would that help?
And then I don't know if all the projects and groups are required for every devstack run? Maybe someone skilled in the art could do a bit of an audit and we could cut more of that out too?
So I guess my point is that maybe we could tweak what we have a bit to make some immediate wins, before anyone has to rewrite too much?
-i
[1] https://review.opendev.org/673018 [2] https://ethercalc.openstack.org/rzuhevxz7793 [3] https://review.opendev.org/673941 [4] https://review.opendev.org/673936
These jobs seem to timeout from every provider on the regular[1], but the issue is surely more apparent with tempest on FN. The result is quite a bit of lost time. 361 jobs that run for several hours results in a little over a 1000 hours of lost cycles.
[1] http://logstash.openstack.org/#/dashboard/file/logstash.json?query=filename:...
On Thu, Aug 1, 2019 at 5:01 AM Ian Wienand iwienand@redhat.com wrote:
On Fri, Jul 26, 2019 at 04:53:28PM -0700, Clark Boylan wrote:
Given my change shows this can be so much quicker is there any interest in modifying devstack to be faster here? And if so what do we think an appropriate approach would be?
My first concern was if anyone considered openstack-client setting these things up as actually part of the testing. I'd say not, comments in [1] suggest similar views.
My second concern is that we do keep sufficient track of complexity v speed; obviously doing things in a sequential manner via a script is pretty simple to follow and as we start putting things into scripts we make it harder to debug when a monoscript dies and you have to start pulling apart where it was. With just a little json fiddling we can currently pull good stats from logstash ([2]) so I think as we go it would be good to make sure we account for the time using appropriate wrappers, etc.
Then the third concern is not to break anything for plugins -- devstack has a very very loose API which basically relies on plugin authors using a combination of good taste and copying other code to decide what's internal or not.
Which made me start thinking I wonder if we look at this closely, even without replacing things we might make inroads?
For example [3]; it seems like SERVICE_DOMAIN_NAME is never not default, so the get_or_create_domain call is always just overhead (the result is never used).
Then it seems that in the gate, basically all of the "get_or_create" calls will really just be "create" calls? Because we're always starting fresh. So we could cut out about half of the calls there pre-checking if we know we're under zuul (proof-of-concept [4]).
Then we have blocks like:
get_or_add_user_project_role $member_role $demo_user $demo_project get_or_add_user_project_role $admin_role $admin_user $demo_project get_or_add_user_project_role $another_role $demo_user $demo_project get_or_add_user_project_role $member_role $demo_user $invis_project
If we wrapped that in something like
start_osc_session ... end_osc_session
which sets a variable that means instead of calling directly, those functions write their arguments to a tmp file. Then at the end call, end_osc_session does
$ osc "$(< tmpfile)"
and uses the inbuilt batching? If that had half the calls by skipping the "get_or" bit, and used common authentication from batching, would that help?
And then I don't know if all the projects and groups are required for every devstack run? Maybe someone skilled in the art could do a bit of an audit and we could cut more of that out too?
So I guess my point is that maybe we could tweak what we have a bit to make some immediate wins, before anyone has to rewrite too much?
-i
[1] https://review.opendev.org/673018 [2] https://ethercalc.openstack.org/rzuhevxz7793 [3] https://review.opendev.org/673941 [4] https://review.opendev.org/673936
---- On Thu, 01 Aug 2019 17:58:18 +0900 Ian Wienand iwienand@redhat.com wrote ----
On Fri, Jul 26, 2019 at 04:53:28PM -0700, Clark Boylan wrote:
Given my change shows this can be so much quicker is there any interest in modifying devstack to be faster here? And if so what do we think an appropriate approach would be?
My first concern was if anyone considered openstack-client setting these things up as actually part of the testing. I'd say not, comments in [1] suggest similar views.
My second concern is that we do keep sufficient track of complexity v speed; obviously doing things in a sequential manner via a script is pretty simple to follow and as we start putting things into scripts we make it harder to debug when a monoscript dies and you have to start pulling apart where it was. With just a little json fiddling we can currently pull good stats from logstash ([2]) so I think as we go it would be good to make sure we account for the time using appropriate wrappers, etc.
I agree on this concern about maintainability and debugging with scripts. Now a days, very less people have good knowledge on devstack code and debugging the failure on job side is much harder for most of the developers. IMO the maintainability and easy to debug is much needed as first priority.
If we wanted to convert the OSC with something faster, Tempest service client comes into my mind. They are the very straight call to API directly but the token is requested for each API call. But that is something need PoC about speed improvement especially.
Then the third concern is not to break anything for plugins -- devstack has a very very loose API which basically relies on plugin authors using a combination of good taste and copying other code to decide what's internal or not.
Which made me start thinking I wonder if we look at this closely, even without replacing things we might make inroads?
For example [3]; it seems like SERVICE_DOMAIN_NAME is never not default, so the get_or_create_domain call is always just overhead (the result is never used).
Then it seems that in the gate, basically all of the "get_or_create" calls will really just be "create" calls? Because we're always starting fresh. So we could cut out about half of the calls there pre-checking if we know we're under zuul (proof-of-concept [4]).
Then we have blocks like:
get_or_add_user_project_role $member_role $demo_user $demo_project get_or_add_user_project_role $admin_role $admin_user $demo_project get_or_add_user_project_role $another_role $demo_user $demo_project get_or_add_user_project_role $member_role $demo_user $invis_project
If we wrapped that in something like
start_osc_session ... end_osc_session
which sets a variable that means instead of calling directly, those functions write their arguments to a tmp file. Then at the end call, end_osc_session does
$ osc "$(< tmpfile)"
and uses the inbuilt batching? If that had half the calls by skipping the "get_or" bit, and used common authentication from batching, would that help?
And then I don't know if all the projects and groups are required for every devstack run? Maybe someone skilled in the art could do a bit of an audit and we could cut more of that out too?
Yeah, improving such usused o not required call with the audit is a good call. For example, In most place, devstack need just resource id or name or few fields for created resource so get call which gives complete resource fileds might not be needed and for async call we can have an exception to get resource('addressess' in server).
-gmann
So I guess my point is that maybe we could tweak what we have a bit to make some immediate wins, before anyone has to rewrite too much?
-i
[1] https://review.opendev.org/673018 [2] https://ethercalc.openstack.org/rzuhevxz7793 [3] https://review.opendev.org/673941 [4] https://review.opendev.org/673936
Just a reminder that there is also http://lists.openstack.org/pipermail/openstack-dev/2016-April/092546.html which was intended to address this same issue.
I toyed around with it a bit for TripleO installs back then and it did seem to speed things up, but at the time there was a bug in our client plugin where it was triggering a prompt for input that was problematic with the server running in the background. I never really got back to it once that was fixed. :-/
On 7/26/19 6:53 PM, Clark Boylan wrote:
Today I have been digging into devstack runtime costs to help Donny Davis understand why tempest jobs sometimes timeout on the FortNebula cloud. One thing I discovered was that the keystone user, group, project, role, and domain setup [0] can take many minutes [1][2] (in the examples here almost 5).
I've rewritten create_keystone_accounts to be a python tool [3] and get the runtime for that subset of setup from ~100s to ~9s [4]. I imagine that if we applied this to the other create_X_accounts functions we would see similar results.
I think this is so much faster because we avoid repeated costs in openstack client including: python process startup, pkg_resource disk scanning to find entrypoints, and needing to convert names to IDs via the API every time osc is run. Given my change shows this can be so much quicker is there any interest in modifying devstack to be faster here? And if so what do we think an appropriate approach would be?
[0] https://opendev.org/openstack/devstack/src/commit/6aeaceb0c4ef078d028fb6605c... [1] http://logs.openstack.org/05/672805/4/check/tempest-full/14f3211/job-output.... [2] http://logs.openstack.org/05/672805/4/check/tempest-full/14f3211/job-output.... [3] https://review.opendev.org/#/c/673108/ [4] http://logs.openstack.org/08/673108/6/check/devstack-xenial/a4107d0/job-outp...
Note the jobs compared above all ran on rax-dfw.
Clark
On Tue, Aug 6, 2019, at 8:26 AM, Ben Nemec wrote:
Just a reminder that there is also http://lists.openstack.org/pipermail/openstack-dev/2016-April/092546.html which was intended to address this same issue.
I toyed around with it a bit for TripleO installs back then and it did seem to speed things up, but at the time there was a bug in our client plugin where it was triggering a prompt for input that was problematic with the server running in the background. I never really got back to it once that was fixed. :-/
I'm not tied to any particular implementation. Mostly I wanted to show that we can take this ~5 minute portion of devstack and turn it into a 15 second portion of devstack by improving our use of the service APIs (and possibly even further if we apply it to all of the api interaction). Any idea how difficult it would be to get your client as a service stuff running in devstack again?
I do not think we should make a one off change like I've done in my POC. That will just end up being harder to understand and debug in the future since it will be different than all of the other API interaction. I like the idea of a manifest or feeding a longer lived process api update commands as we can then avoid requesting new tokens as well as pkg_resource startup time. Such a system could be used by all of devstack as well (avoiding the "this bit is special" problem).
Is there any interest from the QA team in committing to an approach and working to do a conversion? I don't want to commit any more time to this myself unless there is strong interest in getting changes merged (as I expect it will be a slow process weeding out places where we've made bad assumptions particularly around plugins).
One of the things I found was that using names with osc results in name to id lookups as well. We can avoid these entirely if we remember name to id mappings instead (which my POC does). Any idea if your osc as a service tool does or can do that? Probably have to be more careful for scoping things in a tool like that as it may be reused by people with name collisions across projects/users/groups/domains.
On 7/26/19 6:53 PM, Clark Boylan wrote:
Today I have been digging into devstack runtime costs to help Donny Davis understand why tempest jobs sometimes timeout on the FortNebula cloud. One thing I discovered was that the keystone user, group, project, role, and domain setup [0] can take many minutes [1][2] (in the examples here almost 5).
I've rewritten create_keystone_accounts to be a python tool [3] and get the runtime for that subset of setup from ~100s to ~9s [4]. I imagine that if we applied this to the other create_X_accounts functions we would see similar results.
I think this is so much faster because we avoid repeated costs in openstack client including: python process startup, pkg_resource disk scanning to find entrypoints, and needing to convert names to IDs via the API every time osc is run. Given my change shows this can be so much quicker is there any interest in modifying devstack to be faster here? And if so what do we think an appropriate approach would be?
[0] https://opendev.org/openstack/devstack/src/commit/6aeaceb0c4ef078d028fb6605c... [1] http://logs.openstack.org/05/672805/4/check/tempest-full/14f3211/job-output.... [2] http://logs.openstack.org/05/672805/4/check/tempest-full/14f3211/job-output.... [3] https://review.opendev.org/#/c/673108/ [4] http://logs.openstack.org/08/673108/6/check/devstack-xenial/a4107d0/job-outp...
Note the jobs compared above all ran on rax-dfw.
Clark
On 2019-08-06 08:49:17 -0700 (-0700), Clark Boylan wrote: [...]
One of the things I found was that using names with osc results in name to id lookups as well. We can avoid these entirely if we remember name to id mappings instead (which my POC does). Any idea if your osc as a service tool does or can do that? Probably have to be more careful for scoping things in a tool like that as it may be reused by people with name collisions across projects/users/groups/domains.
[...]
Out of curiosity, could OSC/SDK cache those relationships so they're only looked up once (or at least infrequently)? I guess there are cache invalidation concerns if an entity is deleted and another created out-of-band using the same name, but if it's all done through the same persistent daemon then that's less of a risk right?
On Tue, 6 Aug 2019, Jeremy Stanley wrote:
On 2019-08-06 08:49:17 -0700 (-0700), Clark Boylan wrote: [...]
One of the things I found was that using names with osc results in name to id lookups as well. We can avoid these entirely if we remember name to id mappings instead (which my POC does). Any idea if your osc as a service tool does or can do that? Probably have to be more careful for scoping things in a tool like that as it may be reused by people with name collisions across projects/users/groups/domains.
[...]
Out of curiosity, could OSC/SDK cache those relationships so they're only looked up once (or at least infrequently)? I guess there are cache invalidation concerns if an entity is deleted and another created out-of-band using the same name, but if it's all done through the same persistent daemon then that's less of a risk right?
If we are in a situation where name to id and id to name translations are slow at the services' API layer, isn't that a really big bug? One where the fixing is beneficial to everyone, including devstack users?
(Yes, I'm aware of TCP overhead and all that, but I reckon that's way down on the list of contributing factors here?)
On Tue, Aug 6, 2019 at 11:42 AM Chris Dent cdent+os@anticdent.org wrote:
If we are in a situation where name to id and id to name translations are slow at the services' API layer, isn't that a really big bug? One where the fixing is beneficial to everyone, including devstack users?
While the name->ID lookup is an additional API round trip, it does not cause an additional python startup scan, which is the major killer here. In fact, it is possible that there is more than one lookup and that at least one will always be done because we do not know if that value is a name or an ID. The GET is done in any case because nearly every time (in non-create operations) we probably want the full object anyway.
I also played with starting OSC as a background process a while back, it actually does work pretty well and with a bit more error handling would have been good enough(tm)[0]. The major concern with it then was it was not representative of how people actually use OSC and changed the testing value we get from doing that.
dt
[0] Basically run interactive mode in background, plumb up stdin/stdout to some descriptors and off to the races.
On 2019-08-06 14:44:36 -0500 (-0500), Dean Troyer wrote: [...]
The major concern with it then was it was not representative of how people actually use OSC and changed the testing value we get from doing that.
[...]
In an ideal world, OSC would have explicit functional testing independent of the side effect of calling it when standing up DevStack.
On Aug 6, 2019, at 3:44 PM, Dean Troyer dtroyer@gmail.com wrote:
On Tue, Aug 6, 2019 at 11:42 AM Chris Dent cdent+os@anticdent.org wrote:
If we are in a situation where name to id and id to name translations are slow at the services' API layer, isn't that a really big bug? One where the fixing is beneficial to everyone, including devstack users?
While the name->ID lookup is an additional API round trip, it does not cause an additional python startup scan, which is the major killer here. In fact, it is possible that there is more than one lookup and that at least one will always be done because we do not know if that value is a name or an ID. The GET is done in any case because nearly every time (in non-create operations) we probably want the full object anyway.
I also played with starting OSC as a background process a while back, it actually does work pretty well and with a bit more error handling would have been good enough(tm)[0]. The major concern with it then was it was not representative of how people actually use OSC and changed the testing value we get from doing that.
dt
[0] Basically run interactive mode in background, plumb up stdin/stdout to some descriptors and off to the races.
-- Dean Troyer dtroyer@gmail.com
I made some notes about the plugin lookup issue a while back [1] and I looked at that again the most recent time we were in Denver [2], and came to the conclusion that the implementation was going to require more changes in osc-lib than I was going to have time to figure out on my own. Unfortunately, it’s not a simple matter of choosing between looking at 1 internal cache or doing the pkg_resource scan because of the plugin version management layer osc-lib added.
In any case, I think we’ve discussed the fact many times that the way to fix this is to not scan for plugins unless we have to do so. We just need someone to sit down and work on figuring out how to make that work.
Doug
[1] https://etherpad.openstack.org/p/mFsAgTZggf [2] https://etherpad.openstack.org/p/train-ptg-osc
On Tue, Aug 6, 2019, at 9:17 AM, Jeremy Stanley wrote:
On 2019-08-06 08:49:17 -0700 (-0700), Clark Boylan wrote: [...]
One of the things I found was that using names with osc results in name to id lookups as well. We can avoid these entirely if we remember name to id mappings instead (which my POC does). Any idea if your osc as a service tool does or can do that? Probably have to be more careful for scoping things in a tool like that as it may be reused by people with name collisions across projects/users/groups/domains.
[...]
Out of curiosity, could OSC/SDK cache those relationships so they're only looked up once (or at least infrequently)? I guess there are cache invalidation concerns if an entity is deleted and another created out-of-band using the same name, but if it's all done through the same persistent daemon then that's less of a risk right?
They could cache these things too. The concern is a valid one too; however, a relatively short TTL may address that as these resources tend to all be used near each other. For example create a router, network, subnet in neutron or a user, role, group/domain in keystone.
That said I think a bigger win would be caching tokens if we want to make changes to caching for osc (I think it can cache tokens but we don't set it up properly in devstack?) Every invocation of osc first hits the pkg_resources cost, then hits the catalog and token lookup costs, then does name to id translations, then does the actual thing you requested. Addressing the first two upfront costs likely has a bigger impact than name to id translations.
Clark
On 8/6/19 10:49 AM, Clark Boylan wrote:
On Tue, Aug 6, 2019, at 8:26 AM, Ben Nemec wrote:
Just a reminder that there is also http://lists.openstack.org/pipermail/openstack-dev/2016-April/092546.html which was intended to address this same issue.
I toyed around with it a bit for TripleO installs back then and it did seem to speed things up, but at the time there was a bug in our client plugin where it was triggering a prompt for input that was problematic with the server running in the background. I never really got back to it once that was fixed. :-/
I'm not tied to any particular implementation. Mostly I wanted to show that we can take this ~5 minute portion of devstack and turn it into a 15 second portion of devstack by improving our use of the service APIs (and possibly even further if we apply it to all of the api interaction). Any idea how difficult it would be to get your client as a service stuff running in devstack again?
I wish I could take credit, but this is actually Dan Berrange's work. :-)
I do not think we should make a one off change like I've done in my POC. That will just end up being harder to understand and debug in the future since it will be different than all of the other API interaction. I like the idea of a manifest or feeding a longer lived process api update commands as we can then avoid requesting new tokens as well as pkg_resource startup time. Such a system could be used by all of devstack as well (avoiding the "this bit is special" problem).
Is there any interest from the QA team in committing to an approach and working to do a conversion? I don't want to commit any more time to this myself unless there is strong interest in getting changes merged (as I expect it will be a slow process weeding out places where we've made bad assumptions particularly around plugins).
One of the things I found was that using names with osc results in name to id lookups as well. We can avoid these entirely if we remember name to id mappings instead (which my POC does). Any idea if your osc as a service tool does or can do that? Probably have to be more careful for scoping things in a tool like that as it may be reused by people with name collisions across projects/users/groups/domains.
I don't believe this would handle name to id mapping. It's a very thin wrapper around the regular client code that just makes it persistent so we don't pay the startup costs every call. On the plus side that means it basically works like the vanilla client, on the minus side that means it may not provide as much improvement as a more targeted solution.
IIRC it's pretty easy to use, so I can try it out again and make sure it still works and still provides a performance benefit.
On 7/26/19 6:53 PM, Clark Boylan wrote:
Today I have been digging into devstack runtime costs to help Donny Davis understand why tempest jobs sometimes timeout on the FortNebula cloud. One thing I discovered was that the keystone user, group, project, role, and domain setup [0] can take many minutes [1][2] (in the examples here almost 5).
I've rewritten create_keystone_accounts to be a python tool [3] and get the runtime for that subset of setup from ~100s to ~9s [4]. I imagine that if we applied this to the other create_X_accounts functions we would see similar results.
I think this is so much faster because we avoid repeated costs in openstack client including: python process startup, pkg_resource disk scanning to find entrypoints, and needing to convert names to IDs via the API every time osc is run. Given my change shows this can be so much quicker is there any interest in modifying devstack to be faster here? And if so what do we think an appropriate approach would be?
[0] https://opendev.org/openstack/devstack/src/commit/6aeaceb0c4ef078d028fb6605c... [1] http://logs.openstack.org/05/672805/4/check/tempest-full/14f3211/job-output.... [2] http://logs.openstack.org/05/672805/4/check/tempest-full/14f3211/job-output.... [3] https://review.opendev.org/#/c/673108/ [4] http://logs.openstack.org/08/673108/6/check/devstack-xenial/a4107d0/job-outp...
Note the jobs compared above all ran on rax-dfw.
Clark
On 8/6/19 11:34 AM, Ben Nemec wrote:
On 8/6/19 10:49 AM, Clark Boylan wrote:
On Tue, Aug 6, 2019, at 8:26 AM, Ben Nemec wrote:
Just a reminder that there is also http://lists.openstack.org/pipermail/openstack-dev/2016-April/092546.html
which was intended to address this same issue.
I toyed around with it a bit for TripleO installs back then and it did seem to speed things up, but at the time there was a bug in our client plugin where it was triggering a prompt for input that was problematic with the server running in the background. I never really got back to it once that was fixed. :-/
I'm not tied to any particular implementation. Mostly I wanted to show that we can take this ~5 minute portion of devstack and turn it into a 15 second portion of devstack by improving our use of the service APIs (and possibly even further if we apply it to all of the api interaction). Any idea how difficult it would be to get your client as a service stuff running in devstack again?
I wish I could take credit, but this is actually Dan Berrange's work. :-)
I do not think we should make a one off change like I've done in my POC. That will just end up being harder to understand and debug in the future since it will be different than all of the other API interaction. I like the idea of a manifest or feeding a longer lived process api update commands as we can then avoid requesting new tokens as well as pkg_resource startup time. Such a system could be used by all of devstack as well (avoiding the "this bit is special" problem).
Is there any interest from the QA team in committing to an approach and working to do a conversion? I don't want to commit any more time to this myself unless there is strong interest in getting changes merged (as I expect it will be a slow process weeding out places where we've made bad assumptions particularly around plugins).
One of the things I found was that using names with osc results in name to id lookups as well. We can avoid these entirely if we remember name to id mappings instead (which my POC does). Any idea if your osc as a service tool does or can do that? Probably have to be more careful for scoping things in a tool like that as it may be reused by people with name collisions across projects/users/groups/domains.
I don't believe this would handle name to id mapping. It's a very thin wrapper around the regular client code that just makes it persistent so we don't pay the startup costs every call. On the plus side that means it basically works like the vanilla client, on the minus side that means it may not provide as much improvement as a more targeted solution.
IIRC it's pretty easy to use, so I can try it out again and make sure it still works and still provides a performance benefit.
It still works and it still helps. Using the osc service cut about 3 minutes off my 21 minute devstack run. Subjectively I would say that most of the time was being spent cloning and installing services and their deps.
I guess the downside is that working around the OSC slowness in CI will reduce developer motivation to fix the problem, which affects all users too. Then again, this has been a problem for years and no one has fixed it, so apparently that isn't a big enough lever to get things moving anyway. :-/
On 7/26/19 6:53 PM, Clark Boylan wrote:
Today I have been digging into devstack runtime costs to help Donny Davis understand why tempest jobs sometimes timeout on the FortNebula cloud. One thing I discovered was that the keystone user, group, project, role, and domain setup [0] can take many minutes [1][2] (in the examples here almost 5).
I've rewritten create_keystone_accounts to be a python tool [3] and get the runtime for that subset of setup from ~100s to ~9s [4]. I imagine that if we applied this to the other create_X_accounts functions we would see similar results.
I think this is so much faster because we avoid repeated costs in openstack client including: python process startup, pkg_resource disk scanning to find entrypoints, and needing to convert names to IDs via the API every time osc is run. Given my change shows this can be so much quicker is there any interest in modifying devstack to be faster here? And if so what do we think an appropriate approach would be?
[0] https://opendev.org/openstack/devstack/src/commit/6aeaceb0c4ef078d028fb6605c...
[1] http://logs.openstack.org/05/672805/4/check/tempest-full/14f3211/job-output....
[2] http://logs.openstack.org/05/672805/4/check/tempest-full/14f3211/job-output....
[3] https://review.opendev.org/#/c/673108/ [4] http://logs.openstack.org/08/673108/6/check/devstack-xenial/a4107d0/job-outp...
Note the jobs compared above all ran on rax-dfw.
Clark
On Wed, 2019-08-07 at 08:33 -0500, Ben Nemec wrote:
On 8/6/19 11:34 AM, Ben Nemec wrote:
On 8/6/19 10:49 AM, Clark Boylan wrote:
On Tue, Aug 6, 2019, at 8:26 AM, Ben Nemec wrote:
Just a reminder that there is also http://lists.openstack.org/pipermail/openstack-dev/2016-April/092546.html
which was intended to address this same issue.
I toyed around with it a bit for TripleO installs back then and it did seem to speed things up, but at the time there was a bug in our client plugin where it was triggering a prompt for input that was problematic with the server running in the background. I never really got back to it once that was fixed. :-/
I'm not tied to any particular implementation. Mostly I wanted to show that we can take this ~5 minute portion of devstack and turn it into a 15 second portion of devstack by improving our use of the service APIs (and possibly even further if we apply it to all of the api interaction). Any idea how difficult it would be to get your client as a service stuff running in devstack again?
I wish I could take credit, but this is actually Dan Berrange's work. :-)
I do not think we should make a one off change like I've done in my POC. That will just end up being harder to understand and debug in the future since it will be different than all of the other API interaction. I like the idea of a manifest or feeding a longer lived process api update commands as we can then avoid requesting new tokens as well as pkg_resource startup time. Such a system could be used by all of devstack as well (avoiding the "this bit is special" problem).
Is there any interest from the QA team in committing to an approach and working to do a conversion? I don't want to commit any more time to this myself unless there is strong interest in getting changes merged (as I expect it will be a slow process weeding out places where we've made bad assumptions particularly around plugins).
One of the things I found was that using names with osc results in name to id lookups as well. We can avoid these entirely if we remember name to id mappings instead (which my POC does). Any idea if your osc as a service tool does or can do that? Probably have to be more careful for scoping things in a tool like that as it may be reused by people with name collisions across projects/users/groups/domains.
I don't believe this would handle name to id mapping. It's a very thin wrapper around the regular client code that just makes it persistent so we don't pay the startup costs every call. On the plus side that means it basically works like the vanilla client, on the minus side that means it may not provide as much improvement as a more targeted solution.
IIRC it's pretty easy to use, so I can try it out again and make sure it still works and still provides a performance benefit.
It still works and it still helps. Using the osc service cut about 3 minutes off my 21 minute devstack run. Subjectively I would say that most of the time was being spent cloning and installing services and their deps.
I guess the downside is that working around the OSC slowness in CI will reduce developer motivation to fix the problem, which affects all users too. Then again, this has been a problem for years and no one has fixed it, so apparently that isn't a big enough lever to get things moving anyway. :-/
using osc diretly i dont think the slowness is really perceptable from a human stand point but it adds up in a ci run. there are large problems to kill with gate slowness then fixing osc will solve be every little helps. i do agree however that the gage is not a big enough motivater for people to fix osc slowness as we can wait hours in some cases for jobs to start so 3 minutes is not really a consern form a latency perspective but if we saved 3 mins on every run that might in aggreaget reduce the latency problems we have.
On 7/26/19 6:53 PM, Clark Boylan wrote:
Today I have been digging into devstack runtime costs to help Donny Davis understand why tempest jobs sometimes timeout on the FortNebula cloud. One thing I discovered was that the keystone user, group, project, role, and domain setup [0] can take many minutes [1][2] (in the examples here almost 5).
I've rewritten create_keystone_accounts to be a python tool [3] and get the runtime for that subset of setup from ~100s to ~9s [4]. I imagine that if we applied this to the other create_X_accounts functions we would see similar results.
I think this is so much faster because we avoid repeated costs in openstack client including: python process startup, pkg_resource disk scanning to find entrypoints, and needing to convert names to IDs via the API every time osc is run. Given my change shows this can be so much quicker is there any interest in modifying devstack to be faster here? And if so what do we think an appropriate approach would be?
[0]
https://opendev.org/openstack/devstack/src/commit/6aeaceb0c4ef078d028fb6605c...
[1] http://logs.openstack.org/05/672805/4/check/tempest-full/14f3211/job-output....
[2] http://logs.openstack.org/05/672805/4/check/tempest-full/14f3211/job-output....
http://logs.openstack.org/08/673108/6/check/devstack-xenial/a4107d0/job-outp...
Note the jobs compared above all ran on rax-dfw.
Clark
On 8/7/19 9:37 AM, Sean Mooney wrote:
On Wed, 2019-08-07 at 08:33 -0500, Ben Nemec wrote:
On 8/6/19 11:34 AM, Ben Nemec wrote:
On 8/6/19 10:49 AM, Clark Boylan wrote:
On Tue, Aug 6, 2019, at 8:26 AM, Ben Nemec wrote:
Just a reminder that there is also http://lists.openstack.org/pipermail/openstack-dev/2016-April/092546.html
which was intended to address this same issue.
I toyed around with it a bit for TripleO installs back then and it did seem to speed things up, but at the time there was a bug in our client plugin where it was triggering a prompt for input that was problematic with the server running in the background. I never really got back to it once that was fixed. :-/
I'm not tied to any particular implementation. Mostly I wanted to show that we can take this ~5 minute portion of devstack and turn it into a 15 second portion of devstack by improving our use of the service APIs (and possibly even further if we apply it to all of the api interaction). Any idea how difficult it would be to get your client as a service stuff running in devstack again?
I wish I could take credit, but this is actually Dan Berrange's work. :-)
I do not think we should make a one off change like I've done in my POC. That will just end up being harder to understand and debug in the future since it will be different than all of the other API interaction. I like the idea of a manifest or feeding a longer lived process api update commands as we can then avoid requesting new tokens as well as pkg_resource startup time. Such a system could be used by all of devstack as well (avoiding the "this bit is special" problem).
Is there any interest from the QA team in committing to an approach and working to do a conversion? I don't want to commit any more time to this myself unless there is strong interest in getting changes merged (as I expect it will be a slow process weeding out places where we've made bad assumptions particularly around plugins).
One of the things I found was that using names with osc results in name to id lookups as well. We can avoid these entirely if we remember name to id mappings instead (which my POC does). Any idea if your osc as a service tool does or can do that? Probably have to be more careful for scoping things in a tool like that as it may be reused by people with name collisions across projects/users/groups/domains.
I don't believe this would handle name to id mapping. It's a very thin wrapper around the regular client code that just makes it persistent so we don't pay the startup costs every call. On the plus side that means it basically works like the vanilla client, on the minus side that means it may not provide as much improvement as a more targeted solution.
IIRC it's pretty easy to use, so I can try it out again and make sure it still works and still provides a performance benefit.
It still works and it still helps. Using the osc service cut about 3 minutes off my 21 minute devstack run. Subjectively I would say that most of the time was being spent cloning and installing services and their deps.
I guess the downside is that working around the OSC slowness in CI will reduce developer motivation to fix the problem, which affects all users too. Then again, this has been a problem for years and no one has fixed it, so apparently that isn't a big enough lever to get things moving anyway. :-/
using osc diretly i dont think the slowness is really perceptable from a human stand point but it adds up in a ci run. there are large problems to kill with gate slowness then fixing osc will solve be every little helps. i do agree however that the gage is not a big enough motivater for people to fix osc slowness as we can wait hours in some cases for jobs to start so 3 minutes is not really a consern form a latency perspective but if we saved 3 mins on every run that might in aggreaget reduce the latency problems we have.
I find the slowness very noticeable in interactive use. It adds something like 2 seconds to a basic call like image list that returns almost instantly in the OSC interactive shell where there is no startup overhead. From my performance days, any latency over 1 second was considered unacceptable for an interactive call. The interactive shell does help with that if I'm doing a bunch of calls in a row though.
That said, you're right that 3 minutes multiplied by the number of jobs we run per day is significant. Picking 1000 as a round number (and I'm pretty sure we run a _lot_ more than that per day), a 3 minute decrease in runtime per job would save about 50 hours of CI time in total. Small things add up at scale. :-)
On 7/26/19 6:53 PM, Clark Boylan wrote:
Today I have been digging into devstack runtime costs to help Donny Davis understand why tempest jobs sometimes timeout on the FortNebula cloud. One thing I discovered was that the keystone user, group, project, role, and domain setup [0] can take many minutes [1][2] (in the examples here almost 5).
I've rewritten create_keystone_accounts to be a python tool [3] and get the runtime for that subset of setup from ~100s to ~9s [4]. I imagine that if we applied this to the other create_X_accounts functions we would see similar results.
I think this is so much faster because we avoid repeated costs in openstack client including: python process startup, pkg_resource disk scanning to find entrypoints, and needing to convert names to IDs via the API every time osc is run. Given my change shows this can be so much quicker is there any interest in modifying devstack to be faster here? And if so what do we think an appropriate approach would be?
[0]
https://opendev.org/openstack/devstack/src/commit/6aeaceb0c4ef078d028fb6605c...
[1] http://logs.openstack.org/05/672805/4/check/tempest-full/14f3211/job-output....
[2] http://logs.openstack.org/05/672805/4/check/tempest-full/14f3211/job-output....
http://logs.openstack.org/08/673108/6/check/devstack-xenial/a4107d0/job-outp...
Note the jobs compared above all ran on rax-dfw.
Clark
On Wed, 2019-08-07 at 10:11 -0500, Ben Nemec wrote:
On 8/7/19 9:37 AM, Sean Mooney wrote:
On Wed, 2019-08-07 at 08:33 -0500, Ben Nemec wrote:
On 8/6/19 11:34 AM, Ben Nemec wrote:
On 8/6/19 10:49 AM, Clark Boylan wrote:
On Tue, Aug 6, 2019, at 8:26 AM, Ben Nemec wrote:
Just a reminder that there is also http://lists.openstack.org/pipermail/openstack-dev/2016-April/092546.html
which was intended to address this same issue.
I toyed around with it a bit for TripleO installs back then and it did seem to speed things up, but at the time there was a bug in our client plugin where it was triggering a prompt for input that was problematic with the server running in the background. I never really got back to it once that was fixed. :-/
I'm not tied to any particular implementation. Mostly I wanted to show that we can take this ~5 minute portion of devstack and turn it into a 15 second portion of devstack by improving our use of the service APIs (and possibly even further if we apply it to all of the api interaction). Any idea how difficult it would be to get your client as a service stuff running in devstack again?
I wish I could take credit, but this is actually Dan Berrange's work. :-)
I do not think we should make a one off change like I've done in my POC. That will just end up being harder to understand and debug in the future since it will be different than all of the other API interaction. I like the idea of a manifest or feeding a longer lived process api update commands as we can then avoid requesting new tokens as well as pkg_resource startup time. Such a system could be used by all of devstack as well (avoiding the "this bit is special" problem).
Is there any interest from the QA team in committing to an approach and working to do a conversion? I don't want to commit any more time to this myself unless there is strong interest in getting changes merged (as I expect it will be a slow process weeding out places where we've made bad assumptions particularly around plugins).
One of the things I found was that using names with osc results in name to id lookups as well. We can avoid these entirely if we remember name to id mappings instead (which my POC does). Any idea if your osc as a service tool does or can do that? Probably have to be more careful for scoping things in a tool like that as it may be reused by people with name collisions across projects/users/groups/domains.
I don't believe this would handle name to id mapping. It's a very thin wrapper around the regular client code that just makes it persistent so we don't pay the startup costs every call. On the plus side that means it basically works like the vanilla client, on the minus side that means it may not provide as much improvement as a more targeted solution.
IIRC it's pretty easy to use, so I can try it out again and make sure it still works and still provides a performance benefit.
It still works and it still helps. Using the osc service cut about 3 minutes off my 21 minute devstack run. Subjectively I would say that most of the time was being spent cloning and installing services and their deps.
I guess the downside is that working around the OSC slowness in CI will reduce developer motivation to fix the problem, which affects all users too. Then again, this has been a problem for years and no one has fixed it, so apparently that isn't a big enough lever to get things moving anyway. :-/
using osc diretly i dont think the slowness is really perceptable from a human stand point but it adds up in a ci run. there are large problems to kill with gate slowness then fixing osc will solve be every little helps. i do agree however that the gage is not a big enough motivater for people to fix osc slowness as we can wait hours in some cases for jobs to start so 3 minutes is not really a consern form a latency perspective but if we saved 3 mins on every run that might in aggreaget reduce the latency problems we have.
I find the slowness very noticeable in interactive use. It adds something like 2 seconds to a basic call like image list that returns almost instantly in the OSC interactive shell where there is no startup overhead. From my performance days, any latency over 1 second was considered unacceptable for an interactive call. The interactive shell does help with that if I'm doing a bunch of calls in a row though.
well that was kind of my point when we write sripts we invoke it over and over again. if i need to use osc to do lots of commands for some reason i generaly enter the interactive mode. the interactive mode already masks the pain so anytime it has bother me in the past i have just ended up using it instead.
its been a long time since i looked at this but it think there were two reasons it is slow on startup. one is the need to get the token for each request and the other was related to the way we scan for plugins. i honestly dont know if either have imporved but the interactive shell elimiates both as issues.
That said, you're right that 3 minutes multiplied by the number of jobs we run per day is significant. Picking 1000 as a round number (and I'm pretty sure we run a _lot_ more than that per day), a 3 minute decrease in runtime per job would save about 50 hours of CI time in total. Small things add up at scale. :-)
yep it defintly does.
On 7/26/19 6:53 PM, Clark Boylan wrote: > Today I have been digging into devstack runtime costs to help Donny > Davis understand why tempest jobs sometimes timeout on the > FortNebula cloud. One thing I discovered was that the keystone user, > group, project, role, and domain setup [0] can take many minutes > [1][2] (in the examples here almost 5). > > I've rewritten create_keystone_accounts to be a python tool [3] and > get the runtime for that subset of setup from ~100s to ~9s [4]. I > imagine that if we applied this to the other create_X_accounts > functions we would see similar results. > > I think this is so much faster because we avoid repeated costs in > openstack client including: python process startup, pkg_resource > disk scanning to find entrypoints, and needing to convert names to > IDs via the API every time osc is run. Given my change shows this > can be so much quicker is there any interest in modifying devstack > to be faster here? And if so what do we think an appropriate > approach would be? > > [0] >
https://opendev.org/openstack/devstack/src/commit/6aeaceb0c4ef078d028fb6605c...
> > > [1] >
http://logs.openstack.org/05/672805/4/check/tempest-full/14f3211/job-output....
> > > [2] >
http://logs.openstack.org/05/672805/4/check/tempest-full/14f3211/job-output....
> > > [3] https://review.opendev.org/#/c/673108/ > [4] >
http://logs.openstack.org/08/673108/6/check/devstack-xenial/a4107d0/job-outp...
> > > > Note the jobs compared above all ran on rax-dfw. > > Clark >
Just for reference FortNebula does 73-80 jobs an hour, so that's 1700(ish) jobs a day -3 minutes per job.
That is 5200(ish) cycle minutes a day. Or about 4 days worth of computing time.
If there can be a fix that saves minutes, its surely worth it.
On Wed, Aug 7, 2019 at 1:18 PM Sean Mooney smooney@redhat.com wrote:
On Wed, 2019-08-07 at 10:11 -0500, Ben Nemec wrote:
On 8/7/19 9:37 AM, Sean Mooney wrote:
On Wed, 2019-08-07 at 08:33 -0500, Ben Nemec wrote:
On 8/6/19 11:34 AM, Ben Nemec wrote:
On 8/6/19 10:49 AM, Clark Boylan wrote:
On Tue, Aug 6, 2019, at 8:26 AM, Ben Nemec wrote: > Just a reminder that there is also >
http://lists.openstack.org/pipermail/openstack-dev/2016-April/092546.html
> > which was intended to address this same issue. > > I toyed around with it a bit for TripleO installs back then
and it did
> seem to speed things up, but at the time there was a bug in
our client
> plugin where it was triggering a prompt for input that was
problematic
> with the server running in the background. I never really got
back to it
> once that was fixed. :-/
I'm not tied to any particular implementation. Mostly I wanted
to show
that we can take this ~5 minute portion of devstack and turn it
into a
15 second portion of devstack by improving our use of the
service APIs
(and possibly even further if we apply it to all of the api interaction). Any idea how difficult it would be to get your
client as
a service stuff running in devstack again?
I wish I could take credit, but this is actually Dan Berrange's
work. :-)
I do not think we should make a one off change like I've done in
my
POC. That will just end up being harder to understand and debug
in the
future since it will be different than all of the other API interaction. I like the idea of a manifest or feeding a longer
lived
process api update commands as we can then avoid requesting new
tokens
as well as pkg_resource startup time. Such a system could be
used by
all of devstack as well (avoiding the "this bit is special"
problem).
Is there any interest from the QA team in committing to an
approach
and working to do a conversion? I don't want to commit any more
time
to this myself unless there is strong interest in getting changes merged (as I expect it will be a slow process weeding out places
where
we've made bad assumptions particularly around plugins).
One of the things I found was that using names with osc results
in
name to id lookups as well. We can avoid these entirely if we
remember
name to id mappings instead (which my POC does). Any idea if
your osc
as a service tool does or can do that? Probably have to be more careful for scoping things in a tool like that as it may be
reused by
people with name collisions across projects/users/groups/domains.
I don't believe this would handle name to id mapping. It's a very
thin
wrapper around the regular client code that just makes it
persistent so
we don't pay the startup costs every call. On the plus side that
means
it basically works like the vanilla client, on the minus side that
means
it may not provide as much improvement as a more targeted solution.
IIRC it's pretty easy to use, so I can try it out again and make
sure it
still works and still provides a performance benefit.
It still works and it still helps. Using the osc service cut about 3 minutes off my 21 minute devstack run. Subjectively I would say that most of the time was being spent cloning and installing services and their deps.
I guess the downside is that working around the OSC slowness in CI
will
reduce developer motivation to fix the problem, which affects all
users
too. Then again, this has been a problem for years and no one has
fixed
it, so apparently that isn't a big enough lever to get things moving anyway. :-/
using osc diretly i dont think the slowness is really perceptable from
a human
stand point but it adds up in a ci run. there are large problems to
kill with gate
slowness then fixing osc will solve be every little helps. i do agree
however
that the gage is not a big enough motivater for people to fix osc
slowness as
we can wait hours in some cases for jobs to start so 3 minutes is not
really a consern
form a latency perspective but if we saved 3 mins on every run that
might
in aggreaget reduce the latency problems we have.
I find the slowness very noticeable in interactive use. It adds something like 2 seconds to a basic call like image list that returns almost instantly in the OSC interactive shell where there is no startup overhead. From my performance days, any latency over 1 second was considered unacceptable for an interactive call. The interactive shell does help with that if I'm doing a bunch of calls in a row though.
well that was kind of my point when we write sripts we invoke it over and over again. if i need to use osc to do lots of commands for some reason i generaly enter the interactive mode. the interactive mode already masks the pain so anytime it has bother me in the past i have just ended up using it instead.
its been a long time since i looked at this but it think there were two reasons it is slow on startup. one is the need to get the token for each request and the other was related to the way we scan for plugins. i honestly dont know if either have imporved but the interactive shell elimiates both as issues.
That said, you're right that 3 minutes multiplied by the number of jobs we run per day is significant. Picking 1000 as a round number (and I'm pretty sure we run a _lot_ more than that per day), a 3 minute decrease in runtime per job would save about 50 hours of CI time in total. Small things add up at scale. :-)
yep it defintly does.
> > On 7/26/19 6:53 PM, Clark Boylan wrote: > > Today I have been digging into devstack runtime costs to
help Donny
> > Davis understand why tempest jobs sometimes timeout on the > > FortNebula cloud. One thing I discovered was that the
keystone user,
> > group, project, role, and domain setup [0] can take many
minutes
> > [1][2] (in the examples here almost 5). > > > > I've rewritten create_keystone_accounts to be a python tool
[3] and
> > get the runtime for that subset of setup from ~100s to ~9s
[4]. I
> > imagine that if we applied this to the other
create_X_accounts
> > functions we would see similar results. > > > > I think this is so much faster because we avoid repeated
costs in
> > openstack client including: python process startup,
pkg_resource
> > disk scanning to find entrypoints, and needing to convert
names to
> > IDs via the API every time osc is run. Given my change shows
this
> > can be so much quicker is there any interest in modifying
devstack
> > to be faster here? And if so what do we think an appropriate > > approach would be? > > > > [0] > >
https://opendev.org/openstack/devstack/src/commit/6aeaceb0c4ef078d028fb6605c...
> > > > > > [1] > >
http://logs.openstack.org/05/672805/4/check/tempest-full/14f3211/job-output....
> > > > > > [2] > >
http://logs.openstack.org/05/672805/4/check/tempest-full/14f3211/job-output....
> > > > > > [3] https://review.opendev.org/#/c/673108/ > > [4] > >
http://logs.openstack.org/08/673108/6/check/devstack-xenial/a4107d0/job-outp...
> > > > > > > > Note the jobs compared above all ran on rax-dfw. > > > > Clark > > > >
I have a PoC patch up in devstack[0] to start using the openstack-server client. It passed the basic devstack test and looking through the logs you can see that openstack calls are now completing in fractions of a second as opposed to 2.5 to 3, so I think it's working as intended.
That said, it needs quite a bit of refinement. For example, I think we should disable this on any OSC patches. I also suspect it will fall over for any projects that use an OSC plugin since the server is started before any plugins are installed. This could probably be worked around by restarting the service after a project is installed, but it's something that needs to be dealt with.
Before I start taking a serious look at those things, do we want to pursue this? It does add some potential complexity to debugging if a client call fails or if the server crashes. I'm not sure I can quantify the risk there though since it's always Just Worked(tm) for me.
-Ben
On Wed, Aug 14, 2019, at 10:07 AM, Ben Nemec wrote:
I have a PoC patch up in devstack[0] to start using the openstack-server client. It passed the basic devstack test and looking through the logs you can see that openstack calls are now completing in fractions of a second as opposed to 2.5 to 3, so I think it's working as intended.
That said, it needs quite a bit of refinement. For example, I think we should disable this on any OSC patches. I also suspect it will fall over for any projects that use an OSC plugin since the server is started before any plugins are installed. This could probably be worked around by restarting the service after a project is installed, but it's something that needs to be dealt with.
Before I start taking a serious look at those things, do we want to pursue this? It does add some potential complexity to debugging if a client call fails or if the server crashes. I'm not sure I can quantify the risk there though since it's always Just Worked(tm) for me.
Considering that our number one identified e-r bug is job timeouts [1] I think anything to reduce job time by measurable amounts is worthwhile. Additionally if we save 5 minutes per devstack run and then run devstack 10k times a day (not an up to date number but has been in that range in the past, someone can double check this with grafana or logstash or zuul dashboard) that is a massive savings when looked at on the whole. To me that makes it worthwhile.
-Ben
[1] http://status.openstack.org/elastic-recheck/index.html#1686542
On Wed, Aug 14, 2019 at 12:07:01PM -0500, Ben Nemec wrote:
I have a PoC patch up in devstack[0] to start using the openstack-server client. It passed the basic devstack test and looking through the logs you can see that openstack calls are now completing in fractions of a second as opposed to 2.5 to 3, so I think it's working as intended.
I see this as having a couple of advantages
* no bespoke API interfacing code to maintain * the wrapper is custom but pretty small * plugins can benefit by using the same wrapper * we can turn the wrapper off and fall back to the same calls directly with the client (also good for local interaction) * in a similar theme, it's still pretty close to "what I'd type on the command line to do this" which is a bit of a devstack theme
So FWIW I'm positive on the direction, thanks!
-i
(some very experienced people have said "we know it's slow" and I guess we should take advice on if this is a temporary work-around, or an actual solution)
On 8/14/19 8:49 PM, Ian Wienand wrote:
On Wed, Aug 14, 2019 at 12:07:01PM -0500, Ben Nemec wrote:
I have a PoC patch up in devstack[0] to start using the openstack-server client. It passed the basic devstack test and looking through the logs you can see that openstack calls are now completing in fractions of a second as opposed to 2.5 to 3, so I think it's working as intended.
I see this as having a couple of advantages
- no bespoke API interfacing code to maintain
- the wrapper is custom but pretty small
- plugins can benefit by using the same wrapper
- we can turn the wrapper off and fall back to the same calls directly with the client (also good for local interaction)
- in a similar theme, it's still pretty close to "what I'd type on the command line to do this" which is a bit of a devstack theme
So FWIW I'm positive on the direction, thanks!
-i
(some very experienced people have said "we know it's slow" and I guess we should take advice on if this is a temporary work-around, or an actual solution)
Okay, I've got https://review.opendev.org/#/c/676016/ passing devstack ci now and I think it's ready for initial review. I don't know if everything I'm doing will fly with the devstack folks, but the reasons why should be covered in the commit message. I'm open to suggestions on alternate ways to accomplish the same things.
participants (11)
-
Alex Schultz
-
Ben Nemec
-
Chris Dent
-
Clark Boylan
-
Dean Troyer
-
Donny Davis
-
Doug Hellmann
-
Ghanshyam Mann
-
Ian Wienand
-
Jeremy Stanley
-
Sean Mooney