[all] CI resources usage optimization
Hi, In of of the recent weekly TC meetings we have discussed usage of our CI resources by projects. After that discussion I did small script [1] to get some more data about it. I checked separately how much nodes and "nodehours" are projects using in the check and gate queue to validate patches. Full results for the time rage 01.01.2025 to 13.02.2025 are in [2] for the check queue and in [3] for the gate queue. 1 "nodehour" means that 1 CI node was busy for 1 hour running job. This email is of course not sent just to tell team to cut their testing coverage but if you can take a closer look at your project's CI jobs configuration, maybe there is some way to improve it easily. For example if you have jobs running as non-voting for long time, maybe you can think of moving them to the experimental or periodic queue instead of running it in the check queue for every patch (or make it voting). This may be one of those small steps to optimize things a bit and make our own live easier as less load on infra means more stable jobs in general. Below there are results of the top 50 projects based on the "nodehours" per patch # Check queue +--------------------------------------------------+-------------------+------------------+----------------------------+-----------------------+---------------------------------+ | Project | Number of patches | Total Nodes used | Total time used [nodehour] | Nodes per patch (avg) | Time per patch (avg) [nodehour] | +--------------------------------------------------+-------------------+------------------+----------------------------+-----------------------+---------------------------------+ | openstack/openstack-ansible-rabbitmq_server | 2 | 72 | 90:06:07 | 36.00 | 45:03:03 | | openstack/octavia-tempest-plugin | 4 | 128 | 164:35:45 | 32.00 | 41:08:56 | | openstack/openstack-ansible-haproxy_server | 3 | 90 | 119:12:36 | 30.00 | 39:44:12 | | openstack/tacker | 14 | 1667 | 506:49:12 | 119.07 | 36:12:05 | | openstack/openstack-ansible-os_neutron | 8 | 204 | 289:32:21 | 25.50 | 36:11:32 | | openstack/openstack-ansible-os_keystone | 3 | 81 | 106:46:15 | 27.00 | 35:35:25 | | openstack/kolla-ansible | 130 | 8461 | 4491:25:21 | 65.08 | 34:32:57 | | openstack/openstack-ansible-os_octavia | 4 | 80 | 135:50:31 | 20.00 | 33:57:37 | | openstack/openstack-ansible | 31 | 872 | 1012:42:34 | 28.13 | 32:40:04 | | openstack/openstack-ansible-os_horizon | 5 | 115 | 162:39:57 | 23.00 | 32:31:59 | | openstack/openstack-ansible-os_aodh | 3 | 66 | 94:01:09 | 22.00 | 31:20:23 | | openstack/openstack-ansible-os_glance | 3 | 69 | 92:57:56 | 23.00 | 30:59:18 | | openstack/openstack-ansible-os_cinder | 3 | 63 | 86:19:38 | 21.00 | 28:46:32 | | openstack/openstack-ansible-os_magnum | 3 | 63 | 84:34:05 | 21.00 | 28:11:21 | | openstack/openstack-ansible-os_cloudkitty | 3 | 48 | 78:30:50 | 16.00 | 26:10:16 | | openstack/openstack-ansible-os_ironic | 6 | 93 | 153:06:27 | 15.50 | 25:31:04 | | openstack/openstack-ansible-os_barbican | 3 | 54 | 76:32:05 | 18.00 | 25:30:41 | | openstack/openstack-ansible-os_gnocchi | 2 | 32 | 51:00:15 | 16.00 | 25:30:07 | | openstack/openstack-ansible-os_ceilometer | 3 | 48 | 75:49:07 | 16.00 | 25:16:22 | | openstack/openstack-ansible-os_swift | 3 | 54 | 75:37:56 | 18.00 | 25:12:38 | | openstack/openstack-ansible-os_skyline | 5 | 89 | 124:53:18 | 17.80 | 24:58:39 | | openstack/openstack-ansible-os_placement | 2 | 36 | 49:46:31 | 18.00 | 24:53:15 | | openstack/tempest | 26 | 854 | 633:21:36 | 32.85 | 24:21:36 | | openstack/openstack-ansible-os_designate | 3 | 51 | 72:30:14 | 17.00 | 24:10:04 | | openstack/openstack-ansible-os_tempest | 2 | 36 | 47:27:46 | 18.00 | 23:43:53 | | openstack/openstack-ansible-os_mistral | 3 | 51 | 71:06:02 | 17.00 | 23:42:00 | | openstack/openstack-ansible-os_heat | 3 | 51 | 70:23:53 | 17.00 | 23:27:57 | | openstack/openstack-ansible-os_blazar | 2 | 32 | 46:30:44 | 16.00 | 23:15:22 | | openstack/openstack-ansible-os_masakari | 3 | 48 | 68:52:54 | 16.00 | 22:57:38 | | openstack/ansible-role-systemd_networkd | 13 | 277 | 297:37:09 | 21.31 | 22:53:37 | | openstack/ironic-tempest-plugin | 12 | 398 | 271:54:14 | 33.17 | 22:39:31 | | openstack/openstack-ansible-galera_server | 5 | 111 | 112:28:29 | 22.20 | 22:29:41 | | openstack/openstack-ansible-os_nova | 3 | 51 | 66:54:40 | 17.00 | 22:18:13 | | openstack/openstack-ansible-os_rally | 2 | 32 | 44:20:06 | 16.00 | 22:10:03 | | openstack/cinder | 127 | 3518 | 2746:32:43 | 27.70 | 21:37:34 | | openstack/openstack-ansible-os_tacker | 3 | 45 | 63:03:01 | 15.00 | 21:01:00 | | openstack/octavia | 38 | 917 | 764:50:26 | 24.13 | 20:07:38 | | openstack/cinder-tempest-plugin | 13 | 176 | 241:51:59 | 13.54 | 18:36:18 | | openstack/neutron-tempest-plugin | 8 | 129 | 147:41:46 | 16.12 | 18:27:43 | | openstack/devstack | 24 | 536 | 442:16:49 | 22.33 | 18:25:42 | | openstack/neutron | 186 | 3580 | 3283:39:42 | 19.25 | 17:39:14 | | openstack/nova | 251 | 6527 | 4383:59:41 | 26.00 | 17:27:58 | | openstack/openstack-ansible-plugins | 7 | 127 | 116:04:29 | 18.14 | 16:34:55 | | openstack/openstack-ansible-repo_server | 5 | 72 | 82:42:24 | 14.40 | 16:32:28 | | openstack/ironic | 79 | 2159 | 1284:28:46 | 27.33 | 16:15:33 | | openstack/ansible-role-systemd_service | 3 | 52 | 46:55:01 | 17.33 | 15:38:20 | | openstack/manila | 36 | 856 | 561:16:54 | 23.78 | 15:35:28 | | openstack/bifrost | 6 | 164 | 93:30:55 | 27.33 | 15:35:09 | | openstack/kolla | 35 | 997 | 530:29:59 | 28.49 | 15:09:25 | | openstack/glance | 43 | 718 | 616:18:37 | 16.70 | 14:19:58 | | openstack/manila-tempest-plugin | 13 | 177 | 178:58:11 | 13.62 | 13:46:00 | # Gate queue +--------------------------------------------------+-------------------+------------------+----------------------------+-----------------------+-------------------------16:35:44 [270/9949] | Project | Number of patches | Total Nodes used | Total time used [nodehour] | Nodes per patch (avg) | Time per patch (avg) [nodehour] | +--------------------------------------------------+-------------------+------------------+----------------------------+-----------------------+---------------------------------+ | openstack/openstack-ansible-rabbitmq_server | 1 | 31 | 32:04:31 | 31.00 | 32:04:31 | | openstack/octavia-tempest-plugin | 2 | 38 | 43:36:32 | 19.00 | 21:48:16 | | openstack/openstack-ansible-galera_server | 3 | 60 | 64:21:11 | 20.00 | 21:27:03 | | openstack/openstack-ansible-os_keystone | 1 | 18 | 20:01:48 | 18.00 | 20:01:48 | | openstack/openstack-ansible | 24 | 460 | 462:28:00 | 19.17 | 19:16:10 | | openstack/openstack-ansible-repo_server | 5 | 76 | 91:39:29 | 15.20 | 18:19:53 | | openstack/openstack-ansible-os_horizon | 3 | 39 | 52:58:46 | 13.00 | 17:39:35 | | openstack/openstack-ansible-plugins | 3 | 48 | 51:28:19 | 16.00 | 17:09:26 | | openstack/openstack-ansible-os_aodh | 1 | 13 | 16:30:26 | 13.00 | 16:30:26 | | openstack/openstack-ansible-os_octavia | 2 | 24 | 32:52:12 | 12.00 | 16:26:06 | | openstack/openstack-ansible-os_neutron | 4 | 61 | 65:06:05 | 15.25 | 16:16:31 | | openstack/openstack-ansible-os_cinder | 1 | 12 | 15:51:46 | 12.00 | 15:51:46 | | openstack/nova | 91 | 1788 | 1326:41:46 | 19.65 | 14:34:44 | | openstack/tempest | 8 | 181 | 116:36:02 | 22.62 | 14:34:30 | | openstack/openstack-ansible-os_glance | 1 | 13 | 14:28:12 | 13.00 | 14:28:12 | | openstack/openstack-ansible-os_skyline | 3 | 34 | 41:15:36 | 11.33 | 13:45:12 | | openstack/neutron | 87 | 1149 | 1033:31:12 | 13.21 | 11:52:46 | | openstack/openstack-ansible-lxc_container_create | 1 | 19 | 11:49:22 | 19.00 | 11:49:22 | | openstack/openstack-ansible-os_ironic | 4 | 36 | 46:49:08 | 9.00 | 11:42:17 | | openstack/devstack | 11 | 183 | 121:32:26 | 16.64 | 11:02:56 | | openstack/openstack-ansible-lxc_hosts | 2 | 27 | 22:00:46 | 13.50 | 11:00:23 | | openstack/ansible-role-httpd | 1 | 17 | 10:55:36 | 17.00 | 10:55:36 | | openstack/openstack-ansible-os_swift | 1 | 9 | 10:54:59 | 9.00 | 10:54:59 | | openstack/kolla-ansible | 44 | 847 | 468:00:27 | 19.25 | 10:38:11 | | openstack/ironic-tempest-plugin | 7 | 121 | 73:14:18 | 17.29 | 10:27:45 | | openstack/openstack-ansible-os_barbican | 1 | 9 | 10:22:22 | 9.00 | 10:22:22 | | openstack/openstack-ansible-os_heat | 1 | 9 | 10:20:01 | 9.00 | 10:20:01 | | openstack/openstack-ansible-os_ceilometer | 1 | 9 | 10:18:23 | 9.00 | 10:18:23 | | openstack/bifrost | 6 | 108 | 60:34:38 | 18.00 | 10:05:46 | | openstack/openstack-ansible-os_masakari | 1 | 9 | 10:00:38 | 9.00 | 10:00:38 | | openstack/ansible-role-systemd_networkd | 11 | 136 | 109:26:42 | 12.36 | 09:56:58 | | openstack/openstack-ansible-os_cloudkitty | 1 | 9 | 09:56:58 | 9.00 | 09:56:58 | | openstack/openstack-ansible-os_designate | 1 | 9 | 09:54:08 | 9.00 | 09:54:08 | | openstack/openstack-ansible-os_mistral | 1 | 9 | 09:50:16 | 9.00 | 09:50:16 | | openstack/openstack-ansible-os_tacker | 1 | 9 | 09:38:43 | 9.00 | 09:38:43 | | openstack/kolla | 14 | 223 | 134:48:14 | 15.93 | 09:37:43 | | openstack/openstack-ansible-os_magnum | 1 | 9 | 09:11:37 | 9.00 | 09:11:37 | | openstack/openstack-ansible-openstack_hosts | 1 | 12 | 08:58:07 | 12.00 | 08:58:07 | | openstack/openstack-ansible-os_nova | 1 | 9 | 08:49:26 | 9.00 | 08:49:26 | | openstack/swift | 21 | 484 | 180:31:42 | 23.05 | 08:35:47 | | openstack/neutron-lib | 10 | 94 | 82:54:52 | 9.40 | 08:17:29 | | openstack/ansible-role-pki | 3 | 40 | 24:50:35 | 13.33 | 08:16:51 | | openstack/kayobe | 21 | 372 | 173:09:54 | 17.71 | 08:14:45 | | openstack/glance | 12 | 140 | 94:42:31 | 11.67 | 07:53:32 | | openstack/openstack-ansible-os_trove | 1 | 6 | 07:50:27 | 6.00 | 07:50:27 | | openstack/ansible-role-python_venv_build | 1 | 6 | 07:34:53 | 6.00 | 07:34:53 | | openstack/ironic | 46 | 671 | 337:51:59 | 14.59 | 07:20:41 | | x/devstack-plugin-tobiko | 6 | 57 | 42:46:30 | 9.50 | 07:07:45 | | openstack/cinder | 20 | 204 | 136:59:33 | 10.20 | 06:50:58 | | openstack/requirements | 47 | 1404 | 320:35:23 | 29.87 | 06:49:15 | | openstack/placement | 4 | 49 | 27:11:05 | 12.25 | 06:47:46 | [1] https://github.com/slawqo/tools/blob/master/jobs_time/infra_usage_stats.py [2]https://paste.opendev.org/show/bzS4jJgGxMq9UUBfwKNn/ [3] https://paste.opendev.org/show/bbrKIKUAGiHsntU1cQ7L/ -- Slawek Kaplonski Principal Software Engineer Red Hat
On Wed, Feb 26, 2025, at 1:57 AM, Sławek Kapłoński wrote:
Hi,
In of of the recent weekly TC meetings we have discussed usage of our CI resources by projects. After that discussion I did small script [1] to get some more data about it.
Snip
This email is of course not sent just to tell team to cut their testing coverage but if you can take a closer look at your project's CI jobs configuration, maybe there is some way to improve it easily. For example if you have jobs running as non-voting for long time, maybe you can think of moving them to the experimental or periodic queue instead of running it in the check queue for every patch (or make it voting). This may be one of those small steps to optimize things a bit and make our own live easier as less load on infra means more stable jobs in general.
To expand a bit on this I'm going to use an example from Tacker, but I think that these issues aren't unique to that project. If we look at https://review.opendev.org/c/openstack/tacker/+/942337 we can see there are ~33 tacker-ft-* jobs that ran. Each of these uses 3 or 4 nodes. The jobs all failed with a RETRY_LIMIT error which means they were each attempted 3 times before Zuul gave up. This happens because there is a consistent failure in the jobs' pre-run playbook. In this case the issue appears to be tacker using some undefined Singleton object from oslo.service. I think there are three different things we can do to improve the situation for tacker. 1) Stop running Devstack setup within pre-run. Pre-run playbooks should be used to set up the test environment in ways that aren't directly affected by the code under test. This is specifically to try once rather than three times when the project itself is broken. Some projects (like Nodepool) do run Devstack in pre-run. This is ok because I can make any change to Nodepool and it will not break Devstack. That isn't the case with Tacker (and probably others). 2) Consider combining some of these similar tacker-ft-* jobs into fewer jobs. If you look at successful runs of these jobs some of them appear to run with very similar configs then simply run a different set of test cases at the very end of the job. In those cases the total test case runtime if we ran test cases from multiple jobs in a single job is still much shorter than the setup runtime costs. 3) Consider whether each test really needs 4 nodes. Remember multinode testing is effectively a multiplier on the total cost of the job. We should use the bare minimum we can get away with. This has other upsides including making it easier for people to reproduce locally should they need to. I think any one of these improvements would be a great benefit, but together the impact would be quite large. To illustrate this we currently use somewhere between 3 * 33 * 3 = 297 and 4 * 33 * 3 = 396 nodes for each run on this one change. If we implement 1) we get between 3 * 33 = 99 and 4 * 33 = 132 nodes. With 2) if we halve the number of jobs we get 3 * 17 * 3 = 153 to 4 * 17 * 3 = 204. With 3) if we get away with 3 nodes in each job then we get the floor of each of these ranges. Finally if we do some combo of 1) 2) and 3) we get 3 * 17 = 51 nodes. That is at least ~1/6th of the previous total resource consumption.
Hi Clark, Thanks for your suggestion. Tacker is under improvement of FTs since it's been a mess, but I apologize not so much taking care about the resource consumption actually. For 2) and 3), we've already started to consider to reduce the num of tests especially focusing on resource consuming scenarios actually although there is some difficulties for keeping coverage. We'll fix it soon. 1) was not in our plan, but looks feasible. I'd try to revise the playbook. Yasufumi On 2025/02/27 5:08, Clark Boylan wrote:
On Wed, Feb 26, 2025, at 1:57 AM, Sławek Kapłoński wrote:
Hi,
In of of the recent weekly TC meetings we have discussed usage of our CI resources by projects. After that discussion I did small script [1] to get some more data about it.
Snip
This email is of course not sent just to tell team to cut their testing coverage but if you can take a closer look at your project's CI jobs configuration, maybe there is some way to improve it easily. For example if you have jobs running as non-voting for long time, maybe you can think of moving them to the experimental or periodic queue instead of running it in the check queue for every patch (or make it voting). This may be one of those small steps to optimize things a bit and make our own live easier as less load on infra means more stable jobs in general.
To expand a bit on this I'm going to use an example from Tacker, but I think that these issues aren't unique to that project.
If we look at https://review.opendev.org/c/openstack/tacker/+/942337 we can see there are ~33 tacker-ft-* jobs that ran. Each of these uses 3 or 4 nodes. The jobs all failed with a RETRY_LIMIT error which means they were each attempted 3 times before Zuul gave up. This happens because there is a consistent failure in the jobs' pre-run playbook. In this case the issue appears to be tacker using some undefined Singleton object from oslo.service.
I think there are three different things we can do to improve the situation for tacker.
1) Stop running Devstack setup within pre-run. Pre-run playbooks should be used to set up the test environment in ways that aren't directly affected by the code under test. This is specifically to try once rather than three times when the project itself is broken. Some projects (like Nodepool) do run Devstack in pre-run. This is ok because I can make any change to Nodepool and it will not break Devstack. That isn't the case with Tacker (and probably others).
2) Consider combining some of these similar tacker-ft-* jobs into fewer jobs. If you look at successful runs of these jobs some of them appear to run with very similar configs then simply run a different set of test cases at the very end of the job. In those cases the total test case runtime if we ran test cases from multiple jobs in a single job is still much shorter than the setup runtime costs.
3) Consider whether each test really needs 4 nodes. Remember multinode testing is effectively a multiplier on the total cost of the job. We should use the bare minimum we can get away with. This has other upsides including making it easier for people to reproduce locally should they need to.
I think any one of these improvements would be a great benefit, but together the impact would be quite large. To illustrate this we currently use somewhere between 3 * 33 * 3 = 297 and 4 * 33 * 3 = 396 nodes for each run on this one change. If we implement 1) we get between 3 * 33 = 99 and 4 * 33 = 132 nodes. With 2) if we halve the number of jobs we get 3 * 17 * 3 = 153 to 4 * 17 * 3 = 204. With 3) if we get away with 3 nodes in each job then we get the floor of each of these ranges. Finally if we do some combo of 1) 2) and 3) we get 3 * 17 = 51 nodes. That is at least ~1/6th of the previous total resource consumption.
participants (3)
-
Clark Boylan
-
Sławek Kapłoński
-
Yasufumi Ogawa