[watcher] nova cdm builder performance optimizations - summary
I wanted to summarize a series of changes which have improved the performance of the NovaClusterDataModel builder for audits across single and multiple cells (in the CERN case) by a factor of 20-30%. There were initially three changes involved (in order): 1. https://review.opendev.org/#/c/659688/ - Optimize NovaClusterDataModelCollector.add_instance_node Reports on that patch alone said it fixed a regression introduced in Stein with scoped audits: "I checked this patch on the my test environment on the stable/stein branch. I have more than 1000 virtual servers (some real, some dummy). Previously, in the stable/rocky branch, the time to build a cluster was about 15-20 minutes, in the Stein branch there was a regression and the time increased to 90 minutes. After this patch, the build time is only 2 minutes." That change was backported to stable/stein. 2. - https://review.opendev.org/#/c/661121/ - Optimize hypervisor API calls (which requires https://review.opendev.org/#/c/659886/) As noted that change requires a patch to python-novaclient if you are looking to backport the change. We can't backport that upstream because of the python-novaclient dependency since it would require bumping the minimum required version of the library on a stable branch which is against stable branch policy (minimum version of library dependencies are more or less frozen on stable branches). That change also requires configuring watcher with: [nova_client] api_version = 2.53 # or greater; train now requires at least 2.56 3. - https://review.opendev.org/#/c/662089/ - Optimize NovaHelper.get_compute_node_by_hostname This optimizes code used to build/update the nova CDM during notification processing and also fixes a bug about looking up the compute service properly. After those three changes were merged, Corne Lukken (Dantali0n) started doing scale and performance testing with and without the changes in a CERN 5-cell test cluster. Corne identified a regression for which Canwei Li determined the root cause and chenker fixed: 4. https://review.opendev.org/#/c/668100/ - Reduce the query time of the instances when call get_instance_list() With that fix applied Corne reported the overall improvement of 20-30% when building the nova CDM during an audit in various scenarios. The actual performance numbers will be sent later as part of a thesis Corne is working on. I want to thank Dantali0n, licanwei and chenker for all of their help with this series of improvements. -- Thanks, Matt
Thanks alot for the summary, this could be very helpful, we will have a test on these :) On Wed, Jul 10, 2019 at 2:29 AM Matt Riedemann <mriedemos@gmail.com> wrote:
I wanted to summarize a series of changes which have improved the performance of the NovaClusterDataModel builder for audits across single and multiple cells (in the CERN case) by a factor of 20-30%.
There were initially three changes involved (in order):
1. https://review.opendev.org/#/c/659688/ - Optimize NovaClusterDataModelCollector.add_instance_node
Reports on that patch alone said it fixed a regression introduced in Stein with scoped audits:
"I checked this patch on the my test environment on the stable/stein branch. I have more than 1000 virtual servers (some real, some dummy). Previously, in the stable/rocky branch, the time to build a cluster was about 15-20 minutes, in the Stein branch there was a regression and the time increased to 90 minutes. After this patch, the build time is only 2 minutes."
That change was backported to stable/stein.
2. - https://review.opendev.org/#/c/661121/ - Optimize hypervisor API calls (which requires https://review.opendev.org/#/c/659886/)
As noted that change requires a patch to python-novaclient if you are looking to backport the change. We can't backport that upstream because of the python-novaclient dependency since it would require bumping the minimum required version of the library on a stable branch which is against stable branch policy (minimum version of library dependencies are more or less frozen on stable branches).
That change also requires configuring watcher with:
[nova_client] api_version = 2.53 # or greater; train now requires at least 2.56
3. - https://review.opendev.org/#/c/662089/ - Optimize NovaHelper.get_compute_node_by_hostname
This optimizes code used to build/update the nova CDM during notification processing and also fixes a bug about looking up the compute service properly.
After those three changes were merged, Corne Lukken (Dantali0n) started doing scale and performance testing with and without the changes in a CERN 5-cell test cluster. Corne identified a regression for which Canwei Li determined the root cause and chenker fixed:
4. https://review.opendev.org/#/c/668100/ - Reduce the query time of the instances when call get_instance_list()
With that fix applied Corne reported the overall improvement of 20-30% when building the nova CDM during an audit in various scenarios. The actual performance numbers will be sent later as part of a thesis Corne is working on.
I want to thank Dantali0n, licanwei and chenker for all of their help with this series of improvements.
--
Thanks,
Matt
participants (2)
-
Matt Riedemann
-
Zhenyu Zheng