[watcher] nova cdm builder performance optimizations - summary

Matt Riedemann mriedemos at gmail.com
Tue Jul 9 18:13:14 UTC 2019


I wanted to summarize a series of changes which have improved the 
performance of the NovaClusterDataModel builder for audits across single 
and multiple cells (in the CERN case) by a factor of 20-30%.

There were initially three changes involved (in order):

1. https://review.opendev.org/#/c/659688/ - Optimize
NovaClusterDataModelCollector.add_instance_node

Reports on that patch alone said it fixed a regression introduced in 
Stein with scoped audits:

"I checked this patch on the my test environment on the stable/stein 
branch. I have more than 1000 virtual servers (some real, some dummy). 
Previously, in the stable/rocky branch, the time to build a cluster was 
about 15-20 minutes, in the Stein branch there was a regression and the 
time increased to 90 minutes. After this patch, the build time is only 2 
minutes."

That change was backported to stable/stein.

2. - https://review.opendev.org/#/c/661121/ - Optimize hypervisor API 
calls (which requires https://review.opendev.org/#/c/659886/)

As noted that change requires a patch to python-novaclient if you are 
looking to backport the change. We can't backport that upstream because 
of the python-novaclient dependency since it would require bumping the 
minimum required version of the library on a stable branch which is 
against stable branch policy (minimum version of library dependencies 
are more or less frozen on stable branches).

That change also requires configuring watcher with:

[nova_client]
api_version = 2.53  # or greater; train now requires at least 2.56

3. - https://review.opendev.org/#/c/662089/ - Optimize
NovaHelper.get_compute_node_by_hostname

This optimizes code used to build/update the nova CDM during 
notification processing and also fixes a bug about looking up the 
compute service properly.

After those three changes were merged, Corne Lukken (Dantali0n) started 
doing scale and performance testing with and without the changes in a 
CERN 5-cell test cluster. Corne identified a regression for which Canwei 
Li determined the root cause and chenker fixed:

4. https://review.opendev.org/#/c/668100/ - Reduce the query time of the 
instances when call get_instance_list()

With that fix applied Corne reported the overall improvement of 20-30% 
when building the nova CDM during an audit in various scenarios. The 
actual performance numbers will be sent later as part of a thesis Corne 
is working on.

I want to thank Dantali0n, licanwei and chenker for all of their help 
with this series of improvements.

-- 

Thanks,

Matt



More information about the openstack-discuss mailing list