Hello everyone,
We've been seeing steady memory leaks in some of our OpenStack API services. I think I've found the cause but I'd appreciate some feedback.
When a service creates a client object (e.g. placementclient) to use another service's API, the call goes through Proxy.request() in the openstacksdk module. That function calls _report_stats()[1] for every request which calls three other functions for statsd,
prometheus and influxdb. The prometheus function records the number of requests and response times in a dict using the full URL of the request.[2]
This data is recorded unless the _prometheus_counter or _prometheus_histogram member variables are None.[2] Those member variables are set by config.CloudRegion.get_prometheus_counter()[3] and config.CloudRegion.get_prometheus_histogram()[4], which do not
check any config settings. It looks like the only way to prevent this behavior is to uninstall the prometheus_client module.[5] Unfortunately prometheus_client is required by oslo.metrics[6] which is required by oslo.messaging[7].
This is causing a lot of extra memory usage in our production environments. It's worst in our build farm where we create over 500K VMs per week. nova-scheduler queries a lot of unique URLs because the Placement API uses the VM's UUID to get allocations[8]
and each URL gets its own counter and histogram. Even though we run 6 copies of nova-scheduler (currently Caracal) after a few weeks each copy will reach 8 GB and the worker processes will start getting OOMkilled, putting the new VM in error state. The nova-scheduler
parent process keeps running and restarts the workers, which just keep getting killed and VMs can't be reliably scheduled until nova-scheduler is restarted.
This feels like two problems to me. We don't need Prometheus metrics from the OpenStack services but there's no way to turn them off. The Session class in keystoneauth1 supports a collect-timing option[9] (default False) but it's only used in one or two places
that I can find. The CloudRegion methods get their config from the Adapter[10] class in keystoneauth1, which does not support collect-timing. Should it?
Even if we did use these Prometheus metrics, it seems like they shouldn't be allowed to grow unbounded forever. Should the Proxy._report_stats_prometheus() function[2] limit the metrics by age or quantity?
Any thoughts? I've patched our services in-house to turn off these metrics (I'll have results in a week). In the meantime, are we doing something wrong or is every OpenStack service always collecting these metrics constantly?
-- Sam Clippinger
CONFIDENTIALITY NOTICE: This email and any attachments are for the sole use of the intended recipient(s) and contain information that may be Garmin confidential and/or Garmin legally privileged. If you have received this email in error, please notify the sender
by reply email and delete the message. Any disclosure, copying, distribution or use of this communication (including attachments) by someone other than the intended recipient is prohibited. Thank you.