[nova] Nova API call duration increases after upgrading to Caracal
Hello, we recently upgraded from Bobcat to Caracal. After upgrading, an interesting increment of duration of nova api call is observed. It's involved to multiple API calls, e.g. os-query-sets, /v2.1/servers/detail, etc. And it seems to be accumulated, meaning whenever nova-api is restarted, the duration seems to be fairly fast, giving response within like 80, 90 ms. As it keeps running for like up to 10 days, the duration increases to over 500 ms. I noticed that Caracal is using UnifiedLimitsDriver instead of DbQuotaDriver. I tried to change nova.conf to use DbQuotaDriver but it seems not working well with reducing the duration. I understand that calling to Keystone and Placement (and likely other external APIs) could result to longer duration, but pretty confused why it shows up as accumulated. Wondering if anyone else in Openstack community noticed the similar issue, thanks!
Hi, I can't fully confirm your observation, but since we upgraded to Caracal a couple of weeks ago, we feel like there has been a performance decrease using the Dashboard Instances Tab (which also uses nova-api, of course). I haven't had the time to dig deeper, and we also don't have any historical data to compare to, unfortunately. But after we upgraded to Antelope earlier this year, me and my colleagues felt like the Dashboard performance was better than in earlier releases. And now on Caracal the opposite is being reported. And even worse: we've seen gateway timeouts a few times for the first time in 5 or 6 years, I don't exactly remember. Back then it was some memcached misconfiguration, and we haven't changed anything in memcached in the past years, maybe we should though. I'm very curious if others have observed similar things. Regards, Eugen Zitat von Chang Xue <cxue22@bloomberg.net>:
Hello, we recently upgraded from Bobcat to Caracal. After upgrading, an interesting increment of duration of nova api call is observed.
It's involved to multiple API calls, e.g. os-query-sets, /v2.1/servers/detail, etc. And it seems to be accumulated, meaning whenever nova-api is restarted, the duration seems to be fairly fast, giving response within like 80, 90 ms. As it keeps running for like up to 10 days, the duration increases to over 500 ms.
I noticed that Caracal is using UnifiedLimitsDriver instead of DbQuotaDriver. I tried to change nova.conf to use DbQuotaDriver but it seems not working well with reducing the duration. I understand that calling to Keystone and Placement (and likely other external APIs) could result to longer duration, but pretty confused why it shows up as accumulated.
Wondering if anyone else in Openstack community noticed the similar issue, thanks!
i would have to try and find the bug but i belive there was a perfomce regression related to the addtion of the pinned_az field. that has been fixed on master but i dont know how far the backport has gone. On 12/08/2025 12:28, Eugen Block wrote:
Hi,
I can't fully confirm your observation, but since we upgraded to Caracal a couple of weeks ago, we feel like there has been a performance decrease using the Dashboard Instances Tab (which also uses nova-api, of course). I haven't had the time to dig deeper, and we also don't have any historical data to compare to, unfortunately. But after we upgraded to Antelope earlier this year, me and my colleagues felt like the Dashboard performance was better than in earlier releases. And now on Caracal the opposite is being reported. And even worse: we've seen gateway timeouts a few times for the first time in 5 or 6 years, I don't exactly remember. Back then it was some memcached misconfiguration, and we haven't changed anything in memcached in the past years, maybe we should though.
I'm very curious if others have observed similar things.
Regards, Eugen
Zitat von Chang Xue <cxue22@bloomberg.net>:
Hello, we recently upgraded from Bobcat to Caracal. After upgrading, an interesting increment of duration of nova api call is observed.
It's involved to multiple API calls, e.g. os-query-sets, /v2.1/servers/detail, etc. And it seems to be accumulated, meaning whenever nova-api is restarted, the duration seems to be fairly fast, giving response within like 80, 90 ms. As it keeps running for like up to 10 days, the duration increases to over 500 ms.
I noticed that Caracal is using UnifiedLimitsDriver instead of DbQuotaDriver. I tried to change nova.conf to use DbQuotaDriver but it seems not working well with reducing the duration. I understand that calling to Keystone and Placement (and likely other external APIs) could result to longer duration, but pretty confused why it shows up as accumulated.
Wondering if anyone else in Openstack community noticed the similar issue, thanks!
Thanks Sean. Could you also point out the pull request that fixed the pinned_az field issue? So my thinking approach is 1. the duration shows obvious increment trends over time, and returns to very low after only restart nova-api. I assume it's something related to cache(?) but I couldn't understand exactly what's changed in Caracal. And I think it should only be related to nova-api itself, instead of Keystone or Placement external API calls. 2. I added driver = nova.quota.DbQuotaDriver to the [quota] section in nova.conf. It seems to make the increment a little slower(?) but could not confirm if it's something always happens. I have only tried it for once, the increment does take couple days to show obvious trends though.
so im not entilry sure if this is the issue but i was suspecting you are hiitng https://bugs.launchpad.net/nova/+bug/2095364 which was fixed by https://review.opendev.org/c/openstack/nova/+/939658 the caracal backport is proposed https://review.opendev.org/c/openstack/nova/+/955305 but that may or may not be the issue you were hitting. On 12/08/2025 16:12, Chang Xue wrote:
Thanks Sean. Could you also point out the pull request that fixed the pinned_az field issue?
So my thinking approach is 1. the duration shows obvious increment trends over time, and returns to very low after only restart nova-api. I assume it's something related to cache(?) but I couldn't understand exactly what's changed in Caracal. And I think it should only be related to nova-api itself, instead of Keystone or Placement external API calls. 2. I added driver = nova.quota.DbQuotaDriver to the [quota] section in nova.conf. It seems to make the increment a little slower(?) but could not confirm if it's something always happens. I have only tried it for once, the increment does take couple days to show obvious trends though.
Thanks. The os-query-sets API requests to nova have been successful, no error reported though, just the duration of it is becoming longer and longer unless we restart nova-api. It seems to be different from the bug being fixed in the pose.. But will keep checking nova api logs to see if I can find something more.
On 8/13/25 07:41, Chang Xue wrote:
Thanks. The os-query-sets API requests to nova have been successful, no error reported though, just the duration of it is becoming longer and longer unless we restart nova-api. It seems to be different from the bug being fixed in the pose.. But will keep checking nova api logs to see if I can find something more.
I think you mean the os-quota-sets API, right? In Caracal we had intended to change the default [quota]driver for the UnifiedLimitsDriver but this did not happen due to some upgrade concerns. So the default should still be the DbQuotaDriver. I skimmed through the commit differences in Nova between Bobcat and Caracal and did not find anything that looked suspect so far. The challenge is the problem could be in a number of places: Nova or oslo.cache or dogpile.cache.<backend> or keystonemiddleware and maybe more that I haven't thought of. I would think similar to what you mentioned earlier that this would seem most likely to be related to some sort of caching and in the past we have seen such issues from missing or incorrect configuration, as Eugen mentioned. It is known that if the cache is not configured right, you will see a progressive slowdown in the API over time. This would be the first thing to check in your nova-api nova.conf, you should have the following configuration to enable keystone auth token caching: [cache] memcache_servers = localhost:11211 backend = dogpile.cache.memcached enabled = True and if you have multiple memcache_servers they should be comma separated, for example: host1:port1,host2:port2 If your configuration is correct, then it might be worth trying a different [cache]backend to try and isolate whether the problem is related to this cache or something else. And then go from there. -melwitt
Hello, We too observe the mentioned behavior, moreover CPU and RAM usage of Nova API slowly increases over time, API process eventually eats 100% CPU. @Melanie thank you for the suggestion! I have disabled the cache, and will monitor how Nova performs. On Wed, 2025-08-13 at 12:53 -0700, melanie witt wrote:
On 8/13/25 07:41, Chang Xue wrote:
Thanks. The os-query-sets API requests to nova have been successful, no error reported though, just the duration of it is becoming longer and longer unless we restart nova-api. It seems to be different from the bug being fixed in the pose.. But will keep checking nova api logs to see if I can find something more.
I think you mean the os-quota-sets API, right? In Caracal we had intended to change the default [quota]driver for the UnifiedLimitsDriver but this did not happen due to some upgrade concerns. So the default should still be the DbQuotaDriver.
I skimmed through the commit differences in Nova between Bobcat and Caracal and did not find anything that looked suspect so far.
The challenge is the problem could be in a number of places:
Nova or oslo.cache or dogpile.cache.<backend> or keystonemiddleware
and maybe more that I haven't thought of.
I would think similar to what you mentioned earlier that this would seem most likely to be related to some sort of caching and in the past we have seen such issues from missing or incorrect configuration, as Eugen mentioned. It is known that if the cache is not configured right, you will see a progressive slowdown in the API over time.
This would be the first thing to check in your nova-api nova.conf, you should have the following configuration to enable keystone auth token caching:
[cache] memcache_servers = localhost:11211 backend = dogpile.cache.memcached enabled = True
and if you have multiple memcache_servers they should be comma separated, for example: host1:port1,host2:port2
If your configuration is correct, then it might be worth trying a different [cache]backend to try and isolate whether the problem is related to this cache or something else.
And then go from there.
-melwitt
Hi, I did not have nova cache enabled before, and I found a config mistake in the keystone cache section, which I corrected. Then I enabled nova cache and restarted apache and nova-api, there's no difference at all in the response times in the dashboard. It feels like caching doesn't to anything here (for us). Maybe I should start a new thread wrt horizon performance to not highjack this thread... Thanks, Eugen Zitat von Konstantin Larin <klarin@sardinasystems.com>:
Hello,
We too observe the mentioned behavior, moreover CPU and RAM usage of Nova API slowly increases over time, API process eventually eats 100% CPU.
@Melanie thank you for the suggestion! I have disabled the cache, and will monitor how Nova performs.
On Wed, 2025-08-13 at 12:53 -0700, melanie witt wrote:
On 8/13/25 07:41, Chang Xue wrote:
Thanks. The os-query-sets API requests to nova have been successful, no error reported though, just the duration of it is becoming longer and longer unless we restart nova-api. It seems to be different from the bug being fixed in the pose.. But will keep checking nova api logs to see if I can find something more.
I think you mean the os-quota-sets API, right? In Caracal we had intended to change the default [quota]driver for the UnifiedLimitsDriver but this did not happen due to some upgrade concerns. So the default should still be the DbQuotaDriver.
I skimmed through the commit differences in Nova between Bobcat and Caracal and did not find anything that looked suspect so far.
The challenge is the problem could be in a number of places:
Nova or oslo.cache or dogpile.cache.<backend> or keystonemiddleware
and maybe more that I haven't thought of.
I would think similar to what you mentioned earlier that this would seem most likely to be related to some sort of caching and in the past we have seen such issues from missing or incorrect configuration, as Eugen mentioned. It is known that if the cache is not configured right, you will see a progressive slowdown in the API over time.
This would be the first thing to check in your nova-api nova.conf, you should have the following configuration to enable keystone auth token caching:
[cache] memcache_servers = localhost:11211 backend = dogpile.cache.memcached enabled = True
and if you have multiple memcache_servers they should be comma separated, for example: host1:port1,host2:port2
If your configuration is correct, then it might be worth trying a different [cache]backend to try and isolate whether the problem is related to this cache or something else.
And then go from there.
-melwitt
On 15/08/2025 12:58, Eugen Block wrote:
Hi,
I did not have nova cache enabled before, and I found a config mistake in the keystone cache section, which I corrected. Then I enabled nova cache and restarted apache and nova-api, there's no difference at all in the response times in the dashboard. It feels like caching doesn't to anything here (for us). Maybe I should start a new thread wrt horizon performance to not highjack this thread...
the caching in nova api is really more useful for the metadata api. the only thing we really use caching form in the main api is keyston auth token validation. we do not cache api responces in the main api but we do in the metadata api. building the metadata for a vm can be slow and we use memcache to make sure that if subsequent request form a vm are receive by a different api worker process it can share the metadata object built by the first worker. for the api we are mainly using cacheing so we dont have to keep validating the same token over and over again with keystone if you use it to make multiepl requests. that does help performance at the auth step but its not goign to speed up server list or flavor show.
Thanks, Eugen
Zitat von Konstantin Larin <klarin@sardinasystems.com>:
Hello,
We too observe the mentioned behavior, moreover CPU and RAM usage of Nova API slowly increases over time, API process eventually eats 100% CPU.
@Melanie thank you for the suggestion! I have disabled the cache, and will monitor how Nova performs.
On Wed, 2025-08-13 at 12:53 -0700, melanie witt wrote:
On 8/13/25 07:41, Chang Xue wrote:
Thanks. The os-query-sets API requests to nova have been successful, no error reported though, just the duration of it is becoming longer and longer unless we restart nova-api. It seems to be different from the bug being fixed in the pose.. But will keep checking nova api logs to see if I can find something more.
I think you mean the os-quota-sets API, right? In Caracal we had intended to change the default [quota]driver for the UnifiedLimitsDriver but this did not happen due to some upgrade concerns. So the default should still be the DbQuotaDriver.
I skimmed through the commit differences in Nova between Bobcat and Caracal and did not find anything that looked suspect so far.
The challenge is the problem could be in a number of places:
Nova or oslo.cache or dogpile.cache.<backend> or keystonemiddleware
and maybe more that I haven't thought of.
I would think similar to what you mentioned earlier that this would seem most likely to be related to some sort of caching and in the past we have seen such issues from missing or incorrect configuration, as Eugen mentioned. It is known that if the cache is not configured right, you will see a progressive slowdown in the API over time.
This would be the first thing to check in your nova-api nova.conf, you should have the following configuration to enable keystone auth token caching:
[cache] memcache_servers = localhost:11211 backend = dogpile.cache.memcached enabled = True
and if you have multiple memcache_servers they should be comma separated, for example: host1:port1,host2:port2
If your configuration is correct, then it might be worth trying a different [cache]backend to try and isolate whether the problem is related to this cache or something else.
And then go from there.
-melwitt
Thanks for the explanation, Sean, I understand. I will try to capture some metrics from my test cloud (there's no real load though) on Epoxy (or Caracal as well) and see if I can compare them reliably to Antelope version. Because our users were quite delighted about the improved dashboard performance in Antelope... Have a great weekend! Thanks! Eugen Zitat von Sean Mooney <smooney@redhat.com>:
On 15/08/2025 12:58, Eugen Block wrote:
Hi,
I did not have nova cache enabled before, and I found a config mistake in the keystone cache section, which I corrected. Then I enabled nova cache and restarted apache and nova-api, there's no difference at all in the response times in the dashboard. It feels like caching doesn't to anything here (for us). Maybe I should start a new thread wrt horizon performance to not highjack this thread...
the caching in nova api is really more useful for the metadata api.
the only thing we really use caching form in the main api is keyston auth token validation.
we do not cache api responces in the main api but we do in the metadata api.
building the metadata for a vm can be slow and we use memcache to make sure that if subsequent request form a vm are receive by a different api worker process it can share the metadata object built by the first worker.
for the api we are mainly using cacheing so we dont have to keep validating the same token over and over again with keystone if you use it to make multiepl requests. that does help performance at the auth step but its not goign to speed up server list or flavor show.
Thanks, Eugen
Zitat von Konstantin Larin <klarin@sardinasystems.com>:
Hello,
We too observe the mentioned behavior, moreover CPU and RAM usage of Nova API slowly increases over time, API process eventually eats 100% CPU.
@Melanie thank you for the suggestion! I have disabled the cache, and will monitor how Nova performs.
On Wed, 2025-08-13 at 12:53 -0700, melanie witt wrote:
On 8/13/25 07:41, Chang Xue wrote:
Thanks. The os-query-sets API requests to nova have been successful, no error reported though, just the duration of it is becoming longer and longer unless we restart nova-api. It seems to be different from the bug being fixed in the pose.. But will keep checking nova api logs to see if I can find something more.
I think you mean the os-quota-sets API, right? In Caracal we had intended to change the default [quota]driver for the UnifiedLimitsDriver but this did not happen due to some upgrade concerns. So the default should still be the DbQuotaDriver.
I skimmed through the commit differences in Nova between Bobcat and Caracal and did not find anything that looked suspect so far.
The challenge is the problem could be in a number of places:
Nova or oslo.cache or dogpile.cache.<backend> or keystonemiddleware
and maybe more that I haven't thought of.
I would think similar to what you mentioned earlier that this would seem most likely to be related to some sort of caching and in the past we have seen such issues from missing or incorrect configuration, as Eugen mentioned. It is known that if the cache is not configured right, you will see a progressive slowdown in the API over time.
This would be the first thing to check in your nova-api nova.conf, you should have the following configuration to enable keystone auth token caching:
[cache] memcache_servers = localhost:11211 backend = dogpile.cache.memcached enabled = True
and if you have multiple memcache_servers they should be comma separated, for example: host1:port1,host2:port2
If your configuration is correct, then it might be worth trying a different [cache]backend to try and isolate whether the problem is related to this cache or something else.
And then go from there.
-melwitt
Hi Sean, I created https://bugs.launchpad.net/nova/+bug/2121607 since it still bothers us. Could not figure out what's changed from Bobcat to Caracal that could bring the issue. I listed the way I tested in my own dev cluster. Could team please help on this, or any suggestions where I could get started on debugging it? Being new to the industry so thanks ahead for any suggestion. Thanks!
Thanks @Melanie for the explanation and suggestion! We disabled query to os-quota-sets (actually we only sending request to servers/detail during debugging). We still noticed increment of both duration and memory after the change. I tried to change the cache backend from oslo_cache.memcache_pool to dogpile.cache.memcached but it didn't really help. I think it's not necessarily related to cache maybe? But I'm not sure what others worth investigation. -Chang
participants (5)
-
Chang Xue
-
Eugen Block
-
Konstantin Larin
-
melanie witt
-
Sean Mooney