[nova] Nasty new gate regression as of 4/15 - bug 1825435

Matt Riedemann mriedemos at gmail.com
Thu Apr 25 14:48:53 UTC 2019


On 4/19/2019 8:22 AM, Matt Riedemann wrote:
> I spotted this yesterday [1] and according to logstash it showed up 
> around 4/15. It's only hitting on nova unit tests, and I think is 
> somehow related to the TestRPC unit tests, or maybe those just stall out 
> as a result when we hit the stack overflow.
> 
> I don't think it's due to any new oslo.config or oslo.messaging versions 
> because there haven't been any, and it hits on both the 
> lower-constraints and py36 jobs (so different versions of those packages).
> 
> I've looked through the nova changes that merged since around 4/14 but 
> nothing is jumping out at me that might be causing this stack overflow 
> in the oslo.config code, but the cells v1 removal patches are pretty big 
> and I'm wondering if something snuck through there - the cells v1 unit 
> tests were doing some RPC stuff as I recall so maybe that's related.
> 
> We need all eyes on this since it's a high failure rate and we're 
> already experiencing really slow turnaround times in the gate.
> 
> [1] https://bugs.launchpad.net/nova/+bug/1825435

Just an update on this. Yes it's still happening and at a high rate [1].

I have a couple of debug patches up to try and recreate the nova unit 
test failure by detecting an infinite loop in oslo.config and then kill 
before we lose the useful part of the stacktrace [2]. The annoying thing 
is my previous attempts in the debug patch wouldn't recreate the failure 
(or if it did, it didn't dump the output from the oslo.config patch). So 
the latest version of those debug patches just makes oslo.config raise 
explicitly which should kill any nova unit test that's hitting it and 
produce a traceback in the console log output.

[1] http://status.openstack.org/elastic-recheck/#1825435
[2] https://review.opendev.org/#/q/topic:bug/1825435+status:open

-- 

Thanks,

Matt



More information about the openstack-discuss mailing list