[nova][gate] status of some gate bugs
Hey all, We've been having a tough time lately in the gate hitting various bugs while our patches go through CI. I just wanted to mention a few of them that I've seen often in my gerrit notifications and give a brief status on fix efforts. * http://status.openstack.org/elastic-recheck/#1813789 This one is where the nova-live-migration job fails a server evacuate test with: "Timeout waiting for [('network-vif-plugged', 'e3d3db3f-bce4-4889-b161-4b73648f79be')] for instance with vm_state error and task_state rebuild_spawning.: eventlet.timeout.Timeout: 300 seconds" in the screen-n-cpu.txt log. lyarwood has a WIP patch here: https://review.opendev.org/713674 and sean-k-mooney has a WIP patch here: https://review.opendev.org/713342 * https://launchpad.net/bugs/1867380 This one is where the nova-live-migration or nova-grenade-multinode job fail due to n-cpu restarting slowly after being reconfigured for ceph. The server will fail to build and it's because the test begins before nova-compute has fully come up and we see this error: "Instance spawn was interrupted before instance_claim, setting instance to ERROR state {{(pid=3783) _error_out_instances_whose_build_was_interrupted" in the screen-n-cpu.txt log. lyarwood has a patch approved here that we've been rechecking the heck out of that has yet to merge: https://review.opendev.org/713035 * https://launchpad.net/bugs/1844568 This one is where a job fails with: "Body: b'{"conflictingRequest": {"code": 409, "message": "Multiple possible networks found, use a Network ID to be more specific."}}'" gmann has a patch proposed to fix some of these here: https://review.opendev.org/711049 There might be more test classes that need create_default_network = True. * http://status.openstack.org/elastic-recheck/#1844929 This one is where a job fails and the following error is seen one of the logs, usually screen-n-sch.txt: "Timed out waiting for response from cell 8acfb79b-2e40-4e1c-bc3d-d404dac6db90". The TL;DR on this one is there's no immediate clue why it's happening. This bug used to hit more occasionally on "slow" nodes like nodes from the OVH or INAP providers (and OVH restricts disk iops [1]). Now, it seems like it's hitting much more often (still mostly on OVH nodes). I've been looking at it for about a week now and I've been using a DNM patch to add debug logging, look at dstat --disk-wait output, try mysqld my.cnf settings, etc: https://review.opendev.org/701478 So far, what I find is that when we get into the fail state, we get no rows back from the database server when we query for nova 'services' and 'compute_nodes' records, and we fail with the "Timed out waiting for response" error. Haven't figured out why yet, so far. The disk wait doesn't look high when this happens (or at any time during a run) so it's not seeming like it's related to disk IO. I'm continuing to look into it. Cheers, -melanie [1] http://lists.openstack.org/pipermail/openstack-discuss/2019-November/010505....
Hey Melanie, all, About OVH case (company I work for). We are digging into the issue. First thing, we do not limit anymore the IOPS. I dont remember when we removed this limit, but this is not new. However, the hypervisor are quite old now, and our policy on this old servers was to use some swap. And we think that the host may slow down when overcommitting on RAM (swapping on disk). Anyway, we also know that we can have better latency when upgrading QEMU. We are currently in the middle of testing a new QEMU version, I will push to upgrade your hypervisors first, so we will see if the latency on QEMU side can help the gate. Finally, we plan to change the hardware and stop doing overcommit on RAM (and swapping on disk). However, I have no ETA about that, but for sure, this will improve the IOPS. I'll keep you in touch. Cheers, -- Arnaud Morin On 18.03.20 - 22:50, melanie witt wrote:
Hey all,
We've been having a tough time lately in the gate hitting various bugs while our patches go through CI. I just wanted to mention a few of them that I've seen often in my gerrit notifications and give a brief status on fix efforts.
* http://status.openstack.org/elastic-recheck/#1813789
This one is where the nova-live-migration job fails a server evacuate test with: "Timeout waiting for [('network-vif-plugged', 'e3d3db3f-bce4-4889-b161-4b73648f79be')] for instance with vm_state error and task_state rebuild_spawning.: eventlet.timeout.Timeout: 300 seconds" in the screen-n-cpu.txt log.
lyarwood has a WIP patch here:
https://review.opendev.org/713674
and sean-k-mooney has a WIP patch here:
https://review.opendev.org/713342
* https://launchpad.net/bugs/1867380
This one is where the nova-live-migration or nova-grenade-multinode job fail due to n-cpu restarting slowly after being reconfigured for ceph. The server will fail to build and it's because the test begins before nova-compute has fully come up and we see this error: "Instance spawn was interrupted before instance_claim, setting instance to ERROR state {{(pid=3783) _error_out_instances_whose_build_was_interrupted" in the screen-n-cpu.txt log.
lyarwood has a patch approved here that we've been rechecking the heck out of that has yet to merge:
https://review.opendev.org/713035
* https://launchpad.net/bugs/1844568
This one is where a job fails with: "Body: b'{"conflictingRequest": {"code": 409, "message": "Multiple possible networks found, use a Network ID to be more specific."}}'"
gmann has a patch proposed to fix some of these here:
https://review.opendev.org/711049
There might be more test classes that need create_default_network = True.
* http://status.openstack.org/elastic-recheck/#1844929
This one is where a job fails and the following error is seen one of the logs, usually screen-n-sch.txt: "Timed out waiting for response from cell 8acfb79b-2e40-4e1c-bc3d-d404dac6db90".
The TL;DR on this one is there's no immediate clue why it's happening. This bug used to hit more occasionally on "slow" nodes like nodes from the OVH or INAP providers (and OVH restricts disk iops [1]). Now, it seems like it's hitting much more often (still mostly on OVH nodes).
I've been looking at it for about a week now and I've been using a DNM patch to add debug logging, look at dstat --disk-wait output, try mysqld my.cnf settings, etc:
https://review.opendev.org/701478
So far, what I find is that when we get into the fail state, we get no rows back from the database server when we query for nova 'services' and 'compute_nodes' records, and we fail with the "Timed out waiting for response" error.
Haven't figured out why yet, so far. The disk wait doesn't look high when this happens (or at any time during a run) so it's not seeming like it's related to disk IO. I'm continuing to look into it.
Cheers, -melanie
[1] http://lists.openstack.org/pipermail/openstack-discuss/2019-November/010505....
On Thu, 2020-03-19 at 08:25 +0000, Arnaud Morin wrote:
Hey Melanie, all,
About OVH case (company I work for). We are digging into the issue.
First thing, we do not limit anymore the IOPS. I dont remember when we removed this limit, but this is not new.
However, the hypervisor are quite old now, and our policy on this old servers was to use some swap. And we think that the host may slow down when overcommitting on RAM (swapping on disk).
Anyway, we also know that we can have better latency when upgrading QEMU. We are currently in the middle of testing a new QEMU version, I will push to upgrade your hypervisors first, so we will see if the latency on QEMU side can help the gate.
Finally, we plan to change the hardware and stop doing overcommit on RAM (and swapping on disk). However, I have no ETA about that, but for sure, this will improve the IOPS. if you stop doing over commit on ram can i also suggest that you enable hugepages hosting vms. if this is dedicated hardware for the ci then you should not need to do live migration on these vms since they are short lived (max 3-4 hours) so you can just disable the host leave it drain, do your maintance and enable it again. when there is no over subscption anyway enableing hugepags will not affect capasity but it will also give a 30-40% performacne boost.
anyway its just something to consider, it wont allow or prevent any testing we do today but in addtion to the new qemu version it should imporve the perfomance of the ovh nodes, not that they are generally slow ovh used to perform quite well even if it is older hardware but hugepages will imporave all guest memory access times, when compinded with not swaping to disk that should result in a marked imporvement in overall perforamce can ci job time. if ye dont want to enable hugepages thats fine too but since ye are considering making changes to the host anyway i tought i would ask.
I'll keep you in touch.
Cheers,
On Thu, Mar 19, 2020, at 1:25 AM, Arnaud Morin wrote:
Hey Melanie, all,
About OVH case (company I work for). We are digging into the issue.
First thing, we do not limit anymore the IOPS. I dont remember when we removed this limit, but this is not new.
However, the hypervisor are quite old now, and our policy on this old servers was to use some swap. And we think that the host may slow down when overcommitting on RAM (swapping on disk).
Anyway, we also know that we can have better latency when upgrading QEMU. We are currently in the middle of testing a new QEMU version, I will push to upgrade your hypervisors first, so we will see if the latency on QEMU side can help the gate.
Finally, we plan to change the hardware and stop doing overcommit on RAM (and swapping on disk). However, I have no ETA about that, but for sure, this will improve the IOPS.
You all likely know far more about this than I do, but our use case is likely ideal for kernel same page merging because we boot a relatively small number of identical images that rotate relatively slowly (24 hours). Turning that on, if not already, could potentially reduce memory pressure.
I'll keep you in touch.
Cheers,
-- Arnaud Morin
On 18-03-20 22:50:18, melanie witt wrote:
Hey all,
We've been having a tough time lately in the gate hitting various bugs while our patches go through CI. I just wanted to mention a few of them that I've seen often in my gerrit notifications and give a brief status on fix efforts.
Many thanks for writing this up Mel! Comments below on issues I've been working on.
* http://status.openstack.org/elastic-recheck/#1813789
This one is where the nova-live-migration job fails a server evacuate test with: "Timeout waiting for [('network-vif-plugged', 'e3d3db3f-bce4-4889-b161-4b73648f79be')] for instance with vm_state error and task_state rebuild_spawning.: eventlet.timeout.Timeout: 300 seconds" in the screen-n-cpu.txt log.
lyarwood has a WIP patch here:
This looks like it actually fixes it by fencing the destack@* services and running guest domains on the subnode. In the gate now, thanks gibi, sean-k-mooney and stephenfin!
and sean-k-mooney has a WIP patch here:
Might be something we still want to look into under another bug?
* https://launchpad.net/bugs/1867380
This one is where the nova-live-migration or nova-grenade-multinode job fail due to n-cpu restarting slowly after being reconfigured for ceph. The server will fail to build and it's because the test begins before nova-compute has fully come up and we see this error: "Instance spawn was interrupted before instance_claim, setting instance to ERROR state {{(pid=3783) _error_out_instances_whose_build_was_interrupted" in the screen-n-cpu.txt log.
lyarwood has a patch approved here that we've been rechecking the heck out of that has yet to merge:
Merged on master and backported all the way back to stable/pike on the following topic: https://review.opendev.org/#/q/topic:bug/1867380+status:open
* https://launchpad.net/bugs/1844568
This one is where a job fails with: "Body: b'{"conflictingRequest": {"code": 409, "message": "Multiple possible networks found, use a Network ID to be more specific."}}'"
gmann has a patch proposed to fix some of these here:
https://review.opendev.org/711049
There might be more test classes that need create_default_network = True.
* http://status.openstack.org/elastic-recheck/#1844929
This one is where a job fails and the following error is seen one of the logs, usually screen-n-sch.txt: "Timed out waiting for response from cell 8acfb79b-2e40-4e1c-bc3d-d404dac6db90".
The TL;DR on this one is there's no immediate clue why it's happening. This bug used to hit more occasionally on "slow" nodes like nodes from the OVH or INAP providers (and OVH restricts disk iops [1]). Now, it seems like it's hitting much more often (still mostly on OVH nodes).
I've been looking at it for about a week now and I've been using a DNM patch to add debug logging, look at dstat --disk-wait output, try mysqld my.cnf settings, etc:
https://review.opendev.org/701478
So far, what I find is that when we get into the fail state, we get no rows back from the database server when we query for nova 'services' and 'compute_nodes' records, and we fail with the "Timed out waiting for response" error.
Haven't figured out why yet, so far. The disk wait doesn't look high when this happens (or at any time during a run) so it's not seeming like it's related to disk IO. I'm continuing to look into it.
Cheers, -melanie
[1] http://lists.openstack.org/pipermail/openstack-discuss/2019-November/010505....
I've also hit the following a few times on master: test_update_delete_extra_route failing due to timeout when creating subnets https://bugs.launchpad.net/neutron/+bug/1867936 I'll try to write up a logstash query for this now and post a review for recheck. Thanks again, -- Lee Yarwood A5D1 9385 88CB 7E5F BE64 6618 BCA6 6E33 F672 2D76
On 18-03-20 22:50:18, melanie witt wrote:
Hey all,
We've been having a tough time lately in the gate hitting various bugs while our patches go through CI. I just wanted to mention a few of them that I've seen often in my gerrit notifications and give a brief status on fix efforts.
Many thanks for writing this up Mel!
Comments below on issues I've been working on.
* http://status.openstack.org/elastic-recheck/#1813789
This one is where the nova-live-migration job fails a server evacuate test with: "Timeout waiting for [('network-vif-plugged', 'e3d3db3f-bce4-4889-b161-4b73648f79be')] for instance with vm_state error and task_state rebuild_spawning.: eventlet.timeout.Timeout: 300 seconds" in the screen-n-cpu.txt log.
lyarwood has a WIP patch here:
This looks like it actually fixes it by fencing the destack@* services and running guest domains on the subnode. In the gate now, thanks gibi, sean-k-mooney and stephenfin!
and sean-k-mooney has a WIP patch here:
Might be something we still want to look into under another bug? ya so i think we can file a seperate bug for the "intermittently fails with network-vif-plugged timeout exception for shelve" we have already fixed it for resize revert but as part of that mriedemann noted
On Thu, 2020-03-19 at 14:27 +0000, Lee Yarwood wrote: that it was broken for shelve https://bugs.launchpad.net/nova/+bug/1813789/comments/3 and its theoretically broken for evacuate. so what i can do is file the new bug and then track https://bugs.launchpad.net/nova/+bug/1813789 as a related bug but rather then the fix above specificaly closing that since i think your patch above fixes it suffienctly. my [WIP] patch seams to have some other side effect that i was not expecting so i need to look into that more closely.
* https://launchpad.net/bugs/1867380
This one is where the nova-live-migration or nova-grenade-multinode job fail due to n-cpu restarting slowly after being reconfigured for ceph. The server will fail to build and it's because the test begins before nova-compute has fully come up and we see this error: "Instance spawn was interrupted before instance_claim, setting instance to ERROR state {{(pid=3783) _error_out_instances_whose_build_was_interrupted" in the screen-n-cpu.txt log.
lyarwood has a patch approved here that we've been rechecking the heck out of that has yet to merge:
Merged on master and backported all the way back to stable/pike on the following topic:
https://review.opendev.org/#/q/topic:bug/1867380+status:open
* https://launchpad.net/bugs/1844568
This one is where a job fails with: "Body: b'{"conflictingRequest": {"code": 409, "message": "Multiple possible networks found, use a Network ID to be more specific."}}'"
gmann has a patch proposed to fix some of these here:
https://review.opendev.org/711049
There might be more test classes that need create_default_network = True.
* http://status.openstack.org/elastic-recheck/#1844929
This one is where a job fails and the following error is seen one of the logs, usually screen-n-sch.txt: "Timed out waiting for response from cell 8acfb79b-2e40-4e1c-bc3d-d404dac6db90".
The TL;DR on this one is there's no immediate clue why it's happening. This bug used to hit more occasionally on "slow" nodes like nodes from the OVH or INAP providers (and OVH restricts disk iops [1]). Now, it seems like it's hitting much more often (still mostly on OVH nodes).
I've been looking at it for about a week now and I've been using a DNM patch to add debug logging, look at dstat --disk-wait output, try mysqld my.cnf settings, etc:
https://review.opendev.org/701478
So far, what I find is that when we get into the fail state, we get no rows back from the database server when we query for nova 'services' and 'compute_nodes' records, and we fail with the "Timed out waiting for response" error.
Haven't figured out why yet, so far. The disk wait doesn't look high when this happens (or at any time during a run) so it's not seeming like it's related to disk IO. I'm continuing to look into it.
Cheers, -melanie
[1] http://lists.openstack.org/pipermail/openstack-discuss/2019-November/010505....
I've also hit the following a few times on master:
test_update_delete_extra_route failing due to timeout when creating subnets https://bugs.launchpad.net/neutron/+bug/1867936
I'll try to write up a logstash query for this now and post a review for recheck.
Thanks again,
Hey all, I have a few updates since last time. This morning we noticed http://status.openstack.org/elastic-recheck was behind on elastic search indexing by 90+ hours. I mentioned it in #openstack-infra and clarkb checked and saw that a worker was stuck on 226 MB of nova-api logs. I checked and found it was due to default policy deprecation message logging 10K+ messages per run. I was able to get help from gmann on the best way to stop the bleeding and he proposed a patch that we've since merged: https://review.opendev.org/717802 After this merged, clarkb cleared the elastic search indexing backlog as it would never be able to catch up. gmann is working on some oslo.policy changes to better handle policy deprecation logging that we hope to be able to consume soon. Worst case, we'll leave the policy deprecation logging disabled for now as the nova code is OR'ing new policy defaults with old policy defaults and will for 1-2 cycles. So we have time to sort out the logging issue in the event that we can't get a solution via FFE in oslo.policy. On 3/18/20 22:50, melanie witt wrote:
* http://status.openstack.org/elastic-recheck/#1813789
This one is where the nova-live-migration job fails a server evacuate test with: "Timeout waiting for [('network-vif-plugged', 'e3d3db3f-bce4-4889-b161-4b73648f79be')] for instance with vm_state error and task_state rebuild_spawning.: eventlet.timeout.Timeout: 300 seconds" in the screen-n-cpu.txt log.
lyarwood has a WIP patch here:
This patch has merged since last time and the fail rate is down to 1 fail in the last 10 days (note this is likely off a bit because of the indexing snafu from earlier).
and sean-k-mooney has a WIP patch here:
But, it is not completely gone yet (I just had a patch fail on it a few min ago), so we'll need to work further on this patch (relevant IRC convo is linked in a review comment).
* https://launchpad.net/bugs/1867380
This one is where the nova-live-migration or nova-grenade-multinode job fail due to n-cpu restarting slowly after being reconfigured for ceph. The server will fail to build and it's because the test begins before nova-compute has fully come up and we see this error: "Instance spawn was interrupted before instance_claim, setting instance to ERROR state {{(pid=3783) _error_out_instances_whose_build_was_interrupted" in the screen-n-cpu.txt log.
lyarwood has a patch approved here that we've been rechecking the heck out of that has yet to merge:
This patch has merged and I noticed a big improvement in fail rate in the gate since then (we did not have a e-r fingerprint for this one -- I'll try to add one so we can see when/if it crops back up).
* https://launchpad.net/bugs/1844568
This one is where a job fails with: "Body: b'{"conflictingRequest": {"code": 409, "message": "Multiple possible networks found, use a Network ID to be more specific."}}'"
gmann has a patch proposed to fix some of these here:
This patch also merged and its fail rate is down to 5 fails in the last 10 days.
There might be more test classes that need create_default_network = True.
One of my patches hit another instance of this bug and I've proposed a patch for that one specifically: https://review.opendev.org/716809
* http://status.openstack.org/elastic-recheck/#1844929
This one is where a job fails and the following error is seen one of the logs, usually screen-n-sch.txt: "Timed out waiting for response from cell 8acfb79b-2e40-4e1c-bc3d-d404dac6db90".
The TL;DR on this one is there's no immediate clue why it's happening. This bug used to hit more occasionally on "slow" nodes like nodes from the OVH or INAP providers (and OVH restricts disk iops [1]). Now, it seems like it's hitting much more often (still mostly on OVH nodes).
I've been looking at it for about a week now and I've been using a DNM patch to add debug logging, look at dstat --disk-wait output, try mysqld my.cnf settings, etc:
https://review.opendev.org/701478
So far, what I find is that when we get into the fail state, we get no rows back from the database server when we query for nova 'services' and 'compute_nodes' records, and we fail with the "Timed out waiting for response" error.
Haven't figured out why yet, so far. The disk wait doesn't look high when this happens (or at any time during a run) so it's not seeming like it's related to disk IO. I'm continuing to look into it.
I think I finally got to the bottom of this one and found that during the grenade runs after we restart nova-scheduler, while it's coming back up, requests are flowing in and the parent process is holding an oslo.db lock when the child process workers are forked. So, the child processes inherit the held locks and database access through those oslo.db objects can never succeed. This results in the CellTimeout error -- it behaves as though the database never sends us results, but what really happened was our database requests were never made because they were stuck outside the inherited held locks [2]. Inheriting of held standard library locks at fork is a known issue in python [3] that is currently being worked on [4]. I think we can handle this in nova though in the meantime by clearing our cell cache that holds oslo.db database transaction context manager objects during service start(). This way, we get fresh oslo.db objects with unlocked locks when child processes start up. I have proposed a patch for this here: https://review.opendev.org/717662 and have been rechecking it a bunch of times to get more samples. So far, the grenade-py3 and nova-grenade-multinode jobs have both passed 7 runs in a row on the change. Cheers, -melanie [2] https://bugs.launchpad.net/nova/+bug/1844929/comments/28 [3] https://bugs.python.org/issue40089 [4] https://github.com/python/cpython/pull/19195
[1] http://lists.openstack.org/pipermail/openstack-discuss/2019-November/010505....
Hello again, I realized I didn't mention I had written updates inline as well and maybe they weren't seen because of it. Re-sending in case that happened, please see a few updates and a couple of requests for review inline. On 4/6/20 17:16, melanie witt wrote:
Hey all,
I have a few updates since last time.
<snip>
On 3/18/20 22:50, melanie witt wrote:
* http://status.openstack.org/elastic-recheck/#1813789
This one is where the nova-live-migration job fails a server evacuate test with: "Timeout waiting for [('network-vif-plugged', 'e3d3db3f-bce4-4889-b161-4b73648f79be')] for instance with vm_state error and task_state rebuild_spawning.: eventlet.timeout.Timeout: 300 seconds" in the screen-n-cpu.txt log.
lyarwood has a WIP patch here:
This patch has merged since last time and the fail rate is down to 1 fail in the last 10 days (note this is likely off a bit because of the indexing snafu from earlier).
and sean-k-mooney has a WIP patch here:
But, it is not completely gone yet (I just had a patch fail on it a few min ago), so we'll need to work further on this patch (relevant IRC convo is linked in a review comment).
* https://launchpad.net/bugs/1867380
This one is where the nova-live-migration or nova-grenade-multinode job fail due to n-cpu restarting slowly after being reconfigured for ceph. The server will fail to build and it's because the test begins before nova-compute has fully come up and we see this error: "Instance spawn was interrupted before instance_claim, setting instance to ERROR state {{(pid=3783) _error_out_instances_whose_build_was_interrupted" in the screen-n-cpu.txt log.
lyarwood has a patch approved here that we've been rechecking the heck out of that has yet to merge:
This patch has merged and I noticed a big improvement in fail rate in the gate since then (we did not have a e-r fingerprint for this one -- I'll try to add one so we can see when/if it crops back up).
* https://launchpad.net/bugs/1844568
This one is where a job fails with: "Body: b'{"conflictingRequest": {"code": 409, "message": "Multiple possible networks found, use a Network ID to be more specific."}}'"
gmann has a patch proposed to fix some of these here:
This patch also merged and its fail rate is down to 5 fails in the last 10 days.
There might be more test classes that need create_default_network = True.
One of my patches hit another instance of this bug and I've proposed a patch for that one specifically:
I would appreciate review on this patch ^
* http://status.openstack.org/elastic-recheck/#1844929
This one is where a job fails and the following error is seen one of the logs, usually screen-n-sch.txt: "Timed out waiting for response from cell 8acfb79b-2e40-4e1c-bc3d-d404dac6db90".
The TL;DR on this one is there's no immediate clue why it's happening. This bug used to hit more occasionally on "slow" nodes like nodes from the OVH or INAP providers (and OVH restricts disk iops [1]). Now, it seems like it's hitting much more often (still mostly on OVH nodes).
I've been looking at it for about a week now and I've been using a DNM patch to add debug logging, look at dstat --disk-wait output, try mysqld my.cnf settings, etc:
https://review.opendev.org/701478
So far, what I find is that when we get into the fail state, we get no rows back from the database server when we query for nova 'services' and 'compute_nodes' records, and we fail with the "Timed out waiting for response" error.
Haven't figured out why yet, so far. The disk wait doesn't look high when this happens (or at any time during a run) so it's not seeming like it's related to disk IO. I'm continuing to look into it.
I think I finally got to the bottom of this one and found that during the grenade runs after we restart nova-scheduler, while it's coming back up, requests are flowing in and the parent process is holding an oslo.db lock when the child process workers are forked. So, the child processes inherit the held locks and database access through those oslo.db objects can never succeed. This results in the CellTimeout error -- it behaves as though the database never sends us results, but what really happened was our database requests were never made because they were stuck outside the inherited held locks [2].
Inheriting of held standard library locks at fork is a known issue in python [3] that is currently being worked on [4].
I think we can handle this in nova though in the meantime by clearing our cell cache that holds oslo.db database transaction context manager objects during service start(). This way, we get fresh oslo.db objects with unlocked locks when child processes start up. I have proposed a patch for this here:
I would appreciate review on this patch ^
and have been rechecking it a bunch of times to get more samples. So far, the grenade-py3 and nova-grenade-multinode jobs have both passed 7 runs in a row on the change.
Cheers, -melanie
[2] https://bugs.launchpad.net/nova/+bug/1844929/comments/28 [3] https://bugs.python.org/issue40089 [4] https://github.com/python/cpython/pull/19195
[1] http://lists.openstack.org/pipermail/openstack-discuss/2019-November/010505....
On 4/7/20 11:49, melanie witt wrote: <snip>
* http://status.openstack.org/elastic-recheck/#1844929
This one is where a job fails and the following error is seen one of the logs, usually screen-n-sch.txt: "Timed out waiting for response from cell 8acfb79b-2e40-4e1c-bc3d-d404dac6db90".
The TL;DR on this one is there's no immediate clue why it's happening. This bug used to hit more occasionally on "slow" nodes like nodes from the OVH or INAP providers (and OVH restricts disk iops [1]). Now, it seems like it's hitting much more often (still mostly on OVH nodes).
I've been looking at it for about a week now and I've been using a DNM patch to add debug logging, look at dstat --disk-wait output, try mysqld my.cnf settings, etc:
https://review.opendev.org/701478
So far, what I find is that when we get into the fail state, we get no rows back from the database server when we query for nova 'services' and 'compute_nodes' records, and we fail with the "Timed out waiting for response" error.
Haven't figured out why yet, so far. The disk wait doesn't look high when this happens (or at any time during a run) so it's not seeming like it's related to disk IO. I'm continuing to look into it.
I think I finally got to the bottom of this one and found that during the grenade runs after we restart nova-scheduler, while it's coming back up, requests are flowing in and the parent process is holding an oslo.db lock when the child process workers are forked. So, the child processes inherit the held locks and database access through those oslo.db objects can never succeed. This results in the CellTimeout error -- it behaves as though the database never sends us results, but what really happened was our database requests were never made because they were stuck outside the inherited held locks [2].
Inheriting of held standard library locks at fork is a known issue in python [3] that is currently being worked on [4].
Correction: this issue [5] is not considered a bug in python and the recommended way of handling the problem is to use the os.register_at_fork() method (new in python 3.7) in oslo.db to reinitialize its lock [6]. For now, we still support python 3.6, so we have to handle it outside of oslo.db some way (https://review.opendev.org/717662 is proposed). But we can also add use of os.register_at_fork() in oslo.db for those who are running >= python 3.7 to handle the problem more generally. [5] https://bugs.python.org/issue6721 [6] https://bugs.python.org/issue40091#msg365959
I think we can handle this in nova though in the meantime by clearing our cell cache that holds oslo.db database transaction context manager objects during service start(). This way, we get fresh oslo.db objects with unlocked locks when child processes start up. I have proposed a patch for this here:
I would appreciate review on this patch ^
and have been rechecking it a bunch of times to get more samples. So far, the grenade-py3 and nova-grenade-multinode jobs have both passed 7 runs in a row on the change.
Cheers, -melanie
[2] https://bugs.launchpad.net/nova/+bug/1844929/comments/28 [3] https://bugs.python.org/issue40089 [4] https://github.com/python/cpython/pull/19195
[1] http://lists.openstack.org/pipermail/openstack-discuss/2019-November/010505....
participants (5)
-
Arnaud Morin
-
Clark Boylan
-
Lee Yarwood
-
melanie witt
-
Sean Mooney