State of the Gate

newer
[horizon] team meeting cancelled...

older
Is there a limit on the number of...

Matt Riedemann

31 Oct 2019 31 Oct '19

8:23 a.m.

Things are great! Surprise! I just wanted to let everyone know. Later! . . . . . Now that you've been tricked, on Halloween no less, I'm here to tell you that things suck right now. This is your periodic digest of issues. Grab some fun-sized candy bars and read on. I think right now we have three major issues. 1. http://status.openstack.org/elastic-recheck/index.html#1763070 This has resurfaced and I'm not sure why, nor do I think we ever had a great handle on what is causing this or how to work around it so if anyone has new ideas please chip in. 2. http://status.openstack.org/elastic-recheck/index.html#1844929 I've done some digging into this one and my notes are in the bug report. It mostly affects grenade jobs but is not entirely restricted to grenade jobs. It's also mostly on OVH and FortNebula nodes but not totally. From looking at the mysql logs in the grenade jobs, mysqld is (re)started three times. I think (1) for initial package install, (2) for stacking devstack on the old side, and (3) for stacking devstack on the new side. After the last restart, there are a lot of aborted connection messages in the msyql error log. It's around then that grenade is running post-upgrade smoke tests to create a server and the nova-scheduler times out communicating with the nova_cell1 database. I have a few patches up to grenade/devstack [1] to try some things and get more msyql logs but so far they aren't really helpful. We need someone with some more mysql debugging experience to help here, maybe zzzeek or mordred? 3. CirrOS guest SSH issues There are several (some might be duplicates): http://status.openstack.org/elastic-recheck/index.html#1848078 http://status.openstack.org/elastic-recheck/index.html#1808010 (most hits) http://status.openstack.org/elastic-recheck/index.html#1463631 http://status.openstack.org/elastic-recheck/index.html#1849857 http://status.openstack.org/elastic-recheck/index.html#1737039 http://status.openstack.org/elastic-recheck/index.html#1840355 http://status.openstack.org/elastic-recheck/index.html#1843610 A few notes here. a) We're still using CirrOS 0.4.0 since Stein: https://review.opendev.org/#/c/521825/ And that image was published nearly 2 years ago and there are no newer versions on the CirrOS download site so we can't try a newer image to see if that fixes things. b) Some of the issues above are related to running out of disk in the guest. I'm not sure what is causing that, but I have posted a devstack patch that is related: https://review.opendev.org/#/c/690991 tl;dr before Stein the tempest flavors we used had disk=0 so nova would fit the root disk to the size of the image. Since Stein the tempest flavors specify root disk size (1GiB for the CirrOS images). My patch pads an extra 1GiB to the root disk on the tempest flavors. One side effect of that is the volumes tempest creates will go from 1GiB to 2GiB which could be a problem if a lot of tempest volume tests run at the same time, though we do have a volume group size of 24GB in gate runs so I think we're OK for now. I'm not sure my patch would help, but it's an idea. As for the other key generation and dhcp lease failures, I don't know what to do about those. We need more eyes on these issues to generate some ideas or see if we're doing something wrong in our tests, e.g. generating too much data for the config drive? Not using config drive in some cases? Metadata API server is too slow (note we cache the metadata since [2])? Other ideas on injecting logs for debug? [1] https://review.opendev.org/#/q/topic:bug/1844929+status:open [2] https://review.opendev.org/#/q/I9082be077b59acd3a39910fa64e29147cb5c2dd7 -- Thanks, Matt

Show replies by date

Slawek Kaplonski

31 Oct 31 Oct

2:15 p.m.

Hi, On Thu, Oct 31, 2019 at 10:23:01AM -0500, Matt Riedemann wrote:

...

Things are great! Surprise! I just wanted to let everyone know. Later! . . . . . Now that you've been tricked, on Halloween no less, I'm here to tell you that things suck right now. This is your periodic digest of issues. Grab some fun-sized candy bars and read on.

I think right now we have three major issues.

1. http://status.openstack.org/elastic-recheck/index.html#1763070

This has resurfaced and I'm not sure why, nor do I think we ever had a great handle on what is causing this or how to work around it so if anyone has new ideas please chip in.

I think that this is "just" some slowdown of node on which job is running. I noticed it too in some neutron jobs and I checked some. It seems that one API request is processed for very long time. For example in one of fresh examples: https://13cf3dd11b8f009809dc-97cb3b32849366f5bed744685e46b266.ssl.cf5.rackcd... it was request to nova which caused very long time: Oct 31 16:55:08.632162 ubuntu-bionic-inap-mtl01-0012620879 devstack@n-api.service[7191]: INFO nova.api.openstack.requestlog [None req-275af2df-bd4e-4e64-b46e-6582e8de5148 tempest-ServerDiskConfigTestJSON-1598674508 tempest-ServerDiskConfigTestJSON-1598674508] 198.72.124.104 "POST /compute/v2.1/servers/d15d2033-b29b-44f7-b619-ed7ef83fe477/action" status: 500 len: 216 microversion: 2.1 time: 161.951140

...

2. http://status.openstack.org/elastic-recheck/index.html#1844929

I've done some digging into this one and my notes are in the bug report. It mostly affects grenade jobs but is not entirely restricted to grenade jobs. It's also mostly on OVH and FortNebula nodes but not totally.

From looking at the mysql logs in the grenade jobs, mysqld is (re)started three times. I think (1) for initial package install, (2) for stacking devstack on the old side, and (3) for stacking devstack on the new side. After the last restart, there are a lot of aborted connection messages in the msyql error log. It's around then that grenade is running post-upgrade smoke tests to create a server and the nova-scheduler times out communicating with the nova_cell1 database.

I have a few patches up to grenade/devstack [1] to try some things and get more msyql logs but so far they aren't really helpful. We need someone with some more mysql debugging experience to help here, maybe zzzeek or mordred?

3. CirrOS guest SSH issues

There are several (some might be duplicates):

http://status.openstack.org/elastic-recheck/index.html#1848078

This one is I think the same as we have reported in https://bugs.launchpad.net/neutron/+bug/1850557 Basically we noticed issues with dhcp after resize/migration/shelve of instance but I didn't have time to investigate it yet.

...

http://status.openstack.org/elastic-recheck/index.html#1808010 (most hits) http://status.openstack.org/elastic-recheck/index.html#1463631 http://status.openstack.org/elastic-recheck/index.html#1849857 http://status.openstack.org/elastic-recheck/index.html#1737039 http://status.openstack.org/elastic-recheck/index.html#1840355 http://status.openstack.org/elastic-recheck/index.html#1843610

A few notes here.

a) We're still using CirrOS 0.4.0 since Stein:

https://review.opendev.org/#/c/521825/

And that image was published nearly 2 years ago and there are no newer versions on the CirrOS download site so we can't try a newer image to see if that fixes things.

b) Some of the issues above are related to running out of disk in the guest. I'm not sure what is causing that, but I have posted a devstack patch that is related:

https://review.opendev.org/#/c/690991

tl;dr before Stein the tempest flavors we used had disk=0 so nova would fit the root disk to the size of the image. Since Stein the tempest flavors specify root disk size (1GiB for the CirrOS images). My patch pads an extra 1GiB to the root disk on the tempest flavors. One side effect of that is the volumes tempest creates will go from 1GiB to 2GiB which could be a problem if a lot of tempest volume tests run at the same time, though we do have a volume group size of 24GB in gate runs so I think we're OK for now. I'm not sure my patch would help, but it's an idea.

As for the other key generation and dhcp lease failures, I don't know what to do about those. We need more eyes on these issues to generate some ideas or see if we're doing something wrong in our tests, e.g. generating too much data for the config drive? Not using config drive in some cases? Metadata API server is too slow (note we cache the metadata since [2])? Other ideas on injecting logs for debug?

[1] https://review.opendev.org/#/q/topic:bug/1844929+status:open [2] https://review.opendev.org/#/q/I9082be077b59acd3a39910fa64e29147cb5c2dd7

--

Thanks,

Matt

-- Slawek Kaplonski Senior software engineer Red Hat

Matt Riedemann

2:52 p.m.

New subject: State of the Gate (placement?)

On 10/31/2019 4:15 PM, Slawek Kaplonski wrote:

...

...
1.http://status.openstack.org/elastic-recheck/index.html#1763070

This has resurfaced and I'm not sure why, nor do I think we ever had a great handle on what is causing this or how to work around it so if anyone has new ideas please chip in. I think that this is "just" some slowdown of node on which job is running. I noticed it too in some neutron jobs and I checked some. It seems that one API request is processed for very long time. For example in one of fresh examples: https://13cf3dd11b8f009809dc-97cb3b32849366f5bed744685e46b266.ssl.cf5.rackcd... it was request to nova which caused very long time:

Oct 31 16:55:08.632162 ubuntu-bionic-inap-mtl01-0012620879devstack@n-api.service[7191]: INFO nova.api.openstack.requestlog [None req-275af2df-bd4e-4e64-b46e-6582e8de5148 tempest-ServerDiskConfigTestJSON-1598674508 tempest-ServerDiskConfigTestJSON-1598674508] 198.72.124.104 "POST /compute/v2.1/servers/d15d2033-b29b-44f7-b619-ed7ef83fe477/action" status: 500 len: 216 microversion: 2.1 time: 161.951140

That's really interesting, I hadn't noticed the MessagingTimeout in n-api linked to this. I dug into that one and it's really odd, but there is one known and one unknown issue here. The known issue is resize/cold migrate is synchronous through API, conductor and scheduler while picking a host and then we finally return to the API caller when we RPC cast to the selected destination host. If we RPC cast from API to conductor we'd avoid that MessagingTimeout issue. The unknown weirdness that made this fail is it took ~3 minutes to move allocations from the instance to the migration record. We hit conductor about here: Oct 31 16:52:24.300732 ubuntu-bionic-inap-mtl01-0012620879 nova-conductor[14891]: DEBUG nova.conductor.tasks.migrate [None req-275af2df-bd4e-4e64-b46e-6582e8de5148 tempest-ServerDiskConfigTestJSON-1598674508 tempest-ServerDiskConfigTestJSON-1598674508] [instance: d15d2033-b29b-44f7-b619-ed7ef83fe477] Requesting cell 374faad6-7a6f-40c7-b2f4-9138e0dde5d5(cell1) while migrating {{(pid=16132) _restrict_request_spec_to_cell /opt/stack/nova/nova/conductor/tasks/migrate.py:172}} And then finally get the message that we moved allocations prior to scheduling: Oct 31 16:55:08.634009 ubuntu-bionic-inap-mtl01-0012620879 nova-conductor[14891]: DEBUG nova.conductor.tasks.migrate [None req-275af2df-bd4e-4e64-b46e-6582e8de5148 tempest-ServerDiskConfigTestJSON-1598674508 tempest-ServerDiskConfigTestJSON-1598674508] Created allocations for migration 86ccf6f2-a2b0-4ae0-9c31-8bbdac4ceda9 on eccff5f8-3175-4170-b87e-cb556865fde0 {{(pid=16132) replace_allocation_with_migration /opt/stack/nova/nova/conductor/tasks/migrate.py:86}} After that we call the scheduler to find a host and that only takes about 1 second. That POST /allocations call to placement shouldn't be taking around 3 minutes so something crazy is going on there. If I'm reading the placement log correctly, that POST starts here: Oct 31 16:52:24.721346 ubuntu-bionic-inap-mtl01-0012620879 devstack@placement-api.service[8591]: DEBUG placement.requestlog [req-275af2df-bd4e-4e64-b46e-6582e8de5148 req-295f7350-2f0b-4f85-8cdf-d76801637221 None None] Starting request: 198.72.124.104 "POST /placement/allocations" {{(pid=8593) __call__ /opt/stack/placement/placement/requestlog.py:61}} And returns here: Oct 31 16:55:08.629708 ubuntu-bionic-inap-mtl01-0012620879 devstack@placement-api.service[8591]: DEBUG placement.handlers.allocation [req-275af2df-bd4e-4e64-b46e-6582e8de5148 req-295f7350-2f0b-4f85-8cdf-d76801637221 service placement] Successfully wrote allocations [<placement.objects.allocation.Allocation object at 0x7f023c9bcfd0>, <placement.objects.allocation.Allocation object at 0x7f023c9bc6d8>, <placement.objects.allocation.Allocation object at 0x7f023c9bcc18>, <placement.objects.allocation.Allocation object at 0x7f023c9eb7f0>, <placement.objects.allocation.Allocation object at 0x7f023c9ebdd8>, <placement.objects.allocation.Allocation object at 0x7f023c9eb630>] {{(pid=8593) _create_allocations /opt/stack/placement/placement/handlers/allocation.py:524}} Oct 31 16:55:08.629708 ubuntu-bionic-inap-mtl01-0012620879 devstack@placement-api.service[8591]: INFO placement.requestlog [req-275af2df-bd4e-4e64-b46e-6582e8de5148 req-295f7350-2f0b-4f85-8cdf-d76801637221 service placement] 198.72.124.104 "POST /placement/allocations" status: 204 len: 0 microversion: 1.28 Oct 31 16:55:08.629708 ubuntu-bionic-inap-mtl01-0012620879 devstack@placement-api.service[8591]: [pid: 8593|app: 0|req: 304/617] 198.72.124.104 () {72 vars in 1478 bytes} [Thu Oct 31 16:52:24 2019] POST /placement/allocations => generated 0 bytes in 163883 msecs (HTTP/1.1 204) 5 headers in 199 bytes (1 switches on core 0) I know Chris Dent has done a lot of profiling on placement recently but I'm not sure if much has been done around profiling the POST /allocations call to move allocations from one consumer to another. That also looks like some kind of anomaly because looking at that log at other POST /allocations calls, many take only 50-70 milliseconds. I wonder if there is a potential lock somewhere in placement slowing things down in that path. -- Thanks, Matt

Slawek Kaplonski

3:38 p.m.

New subject: State of the Gate (placement?)

On Thu, Oct 31, 2019 at 04:52:39PM -0500, Matt Riedemann wrote:

...

On 10/31/2019 4:15 PM, Slawek Kaplonski wrote:

...
...
1.http://status.openstack.org/elastic-recheck/index.html#1763070

This has resurfaced and I'm not sure why, nor do I think we ever had a great handle on what is causing this or how to work around it so if anyone has new ideas please chip in. I think that this is "just" some slowdown of node on which job is running. I noticed it too in some neutron jobs and I checked some. It seems that one API request is processed for very long time. For example in one of fresh examples: https://13cf3dd11b8f009809dc-97cb3b32849366f5bed744685e46b266.ssl.cf5.rackcd... it was request to nova which caused very long time:

Oct 31 16:55:08.632162 ubuntu-bionic-inap-mtl01-0012620879devstack@n-api.service[7191]: INFO nova.api.openstack.requestlog [None req-275af2df-bd4e-4e64-b46e-6582e8de5148 tempest-ServerDiskConfigTestJSON-1598674508 tempest-ServerDiskConfigTestJSON-1598674508] 198.72.124.104 "POST /compute/v2.1/servers/d15d2033-b29b-44f7-b619-ed7ef83fe477/action" status: 500 len: 216 microversion: 2.1 time: 161.951140

That's really interesting, I hadn't noticed the MessagingTimeout in n-api linked to this. I dug into that one and it's really odd, but there is one known and one unknown issue here.

I point to this one only as an example. I saw similar timeouts on calls to neutron and it was e.g. POST /v2.0/networks which took about 90 seconds. In same test run all other similar POST calls took less than a second. That's why it's hard to really debug what happens there and I think it is "some" slowdown of VM on which job is running. I simply don't have any better explanation for now :/

...

The known issue is resize/cold migrate is synchronous through API, conductor and scheduler while picking a host and then we finally return to the API caller when we RPC cast to the selected destination host. If we RPC cast from API to conductor we'd avoid that MessagingTimeout issue.

The unknown weirdness that made this fail is it took ~3 minutes to move allocations from the instance to the migration record. We hit conductor about here:

Oct 31 16:52:24.300732 ubuntu-bionic-inap-mtl01-0012620879 nova-conductor[14891]: DEBUG nova.conductor.tasks.migrate [None req-275af2df-bd4e-4e64-b46e-6582e8de5148 tempest-ServerDiskConfigTestJSON-1598674508 tempest-ServerDiskConfigTestJSON-1598674508] [instance: d15d2033-b29b-44f7-b619-ed7ef83fe477] Requesting cell 374faad6-7a6f-40c7-b2f4-9138e0dde5d5(cell1) while migrating {{(pid=16132) _restrict_request_spec_to_cell /opt/stack/nova/nova/conductor/tasks/migrate.py:172}}

And then finally get the message that we moved allocations prior to scheduling:

Oct 31 16:55:08.634009 ubuntu-bionic-inap-mtl01-0012620879 nova-conductor[14891]: DEBUG nova.conductor.tasks.migrate [None req-275af2df-bd4e-4e64-b46e-6582e8de5148 tempest-ServerDiskConfigTestJSON-1598674508 tempest-ServerDiskConfigTestJSON-1598674508] Created allocations for migration 86ccf6f2-a2b0-4ae0-9c31-8bbdac4ceda9 on eccff5f8-3175-4170-b87e-cb556865fde0 {{(pid=16132) replace_allocation_with_migration /opt/stack/nova/nova/conductor/tasks/migrate.py:86}}

After that we call the scheduler to find a host and that only takes about 1 second. That POST /allocations call to placement shouldn't be taking around 3 minutes so something crazy is going on there.

If I'm reading the placement log correctly, that POST starts here:

Oct 31 16:52:24.721346 ubuntu-bionic-inap-mtl01-0012620879 devstack@placement-api.service[8591]: DEBUG placement.requestlog [req-275af2df-bd4e-4e64-b46e-6582e8de5148 req-295f7350-2f0b-4f85-8cdf-d76801637221 None None] Starting request: 198.72.124.104 "POST /placement/allocations" {{(pid=8593) __call__ /opt/stack/placement/placement/requestlog.py:61}}

And returns here:

Oct 31 16:55:08.629708 ubuntu-bionic-inap-mtl01-0012620879 devstack@placement-api.service[8591]: DEBUG placement.handlers.allocation [req-275af2df-bd4e-4e64-b46e-6582e8de5148 req-295f7350-2f0b-4f85-8cdf-d76801637221 service placement] Successfully wrote allocations [<placement.objects.allocation.Allocation object at 0x7f023c9bcfd0>, <placement.objects.allocation.Allocation object at 0x7f023c9bc6d8>, <placement.objects.allocation.Allocation object at 0x7f023c9bcc18>, <placement.objects.allocation.Allocation object at 0x7f023c9eb7f0>, <placement.objects.allocation.Allocation object at 0x7f023c9ebdd8>, <placement.objects.allocation.Allocation object at 0x7f023c9eb630>] {{(pid=8593) _create_allocations /opt/stack/placement/placement/handlers/allocation.py:524}} Oct 31 16:55:08.629708 ubuntu-bionic-inap-mtl01-0012620879 devstack@placement-api.service[8591]: INFO placement.requestlog [req-275af2df-bd4e-4e64-b46e-6582e8de5148 req-295f7350-2f0b-4f85-8cdf-d76801637221 service placement] 198.72.124.104 "POST /placement/allocations" status: 204 len: 0 microversion: 1.28 Oct 31 16:55:08.629708 ubuntu-bionic-inap-mtl01-0012620879 devstack@placement-api.service[8591]: [pid: 8593|app: 0|req: 304/617] 198.72.124.104 () {72 vars in 1478 bytes} [Thu Oct 31 16:52:24 2019] POST /placement/allocations => generated 0 bytes in 163883 msecs (HTTP/1.1 204) 5 headers in 199 bytes (1 switches on core 0)

I know Chris Dent has done a lot of profiling on placement recently but I'm not sure if much has been done around profiling the POST /allocations call to move allocations from one consumer to another.

That also looks like some kind of anomaly because looking at that log at other POST /allocations calls, many take only 50-70 milliseconds. I wonder if there is a potential lock somewhere in placement slowing things down in that path.

--

Thanks,

Matt

-- Slawek Kaplonski Senior software engineer Red Hat

Chris Dent

1 Nov 1 Nov

3:34 a.m.

New subject: State of the Gate (placement?)

On Thu, 31 Oct 2019, Matt Riedemann wrote:

...

After that we call the scheduler to find a host and that only takes about 1 second. That POST /allocations call to placement shouldn't be taking around 3 minutes so something crazy is going on there.

Yeah, this is indeed weird and wrong. There are two obvious ways, in the placement code, that _set_allocations (which underlies all the allocation changing requests) could be slowed down: * If the resource provider's generation has changed during the request, because something else is also writing allocations at the same time, there will be up to 10 retries, server-side. * For each of those tries, there's (apparently) a chance of a db deadlock, because the method is guarded by a retry on deadlock. We added that in a long long time ago. However, if the generation retry was happening, it would show up in the logs as "Retrying allocations write on resource provider". And if we were getting deadlock we ought to see "Performing DB retry for function". There could be other non-obvious things, but more data required... (more)

...

Oct 31 16:52:24.721346 ubuntu-bionic-inap-mtl01-0012620879 devstack@placement-api.service[8591]: DEBUG placement.requestlog [req-275af2df-bd4e-4e64-b46e-6582e8de5148 req-295f7350-2f0b-4f85-8cdf-d76801637221 None None] Starting request: 198.72.124.104 "POST /placement/allocations" {{(pid=8593) __call__ /opt/stack/placement/placement/requestlog.py:61}}

We start the requestlog at the very beginning of the request so there are a few steps between here and the actual data interaction, so another possible angle of investigation here is that keystonemiddleware was being slow to validate a token for some reason.

...

I know Chris Dent has done a lot of profiling on placement recently but I'm not sure if much has been done around profiling the POST /allocations call to move allocations from one consumer to another.

You're right that little was done to profile allocations, mostly because initial experiments showed that it was fast and get /a_c was slow. In the fullness of time I probably would have moved on to allocations but reality intervened. What's probably most important is evaluating allocation writes to the same set of several resource providers under high concurrency. However it doesn't seem like that is what is happening in this case (otherwise I'd expect to see more evidence in the logs). Instead it's "just being slow" which could be any of: * something in the auth process (which if other services are also being slow could be a unifying thing) * the database server having a sad (do we keep a slow query log?) * the vm/vms has/have a noisy neighbor Do we know anything about cpu and io stats during the slow period? We've (sorry, let me rephrase, I've) known for the entire 5.5 years I've been involved in OpenStack that the CI resources are way over-subscribed and controller nodes in the wild are typically way under-specified. Yes, our code should be robust in the face of that, but... Of course, I could be totally wrong, there could be something flat out wrong in the placement code, but if that were the case I'd expect (like Matt did) to see it more often. -- Chris Dent ٩◔̯◔۶ https://anticdent.org/ freenode: cdent

Matt Riedemann

6:30 a.m.

New subject: State of the Gate (placement?)

On 11/1/2019 5:34 AM, Chris Dent wrote:

...

Instead it's "just being slow" which could be any of:

* something in the auth process (which if other services are also being slow could be a unifying thing) * the database server having a sad (do we keep a slow query log?)

Not by default but it can be enabled in devstack, like [1] but the resulting log file is so big I can't open it in my browser. So hitting this kind of issue with that enabled is going to be like finding a needle in a haystack I think.

...

* the vm/vms has/have a noisy neighbor

This is my guess as to the culprit. Like Slawek said elsewhere, he's seen this in other APIs that are otherwise really fast.

...

Do we know anything about cpu and io stats during the slow period?

We have peakmemtracker [2] which shows a jump during the slow period noticed: Oct 31 16:51:43.132081 ubuntu-bionic-inap-mtl01-0012620879 memory_tracker.sh[26186]: [iscsid (pid:18546)]=34352KB; [dmeventd (pid:18039)]=119732KB; [ovs-vswitchd (pid:3626)]=691960KB Oct 31 16:51:43.133012 ubuntu-bionic-inap-mtl01-0012620879 memory_tracker.sh[26186]: ]]] Oct 31 16:55:23.220099 ubuntu-bionic-inap-mtl01-0012620879 memory_tracker.sh[26186]: [[[ Oct 31 16:55:23.221255 ubuntu-bionic-inap-mtl01-0012620879 memory_tracker.sh[26186]: Thu Oct 31 16:55:23 UTC 2019 I don't know what that means though. We also have dstat [3] (which I find hard as hell to read - do people throw that into a nice graphical tool to massage that data for inspection?) which shows a jump as well: Oct 31 16:53:31.515005 ubuntu-bionic-inap-mtl01-0012620879 dstat.sh[25748]: 31-10 16:53:31| 19 7 12 61 0|5806M 509M 350M 1194M| 612B 0 | 0 0 | 0 0 |1510 2087 |17.7 9.92 5.82|3.0 13 2.0| 0 0 |qemu-system-x86 24768 13% 0 0 |python2 25797 99k 451B 0%|mysqld 464M| 20M 8172M| 37 513 0 4 12 Oct 31 16:55:08.604634 ubuntu-bionic-inap-mtl01-0012620879 dstat.sh[25748]: 31-10 16:53:32| 20 7 12 61 0|5806M 509M 350M 1194M|1052B 0 | 0 0 | 0 0 |1495 2076 |17.7 9.92 5.82|4.0 13 0| 0 0 |qemu-system-x86 24768 13% 0 0 |python2 25797 99k 446B0.1%|mysqld 464M| 20M 8172M| 37 513 0 4 12 Looks like around the time of slowness mysqld is the top consuming process which is probably not surprising.

...

We've (sorry, let me rephrase, I've) known for the entire 5.5 years I've been involved in OpenStack that the CI resources are way over-subscribed and controller nodes in the wild are typically way under-specified. Yes, our code should be robust in the face of that, but...

Looking at logstash this mostly hits on OVH and INAP nodes. Question to infra: do we know if those are more oversubscribed than some others like RAX or VEXXHOST nodes? [1] https://review.opendev.org/#/c/691995/ [2] https://13cf3dd11b8f009809dc-97cb3b32849366f5bed744685e46b266.ssl.cf5.rackcd... [3] https://13cf3dd11b8f009809dc-97cb3b32849366f5bed744685e46b266.ssl.cf5.rackcd... -- Thanks, Matt

Matt Riedemann

6:32 a.m.

New subject: State of the Gate (placement?)

On 11/1/2019 8:30 AM, Matt Riedemann wrote:

...

Looks like around the time of slowness mysqld is the top consuming process which is probably not surprising.

Oh I also see this in the mysql error log [1] around the time of the slow period, right after things pick back up: 2019-10-31T16:55:08.603773Z 0 [Note] InnoDB: page_cleaner: 1000ms intended loop took 161281ms. The settings might not be optimal. (flushed=201 and evicted=0, during the time.) So obviously something is going very wrong with mysql (or the node in general) during that time. [1] https://13cf3dd11b8f009809dc-97cb3b32849366f5bed744685e46b266.ssl.cf5.rackcd... -- Thanks, Matt

Clark Boylan

7:55 a.m.

New subject: State of the Gate (placement?)

On Fri, Nov 1, 2019, at 6:30 AM, Matt Riedemann wrote:

...

On 11/1/2019 5:34 AM, Chris Dent wrote:

...
Instead it's "just being slow" which could be any of:

* something in the auth process (which if other services are also being slow could be a unifying thing) * the database server having a sad (do we keep a slow query log?)

Not by default but it can be enabled in devstack, like [1] but the resulting log file is so big I can't open it in my browser. So hitting this kind of issue with that enabled is going to be like finding a needle in a haystack I think.

...
* the vm/vms has/have a noisy neighbor

This is my guess as to the culprit. Like Slawek said elsewhere, he's seen this in other APIs that are otherwise really fast.

...
Do we know anything about cpu and io stats during the slow period?

We have peakmemtracker [2] which shows a jump during the slow period noticed:

Oct 31 16:51:43.132081 ubuntu-bionic-inap-mtl01-0012620879 memory_tracker.sh[26186]: [iscsid (pid:18546)]=34352KB; [dmeventd (pid:18039)]=119732KB; [ovs-vswitchd (pid:3626)]=691960KB Oct 31 16:51:43.133012 ubuntu-bionic-inap-mtl01-0012620879 memory_tracker.sh[26186]: ]]] Oct 31 16:55:23.220099 ubuntu-bionic-inap-mtl01-0012620879 memory_tracker.sh[26186]: [[[ Oct 31 16:55:23.221255 ubuntu-bionic-inap-mtl01-0012620879 memory_tracker.sh[26186]: Thu Oct 31 16:55:23 UTC 2019

I don't know what that means though.

We also have dstat [3] (which I find hard as hell to read - do people throw that into a nice graphical tool to massage that data for inspection?) which shows a jump as well:

I put the dstat csv files in https://lamada.eu/dstat-graph/ and that works reasonably well.

...

Oct 31 16:53:31.515005 ubuntu-bionic-inap-mtl01-0012620879 dstat.sh[25748]: 31-10 16:53:31| 19 7 12 61 0|5806M 509M 350M 1194M| 612B 0 | 0 0 | 0 0 |1510 2087 |17.7 9.92 5.82|3.0 13 2.0| 0 0 |qemu-system-x86 24768 13% 0 0 |python2 25797 99k 451B 0%|mysqld 464M| 20M 8172M| 37 513 0 4 12 Oct 31 16:55:08.604634 ubuntu-bionic-inap-mtl01-0012620879 dstat.sh[25748]: 31-10 16:53:32| 20 7 12 61 0|5806M 509M 350M 1194M|1052B 0 | 0 0 | 0 0 |1495 2076 |17.7 9.92 5.82|4.0 13 0| 0 0 |qemu-system-x86 24768 13% 0 0 |python2 25797 99k 446B0.1%|mysqld 464M| 20M 8172M| 37 513 0 4 12

Looks like around the time of slowness mysqld is the top consuming process which is probably not surprising.

...
We've (sorry, let me rephrase, I've) known for the entire 5.5 years I've been involved in OpenStack that the CI resources are way over-subscribed and controller nodes in the wild are typically way under-specified. Yes, our code should be robust in the face of that, but...

Looking at logstash this mostly hits on OVH and INAP nodes. Question to infra: do we know if those are more oversubscribed than some others like RAX or VEXXHOST nodes?

I believe both OVH and INAP give us dedicated hypervisors. This means that we will end up being our own noisy neighbors there. I don't know what level we oversubscribe at but amorin (OVH) and benj_ (INAP) can probably provide more info. INAP was also recently turned back on. It had been offline for redeployment and that was completed and added back to the pool. Possible that more than just the openstack version has changed? OVH controls the disk IOPs that we get pretty aggressively as well. Possible it is an IO thing?

...

[1] https://review.opendev.org/#/c/691995/ [2] https://13cf3dd11b8f009809dc-97cb3b32849366f5bed744685e46b266.ssl.cf5.rackcd... [3] https://13cf3dd11b8f009809dc-97cb3b32849366f5bed744685e46b266.ssl.cf5.rackcd...

Matt Riedemann

8:22 a.m.

New subject: State of the Gate (placement?)

On 11/1/2019 9:55 AM, Clark Boylan wrote:

...

OVH controls the disk IOPs that we get pretty aggressively as well. Possible it is an IO thing?

Yeah, so looking at the dstat output in that graph (thanks for pointing out that site, really nice) we basically have 0 I/O from 16:53 to 16:55, so uh, that's probably not good. -- Thanks, Matt

Chris Dent

4 Nov 4 Nov

3:37 a.m.

New subject: State of the Gate (placement?)

On Fri, 1 Nov 2019, Matt Riedemann wrote:

...

On 11/1/2019 9:55 AM, Clark Boylan wrote:

...
OVH controls the disk IOPs that we get pretty aggressively as well. Possible it is an IO thing?

Yeah, so looking at the dstat output in that graph (thanks for pointing out that site, really nice) we basically have 0 I/O from 16:53 to 16:55, so uh, that's probably not good.

What happens in a case like this? Is there an official procedure for "hey, can you give is more IO?" or (if that's not an option) "can you give us less CPU?". Is that something that is automated, is is something that is monitored and alarming? "INAP ran out of IO X times in the last N hours, light the beacons!" -- Chris Dent ٩◔̯◔۶ https://anticdent.org/ freenode: cdent

Clark Boylan

4:58 p.m.

New subject: State of the Gate (placement?)

On Mon, Nov 4, 2019, at 7:37 PM, Chris Dent wrote:

...

On Fri, 1 Nov 2019, Matt Riedemann wrote:

...
On 11/1/2019 9:55 AM, Clark Boylan wrote:

...
OVH controls the disk IOPs that we get pretty aggressively as well. Possible it is an IO thing?

Yeah, so looking at the dstat output in that graph (thanks for pointing out that site, really nice) we basically have 0 I/O from 16:53 to 16:55, so uh, that's probably not good.

What happens in a case like this? Is there an official procedure for "hey, can you give is more IO?" or (if that's not an option) "can you give us less CPU?". Is that something that is automated, is is something that is monitored and alarming? "INAP ran out of IO X times in the last N hours, light the beacons!"

Typically we try to work with the clouds to properly root cause the issue. Then from there we can figure out what the best fix may be. They are running our software after all and there is a good chance the problems are in openstack. I'm in shanghai at the moment but if others want to reach out feel free. benj_ and mgagne are at inap and amorin has been helpful at ovh. The test node logs include a hostid in them somewhere which an be used to identify hypervisors if necessary. Clark

Matt Riedemann

6 Nov 6 Nov

9:14 a.m.

New subject: State of the Gate (placement?)

On 11/4/2019 6:58 PM, Clark Boylan wrote:

...

Typically we try to work with the clouds to properly root cause the issue. Then from there we can figure out what the best fix may be. They are running our software after all and there is a good chance the problems are in openstack.

I'm in shanghai at the moment but if others want to reach out feel free. benj_ and mgagne are at inap and amorin has been helpful at ovh. The test node logs include a hostid in them somewhere which an be used to identify hypervisors if necessary.

I noticed this today [1]. That doesn't always result in failed jobs but I correlated it to a failure in a timeout in a nova functional job [2] and those normally don't have these types of problems. Note the correlation to when it spikes, midnight and noon it looks like. The dip on 11/2 and 11/3 was the weekend. And it's mostly OVH nodes. So they must have some kind of cron or something that hits at those times? Anecdotally, I'll also note that it seems like the gate is much more stable this week while the summit is happening. We're actually able to merge some changes in nova which is kind of amazing given the last month or so of rechecks we've had to do. [1] http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22Function%20'nova.servicegroup.drivers.db.DbDriver._report_state'%20run%20outlasted%20interval%20by%5C%22&from=7d [2] https://zuul.opendev.org/t/openstack/build/63001bbd58c244cea70c995f1ebf61fb/... -- Thanks, Matt

melanie witt

11:12 a.m.

New subject: State of the Gate (placement?)

On 11/4/19 16:58, Clark Boylan wrote:

...

On Mon, Nov 4, 2019, at 7:37 PM, Chris Dent wrote:

...
On Fri, 1 Nov 2019, Matt Riedemann wrote:

...
On 11/1/2019 9:55 AM, Clark Boylan wrote:

...
OVH controls the disk IOPs that we get pretty aggressively as well. Possible it is an IO thing?

Yeah, so looking at the dstat output in that graph (thanks for pointing out that site, really nice) we basically have 0 I/O from 16:53 to 16:55, so uh, that's probably not good.

What happens in a case like this? Is there an official procedure for "hey, can you give is more IO?" or (if that's not an option) "can you give us less CPU?". Is that something that is automated, is is something that is monitored and alarming? "INAP ran out of IO X times in the last N hours, light the beacons!"

Typically we try to work with the clouds to properly root cause the issue. Then from there we can figure out what the best fix may be. They are running our software after all and there is a good chance the problems are in openstack.

I'm in shanghai at the moment but if others want to reach out feel free. benj_ and mgagne are at inap and amorin has been helpful at ovh. The test node logs include a hostid in them somewhere which an be used to identify hypervisors if necessary.

Just wanted to throw this out there to the ML in case anyone has any thoughts: Since we know that I/O is overloaded in these cases, would it make any sense to have infra/tempest use a flavor which sets disk I/O quotas [1] to help prevent any one process from getting starved out? I agree that properly troubleshooting the root cause is necessary and maybe adding limits would not be desired for concern of it potentially hiding issues. -melanie [1] https://docs.openstack.org/nova/latest/user/flavors.html#extra-specs-disk-tu...

Matt Riedemann

20 Dec 20 Dec

9:47 a.m.

New subject: State of the Gate (placement?)

On 11/1/2019 10:22 AM, Matt Riedemann wrote:

...

On 11/1/2019 9:55 AM, Clark Boylan wrote:

...
OVH controls the disk IOPs that we get pretty aggressively as well. Possible it is an IO thing?

Yeah, so looking at the dstat output in that graph (thanks for pointing out that site, really nice) we basically have 0 I/O from 16:53 to 16:55, so uh, that's probably not good.

Dredging this back up after discussing with Clark in IRC today. I've posted a patch to disable c-bak and etcd3 in grenade jobs [1]. The hope is this will alleviate some pressure on OVH nodes where iops are restricted. We aren't even effectively doing anything with c-bak and etcd3 in grenade jobs so this should not be controversial. [1] https://review.opendev.org/#/c/700214/ -- Thanks, Matt

Matt Riedemann

1:25 p.m.

New subject: State of the Gate (placement?)

On 12/20/2019 11:47 AM, Matt Riedemann wrote:

...

Dredging this back up after discussing with Clark in IRC today. I've posted a patch to disable c-bak and etcd3 in grenade jobs [1]. The hope is this will alleviate some pressure on OVH nodes where iops are restricted. We aren't even effectively doing anything with c-bak and etcd3 in grenade jobs so this should not be controversial.

[1] https://review.opendev.org/#/c/700214/

Actually doing it in the grenade repo doesn't work because of how devstack-gate generates the enabled services list for legacy jobs so I've got a series in devstack-gate that delivers the goods. https://review.opendev.org/#/c/700233/ -- Thanks, Matt

Matt Riedemann

4 Nov 4 Nov

6:57 a.m.

New subject: State of the Gate (placement?)

On 11/1/2019 9:55 AM, Clark Boylan wrote:

...

INAP was also recently turned back on. It had been offline for redeployment and that was completed and added back to the pool. Possible that more than just the openstack version has changed?

OVH controls the disk IOPs that we get pretty aggressively as well. Possible it is an IO thing?

Related to slow nodes, I noticed this failed recently, it's a synchronous RPC call from nova-api to nova-compute that timed out after 60 seconds [1]. Looking at MessagingTimeout errors in the nova-api logs shows it's mostly in INAP and OVH nodes [2] so there seems to be a pattern emerging with those being slow nodes causing issues. There are ways we could workaround this a bit on the nova side [3] but I'm not sure how much we want to make parts of nova super resilient to very slow nodes when real life operations would probably need to know about this kind of thing to scale up/out their control plane. [1] https://zuul.opendev.org/t/openstack/build/ef0196fe84804b44ac106d011c8c29ea/... [2] http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22MessagingTimeout%5C%22%20AND%20tags%3A%5C%22screen-n-api.txt%5C%22&from=7d [3] https://review.opendev.org/#/c/692550/ -- Thanks, Matt

Matt Riedemann

31 Oct 31 Oct

3:03 p.m.

On 10/31/2019 4:15 PM, Slawek Kaplonski wrote:

...

...
3. CirrOS guest SSH issues

There are several (some might be duplicates):

http://status.openstack.org/elastic-recheck/index.html#1848078 This one is I think the same as we have reported in https://bugs.launchpad.net/neutron/+bug/1850557

Basically we noticed issues with dhcp after resize/migration/shelve of instance but I didn't have time to investigate it yet.

Hmm, https://review.opendev.org/#/c/670591/ is new to devstack in Train and was backported to stable/stein. I wonder if that is too aggressive and is causing issues with operations where the guest is stopped and started, though for resize/migrate/shelve/unshelve the guest is destroyed on one host and re-spawned on another, so I would think that having a graceful shutdown for the guest wouldn't matter in those cases, unless it has to do with leaving the guest "dirty" somehow before transferring the root disk / creating a snapshot (in the case of shelve). Maybe we should bump that back up to 10 seconds? -- Thanks, Matt

Slawek Kaplonski

1 Dec 1 Dec

2:44 a.m.

Hi,

...

On 31 Oct 2019, at 23:03, Matt Riedemann <mriedemos@gmail.com> wrote:

On 10/31/2019 4:15 PM, Slawek Kaplonski wrote:

...
...
3. CirrOS guest SSH issues

There are several (some might be duplicates):

http://status.openstack.org/elastic-recheck/index.html#1848078 This one is I think the same as we have reported in https://bugs.launchpad.net/neutron/+bug/1850557 Basically we noticed issues with dhcp after resize/migration/shelve of instance but I didn't have time to investigate it yet.

Hmm, https://review.opendev.org/#/c/670591/ is new to devstack in Train and was backported to stable/stein. I wonder if that is too aggressive and is causing issues with operations where the guest is stopped and started, though for resize/migrate/shelve/unshelve the guest is destroyed on one host and re-spawned on another, so I would think that having a graceful shutdown for the guest wouldn't matter in those cases, unless it has to do with leaving the guest "dirty" somehow before transferring the root disk / creating a snapshot (in the case of shelve). Maybe we should bump that back up to 10 seconds?

I finally spent some time on investigating this issue. I was able to reproduce it locally and I found that the problem is in openvswitch firewall. All is described in [1]. I just pushed patch which should fix this. It’s in [2]. [1] https://bugs.launchpad.net/neutron/+bug/1850557 [2] https://review.opendev.org/696794

...

--

Thanks,

Matt

— Slawek Kaplonski Senior software engineer Red Hat

Slawek Kaplonski

3 Dec 3 Dec

12:13 a.m.

Hi,

...

On 1 Dec 2019, at 11:44, Slawek Kaplonski <skaplons@redhat.com> wrote:

Hi,

...
On 31 Oct 2019, at 23:03, Matt Riedemann <mriedemos@gmail.com> wrote:

On 10/31/2019 4:15 PM, Slawek Kaplonski wrote:

...
...
3. CirrOS guest SSH issues

There are several (some might be duplicates):

http://status.openstack.org/elastic-recheck/index.html#1848078 This one is I think the same as we have reported in https://bugs.launchpad.net/neutron/+bug/1850557 Basically we noticed issues with dhcp after resize/migration/shelve of instance but I didn't have time to investigate it yet.

Hmm, https://review.opendev.org/#/c/670591/ is new to devstack in Train and was backported to stable/stein. I wonder if that is too aggressive and is causing issues with operations where the guest is stopped and started, though for resize/migrate/shelve/unshelve the guest is destroyed on one host and re-spawned on another, so I would think that having a graceful shutdown for the guest wouldn't matter in those cases, unless it has to do with leaving the guest "dirty" somehow before transferring the root disk / creating a snapshot (in the case of shelve). Maybe we should bump that back up to 10 seconds?

I finally spent some time on investigating this issue. I was able to reproduce it locally and I found that the problem is in openvswitch firewall. All is described in [1]. I just pushed patch which should fix this. It’s in [2].

My patch is now merged so I hope we will not see this issue with SSH in resize/shelve tests anymore.

...

[1] https://bugs.launchpad.net/neutron/+bug/1850557 [2] https://review.opendev.org/696794

...
--

Thanks,

Matt

— Slawek Kaplonski Senior software engineer Red Hat

— Slawek Kaplonski Senior software engineer Red Hat

Matt Riedemann

6:32 a.m.

On 12/3/2019 2:13 AM, Slawek Kaplonski wrote:

...

...
I finally spent some time on investigating this issue. I was able to reproduce it locally and I found that the problem is in openvswitch firewall. All is described in [1]. I just pushed patch which should fix this. It’s in [2]. My patch is now merged so I hope we will not see this issue with SSH in resize/shelve tests anymore.

Awesome. Thanks as always for digging into gate issues Slawek! -- Thanks, Matt

Sean Mooney

5:53 a.m.

On Thu, Oct 31, 2019 at 3:29 PM Matt Riedemann <mriedemos@gmail.com> wrote:

...

Things are great! Surprise! I just wanted to let everyone know. Later! . . . . . Now that you've been tricked, on Halloween no less, I'm here to tell you that things suck right now. This is your periodic digest of issues. Grab some fun-sized candy bars and read on.

I think right now we have three major issues.

1. http://status.openstack.org/elastic-recheck/index.html#1763070

This has resurfaced and I'm not sure why, nor do I think we ever had a great handle on what is causing this or how to work around it so if anyone has new ideas please chip in.

the only theory i have on that is in the absense of any other indication we could be running out of entorpy in the vm which can lead to https connections failing. we might want to consider enabling the virtio random number generator in the gate vms. https://github.com/openstack/glance/blob/master/etc/metadefs/compute-libvirt... this need to be enabled in the flavours too but low entroy can cass ssl connection to fail https://major.io/2007/07/01/check-available-entropy-in-linux/

...

2. http://status.openstack.org/elastic-recheck/index.html#1844929

I've done some digging into this one and my notes are in the bug report. It mostly affects grenade jobs but is not entirely restricted to grenade jobs. It's also mostly on OVH and FortNebula nodes but not totally.

From looking at the mysql logs in the grenade jobs, mysqld is (re)started three times. I think (1) for initial package install, (2) for stacking devstack on the old side, and (3) for stacking devstack on the new side. After the last restart, there are a lot of aborted connection messages in the msyql error log. It's around then that grenade is running post-upgrade smoke tests to create a server and the nova-scheduler times out communicating with the nova_cell1 database.

I have a few patches up to grenade/devstack [1] to try some things and get more msyql logs but so far they aren't really helpful. We need someone with some more mysql debugging experience to help here, maybe zzzeek or mordred?

3. CirrOS guest SSH issues

There are several (some might be duplicates):

http://status.openstack.org/elastic-recheck/index.html#1848078 http://status.openstack.org/elastic-recheck/index.html#1808010 (most hits) http://status.openstack.org/elastic-recheck/index.html#1463631 http://status.openstack.org/elastic-recheck/index.html#1849857 http://status.openstack.org/elastic-recheck/index.html#1737039 http://status.openstack.org/elastic-recheck/index.html#1840355 http://status.openstack.org/elastic-recheck/index.html#1843610

A few notes here.

a) We're still using CirrOS 0.4.0 since Stein:

https://review.opendev.org/#/c/521825/

And that image was published nearly 2 years ago and there are no newer versions on the CirrOS download site so we can't try a newer image to see if that fixes things.

b) Some of the issues above are related to running out of disk in the guest. I'm not sure what is causing that, but I have posted a devstack patch that is related:

https://review.opendev.org/#/c/690991

tl;dr before Stein the tempest flavors we used had disk=0 so nova would fit the root disk to the size of the image. Since Stein the tempest flavors specify root disk size (1GiB for the CirrOS images). My patch pads an extra 1GiB to the root disk on the tempest flavors. One side effect of that is the volumes tempest creates will go from 1GiB to 2GiB which could be a problem if a lot of tempest volume tests run at the same time, though we do have a volume group size of 24GB in gate runs so I think we're OK for now. I'm not sure my patch would help, but it's an idea.

As for the other key generation and dhcp lease failures, I don't know what to do about those.We need more eyes on these issues to generate

so the ssh key generation issue may also be down to entorpoy. i have not looked at those specific failueres but i did note in some failed test in the past the we printed the kernel entropy in the guest and it was like 36 so some other very low number(it should be in the hundreds). if we have low entropy key generation will take a long time. https://wiki.debian.org/BoottimeEntropyStarvation

...

some ideas or see if we're doing something wrong in our tests, e.g. generating too much data for the config drive? Not using config drive in some cases? Metadata API server is too slow (note we cache the metadata since [2])? Other ideas on injecting logs for debug?

[1] https://review.opendev.org/#/q/topic:bug/1844929+status:open [2] https://review.opendev.org/#/q/I9082be077b59acd3a39910fa64e29147cb5c2dd7

--

Thanks,

Matt

Jeremy Stanley

6:42 a.m.

On 2019-12-03 13:53:04 +0000 (+0000), Sean Mooney wrote: [...]

...

we might want to consider enabling the virtio random number generator in the gate vms. [...]

That should be safe to do regardless; we already pre-install haveged in them so they have a limitless supply of entropy to provide to the guests Tempest boots. -- Jeremy Stanley

Slawek Kaplonski

7:12 a.m.

Hi, On Tue, Dec 03, 2019 at 01:53:04PM +0000, Sean Mooney wrote:

...

On Thu, Oct 31, 2019 at 3:29 PM Matt Riedemann <mriedemos@gmail.com> wrote:

...
Things are great! Surprise! I just wanted to let everyone know. Later! . . . . . Now that you've been tricked, on Halloween no less, I'm here to tell you that things suck right now. This is your periodic digest of issues. Grab some fun-sized candy bars and read on.

I think right now we have three major issues.

1. http://status.openstack.org/elastic-recheck/index.html#1763070

This has resurfaced and I'm not sure why, nor do I think we ever had a great handle on what is causing this or how to work around it so if anyone has new ideas please chip in.

the only theory i have on that is in the absense of any other indication we could be running out of entorpy in the vm which can lead to https connections failing. we might want to consider enabling the virtio random number generator in the gate vms. https://github.com/openstack/glance/blob/master/etc/metadefs/compute-libvirt... this need to be enabled in the flavours too but low entroy can cass ssl connection to fail https://major.io/2007/07/01/check-available-entropy-in-linux/

You may be right with lack of entropy is culprit in some issues. E.g. here: https://zuul.opendev.org/t/openstack/build/edcf837457a741abb752693723319b15/... it took 120 seconds to Initialize random number generator on guest VM and because of that network wasn't configured properly and test failed.

...

...
2. http://status.openstack.org/elastic-recheck/index.html#1844929

I've done some digging into this one and my notes are in the bug report. It mostly affects grenade jobs but is not entirely restricted to grenade jobs. It's also mostly on OVH and FortNebula nodes but not totally.

From looking at the mysql logs in the grenade jobs, mysqld is (re)started three times. I think (1) for initial package install, (2) for stacking devstack on the old side, and (3) for stacking devstack on the new side. After the last restart, there are a lot of aborted connection messages in the msyql error log. It's around then that grenade is running post-upgrade smoke tests to create a server and the nova-scheduler times out communicating with the nova_cell1 database.

I have a few patches up to grenade/devstack [1] to try some things and get more msyql logs but so far they aren't really helpful. We need someone with some more mysql debugging experience to help here, maybe zzzeek or mordred?

3. CirrOS guest SSH issues

There are several (some might be duplicates):

http://status.openstack.org/elastic-recheck/index.html#1848078 http://status.openstack.org/elastic-recheck/index.html#1808010 (most hits) http://status.openstack.org/elastic-recheck/index.html#1463631 http://status.openstack.org/elastic-recheck/index.html#1849857 http://status.openstack.org/elastic-recheck/index.html#1737039 http://status.openstack.org/elastic-recheck/index.html#1840355 http://status.openstack.org/elastic-recheck/index.html#1843610

A few notes here.

a) We're still using CirrOS 0.4.0 since Stein:

https://review.opendev.org/#/c/521825/

And that image was published nearly 2 years ago and there are no newer versions on the CirrOS download site so we can't try a newer image to see if that fixes things.

b) Some of the issues above are related to running out of disk in the guest. I'm not sure what is causing that, but I have posted a devstack patch that is related:

https://review.opendev.org/#/c/690991

tl;dr before Stein the tempest flavors we used had disk=0 so nova would fit the root disk to the size of the image. Since Stein the tempest flavors specify root disk size (1GiB for the CirrOS images). My patch pads an extra 1GiB to the root disk on the tempest flavors. One side effect of that is the volumes tempest creates will go from 1GiB to 2GiB which could be a problem if a lot of tempest volume tests run at the same time, though we do have a volume group size of 24GB in gate runs so I think we're OK for now. I'm not sure my patch would help, but it's an idea.

As for the other key generation and dhcp lease failures, I don't know what to do about those.We need more eyes on these issues to generate

so the ssh key generation issue may also be down to entorpoy. i have not looked at those specific failueres but i did note in some failed test in the past the we printed the kernel entropy in the guest and it was like 36 so some other very low number(it should be in the hundreds). if we have low entropy key generation will take a long time. https://wiki.debian.org/BoottimeEntropyStarvation

...
some ideas or see if we're doing something wrong in our tests, e.g. generating too much data for the config drive? Not using config drive in some cases? Metadata API server is too slow (note we cache the metadata since [2])? Other ideas on injecting logs for debug?

[1] https://review.opendev.org/#/q/topic:bug/1844929+status:open [2] https://review.opendev.org/#/q/I9082be077b59acd3a39910fa64e29147cb5c2dd7

--

Thanks,

Matt

-- Slawek Kaplonski Senior software engineer Red Hat

Clark Boylan

4:26 p.m.

On Tue, Dec 3, 2019, at 7:12 AM, Slawek Kaplonski wrote:

...

Hi,

On Tue, Dec 03, 2019 at 01:53:04PM +0000, Sean Mooney wrote:

...
On Thu, Oct 31, 2019 at 3:29 PM Matt Riedemann <mriedemos@gmail.com> wrote:

...
Things are great! Surprise! I just wanted to let everyone know. Later! . . . . . Now that you've been tricked, on Halloween no less, I'm here to tell you that things suck right now. This is your periodic digest of issues. Grab some fun-sized candy bars and read on.

I think right now we have three major issues.

1. http://status.openstack.org/elastic-recheck/index.html#1763070

This has resurfaced and I'm not sure why, nor do I think we ever had a great handle on what is causing this or how to work around it so if anyone has new ideas please chip in.

the only theory i have on that is in the absense of any other indication we could be running out of entorpy in the vm which can lead to https connections failing. we might want to consider enabling the virtio random number generator in the gate vms. https://github.com/openstack/glance/blob/master/etc/metadefs/compute-libvirt... this need to be enabled in the flavours too but low entroy can cass ssl connection to fail https://major.io/2007/07/01/check-available-entropy-in-linux/

You may be right with lack of entropy is culprit in some issues. E.g. here: https://zuul.opendev.org/t/openstack/build/edcf837457a741abb752693723319b15/... it took 120 seconds to Initialize random number generator on guest VM and because of that network wasn't configured properly and test failed.

https://review.opendev.org/#/c/697194/ is a first pass attempt at implementing the virtio rng device. How do we verify it is actually working? Clark

2049

Age (days ago)

2099

Last active (days ago)

List overview

Download

23 comments

7 participants

participants (7)

Chris Dent
Clark Boylan
Jeremy Stanley
Matt Riedemann
melanie witt
Sean Mooney
Slawek Kaplonski