On Thu, Oct 31, 2019 at 04:52:39PM -0500, Matt Riedemann wrote:
On 10/31/2019 4:15 PM, Slawek Kaplonski wrote:
1.http://status.openstack.org/elastic-recheck/index.html#1763070
This has resurfaced and I'm not sure why, nor do I think we ever had a great handle on what is causing this or how to work around it so if anyone has new ideas please chip in. I think that this is "just" some slowdown of node on which job is running. I noticed it too in some neutron jobs and I checked some. It seems that one API request is processed for very long time. For example in one of fresh examples: https://13cf3dd11b8f009809dc-97cb3b32849366f5bed744685e46b266.ssl.cf5.rackcd... it was request to nova which caused very long time:
Oct 31 16:55:08.632162 ubuntu-bionic-inap-mtl01-0012620879devstack@n-api.service[7191]: INFO nova.api.openstack.requestlog [None req-275af2df-bd4e-4e64-b46e-6582e8de5148 tempest-ServerDiskConfigTestJSON-1598674508 tempest-ServerDiskConfigTestJSON-1598674508] 198.72.124.104 "POST /compute/v2.1/servers/d15d2033-b29b-44f7-b619-ed7ef83fe477/action" status: 500 len: 216 microversion: 2.1 time: 161.951140
That's really interesting, I hadn't noticed the MessagingTimeout in n-api linked to this. I dug into that one and it's really odd, but there is one known and one unknown issue here.
I point to this one only as an example. I saw similar timeouts on calls to neutron and it was e.g. POST /v2.0/networks which took about 90 seconds. In same test run all other similar POST calls took less than a second. That's why it's hard to really debug what happens there and I think it is "some" slowdown of VM on which job is running. I simply don't have any better explanation for now :/
The known issue is resize/cold migrate is synchronous through API, conductor and scheduler while picking a host and then we finally return to the API caller when we RPC cast to the selected destination host. If we RPC cast from API to conductor we'd avoid that MessagingTimeout issue.
The unknown weirdness that made this fail is it took ~3 minutes to move allocations from the instance to the migration record. We hit conductor about here:
Oct 31 16:52:24.300732 ubuntu-bionic-inap-mtl01-0012620879 nova-conductor[14891]: DEBUG nova.conductor.tasks.migrate [None req-275af2df-bd4e-4e64-b46e-6582e8de5148 tempest-ServerDiskConfigTestJSON-1598674508 tempest-ServerDiskConfigTestJSON-1598674508] [instance: d15d2033-b29b-44f7-b619-ed7ef83fe477] Requesting cell 374faad6-7a6f-40c7-b2f4-9138e0dde5d5(cell1) while migrating {{(pid=16132) _restrict_request_spec_to_cell /opt/stack/nova/nova/conductor/tasks/migrate.py:172}}
And then finally get the message that we moved allocations prior to scheduling:
Oct 31 16:55:08.634009 ubuntu-bionic-inap-mtl01-0012620879 nova-conductor[14891]: DEBUG nova.conductor.tasks.migrate [None req-275af2df-bd4e-4e64-b46e-6582e8de5148 tempest-ServerDiskConfigTestJSON-1598674508 tempest-ServerDiskConfigTestJSON-1598674508] Created allocations for migration 86ccf6f2-a2b0-4ae0-9c31-8bbdac4ceda9 on eccff5f8-3175-4170-b87e-cb556865fde0 {{(pid=16132) replace_allocation_with_migration /opt/stack/nova/nova/conductor/tasks/migrate.py:86}}
After that we call the scheduler to find a host and that only takes about 1 second. That POST /allocations call to placement shouldn't be taking around 3 minutes so something crazy is going on there.
If I'm reading the placement log correctly, that POST starts here:
Oct 31 16:52:24.721346 ubuntu-bionic-inap-mtl01-0012620879 devstack@placement-api.service[8591]: DEBUG placement.requestlog [req-275af2df-bd4e-4e64-b46e-6582e8de5148 req-295f7350-2f0b-4f85-8cdf-d76801637221 None None] Starting request: 198.72.124.104 "POST /placement/allocations" {{(pid=8593) __call__ /opt/stack/placement/placement/requestlog.py:61}}
And returns here:
Oct 31 16:55:08.629708 ubuntu-bionic-inap-mtl01-0012620879 devstack@placement-api.service[8591]: DEBUG placement.handlers.allocation [req-275af2df-bd4e-4e64-b46e-6582e8de5148 req-295f7350-2f0b-4f85-8cdf-d76801637221 service placement] Successfully wrote allocations [<placement.objects.allocation.Allocation object at 0x7f023c9bcfd0>, <placement.objects.allocation.Allocation object at 0x7f023c9bc6d8>, <placement.objects.allocation.Allocation object at 0x7f023c9bcc18>, <placement.objects.allocation.Allocation object at 0x7f023c9eb7f0>, <placement.objects.allocation.Allocation object at 0x7f023c9ebdd8>, <placement.objects.allocation.Allocation object at 0x7f023c9eb630>] {{(pid=8593) _create_allocations /opt/stack/placement/placement/handlers/allocation.py:524}} Oct 31 16:55:08.629708 ubuntu-bionic-inap-mtl01-0012620879 devstack@placement-api.service[8591]: INFO placement.requestlog [req-275af2df-bd4e-4e64-b46e-6582e8de5148 req-295f7350-2f0b-4f85-8cdf-d76801637221 service placement] 198.72.124.104 "POST /placement/allocations" status: 204 len: 0 microversion: 1.28 Oct 31 16:55:08.629708 ubuntu-bionic-inap-mtl01-0012620879 devstack@placement-api.service[8591]: [pid: 8593|app: 0|req: 304/617] 198.72.124.104 () {72 vars in 1478 bytes} [Thu Oct 31 16:52:24 2019] POST /placement/allocations => generated 0 bytes in 163883 msecs (HTTP/1.1 204) 5 headers in 199 bytes (1 switches on core 0)
I know Chris Dent has done a lot of profiling on placement recently but I'm not sure if much has been done around profiling the POST /allocations call to move allocations from one consumer to another.
That also looks like some kind of anomaly because looking at that log at other POST /allocations calls, many take only 50-70 milliseconds. I wonder if there is a potential lock somewhere in placement slowing things down in that path.
--
Thanks,
Matt
-- Slawek Kaplonski Senior software engineer Red Hat