State of the Gate (placement?)

Matt Riedemann mriedemos at gmail.com
Thu Oct 31 21:52:39 UTC 2019


On 10/31/2019 4:15 PM, Slawek Kaplonski wrote:
>> 1.http://status.openstack.org/elastic-recheck/index.html#1763070
>>
>> This has resurfaced and I'm not sure why, nor do I think we ever had a great
>> handle on what is causing this or how to work around it so if anyone has new
>> ideas please chip in.
> I think that this is "just" some slowdown of node on which job is running. I
> noticed it too in some neutron jobs and I checked some. It seems that one API
> request is processed for very long time. For example in one of fresh examples:
> https://13cf3dd11b8f009809dc-97cb3b32849366f5bed744685e46b266.ssl.cf5.rackcdn.com/692206/3/check/tempest-integrated-compute/35ecb4a/job-output.txt
> it was request to nova which caused very long time:
> 
> Oct 31 16:55:08.632162 ubuntu-bionic-inap-mtl01-0012620879devstack at n-api.service[7191]: INFO nova.api.openstack.requestlog [None req-275af2df-bd4e-4e64-b46e-6582e8de5148 tempest-ServerDiskConfigTestJSON-1598674508 tempest-ServerDiskConfigTestJSON-1598674508] 198.72.124.104 "POST /compute/v2.1/servers/d15d2033-b29b-44f7-b619-ed7ef83fe477/action" status: 500 len: 216 microversion: 2.1 time: 161.951140
> 
> 

That's really interesting, I hadn't noticed the MessagingTimeout in 
n-api linked to this. I dug into that one and it's really odd, but there 
is one known and one unknown issue here.

The known issue is resize/cold migrate is synchronous through API, 
conductor and scheduler while picking a host and then we finally return 
to the API caller when we RPC cast to the selected destination host. If 
we RPC cast from API to conductor we'd avoid that MessagingTimeout issue.

The unknown weirdness that made this fail is it took ~3 minutes to move 
allocations from the instance to the migration record. We hit conductor 
about here:

Oct 31 16:52:24.300732 ubuntu-bionic-inap-mtl01-0012620879 
nova-conductor[14891]: DEBUG nova.conductor.tasks.migrate [None 
req-275af2df-bd4e-4e64-b46e-6582e8de5148 
tempest-ServerDiskConfigTestJSON-1598674508 
tempest-ServerDiskConfigTestJSON-1598674508] [instance: 
d15d2033-b29b-44f7-b619-ed7ef83fe477] Requesting cell 
374faad6-7a6f-40c7-b2f4-9138e0dde5d5(cell1) while migrating 
{{(pid=16132) _restrict_request_spec_to_cell 
/opt/stack/nova/nova/conductor/tasks/migrate.py:172}}

And then finally get the message that we moved allocations prior to 
scheduling:

Oct 31 16:55:08.634009 ubuntu-bionic-inap-mtl01-0012620879 
nova-conductor[14891]: DEBUG nova.conductor.tasks.migrate [None 
req-275af2df-bd4e-4e64-b46e-6582e8de5148 
tempest-ServerDiskConfigTestJSON-1598674508 
tempest-ServerDiskConfigTestJSON-1598674508] Created allocations for 
migration 86ccf6f2-a2b0-4ae0-9c31-8bbdac4ceda9 on 
eccff5f8-3175-4170-b87e-cb556865fde0 {{(pid=16132) 
replace_allocation_with_migration 
/opt/stack/nova/nova/conductor/tasks/migrate.py:86}}

After that we call the scheduler to find a host and that only takes 
about 1 second. That POST /allocations call to placement shouldn't be 
taking around 3 minutes so something crazy is going on there.

If I'm reading the placement log correctly, that POST starts here:

Oct 31 16:52:24.721346 ubuntu-bionic-inap-mtl01-0012620879 
devstack at placement-api.service[8591]: DEBUG placement.requestlog 
[req-275af2df-bd4e-4e64-b46e-6582e8de5148 
req-295f7350-2f0b-4f85-8cdf-d76801637221 None None] Starting request: 
198.72.124.104 "POST /placement/allocations" {{(pid=8593) __call__ 
/opt/stack/placement/placement/requestlog.py:61}}

And returns here:

Oct 31 16:55:08.629708 ubuntu-bionic-inap-mtl01-0012620879 
devstack at placement-api.service[8591]: DEBUG 
placement.handlers.allocation [req-275af2df-bd4e-4e64-b46e-6582e8de5148 
req-295f7350-2f0b-4f85-8cdf-d76801637221 service placement] Successfully 
wrote allocations [<placement.objects.allocation.Allocation object at 
0x7f023c9bcfd0>, <placement.objects.allocation.Allocation object at 
0x7f023c9bc6d8>, <placement.objects.allocation.Allocation object at 
0x7f023c9bcc18>, <placement.objects.allocation.Allocation object at 
0x7f023c9eb7f0>, <placement.objects.allocation.Allocation object at 
0x7f023c9ebdd8>, <placement.objects.allocation.Allocation object at 
0x7f023c9eb630>] {{(pid=8593) _create_allocations 
/opt/stack/placement/placement/handlers/allocation.py:524}}
Oct 31 16:55:08.629708 ubuntu-bionic-inap-mtl01-0012620879 
devstack at placement-api.service[8591]: INFO placement.requestlog 
[req-275af2df-bd4e-4e64-b46e-6582e8de5148 
req-295f7350-2f0b-4f85-8cdf-d76801637221 service placement] 
198.72.124.104 "POST /placement/allocations" status: 204 len: 0 
microversion: 1.28
Oct 31 16:55:08.629708 ubuntu-bionic-inap-mtl01-0012620879 
devstack at placement-api.service[8591]: [pid: 8593|app: 0|req: 304/617] 
198.72.124.104 () {72 vars in 1478 bytes} [Thu Oct 31 16:52:24 2019] 
POST /placement/allocations => generated 0 bytes in 163883 msecs 
(HTTP/1.1 204) 5 headers in 199 bytes (1 switches on core 0)

I know Chris Dent has done a lot of profiling on placement recently but 
I'm not sure if much has been done around profiling the POST 
/allocations call to move allocations from one consumer to another.

That also looks like some kind of anomaly because looking at that log at 
other POST /allocations calls, many take only 50-70 milliseconds. I 
wonder if there is a potential lock somewhere in placement slowing 
things down in that path.

-- 

Thanks,

Matt



More information about the openstack-discuss mailing list