Re: State of the Gate (placement?)

1 Nov 2019

      On Thu, 31 Oct 2019, Matt Riedemann wrote:
...
After that we call the scheduler to find a host and that only takes about 1 
second. That POST /allocations call to placement shouldn't be taking around 3 
minutes so something crazy is going on there.
Yeah, this is indeed weird and wrong.

There are two obvious ways, in the placement code, that
_set_allocations (which underlies all the allocation changing
requests) could be slowed down:

* If the resource provider's generation has changed during the
   request, because something else is also writing allocations at the
   same time, there will be up to 10 retries, server-side.

* For each of those tries, there's (apparently) a chance of a
   db deadlock, because the method is guarded by a retry on deadlock.
   We added that in a long long time ago.

However, if the generation retry was happening, it would show up in
the logs as "Retrying allocations write on resource provider". And
if we were getting deadlock we ought to see "Performing DB retry
for function".

There could be other non-obvious things, but more data required...

(more)
...
Oct 31 16:52:24.721346 ubuntu-bionic-inap-mtl01-0012620879 
devstack@placement-api.service[8591]: DEBUG placement.requestlog 
[req-275af2df-bd4e-4e64-b46e-6582e8de5148 
req-295f7350-2f0b-4f85-8cdf-d76801637221 None None] Starting request: 
198.72.124.104 "POST /placement/allocations" {{(pid=8593) __call__ 
/opt/stack/placement/placement/requestlog.py:61}}
We start the requestlog at the very beginning of the request so
there are a few steps between here and the actual data interaction,
so another possible angle of investigation here is that
keystonemiddleware was being slow to validate a token for some reason.
...
I know Chris Dent has done a lot of profiling on placement recently but I'm 
not sure if much has been done around profiling the POST /allocations call to 
move allocations from one consumer to another.
You're right that little was done to profile allocations, mostly
because initial experiments showed that it was fast and get /a_c was
slow. In the fullness of time I probably would have moved on to
allocations but reality intervened. What's probably most important
is evaluating allocation writes to the same set of several resource
providers under high concurrency.

However it doesn't seem like that is what is happening in this case
(otherwise I'd expect to see more evidence in the logs).
Instead it's "just being slow" which could be any of:

* something in the auth process (which if other services are also
   being slow could be a unifying thing)
* the database server having a sad (do we keep a slow query log?)
* the vm/vms has/have a noisy neighbor

Do we know anything about cpu and io stats during the slow period?

We've (sorry, let me rephrase, I've) known for the entire 5.5 years
I've been involved in OpenStack that the CI resources are way
over-subscribed and controller nodes in the wild are typically way
under-specified.  Yes, our code should be robust in the face of
that, but...

Of course, I could be totally wrong, there could be something flat
out wrong in the placement code, but if that were the case I'd
expect (like Matt did) to see it more often.

-- 
Chris Dent                       ٩◔̯◔۶           https://anticdent.org/
freenode: cdent