-----Original Message----- From: Matt Riedemann <mriedemos@gmail.com> Sent: Thursday, June 6, 2019 1:33 PM To: openstack-discuss@lists.openstack.org Subject: Re: [nova] [cyborg] Impact of moving bind to compute
On 5/23/2019 7:00 AM, Nadathur, Sundar wrote:
[....] Moving the binding from [2] to [3] reduces this overlap. I did some measurements of the time window from [2] to [3]: it was consistently between 20 and 50 milliseconds, whether I launched 1 VM at a time, 2 at a time, etc. This seems acceptable.
[2] https://github.com/openstack/nova/blob/master/nova/conductor/manager.py#L150...
[3] https://github.com/openstack/nova/blob/master/nova/compute/manager.py#L1882
Regards,
Sundar
I'm OK with binding in the compute since that's where we trigger the callback event and want to setup something to wait for it before proceeding, like we do with port binding.
What I've talked about in detail in the spec is doing the ARQ *creation* in conductor rather than compute. I realize that doing the creation in the compute service means fewer (if any) RPC API changes to get phase 1 of this code going, but I can't imagine any RPC API changes for that would be very big (it's a new parameter to the compute service methods, or something we lump into the RequestSpec).
The bigger concern I have is that we've long talked about moving port (and at times volume) creation from the compute service to conductor because it's less expensive to manage external resources there if something fails, e.g. going over-quota creating volumes. The problem with failing late in the compute is we have to cleanup other things (ports and volumes) and then reschedule, which may also fail on the next alternate host.
The ARQ creation could be done at [1], followed by the binding, before acquiring the semaphore or creating other resources. Why is that not a good option? [1] https://github.com/openstack/nova/blob/master/nova/compute/manager.py#L1898
Failing fast in conductor is more efficient and also helps take some of the guesswork out of which service is managing the resources (we've had countless bugs over the years about ports and volumes being leaked because we didn't clean them up properly on failure). Take a look at any of the error handling in the server create flow in the ComputeManager and you'll see what I'm talking about.
Anyway, if we're voting I vote that ARQ creation happens in conductor and binding happens in compute.
--
Thanks,
Matt
Regards, Sundar