[nova] [cyborg] Impact of moving bind to compute

Nadathur, Sundar sundar.nadathur at intel.com
Fri Jun 7 05:17:30 UTC 2019


> -----Original Message-----
> From: Matt Riedemann <mriedemos at gmail.com>
> Sent: Thursday, June 6, 2019 1:33 PM
> To: openstack-discuss at lists.openstack.org
> Subject: Re: [nova] [cyborg] Impact of moving bind to compute
> 
> On 5/23/2019 7:00 AM, Nadathur, Sundar wrote:
> > [....]
> > Moving the binding from [2] to [3] reduces this overlap. I did some
> > measurements of the time window from [2] to [3]: it was consistently
> > between 20 and 50 milliseconds, whether I launched 1 VM at a time, 2
> > at a time, etc. This seems acceptable.
> >
> > [2] https://github.com/openstack/nova/blob/master/nova/conductor/manager.py#L1501
> >
> > [3] https://github.com/openstack/nova/blob/master/nova/compute/manager.py#L1882

> > Regards,
> >
> > Sundar

> I'm OK with binding in the compute since that's where we trigger the callback
> event and want to setup something to wait for it before proceeding, like we
> do with port binding.
> 
> What I've talked about in detail in the spec is doing the ARQ *creation* in
> conductor rather than compute. I realize that doing the creation in the
> compute service means fewer (if any) RPC API changes to get phase 1 of this
> code going, but I can't imagine any RPC API changes for that would be very big
> (it's a new parameter to the compute service methods, or something we lump
> into the RequestSpec).

> The bigger concern I have is that we've long talked about moving port (and at
> times volume) creation from the compute service to conductor because it's
> less expensive to manage external resources there if something fails, e.g.
> going over-quota creating volumes. The problem with failing late in the
> compute is we have to cleanup other things (ports and volumes) and then
> reschedule, which may also fail on the next alternate host. 

The ARQ creation could be done at [1], followed by the binding, before acquiring the semaphore or creating other resources. Why is that not a good option? 

[1] https://github.com/openstack/nova/blob/master/nova/compute/manager.py#L1898

> Failing fast in
> conductor is more efficient and also helps take some of the guesswork out of
> which service is managing the resources (we've had countless bugs over the
> years about ports and volumes being leaked because we didn't clean them up
> properly on failure). Take a look at any of the error handling in the server
> create flow in the ComputeManager and you'll see what I'm talking about.
> 
> Anyway, if we're voting I vote that ARQ creation happens in conductor and
> binding happens in compute.
> 
> --
> 
> Thanks,
> 
> Matt

Regards,
Sundar




More information about the openstack-discuss mailing list