Re: [nova] [cyborg] Impact of moving bind to compute
Hi, We had a thread [1] on this subject from May of this year. The preference was that "ARQ creation happens in conductor and binding happens in compute" [2]. The ARQ binding involves device preparation and FPGA programming, which may take a while. So, it is done asynchronously. It is desirable to kickstart the binding ASAP, to maximize the overlap with other tasks needed for VM creation. We wound up doing all of binding in the compute for the following reason. If we call Cyborg to initiate ARQ binding and then wait for the notification event, we may miss the event if it comes in the window in between. So we had to call wait_for_instance_event() and, within its scope, call Cyborg for binding. This logic moved everything to compute. But now we are close to having an improved wait_for_instance_event() [3]. So I propose to: A. Start the binding in the conductor. This gets maximum concurrency between binding and other tasks. B. Wait for the binding notification in the compute manager (without losing the event). In fact, we can wait inside _build_resources, which is where Neutron/Cinder resources are gathered as well. That will allow for doing the cleanup in a consistent manner as today. C. Call Cyborg to get the ARQs in the virt driver, like today. Please LMK if you have any objections. [1] http://lists.openstack.org/pipermail/openstack-discuss/2019-May/006541.html [2] http://lists.openstack.org/pipermail/openstack-discuss/2019-June/006979.html [3] https://review.opendev.org/#/c/695985/ Regards, Sundar
A. Start the binding in the conductor. This gets maximum concurrency between binding and other tasks.
B. Wait for the binding notification in the compute manager (without losing the event). In fact, we can wait inside _build_resources, which is where Neutron/Cinder resources are gathered as well. That will allow for doing the cleanup in a consistent manner as today.
+many
But now we are close to having an improved wait_for_instance_event() [3]. So I propose to:
A. Start the binding in the conductor. This gets maximum concurrency between binding and other tasks.
B. Wait for the binding notification in the compute manager (without losing the event). In fact, we can wait inside _build_resources, which is where Neutron/Cinder resources are gathered as well. That will allow for doing the cleanup in a consistent manner as today.
C. Call Cyborg to get the ARQs in the virt driver, like today.
We actually collect the neutron event in the virt driver. We kick off some of the early stuff in _build_resources(), but those are things that we want to be able to do from conductor. I'd ideally like to move the wait further down into the stack purely so we overlap with the image fetch. That's the thing that will take the longest on the compute node. If the system is unloaded, the conductor->compute->virt stuff could happen pretty quick, and if we wait a minute (for example) for programming to finish before we start spawn(), that's enough time that we could have potentially already finished the image fetch. This is also time where we're holding a spot in the parallel build limit queue, but we're not doing anything useful. That said, things can move around inside the compute manager and virt driver without affecting upgrades, so if it's easier to do it in _build_resources() now, we can see about optimizing later. It should, however, happen as the last step in _build_resources() so that we overlap with all the network and block stuff that happens there already. --Dan
But now we are close to having an improved wait_for_instance_event() [3]. So I propose to:
A. Start the binding in the conductor. This gets maximum concurrency between binding and other tasks.
B. Wait for the binding notification in the compute manager (without losing the event). In fact, we can wait inside _build_resources, which is where Neutron/Cinder resources are gathered as well. That will allow for doing the cleanup in a consistent manner as today.
C. Call Cyborg to get the ARQs in the virt driver, like today.
Sorry, I missed this. No, I don't think this is reasonable. I'm -5 on where you have it today. However, there is zero point in calling to cyborg in _build_resources() and then calling it again in the virt driver just a couple stack frames away. The point of _build_resources() is to collect resources that we need to clean up if we fail, and yield them to the build process. Store your ARQs there, pass them to the virt driver, and roll them back if you fail. --Dan
From: Dan Smith <dms@danplanet.com> Sent: Tuesday, November 26, 2019 3:17 PM Subject: Re: [nova] [cyborg] Impact of moving bind to compute
But now we are close to having an improved wait_for_instance_event() [3]. So I propose to:
A. Start the binding in the conductor. This gets maximum concurrency between binding and other tasks.
B. Wait for the binding notification in the compute manager (without losing the event). In fact, we can wait inside _build_resources, which is where Neutron/Cinder resources are gathered as well. That will allow for doing the cleanup in a consistent manner as today.
C. Call Cyborg to get the ARQs in the virt driver, like today.
Sorry, I missed this. No, I don't think this is reasonable. I'm -5 on where you have it today. However, there is zero point in calling to cyborg in _build_resources() and then calling it again in the virt driver just a couple stack frames away. The point of _build_resources() is to collect resources that we need to clean up if we fail, and yield them to the build process. Store your ARQs there, pass them to the virt driver, and roll them back if you fail.
Agreed, thanks.
--Dan
Regards, Sundar
participants (3)
-
Dan Smith
-
Eric Fried
-
Nadathur, Sundar