[nova] NUMA live migration is ready for review and testing
tl;dr If you care about NUMA live migration, check out [1] and test in in your env(s), or review it. Over the months that I've worked on NUMA LM, I've been pinged by various folks that were interested in helping out. At this point I've addressed all the issues that were found at the end of the Stein cycle, and the series is ready for review and testing, with the aim of getting it merged in Train (for real this time). So if you care about NUMA-aware live migration and have some spare time and hardware (if you're in the former category I don't think I need to explain what kind of hardware - though I'll try to answer questions as best I can), I would greatly appreciate it if you deployed the patches and tested them. I've done that myself, of course, but, as at the end of Stein, I'm sure there are edge cases that I didn't think of (though I'm selfishly hoping that there aren't). I believe the series is also ready for review, though I haven't put it in the runway queue just yet because the last functional test patch is still a WIP, as I need to fiddle with it to assert more things. Thanks in advance, cheers! [1] https://review.opendev.org/#/c/672595/8
On 8/9/2019 4:11 PM, Artom Lifshitz wrote:
tl;dr If you care about NUMA live migration, check out [1] and test in in your env(s), or review it.
As I've said in IRC a few times, this feature was mentioned (at the last summit/PTG in Denver) as being critical for the next StarlingX release so I'd really hope the StarlingX community can help review and test this. I know there was some help from WindRiver in Stein which uncovered some issues, so it would be good to have that same kind of attention here. Feature freeze for Train is less than a month away (Sept 12).
So if you care about NUMA-aware live migration and have some spare time and hardware (if you're in the former category I don't think I need to explain what kind of hardware - though I'll try to answer questions as best I can), I would greatly appreciate it if you deployed the patches and tested them. I've done that myself, of course, but, as at the end of Stein, I'm sure there are edge cases that I didn't think of (though I'm selfishly hoping that there aren't).
Again the testing here with real hardware is key, and that's something I'd hope Intel/WindRiver/StarlingX folk can help with since I personally don't have a lab sitting around available for NUMA testing. Since we won't have third party CI for this feature, it's going to be important that at least someone is hitting this with a real environment, ideally with mixed Stein and Train compute services as well to make sure it behaves properly during rolling upgrades. -- Thanks, Matt
On Thu, Aug 15, 2019 at 2:31 PM Matt Riedemann <mriedemos@gmail.com> wrote:
As I've said in IRC a few times, this feature was mentioned (at the last summit/PTG in Denver) as being critical for the next StarlingX release so I'd really hope the StarlingX community can help review and test this. I know there was some help from WindRiver in Stein which uncovered some issues, so it would be good to have that same kind of attention here. Feature freeze for Train is less than a month away (Sept 12).
StarlingX does have time built in for this testing, intending to be complete before the STX 2.0 release at the end of August. I've suggested that we need to test both Train and our Stein backport but I am not the one with the resources to allocate.
Again the testing here with real hardware is key, and that's something I'd hope Intel/WindRiver/StarlingX folk can help with since I personally don't have a lab sitting around available for NUMA testing. Since we won't have third party CI for this feature, it's going to be important that at least someone is hitting this with a real environment, ideally with mixed Stein and Train compute services as well to make sure it behaves properly during rolling upgrades.
Oddly enough, in my $OTHER_DAY_JOB Intel's new Third Party CI is at the top of my list and we are getting dangerously close there in general, but this testing is unfortunately not first in line. dt -- Dean Troyer dtroyer@gmail.com
On Thu, 2019-08-15 at 15:23 -0500, Dean Troyer wrote:
On Thu, Aug 15, 2019 at 2:31 PM Matt Riedemann <mriedemos@gmail.com> wrote:
As I've said in IRC a few times, this feature was mentioned (at the last summit/PTG in Denver) as being critical for the next StarlingX release so I'd really hope the StarlingX community can help review and test this. I know there was some help from WindRiver in Stein which uncovered some issues, so it would be good to have that same kind of attention here. Feature freeze for Train is less than a month away (Sept 12).
StarlingX does have time built in for this testing, intending to be complete before the STX 2.0 release at the end of August. I've suggested that we need to test both Train and our Stein backport but I am not the one with the resources to allocate. i doubt you will be able to safely backport this to Stein as it contains RPC/object chagnes which would normally will break things on upgrade.
e.g. if you backport this to Stine in STX 1.Y.Z going form 1.0 to 1.Y.Z would require you to treat it like a majour upgrde and upgrade all your contoller first followed by the compute to ensure you never generate a copy of the updated object before the nodes that recive tehm are updated. if you don't do that then service will start exploding. we did a part backport of numa aware vswitch internally and had to drop all the object changes and schduler change and only backprot the virt driver chagnes as we could not figure out a safe way to backprot ovo changes that would not break deployment if you didnt syncquest eh update like a majory version upgrade wiche we cant assume for z releases (x.y.z). but glad to hear that in either case ye do plan to test it in some capasity. i have dual numa hardware i own that i plan to test it on personally but the more the better.
Again the testing here with real hardware is key, and that's something I'd hope Intel/WindRiver/StarlingX folk can help with since I personally don't have a lab sitting around available for NUMA testing. Since we won't have third party CI for this feature, it's going to be important that at least someone is hitting this with a real environment, ideally with mixed Stein and Train compute services as well to make sure it behaves properly during rolling upgrades.
Oddly enough, in my $OTHER_DAY_JOB Intel's new Third Party CI is at the top of my list and we are getting dangerously close there in general, but this testing is unfortunately not first in line.
speak of that. i see igor rebased https://review.opendev.org/#/c/652197/ i havent really look at that since may and it looks like some files permission have change so its currenlty broken. im not sure if he/ye planned on taking that over or if he was just interested either is fine. my fist party ci solution has kind of stalled since i just have not had time to work on it (given it snot part of my $OTHER_DAY_JOB) so looking forward to the third part ci you are working on. if i find time to work on that again i will but it still didnt have full parity with what the intel nfv ci was testing as it was running with the singel numa node guest we have in the gate but it would be nice to have even basic first party test of pinning/hugepages at some point. even though i wrote it i dont liek the fact that i was force to use fedroa with the virt preview repos enabled ot get a new enough qemu/libvirt to even to patial testing witout nested virt so i would still guess the third part ci would be more reliyable since it can actully use nested vert provide you replace the default ubuntu kernel with somthing based on 4.19
dt
participants (4)
-
Artom Lifshitz
-
Dean Troyer
-
Matt Riedemann
-
Sean Mooney