Re: [nova][ops] Live migration and CPU features

21 Aug 2020

      Hi,
thank you all for your comments/suggestions.

Having a "custom" cpu_mode seems the best option for our use case.
"host-passhtough" is problematic when the hardware is retired and instances
need to be moved to newer compute nodes.

Belmiro

On Wed, Aug 19, 2020 at 11:21 AM Arnaud Morin <arnaud.morin@gmail.com>
wrote:
...
Hello,
We have the same kind of issue.
To help mitigate it, we do segregation and also use cpu_mode=custom, so we
can use a model which is close to our hardware (cpu_model=Haswell-noTSX)
and add extra_flags when needed.
This is painful.
Cheers,
--
Arnaud Morin
...
On Tue, 2020-08-18 at 17:06 +0200, Fabian Zimmermann wrote:
...
Hi,
We are using the "custom"-way. But this does not protect you from all
issues.
We had problems with a new cpu-generation not (jet) detected correctly
in an libvirt-version. So libvirt failed back to the "desktop"-cpu of
this newer generation, but didnt support/detect some features =>
blocked live-migration.
yes that is common when using really new hardware. having previouly
worked
at intel and hitting this often that one of the reason i tend to default
to host-passthouh
and recommend using AZ or aggreate to segreatate the cloud for live
migration.
in the case where your libvirt does not know about the new cpus your
best approch is to use the
newest server cpu model that it know about and then if you really need
...
to add theem using the config options  but that is effectivly the same
as using host-passhtough
which is why i default to that as a workaround instead.
...
Fabian
Am Di., 18. Aug. 2020 um 16:54 Uhr schrieb Belmiro Moreira
<moreira.belmiro.email.lists@gmail.com>:
...
Hi,
in our infrastructure we have always compute nodes that need a
hardware intervention and as a consequence they are
...
...
rebooted, bringing a new kernel, kvm, ...
In order to have a good compromise between performance and
flexibility (live migration) we have been using "host-
model" for the "cpu_mode" configuration of our service VMs. We
didn't expect to have CPU compatibility issues
because we have the same hardware type per cell.
The problem is that when a compute node is rebooted the instance
domain is recreated with the new cpu features that
were introduced because of the reboot (using centOS).
If there are new CPU features exposed, this basically blocks live
migration to all the non rebooted compute nodes
(those cpu features are not exposed, yet). The nova-scheduler
doesn't know about them when scheduling the live
migration destination.
I wonder how other operators are solving this issue.
I don't like stopping OS upgrades.
What I'm considering is to define a "custom" cpu_mode for each
hardware type.
I would appreciate your comments and learn how you are solving this
On 18.08.20 - 16:16, Sean Mooney wrote:
the new fature you can try
problem.
...
...
...
Belmiro