Hello,
We have the same kind of issue.
To help mitigate it, we do segregation and also use cpu_mode=custom, so we
can use a model which is close to our hardware (cpu_model=Haswell-noTSX)
and add extra_flags when needed.
This is painful.
Cheers,
--
Arnaud Morin
On 18.08.20 - 16:16, Sean Mooney wrote:
> On Tue, 2020-08-18 at 17:06 +0200, Fabian Zimmermann wrote:
> > Hi,
> >
> > We are using the "custom"-way. But this does not protect you from all issues.
> >
> > We had problems with a new cpu-generation not (jet) detected correctly
> > in an libvirt-version. So libvirt failed back to the "desktop"-cpu of
> > this newer generation, but didnt support/detect some features =>
> > blocked live-migration.
> yes that is common when using really new hardware. having previouly worked
> at intel and hitting this often that one of the reason i tend to default to host-passthouh
> and recommend using AZ or aggreate to segreatate the cloud for live migration.
>
> in the case where your libvirt does not know about the new cpus your best approch is to use the
> newest server cpu model that it know about and then if you really need the new fature you can try
> to add theem using the config options but that is effectivly the same as using host-passhtough
> which is why i default to that as a workaround instead.
>
> >
> > Fabian
> >
> > Am Di., 18. Aug. 2020 um 16:54 Uhr schrieb Belmiro Moreira
> > <moreira.belmiro.email.lists@gmail.com>:
> > >
> > > Hi,
> > > in our infrastructure we have always compute nodes that need a hardware intervention and as a consequence they are
> > > rebooted, bringing a new kernel, kvm, ...
> > >
> > > In order to have a good compromise between performance and flexibility (live migration) we have been using "host-
> > > model" for the "cpu_mode" configuration of our service VMs. We didn't expect to have CPU compatibility issues
> > > because we have the same hardware type per cell.
> > >
> > > The problem is that when a compute node is rebooted the instance domain is recreated with the new cpu features that
> > > were introduced because of the reboot (using centOS).
> > >
> > > If there are new CPU features exposed, this basically blocks live migration to all the non rebooted compute nodes
> > > (those cpu features are not exposed, yet). The nova-scheduler doesn't know about them when scheduling the live
> > > migration destination.
> > >
> > > I wonder how other operators are solving this issue.
> > > I don't like stopping OS upgrades.
> > > What I'm considering is to define a "custom" cpu_mode for each hardware type.
> > >
> > > I would appreciate your comments and learn how you are solving this problem.
> > >
> > > Belmiro
> > >
> >
> >
>
>