[nova][ops] Live migration and CPU features
Hi, in our infrastructure we have always compute nodes that need a hardware intervention and as a consequence they are rebooted, bringing a new kernel, kvm, ... In order to have a good compromise between performance and flexibility (live migration) we have been using "host-model" for the "cpu_mode" configuration of our service VMs. We didn't expect to have CPU compatibility issues because we have the same hardware type per cell. The problem is that when a compute node is rebooted the instance domain is recreated with the new cpu features that were introduced because of the reboot (using centOS). If there are new CPU features exposed, this basically blocks live migration to all the non rebooted compute nodes (those cpu features are not exposed, yet). The nova-scheduler doesn't know about them when scheduling the live migration destination. I wonder how other operators are solving this issue. I don't like stopping OS upgrades. What I'm considering is to define a "custom" cpu_mode for each hardware type. I would appreciate your comments and learn how you are solving this problem. Belmiro
Hi, Try to choose a custom cpu_model that fits into your infra. This should be the best approach to avoid this kind of problem. If the performance is not an issue for the tenants, KVM64 should be a good election. Br, Luis Rmz <https://www.linkedin.com/in/luisframirez/> Blockchain, DevOps & Open Source Cloud Solutions Architect ---------------------------------------- Founder & CEO OpenCloud.es <http://www.opencloud.es/> luis.ramirez@opencloud.es Skype ID: d.overload Hangouts: luis.ramirez@opencloud.es [image: ] +34 911 950 123 / [image: ]+39 392 1289553 / [image: ]+49 152 26917722 / Česká republika: +420 774 274 882 ----------------------------------------------------- El mar., 18 ago. 2020 a las 16:55, Belmiro Moreira (< moreira.belmiro.email.lists@gmail.com>) escribió:
Hi, in our infrastructure we have always compute nodes that need a hardware intervention and as a consequence they are rebooted, bringing a new kernel, kvm, ...
In order to have a good compromise between performance and flexibility (live migration) we have been using "host-model" for the "cpu_mode" configuration of our service VMs. We didn't expect to have CPU compatibility issues because we have the same hardware type per cell.
The problem is that when a compute node is rebooted the instance domain is recreated with the new cpu features that were introduced because of the reboot (using centOS).
If there are new CPU features exposed, this basically blocks live migration to all the non rebooted compute nodes (those cpu features are not exposed, yet). The nova-scheduler doesn't know about them when scheduling the live migration destination.
I wonder how other operators are solving this issue. I don't like stopping OS upgrades. What I'm considering is to define a "custom" cpu_mode for each hardware type.
I would appreciate your comments and learn how you are solving this problem.
Belmiro
On Tue, 2020-08-18 at 17:01 +0200, Luis Ramirez wrote:
Hi,
Try to choose a custom cpu_model that fits into your infra. This should be the best approach to avoid this kind of problem. If the performance is not an issue for the tenants, KVM64 should be a good election. you should neve use kvm64 in production it is not maintained for security vulnerablity e.g. it is never updated with any fo the feature flag to mitigate security issue like specter ectra.
its perfect for ci and test where you dont contol the underlying cloud and are using nested virt. its also semi resonable for nested vms but its not a good choice for the host. you should either use host-passthough and segreate your host using aggreates or other means to ensure live migration capavlity or use a custom model. host model is a good default provided you upgrade all host at the same time and you are ok with the feature set changing. host model has a 1 way migration proablem where it possible to migrate form old host to new but not new to old if the vm is hard rebooted in between. so when using host model we still recommend segrationg host by cpu generation to avoid that.
Br, Luis Rmz <https://www.linkedin.com/in/luisframirez/> Blockchain, DevOps & Open Source Cloud Solutions Architect ---------------------------------------- Founder & CEO OpenCloud.es <http://www.opencloud.es/> luis.ramirez@opencloud.es Skype ID: d.overload Hangouts: luis.ramirez@opencloud.es [image: ] +34 911 950 123 / [image: ]+39 392 1289553 / [image: ]+49 152 26917722 / Česká republika: +420 774 274 882 -----------------------------------------------------
El mar., 18 ago. 2020 a las 16:55, Belmiro Moreira (< moreira.belmiro.email.lists@gmail.com>) escribió:
Hi, in our infrastructure we have always compute nodes that need a hardware intervention and as a consequence they are rebooted, bringing a new kernel, kvm, ...
In order to have a good compromise between performance and flexibility (live migration) we have been using "host-model" for the "cpu_mode" configuration of our service VMs. We didn't expect to have CPU compatibility issues because we have the same hardware type per cell.
The problem is that when a compute node is rebooted the instance domain is recreated with the new cpu features that were introduced because of the reboot (using centOS).
If there are new CPU features exposed, this basically blocks live migration to all the non rebooted compute nodes (those cpu features are not exposed, yet). The nova-scheduler doesn't know about them when scheduling the live migration destination.
I wonder how other operators are solving this issue. I don't like stopping OS upgrades. What I'm considering is to define a "custom" cpu_mode for each hardware type.
I would appreciate your comments and learn how you are solving this problem.
Belmiro
Hi, We are using the "custom"-way. But this does not protect you from all issues. We had problems with a new cpu-generation not (jet) detected correctly in an libvirt-version. So libvirt failed back to the "desktop"-cpu of this newer generation, but didnt support/detect some features => blocked live-migration. Fabian Am Di., 18. Aug. 2020 um 16:54 Uhr schrieb Belmiro Moreira <moreira.belmiro.email.lists@gmail.com>:
Hi, in our infrastructure we have always compute nodes that need a hardware intervention and as a consequence they are rebooted, bringing a new kernel, kvm, ...
In order to have a good compromise between performance and flexibility (live migration) we have been using "host-model" for the "cpu_mode" configuration of our service VMs. We didn't expect to have CPU compatibility issues because we have the same hardware type per cell.
The problem is that when a compute node is rebooted the instance domain is recreated with the new cpu features that were introduced because of the reboot (using centOS).
If there are new CPU features exposed, this basically blocks live migration to all the non rebooted compute nodes (those cpu features are not exposed, yet). The nova-scheduler doesn't know about them when scheduling the live migration destination.
I wonder how other operators are solving this issue. I don't like stopping OS upgrades. What I'm considering is to define a "custom" cpu_mode for each hardware type.
I would appreciate your comments and learn how you are solving this problem.
Belmiro
On Tue, 2020-08-18 at 17:06 +0200, Fabian Zimmermann wrote:
Hi,
We are using the "custom"-way. But this does not protect you from all issues.
We had problems with a new cpu-generation not (jet) detected correctly in an libvirt-version. So libvirt failed back to the "desktop"-cpu of this newer generation, but didnt support/detect some features => blocked live-migration. yes that is common when using really new hardware. having previouly worked at intel and hitting this often that one of the reason i tend to default to host-passthouh and recommend using AZ or aggreate to segreatate the cloud for live migration.
in the case where your libvirt does not know about the new cpus your best approch is to use the newest server cpu model that it know about and then if you really need the new fature you can try to add theem using the config options but that is effectivly the same as using host-passhtough which is why i default to that as a workaround instead.
Fabian
Am Di., 18. Aug. 2020 um 16:54 Uhr schrieb Belmiro Moreira <moreira.belmiro.email.lists@gmail.com>:
Hi, in our infrastructure we have always compute nodes that need a hardware intervention and as a consequence they are rebooted, bringing a new kernel, kvm, ...
In order to have a good compromise between performance and flexibility (live migration) we have been using "host- model" for the "cpu_mode" configuration of our service VMs. We didn't expect to have CPU compatibility issues because we have the same hardware type per cell.
The problem is that when a compute node is rebooted the instance domain is recreated with the new cpu features that were introduced because of the reboot (using centOS).
If there are new CPU features exposed, this basically blocks live migration to all the non rebooted compute nodes (those cpu features are not exposed, yet). The nova-scheduler doesn't know about them when scheduling the live migration destination.
I wonder how other operators are solving this issue. I don't like stopping OS upgrades. What I'm considering is to define a "custom" cpu_mode for each hardware type.
I would appreciate your comments and learn how you are solving this problem.
Belmiro
Hello, We have the same kind of issue. To help mitigate it, we do segregation and also use cpu_mode=custom, so we can use a model which is close to our hardware (cpu_model=Haswell-noTSX) and add extra_flags when needed. This is painful. Cheers, -- Arnaud Morin On 18.08.20 - 16:16, Sean Mooney wrote:
On Tue, 2020-08-18 at 17:06 +0200, Fabian Zimmermann wrote:
Hi,
We are using the "custom"-way. But this does not protect you from all issues.
We had problems with a new cpu-generation not (jet) detected correctly in an libvirt-version. So libvirt failed back to the "desktop"-cpu of this newer generation, but didnt support/detect some features => blocked live-migration. yes that is common when using really new hardware. having previouly worked at intel and hitting this often that one of the reason i tend to default to host-passthouh and recommend using AZ or aggreate to segreatate the cloud for live migration.
in the case where your libvirt does not know about the new cpus your best approch is to use the newest server cpu model that it know about and then if you really need the new fature you can try to add theem using the config options but that is effectivly the same as using host-passhtough which is why i default to that as a workaround instead.
Fabian
Am Di., 18. Aug. 2020 um 16:54 Uhr schrieb Belmiro Moreira <moreira.belmiro.email.lists@gmail.com>:
Hi, in our infrastructure we have always compute nodes that need a hardware intervention and as a consequence they are rebooted, bringing a new kernel, kvm, ...
In order to have a good compromise between performance and flexibility (live migration) we have been using "host- model" for the "cpu_mode" configuration of our service VMs. We didn't expect to have CPU compatibility issues because we have the same hardware type per cell.
The problem is that when a compute node is rebooted the instance domain is recreated with the new cpu features that were introduced because of the reboot (using centOS).
If there are new CPU features exposed, this basically blocks live migration to all the non rebooted compute nodes (those cpu features are not exposed, yet). The nova-scheduler doesn't know about them when scheduling the live migration destination.
I wonder how other operators are solving this issue. I don't like stopping OS upgrades. What I'm considering is to define a "custom" cpu_mode for each hardware type.
I would appreciate your comments and learn how you are solving this problem.
Belmiro
Hi, thank you all for your comments/suggestions. Having a "custom" cpu_mode seems the best option for our use case. "host-passhtough" is problematic when the hardware is retired and instances need to be moved to newer compute nodes. Belmiro On Wed, Aug 19, 2020 at 11:21 AM Arnaud Morin <arnaud.morin@gmail.com> wrote:
Hello,
We have the same kind of issue. To help mitigate it, we do segregation and also use cpu_mode=custom, so we can use a model which is close to our hardware (cpu_model=Haswell-noTSX) and add extra_flags when needed.
This is painful.
Cheers,
-- Arnaud Morin
On Tue, 2020-08-18 at 17:06 +0200, Fabian Zimmermann wrote:
Hi,
We are using the "custom"-way. But this does not protect you from all issues.
We had problems with a new cpu-generation not (jet) detected correctly in an libvirt-version. So libvirt failed back to the "desktop"-cpu of this newer generation, but didnt support/detect some features => blocked live-migration. yes that is common when using really new hardware. having previouly worked at intel and hitting this often that one of the reason i tend to default to host-passthouh and recommend using AZ or aggreate to segreatate the cloud for live migration.
in the case where your libvirt does not know about the new cpus your best approch is to use the newest server cpu model that it know about and then if you really need
to add theem using the config options but that is effectivly the same as using host-passhtough which is why i default to that as a workaround instead.
Fabian
Am Di., 18. Aug. 2020 um 16:54 Uhr schrieb Belmiro Moreira <moreira.belmiro.email.lists@gmail.com>:
Hi, in our infrastructure we have always compute nodes that need a
hardware intervention and as a consequence they are
rebooted, bringing a new kernel, kvm, ...
In order to have a good compromise between performance and flexibility (live migration) we have been using "host- model" for the "cpu_mode" configuration of our service VMs. We didn't expect to have CPU compatibility issues because we have the same hardware type per cell.
The problem is that when a compute node is rebooted the instance domain is recreated with the new cpu features that were introduced because of the reboot (using centOS).
If there are new CPU features exposed, this basically blocks live migration to all the non rebooted compute nodes (those cpu features are not exposed, yet). The nova-scheduler doesn't know about them when scheduling the live migration destination.
I wonder how other operators are solving this issue. I don't like stopping OS upgrades. What I'm considering is to define a "custom" cpu_mode for each hardware type.
I would appreciate your comments and learn how you are solving this
On 18.08.20 - 16:16, Sean Mooney wrote: the new fature you can try problem.
Belmiro
participants (5)
-
Arnaud Morin
-
Belmiro Moreira
-
Fabian Zimmermann
-
Luis Ramirez
-
Sean Mooney