*Nova Wallaby virtual PTG
*DO NOT USE TRANSLATION TOOLS IN THIS ETHERPAD!!!! (ノ ゜Д゜)ノ c┻━┻
*不要在这个ETHERPAD中使用翻译工具!
To translate this etherpad, please follow these easy instructions:
1. Look in the above toolbar and click the '</>' ("Share this pad") button
2. Click the "Read only" checkbox at the top of the dialog box
3. Copy the URL that appears in the "Link" box
4. Open that URL in a new browser tab or window
5. Use your translation tools of choice in the new window
Thank you!
要翻译此etherpad,请按照以下简单说明进行操作:
1.查看上面的工具栏,然后单击“ </>”(“共享此记事本”)按钮
2.单击对话框顶部的“只读”复选框
3.复制出现在“链接”框中的URL
4.在新的浏览器标签或窗口中打开该URL
5.在新窗口中使用您选择的翻译工具
谢谢!
people/colours:
sean-k-mooney/and this
stephenfin
lyarwood
bauzas
johnthetubaguy
brinzhang
xinranwang
kashyapc
aarents
gmann
*Schedule
October 26 - 30
Every time is in UTC
28th Oct (Wednesday):
- 13:00 - 14:00: Nova-Cyborg cross project session
- 14:00 - 15:00: Nova-Neutron cross project session
29th Oct (Thrusday)
30th Oct (Friday)
We will use the Liberty Zoom room for the sessions:
*PTG Topics
*Nova-Cyborg cross project
- (xinranwang, yumeng) SmartNIC: SRIOV nic support in nova, neutron and cyborg,
- and does not mean It's favorited.
- There is a simple changes log in nova patch's commit message.
- this isnt really a good tilte by the way its not really about sriov nic supprot and more about smartnic support
- (alex) this spec seems more focus on SRIOV-Nic, smartnic is futhur step in the future.
- (gibi): Does nova need to know that a neutron port is backed by a SmartNIC? After reading the spec it seems that SmartNIC for nova is just a PCI backed neutron port with a (non bandwidth related) resource request.
- (xinranwang) nova should retrieve device profile name from neutorn port back_end field, and put them into resource request, in neutron.py file.
- Yes, For Nova, this spec is only a pci deivce managed by Cyborg. And smartnic andvanced feature is handled by Cyborg.
- I would be happier not to have a specific implementation for these kind of neutron ports. The resource request of the cyboorg device could come from the neutron port directly. However after the binding, the PCI info needs to come from cyborg anyhow. :/
- (yongli)exactly, we could not elimated all interaction with cyborg. (gibi): yeah I have to accept this.
- general speaking, is it ok?
- AGREED: overall direction seems OK
- Opens:
- co-exist with neutron bandwidth qos feature.
- keep the config file consistent between neutron qos and cyborg (the "pysical_network" traits and RP)
- (gibi): this is not qos specific. The physical network - backing device mapping is defined both in neutron and nova even without qos is configured
- neutron: [sriov_nic]/physical_device_mappings
- nova: [pci]/passthrough_whitelist <-- during the discussion I relaized that this is irrelevant for SmartNICs case
- let neutron report phynets traits anyway, need figure out how looks like the rp tree structure.
- (yongli)Which way prefer? (change neutron or use config file)
- AGREED: cyborg will have a physnet configuration
- To avoid representing the same physical device twice (once by neutron sriov nic agent and once by cyborg) we can try to move the QoS config and inventory reporting of SmartNIC to cyborg
- (Yongli)do we need a new vnic?
- seems we don't need new one. we support only "direct" (direct passthough), and it's toltally identical to PCI passthrough.
- Agree?-1
- (sean-k-mooney): there already is one vnic_type="smart-nic"
- https://github.com/openstack/neutron-lib/blob/52a75a02a204e1bcd5b9be6c9f13effa0380eaeb/neutron_lib/api/definitions/portbindings.py#L119
- nova don't know this type?
- direct is tied into lots of complciated logic i would prefer not to reuse it for cyborg and keep direct for pci devices manged by nova in the pci tracker.
- spcifically i do ot want to have to rework this logic that create a pci request spec for all neutron sriov vnic types https://github.com/openstack/nova/blob/a83001a903c50143afe2957be6ca72a6e3b884f5/nova/network/neutron.py#L2121-L2146
- based on what we learned from POC , we don't need to change "pci request" logic. (fix me)
- we would if the pci request filter is enabled since the host wont have any pci devices in the nova database but that code will create a pci request.
- vnic types are cheap and it will make the support explict which is likely better in the long term. (gibi): +1 for explicitness
- AGREED:
- vnic_type=direct triggers the nova PCI tracker codepath. We don't want to trigger that for cyborg SmartNICs
- vnic_type=smartnic is already used for Ironic-> do not reuse it
- so we need a new vnic_type for cyborg SmartNICs. Bikeshed on the naming in the spec (e.g. accelerator, device_profile)
- (gibi): UX questions:
- Will neutron reject port creation with an invalid cyborg device profile? Or only nova will detect that the device profile stored in the neutron port is invalid (e.g. not exists in cyborg)?
- (yongli) This is an open, in POC, only nova sense this kind of error. must have?
- Does the device profile in the neutron port changeable after port create? only for unbound port?
- (yongli) seems the simplest choice is not allow to change, if want to do that, delete and create a new one.
- Will the implementation support attaching / detaching a port that has device profile? If not, then we need to explicitly reject it.
- (yongli) not going to support this right now(out of scope for this spec). so there should be an explicitly reject , yes.
- Nova-cyborg integration: VGPU support (Yumeng, Brin, Wenping)
- nova-libvirt-support: (Brin, wenping)nova side spec: https://review.opendev.org/#/c/750116/
- (Yumeng)changes in nova side mainly lies in the libvirt driver process, which include:
- 1. when spawning a vm with vGPU, the UUID of a mdev device was passed to libvirt driver by acc_info, then the libvirt driver create a mdev with that given UUID and later writes the UUID into instance's XML.
- 2. we should make sure that the new mdev support does not conflict with current mdev management process.
- maybe we can avoid such conclict by defining a new tag mdev_tags in spawn( ) and then use it where necessary.
- the enumerate values of mdev_tags could be ("COMPUTE", "ACCELERATOR"), "COMPUTE" is the default value, which means mdev is managed by nova, "ACCELERATOR" means mdev is managed by cyborg
- 3. extend _guest_add_accel_pci_devices to support both pci devices and mdev devices
- (Yumeng) cyborg-driver-support: cyborg-side spec https://review.opendev.org/758925
- changes in cyborg side mainly lies in nvidia driver side, which should include:
- 1. cyborg support configuring vgpu_type for each device in cyborg.conf
- 2. driver should distinguish whether the device is a pas-through GPU or virtulized GPU, and then generate resource_class(RC)
- driver can distinguish by the existence of "/sys/bus/pci/devices/{PCI_ADDRESS}/mdev_supported_types". If the path exists, it is a virtulized GPU, otherise it is a pas-through GPU.
- 3.generate traits:
- We need to define the traits format. Is this format ok? CUSTOM_<VENDOR_NAME>_<PRODUCT_ID>_<Virtual_GPU_Type>?
- cyborg traits: OWNER_CYBORG_CUSTOM_GPU_NVIDIA_1EB8_T41B
- nova traits: OWNER_NOVA_CUSTOM_GPU_NVIDIA\OWNER_NOVA_CUSTOM_GPU_1EB8
- The traits will be reported to both cyborg-db and placement and will be used in device_profile when requesting an accelerator and in nova scheduler, what format of Virtual_GPU_Type should we use?
- 4. generate device, deployable, controlpath_id, attach_handle objects
- | |
- | deployable ---> resource_provider
- | |
- vgpus attach_handles ---> allocations
- (bauzas) We need more details about :
- the Placement Resource Providers modeling
- for example. 1 GPU device with 4 vGPUs, will report 1 resource provider with 4 in the inventory
- Q: will the new GPU resource provider have the compute node as its parent?
- Yes
- observation: same as vgpu, except cyborg owns the resource provider
- Current Nova documentation for mdevs : https://docs.openstack.org/nova/latest/admin/virtual-gpu.html
- how to make sure that Cyborg mdevs won't be used by the mdev management we have in nova
- add mdev_tag to confirm where is the mdev request from, if
- ah_types_set = {arq['attach_handle_type'] for arq in accel_info} contains "MDEV", then it is from cyborg(make sure need to check this gpu is not managed by nova)
- Observation: if nova and cyborg conflict you just get "resource in use" errors
- Does it mean we can enabled the same vgpu_type in nova.conf and cyborg.conf at the same time ?
- ... its worse, Nova needs to know not to create mdev devices when its cyborg?
- how the libvirt spawn() will not use the mdev management if needed
- (gibi): flavor based VGPU resource should be allocated from the nova managed RP. While device profile based VGPU resource should be allocated from the cyborg managed RP.
- use owner (nova, cyborg) trait in Placement when the inventory is reported
- e.g. OWNER_NOVA, OWNER_CYBORG this would be a new namepace in os-traits
- this is similar to consumer type (instance, migration). Neither consumer types no RP types (or owners, or factories) are exists in placement today
- thanks forgot the name(consumer type)
- AGREED: use OWNER_* traits
- include the owner trait in the allocation candidate query
- libvirt driver should only create MDEVs if the allocation is for a nova managed RP.
- NEXT STEPS: detail out the needed nova changes in the spec
*Nova-Neutron cross project
- (bauzas): the status of the scheduler support for routed network+1
- spec: https://review.opendev.org/733703
- we will continue working on this in Wallaby
- Future work: to provide the routed network information via the port resource_request. Bauzas has no bandwidth for make this happen so it can be taken over by other developers
- (sean-k-mooney) nova support for neutron port NUMA affinity
- basically this is a follow up to the vm wide pci nume affinity image and flavor extra spec.
- neutron has been enhanced to add a new property to the neutron port that is the numa affinity policy that should be applied
- this would allow requireing strict numa affinity for sriov ports while disabling affintiy for say the management interface.
- this would also allow you to opt out of numa aware vswitch which today enforces trict numa affiinity for all vms on the host if its configured.
- the nova changes are basically as follows:
- extend the nova.network.model.vif object to store the numa affinity policy.
- extend the nova neutorn model to populate that wehn we buidl the port form the neutron responce.
- check the numa affinity policy fo porst as part of the scudling process and propagate it to the exising numa affinity code for sriov inteface and numa vswitches.
- optionally allow a default affinity policy to be set for numa vswichs in teh nova.conf other then strict. this would allow you to configure the feature so that the info is avaiable to nova to scdule on but not enforce affintiy unless asked.
- TODO(sean-k-mooney): file a spec for this assuming we have agreement
- the optional default affinity policy for numa vswitch is one of the design point i want to get imput on but we could leave that to the spec too
- Neutron code: https://review.opendev.org/#/q/topic:bug/1886798+(status:open+OR+status:merged)
- AGREED: lets have a small spec in nova about it. But overall direction seems OK.
- (sean-k-mooney): vhost-vdpa support in nova/neutron via ml2/ovs and libvirt.
- https://review.opendev.org/#/q/topic:vhost-vdpa+(status:open+OR+status:merged)
- neutron changes:
- nova changes
- the libvirt driver needs to support generating the libvirt interface xml
- the libvirt driver need to track vdpa devices and their assotion to VFs.
- this can use the reouces infrastucture added for PMEM or it can extend the pci tracker code with a new device type type_vdpa.
- adding a new device type to the pci tracker is the prefered option as its the smallest change and will alows use to block requeting vdpa device via pci aliases which is always incorrect to do.
- this simple amount to marking something as type-vdpa when it reports the vdpa capablity here: https://github.com/openstack/nova/blob/master/nova/virt/libvirt/driver.py#L7062
- the libvirt xml requires that we specify the /dev/* device path in the xml as a result we need to modify this on live migration so we need to extend the migration_data object as we did for sriov and numa migration.
- eventually nova shoudl report these vdpa resouce to plamcent after pci device are tracked in placement.
- initally nova will just report a new trait hw_vhost_vdpa for all host that have pci devevices with the vdpa capablity.
- this will be automatically added with a placement prefilter to and vms with a vdpa type port. (note at like other sriov like interface, ports will have to be precreated with vnic type vdpa before booting the vm with those ports)
- as the vdpa device are assocated with VFs transitivly that gives them a numa affinity. the inital support could be done without numa affintiy and that can be added later without
- to much effort, just reuse the affinity policies for neutorn port or the flavor to contol it.
- over all the nova changes will be very similar to https://specs.openstack.org/openstack/nova-specs/specs/pike/implemented/netronome-smartnic-enablement.html
- the only real delta between vhost-vdpa and virtio forwarder from a nova perspective is that in the hardware nativly speaks the virtio packet format so no user space applciation is required to do the translation and the kernel provides a vhost backed for it via the new vhsot_vdpa module.
- future work
- numa
- as with virtio-forwarder and sriov numa affintiy can be enabled via numa affinity fo the VF
- technically vdpa device do not have to be associated with VFs they can be implemted purly in software.
- as an intilal implemantnion we could levage the numa affinity of the vf when usign the sriov nic agent as the ml2 driver
- for other backend that do not leave sriov we can look at numa affinity when they exist.
- placement
- we could track vdpa device in placment without numa affinity or generic pci passthough devices in placment.
- just like vGPUs we would add a nested RP for vdpa devices and create an invetory of them.
- we could create 1 rp per parent PF device so that we can use traits to model physnets.
- ovs-dpdk
- ovs-dpdk support is out of scope at this time. it can be added in the future if support for vdpa is added without modifying neutorn by simply looking at the datapath type in the port bindings to determin if its dpdk or kernel ovs.
- linuxbridge
- in theory adding the vf repsentor to a linux bridge is all that is require however at this no vendor has implemented that yet.
- os-vif would handel adding the vif on nova behalf but the linux bridge plug would have to be enhanced to do this.
- there is no documentaiton on usign vdpa with linux bridge but im told it shoudl just work in the future.
- on the neutron side it woudl jsut have to accpet the vdpa vnic_type type in the ml2 driver but that is currenlty out of scope.
- sriov-nic-agent
- at present vdpa can only be used with melonox nics in switchdev mode. in that mode they cannot be managed correctly with the sriov-nic agent so this is out of scope.
- future nics form other vendor might support non switchdev mode but until that happens ovs is the only contol plane that we can enable.
- AGREED:
- Go with the PCI tracker based solution to gain the NUMA affinity support for free
- Add the minimal placement interaction; the capability trait
- Discuss moving PCI tracking to Placement separately
- We need a nova spec
- Discussion around port flavor / default vnic_type on a neutron network
- similar to network qos policy that is the default qos policy of the ports created on that network
- => johnthetubaguy + mlavalle to continue discussing
- NEXT STEP: John to draft a spec
*Nova PTG topics
*Wednesday (28th ) 15:10 UTC - 16:59 UTC - dev process and team related topics
- (gibi): Victoria retrospective
- What went well?
- We closed 9 bp out of the 16 approved, this is 56% completion rate
- I think the runway process helped focusing on the almost done bps +1
- (gibi): bauzas suggested to optionaly you can put your name to a runway slot item to indicate you will review it.
- We managed to push down the number of untriaged bugs and also kept the low number during the last couple months+1000 +1
- We have a lot of stale stuff in launchpad, we can try to look at them during W.
- E.g. Move bugs in fix commited to fix release state
- E.g. in progress bugs without open patches
- Incomplete bugs expire automatically
- Check bug in progress without any tag
- No API microversion in Victoria :).
- Or DB migrations (though that one's not new)
- What should we do differently?
- johnthetubaguy: my time management/availability has got terrible again :'( working on that...
- lyarwood: Create more ER queries & LP bugs when CI issues are encountered, less blind rechecks! +1
- (stephenfin) I can do this but I have no idea how /o\ Something for our docs?
- => Let's report critical bugs for intermittent gate failures.
- => lyarwood will document this.
- (stephenfin) Interaction with placement probably needs to grow again (there's a topic for that later)
- (stephenfin) As always, growing our core list would be nice, though that's easier said than done :)
- Any other feedback
- (lyarwood) CODEOWNERS
- Do we want to codify https://wiki.openstack.org/wiki/Nova/BugTriage in the tree somehow so interested devs can be found and even automatically added to reviews?
- While this was introduced as a a platform specific thing it does appear we could use tooling locally to query the file
- (gibi): the table on the wiki is pretty outdated and also I feel that we not really following the process described on that wiki. So I support rethinking of our bug triage process as whole and automate what we can. But please do not codify the current table in git as that only increase the confusion
- https://wiki.openstack.org/wiki/Nova/BugTriage#Tag_Owner_List this is what i normally look at not the CODEOWNERS
- We already have the MAINTAINERS file pointing to this wikipage https://github.com/openstack/nova/blob/master/MAINTAINERS
- this too? -https://wiki.openstack.org/wiki/Nova#Developer_Contacts
- (gibi): PTL guide can be extended with an extra step to trigger an update of the list at the beginning of every release
- NEXT STEP:
- lyarwood will propose the initial list in gerrit
- gibi will propose a PTL guide update the list regurarly
- (stephenfin) Do we still need release candidates?
- (bauzas) The alternative being https://releases.openstack.org/reference/release_models.html#cycle-with-intermediary
- I personally tend to give credits to the existing which allows us to focus more on bugs during the RC period
- (gibi): I think cycle-with-intermediary also require a release around RC1 so we would not get rid of the RC
- (gmann) yeah at least one release for the cycle.
- (gibi): I like the fact that we feature freeze at M3 and slow down landing code close to RC1. This makes it possible to re-focus my time on bugs and incoming spec reviews instead of the code reviews.
- (gibi): This RC period went pretty smooth, we did not detected any fresh regression that needed imediate action and an RC2. If we can keep this track record then I can be convinced that we can land more code after RC1 but before the relase on master branch.
- AGREED:
- Keep the current model
- What not to merge to master between RC1 and final release:
- Massive changes (e.g. refactors) that makes bugfix backporting hard due to conflicts
- (gibi): Placement project is now without PTL
- (bauzas) Deprecate os-hypervisors API
- Are we on par with osc-placement ?
- What signal do we care for the API deprecation ?
- bauzas to spec up
- (sean-k-mooney) hyperviors api
- in most case i think you can get more correct info form placement
- can we deprecate it in W and remove it in X+
- (stephenfin) +1 to deprecation at least, +0 to removal for now unless it saves us a lot
- (stephenfin) What are the gaps? What can it give us that placement can't? Can we close those gaps, if any? +1
- Horizon depends on os-hypervisors to show capacity to the admin
- cpu, ram, disk without overallocation
- horizon could use the old microversion
- os-hosts APIs are previously deprecated in favor of os-hypervisors API - https://github.com/openstack/nova/blob/master/api-ref/source/os-hosts.inc
- The current_workload is the number of tasks the hypervisor is responsible for. This will be equal or greater than the number of active VMs on the system (it can be greater when VMs are being deleted and the hypervisor is still cleaning up).
- that is what we say in the current api ref
- AGREED:
- Sanitize the os-hypervisor API in the proposed microversion
- Add a placement CLI command that list all the info from placement in one shot
*
*Thursday (29th) 13:00 UTC - 15:00 UTC - deprecations
*
- (stephenfin) Tech debt/simplification
- Would like to continue the vein of reducing nova's KLOC because nova has too many dark, rarely explored corners that make things harder to parse, particularly for a smaller maintenance team
- (sean-k-mooney) devname support in pci passthough list
- i wanted to remvoe it in trian but it sill causes issues when you do OS upgrades or firmware updates
- could we deprecate it in W with the replacment bting just usign the pci adress or vendor id/produt id instead.
- devname also only works for nics unlike the other options.
- installer could be extended to translate the devname to the pci adress to provide an upgrade path and they can then support that for as long as they like.
- (stephenfin) I started on this https://review.opendev.org/#/c/670585/ Based on that discussion it seems we should bring this up with deployment frameworks?
- AGREED:
- Deprecate devname support with a bugreport and a big warning at nova-compute startup
- Remove the support later
(stephenfin) DB migrations < UssuriThese take a not-insignificant amount of time to test and deploy and their presence seems particularly egregious for things we don't even support anymoreRemoval means folding the individual migrations into the initial migration.Would also allow us to remove a number of nova-manage commands.Can we switch to alembic like everyone else? zzzeek says it's "way better" (TM)(melwitt): There were discussions in the past around this and reasons not to move to alembic, we need to dig up the history on it to consider in a new discussion.Nevermind, I couldn't find anything.
- (stephenfin) Move to alembic
- We have limited DB migration to Train and no API DB migrations since Stein (iirc); if we were ever going to do this, now (post-cleanup) would be the time
- Any reason not to, if it wasn't crazy complex?
- Backports would have to be rewritten, but they're exceptionally rare, particularly when we're not actually making DB changes
- sqlalchemy-migrate is not maintained. Alembic is maintained.
- (stephenfin) johnthetubaguy says there might be some issues with online/offline migrations, probably just need to make sure we can add new columns and FKs OK
- AGREED:
- make a spec defining the upgrade path from sqlalchemy-migrate to alembic
- stephenfin will do a PoC patch with a complicated migration (index, FK)
- (stephenfin) Support for schedulers != filter_scheduler
- We deprecated this support in Ussuri. Would simplify a lot of our code paths if we knew we could rely on placement being present. Will require removing that safe_connect decorator but efried had patches started for this
- (melwitt): I notice the @safe_connect decorator also swallows ks_exc.ConnectFailure so that's likely one of many reasons it'd be good to remove it.
- we have required plamcement for nova to function for several releases. the compute agents require it to function now so we can remove the checks from placment even if we allow the schduler to be plugable.
- with that said yes i woudl be +1 on simplicfation
- AGREED:
- (stephenfin) Shadow tables+1
- Does anyone rely on these? Could we make them optional or remove them entirely?
- i woudl really like to see them gone and soft delete in general too.
- (melwitt): FWIW they have helped me identify bugs around archive (example: https://review.opendev.org/757656) so if soft-delete exists, then archive exists, and the shadow tables are helpful in debugging archive issues. So from my perspective, removing them should coincide with removal of soft-delete.+1 i had put soft delete and shadow tables together in my head but i guess they are seperate. i do think soft-delete should be considerd for removal or deprecation too.
- local delete realy on marking something deleted
- archive can be changed to not moving data to the shadow tables but just pruging the deleted rows from the db
- (melwitt): Can already do this via 'nova-manage db archive_deleted_rows --purge' (though again, it helps me debug when users don't do that ;) )
- Do see if this make moving to alembic easier
- AGREED:
- removing the shadow tables needs deployer discussions. Send mail to ML
- rediscuss this at the next PTG
- (stephenfin) RPC version bump
- Would allow us to remove a *lot* of hairy, complex code
- Q: Are there disadvantages to this?
- (gibi) Remove code that guard against old, not supported compute versions
- (sean-k-mooney): virtio-memory balloon.
- nova only uses the virtio memory balloon to collect metric to send to ceilomiter.
- nova/libvirt used to allow turning if off by disabling the memory metrics collection but libvirt started addign the ballon device by default at some point
- can we just remove it entirely or disable it by default and provide a way to opt in.
- keeping it add guest overhead in the form of qemu needing to periodicly update it swith stats on the guest memroy usage.
- nova does not actully configure the ballon to auto deflate on host OOM so it provide no benifit in that edgecase
- to use actul automaitc memory balloning you need to write an external agent to activly mange the guest memory , over used to have once call MOM(memory over commit manger) https://github.com/oVirt/mom
- but that is not actully supported even in ovirt based products so i dont know of any valid usage of the ballon other then for ceilometer metics.
- https://docs.openstack.org/nova/latest/configuration/config.html#libvirt.mem_stats_period_seconds
- AGREED:
- turn virtio-memory balloon off by default, option to turn it on
- when its on, always do auto deflate, to help those wanting to overcommit loads of memory
- DOCS: be clear operators on how to help avoid OOM kill, i.e. have swap
- have a specless bp
- (stephenfin) Remove the old nova-network-related parameters from various APIs
- eg. The 'security_groups' field in the 'POST /servers' API (currently a no-op)
- isnt it used for the default security group of nova created ports ? e.g. if you use --network and nova creates the ports.
- (stephenfin) Nope, we validate in the API, build up security group objects...and then ignore it entirely if using neutron.
- AGREED:
- nothing to remove now
- re-discuss it when new items collected for removal
- (gmann): Deprecation warnings for Policy
*Any other business
- (tbarron/gouthamr/manila contributors) VirtIO-FS
- Please consider scheduling this on Thursday(15:00-17:00 UTC) since the rest of the times coincide with openstack manila PTG discussions :)
- Feature: https://virtio-fs.gitlab.io/
- Concept: Looking to discuss the possibility of exposing shared directories to VMs with the nova team
- startup etherpad: https://etherpad.opendev.org/p/nova-manila-virtio-fs-support-vptg-october-2020
- demo of manual remote mount to staging dir on fedora33 host and exposing to alpine guest via virtiofs: https://asciinema.org/a/IZ7UrhwspxBN63XsOl9JrTcUX?speed=1.5
- johnthetubaguy: I remember we deffered this in the past, but I think its because vsock wasn't ready/stable. I like this more than I used to, now I have an HPC hat. Currently I get around not having this by telling people to use K8s, but the networking is *hard* and partly broken really (for the CephFS case). Basically I like this so I don't have to push CephFS through neutron routers.
- johnthetubaguy: I think this has to be a separate kind of attachment to Cinder, but "similar", its an FS not a block
- (sean-k-mooney): yes i discussed this with tbarron a littel on irc before and the idea was to have a share attach/detach like cinder but for manila shares host mounting the shares and exposing them to the guest via virtio-fs +1, I think we almost agreed on this direction before? maybe :)
- one advantage to this approch is that the guest now would have an abstration over the backend sotarge layer. e.g. nfs or cephfs are normaliased to virtio-fs in teh guest allowing windows to use both as it just looks liek a virtio-fs share. Right, guest doesn't need to know about remote FS protocols, install client packages, get auth keys, etc.
- yeah, I need this for Lustre really, as that is even more terrible with version problems
- cloud-provider openstack could support this new approach
- yeah, we have manila-csi in CPO now and would extend it and make sure that attaching shares works in that context. CPO will basically just call into OpenStack services but will be an important use case to be accomodated in new APIs.
- supperted since QEMU 5.0, and libvirt 6.2. Nova's next min is QEMU 4.2, libvirt 6.0.0
- UCA support ?
- not currently we the victoia branch just has the 20.04 version we would need 20.10's packages
- we can use fedora 33 if we needed
- we only need the newer libvirt the qemu version will be fine on 20.04
- i coudl update https://opendev.org/x/devstack-plugin-libvirt-qemu/src/branch/master/devstack to provide this too. it comples libvirt and qemu form source but i have not updated in a year or two so need to add 20.04 support.
- apparently alpine can be a possible guest in the gate where this works
- Claims performance is good now, and I think we tested this for kata: https://www.stackhpc.com/images/IO-Performance-of-Kata-Containers-TheNewStack.pdf
- We are mostly discussing on having another full API support exactly like we have with BDMs plus a new os-something library as a common lib between Manila and us. That's a massive effort, we absolutely need testing and baby steps.+1+1+1
- How to stage the addition of this, so we make progress?
- API: we can first add adding a share to a running VM, but second thing, changing the create server API?
- Params:
- tag
- manilla share uuid
- read vs readwrite
- key initial bits are ensure we add the tag in metadata
- ... but we add the API last
- add simple API first
- Q:os-brick style attachment
- is there an agent on the hypervisor or not?
- NOTE: we can only depend on released libs!
- ganesha nfs in manila does bits of this
- ACTION: agree details in spec
- API in manila that Nova talks to to get the attachment details
- is there anything new we need? Maybe not I guess?
- Ideally something the os-brick style replacement can deal with
- Q: how much does "nova" have to do, vs os-brick like thing, vs libvirt
- system calls for running the virtio-fsd agent, who would be responsible of them ?
- lyarwood: I was about to ask this earlier, FWIW it isn't listed in the libvirt docs as something the caller would worry about.
- Testability
- needs to run in the gate.. seems possible
- Upgrades
- Should we wait for the whole cloud to be up to date ?
- yes, usual service version checks for this one
- johnthetubaguy:
- I should say, this excites me, I will see if we can work on it from StackHPC, but likely my time to help review it is the most I offer right now
- ... I want to add lustre support using this, medium term, not got anyone to pay us to do that yet+1
- AGREED
- its in scope, its a new API for attaching a share to a running VM
- Do the POST /servers API extension later
- you can find shares via tags
- share concepts with volume mount.. but not we hate loads of that :)
- it is a multi cycle work
- First spec deals with a bite size step forward, proposed usecase being a fs attach/detach to a running instance
- (gmann) Allow system admin to create server for other projects
- Context: With API new policy and scope_type, many API operation became system scoped like hypervisor info, host info etc.
- Create server is project level API and default policy is PROJECT_MEMBER and in code, request context's project_id is used to create server for.
- But in create server request we do have some system level request parameter like force server on particular hosta nd that is controlled by separate policy 'os_compute_api:servers:create:forced_host'
- Other example than force_host are creating server with zero disk flavor, attach the external networks, request destination.
- which is default to PROJECT_ADMIN for now because there is no way for system level users to create server request ( TODO in above link ^^)
- To allow system level roles to request to create server for other projects with system level request parameter, we need to allow project_id as one of the request parameter
- (melwitt): Just an idle comment, disappointed to know we need this ... I had thought that scope types would work in combination with the credential project/user so that we still use the project_id and user_id from the RequestContext like we normally do.
- Yeah with access control (say to allow server with zero disk flavor) it works but for things like force_host they need to get host info and pass in request.
- Yeah when I first commented I didn't realize it was just for getting host info via a separate API. it makes sense to me (and I thought it was intended) that if a user needs to do this it is expected that they have the appropriate role (and use system token) to query for the host list and then when they want to boot the server, they do that with a project scoped token. IMHO it feels like it's working as intended today.
- For this special case yes but if we want to have a complete isloation between system level token and project level then it does not. Means a single user need to have system level token to perform project level operation which they can use to get other project info :)
- Right, I honestly thought that was the point of all of the policy and scope types work, to carve up the API permissions/access to need system or project token, then the appropriate roles. And that lumping them together was the old way and what we were trying to get away from. Just my perspective. Like johnthetubaguy said, we "could" do it but it can get confusing in a similar way it was confusing with the legacy admin model. And I'm not sure it's a good idea.
- and API will create server for requested project_id if any instead of contextproject_id, if no project_id is requested then use context's project_id like it is currently.
- Or make project_id mandatory in request body so that each user need to explicitly mentioned the project they want to create server for even it is their own project itself ?
- This way we can make 'os_compute_api:servers:create:forced_host' policy to be default to system_admin. Eek. We should rather kill this horrible hack. Its super needed :'( I know, that's the difference between a wish and the reality. Oh, just tbc, I was speaking of the semantics with the AZ field that is hacky, not the fact to ask for a target if you're admin, which sounds a reasonable usecase.true, the az syntax is mad.
- This change need microversion as we need to modify the request body.
- neutron advanced thing https://github.com/openstack/neutron/blob/master/neutron/policy.py#L43 "context_is_advsvc"
- slightly wacky idea is that if you use tenant isolation with aggreate and you are a project amin you could list the host for your project only as a project admin.
- in general however i dont think project admins shoudl be able to list hosts.
- (melwitt): (after the discussion) it seems to me that the way it works today is the model we were trying to get to with the policy changes and scope types, that you need to request the appropriate token per API to carry out the actions you need and that you need to be given appropriate roles along with that. If we start rolling things together, wouldn't that put us backward more toward what we originally had with the global admin concept?
- AGREED:
- Document it better how to do the above use case today.
- Gather information from the users of the new policy system and rediscuss this use case based on that
- (lyarwood) Image and flavor defined ephemeral storage encryption
- https://review.opendev.org/#/c/752284/
- tl;dr Introduce image properties, flavor extra_specs and configurables to bring Nova's ephemeral storage encryption in-line with the UX of Cinder's volume encryption using encrypted volume types.
- The spec also suggests the use of compute traits and a pre-filter when initially scheduling instances.
- The alternative or possible followup in a future release would be to expand the block_device_mapping_v2 API to handle this directly instead.
- The libvirt virt driver implementation will need an additional spec, I'm currently blocked on some internal debt around we transform API bdms into BlockDeviceMappings into driver BDMs before finally into block_info within the driver.
- Open Questions:
- Does this require a microversion to indicate support for the new flavors and extra specs? I'm assuming yes as was the case with BFV rescue etc.
- Was that to just advertise its available, not restrict its use?
- normally we dont bump microversion for extra specs or image properties so i would assume no+1+1
- Enabled / Disabled config, but only works with LVM
- flavor: explicitly enable or disable?
- https://github.com/openstack/nova/blob/master/nova/compute/api.py#L585-L609 we do the validate in _validate_flavor_image_nostatus
- Idea: metadata report if you disks are encrypted or not
- (Luzi) There is currently ongoing work for Image encryption in Glance, they do have some encryption metadata: https://specs.openstack.org/openstack/cinder-specs/specs/wallaby/image-encryption.html#proposed-change
- AGREED:
- No microversion bump is needed
- but a validator should be added to 'nova.api.validation.extra_specs.hw'
- add encryption to metadata for ephemeral disks so users can confirm if their disks are actually encrypted at rest.
- I assume we should also do this for encrypted volumes as well?+1
- you can look at the Image encryption in Glance for an example
- Deprecate the old config param
- (lyarwood) Device detach improvements
- (gibi): Do we talk about both volume and interface detach?
- (lyarwood): Yes this would cover both disks and interfaces.
- API
- Reject requests to detach devices from instances on down computes
- Recent examples of async (cast) APIs that have started to do this
- johnthetubaguy: should we do this for all operations, similar to the task check decorator we have?
- Except delete
- (melwitt): +1, and we "guarantee" we will delete it later via the periodic reap task in nova-compute so from a general perspective we are accurately deleting async as communicated to the user via the 204
- history: I think this is because people should avoid getting billed for things they don't want, regardless of if its possible to complete the request right away
- Yeah, totally agree delete should never be blocked (for many reasons).+1
- AGREED:
- reject the operation with a vague error message to prevent a leak
- For volumes introduce a force-detach admin only action to allow operators to detach volumes from down computes
- Call c-api to delete the attachment moving the volume to available
- AGREED:
- libvirt
- Switch to using libvirt events
- (lyarwood) Nope, I'm talking about actual libvirt events for device detach, this isn't related to block jobs and their associated QEMU events that n-cpu shouldn't be waiting on either IMHO. We should only care about events directly from libvirt. — ah-ha, okay, separate topic - the below QEMU BLOCK_ events thing; I left it for the record)
- Specifically the following device detached events from libvirt:
- WIP code I posted during the Focal detach issue https://review.opendev.org/#/c/749929/ this just wires up some logging of the events.
- The idea here would be to remove the retry loops that hit the undefined QEMU behaviour when we send overlapping requeststo detach.
- AGREED:
- [SKIP - not for Wallaby] (kashyapc) - the below QEMU events are a separate topic to Lee's "libvirt events"; leaving the below here for the record — I'll file a separate Blueprint for the record
tl;dr -- Switching to these "events" are largely to avoid race conditions for long-running block device operations like blockCopy(), blockCommit(), etc that Nova runs into sometimesSome available events to 'listen for': BLOCK_IO_ERROR; BLOCK_JOB_{CANCELLED, COMPLETED, ERROR, READY}Note: Switching to this event infrastructure may involve non-trivial reworking of existing code might be required, brand new unit tests, and probably more. But it solves some flaky race condition problemsAdditional info: for others normally don't dwell in this area. FWIW, in the past I did a small "dissection" of where these events come into picture while debugging a Gate problem: http://lists.openstack.org/pipermail/openstack-dev/2016-October/105158.html
*We stopped here on Thursday
- (sean-k-mooney): add a server recreate API
- Example:
- openstack server recreate --image <image-id> --flavor <flavor-id> <server>
- recreate is a data and port/resouce preserving move opertion by default
- image and flavor would be optional
- if neither image or flavor are passed then recreate consults the schduler for a new host that is compatiable with the latest state fo the image metadta and flavor extra specs instead of the embeded copy
so this is a cold migrate the new thing is updating the image metadata if that was updated since the boot
- if image is passed recreate is a rebuild that is allowed to change resouce useage/numa toplogy since it is a move operations and can/likely will change host.
- if flavor is passed recreate is a resize that auto confrims.
- this is negotiably but i think we sho
theuld not have to manually confirm resizes or migration today so i would prefer not to intoduce that requirement to recreate - i would prefer to eventully remove resize verify and explict confrim/revert form the other apis but that is out of scope. it is the reason why i would prefer this api to work like rebuild, live-migrate and shelve however rather then emulate resize.
- This can be closely emulated today by setting resize_confirm_window to a small positive number
- if both flavor and image are passed, this action recreates the server by reusign its volums and port etc but remiaging the root disk and updating the flavor as a singel atomic operation
- currently this requires stoping the vm, then rebuilding and then resizing before starting it again.
- this is useful for cloud instances that use rebuild for upgrades where the resouce requirements of the updated stateless application differ form n to n+1
- this is an edge case that customers have encountered and providing a single atomic operation would simplify there workflow.
- recreate could support shelved instances
- i think resize should support shelved instnace today as resize a shelved instance is jsut confirming its new flavor is compatiable with the image end the updating the flavor in the db.
- the new flaovr will take effect when the instance is unshleved.
- this allow upgrading fo shelved isntance to new flaovrs if operators want to discontinue an old flaovr without
- this could be defered to a follow up that adds supprot for shelved instance to resize and recreate.
- (melwitt): Any reason not to do this as 'openstack server rebuild --keep-ports-and-volumes'? Maybe it's just me but that would feel more intuitive.
- yes i want this to be able to change host so that the requirement in the image can change rebuild are always to the same host they are not move operations.
- also we want to be able to change the image and flavor at teh same time in one operation
- (melwitt): Gah, sorry, I meant resize. But noted on wanting to change image and flavor at the same time :/
- (johnthetubaguy): Is this really resize with --rebuild <image-id>?
- It feels close to what we said about "no more orchestration" APIs +1 as far as I see a heat template can automate the stop + rebuild + resize + confirm + start process+1
- New Resize API microversion idea:
- get rid of confirm resize / revert resize, or default to auto confirm resize+1
- -- flavor to the same flavor, allow that if extra specs have changed, or maybe allow always+1
- shelve offloaded, allow resize, in the new microversion, disaalow auto confirm resize
- -- image addition, so you can avoid copying the disk during resize-1 from me (bauzas)... if this work with volumes
- AGREED:
- new microversion
- auto confirming resize by default
- allow resizing to the same flavor to update extra_spec
- allow resizing shelve offloaded instances
- wait for the rebuild of bvf solution before we do the --image part
- (stephenfin) NUMA in placement
- This is clearly a hard job thing to do. Given reduced resources, is it still a thing we care about?
- (alex_xu) probably left a lot of feature can't do numa affnity, vpmem, gpu, cyborg devices
- is it ok to continue using the existing host numa object to do the numa affinity?
- What is it blocking? What does it give us besides warm fuzzy feelings?
- (melwitt): Existing NUMA and affinity claiming processes race and cause reschedule cascades + NoValidHost. It would be best to solve this (do the right thing: in placement), else customers continue to suffer with it.+100
- (sean-k-mooney): we can also solve this without placment, we just need to create the claim in the resouce tracker on the host i the conductor/scduler. we have never need placment to fix that.
- (melwitt): So what would that look like? "If NUMA, do a two-phase commit style claim on resource tracker, else claim in placement"? (That's beside the fact that placement was created to perform all the placement things).
- (sean-k-mooney): in the conductor/schduler where we currently do the placement allocation claim we would make a call to the compute node to claim the resouce in the resouce tracker similar to a move claim for cold migration if that succeeds then we proceed with that host if not we free the placement allocation and go to the next alternate host and repeat. this is the same solution we proposed before placment was created and its still equally valid today.
- (melwitt): Sorry, I don't see how adding a second way of "claims in the scheduler" is valid after we chose to do it in placement after a lot of discussion. To elaborate (since this discussion will likely be before I come online in my time zone), I feel like we did a whole lot of work to get placement going and do claims in placement. And now it feels like we're saying, freeze placement and add another way to do a claim in the scheduler, and do all the things without placement going forward. It is a thing we could do, but I'm not sure "valid" is the best word.
- I like a two-phase commit solution for things like server group affinitiy / anti-affinity. That stuff really sucks in a similar way. There is always a point where the hypervisor has to reject something... maybe we just draw the line at NUMA for now? Seems a shame though :'(
- (gibi): For me it feels like we have a wish and a dream to do NUMA in placement but we don't have a the human resources to make it happen. Do we have developers who will spend time on implementing the necessary changes? If yes, then it make sense to agree on those changes.
- (sean-k-mooney): i agree with both points we do really want to do numa in placmeent but wallaby will be the base of our next lts downstream product and they are pushing to land as many feature that we can that are user visable since it will be 3 more release before our next lts based on Z e.g anything that is not in Wallaby will not be availabel to customer until Z is released in our downstream product modulo some very small and few feature backprots. from an enginerring point of view this is still a high priority but from a pm point of view its not. the flip side of this is we have been saying let wait for placment to have support for this for 3 years now it predates the creation of placement and its why nested resouce providers was added. placement now has support for nested resource providers but i dont this is somethign we coudl have delivered years ago and i dont think its resonable to keep defering this just because we dont do it in placment. we will eventrually do it in placment but it will be hard to justify working on this in wallaby form a redhat point of view.
- (melwitt): I know this is going to be an unpopular opinion, but I feel like at some point, we do need to learn how to develop in placement being that critical functions we depend on are in it. From my perspective, I think of it as though part of nova is in placement and it feels like we keep pushing it off to learn to develop in it. Are we thinking of effectively freezing placement except for bug fixes from now on?
- +1 melwitt here. Could I turn this upsdie down though... can we think of a easier way of doing this that involves breaking some "rules".
- (gibi): I don't think that the real blocking issues is that we don't know placement internals. I think we in general lack of developers who dedicate time to make the necessary code changes happen. I totally agree that it would be good to have NUMA in placement.
- (melwitt): Not sure I understand this, the developers would and should be the devs working on this feature right (i.e. there are devs who dedicate time to work on the feature)
Stupid question, is there an easier but non-pefect path forward?Step1: if we agree a design, we have the option of asking for help for people to follow the spec and implement iti.e. what about every nova compute reporting a hypervisor for every NUMA node, a bit like ironic-compute reporting lots of ironic nodes?this breaks targeting a specific hypervisor, but whatever, you just target a specific NUMA node now.
for a new hypervisor, or a hypervisor that is empty (has had all its VM live-migrated off it), we "just" need to start reporting two sets of resourceswe make it opt in, a bit like opting into doing pCPU vs vCPU... I presume the hard bit is transitioning... is there a way to make the feasible if we break a few rules/guidelines? like only allowing empty hypervisors to move from a NUMA reporting system to non-NUMA reporting system?.. maybe this is stupid.
- Context : https://specs.openstack.org/openstack/nova-specs/specs/victoria/approved/numa-topology-with-rps.html
- AGREED:
- Implementing the two phase commit is less complex than modeling NUMA in placement and both could stop the race condition on resource claim
- move the current NUMA in placement spec to backlog
- stephenfin will propose a spec about two phase commit
- we only approve the spec if the PoC code is proposed
- (sean-k-mooney): socket affinity as a new numa afinity policy for devices or memroy.
- because of the sub numa stuff? bummer.
- ya basically as a simpler alternitive to doing numa distance modeling. just to affintiy to any numa node on the same socket
- numa distance is better but might be more complex
- this is important for cluster on die/subnuma clustering on intel cpus or amd zen cpus include zen2 and zen3
- AGREED:
- more discussion needed in an additional spec
- (sean-k-mooney): numa blancing https://bugs.launchpad.net/nova/+bug/1893121
- this was disccused when we first added numa support but was orignally declared out of scope.
- today we deterministicly order the numa nodes and iterate over them using itertools permutaions.
- we have a minor optimistaion to avoid numa nodes with pci devices if you do not request one by sorting the list before passing it to permutations
- we should do the opisite if you reqeust a pci device but dont today
- this means that if you have a pci device on numa node 1 and the prefer pci affinity policy and hw:numa_nodes=1, then if the vm can fit on numa node 0 we will select it instead of numa node 1 even though we said to prefrer numa affinity since we bail out of the function when the first numa node matches.
- ideally we would also sort on other factors too
- since python support stable sort orders complex sort can be achcive by multiple stables sorts https://docs.python.org/3/howto/sorting.html#sort-stability-and-complex-sorts
- so we can support numa blancing by doing the following sorts fo the numa node.
- first sorting by instance per numa node
- then sorting by free memory per numa node
- then by cpus per numa node and finally
- by pci device per numa node.
- i belive this is a bug that should be adress and backported to all maintianed release. if we do numa in plamcnet providign this functionality is fundemtally achived by sorting the allcoation candiates instead of sorting the numa nodes in the hardware module. if we dont adress this before we add numa to placment then we cant eaissly adress this for exsiting releases. if we consider this a feature not a significant performance bug then it can be adreessed with or without numa in plamcnet.
- a similar bug has since been filed indepently https://bugs.launchpad.net/nova/+bug/1901371 which i have marked as a duplicate of https://bugs.launchpad.net/nova/+bug/1893121
- AGREED:
- sean-k-mooney will propose patch for this
- have a spec and debate there about the need for a config that allow packing / spreading on NUMA level
- (sean-k-mooney): related topic: default numa instance to hw:mem_page_size=any
- we basically need to do this if we track mempages in placment and we cant track numa in placment usefully if we dont do that.
- we have had several customs and uptream threads related to OOM issues when using numa instnaces.
- experince has shown that users do not understand how to configure numa related feature correctly.
- unless you set hw:numa_nodes=<number of nodes on a host> there is currently non valid way to define a numa instance without setting hw:mem_page_size to a vaild value.
- setting hw:mem_page_size or the image property hw_mem_page_size enable per numa memroy tracking
- without it we pin the instance to a host numa node without checking the free capsity of that node or claiming the memory.
- the vm will not be able to use memroy form other nodes and if you just use hw:numa_nodes=1 we will pin all instance to numa 0 and never use numa 1+
- hw:mem_page_size=any mean use small pages not hugepages but allow the image to request a different page size. this i is the same beahvior as not setting the value in the flavor except
- memory is tracked and claimed in the numa toplogy blob allowign instance to use all numa nodes on the host.
- OOM issue will be prevented as memory will be claimed and host will not be selected if there is not enough memroy.
- oversubsriptiuon will be disabled for all numa instance
- note the only way to use oversubstiption + numa today is if you only set hw:numa_nodes without cpu pinning or explcit page size requests
- this is only valid if you set hw:numa_nodes=<number of host numa nodes> since we dont enable per numa memory trackign so over subsription will only "work" if the instance can use memory from all numa nodes
- AGREED
- Treat this as a bugfix with a fat releasenote describing the behavior change
- (lyarwood) Removing the use of persistent domains in the libvirt driver +1
- (lyarwood) We can skip this unless someone else wants to take it as I'm already fully loaded for the cycle.
- (stephenfin) Could you say why we do this at the moment and what the implications are? I'm not sure
- i belive we do this today because its teh default way to interact with libvirt via persistnt domains.
- with that said we have never used them the way that libvirt intended them to be used. persitent domains are intended to be used when you dont delete them every time you start a instance to provide stable device names and other things
- nova thows away the domain every time we start an instance so it provide no benifit to use at all and encurages peopel to do bad things like modify the domain behind nova's back.
- (stephenfin) On this note, should we be more explicit about adding things that we get automatically on x86 (keyboard, mouse, ...). Translation: do we care about ARM?
- yes i think so. ideally i think we should not be using libvirt to automagically add anythign for us and only adding what we require or are asked to add. +1 for being explicit, it causes lots of bugs/CVEs.. I wish we could tell libvirt we only want what we ask for?
- (kashyapc)
- AGREED:
- direction is OK
- we don't have bandwidth right now to work on it. If you are interested please ping lyarwood on IRC
- lyarwood to raise this in the libvirt group etherpad and mailing list.
- (bauzas) Beware of the potential goals, one being python-novaclient being dropped in favor of OSC
- (gibi) policy default to yaml (W cycle goal)
- In nova it was already done by gmann
- AGREED: gibi will propose a patch for Placement
- (stephenfin) Support for the Q35 machine type in the libvirt virt driver
- This is potentially critical for RH leading into OSP 17 and RHEL. Do we understand the problem and can we prioritize work here?
- (lyarwood) I've picked this up from a RH PoV. To answer the question, yes I believe I understand the problem and the work required, namily the need to stash the machine type, stand up nova-next to use q35, resolve the volume extend bug (listed below) and anything else that comes up.
- (kashyapc)
- Need to stash existing machine types before allowing operators to change the default to avoid breaking users by changing the underlying emulated hardware in their instances.
- This will be outlined in a spec to be written shortly
- this is the stable abi thing i was mentioning earlier
- at a minium we need to recored machine type which we agreed to do last ptg i just didnt get to it. (required for q35 change)
- ideally i think we should recored all values settable via image porperties for an instace ( nice to have)
- The libvirt virt driver being responsible for capturing this for existing instances at startup etc.
- Obviously this is going to require some extensive docs/releasenotes setting out the upgrade case, we should also test this in grenade.
- CI
- nova-next: Start testing the 'q35' machine type
- Bugs
- tempest.api.volume.test_volumes_extend.VolumesExtendAttachedTest.test_extend_attached_volume failing when using the Q35 machine type
- same behavior, attachment seems ok from libvirt/nova, then libvirt receive "DEVICE_DELETED" event from qemu.
- (lyarwood) Can you add that context to the bug please.
- Done
- AGREED:
- lyarwood to write it up as a generic record all the things spec.
- record all image props to allow future changes to defaults
- take special care about upgrade
- stash current machnie type at compute startup and hard reboot
- make an upgrade check that warns if there are instances without machine type record
- (gibi) finishing V cycle goal
- We still need to move tempest-integrated-gate back to Focal as the nova libvirt fix has been merged
- lyarwood will follow it up
- Zuul v3 migration
- (kashyapc) Secure Boot spec
- Spec: https://review.opendev.org/759731
- WIP code (before I paused the effort): https://review.opendev.org/#/q/topic:bp/allow-secure-boot-for-qemu-kvm-guests+(status:open+OR+status:merged)
- Q35 is needed for secure boot but we are not depending on the Q35 as a default spec
- Must make sure we enforce this -- yes
- TODO
- (Stephen raised this) Double-check if there's any other host-level config required: there _isn't_ any, top off my head, besides the virt component versions below
- libvirt, QEMU versions
- ovmf package (perhaps this is flagged as a domain capability by libvirt?)
- Yes, there's a way, from the spec:
- "Make Nova programatically query the ``getDomainCapabilities()`` API to check if libvirt supports the relevant Secure Boot-related features. [...] This can be done by checking for the presence of ``efi`` and ``secure`` XML attributes from the output of the ``$getDomainCapabilities()`` API."
- image metadata would ask for the trait
- AGREED:
- direction looks good.
- Let's fix up the nits in the spec and approve it
- (sean-k-mooney): dnspython 2.0 support
- eventlet currenlty does not support it
- debian is already shipping it so victoria wont actully work properly.
- rhel9 will proably have it too so we should ideally fix it but that need change to eventlets
- the only thing broken by nova we know of right now is the vnc console proxy but there might be others.
- AGREED:
- make a release note stating the issue
- gibi will look into the eventlet codebase to see how hard it is to fix
- bump lower contraints on master on eventlet to cap dnspython
- bump upper-constraint dnspython < 2.0 and backport it
- gibi will coordinate with neutron
- (stephenfin) Filters -> placement pre-filters
- Are there other filters we might want to look at replacing with placement queries? ComputeFilter jumps to mind.
- AGREED:
- if there is no external visible behavior change by transforming a filter to a prefilter then specless bp is enough
- Only deprecate the translated filters but not remove them until we gather feedback about the pre-filter (this could be after more than a few cycles +1).
- We can remove the replaced filter from the default filter list
- Emit a debug log for each pre-filter when it the pre-filter is changed the allocation candidate request
- (lyarwood) Enabling admin only move operations for instances with associated barbican sercerts
- https://bugs.launchpad.net/nova/+bug/1895848
- https://docs.openstack.org/barbican/latest/api/reference/acls.html#patch-v1-containers-uuid-acl
- mgoddard: Feel free to ping me for this one, since I raised the bug.
- Q: Should we try to workaround this in code or just document the suggested workaround from the bug (using a migrator role who can read secrets) as Cinder does for other issues during the initial creation of an encrypted volume by a user:
- thre are thing that try to do things with an admin context without a user token
- resize auto confirm periodic task <- if the guess it running in resize verify this should not fail right
- rebooting instance at compute startup due to resume_guests_state_on_host_boot config
- AGREED:
- add a new user to nova conf for barbican
- when nova creates the secret in barbican with the user's token then nova needs to add an ACL so that the nova's barbican user can read the token later
- alternative: service user token used in a similar way along side the user admin token
- lyarwood to write up a spec for this in W
- (lyarwood) refresh connection_info / avoid storing stale connection_info in Nova
- https://review.opendev.org/#/q/topic:bug/1860913
- https://bugs.launchpad.net/cinder/+bug/1860913
- Also useful when MON IPs change with Ceph. +1+1(I see the Ghost of Chet)
- This also came up during the cinder PTG, during the Kioxia driver discussion.
- https://etherpad.opendev.org/p/wallaby-ptg-cinder (~L314)
- It was agreed that as a workaround the new cinder volume driver will not provide stateful connection_info for the time being due to the issue of stale connection_info in Nova. Instead os_brick will request this stateful data via an agent installed on the computes before connecting the volume that is itself made up of various NVMe connections to storage across several nodes. These connections can change over time leading to the issue.
- Q: Really Nova should stop storing connection_info between instance ops and start pulling it more often from Cinder API? If yes, is this a spec?
- AGREED:
- have a small spec, but overall looks good.
- (gibi): Interface attach with qos ports+1
- (sean-k-mooney): live migration and scheduler defaults,
- can we make them this
- live_migration_permit_post_copy=true
- (kashyapc) A concern: If you're using not using network-based storage, a serious downside to post-copy is that if you have a network failure while migration is in progress, you'll lose the guest
- This problem of "post-copy recovery" is not yet resolved in QEMU
- Despite the above downside, for most users it does make sense to have 'post-copy' enabled—especially for network-based storage, because network will already be a critical point for you
- IOW, enabling post-copy is based on your risk tolerance
- (johnthetubaguy) +1 keep this off
- live_migration_timeout_action=force_complete
- live_migration_permit_auto_converge=true
- not sure about that one... although the OPS meetup folks said to turn this on
- (aarents) this help us a lot to turn it on
- Also: 500 -> 2000 or 1000?
https://review.opendev.org/#/c/760050/
- when we intoduced multiple port binding one of the motivators was fast network setup with less packetloss.
- this only happens if you use post copy
- postcopy only triggers if the timeout action is force complete
- for large vms or when you use hugepages post copy significantly reduces live migration time
- when usign 1G hugepages if you write a singel byte to a hugepage teh whole thing need to be copied again if you dont use post copy
- for busy vms auto converge ensure the live migration will eventualy complete
- lets default to the fast option and allow operators to opt out if they dont want it.
- AGREED:
- make force complete as the default action at live migration timeout
- bump to 2000
- document the hugepage recomendation
- for the scheduler it would be nice to set
- [filter_scheduler]/host_subset_size=3
- this should really be in the range of 3-10 in most cases but large value basically break the weighing and make it random
- basically this should roughly the same order of magniture or small as your typical max size multi create and it should be several oders of magniture smaller then the could size.
- you want it to be small relivie to the cloud so that the weigher are statitcally relevent but large relitve to the request so that the entroy of the node selection fo alternate is enough to ensure we dont hit the retry limit.
- if you have a 500 node cloud then this should be set to at most something in the 5-50 range to keep it 1 to 2 orders of magnitude smaller then the cloud. (for wheiers to matere this need to be staticialy small relive to the cloud size)
- if your typical multicreate is 100 instance then you would want the value to be somewhere betwen 1-10
- so for a cloud of this size we would recomment something between 5-10 though smaller values are still ok
- 3 is a guess but shoudl work for many cloud over ~30 nodes for smaller cloud 1 is proably the right default i chose 3 mainly because that is our default retry value so if you have 3 or less host its going to try them all anyway.
- I'm personnally against changing this default since it would make scheduling unpredictable for the slight benefit of performance (particularly for large clouds, but not for small ones). TBH, I just feel we need better docs. +1
- AGREED:
- only document, don't change the default
- [filter_scheduler]/shuffle_best_same_weighed_hosts=true
- https://docs.openstack.org/nova/latest/configuration/config.html#filter_scheduler.host_subset_size
- https://docs.openstack.org/nova/latest/configuration/config.html#filter_scheduler.shuffle_best_same_weighed_hosts
- https://review.opendev.org/#/c/760055/
- this is more or less required for multi create to work properly when you have any retries. the wehers work on the current state not the pending state so we need extra entropy to ensure we dont have retries all try the same host.
- AGREED:
- agreed to change the default
- both the live migration and filter sheduler values have come up several times with customers over the last 6-12 months so i think its worth making these our defualts.
- we could adress the sheduler default with better docs but i think the live migration options should be changed in code.
- johnthetubaguy: ... but this is much less of a big deal with we do two-phase commit (whoop, whoop).
- (jrosser): Instance identity documents+1
- See here for a nice introduction to the subject https://smallstep.com/blog/embarrassingly-easy-certificates-on-aws-azure-gcp/
- Here is a POC http://paste.openstack.org/show/798996/
- The motivation is to provide a cryptographically verifiable identity document for an instance which can then be used outside the instance to bootstrap other operations which require trust
- Issuance of certificates
- ...
- The JWT approach from google appears simple and as the POC shows is a relatively trivial addition to the metadata api service
- i mentioned on irc that this is almost small enough to be a specless blueprint but i forgot that all api changes require a spec. with that said i do think this is a resonable thing to add.
- Things to think about
- Rotation of the private key - existing commercial public clouds may do this daily so whats the hook to allow a deployer to do this without lots of restarting services
- How to pass the 'audience' for the JWT, either an HTTP header or query parameter, but how would this get from handler.py to base.py, and how would it pass through the metadata proxy
- Where to publish the public key for validation of the identity document
- What is the information that should & should not go in the identity document
- AGREED:
- (gibi)(stephen): Static typing in nova with mypy+1+1
- What should we do about oslo versionedobjects in nova?
- The class member definition of o.vos are happen at runtime via the register, register_if or objectify class decorators based on the fields dict. So mypy does not know about the members of the class or the types of those members
- A mypy plugin like https://github.com/gibizer/ovo-mypy-plugin can help mypy to understand what fields exists in an o.vo runtime
- o.vos value coercing mechanism makes it hard to define exact types for o.vo defined fields. E.g. an IntegerField accept a string value durring assignment if int(value) can be used to convert the string to int
- given the fields are objects could we define a mypy_type property on them and have the plugin call that to get the type to check and then we can handel this per field. e.g. IntegerFiled coudl return Union[String,Int]
- in the base filed object we coudl have mypy_type return ANY
- AGREED:
- do type annotation for hardware.py
- only best effort for o.vos
- (bauzas) Stop recreating mdevs on reboot
- Context : https://bugs.launchpad.net/nova/+bug/1900800
- mdevs aren't persisted, so we can't guess whose parent they had before after rebooting
- a workaround does exist : https://github.com/mdevctl/mdevctl
- two possible outcomes:
- only stop trying to recreate at reboot, delete a single method and done. Profit.
- or, stop creating mdevs on the fly at instance creation, which would require ops to pre-create mdevs, like we do for VFs. Harder to tell to ops tho and
- AGREED:
- continue discussing on IRC
- (brinzhang): Cyborg shelve/unshelve support
- https://review.opendev.org/#/c/729563/
- We have approved this blueprint in Victoria, and it was not merged closely Victoria release
- (sean-k-mooney) this should be moved up to the cyborg session slot but yes i think we should just continue with this
- AGREED: just review this specs no objections to continuing this
- (brinzhang): Proposal for a safer remote console with password authentication
- https://review.opendev.org/#/c/623120/
- We has already re-proposed this spec in Victoria, and due the contributor work changing, it couldn't be completed, and we will re-propose it in Wallaby, and I will change the owner
- AGREED:
- review the spec, it is a repropose so it is OK.
- (brinzhang): Cyborg suspend/resume support
- https://review.opendev.org/#/c/729945
- We would like to complete this feature in Wallaby
- (sean-k-mooney) this should be moved up to the cyborg session slot but yes i think we should just continue with this
- AGREED: just review this specs no objections to continuing this
- (stephenfin) Incorporating IP information in libvirt instance metadata
- https://review.opendev.org/#/c/750552/3
- this specific patch is in complete since it does not take into account interface attach/detach or floating ips as far as i can see.
- As an aside, our guideline for what someone should do with specless blueprints seems to be lacking
- as long as it is not an api change if we agree its well scoped and not a bug it can be a sepcless blueprint. this is not an api change so specless blueprint sound resonable to me, its what i was going to do.
- (melwitt): As johnthetubaguy said, that's how I think of it too, if a thing needs any discussion or questions/answers, a spec is what we use to have that discussion +1
- (sean-k-mooney): actully this remind me of a patch i started working on last cycle. i was going to also add all the flavor and image info e.g. the exta specs and image properties.
- basicailly i was going to add all the info needed to debug why the xml was generated the way it was and how it landed on a given host. that way without needing to lookup the flavor and image
- we can debug things just form the xml. this is very very useful for us downstream when looking at sos reports from customers as we often do not get teh flavor and image info and have to ask
- them to provide it but if its in the metadata we can validate our assumtions.
Related https://review.opendev.org/#/c/749977/ (add requested_networks to RequestSpec)- AGREED:
- we need a small spec to answer the stephen's questions
- (johnthetubaguy) adding non-fatal filters, deprecate weighers
- Make everything return a True/False, i.e. a filter
- have an ordered list of the above filters, generates a bit field, use bit field generated as the weight
- Advantage: the precidence order is clear
- Existing weighers can be converted by adding a threshold for True/False given a weigher that doesn't just return 1/0
- ... might need to have config that allows a weigher to be used twice with two different thresholds
- AIM: no extra code needed when deciding if something moves from a filter to a weight or v.v.
- existing list becomes a list of "fatal" filters, i.e. a host must pass such filters to be a valid candidate
- ... actually this is just one more weigher, eventually we make it the only default weigher
- please tell me more :)
- (bauzas) Trying to gather old' Dublin etherpad...
- (sean-k-mooney): static rest endpoint to list available features.
- simple list of named tupples
- <> required [] optional
- <feature-string>:[min micoroversion]:[description]
- semantics are if the feature is supported its listed and you can use it for discoverablity.
- useful fo flavor extra specs or other virt dirver/backend speciifc features that are independet of micorverions.
- (gmann): this seems same as providing capabilities?
- Using JSON home was one option we discussed in past.
- AGREED:
- write a spec, define what is a "feature"
- probably we don't have time for this in the W cycle
- (sean-k-mooney) on behalf of legochen add new domain prefilter to map keystone domains to host aggreates
- https://github.com/openstack/nova/commit/732e202e81142a8ea462a9ebcde9a7226a62a60b
- basicaly the same as the existign tenant one ^ but change proejct id for domain and TENANT_METADATA_KEY with a DOMAIN_METADATA_KEY in host aggrate
- addtionally check if AZ is valid for a given user based on project or domain and retrun error in server create instead of no valid host
- ... erm but why? 5 AZs in the cloud, but this domain only has access to 3 AZs
- also could filter the az list on the same critiia
- user scenario - https://docs.google.com/document/d/1Cv3FB3HLc70o4EcFh9aLxzPVszRgulkADnmJxmT65a8/edit#
- AGREED:
- need a spec/blueprint on domain prefilter, seems good
- need a spec to expand flavor ACL for domains as well as project, seems good
- have a spec on filtering AZs by tenants and domain, need more details on use cases
- NOTE: we have plans to use "distance" or service group affinity/anti-affinity
- (legochen) one more question is about the access-control of flavors
- so far, flavor offers a way to share by multiple-tenants. But, I'm looking for a solution for domain level control.
- Will system scope things can be done instead of domain?
- (belmoreira) Support guests from different architectures
- Probably I can't make it for the discussion but if you reach this topic it would be great if this gets some attention.
- My use case is to run aarch64 guests on top of x86_64
- In the past we had a generic bug to discuss this issue: https://bugs.launchpad.net/nova/+bug/1863728
- I'm now opening target bugs:
- Need to be sure we don't break mult-arch clouds be accident
- AGREED:
- Thanks for the bugs. We need somebody who works on the bugfixes
- To avoid breaking multi-arch clouds we need a spec to discuss separation