[openstack-dev] [nova] Rocky PTG summary - miscellaneous topics from Friday

melanie witt melwittt at gmail.com
Tue Mar 20 22:57:25 UTC 2018

Howdy all,

I've put together an etherpad [0] with summaries of the items from the 
Friday miscellaneous session from the PTG at the Croke Park Hotel "game 
room" across from the bar area. I didn't summarize all of the items, but 
attempted to do so for most of them, namely the ones that had 
discussion/decisions about them.


[0] https://etherpad.openstack.org/p/nova-ptg-rocky-misc-summary

*Friday Miscellaneous: Rocky PTG Summary

https://etherpad.openstack.org/p/nova-ptg-rocky L281

*Key topics

   * Team / review policy
   * Technical debt and cleanup
     * Removing nova-network and legacy cells v1
     * Community goal to remove usage of mox3 in unit tests
     * Dropping support of running nova-api and the metadata API service 
under eventlet
     * Cruft surrounding rebuild and evacuate
     * Bumping the minimum required version of libvirt
     * Nova's 'enabled_perf_events' feature will be broken with Linux 
Kernel 4.14+ (the feature has been removed from the kernel)
   * Miscellaneous topics from the PTG etherpad

*Agreements and decisions

   * On team / review policy, for the Rocky cycle we're going to 
experiment with a process for "runways" wherein we'll focus review 
bandwidth on selected blueprints in 2 week time-boxes
     * Details here: https://etherpad.openstack.org/p/nova-runways-rocky
   * On technical debt and cleanup:
     * We're going to remove nova-network this cycle and see how it 
goes. Then we'll look toward removing legacy cells v1.
     * NOTE: If you're planning to work on the community-wide goal of 
removing mox3 usage, don't bother refactoring nova-network and legacy 
cells v1 unit tests. Those tests will be entirely removed soon-ish.
     * We're going to dump a warning on service startup and add a 
release note for deprecation, and plan for removal of support for 
running nova-api and the metadata API service under eventlet in S
       * Patch: https://review.openstack.org/#/c/549510/
     * For rebuild, we're going to defer the instance.save() until 
conductor has passed scheduling and before it casts to compute in order 
to address the issue of rolling back instance values if something fails 
during rebuild scheduling
       * For future work on rebuild tech debt, there was an idea to 
deprecate "evacuate" and add an option to rebuild like "--elsewhere" to 
collapse the two into using nearly the same code path. Evacuate is a 
rebuild and it would be nice to represent it as such. Someone would need 
to write up a spec for this.
     * We're going to bump the minimum required libvirt version: 
       * kashyap is going to do this
     * We're going to log a warning if enabled_perf_events is set in 
nova.conf and mark it as deprecated for removal
       * kashyap is going to do this
   * Abort Cold Migration
     * This would add a new API and a significant amount of complexity 
as it prone to race conditions (for example, abort request lands just 
after the disk migration has finished, having to restore the original 
instance), etc.
     * We would like to have greater interest from operators for the 
feature before going down that path
     * takashin will email openstack-operators at lists.openstack.org to 
ask if there is broader interest in the feature
   * Abort live migrations in queued status
     * We agreed this is reasonable functionality to add, just need to 
work out the details on the spec
     * Kevin_Zheng will update the spec: 
   * Adding request_id field to migrations object
     * The goal here is to be able to lookup the instance action for a 
failed migration to determine why it failed, and the request_id is 
needed to lookup the instance action.
     * We agreed to add the request_id instance action notification 
instead and gibi will do this: 
   * Returning Flavor Extra Specs in GET /flavors/detail and GET 
     * Doing this would create parity between the servers API (when 
showing the instance.flavor) and the flavors API
     * We agreed to add a new microversion and implement it the same way 
as we have for instance.flavor using policy as the control on whether to 
show the extra specs
   * Adding host and Error Code field to instance action event
     * We agreed that it would be reasonable to add a new microversion 
to add the host (how it's shown to be based on a policy check) to the 
instance action event but the error code is a much more complex, 
cross-project, community-wide effort so we're not going to pursue that 
for now
     * Spec for adding host: https://review.openstack.org/#/c/543277/
   * Allow specifying tolerance for (soft)(anti-)affinity groups
     * This requirement is about adding an attribute to the group to 
limit the amount of how hard the affinity is (in the filter). Today it's 
a max of 1 per host for hard anti-affinity
     * We agreed this feature is reasonable, just need to sort out the 
api model vs data model in the spec: 
   *  XenAPI: support non file system based SR types - e.g. LVM, ISCSI
     * Currently xenapi is only file system-based, cannot yet support 
LVM, ISCSI that are supported by XenServer
     * We agreed that a specless blueprint is fine for this: 
   * Supporting live-migration for xapi pool based hosts
     * Have to implement this first before removing the aggregate 
upcall, else it would break live migration for shared storage
     * We agreed to do a specless blueprint for this: 
     * The removal of the aggregate upcall will be a patch stacked on 
top of the live migration implementation ^
   * Preemptible instances
     * The scientific SIG has been experimenting with doing this 
completely outside of Nova and it was a bad user experience, so they're 
interested in what a minimal integration in Nova could look like
     * There was an idea suggested to put instances that failed to boot 
in a new non-ERROR vm_state. The external reaper service could go make 
some room, then issue a rebuild on that instance. Notifications could be 
used to determine the ordering of the instances that landed in the new 
non-ERROR state. Then the reaper service could use that info to decide 
what to do. Maybe use reset-state to put an instance into ERROR if the 
reaper is giving up on it. A new notification is not needed, we already 
have one. The configurable bit will be whether to use the new non-ERROR 
"pending" state. We'll need a way to remove the instance from cell0, 
etc. "Evacuate for cell0" when you do the rebuild.
     * There was agreement that the above idea seemed reasonable (though 
need to check if there is any special handling needed for boot-from-volume)
   * Exposing virt driver capabilities out of a REST API
     * Simple way to tie driver capabilities to scheduling requests 
using Placement and provider traits, for things like tagged devices and 
volume multiattach
     * We agreed this is OK, so do it. Jay has a patch to register the 
traits in os-traits; the nova patch will depend on a release of that
     * We also agreed to merge traits, somehow. But we'll finish 
update_provider_tree as written first
   * libvirt: add support for virtio-net rx/tx queue sizes
     * Spec: https://review.openstack.org/#/c/539605
     * It looks like we thought we had agreement to go with a global 
config option for this but it's still being actively discussed on the 
spec. Please see the spec discussion for details
   * Adding support for rebuild of volume-backed instances
     * Review of the spec is underway: 
   * Strict isolation of group of hosts for image and flavor
     * Spec: https://review.openstack.org/#/c/381912
     * The only problem remaining is if image doesn't contain any 
properties, then it will land on any host aggregate group
     * We agreed that this should be done with traits. From the spec 
review comments, it sounds like the desired behavior will be possible 
once the "placement request filtering" work lands: 
       * The solution will involve applying custom traits to compute 
nodes and using the placement request filtering to use data from the 
RequestSpec to filter the request only for host aggregates that meet the 
requirement. Discussion on the spec is in progress.
   * Reliable port ordering in Nova
     * Ports do not get designated order, backup/restore of db can 
result in port orders changing
     * We agreed the existing device tagging feature can be used to get 
reliable ordering for devices
   * Granular API policy
     * This is about making policy more granular for GET, POST, PUT, 
DELETE separately for APIs
     * Interested people should review the spec: 
   * Block device mapping creation races during attach volume
     * We agreed to create a nova-manage command to do BDM clean up and 
then add a unique constraint in S
     * mriedem will restore the device name spec and someone else can 
pick it up
   * Ironic instance switchover
     * With booting from volume, a failed bare metal node can be 
switched to another node by booting the alternative node from the same 
node. There are some options regarding which Compute API can be used to 
trigger this switchover
     * This is about wanting to be able to migrate baremetal instances
     * We agreed this would be adding parity with other virt drivers, so 
it's okay in that regard. Details need to be worked out on the spec 
review: https://review.openstack.org/#/c/449155
   * Add UEFI Secure Boot support for QEMU/KVM guests, using OVMF
     * Spec: https://review.openstack.org/#/c/506720/
     * We agreed that providing this feature would be reasonable. 
Details need to be worked out on the spec review
   * Validate policy when creating a server group
     * We can create a server group that have no policies (empty 
policies) currently. We can create a server with it, but all related 
scheduler filters return True, so it is useless
     * Spec: https://review.openstack.org/#/c/546484
     * We agreed this should be a simple thing to do, spec review is 
underway. We also said we should consider lumping in some other trivial 
API cleanup into the same microversion - we have a lot of TODOs for 
similar stuff like this in the API
   * Add force flag in cold migration
     * We have agreed not to add a way to force bypass of the scheduler 
   * Skip instance backup image creation when rotation 0
     * Spec: https://review.openstack.org/511825
     * The spec ^ is old and needs to be re-proposed for Rocky
     * We agreed that the approach should be to add a microversion to 
disallow 0 and then also add a new API for "purge all backups" to be 
used instead of passing 0
   * Live resize for hyper-v
     * Spec: https://review.openstack.org/#/c/141219
     * This uses same workflow as cold migrate, we'll increase 
allocations in placement when a host is found, but there's no 
confirm/revert step because it's live
     * Proposes a new API and resize up only will be allowed. Virt 
driver will have a new method "can_live_resize" (or similar) to check if 
it supports it before proceeding past the API layer
     * Stage 1 only, no automatic live migration fallback
     * PowerVM wants to do this too. Should be possible in libvirt 
driver too
     * We agreed this idea sounds fine, just a matter of getting 
interest and review on the spec. Interested parties should review the spec

More information about the OpenStack-dev mailing list