On Mon, 2019-01-21 at 10:45 +0000, Balázs Gibizer wrote:
On Fri, Jan 18, 2019 at 7:40 PM, Dan Smith <dms@danplanet.com> wrote:
* There will be a second new microversion (probably in Train) that will enable move operations for server having resource aware ports. This microversion split will allow us not to block the create / delete support for the feature in Stein.
* The new microversions will act as a feature flag in the code. This will allow merging single use cases (e.g.: server create with one ovs backed resource aware port) and functionally verifying it before the whole generic create use case is ready and enabled.
* A nova-manage command will be provided to heal the port allocations without moving the servers if there is enough resource inventory available for it on the current host. This tool will only work online as it will call neutron and placement APIs.
* Server move operations with the second new microversion will automatically heal the server allocation.
I wasn't on this call, so apologies if I'm missing something important.
Having a microversion that allows move operations for an instance configured with one of these ports seems really terrible to me. What exactly is the point of that? To distinguish between Stein and Train systems purely because Stein didn't have time to finish the feature?
I think in Stein we have time to finish the boot / delete use case of the feature but most probably do not have time to finish the move use cases. I belive that the boot / delete use case is already useful for end users. There are plenty of features in nova that are enabled before supporting all the cases, like move operations with NUMA.
that is true however numa in partaclar was due to an oversight not by design. as is the case with macvtap sriov numa had intended to support livemigration from its introduction even if they are only now being completed. numa even without artoms work has always supported cold migrations the same is true of cpu pinning,hugepages,pci/sriov pass-though.
IMHO, we should really avoid abusing microversions for that sort of thing. I would tend to err on the side of "if it's not ready, then it's not ready" for Stein, but I'm sure the desire to get this in (even if partially) is too strong for that sort of restraint.
Why it is an abuse of the microversion to use it to signal that a new use case is supported? I'm confused. I was asked to use microversions to signal that a feature is ready. So I'm not sure why in case of a feature (feature == one ore more use case(s)) it is OK to use a microversion but not OK when a use case (e.g. boot/delete) is completed.
dan can speak for himself but i would assume because it does not signal that the use case is supported. it merely signals taht the codebase could support it, as move operations can be disable via config or may not be supported by the selected hypervior (ironic), the presence of the microversion alone is not enough to determine the usecase is supported. unlike neutron extensions micro versions are not advertised individually and cant be enabled only when the deployment is configured to support a feature.
Can we not return 403 in Stein, since moving instances is disable-able anyway, and just make it work in Train? Having a new microversion with a description of "nothing changed except we finished a feature so you can do this very obscure thing now" seems like we're just using them as an
I think "nothing is changed" would not be true. Some operation (e.g. server move) that was rejected before (or even accepted but caused unintentional resource overallocation) now works properly.
Isn't it the "you can do this very obscure thing now" documentation of a microversion that makes the new API behavior discoverable?
experimental feature flag, which was definitely not the intent. I know returning 403 for "you can't do this right now" isn't *as* discoverable, but you kinda have to handle 403 for operations that could be disabled anyway, so...
The boot / delete use case would not be experimental, that would be final.
403 is a client error but in this case, in Stein, move operations would not be implemented yet. So for me that error is not a client error (e.g. there is no way a client can fix it) but a server error, like HTTP 501. a 501 "not implemented" would be a valid error code to use with the new mirco version
since the min bandwith before was best effort any overallocation was not a bug or unintentional it was allowed by design given that we initall planned to delegate the bandwith mangment to the sdn contoler. as matt pointed out the apis for creating qos rules and policies are admin only as are most of the move operations. a tenant could have chosen to apply the QOS policy but the admin had to create it in the first place. that declares support for bandwith based schduling. resize today does not retrun 501 https://developer.openstack.org/api-ref/compute/?expanded=resize-server-resi... nor do shelve/unshelve https://developer.openstack.org/api-ref/compute/#shelve-server-shelve-action https://developer.openstack.org/api-ref/compute/#unshelve-restore-shelved-se... the same is true of migrate and live migrate https://developer.openstack.org/api-ref/compute/?expanded=#migrate-server-mi... https://developer.openstack.org/api-ref/compute/?expanded=#live-migrate-serv... as such for older microverions returning 501 would be incorrect as its a change in the set of response codes that existing clients should expect form those endpoints. while i agree it is not a client error being consitent with exisitng behavior chould be preferable as client presuably know how to deal with it.
Cheers, gibi
--Dan