[openstack-dev] [nova] Stein PTG summary

melanie witt melwittt at gmail.com
Wed Sep 26 22:10:46 UTC 2018


Hello everybody,

I've written up a high level summary of the discussions we had at the 
PTG -- please feel free to reply to this thread to fill in anything I've 
missed.

We used our PTG etherpad:

https://etherpad.openstack.org/p/nova-ptg-stein

as an agenda and each topic we discussed was filled in with agreements, 
todos, and action items during the discussion. Please check out the 
etherpad to find notes relevant to your topics of interest, and reach 
out to us on IRC in #openstack-nova, on this mailing list with the 
[nova] tag, or by email to me if you have any questions.

Now, onto the high level summary:

Rocky retrospective
===================
We began Wednesday morning with a retro on the Rocky cycle and captured 
notes on this etherpad:

https://etherpad.openstack.org/p/nova-rocky-retrospective

The runways review process was seen as overall positive and helped get 
some blueprint implementations merged that had languished in previous 
cycles. We agreed to continue with the runways process as-is in Stein 
and use it for approved blueprints. We did note that we could do better 
at queuing important approved work into runways, such as 
placement-related efforts that were not added to runways last cycle.

We discussed whether or not to move the spec freeze deadline back to 
milestone 1 (we used milestone 2 in Rocky). I have an action item to dig 
into whether or not the late breaking regressions we found at RC time:

https://etherpad.openstack.org/p/nova-rocky-release-candidate-todo

were related to the later spec freeze at milestone 2. The question we 
want to answer is: did a later spec freeze lead to implementations 
landing later and resulting in the late detection of regressions at 
release candidate time?

Finally, we discussed a lot of things around project management, 
end-to-end themes for a cycle, and people generally not feeling they had 
clarity throughout the cycle about which efforts and blueprints were 
most important, aside from runways. We got a lot of work done in Rocky, 
but not as much of it materialized into user-facing features and 
improvements as it did in Queens. Last cycle, we had thought runways 
would capture what is a priority at any given time, but looking back, we 
determined it would be helpful if we still had over-arching 
goals/efforts/features written down for people to refer to throughout 
the cycle. We dove deeper into that discussion on Friday during the hour 
before lunch, where we came up with user-facing themes we aim to 
accomplish in the Stein cycle:

https://etherpad.openstack.org/p/nova-ptg-stein-priorities

Note that these are _not_ meant to preempt anything in runways, these 
are just 1) for my use as a project manager and 2) for everyone's use to 
keep a bigger picture of our goals for the cycle in their heads, to aid 
in their work and review outside of runways.

Themes
======
With that, I'll briefly mention the themes we came up with for the cycle:

* Compute nodes capable to upgrade and exist with nested resource 
providers for multiple GPU types

* Multi-cell operational enhancements: resilience to "down" or 
poor-performing cells and cross-cell instance migration

* Volume-backed user experience and API hardening: ability to specify 
volume type during boot-from-volume, detach/attach of root volume, and 
volume-backed rebuild

These are the user-visible features and functionality we aim to deliver 
and we'll keep tabs on these efforts throughout the cycle to keep them 
making progress.

Placement
=========
As usual, we had a lot of discussions on placement-related topics, so 
I'll try to highlight the main things that stand out to me. Please see 
the "Placement" section of our PTG etherpad for all the details and 
additional topics we discussed.

We discussed the regression in behavior that happened when we removed 
the Aggregate[Core|Ram|Disk]Filters from the scheduler filters -- these 
filters allowed operators to set overcommit allocation ratios per 
aggregate instead of per host. We agreed on the importance of restoring 
this functionality and hashed out a concrete plan, with two specs needed 
to move forward:

https://review.openstack.org/552105
https://review.openstack.org/544683

The other standout discussions were around the placement extraction and 
closing the gaps in nested resource providers. For the placement 
extraction, we are focusing on full support of an upgrade from 
integrated placement => extracted placement, including assisting with 
making sure deployment tools like OpenStack-Ansible and TripleO are able 
to support the upgrade. For closing the gaps in nested resource 
providers, there are many parts to it that are documented on the 
aforementioned PTG etherpads. By closing the gaps with nested resource 
providers, we'll open the door for being able to support minimum 
bandwidth scheduling as well.

Cells
=====
On cells, the main discussions were around resiliency "down" and 
poor-performing cells and cross-cell migration. Please see the "Cells" 
section of our PTG etherpad for all the details and additional topics we 
discussed.

Some multi-cell resiliency work was completed in Rocky and is continuing 
in-progress for Stein, so there are no surprises there. Based on 
discussion at the PTG, there's enough info to start work on the 
cross-cell migration functionality.

"Cross-project Day"
===================
We had all of our cross-project discussions with the Cinder, Cyborg, 
Neutron, and Ironic teams on Thursday. Please see the "Thursday" section 
of our etherpad for details of all topics discussed.

With the Cinder team, we went over plans for volume-backed rebuild, 
improving the boot-from-volume experience by accepting volume type, and 
detach/attach of root volumes. We agreed to move forward with these 
features. This was also the start of a discussion around transfer of 
ownership of resources (volume/instance/port/etc) from one project/user 
to another. The current idea is to develop a tool that will do the 
database surgery correctly, instead of trying to implement ownership 
transfer APIs in each service and orchestrating them. More details on 
that are to come.

With the Cyborg team, we focused on solidifying what Nova changes would 
be needed to integrate with Cyborg, and the Cyborg team is going to 
propose a Nova spec for those changes:

https://etherpad.openstack.org/p/stein-ptg.cyborg-nova-new

With the Neutron team, we had a demo of minimum bandwidth scheduling to 
kick things off. A link to a writeup about the demo is available here if 
you missed it:

http://lists.openstack.org/pipermail/openstack-dev/2018-September/134957.html

Afterward, we discussed heterogeneous (linuxbridge, ovs, etc) Neutron 
ML2 backends and the current inability to migrate an instance between 
them -- we thought we had gained the ability by way of leveraging the 
newest Neutron port binding API but it turns out there are still some 
gaps. We discussed minimum bandwidth scheduling and ownership transfer 
of a port. We quickly realized transferring a port from a non-shared 
network would be really complicated, so we suspect the more realistic 
use case for someone wanting to transfer an instance and its ports to 
another project/user would involve an instance on a shared network, in 
which case the transfer is just database surgery.

With the Ironic team, we discussed the problem of Nova/Ironic powersync 
wherein an instance that had been powered off via the Nova API is turned 
on via IPMI by a maintenance engineer to perform maintenance, is turned 
back off by Nova, disrupting maintenance. We agreed that Ironic will 
leverage Nova's external events API to notify Nova when a node has been 
powered on and should be considered ON so that Nova will not try to shut 
it down. We also discussed the need for failure domains for 
nova-computes controlling subsets of Ironic nodes and agreed to 
implement it as a config option in the [ironic] section to specify an 
Ironic partition key and a list services with which a node should peer. 
We also discussed whether to deprecate the ComputeCapabilities filter 
and we agreed to deprecate it. But, judging from the ML thread about it:

http://lists.openstack.org/pipermail/openstack-dev/2018-September/135059.html

I'm not sure it's appropriate to deprecate yet.

Tech Debt and Miscellaneous Topic Day
=====================================
Friday was our day for discussing topics from the "Tech Debt/Project 
Management" and "Miscellaneous" sections of our PTG etherpad. Please see 
the etherpad for all the notes taken on those discussions.

The major topics that stand out to me were the proposal to move to 
Keystone unified limits and filling in gaps in openstackclient (OSC) for 
support of newer compute API microversions and achieving parity with 
novaclient. Example: migrations and boot-from-volume work differently 
between openstackclient and novaclient. The support of OSC is coming up 
on the ML now as a prospective community-wide goal for the T series:

http://lists.openstack.org/pipermail/openstack-dev/2018-September/135107.html

On unified limits, we agreed we should migrate to unified limits, noting 
that I think we must wait for a few more oslo.limit changes to land 
first. We agreed to drop per user limits on resources when we move to 
unified limits. This means that we will no longer allow setting a limit 
on a resource for a particular user -- only for a particular project. 
Note that with unified limits, we will gain the ability to have strict 
two-level hierarchy, which should address the reasons why admins 
leverage per user limits, at present. We will signal the upcoming change 
with a 'nova-status upgrade check'. And we're freezing all other 
quota-related features until we integrate with unified limits.

I think that's about it for the "summary" which has gotten pretty long 
here. Find us on IRC in #openstack-nova or email us on this mailing list 
with the [nova] tag if you have any questions about any discussions from 
the PTG.

Cheers,
-melanie




More information about the OpenStack-dev mailing list