[openstack-dev] [nova] Stein PTG summary
melanie witt
melwittt at gmail.com
Wed Sep 26 22:10:46 UTC 2018
Hello everybody,
I've written up a high level summary of the discussions we had at the
PTG -- please feel free to reply to this thread to fill in anything I've
missed.
We used our PTG etherpad:
https://etherpad.openstack.org/p/nova-ptg-stein
as an agenda and each topic we discussed was filled in with agreements,
todos, and action items during the discussion. Please check out the
etherpad to find notes relevant to your topics of interest, and reach
out to us on IRC in #openstack-nova, on this mailing list with the
[nova] tag, or by email to me if you have any questions.
Now, onto the high level summary:
Rocky retrospective
===================
We began Wednesday morning with a retro on the Rocky cycle and captured
notes on this etherpad:
https://etherpad.openstack.org/p/nova-rocky-retrospective
The runways review process was seen as overall positive and helped get
some blueprint implementations merged that had languished in previous
cycles. We agreed to continue with the runways process as-is in Stein
and use it for approved blueprints. We did note that we could do better
at queuing important approved work into runways, such as
placement-related efforts that were not added to runways last cycle.
We discussed whether or not to move the spec freeze deadline back to
milestone 1 (we used milestone 2 in Rocky). I have an action item to dig
into whether or not the late breaking regressions we found at RC time:
https://etherpad.openstack.org/p/nova-rocky-release-candidate-todo
were related to the later spec freeze at milestone 2. The question we
want to answer is: did a later spec freeze lead to implementations
landing later and resulting in the late detection of regressions at
release candidate time?
Finally, we discussed a lot of things around project management,
end-to-end themes for a cycle, and people generally not feeling they had
clarity throughout the cycle about which efforts and blueprints were
most important, aside from runways. We got a lot of work done in Rocky,
but not as much of it materialized into user-facing features and
improvements as it did in Queens. Last cycle, we had thought runways
would capture what is a priority at any given time, but looking back, we
determined it would be helpful if we still had over-arching
goals/efforts/features written down for people to refer to throughout
the cycle. We dove deeper into that discussion on Friday during the hour
before lunch, where we came up with user-facing themes we aim to
accomplish in the Stein cycle:
https://etherpad.openstack.org/p/nova-ptg-stein-priorities
Note that these are _not_ meant to preempt anything in runways, these
are just 1) for my use as a project manager and 2) for everyone's use to
keep a bigger picture of our goals for the cycle in their heads, to aid
in their work and review outside of runways.
Themes
======
With that, I'll briefly mention the themes we came up with for the cycle:
* Compute nodes capable to upgrade and exist with nested resource
providers for multiple GPU types
* Multi-cell operational enhancements: resilience to "down" or
poor-performing cells and cross-cell instance migration
* Volume-backed user experience and API hardening: ability to specify
volume type during boot-from-volume, detach/attach of root volume, and
volume-backed rebuild
These are the user-visible features and functionality we aim to deliver
and we'll keep tabs on these efforts throughout the cycle to keep them
making progress.
Placement
=========
As usual, we had a lot of discussions on placement-related topics, so
I'll try to highlight the main things that stand out to me. Please see
the "Placement" section of our PTG etherpad for all the details and
additional topics we discussed.
We discussed the regression in behavior that happened when we removed
the Aggregate[Core|Ram|Disk]Filters from the scheduler filters -- these
filters allowed operators to set overcommit allocation ratios per
aggregate instead of per host. We agreed on the importance of restoring
this functionality and hashed out a concrete plan, with two specs needed
to move forward:
https://review.openstack.org/552105
https://review.openstack.org/544683
The other standout discussions were around the placement extraction and
closing the gaps in nested resource providers. For the placement
extraction, we are focusing on full support of an upgrade from
integrated placement => extracted placement, including assisting with
making sure deployment tools like OpenStack-Ansible and TripleO are able
to support the upgrade. For closing the gaps in nested resource
providers, there are many parts to it that are documented on the
aforementioned PTG etherpads. By closing the gaps with nested resource
providers, we'll open the door for being able to support minimum
bandwidth scheduling as well.
Cells
=====
On cells, the main discussions were around resiliency "down" and
poor-performing cells and cross-cell migration. Please see the "Cells"
section of our PTG etherpad for all the details and additional topics we
discussed.
Some multi-cell resiliency work was completed in Rocky and is continuing
in-progress for Stein, so there are no surprises there. Based on
discussion at the PTG, there's enough info to start work on the
cross-cell migration functionality.
"Cross-project Day"
===================
We had all of our cross-project discussions with the Cinder, Cyborg,
Neutron, and Ironic teams on Thursday. Please see the "Thursday" section
of our etherpad for details of all topics discussed.
With the Cinder team, we went over plans for volume-backed rebuild,
improving the boot-from-volume experience by accepting volume type, and
detach/attach of root volumes. We agreed to move forward with these
features. This was also the start of a discussion around transfer of
ownership of resources (volume/instance/port/etc) from one project/user
to another. The current idea is to develop a tool that will do the
database surgery correctly, instead of trying to implement ownership
transfer APIs in each service and orchestrating them. More details on
that are to come.
With the Cyborg team, we focused on solidifying what Nova changes would
be needed to integrate with Cyborg, and the Cyborg team is going to
propose a Nova spec for those changes:
https://etherpad.openstack.org/p/stein-ptg.cyborg-nova-new
With the Neutron team, we had a demo of minimum bandwidth scheduling to
kick things off. A link to a writeup about the demo is available here if
you missed it:
http://lists.openstack.org/pipermail/openstack-dev/2018-September/134957.html
Afterward, we discussed heterogeneous (linuxbridge, ovs, etc) Neutron
ML2 backends and the current inability to migrate an instance between
them -- we thought we had gained the ability by way of leveraging the
newest Neutron port binding API but it turns out there are still some
gaps. We discussed minimum bandwidth scheduling and ownership transfer
of a port. We quickly realized transferring a port from a non-shared
network would be really complicated, so we suspect the more realistic
use case for someone wanting to transfer an instance and its ports to
another project/user would involve an instance on a shared network, in
which case the transfer is just database surgery.
With the Ironic team, we discussed the problem of Nova/Ironic powersync
wherein an instance that had been powered off via the Nova API is turned
on via IPMI by a maintenance engineer to perform maintenance, is turned
back off by Nova, disrupting maintenance. We agreed that Ironic will
leverage Nova's external events API to notify Nova when a node has been
powered on and should be considered ON so that Nova will not try to shut
it down. We also discussed the need for failure domains for
nova-computes controlling subsets of Ironic nodes and agreed to
implement it as a config option in the [ironic] section to specify an
Ironic partition key and a list services with which a node should peer.
We also discussed whether to deprecate the ComputeCapabilities filter
and we agreed to deprecate it. But, judging from the ML thread about it:
http://lists.openstack.org/pipermail/openstack-dev/2018-September/135059.html
I'm not sure it's appropriate to deprecate yet.
Tech Debt and Miscellaneous Topic Day
=====================================
Friday was our day for discussing topics from the "Tech Debt/Project
Management" and "Miscellaneous" sections of our PTG etherpad. Please see
the etherpad for all the notes taken on those discussions.
The major topics that stand out to me were the proposal to move to
Keystone unified limits and filling in gaps in openstackclient (OSC) for
support of newer compute API microversions and achieving parity with
novaclient. Example: migrations and boot-from-volume work differently
between openstackclient and novaclient. The support of OSC is coming up
on the ML now as a prospective community-wide goal for the T series:
http://lists.openstack.org/pipermail/openstack-dev/2018-September/135107.html
On unified limits, we agreed we should migrate to unified limits, noting
that I think we must wait for a few more oslo.limit changes to land
first. We agreed to drop per user limits on resources when we move to
unified limits. This means that we will no longer allow setting a limit
on a resource for a particular user -- only for a particular project.
Note that with unified limits, we will gain the ability to have strict
two-level hierarchy, which should address the reasons why admins
leverage per user limits, at present. We will signal the upcoming change
with a 'nova-status upgrade check'. And we're freezing all other
quota-related features until we integrate with unified limits.
I think that's about it for the "summary" which has gotten pretty long
here. Find us on IRC in #openstack-nova or email us on this mailing list
with the [nova] tag if you have any questions about any discussions from
the PTG.
Cheers,
-melanie
More information about the OpenStack-dev
mailing list