[openstack-dev] [ironic] PTG Summary
juliaashleykreger at gmail.com
Thu Mar 8 21:07:47 UTC 2018
The Ironic PTG Summary - The blur(b) from the East
In an effort to provide visibility and awareness of all the things
related to Ironic, I've typed up a summary below. I've tried to keep
this fairly generalized with enough context and convey action items or
the instances of consensus where applicable. It goes without saying
that the week went by as a complete blur. We had to abruptly change
our schedule around, some fine detailed topics were missed. A special
thanks to Ruby Loo for taking some time to proof read this for me.
>From our retrospective:
As seems to be the norm with retrospectives, we did bring up a number
of issues that slowed us down, hindered us, or hindered the ability to
move faster. A great deal of this revolved around specifications, and
the perceptions that tend to occur.
* Jroll will bring up for discussion if we can update the theme for
rendered specs documentation to highlight that the specs are points in
time references for design, and are not final documentation.
* TheJulia will revise our specification template to attempt to be
more clear about *why* we are asking the questions, also to suggest
but not require proof of concept code
After our retrospective, we spoke about things that can improve our
velocity. This sort of discussion tends to always come up, and focused
on community cultural aspects of revising/helping land code. The
conclusion we quickly came to was that communication or context of the
contributor is required. One of the points raised, that we did not get
to, was that we should listen to contributor's perceptions, which
really goes back to communication.
As time went on, we shifted gears to a high level status of ironic,
and there are some items to take away:
* Inspector, at a high level, could use some additional work and
contributors. Virtual media boot support would be helpful, and we may
look at breaking some portions out and moving them into ironic.
Additional High Availability work may be needed, at the same time it
may not be needed. Entirely to be determined.
* Ironic-ui presently has no active contributors, but is stable. Major
risk right now is a breaking change coming from Horizon, which was
also discussed earlier in the week with Horizon. Will add testing such
that horizon's gate triggers ironic-ui testing and raises visibility
to breaking changes.
* Ironic itself got a lot completed this cycle, and we should expect
quite a bit this cycle in terms of clean-up from deprecation.
* Networking-baremetal received a good portion of work this cycle due
to routed networks support. \o/
* Networking-generic-switch seems to be in a fairly stable state at
this point. Some trunk awareness has been added, as well as some new
switches and bug fixes.
* Bifrost has low activity, but at the same time we're seeing new
contributors fix issues or improve things, which is a good sign.
* Sushy got authentication and introspection support added this cycle.
We discussed that we may want to consider supporting RAID (in terms of
client actions), as well as composable hardware.
After statuses, we shifted into discussing the future.
We started the entire discussion of the future with a visioning
exercise to help frame the future, so we were all using the same words
and had the same scope in mind when discussing the future of Ironic.
One thing worth noting is upfront there was a lot of alignment, but we
sometimes were just using slightly different words or concepts. Taking
a little more time to reconcile those differences allowed us to relate
additional words to the same meaning. Truly this set the stage for all
of the other topics, and gave us the common reference point to grasp
if what we were talking about made sense. Expect Jroll to send out an
email to the mailing list to summarize this further, and from this
initial discussion we will likely draft a formal vision document that
will allow us to continue having the same reference point for
discussions. Maybe one day your light bulb will be provisioned with
In terms of the future, we again returned to the concept of breaking
up deployments into a series of steps. Without going deep into detail,
this is a very large piece of functionality and would help solve many
problems and desires that exist today, especially where some operators
wish things like deploy-time raid, or to flash firmware as part of the
baremetal node provisioning process. This work is also influenced by
traits, because traits can map to actions that need to be performed
automatically. In the end, we agreed to take a small step, and iterate
from there. Specifically adding a deploy steps framework and splitting
our current deploy process into two logical steps.
"Location awareness" as we are calling it, or possibly better stated
as "conductor to node affinity" is a topic that we again revisited.
This is important as many operators desire a single pane of glass for
their entire baremetal fleet. Some operators would like to isolate
conductors per rack, per data center, per customer, per sets of data
centers in close proximity, per continent. This is a common problem of
creating failure domains that match the environment and have optimal
performance, as opposed to deploying across a point-to-point circuit.
We agreed this is something that we need to make happen, as it is a
very common operational problem. We may further work on this in the
future to provide a scoring and anti-affinity system, but right now
our focus is hard affinity to clusters of conductors.
We revisited the topic of graphical consoles, which is one of the
topics we made very little progress on this past cycle. This is
difficult because there are several different ways to architect and
develop this functionality. And then we realized libvirt offers a VNC
server that we could very easily leverage as someone was kind enough
to stub it out already in our virtualized BMC services. TL;DR We are
going to pick this back up and try to reach consensus and try to land
the framework this cycle. We know we are likely to want to land a
distinct driver interface to support this since our existing console
is designed around serial console usage. We also know we can use our
virtualized BMC for testing.
Going beyond the qcow2
Next up on the topic list was partitioning and getting beyond our
current use case. Where this topic came from was several different
topics with the same central theme of "what if I don't want to make or
deploy a qcow2 file?" Historically, we have resisted this as it is
more a pattern of pet management. The reality in that consensus is
that we agree pets will happen, and have to be able to happen.
So what does this mean for the average user? Not much right now. We
still have some things to think about, such as what would be a good
way to tell Ironic about disk partitioning? And then what to do with
the contents of the image?
This also had an interesting shift of "what if we supported a generic
TFTP interface?" which gets us towards things like where we can
configure new switches and non-traditional devices upon power-up. The
possibilities are somewhat endless. The surprising thing... there was
not disagreement. We even had consensus that this sort of thing would
be useful, and be a step towards deploying that light bulb with
* Jroll to look at ways we could allow for user definable partition
data, and what that might look like.
Security Interfaces/TPM modules!
As a topic which the PTL mainly drove, there was a general consensus
amongst the room that it could be useful, but that a greater
understanding was required. Our consensus may be in part due to
learning that Thursday we would likely have less attendees due to the
incoming weather system.
As a follow-up note: I was approached by the Cyborg PTL to see if
there could be an opportunity to collaborate. At present we are unsure
given our use model and workflow, but there may be some more
discussions in the future.
* TheJulia needs to sit down and write a spec and popularize the concept.
One of our goals during the past cycle was to create a set of
reference architecture documentation. We didn't quite get to that
work. One of the advantages to being on the same page and having the
same words was that we quickly determined the challenge that deterred
us, which was a lack of clear scope. After some discussion, we were
able to refine the scope into smaller logical blocks that would build
upon each other to help convey how things fit together and how they
can be fit together differently. This also raised some greater
visibility on where we have an opportunity to improve our developer
* dtantsur and jroll to begin creating high level control plane
diagrams covering API -> RabbitMQ -> Conductor communication. With
this we intend to iterate.
* Sambetts to update the development docs on how the networking works
to help developers troubleshooting
Cleaning - Firmware versions
One topic that has come up a number of times is how to manage firmware
efficiently and effectively, since there are substantial barriers to
entry, which are compounded by differing vendors and hardware fleets.
The ask from the community is to help spur further discussion to lower
the bar to entry and make it easier to apply firmware updates to
hardware nodes, in a way that also provides some level of visibility
in that the process has completed, or that the latest firmware has
been applied. This is further complicated even more by the fact that
some operators have expressed need to apply firmware updates prior to
the deploy being completed. Ultimately this takes us down the road of
the deploy steps topic, since we should then be able to determine and
handle cases where a BIOS image needs to be soft reset for in-band
firmware updates, or turned off prior to out-of-band firmware updates.
* TheJulia is going to try and spur further community discussion in
regards to standardization in two weeks.
Cleaning - Burn-in
As part of discussing cleaning changes, we discussed supporting a
"burn-in" mode where hardware could be left to run load, memory, or
other tests for a period of time. We did not have consensus on a
generic solution, other than that this should likely involve
clean-steps that we already have, and maybe another entry point into
cleaning. Since we didn't really have consensus on use cases, we
decided the logical thing was to write them down, and then go from
* Community members to document varying burn-in use cases for
hardware, as they may vary based upon industry.
* Community to try and come up with a couple example clean-steps.
Planning for Rocky
Rocky Planning was performed in record time, but in part because the
ironic community performs the initial on-site prioritization via a
poll of the room and then five votes per person. This is in turn
transformed into our cycle priorities which is posted into gerrit.
This can be viewed at https://review.openstack.org/#/c/550174/. We
must stress that due to the notice of the need to vacate the building
by2PM on Thursday, we chose to move up our planning session and not
everyone was able to attend. Thoughts, feedback, and needs should be
communicated via the posted change set for community participants that
were not present during the planning process.
Due to the abrupt schedule changes and need of contributors to begin
re-booking flights, we lost some of our time for a little while on
Thursday. This largely resulted in that we were unable to discuss
miscellaneous items like communication flow changes, changing the
default boot mode, alternative dhcp servers. None of which is
Towards the end of Thursday, Ironic was able to convene with the Nova
team to discuss topics of interest.
One of the common asks, especially in large scale deployments, or
where things such as RAID is needed, is to be able to define what the
machine should look like by the requester. This is not a simple need
to fulfill given that it is not a "cloudy" behavior. We discussed
various options, and a spec is going to be proposed that will allow
nova to pass a pointer of some sort to ironic that would define the
disk and file system profile for the node.
* jroll to write a spec on how to allow user supplied partition/raid
configuration to reach Ironic.
Virt driver interactions
There are several cases where the ironic virt driver in nova does
things that are not ideal. Also because of long lived processes,
hardware is not immediately freed to the resource tracker which can
lead to issues. There is a mutual desire to fix these issues, and
largely revolves around ensuring that we provide information correctly
and set the state for the resources such that the virt driver does not
encounter issues with placement.
* jroll to fix the nova-compute crash upon start-up if there are
issues talking to Ironic such that it raises NotReadyYet.
API Version Negotiation
One of the biggest headaches that Ironic has encountered as time has
gone on is the compliance with testing scenarios within the framework,
as right now we force a very particular testing order. One of the
things that makes this difficult is that we include a pin with our
current API client usage (in nova's ironic virt driver) that locks the
version the client speaks, and if the server does not speak it, the
nova-compute process fails to start.
The solution to this is to begin replacing the use of
python-ironicclient in the virt driver with REST statements that
explicitly state the API version they need to operate. This provides
greater visibility, and maximum flexibility moving forward.
* TheJulia to work on updating the virt driver to use REST calls
instead of the client library.
And then there was Friday
On Friday, the available team members discussed the bios_interface and
how to handle the getting/setting of properties considering what was
proposed is very different from how we presently handle RAID.
Additionally the team discussed the deprecation of vif port ID's being
stored in the port's (and portgroup's) extra field. This was
originally how networking information was conveyed from Nova to
Ironic, but that mechanism was replaced with the vif-attach and
vif-detach APIs in a previous cycle.
Additional items (from discussions outside ironic sessions):
* Ironic to attempt to implement a CI job triggered in the horizon CI
check queue to allow for some level of integration testing to help
provide feedback if a horizon change breaks ironic-ui. This is the
first logical step to support the future of plugins with Horizon, and
lowers effort on our end to maintain. Please blame TheJulia if there
are any questions.
* Scientific SIG will be creating use cases for ironic as RFEs. Things
like kexec from deployment ramdisk for extremely time consuming
reboots, and pure booting from a ramdisk.
* Scientific SIG will also be exploring things like BFV based cluster
booting, so we may receive some interest and RFEs as a result.
Joking about deploying a light bulb aside, it was a positive
experience to talk about our mutual shared visions and really reach
the same page. While last week was a complete blur, this is an
exciting time, now onward to seize it!
More information about the OpenStack-dev