Summary: Ironic Mid-cycle at CERN

16 Mar 2020

      A couple weeks ago (February 25-26), the Ironic community convened its
first mid-cycle in quite a long time at the invitation of and
encouragement of CERN. A special thanks goes to Arne Wiebalck for
organizing the gathering.

We spent two days discussing, learning, sharing, and working together
to form a path forward. As long time contributors, some of us were
able to bring context of not just how, but why. Other community
members brought questions and requirements, meanwhile one of the
hardware vendors brought their context and needs. And CERN was kind
enough to show us how our work matters and makes a difference, which
was the most inspiring part of all!

Special thanks goes to Dmitry Tantsur, Riccardo Pittau, and Iury
Gregory for helping me keep momentum moving forward on this summary.

---------------------------------------

Deploy Steps
==========

Discussed issues related to the actual workflow, concerns out process
efficiency and path forward. The issue in question was the advance
validation of deploy steps when some of the steps come from the
ironic-python-agent ramdisk and are not reflected in the server code.
Creating the whole list of steps for validation and execution requires
information from the ramdisk, but it’s only available when the ramdisk
is booted. We have discussed the following alternatives:

* Start ramdisk before deploy step execution. This was ruled out for
the following reasons:

** Some steps need to be executed out-of-band before the ramdisk is
running. This is already an issue with iDRAC clean steps.
** The first deploy steps validation happens in the node validation
API when the ramdisk is clearly not running.

* Using cached deploy steps from the previous cleaning run. It was
ruled out because:
** Some deployments disable automated cleaning.
** The deploy step list can change in between, e.g. because of
hardware changes or any external input.

* Accept that we cannot provide early validation of deploy steps and
validate them as we go. This involves booting the ramdisk as one of
the deploy steps (no special handling), with only out-of-band steps
executed before that (ignoring any in-band steps with higher
priority).

We decided to go with the third option. In a side discussion we
decided to file an RFE for a driver_info flag preventing booting the
ramdisk during manual cleaning. Solving the cleaning issues completely
probably requires making booting the ramdisk a separate clean step,
similarly to deploy steps above. No final plan has been made for it,
but we have more clarity than we did before.

Security
======

Firmware Management
--------------------------------

We entered a discussion about creating a possible “meta step”. After
some back and forth discussions, we reached a consensus that it is
likely not possible given different vendor parameters and
requirements.

During this topic, we also reached the point of discussing changes to
“Active node” configuration as it is related in large part to firmware
updates, and is necessary for larger fleet management, and eventually
even for attestation process integration. The consensus kind of
revolved around that the way to help enable some of this process was
to leverage rescue, this is only theory. Hopefully we’ll have operator
feedback in the next year on this subject and can make more informed
decisions. By then,  we should have deploy steps in a form that one
doesn't need to be a python developer to leverage, and the team should
be bandwidth to explore this further with operators.

Attestation
---------------

This is a topic that Julia has been raising for a while because there
is a logical and legitimate reason to go ahead and implement some sort
of integration with an attestation platform to perform system
measurement and attestation during cleaning and deployment processes
in order to help identify if machines were tampered with.

In our case, remote attestation is likely the way forward, and
inspiration can come looking at Keylime (TPM based highly boot
attestation and run-time integrity measurement solution, and most
important, opensource).

We’ll need an implementation to cover at least Clean/Deploy steps, to
be able to run and validate TPM measurement, and fail the deployment
if attestation fails.

We still need to figure out what the actual impact to firmware upgrade
is, and how to safely ensure that a re-measurement is valid, or not,
and when to trust the measurement is actually valid.

Ironic’s next step is to begin to talk to the Keylime folks in more
depth. Also one of our contributors, Kaifeng, who read our notes
etherpad indicated that he is working towards the same area as well,
so we may see some interesting and fruitful collaboration because
ultimately we all have some of the same needs.

Agent Tokens
------------------

Agent tokens was possibly the quickest topic that we visited with
Dmitry suggesting we just needed to add a unit test and merge the
code. To be further secure, we need the agent ramdisk to begin using
TLS.

TODO:

Julia is to send out an email to the mailing list to serve as notice
to operators that ironic intends to break backwards IPA compatibility
next cycle by removing support for agents that do not support agent
tokens.

NOTE: As we type/refine this summary for distribution, the agent token
code has largely merged, and should be completely merged before the
end of the current development cycle.

TLS in virtual media
---------------------------

In order to secure Agent token use, we need to secure their
transmission to the ironic-python-agent when commands are issued to
the agent from the conductor. Ultimately we’re hoping to begin work on
this soon in order to better secure interactions and communications
with machines in remote “edge” environments.

An RFE has been filed to automatically generate certificates and
exchange them with the ramdisk:
https://storyboard.openstack.org/#!/story/2007214. Implementing it may
require downstream consumers to update their USA export control
certification.

FIPS 140-2
---------------

This was a late addition topic that was added while other discussion
was coming up, and largely more for the purposes of community
visibility. In short, we know based on some recent bugs that were
fixed, that operators are starting to try and deploy Ironic in
environments and on hosts that are configured for a FIPS 140-2
operating mode, in short a much more strict cryptography
configuration. We ought to make sure that we don’t have any other
surprises waiting for us, so the call went out for someone to at some
point review the standard, and sanity check Ironic and components.

Post-IPMI universe
===============

The decline of IPMI is one that we, as a community, need to plan ahead
for as some things become a little more difficult.

Discovery
-------------
Node discovery, as a feature, is anticipated to become a little more
complicated. While we should still be able to identify a BMC
address,that address may be the in-band communications channel address
once vendors are supporting the Redfish host interface specification.

This spurred discussion of alternatives, and one of the items that was
raised was possibly supporting the discovery of BMCs using the SSDP
and UPnP. This raises an interesting possibility in that the UUID of
the BMC is retrievable through these means. It seems logical for us to
one day consider the detection and enrollment of machines using an
operator tool of some sort. This functionality is defined by the
Redfish standard and, as everything in Redfish, is optional. The DMTF
provided Redfish library contains an example implementation:
https://github.com/DMTF/python-redfish-library/blob/master/src/redfish/disco....

Using System ID/GUID
-------------------------------

The discovery topic spurred another topic of if we should be using the
System UUID or GUID identifier in addition to or instead of a MAC
address on the chassis. Ironic Inspector folks have been considering
having additional or even plug-able matches for a long time. The
system UUID can be discovered in-band via SMBios before launching
inspection/discovery.

Supporting this would largely be more of a feature matching a physical
machine, but some of our code requires network information anyway, so
it may not bring a huge benefit upfront beyond trying to match
BMC<->Host.

IPMI cipher changes
----------------------------

CERN folks were kind enough to raise an issue that has brought
themselves some headaches as of recent which is that some BMC vendors
have changed cipher logic and keying so they’ve had to put in some
workarounds and modified ipmitool builds on their systems. As far as
we’re aware as a community, there is really nothing we can directly do
to help them remove this work around, but ultimately this headache may
cause them to begin looking at Redfish and cause some development on
serial console support for Redfish.

Making inspector data a time series
===========================

One of the challenges in data centers is identifying when the
underlying hardware changes. When a disk is replaced, the serial
number changes, and if that disk is in a running system, it would
traditionally have needed to be re-inspected in order for information
about the machine to be updated.

But we added the ability to manually execute and submit this data in
the last development cycle, so if there was any time series nature to
inspection data, then it allows for changes to be identified, new
serial numbers recorded, etc.

The overwhelming response during the discussion was “Yes Please!”, in
that such a feature would help a number of cases. Of course, what we
quickly reached was disagreement over meaning. Turns out, the purpose
is more about auditing and identifying changes, so even if there are
only two copies, the latest and the previous inspection data, then
differences could be identified by external tooling.

A spec document or some sort of written MVP will ultimately be
required, but the overall concept was submitted to Outreachy.

DHCP-less deploy
==============

In regard to the DHCP-less deploy specification
(https://review.opendev.org/#/c/672780/) we touched upon several areas
of the specification.

We settled on Nova's network metadata format (as implemented by Glean)
as the API format for this feature. Ilya has voiced concerns that it
will tie us to Glean closer than we may want.

Scalability of rebuilding ISO images per node. The CERN folks
rightfully expressed concern that a parallel deployment of several
hundred nodes can put a significant load on conductors, especially in
terms of disk space.
* In case of hardware that has more than one usable virtual media
slot, we can keep the base ISO intact and use a 2nd slot (e.g. virtual
USB) to provide configuration.
* The only other option is documenting it as a limitation of our
virtual media implementation.

To get rid of the MAC address requirements for DHCP-less virtual media
deployments, we determined that it will be necessary to return to
sending node UUID and other configuration to the ramdisk via boot
parameters. This way we can avoid the requirement for MAC addresses,
although this has to be navigated carefully and with Operator
feedback.

An additional concern, beyond parallel deployment load, was “Rescue”
support. The consensus seemed to be that we put giant security
warnings in the documentation to signal the security risk of the
ramdisk being exposed to a potentially un-trusted network. Agent token
work _does_ significantly help improve operational security in these
cases, but operators must be cognizant of the risks and potentially
consider that rescue may be something they might not want to use under
normal circumstances with network edge deployments.

OOB DHCP-less deploy
===================

We briefly touched on OOB dhcp-less deployment. This is HTTPBoot
asserted through to the BMC with sufficient configuration details,
ultimately looking a lot like DHCP-less deployments. Interest does
seem to exist on this topic, but we can revisit once the base
DHCP-less deployment work is done and hopefully an ironic contributor
has access hardware where this is an explicit feature of the BMC.

CI Improvements and Changes
========================

The upstream CI, and ways to make it more stable and just improve its
efficiency, is a recurrent main discussion argument not only as part
of the meetup.

The impact of the CI in the day-to-day work is very high, and that’s
why we took our time to talk about different aspects and do a proper
analysis of the different jobs involved.

The discussion started with the proposal of reducing the usage of the
ironic-python-agent images based on TinyCoreLinux (the so-called
tinyipa images), and rely more on the images built using
diskimage-builder (DIB) and specifically CentOS 8 as base. This
proposal is based on the fact that DIB-built images are what we
recommend for production usage, while tinyIPA images have known issues
on real hardware. Their only real benefit is a much smaller memory
footprint (roughly 400MiB vs 2GiB of a CentOS 8 image). We have agreed
to switch all jobs that use 1 testing VM to pre-built CentOS 8 images.
This covers all jobs, except for the standalone, multi-node and
grenade (upgrade testing) ones.

While reviewing the current list of jobs, we realized that there is a
lot of duplication between them. Essentially, most of the image type -
deploy interface combinations are already tested in the standalone
job. As these combinations are orthogonal to the management
technology, we can use Redfish instead of IPMI for some of the tests.

We decided to split the ironic-standalone job since it covers a lot of
scenarios from tempest and it has a high failure rate. The idea to
have one job testing software raid, manual cleaning and rescue, the
other tests that consists in the combination of image type - deploy
interface will be split in two jobs (one using IPMI and the other
using Redfish).

One other point that we reached was some consensus that more exotic,
non-openstack focused CI jobs, were likely best to be implemented
using Bifrost as opposed to Tempest.

Third Party CI/Driver Requirements
-----------------------------------------------

The question has been raised with-in the community if we reconsider
3rd Party CI requirements. For those that are unaware, it has been a
requirement for drivers to merge into ironic to have Third-Party
operated CI. Operating Third Party CI helps exercise drivers to ensure
that the driver code is functional, and provides the community
information in the event that there is a breaking changer or
enhancement made.

The community recognizes third party CI is difficult, and can be hard
at times to keep working as the entire code base and dependencies
evolve as time moves on. We discussed why some of these things are
difficult, and what can we, and the larger community do to try and
make it easier.

As one would expect, a few questions arose:

Q: Do we consider “supported = False” and keeping drivers in-tree
until we know they no longer work?
A: The consensus was that this is acceptable. That the community can
keep unit tests working and code looking sane.

Q: Do we consider such drivers as essentially frozen?
A: The consensus is that drivers without third party CI will be
functionally frozen unless changes are required to the driver for the
project to move forward.

Q: How do we provide visibility into the state of the driver?
A: The current thought is to return a field in the /v1/drivers list to
signal if the driver has upstream testing. The thought is to use a
field named “Tested” or something similar as opposed to the internal
name in the driver interface which is “supported”

Q: Will we make it easier to merge a driver?
A: Consensus was that we basically want to see it work at least once
before we merge drivers. It was pointed out that this helped provide
visibility with some of the recently proposed driver code which was
developed originally against a much older version of ironic.

Q: Do third party CI systems need to run on every patch?
A: Consensus is No! A number of paths in the repository can be ignored
from changes. In other words, there is no reason to trigger an
integration test of Third Party CI for a documentation change or an
update to a release note, or even a change to other vendor’s drivers.

In summary Drivers without Third Party CI are “use at own risk” and
removal is moving towards a model of “don’t be brutal”.

This leaves us with a number of tasks in the coming months:

* Update the contributor documentation with the Questions and Answers above.
* Author an explicit exception path for the process of bringing CI
back up as it pertains to drivers, essentially focusing on
communication between the ironic community and driver maintainers.
* Author a policy stating that unsupported drivers shall be removed
immediately upon being made aware that the driver is no longer
functional and without a clear/easy fix or path to resolution.
* Solicit pain points from driver maintainers who have recently setup
or do presently maintain Third Party CI and try to aggregate the data
and maybe find some ways of improving the situation.

“Fishy Politics”: Adapting sushy for Redfish spec versus implementation reality
============================================================

Everyone’s favorite topic is how implementations differ from
specification documents. In part, the community is increasingly seeing
cases where different vendors have behavior oddities in their Redfish
implementations.

We discussed various examples, such as
https://storyboard.openstack.org/#!/story/2007071, and the current
issues we’re observing on two different vendors around setting the
machine boot mode and next boot device at the same time.

For some of these issues, the idea of having some sort of Redfish
flavor indicator suggested so the appropriate plugin could be loaded
which might be able to handle larger differences like major field name
differences, or possibly endpoint behavior differences like “PUT”
instead of “PATCH”. This has not yet been explored but will likely
need to be explored moving forward.

Another item for the ironic team to be mindful moving forward, is that
newer UEFI specific boot setting fields have been created, which we
may want to explore using. This could give us a finer level of
granularity of control, and at the same time may not be really usable
in other vendor’s hardware due to the data in the field and how or
what to correspond it back to.

Kexec (or "Faster booting, yes?")
=========================

This topic concerns using the kexec mechanism instead of rebooting
from the ramdisk to the final instance. Additionally, if it is
acceptable to run the agent on the user instance, it can be used for
rebuilding and tearing down an instance. Potentially saving numerous
reboots and "Power On Self Tests" in the process.

We have one potential issue: with multi-tenant networking there is a
possibility of a race between kexec and switching from the
provisioning to the tenant network(s). In a normal deploy we avoid it
by powering the node off first, then flipping the networks, then
booting the final instance (on success). There is no such opportunity
for kexec, meaning that this feature will be restricted to flat
network cases.

The group has expressed lots of interest in providing this feature as
option for advanced operators, in other words “those who run large
scale computing farms”. Julia proposed making a demo of super-fast
deployment using fast-track and kexec as a goal moving forward and
this received lots of positive feedback.

Partitioning, What is next?
====================

Dmitry expressed that there seemed to be lots of interest from EU
operators in supporting disk partitioning. This has been long sought,
but with minimal consensus. We discussed some possible cases how this
could be supported and we reached conclusion that the basis is largely
just to support Linux Logical Volume Manager in the simplest possible
configuration. At the same time the point was raised that parity
basically means some mix of software RAID through LVM and UEFI boot.
We soon realized we needed more information!

So the decision was reached to start by creating a poll, with
questions in three different languages to try and identify community
requirements using some simple and feasible scenarios. Such as LVM on
a single disk, LVM on multiple disks, LVM + image extraction on top of
the LVM.

The partitioning topic was actually very power in that we covered a
number of different topics that we were not otherwise planning to
explicitly cover.

Network booting
----------------------

One of them being why not use network booting, and the point was made
that Network is our legacy and fundamental for iSCSI based boot and
ramdisk booting (such as the deployment ramdisk). During this dive
into ironic’s history, we did reach an important point of consensus
which is that Ironic should switch the default boot mode, as
previously planned, and still keep at least one scenario test running
in CI which uses network booting.

Stated operator wants in terms of  Partitioning
-------------------------------------------------------------

Dmitry was able to provide us some insight into what the Russian
operator community was seeking from Ironic, and Julia confirmed she
had heard similar wants from public cloud operators wanting to offer
Bare Metal as a Service. Largely these wants revolve around wanting
LVM capability of the most basic possible scenario such as a single
disk or a partition image with a LVM, or even Software RAID with
partition images. Likely what has stalled some of these discussions in
the past is the immediate focus on the more complex partitioning
scenarios sought by some operators in the past, which has resulted in
these discussions stalling due to complexity of requirements.

Traits/Scheduling/Flavor Explosion
===========================

Arne with CERN raised this topic to bring greater awareness. CERN
presently has greater than 100 flavors representing their hardware
fleet as each physical machine type gets its own flavor. This has
resulted in pain from the lack of flavor specific quotas. What may
help in this area is for Resource Class based quotas, but presently
the state of that work is unknown. The bottom line: A user does not
have clarity into their resource usage. The question then shifted to
being able to report utilization since the current quota model is
based on cores/RAM/instances but not resource_class consumption as a
whole.

The question largely being “How many am I allowed to create? [before I
eat someone else’s capacity]”.

https://github.com/stackhpc/os-capacity was raised as a reporting tool
that may help with these sorts of situations with bare metal cloud
operators. Another point raised in this discussion was the lack of
being able to tie consumer and project ID to consumed resources, but
it turns out the allocation list functionality in Placement now has
this functionality.

In the end of this discussion, there was consensus that this should be
brought back to the Nova community radar.

Machine Burn-in
=============

Burning in machines is a regular topic that comes up, and it looks
like we’re getting much closer to being able to support such
functionality. Part of the reason behind discussing this was to
determine how to move forward what organizations like CERN can offer
the community.

There are two use cases. The most important thing is to ensure that
hardware does not fail. That doesn’t seem like a “cloudy” thing to
have to worry about, but when you're building your cloud, you kind of
want to make sure that suddenly half your hardware is not going to
fail as soon as you put a workload on it. The second use case is
nearly just as important, which is to be able to ensure that you are
obtaining the performance you expect from the hardware.

This brought a bit of discussion because there are fundamentally two
different paths that could be taken. The first is to leverage the
inspector, whereas the second is to use clean steps. Both have very
useful possible configurations, largely there is no sophisticated data
collection in terms of performance data.

That being said, the consensus seemed to be that actual data
collection was less of a problem than flexibility to invoke as part of
cleaning and preparation of a machine into the deployment. In other
words, the consensus seemed to be clean-steps would be ideal for
community adoption and code acceptance.

IPv6/Dual Stack
============

TL;DR, we need to remove the ip_version setting field. This is mostly
a matter of time to sift through the PXE code, and determine the code
paths that need to be taken. I.E. for IPv6, we would likely only want
to signal flags for it if the machine is in UEFI mode.

The dhcp-less work should provide some of the API side capabilities
this will really need in terms of interaction with Neutron.

Graphical Console
==============

The question was raised to “what would it take to finally get this
moving forward again?” This is because there is initial interface
code, two proofs of concept, and it should be relatively
straightforward to implement redfish support, OEM capability
dependent. The answer was functionally “Someone needs to focus on this
for a couple of months, keep it rebased, and engage the community”.
The community expressed an absolute willingness to mentor.

Software RAID - Specifying devices
===========================

We have briefly discussed RFE
https://storyboard.openstack.org/#!/story/2006369 that proposes a way
to define which physical devices participate in software RAID. A
similar functionality already exists in the RAID configuration format
for hardware RAID, but software RAID always spans all hard drives, no
matter how many.

The RFE proposes re-using the same dictionary format as used for root
device hints in the “physical_disks” field of the RAID configuration.
This idea has been accepted by the audience, with Arne proposing to
extend supported hints with a new “type” hint with values like
“rotational”, “nvme” or “ssd”.

Stickers
======

Yes, we really did discuss next steps for stickers. We have ideas.
Lots of ideas… and we are all very busy. So we shall see if we’re able
to make some awesome and fun stickers appear for the Berlin time
frame.

Julia Kreger

melanie witt

tags

participants (2)