[ironic][ptg] Summary of discussions/happenings related to ironic

Julia Kreger juliaashleykreger at gmail.com
Wed Nov 13 19:35:23 UTC 2019


Overall, There was quite a bit of interest in Ironic. We had great
attendance for the Project Update, Rico Lin’s Heat/Ironic integration
presentation, demonstration of dhcp-less virtual media boot, and the
forum discussion on snapshot support for bare metal machines, and
more! We also learned there are some very large bare metal clouds in
China, even larger than the clouds we typically talk about when we
discuss scale issues. As such, I think it would behoove the ironic
community and OpenStack in general to be mindful of hyper-scale. These
are not clouds with 100s of compute nodes, but with baremetal clouds
containing thousands to tens of thousands of physical bare metal
machines.

So in no particular order, below is an overview of the sessions,
discussions, and commentary with additional status where applicable.

My apologies now since this is over 4,000 words in length.

Project Update
===========

The project update was fairly quick. I’ll try and record a video of it
sometime this week or next and post it online. Essentially Ironic’s
code addition/deletion levels are relatively stable cycle to cycle.
Our developer and Ironic operator commit contribution levels have
increased in Train over Stein, while the overall pool of contributors
has continued to decline cycle after cycle, although not dramatically.
I think the takeaway from this is that as ironic has become more and
more stable, and that the problems being solved in many cases are
operator specific needs or wants, or bug fixes in cases that are only
raised in particular environment configurations.

The only real question that came out of the project update was, if my
memory is correct, was “What does Metal^3 mean for Ironic”, and “Who
is driving forward Metal^3?” The answers are fairly straight forward,
more ironic users and more use cases from Metal^3 driving ironic to
deploy machines. As for who is driving it forward, it is largely being
driven forward by Red Hat along with interested communities and
hardware vendors.

Quick, Solid, and Automatic OpenStack Bare-Metal Orchestration
==================================================

Rico Lin, the Heat PTL, proposed this talk promoting the possibility
of using ironic naively to deploy bare metal nodes. Specifically where
configuration pass-through can’t be made generic or somehow
articulated through the compute API. Cases where they may be is where
someone wishes to utilize something like our “ramdisk”
deploy_interface which does not deploy an image to the actual physical
disk. The only real question that I seem to remember coming up was the
question why might someone want or need to do this, which again
becomes more of a question of doing things that are not quite
“compute” API-ish. The patches are available in gerrit[10].

Operator Feedback Session
=====================

The operator feedback[0] session was not as well populated with maybe
~20-25 people present. Overall the feeling of the room was that
“everything works”, however there is a need and desire for information
and additional capabilities.

* Detailed driver support matrix
* Reduce the deployment times further
* Disk key rotation is an ask from operators for drives that claim
smart erase support but end up doing a drive wipe instead. In essence,
to reduce the overall time spent cleaning
* Software RAID is needed at deploy time.
* IPA needs improved error handling. - This may be a case where
something of the communication flow changes that had been previously
discussed could help in that we could actively try and keep track of
the agent a little more. Additional discussion will definitely be
required.
* There does still seem to be some interest in graphical console
support. A contributor has been revising patches, but I think it would
really help for a vendor to become involved here and support accessing
their graphical interface through such a method.
* Information and an information sharing location is needed. I’ve
reached out to the Foundation staff regarding the Bare Metal Logo
Program to see if we can find a common place that we can build/foster
moving forward. In this topic, the one major pain point began being
stressed, issues with the resource tracker at 3,500 bare metal nodes.
Privately another operator reached out with the same issue in the
scale of tens of thousands of bare metal nodes. As such, this became a
topic during the PTG which gained further discussion. I’ll cover that
later.

Ironic – Snapshots?
===============

As a result of some public discussion of adding snapshot capability, I
proposed a forum session to discuss the topic[1] such that
requirements can be identified and the discussion can continue over
the next cycle.
I didn't expect the number of attendees present to swell from the
operator's feedback session. The discussion of requirements went back
and forth to ultimately define "what is a snapshot" in this case, and
"what should Ironic do?"

There was quite a bit of interaction in this session and the consensus
seemed to be the following:
* Don’t make it reliant on nova, for standalone users may want/need to use it.
* This could be a very powerful feature as an operator could ``adopt``
a machine into ironic and then ``snapshot`` it to capture the disk
contents.
* Block level only and we can’t forget about capturing/storing content checksum
* Capture the machine’s contents with the same expectation as we would
have for a VM, and upload this to someplace.

In order to make this happen in a fashion which will scale, the ironic
team will likely need to leverage the application credentials.

Ironically reeling in large bare metal deployment without PXE
==============================================

This was a talk submitted by Ilya Etingof, who unfortunately was
unable to make it to the summit. Special thanks goes to Both Ilya and
Richard Pioso for working together to make this demonstration happen.
The idea was to demonstrate where the ironic team sees the future of
deployment of machines on the edge using virtual media and how vendors
would likely interact with that in some cases as slightly different
mechanics may be required even if the BMCs all speak Redfish, which is
the case for a Dell iDRAC BMC.

The idea[2] ultimately being is that the conductor would inject the
configuration information into the virtual media ISO image that is
attached via virtual media negating the need for DHCP. We have videos
posted that allow those interested to see what this functionality
looks like with neutron[3] and without neutron[4].

While the large audience was impressed, it seemed to be a general
surprise that Ironic had virtual media support in some of the drivers
previously. This talk spurred quite a bit of conversation and hallway
track style discussion after the presentation concluded which is
always an excellent sign.

Project Teams Gathering
===================

The ironic community PTG attendance was nothing short of excellent.
Thank you everyone who attended! At one point we had fifteen people
and a chair had to be pulled up to our table for a 16th person to join
us. At which point, we may have captured another table and created
confusion.

We did things a little differently this time around. Given some of the
unknowns, we did not create a strict schedule around the topics. We
simply went through and prioritized topics and tried to discuss them
each as thoroughly as possible until we had reached the conclusion or
a consensus on the topic.

Topics and a few words on each topic we discussed in the notes section
on the PTG etherpad[5].

On-boarding
-----------------

We had three contributors that attended a fairly brief on-boarding
overview of Ironic. Two of them were more developer focused where as
the third was more of an operator focus looking to leverage ironic and
see how they can contribute back to the community.

BareMetal SIG - Next Steps
-------------------------------------

Arne Wiebalck and I both provided an update including current
conversations where we saw the SIG, the Logo Program, the white paper,
and what should the SIG do beyond the whitepaper.

To start with the Logo program, it largely seems there that somewhere
along the way a message or document got lost and that largely impacted
the Logo Program -> SIG feedback mechanism. I’m working with the
OpenStack Foundation to fix that and get communication going again.
Largely what spurred that was that some vendors expressed interest in
joining, and wanted additional information.

As for the white paper, contributions are welcome and progress is
being made again.

>From a next steps standpoint, the question was raised how do we build
up an improved Operator point of contact. There was some consensus
that we as a community should try to encourage at least one
contributor to attend the operations mid-cycles. This allows for a
somewhat shorter feedback look with a different audience.

We also discussed knowledge sharing, or how to improve it. Included
with this is how do we share best practices. I’ve got the question out
to folks at the foundation if there is a better way as part of the
Logo program, or if we should just use the Wiki. I think this will be
an open discussion topic in the coming weeks.

The final question that came up as part of the SIG is how to show
activity. I reached out to Amy on the UC regarding this, and it seems
the process is largely just reach out to the current leaders of the
SIG, so it is critical that we keep that up to date moving forward.

Sensor Data/Metrics
---------------------------

The barrier between Tenant level information and Operator level
information is difficult with this topic.

The consensus among the group was that the capability to collect some
level of OOB sensor data should be present on all drivers, but there
is also a recognition that this comes at a cost and possible
performance impact. Mainly this performance impact question was raised
with Redfish because this data is scattered around the API where
multiple API calls are required, and may even cause some interruption
to actively inquire upon some data points.

The middle ground in the discussion came to adding a capability of
somehow saying “collect power status, temp every minute, fan speeds
every five minutes, drive/cpu health data maybe every 30 minutes”. I
would be remiss if I didn't note that there was joking about how this
would in essence be re-implementation of Cron. What this would end up
looking like, we don’t know, but it would provide operators the data
resolution necessary for the failure risk/impact. The analogy used was
that “If the temperature sensor has risen to an alarm level, either a
AC failure or a thermal hot spot forming based upon load in the data
center, checking the sensor too often is just not going to result in a
human investigating that on the data center floor any faster.”

Mainly I believe this discussion largely stresses that the information
is for the operator of the bare metal and not to provide insight into
a tenant monitoring system, that those activities should largely be
done with-in the operating system.

One question among the group was if anyone was using the metrics
framework built into ironic already for metrics of ironic itself, to
see if we can re-use it. Well, it uses a plugin interface! In any
event, I've sent a post to the openstack-discuss mailing list seeking
usage information.


Node Retirement
-----------------------

This is a returning discussion from the last PTG, and in discussing
the topic we figured out where the discussion became derailed at
previously.  In essence, the desire was to mix this with the concept
of being able to take a node “out of service”. Except, taking a node
out of service is an immediate state related flag, where as retiring
might be as soon as the current tenant vacates the machine… possibly
in three to six months.

In other words, one is “do something or nothing now”, and the other is
“do something later when a particular state boundary is crossed”.
Trying to make one solution for both, doesn’t exactly work.

Unanimous consensus among those present was that in order to provide
node retirement functionality, that the logic should be similar to
maintenance/maintenance reason. A top level field in the node object
that would allow API queries for nodes slated for retirement, which
helps solve an operator workflow conundrum “How do I know what is
slated for retirement but not yet vacated?”

Going back to the “out of service” discussion, we reached consensus
that this was in essence a “user declarable failed state”, and as such
that it should be done only in the state machine as it is in the
present, not a future action.  Should we implement out of service,
we’ll need to check the nova.virt.ironic code and related virt code to
properly handle nodes dropping from `ACTIVE` state, which could also
be problematic and need to be API version guarded to prevent machines
from accidentally entering `ERROR` state if they are not automatically
recovered in nova.

Multi-tenancy
------------------

Lots of interest existed around making the API somewhat of a
mutli-tenant aware interaction, and the exact interactions and uses
involved there are not exactly clear. What IS clear is that providing
functionality as such will allow operators to remove complication in
their resource classes and tenant specific flavors which is presently
being used to enable tenant specific hardware pools. The added benefit
of providing some level for normally non-admin users to access the
ironic API is that it would allow those tenants to have a clear
understanding of their used resources and available resources by
directly asking ironic, where as presently, they don’t have a good way
to collect nor understand that short of asking the cloud operator when
it comes to bare metal. Initial work has been posted for this to
gerrit[6].

In terms of how tenants resources would be shared, there was consensus
that the community should stress that new special use tenants should
be created for collaborative efforts.

There was some discussion regarding explicitly dropping fields for
non-privileged users that can see the nodes, such as driver_info and
possibly even driver_internal_info. Definitely a topic that requires
more discussion, but that would solve operator reporting and use
headaches.

Manual Cleaning Out-Of-Band
----------------------------------------

The point was raised that we unconditionally start the agent ramdisk
to perform manual cleaning. Except, we should support a method of out
of band cleaning operators to only be executed so the bare metal node
doesn’t need to be booted to a ramdisk.

The consensus seemed to be that we should consider a decorator or
existing decorator change that allows the conductor to hold off
actually powering the node on for ramdisk boot unless or until a step
is reached that is not purely out of band.

In essence, fixing this allows a “fix_bmc” out of band clean step to
be executed first without trying to modify BMC settings, which would
presently fail.

Scale issues
-----------------

A number of scaling issues between how nova and ironic interact,
specifically with the resource tracker and how inventory is updated
from ironic and loaded into nova. Largely this issue revolves around
the concept in nova that each ``nova-compute`` is a hypervisor. And
while one can run multiple ``nova-compute`` processes to serve as the
connection to ironic, the underlying lock in Nova is at the level of
the compute node, not the node level. This means as thousands of
records are downloaded, synced, copied into the resource tracker, the
compute process is essentially blocked from other actions while this
serialized job runs.

In a typical VM case, you may only have at most a couple hundred VMs
on a hypervisor, where as with bare metal, we’re potentially servicing
thousands of physical machines.

It should be noted that there are several large scale operators that
indicated during the PTG that this was their pain point. Some of the
contributors from CERN sat down with us and the nova team to try and
hammer out a solution to this issue. A summary of that cross project
session can be found at line 212 in the PTG etherpad[0].

But there is another pain point that contributes to this performance
issue and that is the speed at which records are returned by our API.
We’ve had some operators voice some frustration with this before, and
we should at least be mindful of this and hopefully see if we can
improve record retrieval performance. In addition to this, if we
supported some form of bulk “GET” of nodes, it might be able to be
leveraged as opposed to a get on each node one at a time which is
presently what occurs in the nova-compute process.

Boot Mode Config
------------------------

Previously, when scheduling occurred with flavors and filters were
appropriately set, if a machine was declared as supporting only one
boot mode, requests would only ever land on that node. Now with
Traits, this is a bit different and unfortunately optional without
logic to really guard the setting application for an instance.

So in this case, if filters are such that a request for a Legacy boot
instance lands on a UEFI only machine, we’ll still try to deploy it.
In reality, we really should try and fail fast.

Ideally the solution here is we consult with the BMC through some sort
of get_supported_boot_modes method, and if we determine a mismatch
between what the settings are or what the requested instance is from
the data we have, we fail the deploy.

This ultimately may require work in the nova.virt.ironic driver code
to identify the cause of the failure being an invalid configuration
and reporting that back, however it may not be fatal on another
machine.

Security of /heartbeat and /lookup endpoints
-----------------------------------------------------------

We had a discussion of adding some additional layers of security
mechanics around the /heartbeat and /lookup endpoints in ironic’s REST
API. These limited endpoints are documented as being unauthenticated,
so naturally some issues can arise from these and we want to minimize
the vectors in which an attacker that has gained access to a
cleaning/provisioning/rescue network could possibly impersonate a
running ironic-python-agent. Conversely, the ironic-python-agent runs
in a similar fashion, intended to run on secure trusted networks which
is only accessible to the ironic-conductor. As such, we also want to
add some validation to the API request is from the same Ironic
deployment that IPA is heart-beating to.

The solution to this introduce a limited lifetime token that is unique
per node per deployment. It would be stored in RAM on the agent, and
in the node.driver_internal_info so it is available to the conductor.
It would be provided only once via out of band OR via the first
“lookup” of a node, and then only become accessible again during known
reboot steps.

Conceptually the introduction of tokens was well supported in the
discussions and there were zero objections to doing so. Some initial
patches[7][8] are under development to move this forward.

An additional item is to add IP address filtering capabilities to both
endpoints such that we only process the heartbeat/lookup address if we
know it came from the correct IP address. An operator has written this
feature downstream and consensus was unanimous at the PTG that we
should accept this feature upstream. We should expect a patch for this
functionality to be posted soon.

Persistent Agents
------------------------

The use case behind persistent agents is “I want to kexec my way to
the agent ramdisk, or the next operating system.” and “I want to have
up to date inspection data.” We’ve already somewhat solved the latter,
but the former is a harder problem requiring the previously mentioned
endpoint security enhancements to be in-place first. There is some
interest from CERN and some other large scale operators.

In other words, we should expect more of this from an bare metal fleet
operations point of view for some environments as we move forward.

“Managing hardware the Ironic way”
-------------------------------------------------

The question that spurred this discussion was “How do I provide a way
for my hardware manager to know what it might need to do by default.”
Except, those defaults may differ between racks that serve different
purposes. “Rack 1, node0” may need a port set to FiberChannel mode,
where as “Rack2, node1” may require it to be Ethernet.

This quickly also reaches the discussion of “What if I need different
firmware versions by default?”

This topic quickly evolved from there and the idea that surfaced was
that we introduce a new field on the node object for the storage of
such data. Something like ``node.default_config``, where it would be a
dictionary sort of like what a user provides for cleaning steps or
deploy steps, that provides argument values which is iterated through
when in automated cleaning mode to allow operators to fill in
configuration requirement gaps for hardware managers.

 Interestingly enough, even today we just had someone ask a similar
question in IRC.

This should ultimately be usable to assert desired/default firmware
from an administrative point of view. Adrianc (Mellanox) is going to
reach out to bdobb (DMTF) regarding the redfish PLDM firmware update
interface to see where this may go from here.

Edge computing working group session
----------------------------------------------------

The edge working group largely became a session to update everyone on
where Ironic was going and where we see things going in terms of
managing bare metal at the edge/far-edge. This included some in-depth
questions about dhcp-less deployment and related mechanics as well as
HTTPBoot’ing machines.

Supporting HTTPBoot does definitely seem to be of interest to a number
of people, although at least after sharing my context only five or six
people in attendance really seemed interested in ironic prioritizing
such functionality. The primary blocker, for those that are unaware,
is pre-built UEFI images for us to do integration testing for IPv4
HTTPBoot. Functionally ironic already supports IPv6 HTTPBoot via
DHCPv6 as part of our IPv6 support with PXE/iPXE, however we also
don’t have an integration test job for this code path for the same
reason, pre-built UEFI firmware images lack the built-in support.

More minor PTG topics
-------------------------------

* Smartnics - A desire to attach virtual ports in ironic baremetal
nodes with smartnics was raised. Seems that we don’t need to try and
create a port entry in ironic. It seems we only need to track/signal
and remove the “vif” attachment” to the node in general as there is no
physical mac required for that virtual port in ironic. The constraint
that at least one MAC address would be required to identify the
machine is understood. If anyone sees an issue with this, please raise
this to adrianc.
* Metal^3 - Within the group attending the PTG, there was not much
interest in Metal^3 or using CRDs to manage bare metal resources with
ironic hidden behind the CRD. One factor related to this is the desire
to define more data to be passed through to ironic which is not
presently supported in the CRD definition..

Stable Backports with Ironic's release model
==================================

I was pulled into a discussion with the TC and the Stable team
regarding frustrations that have been expressed with-in the ironic
team regarding stable back-porting of fixes, mainly drivers. There is
consensus that it is okay for us as the ironic team to backport
drivery things when needed to support vendors as long as they are not
breaking or overall behavior contracts. This quickly leads us to
needing to also modify constraints for drivery things as well.
Constraints changes will continue to be evaluated on a case by case
basis, but the general consensus is there is full support to "do the
right thing" for ironic's users, vendors, and community. The key is
making sure we are on the same page and agreeing to what that right
thing is. This is where asynchronous communication can get us into
trouble, and I would highly encourage trying to start higher bandwidth
discussion when these cases arise in the future. The key takeaway that
we should likely keep in mind is policy is there for good reasons, but
policy is not and can not be a crutch to prevent the right thing from
being done.

Additional items worth noting - Q1 Gatherings
===================================

There will be an operations mid-cycle at Bloomberg in London, January
7th-8th, 2020. It would be good if at least one ironic contributor
could attend as the operators group tends to be closer to the physical
baremetal, and it is a good chance to build mutual context between
developers and operations people actually using our software.

Additionally, we want to gauge the interest of having an ironic
mid-cycle in central Europe in Q1 of 2020. We need to identify the
number of contributors that would be interested in and able to attend
since the next PTG will be in June. Please email me off-list if your
interested in attending and I'll make a note of it as we're still
having initial discussions.


And now I've reached a buffer under-run on words. If there are any
questions, just reply to the list.

-Julia

Links:

[0]: https://etherpad.openstack.org/p/PVG-ironic-operator-feedback
[1]: https://etherpad.openstack.org/p/PVG-ironic-snapshot-support
[2]: https://review.opendev.org/#/c/672780/
[3]: https://tinyurl.com/vwts36l
[4]: https://tinyurl.com/st6azrw
[5]: https://etherpad.openstack.org/p/PVG-Ironic-Planning
[6]: https://review.opendev.org/#/c/689551/
[7]: https://review.opendev.org/692609
[8]: https://review.opendev.org/692614
[9]: https://etherpad.openstack.org/p/ops-meetup-1st-2020
[10]: https://review.opendev.org/#/q/topic:story/2006403+(status:open+OR+status:merged)



More information about the openstack-discuss mailing list