A couple weeks ago (February 25-26), the Ironic community convened its first mid-cycle in quite a long time at the invitation of and encouragement of CERN. A special thanks goes to Arne Wiebalck for organizing the gathering. We spent two days discussing, learning, sharing, and working together to form a path forward. As long time contributors, some of us were able to bring context of not just how, but why. Other community members brought questions and requirements, meanwhile one of the hardware vendors brought their context and needs. And CERN was kind enough to show us how our work matters and makes a difference, which was the most inspiring part of all! Special thanks goes to Dmitry Tantsur, Riccardo Pittau, and Iury Gregory for helping me keep momentum moving forward on this summary. --------------------------------------- Deploy Steps ========== Discussed issues related to the actual workflow, concerns out process efficiency and path forward. The issue in question was the advance validation of deploy steps when some of the steps come from the ironic-python-agent ramdisk and are not reflected in the server code. Creating the whole list of steps for validation and execution requires information from the ramdisk, but it’s only available when the ramdisk is booted. We have discussed the following alternatives: * Start ramdisk before deploy step execution. This was ruled out for the following reasons: ** Some steps need to be executed out-of-band before the ramdisk is running. This is already an issue with iDRAC clean steps. ** The first deploy steps validation happens in the node validation API when the ramdisk is clearly not running. * Using cached deploy steps from the previous cleaning run. It was ruled out because: ** Some deployments disable automated cleaning. ** The deploy step list can change in between, e.g. because of hardware changes or any external input. * Accept that we cannot provide early validation of deploy steps and validate them as we go. This involves booting the ramdisk as one of the deploy steps (no special handling), with only out-of-band steps executed before that (ignoring any in-band steps with higher priority). We decided to go with the third option. In a side discussion we decided to file an RFE for a driver_info flag preventing booting the ramdisk during manual cleaning. Solving the cleaning issues completely probably requires making booting the ramdisk a separate clean step, similarly to deploy steps above. No final plan has been made for it, but we have more clarity than we did before. Security ====== Firmware Management -------------------------------- We entered a discussion about creating a possible “meta step”. After some back and forth discussions, we reached a consensus that it is likely not possible given different vendor parameters and requirements. During this topic, we also reached the point of discussing changes to “Active node” configuration as it is related in large part to firmware updates, and is necessary for larger fleet management, and eventually even for attestation process integration. The consensus kind of revolved around that the way to help enable some of this process was to leverage rescue, this is only theory. Hopefully we’ll have operator feedback in the next year on this subject and can make more informed decisions. By then, we should have deploy steps in a form that one doesn't need to be a python developer to leverage, and the team should be bandwidth to explore this further with operators. Attestation --------------- This is a topic that Julia has been raising for a while because there is a logical and legitimate reason to go ahead and implement some sort of integration with an attestation platform to perform system measurement and attestation during cleaning and deployment processes in order to help identify if machines were tampered with. In our case, remote attestation is likely the way forward, and inspiration can come looking at Keylime (TPM based highly boot attestation and run-time integrity measurement solution, and most important, opensource). We’ll need an implementation to cover at least Clean/Deploy steps, to be able to run and validate TPM measurement, and fail the deployment if attestation fails. We still need to figure out what the actual impact to firmware upgrade is, and how to safely ensure that a re-measurement is valid, or not, and when to trust the measurement is actually valid. Ironic’s next step is to begin to talk to the Keylime folks in more depth. Also one of our contributors, Kaifeng, who read our notes etherpad indicated that he is working towards the same area as well, so we may see some interesting and fruitful collaboration because ultimately we all have some of the same needs. Agent Tokens ------------------ Agent tokens was possibly the quickest topic that we visited with Dmitry suggesting we just needed to add a unit test and merge the code. To be further secure, we need the agent ramdisk to begin using TLS. TODO: Julia is to send out an email to the mailing list to serve as notice to operators that ironic intends to break backwards IPA compatibility next cycle by removing support for agents that do not support agent tokens. NOTE: As we type/refine this summary for distribution, the agent token code has largely merged, and should be completely merged before the end of the current development cycle. TLS in virtual media --------------------------- In order to secure Agent token use, we need to secure their transmission to the ironic-python-agent when commands are issued to the agent from the conductor. Ultimately we’re hoping to begin work on this soon in order to better secure interactions and communications with machines in remote “edge” environments. An RFE has been filed to automatically generate certificates and exchange them with the ramdisk: https://storyboard.openstack.org/#!/story/2007214. Implementing it may require downstream consumers to update their USA export control certification. FIPS 140-2 --------------- This was a late addition topic that was added while other discussion was coming up, and largely more for the purposes of community visibility. In short, we know based on some recent bugs that were fixed, that operators are starting to try and deploy Ironic in environments and on hosts that are configured for a FIPS 140-2 operating mode, in short a much more strict cryptography configuration. We ought to make sure that we don’t have any other surprises waiting for us, so the call went out for someone to at some point review the standard, and sanity check Ironic and components. Post-IPMI universe =============== The decline of IPMI is one that we, as a community, need to plan ahead for as some things become a little more difficult. Discovery ------------- Node discovery, as a feature, is anticipated to become a little more complicated. While we should still be able to identify a BMC address,that address may be the in-band communications channel address once vendors are supporting the Redfish host interface specification. This spurred discussion of alternatives, and one of the items that was raised was possibly supporting the discovery of BMCs using the SSDP and UPnP. This raises an interesting possibility in that the UUID of the BMC is retrievable through these means. It seems logical for us to one day consider the detection and enrollment of machines using an operator tool of some sort. This functionality is defined by the Redfish standard and, as everything in Redfish, is optional. The DMTF provided Redfish library contains an example implementation: https://github.com/DMTF/python-redfish-library/blob/master/src/redfish/disco.... Using System ID/GUID ------------------------------- The discovery topic spurred another topic of if we should be using the System UUID or GUID identifier in addition to or instead of a MAC address on the chassis. Ironic Inspector folks have been considering having additional or even plug-able matches for a long time. The system UUID can be discovered in-band via SMBios before launching inspection/discovery. Supporting this would largely be more of a feature matching a physical machine, but some of our code requires network information anyway, so it may not bring a huge benefit upfront beyond trying to match BMC<->Host. IPMI cipher changes ---------------------------- CERN folks were kind enough to raise an issue that has brought themselves some headaches as of recent which is that some BMC vendors have changed cipher logic and keying so they’ve had to put in some workarounds and modified ipmitool builds on their systems. As far as we’re aware as a community, there is really nothing we can directly do to help them remove this work around, but ultimately this headache may cause them to begin looking at Redfish and cause some development on serial console support for Redfish. Making inspector data a time series =========================== One of the challenges in data centers is identifying when the underlying hardware changes. When a disk is replaced, the serial number changes, and if that disk is in a running system, it would traditionally have needed to be re-inspected in order for information about the machine to be updated. But we added the ability to manually execute and submit this data in the last development cycle, so if there was any time series nature to inspection data, then it allows for changes to be identified, new serial numbers recorded, etc. The overwhelming response during the discussion was “Yes Please!”, in that such a feature would help a number of cases. Of course, what we quickly reached was disagreement over meaning. Turns out, the purpose is more about auditing and identifying changes, so even if there are only two copies, the latest and the previous inspection data, then differences could be identified by external tooling. A spec document or some sort of written MVP will ultimately be required, but the overall concept was submitted to Outreachy. DHCP-less deploy ============== In regard to the DHCP-less deploy specification (https://review.opendev.org/#/c/672780/) we touched upon several areas of the specification. We settled on Nova's network metadata format (as implemented by Glean) as the API format for this feature. Ilya has voiced concerns that it will tie us to Glean closer than we may want. Scalability of rebuilding ISO images per node. The CERN folks rightfully expressed concern that a parallel deployment of several hundred nodes can put a significant load on conductors, especially in terms of disk space. * In case of hardware that has more than one usable virtual media slot, we can keep the base ISO intact and use a 2nd slot (e.g. virtual USB) to provide configuration. * The only other option is documenting it as a limitation of our virtual media implementation. To get rid of the MAC address requirements for DHCP-less virtual media deployments, we determined that it will be necessary to return to sending node UUID and other configuration to the ramdisk via boot parameters. This way we can avoid the requirement for MAC addresses, although this has to be navigated carefully and with Operator feedback. An additional concern, beyond parallel deployment load, was “Rescue” support. The consensus seemed to be that we put giant security warnings in the documentation to signal the security risk of the ramdisk being exposed to a potentially un-trusted network. Agent token work _does_ significantly help improve operational security in these cases, but operators must be cognizant of the risks and potentially consider that rescue may be something they might not want to use under normal circumstances with network edge deployments. OOB DHCP-less deploy =================== We briefly touched on OOB dhcp-less deployment. This is HTTPBoot asserted through to the BMC with sufficient configuration details, ultimately looking a lot like DHCP-less deployments. Interest does seem to exist on this topic, but we can revisit once the base DHCP-less deployment work is done and hopefully an ironic contributor has access hardware where this is an explicit feature of the BMC. CI Improvements and Changes ======================== The upstream CI, and ways to make it more stable and just improve its efficiency, is a recurrent main discussion argument not only as part of the meetup. The impact of the CI in the day-to-day work is very high, and that’s why we took our time to talk about different aspects and do a proper analysis of the different jobs involved. The discussion started with the proposal of reducing the usage of the ironic-python-agent images based on TinyCoreLinux (the so-called tinyipa images), and rely more on the images built using diskimage-builder (DIB) and specifically CentOS 8 as base. This proposal is based on the fact that DIB-built images are what we recommend for production usage, while tinyIPA images have known issues on real hardware. Their only real benefit is a much smaller memory footprint (roughly 400MiB vs 2GiB of a CentOS 8 image). We have agreed to switch all jobs that use 1 testing VM to pre-built CentOS 8 images. This covers all jobs, except for the standalone, multi-node and grenade (upgrade testing) ones. While reviewing the current list of jobs, we realized that there is a lot of duplication between them. Essentially, most of the image type - deploy interface combinations are already tested in the standalone job. As these combinations are orthogonal to the management technology, we can use Redfish instead of IPMI for some of the tests. We decided to split the ironic-standalone job since it covers a lot of scenarios from tempest and it has a high failure rate. The idea to have one job testing software raid, manual cleaning and rescue, the other tests that consists in the combination of image type - deploy interface will be split in two jobs (one using IPMI and the other using Redfish). One other point that we reached was some consensus that more exotic, non-openstack focused CI jobs, were likely best to be implemented using Bifrost as opposed to Tempest. Third Party CI/Driver Requirements ----------------------------------------------- The question has been raised with-in the community if we reconsider 3rd Party CI requirements. For those that are unaware, it has been a requirement for drivers to merge into ironic to have Third-Party operated CI. Operating Third Party CI helps exercise drivers to ensure that the driver code is functional, and provides the community information in the event that there is a breaking changer or enhancement made. The community recognizes third party CI is difficult, and can be hard at times to keep working as the entire code base and dependencies evolve as time moves on. We discussed why some of these things are difficult, and what can we, and the larger community do to try and make it easier. As one would expect, a few questions arose: Q: Do we consider “supported = False” and keeping drivers in-tree until we know they no longer work? A: The consensus was that this is acceptable. That the community can keep unit tests working and code looking sane. Q: Do we consider such drivers as essentially frozen? A: The consensus is that drivers without third party CI will be functionally frozen unless changes are required to the driver for the project to move forward. Q: How do we provide visibility into the state of the driver? A: The current thought is to return a field in the /v1/drivers list to signal if the driver has upstream testing. The thought is to use a field named “Tested” or something similar as opposed to the internal name in the driver interface which is “supported” Q: Will we make it easier to merge a driver? A: Consensus was that we basically want to see it work at least once before we merge drivers. It was pointed out that this helped provide visibility with some of the recently proposed driver code which was developed originally against a much older version of ironic. Q: Do third party CI systems need to run on every patch? A: Consensus is No! A number of paths in the repository can be ignored from changes. In other words, there is no reason to trigger an integration test of Third Party CI for a documentation change or an update to a release note, or even a change to other vendor’s drivers. In summary Drivers without Third Party CI are “use at own risk” and removal is moving towards a model of “don’t be brutal”. This leaves us with a number of tasks in the coming months: * Update the contributor documentation with the Questions and Answers above. * Author an explicit exception path for the process of bringing CI back up as it pertains to drivers, essentially focusing on communication between the ironic community and driver maintainers. * Author a policy stating that unsupported drivers shall be removed immediately upon being made aware that the driver is no longer functional and without a clear/easy fix or path to resolution. * Solicit pain points from driver maintainers who have recently setup or do presently maintain Third Party CI and try to aggregate the data and maybe find some ways of improving the situation. “Fishy Politics”: Adapting sushy for Redfish spec versus implementation reality ============================================================ Everyone’s favorite topic is how implementations differ from specification documents. In part, the community is increasingly seeing cases where different vendors have behavior oddities in their Redfish implementations. We discussed various examples, such as https://storyboard.openstack.org/#!/story/2007071, and the current issues we’re observing on two different vendors around setting the machine boot mode and next boot device at the same time. For some of these issues, the idea of having some sort of Redfish flavor indicator suggested so the appropriate plugin could be loaded which might be able to handle larger differences like major field name differences, or possibly endpoint behavior differences like “PUT” instead of “PATCH”. This has not yet been explored but will likely need to be explored moving forward. Another item for the ironic team to be mindful moving forward, is that newer UEFI specific boot setting fields have been created, which we may want to explore using. This could give us a finer level of granularity of control, and at the same time may not be really usable in other vendor’s hardware due to the data in the field and how or what to correspond it back to. Kexec (or "Faster booting, yes?") ========================= This topic concerns using the kexec mechanism instead of rebooting from the ramdisk to the final instance. Additionally, if it is acceptable to run the agent on the user instance, it can be used for rebuilding and tearing down an instance. Potentially saving numerous reboots and "Power On Self Tests" in the process. We have one potential issue: with multi-tenant networking there is a possibility of a race between kexec and switching from the provisioning to the tenant network(s). In a normal deploy we avoid it by powering the node off first, then flipping the networks, then booting the final instance (on success). There is no such opportunity for kexec, meaning that this feature will be restricted to flat network cases. The group has expressed lots of interest in providing this feature as option for advanced operators, in other words “those who run large scale computing farms”. Julia proposed making a demo of super-fast deployment using fast-track and kexec as a goal moving forward and this received lots of positive feedback. Partitioning, What is next? ==================== Dmitry expressed that there seemed to be lots of interest from EU operators in supporting disk partitioning. This has been long sought, but with minimal consensus. We discussed some possible cases how this could be supported and we reached conclusion that the basis is largely just to support Linux Logical Volume Manager in the simplest possible configuration. At the same time the point was raised that parity basically means some mix of software RAID through LVM and UEFI boot. We soon realized we needed more information! So the decision was reached to start by creating a poll, with questions in three different languages to try and identify community requirements using some simple and feasible scenarios. Such as LVM on a single disk, LVM on multiple disks, LVM + image extraction on top of the LVM. The partitioning topic was actually very power in that we covered a number of different topics that we were not otherwise planning to explicitly cover. Network booting ---------------------- One of them being why not use network booting, and the point was made that Network is our legacy and fundamental for iSCSI based boot and ramdisk booting (such as the deployment ramdisk). During this dive into ironic’s history, we did reach an important point of consensus which is that Ironic should switch the default boot mode, as previously planned, and still keep at least one scenario test running in CI which uses network booting. Stated operator wants in terms of Partitioning ------------------------------------------------------------- Dmitry was able to provide us some insight into what the Russian operator community was seeking from Ironic, and Julia confirmed she had heard similar wants from public cloud operators wanting to offer Bare Metal as a Service. Largely these wants revolve around wanting LVM capability of the most basic possible scenario such as a single disk or a partition image with a LVM, or even Software RAID with partition images. Likely what has stalled some of these discussions in the past is the immediate focus on the more complex partitioning scenarios sought by some operators in the past, which has resulted in these discussions stalling due to complexity of requirements. Traits/Scheduling/Flavor Explosion =========================== Arne with CERN raised this topic to bring greater awareness. CERN presently has greater than 100 flavors representing their hardware fleet as each physical machine type gets its own flavor. This has resulted in pain from the lack of flavor specific quotas. What may help in this area is for Resource Class based quotas, but presently the state of that work is unknown. The bottom line: A user does not have clarity into their resource usage. The question then shifted to being able to report utilization since the current quota model is based on cores/RAM/instances but not resource_class consumption as a whole. The question largely being “How many am I allowed to create? [before I eat someone else’s capacity]”. https://github.com/stackhpc/os-capacity was raised as a reporting tool that may help with these sorts of situations with bare metal cloud operators. Another point raised in this discussion was the lack of being able to tie consumer and project ID to consumed resources, but it turns out the allocation list functionality in Placement now has this functionality. In the end of this discussion, there was consensus that this should be brought back to the Nova community radar. Machine Burn-in ============= Burning in machines is a regular topic that comes up, and it looks like we’re getting much closer to being able to support such functionality. Part of the reason behind discussing this was to determine how to move forward what organizations like CERN can offer the community. There are two use cases. The most important thing is to ensure that hardware does not fail. That doesn’t seem like a “cloudy” thing to have to worry about, but when you're building your cloud, you kind of want to make sure that suddenly half your hardware is not going to fail as soon as you put a workload on it. The second use case is nearly just as important, which is to be able to ensure that you are obtaining the performance you expect from the hardware. This brought a bit of discussion because there are fundamentally two different paths that could be taken. The first is to leverage the inspector, whereas the second is to use clean steps. Both have very useful possible configurations, largely there is no sophisticated data collection in terms of performance data. That being said, the consensus seemed to be that actual data collection was less of a problem than flexibility to invoke as part of cleaning and preparation of a machine into the deployment. In other words, the consensus seemed to be clean-steps would be ideal for community adoption and code acceptance. IPv6/Dual Stack ============ TL;DR, we need to remove the ip_version setting field. This is mostly a matter of time to sift through the PXE code, and determine the code paths that need to be taken. I.E. for IPv6, we would likely only want to signal flags for it if the machine is in UEFI mode. The dhcp-less work should provide some of the API side capabilities this will really need in terms of interaction with Neutron. Graphical Console ============== The question was raised to “what would it take to finally get this moving forward again?” This is because there is initial interface code, two proofs of concept, and it should be relatively straightforward to implement redfish support, OEM capability dependent. The answer was functionally “Someone needs to focus on this for a couple of months, keep it rebased, and engage the community”. The community expressed an absolute willingness to mentor. Software RAID - Specifying devices =========================== We have briefly discussed RFE https://storyboard.openstack.org/#!/story/2006369 that proposes a way to define which physical devices participate in software RAID. A similar functionality already exists in the RAID configuration format for hardware RAID, but software RAID always spans all hard drives, no matter how many. The RFE proposes re-using the same dictionary format as used for root device hints in the “physical_disks” field of the RAID configuration. This idea has been accepted by the audience, with Arne proposing to extend supported hints with a new “type” hint with values like “rotational”, “nvme” or “ssd”. Stickers ====== Yes, we really did discuss next steps for stickers. We have ideas. Lots of ideas… and we are all very busy. So we shall see if we’re able to make some awesome and fun stickers appear for the Berlin time frame.
On 3/16/20 16:25, Julia Kreger wrote: <snip>
Traits/Scheduling/Flavor Explosion ===========================
Arne with CERN raised this topic to bring greater awareness. CERN presently has greater than 100 flavors representing their hardware fleet as each physical machine type gets its own flavor. This has resulted in pain from the lack of flavor specific quotas. What may help in this area is for Resource Class based quotas, but presently the state of that work is unknown. The bottom line: A user does not have clarity into their resource usage. The question then shifted to being able to report utilization since the current quota model is based on cores/RAM/instances but not resource_class consumption as a whole.
The question largely being “How many am I allowed to create? [before I eat someone else’s capacity]”.
https://github.com/stackhpc/os-capacity was raised as a reporting tool that may help with these sorts of situations with bare metal cloud operators. Another point raised in this discussion was the lack of being able to tie consumer and project ID to consumed resources, but it turns out the allocation list functionality in Placement now has this functionality.
In the end of this discussion, there was consensus that this should be brought back to the Nova community radar.
<snip> FYI work is in progress to add the ability to have resource class based quota limits as part of the larger effort to add support for unified limits in nova: https://review.opendev.org/#/q/topic:bp/unified-limits-nova+(status:open+OR+...) Specifically, this work-in-progress patch will extract resource classes from a flavor and use them during quota limit enforcement: https://review.opendev.org/615180 Cheers, -melanie
participants (2)
-
Julia Kreger
-
melanie witt