A minor revision, we have new links for the videos as it seems there was an access permission issue. Links replaced below. On Wed, Nov 13, 2019 at 11:35 AM Julia Kreger <juliaashleykreger@gmail.com> wrote:
Overall, There was quite a bit of interest in Ironic. We had great attendance for the Project Update, Rico Lin’s Heat/Ironic integration presentation, demonstration of dhcp-less virtual media boot, and the forum discussion on snapshot support for bare metal machines, and more! We also learned there are some very large bare metal clouds in China, even larger than the clouds we typically talk about when we discuss scale issues. As such, I think it would behoove the ironic community and OpenStack in general to be mindful of hyper-scale. These are not clouds with 100s of compute nodes, but with baremetal clouds containing thousands to tens of thousands of physical bare metal machines.
So in no particular order, below is an overview of the sessions, discussions, and commentary with additional status where applicable.
My apologies now since this is over 4,000 words in length.
Project Update ===========
The project update was fairly quick. I’ll try and record a video of it sometime this week or next and post it online. Essentially Ironic’s code addition/deletion levels are relatively stable cycle to cycle. Our developer and Ironic operator commit contribution levels have increased in Train over Stein, while the overall pool of contributors has continued to decline cycle after cycle, although not dramatically. I think the takeaway from this is that as ironic has become more and more stable, and that the problems being solved in many cases are operator specific needs or wants, or bug fixes in cases that are only raised in particular environment configurations.
The only real question that came out of the project update was, if my memory is correct, was “What does Metal^3 mean for Ironic”, and “Who is driving forward Metal^3?” The answers are fairly straight forward, more ironic users and more use cases from Metal^3 driving ironic to deploy machines. As for who is driving it forward, it is largely being driven forward by Red Hat along with interested communities and hardware vendors.
Quick, Solid, and Automatic OpenStack Bare-Metal Orchestration ==================================================
Rico Lin, the Heat PTL, proposed this talk promoting the possibility of using ironic naively to deploy bare metal nodes. Specifically where configuration pass-through can’t be made generic or somehow articulated through the compute API. Cases where they may be is where someone wishes to utilize something like our “ramdisk” deploy_interface which does not deploy an image to the actual physical disk. The only real question that I seem to remember coming up was the question why might someone want or need to do this, which again becomes more of a question of doing things that are not quite “compute” API-ish. The patches are available in gerrit[10].
Operator Feedback Session =====================
The operator feedback[0] session was not as well populated with maybe ~20-25 people present. Overall the feeling of the room was that “everything works”, however there is a need and desire for information and additional capabilities.
* Detailed driver support matrix * Reduce the deployment times further * Disk key rotation is an ask from operators for drives that claim smart erase support but end up doing a drive wipe instead. In essence, to reduce the overall time spent cleaning * Software RAID is needed at deploy time. * IPA needs improved error handling. - This may be a case where something of the communication flow changes that had been previously discussed could help in that we could actively try and keep track of the agent a little more. Additional discussion will definitely be required. * There does still seem to be some interest in graphical console support. A contributor has been revising patches, but I think it would really help for a vendor to become involved here and support accessing their graphical interface through such a method. * Information and an information sharing location is needed. I’ve reached out to the Foundation staff regarding the Bare Metal Logo Program to see if we can find a common place that we can build/foster moving forward. In this topic, the one major pain point began being stressed, issues with the resource tracker at 3,500 bare metal nodes. Privately another operator reached out with the same issue in the scale of tens of thousands of bare metal nodes. As such, this became a topic during the PTG which gained further discussion. I’ll cover that later.
Ironic – Snapshots? ===============
As a result of some public discussion of adding snapshot capability, I proposed a forum session to discuss the topic[1] such that requirements can be identified and the discussion can continue over the next cycle. I didn't expect the number of attendees present to swell from the operator's feedback session. The discussion of requirements went back and forth to ultimately define "what is a snapshot" in this case, and "what should Ironic do?"
There was quite a bit of interaction in this session and the consensus seemed to be the following: * Don’t make it reliant on nova, for standalone users may want/need to use it. * This could be a very powerful feature as an operator could ``adopt`` a machine into ironic and then ``snapshot`` it to capture the disk contents. * Block level only and we can’t forget about capturing/storing content checksum * Capture the machine’s contents with the same expectation as we would have for a VM, and upload this to someplace.
In order to make this happen in a fashion which will scale, the ironic team will likely need to leverage the application credentials.
Ironically reeling in large bare metal deployment without PXE ==============================================
This was a talk submitted by Ilya Etingof, who unfortunately was unable to make it to the summit. Special thanks goes to Both Ilya and Richard Pioso for working together to make this demonstration happen. The idea was to demonstrate where the ironic team sees the future of deployment of machines on the edge using virtual media and how vendors would likely interact with that in some cases as slightly different mechanics may be required even if the BMCs all speak Redfish, which is the case for a Dell iDRAC BMC.
The idea[2] ultimately being is that the conductor would inject the configuration information into the virtual media ISO image that is attached via virtual media negating the need for DHCP. We have videos posted that allow those interested to see what this functionality looks like with neutron[3] and without neutron[4].
While the large audience was impressed, it seemed to be a general surprise that Ironic had virtual media support in some of the drivers previously. This talk spurred quite a bit of conversation and hallway track style discussion after the presentation concluded which is always an excellent sign.
Project Teams Gathering ===================
The ironic community PTG attendance was nothing short of excellent. Thank you everyone who attended! At one point we had fifteen people and a chair had to be pulled up to our table for a 16th person to join us. At which point, we may have captured another table and created confusion.
We did things a little differently this time around. Given some of the unknowns, we did not create a strict schedule around the topics. We simply went through and prioritized topics and tried to discuss them each as thoroughly as possible until we had reached the conclusion or a consensus on the topic.
Topics and a few words on each topic we discussed in the notes section on the PTG etherpad[5].
On-boarding -----------------
We had three contributors that attended a fairly brief on-boarding overview of Ironic. Two of them were more developer focused where as the third was more of an operator focus looking to leverage ironic and see how they can contribute back to the community.
BareMetal SIG - Next Steps -------------------------------------
Arne Wiebalck and I both provided an update including current conversations where we saw the SIG, the Logo Program, the white paper, and what should the SIG do beyond the whitepaper.
To start with the Logo program, it largely seems there that somewhere along the way a message or document got lost and that largely impacted the Logo Program -> SIG feedback mechanism. I’m working with the OpenStack Foundation to fix that and get communication going again. Largely what spurred that was that some vendors expressed interest in joining, and wanted additional information.
As for the white paper, contributions are welcome and progress is being made again.
From a next steps standpoint, the question was raised how do we build up an improved Operator point of contact. There was some consensus that we as a community should try to encourage at least one contributor to attend the operations mid-cycles. This allows for a somewhat shorter feedback look with a different audience.
We also discussed knowledge sharing, or how to improve it. Included with this is how do we share best practices. I’ve got the question out to folks at the foundation if there is a better way as part of the Logo program, or if we should just use the Wiki. I think this will be an open discussion topic in the coming weeks.
The final question that came up as part of the SIG is how to show activity. I reached out to Amy on the UC regarding this, and it seems the process is largely just reach out to the current leaders of the SIG, so it is critical that we keep that up to date moving forward.
Sensor Data/Metrics ---------------------------
The barrier between Tenant level information and Operator level information is difficult with this topic.
The consensus among the group was that the capability to collect some level of OOB sensor data should be present on all drivers, but there is also a recognition that this comes at a cost and possible performance impact. Mainly this performance impact question was raised with Redfish because this data is scattered around the API where multiple API calls are required, and may even cause some interruption to actively inquire upon some data points.
The middle ground in the discussion came to adding a capability of somehow saying “collect power status, temp every minute, fan speeds every five minutes, drive/cpu health data maybe every 30 minutes”. I would be remiss if I didn't note that there was joking about how this would in essence be re-implementation of Cron. What this would end up looking like, we don’t know, but it would provide operators the data resolution necessary for the failure risk/impact. The analogy used was that “If the temperature sensor has risen to an alarm level, either a AC failure or a thermal hot spot forming based upon load in the data center, checking the sensor too often is just not going to result in a human investigating that on the data center floor any faster.”
Mainly I believe this discussion largely stresses that the information is for the operator of the bare metal and not to provide insight into a tenant monitoring system, that those activities should largely be done with-in the operating system.
One question among the group was if anyone was using the metrics framework built into ironic already for metrics of ironic itself, to see if we can re-use it. Well, it uses a plugin interface! In any event, I've sent a post to the openstack-discuss mailing list seeking usage information.
Node Retirement -----------------------
This is a returning discussion from the last PTG, and in discussing the topic we figured out where the discussion became derailed at previously. In essence, the desire was to mix this with the concept of being able to take a node “out of service”. Except, taking a node out of service is an immediate state related flag, where as retiring might be as soon as the current tenant vacates the machine… possibly in three to six months.
In other words, one is “do something or nothing now”, and the other is “do something later when a particular state boundary is crossed”. Trying to make one solution for both, doesn’t exactly work.
Unanimous consensus among those present was that in order to provide node retirement functionality, that the logic should be similar to maintenance/maintenance reason. A top level field in the node object that would allow API queries for nodes slated for retirement, which helps solve an operator workflow conundrum “How do I know what is slated for retirement but not yet vacated?”
Going back to the “out of service” discussion, we reached consensus that this was in essence a “user declarable failed state”, and as such that it should be done only in the state machine as it is in the present, not a future action. Should we implement out of service, we’ll need to check the nova.virt.ironic code and related virt code to properly handle nodes dropping from `ACTIVE` state, which could also be problematic and need to be API version guarded to prevent machines from accidentally entering `ERROR` state if they are not automatically recovered in nova.
Multi-tenancy ------------------
Lots of interest existed around making the API somewhat of a mutli-tenant aware interaction, and the exact interactions and uses involved there are not exactly clear. What IS clear is that providing functionality as such will allow operators to remove complication in their resource classes and tenant specific flavors which is presently being used to enable tenant specific hardware pools. The added benefit of providing some level for normally non-admin users to access the ironic API is that it would allow those tenants to have a clear understanding of their used resources and available resources by directly asking ironic, where as presently, they don’t have a good way to collect nor understand that short of asking the cloud operator when it comes to bare metal. Initial work has been posted for this to gerrit[6].
In terms of how tenants resources would be shared, there was consensus that the community should stress that new special use tenants should be created for collaborative efforts.
There was some discussion regarding explicitly dropping fields for non-privileged users that can see the nodes, such as driver_info and possibly even driver_internal_info. Definitely a topic that requires more discussion, but that would solve operator reporting and use headaches.
Manual Cleaning Out-Of-Band ----------------------------------------
The point was raised that we unconditionally start the agent ramdisk to perform manual cleaning. Except, we should support a method of out of band cleaning operators to only be executed so the bare metal node doesn’t need to be booted to a ramdisk.
The consensus seemed to be that we should consider a decorator or existing decorator change that allows the conductor to hold off actually powering the node on for ramdisk boot unless or until a step is reached that is not purely out of band.
In essence, fixing this allows a “fix_bmc” out of band clean step to be executed first without trying to modify BMC settings, which would presently fail.
Scale issues -----------------
A number of scaling issues between how nova and ironic interact, specifically with the resource tracker and how inventory is updated from ironic and loaded into nova. Largely this issue revolves around the concept in nova that each ``nova-compute`` is a hypervisor. And while one can run multiple ``nova-compute`` processes to serve as the connection to ironic, the underlying lock in Nova is at the level of the compute node, not the node level. This means as thousands of records are downloaded, synced, copied into the resource tracker, the compute process is essentially blocked from other actions while this serialized job runs.
In a typical VM case, you may only have at most a couple hundred VMs on a hypervisor, where as with bare metal, we’re potentially servicing thousands of physical machines.
It should be noted that there are several large scale operators that indicated during the PTG that this was their pain point. Some of the contributors from CERN sat down with us and the nova team to try and hammer out a solution to this issue. A summary of that cross project session can be found at line 212 in the PTG etherpad[0].
But there is another pain point that contributes to this performance issue and that is the speed at which records are returned by our API. We’ve had some operators voice some frustration with this before, and we should at least be mindful of this and hopefully see if we can improve record retrieval performance. In addition to this, if we supported some form of bulk “GET” of nodes, it might be able to be leveraged as opposed to a get on each node one at a time which is presently what occurs in the nova-compute process.
Boot Mode Config ------------------------
Previously, when scheduling occurred with flavors and filters were appropriately set, if a machine was declared as supporting only one boot mode, requests would only ever land on that node. Now with Traits, this is a bit different and unfortunately optional without logic to really guard the setting application for an instance.
So in this case, if filters are such that a request for a Legacy boot instance lands on a UEFI only machine, we’ll still try to deploy it. In reality, we really should try and fail fast.
Ideally the solution here is we consult with the BMC through some sort of get_supported_boot_modes method, and if we determine a mismatch between what the settings are or what the requested instance is from the data we have, we fail the deploy.
This ultimately may require work in the nova.virt.ironic driver code to identify the cause of the failure being an invalid configuration and reporting that back, however it may not be fatal on another machine.
Security of /heartbeat and /lookup endpoints -----------------------------------------------------------
We had a discussion of adding some additional layers of security mechanics around the /heartbeat and /lookup endpoints in ironic’s REST API. These limited endpoints are documented as being unauthenticated, so naturally some issues can arise from these and we want to minimize the vectors in which an attacker that has gained access to a cleaning/provisioning/rescue network could possibly impersonate a running ironic-python-agent. Conversely, the ironic-python-agent runs in a similar fashion, intended to run on secure trusted networks which is only accessible to the ironic-conductor. As such, we also want to add some validation to the API request is from the same Ironic deployment that IPA is heart-beating to.
The solution to this introduce a limited lifetime token that is unique per node per deployment. It would be stored in RAM on the agent, and in the node.driver_internal_info so it is available to the conductor. It would be provided only once via out of band OR via the first “lookup” of a node, and then only become accessible again during known reboot steps.
Conceptually the introduction of tokens was well supported in the discussions and there were zero objections to doing so. Some initial patches[7][8] are under development to move this forward.
An additional item is to add IP address filtering capabilities to both endpoints such that we only process the heartbeat/lookup address if we know it came from the correct IP address. An operator has written this feature downstream and consensus was unanimous at the PTG that we should accept this feature upstream. We should expect a patch for this functionality to be posted soon.
Persistent Agents ------------------------
The use case behind persistent agents is “I want to kexec my way to the agent ramdisk, or the next operating system.” and “I want to have up to date inspection data.” We’ve already somewhat solved the latter, but the former is a harder problem requiring the previously mentioned endpoint security enhancements to be in-place first. There is some interest from CERN and some other large scale operators.
In other words, we should expect more of this from an bare metal fleet operations point of view for some environments as we move forward.
“Managing hardware the Ironic way” -------------------------------------------------
The question that spurred this discussion was “How do I provide a way for my hardware manager to know what it might need to do by default.” Except, those defaults may differ between racks that serve different purposes. “Rack 1, node0” may need a port set to FiberChannel mode, where as “Rack2, node1” may require it to be Ethernet.
This quickly also reaches the discussion of “What if I need different firmware versions by default?”
This topic quickly evolved from there and the idea that surfaced was that we introduce a new field on the node object for the storage of such data. Something like ``node.default_config``, where it would be a dictionary sort of like what a user provides for cleaning steps or deploy steps, that provides argument values which is iterated through when in automated cleaning mode to allow operators to fill in configuration requirement gaps for hardware managers.
Interestingly enough, even today we just had someone ask a similar question in IRC.
This should ultimately be usable to assert desired/default firmware from an administrative point of view. Adrianc (Mellanox) is going to reach out to bdobb (DMTF) regarding the redfish PLDM firmware update interface to see where this may go from here.
Edge computing working group session ----------------------------------------------------
The edge working group largely became a session to update everyone on where Ironic was going and where we see things going in terms of managing bare metal at the edge/far-edge. This included some in-depth questions about dhcp-less deployment and related mechanics as well as HTTPBoot’ing machines.
Supporting HTTPBoot does definitely seem to be of interest to a number of people, although at least after sharing my context only five or six people in attendance really seemed interested in ironic prioritizing such functionality. The primary blocker, for those that are unaware, is pre-built UEFI images for us to do integration testing for IPv4 HTTPBoot. Functionally ironic already supports IPv6 HTTPBoot via DHCPv6 as part of our IPv6 support with PXE/iPXE, however we also don’t have an integration test job for this code path for the same reason, pre-built UEFI firmware images lack the built-in support.
More minor PTG topics -------------------------------
* Smartnics - A desire to attach virtual ports in ironic baremetal nodes with smartnics was raised. Seems that we don’t need to try and create a port entry in ironic. It seems we only need to track/signal and remove the “vif” attachment” to the node in general as there is no physical mac required for that virtual port in ironic. The constraint that at least one MAC address would be required to identify the machine is understood. If anyone sees an issue with this, please raise this to adrianc. * Metal^3 - Within the group attending the PTG, there was not much interest in Metal^3 or using CRDs to manage bare metal resources with ironic hidden behind the CRD. One factor related to this is the desire to define more data to be passed through to ironic which is not presently supported in the CRD definition..
Stable Backports with Ironic's release model ==================================
I was pulled into a discussion with the TC and the Stable team regarding frustrations that have been expressed with-in the ironic team regarding stable back-porting of fixes, mainly drivers. There is consensus that it is okay for us as the ironic team to backport drivery things when needed to support vendors as long as they are not breaking or overall behavior contracts. This quickly leads us to needing to also modify constraints for drivery things as well. Constraints changes will continue to be evaluated on a case by case basis, but the general consensus is there is full support to "do the right thing" for ironic's users, vendors, and community. The key is making sure we are on the same page and agreeing to what that right thing is. This is where asynchronous communication can get us into trouble, and I would highly encourage trying to start higher bandwidth discussion when these cases arise in the future. The key takeaway that we should likely keep in mind is policy is there for good reasons, but policy is not and can not be a crutch to prevent the right thing from being done.
Additional items worth noting - Q1 Gatherings ===================================
There will be an operations mid-cycle at Bloomberg in London, January 7th-8th, 2020. It would be good if at least one ironic contributor could attend as the operators group tends to be closer to the physical baremetal, and it is a good chance to build mutual context between developers and operations people actually using our software.
Additionally, we want to gauge the interest of having an ironic mid-cycle in central Europe in Q1 of 2020. We need to identify the number of contributors that would be interested in and able to attend since the next PTG will be in June. Please email me off-list if your interested in attending and I'll make a note of it as we're still having initial discussions.
And now I've reached a buffer under-run on words. If there are any questions, just reply to the list.
-Julia
Links:
[0]: https://etherpad.openstack.org/p/PVG-ironic-operator-feedback [1]: https://etherpad.openstack.org/p/PVG-ironic-snapshot-support [2]: https://review.opendev.org/#/c/672780/
[3] https://drive.google.com/file/d/1_PaPM5FvCyM6jkACADwQtDeoJkfuZcAs/view?usp=s... [4] https://drive.google.com/file/d/1YUFmwblLbJ9uJgW6Rkf6pkW8ouU-PYFK/view?usp=s...
[5]: https://etherpad.openstack.org/p/PVG-Ironic-Planning [6]: https://review.opendev.org/#/c/689551/ [7]: https://review.opendev.org/692609 [8]: https://review.opendev.org/692614 [9]: https://etherpad.openstack.org/p/ops-meetup-1st-2020 [10]: https://review.opendev.org/#/q/topic:story/2006403+(status:open+OR+status:me...)