[openstack-dev] [ironic] The state of the ironic universe - July 2nd, 2018

Julia Kreger juliaashleykreger at gmail.com
Tue Jul 3 17:21:37 UTC 2018

The state of the ironic universe

This month we're trying a new format to keep those interested updated
on what is going on in ironic. The intent is for our weekly updates to
now take the form of a monthly newsletter to cover highlights of what
is going on in the ironic community. If you have something to add in
the future, please feel free to reach out and we can add it to the
next edition.

- Long deprecated 'classic' ('pxe_*', 'agent_*) drivers are being
removed. They will not be present in the next major version of ironic.
- Ironic now has support to return nodes from maintenance state when
BMC connectivity is restored from an outage.
- BIOS configuration caching and setting assertion interface has
merged and vendors are working on their implementations.

>From OpenInfra Days China!
* Users in china are interested in ironic!
* Everything from small hundreds to thousands, basic OS installation
to super computing use cases!
* The larger deployments are encountering some of the scale issues
larger operators have experienced in the past.
* The language barrier is making it difficult to grasp the finer
details of: Deployment error reporting/troubleshooting and high
availability mechanics.
* Some operators are interested in the ability to "clone" or "backup"
an ironic node's contents in order to redeploy elsewhere and/or
restore the machine state.
* Many operators wishing to contribute felt that they were unable to
because "we are not [a] big name", that they would be unable to gain
traction or build consensus by not being a major contributor already.
In these discussions, I stressed that we all have similar, if not the
same, problems that we are trying to solve. Julia wrote a recent
SuperUser post about this.[1]

>From the OpenStack Summit
Operator interests vary, but there are some common problems that
operators have or are interested in solving.

Attestation/Security Integration
Some operators and deployers seek to strengthen their security posture
via the use of TPMs, registration and attestation of status with
attestation servers. In a sense, think of it as profile enforcement of
bare metal. An initial specification [2] has been posted to try and
figure out some of the details and potential integration points.

Firmware Management (Version Discovery/Assertion)
Operators are seeking more capabilities to discover the current
version of firmware on a particular bare metal node and then possibly
take corrective action through ironic in order to update the firmware.

The developer community does not presently have a plan to tackle this
challenge, however doing so moves us closer to being a sort of
attestation service.

RAID prior to deploy and Software RAID
One of the frequent asks is for support for enabling the assertion of
RAID configuration prior to deployment. Naturally this is somewhat
problematic as this task CAN greatly extend the deployment time.
Presently deployment steps[6] are anticipated to enable these
sorts of workflows.

Additionally the ask for Software RAID support seems to be ramping up.
This is not a simple task for us to achieve, but conceivably it might
take the same shape as hardware raid presently does, just with
appropriate interface mechanisms during the deployment process. There
are several conundrums, and the community needs to better understand
desired cases before development of a solution can take place.

Serial Console Logging
Numerous operators expressed interest in having console logging
support. This last seems to have been worked on last year[3] and
likely needs a contributor to pick back up and champion it forward.

Hardware CMDB/Asset Discovery/Recording and Tracking
While not directly a problem of deploying bare metal directly,
information about hardware is needed to tie in tasks such as repairs,
warranty, tax accounting, and so on and so forth. Often these problems
becomes "solved" by disparate processes tracking information in
several different places. There is a growing ask for something in the
community to aid in this effort. Jokingly, we've already kind of come
up with a name, but the current main ironic developer community
doesn't have time to take on this challenge.

The most viable path forward for interested operators is likely to
detail the requirements and begin working together to implement
something with integration with ironic.

Rack Awareness/Conductor Locality
Ironic is working on conductor locality, in terms of pinning specific
bare metal nodes to specific ironic conductors. We hope that this
functionality will be available in the final Rocky release.[4]

Burn-in Tests
Operators expressed interest in having the capability to use ironic as
part of burn-in proceses for hardware being added to the deployment.
The developer community discussed implementing such tooling at the
Rocky PTG and those discussions seemed to center around this being a
clean step to perform some unknown actions on the ramdisk. The missing
piece of the puzzle would be creating a "meta" step, and then
executing additional steps. We mainly need to understand what would be
good steps to implement in terms of actual actions to take for
burning-in the node.

Issues reported at the Summit

L3 Networking Documentation
Operators expressed a need for improved documentation in
L3/multi-tenant networking integration. This is something the active
developer community is attempting to improve as time permits.

Mutlitenant networking + boot from volume without HBAs
An increasing desire seems to exist to operate boot from volume with
Multi-tenant networking, although without direct storage attachment on
to that network, the routers need to take on the networking load for
the IO operations. As such, this is something that we never
anticipated during development of the feature. The community needs
more information to better understand the operational scenario.

Recent New Specifications

* L3 based ironic deployments[7]

This work aims to allow operators to deploy utilizing virtual media in
remote data centers, where no DHCP is present.

* Boot from Ramdisk[8]

This is an often requested feature from the Scientific computing
community, and may allow us to better support other types of ramdisk
based booting, such as root on NFS and root on RBD.

* Security Interface[9]

There is a growing desire for support for integration into security
frameworks, ultimately to enable better use of TPMs and/or enable
tighter operator specific workflow integrations. This would benefit
from operator feedback.

* Synchronize events with neutron[10]

This describes introduction of processes to enable ironic to better
synchronize its actions with neutron.

* Direct Deploy with local HTTP server[11]

This is an feature that would allow operators to utilize the "direct" deployment
interface with a local HTTP server instead of glance being backed by
swift and using swift tempurls.

Recently merged specifications

* VNC Graphical Console [5]
* Conductor/Node locality [4]

Things that might make good Summit or conference talks
* Talks about experiences scaling ironic or running ironic at scale.
* Experiences about customizing drivers or hardware types.
* New use cases!

[1]: http://superuser.openstack.org/articles/translating-context-understanding-the-global-open-source-community/
[2]: https://review.openstack.org/576718
[3]: https://review.openstack.org/#/c/453627
[4]: https://review.openstack.org/#/c/559420
[5]: https://review.openstack.org/306074
[6]: https://review.openstack.org/#/c/549493/
[7]: https://review.openstack.org/543936
[8]: https://review.openstack.org/576717
[9]: https://review.openstack.org/576718
[10]: https://review.openstack.org/343684
[11]: https://review.openstack.org/#/c/504039/

More information about the OpenStack-dev mailing list