[openstack-dev] [ironic] my notes from the PTG in Denver

Dmitry Tantsur dtantsur at redhat.com
Fri Oct 6 14:52:44 UTC 2017


Hi all!

Here are my notes from the ironic (and a bit of nova) room in Denver.
The same content in a nicely rendered form is on my blog:
http://dtantsur.github.io/posts/ironic-ptg-denver-2017-1.html
http://dtantsur.github.io/posts/ironic-ptg-denver-2017-2.html

Here goes the raw rst-formatted version. Feel free to comment and ask questions 
here or there.

Status of Pike priorities
-------------------------

In the Pike cycle, we had 22 priority items. Quite a few planned priorities
did land completely, despite the well-known staffing problems.

Finished
~~~~~~~~

Booting from cinder volumes
^^^^^^^^^^^^^^^^^^^^^^^^^^^

This includes the iRMC implementation, but excludes the iLO one. There is
also a nova patch for updating IP addresses for volume connectors on review:
https://review.openstack.org/#/c/468353/.

Next, we need to update cinder to support FCoE - then we'll be able to
support it in the generic PXE boot interface. Finally, there is some interest
in implementing out-of-band BFV for UCS drivers too.

Rolling (online) upgrades between releases
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

We've found a bug that was backported to stable/pike soon after the release
and now awaits a point release. We also need developer documentation and
some post-Pike clean ups.

We also discussed fast-forward upgrades. We may need an explicit migration
for VIFs from port.extra to port.internal_info, **rloo** will track this.
Overall, we need to always make our migrations explicit and runnable without
the services running.

The driver composition reform
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Finished, with hardware types created for all supported hardware, and the
classic drivers pending deprecation in Queens.

`Removing the classic drivers`_ is planned for Rocky.

Standalone jobs (jobs without nova)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

These are present and voting, but we're not using their potential. The
discussion is summarized below in `Future development of our CI`_.

Feature parity between two CLI implementations
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The ``openstack baremetal`` CLI is now complete and preferred, with the
deprecation of the ``ironic`` CLI expected in Queens.

We would like OSC to have less dependencies though. There were talks about
having a standalone ``openstack`` command without dependencies on other
clients, only on ``keystoneauth1``. **rloo** will follow up here.

**TheJulia** will check if there are any implications from the
interoperability team point of view.

Redfish hardware type
^^^^^^^^^^^^^^^^^^^^^

The ``redfish`` hardware type now provides all the basic stuff we need, i.e.
power and boot device management. There is an ongoing effort to implement
inspection. It is unclear whether more features can be implemented in a
vendor-agnostic fashion; **rpioso** is looking into Dell, while **crushil**
is looking into Lenovo.

Other
^^^^^

Also finished are:

* Post-deploy VIF attach/detach.

* Physical network awareness.

Not finished
~~~~~~~~~~~~

OSC default API version
^^^^^^^^^^^^^^^^^^^^^^^

We now issue a warning of no explicit version is provided to the CLI.
The next step will be to change the version to latest, but our current
definition of latest does not fit this goal really well. We use the latest
version known to the client, which will prevent it from working out-of-box
with older clouds. Instead, we need to finally implement API version
negotiation in ironicclient, and negotiate the latest version.

Reference architectures guide
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

There is one patch that lays out considerations that are going to be shared
between all proposed architectures. The use cases we would like to cover:

* Admin-only provisioner (standalone architectures)

   * Small fleet and/or rare provisions.

     Here a non-HA architecture may be acceptable, and a *noop* or *flat*
     networking can be used.

   * Large fleet or frequent provisions.

     Here we will recommend HA and *neutron* networking. *Noop* networking is
     also acceptable.

* Bare metal cloud for end users (with nova)

   * Smaller single-site cloud.

     Non-HA architecture and *flat* or *noop* networking is acceptable.
     Ironic conductors can live on OpenStack controller nodes.

   * Large single-site cloud.

     HA is required, and it is recommended to split ironic conductors with
     their TFTP/HTTP servers to separate machines. *Neutron* networking
     should be used, and thus compatible switches will be required, as well
     as their ML2 mechanism drivers.

     It is preferred to use virtual media instead of PXE/iPXE for deployment
     and cleaning, if supported by hardware. Otherwise, especially large
     clouds may consider splitting away TFTP servers.

   * Large multi-site cloud.

     The same as a single-site cloud plus using Cells v2.

Deploy steps
^^^^^^^^^^^^

We agreed to continue this effort, even though the ansible deploy driver solves
some of its use cases. The crucial point is how to pass the requested deploy
steps parameters from a user to ironic. For a non-standalone case it means
passing them through nova.

In a discussion in the nova room we converged to an idea of introducing new
CRUD API for *deploy templates* (the exact name to be defined) on the ironic
side. Each such template will have a unique name and will correspond to a
*deploy step* and a set of arguments for it. On the nova side, a *trait* can
be requested with a name matching (in some sense) the name of a deploy
template. It will be passed to ironic, and ironic will apply the action,
specified in the template, during deployment.

The exact implementation and API will be defined in a spec, **johnthetubaguy**
is writing it.

Networking features
^^^^^^^^^^^^^^^^^^^

Routed network support is close to completion, we need to finish a patch for
networking-baremetal.

The neutron event processing work is on a spec stage, but does not look
controversial for now.

We also have patches up for deprecating DHCP providers and for making our DHCP
code less dnsmasq-specific.

ironic-inspector HA
^^^^^^^^^^^^^^^^^^^

Preparation work is under way. We are making our PXE boot management
pluggable, with a new implementation on review that manages a *dnsmasq*
process directly, instead of changing *iptables*.

We seem to agree that rolling upgrades are not a priority for
ironic-inspector, as it's never hit via end users either directly or through
another service. It's a purely admin-only API, and admins can plan for a
potential outage.

There is a proposal to support ironic boot interfaces instead of a home-grown
implementation for boot management. The discussion of it launched a more
global discussion about ironic-inspector future, that continued the next day.

Just Do It
^^^^^^^^^^

The following former priorities have all or the most of patches up for review,
and just require some attention:

* Node tags

* IPA API versioning

* Rescue mode

* Supported power states API

* E-Tags in API

.. _public etherpads: https://etherpad.openstack.org/p/ironic-queens-ptg
.. _Removing the classic drivers: 
http://specs.openstack.org/openstack/ironic-specs/specs/approved/classic-drivers-future.html

OpenStack goals status
----------------------

We have not completed either of the two goals for the Pike cycle, and now we
have two more goals to complete. All four goals are relatively close to
completion.

Python 3
~~~~~~~~

We have a non-voting integration job on ironic and a voting functional test
job on ironic-inspector. The missing steps are:

* make the python 3 job voting on ironic
* implement a job with IPA running on python 3 (blocked by pyudev weirdness)
* create an integration job with python 3 for ironic-inspector (mostly blocked
   by swift, will have reduced coverage; an alternative is to try RadosGW)

Switching to uWSGI
~~~~~~~~~~~~~~~~~~

Ironic standalone tests are running with mod_wsgi and voting, we only need to
switch to uWSGI.

For ironic-inspector it's much more complicated: it does not have a separate
API service for now at all. It's unclear if we'll able to just launch the
current service as it is behind a WSGI container, as we actively use green
threads. We have to probably wait until the HA work is done.

Splitting away the tempest plugin
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

We have a script to extract git history for a sub-tree. We need to create a
separate git repository somewhere, so that we do not submit 60-80 related
patches to zuul. Then this repository will be imported by the infra team, and
we'll proceed with the migration.

On the previous (ATL) PTG we decided to have ironic and ironic-inspector
plugins co-located. This will be less confusing for external users, as many of
them to not understand the difference clearly, but it will also complicate the
migration.

We will need to plan the actual migration in advance, and freeze the version
in-tree for some time.

Policy in the code
~~~~~~~~~~~~~~~~~~

The ironic part is essentially done, we just need to change the way we
document policy: https://review.openstack.org/#/c/502519/.

No policy support exists in ironic-inspector, and it's unclear if this goal
assumes adding it. There is a desire to do so anyway.

Future development of our CI
----------------------------

Standalone tests
~~~~~~~~~~~~~~~~

We have standalone tests voting, but we're not fully using their potential.
In the end, we want to reduce the number of **non**-standalone jobs to:

#. a whole disk image job,
#. a partition images job,
#. a boot-from-volume job,
#. a multi-node job with advanced networking (can be merged with one of the
    first two),
#. two grenade jobs: full and partial.

The following tests can likely be part of the standalone job:

* tests for all combinations of disk types and deploy methods,
* tests covering all community-supported drivers (snmp, redfish),
* tests on different boot options (local vs network boot),
* tests on root device hints (we plan to cover serial number, wwn and size
   with operators),
* node adoption.

Take over testing
~~~~~~~~~~~~~~~~~

The take over feature is very important for our HA model, but is completely
untested. We discussed the two most important test cases:

#. conductor failure during deployment with node in ``deploy wait``,
#. conductor failure for an active node using network boot.

We discussed two ways of implementing the test: using a multi-node job with two
conductors or using only one conductor. The latter requires a trick: after
killing the conductor, change its host name, so that it looks like a new
conductor. In either case, we can combine both tests into one run:

#. start deploying two nodes with netboot:

    #. ``driver=manual-management deploy_interface=iscsi``,
    #. ``driver=manual-management deploy_interface=direct``,

    The remaining steps will be repeated for both nodes.

#. Wait for nodes ``provision_state`` becomes ``deploy wait``.
#. Kill the conductor.
#. Manually clean up the files from the TFTP and HTTP directories and the
    master image cache.
#. Change the conductor host name in ``ironic.conf``.
#. Wait for directories to be populated again.

    .. note:: We should aim to remove this step eventually.

#. ``virsh start`` the nodes to continue their deployment.
#. Wait for nodes to become ``active``.

Here is where the second test starts:

#. Repeat steps 3 - 6.
#. ``virsh reboot`` the nodes.
#. Check SSH connection to the rebooted instances.

In the future, we would also like to have negative tests on failed take over
for nodes in ``deploying``. We should also have similar tests for cleaning.

Pike retrospective
------------------

We've had a short retrospective. Positive items:

* Virtual midcycle
* Weekly bug liaison (action: start doing it again),
* Weekly priorities
* Landed some big features
* Acknowledge that vendors need more attention
* Did not drive our PTL away :)

Not so positive:

* Loss of people
* Gate breakages (action: better hand off of current mitigation actions
   between timezones, report on IRC and the whiteboard what you've done and
   what's left)
* Took too many priorities (action: take less, make the community understand
   that priorities != full backlog)
* Still not enough attention to vendors (action: accept one patch per vendor
   as part of weekly priorities; the same for subteams)
* Soft feature freeze
* Need more folks reviewing (action: **jlvillal** considers picking up the
   weekly review call)
* Releasing and cutting stable/pike was a mess (discussed in `Release cycle`_)
* No alignment between OpenStack releases and vendor hardware releases.

Release cycle
-------------

We had really hard time releasing Pike. Grenade was branched before us,
essentially messing up our upgrade testing. We had to cut out stable/pike at a
random point, and then backport quite a few features, after repairing the CI.

When discussing that, we noted that we committed to releasing often and early,
but we'd never done it, at least not for ironic itself. Having regular
releases can help us avoiding getting overloaded in the end of the cycle.
We've decided:

* Keep master as close to a releasable state as possible, including not
   exposing incomplete features to users and keeping release notes polished.
* Release regularly, especially when we feel that something is ready to got
   out. Let us aim for releasing roughly once a month.
* Let us cut stable/pike at the same time as the other projects. We will use
   the last released version as a basis for it.
* We are going back to feature freeze at the same time as the other projects,
   two weeks before the branching at milestone 3. This will allow us to finish
   anything requiring finishing, particularly rolling upgrade preparation,
   documentation and release notes.

Nova virt driver API compatibility
----------------------------------

Currently, we hardcode the required Bare Metal API microversion in our virt
driver. This introduces a hard dependency on a certain version of ironic, even
when it is not mandatory in reality, and enforces a particular upgrade order
between nova and ironic. For example, when we introduced boot-from-volume
support, we had to bump the required version, even though the feature itself
is optional. Cinder support, on the other hand, has multiple code paths
in nova, depending on which API version is available.

We would like to support the current and the previous versions of ironic in
the virt driver. For that we will need more advanced support for API
microversion negotiation in *ironicclient*. Currently it's only possible to
request one version during client creation. What we want to end up with is to
request the **minimum** version in get_client_, and then provide an ability
to specify a version in each call. For example,

.. code-block:: python

     ir_client = ironicclient.get_client(session=session,
                                         os_ironic_api_version="1.28")
     nodes = ir_client.node.list()  # using 1.28
     ports = ir_client.port.list(os_ironic_api_version="1.34")  # overriding

Another idea was to allow specifying several versions in get_client_. The
highest available version will be chosen and used for all calls:

.. code-block:: python

     ir_client = ironicclient.get_client(session=session,
                                         os_ironic_api_version=["1.28", "1.34"])
     if ir_client.negotiated_api_version == (1, 34):
         # do something

Nothing prevents us from implementing both, but the former seems to be what
the API SIG recommends (unofficially, **dtantsur** to follow up with a formal
guideline). It seems that we can reuse newly introduces version discovery
support from the *keystoneauth1* library. **TheJulia** will look into it.

.. _get_client: 
https://docs.openstack.org/python-ironicclient/latest/api/ironicclient.client.html

What we consider a deploy?
--------------------------

We had a heated discussion on our deploy interfaces. Currently, the whole
business logic of provisioning, unprovisioning, taking over and cleaning nodes
is spread between the conductor and a deploy driver, with the deploy driver
containing the most of it. This ends up with a lot of duplication, and also
with vendor-specific deploy interfaces, which is something we would want to
avoid. It also ends up with a lot of conditionals in the deploy interfaces
code, as e.g. boot-from-volume does not need half of the actions.
A few options were considered without a clear winner:

#. Move orchestration to the conductor, keep only image flashing logic in
    deploy interfaces. This is arguably how we planned on using deploy
    interfaces. But doing so would limit the ability of drivers to change how
    deploy if orchestrated, if e.g. they need to change the order of some
    operations or add a driver-specific operation in between of them.

#. Create a new *orchestration* interface, keep only image flashing logic in
    deploy interfaces. That will fix the problem with customization, but it
    will complicate our interfaces matrix even further. And such change would
    break all out-of-tree drivers with custom deploy interfaces.

#. Do nothing and just try our best to clean up the duplication.

The last option is what we're going to do for Queens. Then we will re-evaluate
the remaining options.

Available clean steps API
-------------------------

We have currently no way to indicate which clean steps are available for which
node. Implementing such API is complicated by the fact that some clean steps
come from hardware interfaces, while some come from the ramdisk (at least for
IPA-based drivers). The exact API was discussed in the API SIG room, and then
later in the ironic room.

We agreed that clean steps need to be cached to make sure we can return them
in a synchronous GET request, like ``GET /v1/nodes/<UUID>/cleaning/steps``
(the exact URI to be discussed in the spec). The caching itself will happen in
two cases:

#. Implicitly on every cleaning
#. Explicitly when a user requests manual cleaning without clean steps

A standard ``update_at`` field will be provided, so that users know when the
cached steps were last updated. **rloo** to follow up on the spec with it.

We decided to not take any actions to invalidate the cache for now.

Rethinking the vendor passthru API
----------------------------------

Two problems were discussed:

#. For dynamic drivers, the driver vendor passthru API only works with
    the default *vendor* interface implementation
#. No more support for mixing several vendor passthru implementations

For the first issue, we probably need to do the same thing as we plan to do
with driver properties: https://review.openstack.org/#/c/471174/. This does
not seem to be a high priority, so **dtantsur** will just file an RFE and
leave it there.

For the second issue, we don't have a clean solution now. It can be worked
around by changing ``node.vendor_interface`` on flight. **pas-ha** will
document it.

Future of bare metal scheduling
-------------------------------

We have discussed the present and the future of scheduling bare metal
instances using nova. The discussion has started in the nova room and
continued in our room afterwards and on Friday.

Node availability
~~~~~~~~~~~~~~~~~

First, we discussed marking a node as unavailable for nova. Currently, when a
node is cleaning or otherwise unavailable, we set its resource classes count
to zero. This is, of course, hacky, and we want to get rid of it. I was
thinking about a new virt driver method to express availability, like

.. code-block:: python

     def is_operational(self, hostname):
         "Returns whether the host can be used for deployment."""

However, it was pointed out that ironic would probably be the only user of
such feature. Instead, it was proposed to use ``RESERVED`` field when
reporting resource classes. Indeed, cleaning can be treated as a temporary
reservation of the node by ironic for its internal business.

We will return ``RESERVED=0`` when node is active or available. Otherwise,
``RESERVED`` will equal to the total amount of reported resources (``1``
in case of a custom resource class). This will ensure that no resources are
available for scheduling without messing with the reported inventory.

Advanced configuration
~~~~~~~~~~~~~~~~~~~~~~

Then we discussed means of passing from nova to ironic such information as
BIOS configuration or requested RAID layout. We agreed (again) that we don't
want nova to just pipe JSON blobs from a user to ironic. Instead, we will use
*traits* on the nova side and a new entity tentatively called *deploy
templates* on the ironic side.

A user will request a *deploy template* to be applied on a node by requesting
an appropriate trait. All matches traits will be passed from nova to ironic in
a similar way to how capabilities are passed now. Then ironic will fetch
*deploy templates* corresponding to traits and apply them.

The exact form of a *deploy template* is to be defined. A *deploy template*
will probably contain a *deploy step* name and its arguments. Thus, this work
will require the *deploy steps* work to be revived and finished.

**johnthetubaguy** will write specs on both topics.

Ownership of bare metal nodes
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

We want to allow nodes to be optionally owned by a particular tena^Wproject.
We discussed how to make the nova side work, with ironic still being the source
of truth for who owns which node. We decided that we can probably make it work
with *traits* as well.

Quantitative scheduling
~~~~~~~~~~~~~~~~~~~~~~~

Next, by request of some of the community members, we have discussed bringing
back the ability to use quantitative scheduling with bare metal instances.
We ended up with the same outcome as previously. Starting with Pike, bare
metal scheduling has to be done in terms of *custom resource classes* and
*traits* (ah, that magical traits!), and quantitative scheduling is not
coming back.

Inspection and resource classes
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

After the switch to resource classes, inspection is much less useful.
Previously the information it provided was enough for scheduling. Now we don't
care too much about CPU/memory/disk properties, but we do care about the
resource class. Essentially, inspection is only useful for discovering ports
and capabilities.

In-band inspection (using ironic-inspector) has a good work-around though: its
*introspection rules* (mini-DSL to run on the discovered data) can be used to
set the resource class based on logic provided by an operator. These rules are
part of the ironic-inspector API, and thus out-of-band inspection does not
benefit from them.

A potential solution is to move introspection rules API to ironic itself. That
would require agreeing on a common inventory format for both in-band and
out-of-band inspection. This is likely to be the `IPA inventory format`.
Then we'll have to change the *inspect* interface. Currently we have one call
that does the whole inspection process, we need a call that returns
an inventory. Then ironic itself will run introspection rules, create ports
and update properties and capabilities.

A big problem here is that the discovery process, implemented purely within
ironic-inspector, also heavily relies on introspection rules. We cannot
remove/deprecate the introspection rules API in ironic-inspector until this is
solved. The two API will have to co-exist for the time being. We should
probably put the mechanism behind introspection rules to ironic-lib.

**sambetts** plans to summarize a potential solution on the ML.

We also discussed potentially having the default resource class to use for new
nodes, if none is provided. That would simplify things for some consumers,
like TripleO. Another option is to generate a resource class based on some
template. We can even implement both:

.. code-block:: ini

     default_hardware_type = baremetal

results in ``baremetal`` resource class for new nodes, while

.. code-block:: ini

     inspected_hardware_type = bm-{memory_mb}-{cpus}-{cpu_arch}

results in a templated resource class to be set for inspected nodes that do
not have a resource class already set.

.. _IPA inventory format: 
https://docs.openstack.org/ironic-python-agent/latest/admin/how_it_works.html#hardware-inventory

Future ironic-inspector architecture
------------------------------------

The discussion in `Inspection and resource classes`_ brought us to an idea of
slowly merging most of ironic-inspector into ironic. Ironic will benefit by
receiving introspection rules and optional inventory storage, while
ironic-inspector will benefit from using the boot interface and from the
existing HA architecture. In the end, the only part remaining in a separate
project will be PXE handling for introspecting of nodes without ports and
for auto-discovery.

It's not clear how that will look. We could not discuss it in-depth, as a core
contributor (**milan**) was not able to come to the PTG. However, we have a
rough plan for the next steps:

#. Implement optional support for using boot interfaces in the ``Inspector``
    *inspect* interface: https://review.openstack.org/305864.

    When discussing its technical details, we agreed that instead of having a
    configuration option in ironic to force using a boot interface, we better
    introduce a configuration option in ironic-inspector to completely disable
    its boot management.

#. Implement optional support for using network interfaces in the ``Inspector``
    *inspect* interface: https://review.openstack.org/320003.

#. Move introspection rules to ironic itself as discussed in `Inspection
    and resource classes`_.

#. Move the whole data processing to ironic and stop using ironic-inspector
    when a boot interface has all required information.

The first item is planned for Queens, the second can fit as well. The timeline
for the other items is unclear. A separate call will be scheduled soon to
discuss this.

BIOS configuration
------------------

This feature has been discussed several times already. This time we came up
with a more or less solid plan to implement it in Queens.

* We have confirmed the current plan to use clean steps for starting the
   configuration, similar how RAID already works. There will be two new clean
   steps: ``bios.apply_configuration`` and ``bios.factory_reset``.

* We discussed having a new BIOS interface versus introducing new methods on
   the management interface. We agreed that we want to allow mix-and-match of
   interfaces, e.g. using Redfish power with a vendor BIOS interface.

* We also discussed the name of the new interface. While the name "BIOS" is
   not ideal, as some systems use UEFI and some don't even have a BIOS, we
   could not come up with a better proposal.

* We will apply only very minimum validation to requested parameters.

Eventually, we will want to expose this feature as a deploy step as well.

A point of contention was how to display available BIOS configuration to a
user. Vendor representatives told us that available configurable parameters
may vary from node to node even within the same generation, so doing it
per-driver is not an option. We decided to go with the following approach:

* Introduce a new API endpoint to return cached available parameters. The
   response will contain the standard ``updated_at`` field, informing a user
   when the cache was last updated.

* The cache will be updated every time the configuration is changed via
   the clean steps mentioned above.

* The cache will also be updated on moving a node from ``enroll`` to
   ``manageable`` provision states.

API for single request deploy
-----------------------------

This idea has been in the air for really long time. Currently, a deployment
via the ironic API involves:

* locking a node by setting ``instance_uuid``,
* attaching VIFs via the VIF API,
* updating ``instance_info`` with a few fields,
* requesting provision state ``active``, providing a configdrive.

In addition to being not user-friendly, this complex procedure makes it harder
to configure policies in a way to allow a user to only deploy/undeploy nodes
and nothing else.

Essentially, three ideas where considered:

#. Introduce a completely new API endpoint. This may complicate our already
    quite complex API.

#. Make working with the exising node more restful. For example, allow a PUT
    request against a node updating both ``instance_uuid`` and
    ``instance_info``, and changing ``provision_state`` to ``active``.

    It was noted, however, that directly changing ``provision_state`` is
    confusing, as the result will not match it (the value of ``provision_state``
    will become ``deploying``, not ``active``). This can be fixed by setting
    ``target_provision_state`` instead.

#. Introduce a new *deployment* object and CRUD API associated with it. A UUID
    of this object will replace ``instance_uuid``, while its body will contain
    what we have in ``instance_info`` now. A deploy request would look like::

     POST /v1/deployments {'node_uuid': '...', 'root_gb': '...', 'config_drive': 
'...'}

    A request to undeploy will be just::

     DELETE /v1/deployments/<DEPLOY UUID>

    Finally, and update of this object will cause a reprovision::

     PUT /v1/deployments/<DEPLOY UUID> {'config_drive': '...'}

    This is also a restful option, which is also the hardest to implement.

We did not agree to implement any (or some) of these options. Instead,
**pas-ha** will look into possible policies adjustments to allow a non-admin
user to provision and unprovision instances. A definition of success is to be
able to switch nova to a non-admin user.

Bare metal instance HA
----------------------

This session was dedicated to the proposal of implementing ``nova migrate``
for bare metal instances: https://review.openstack.org/#/c/449155/. This spec
is against nova, and no ironic changes are expected.

The idea is to enable moving an instance from one ironic node to another,
assuming that any valuable data is stored only on remote volumes. We agreed
that in the cloud case local disks should not be treated as a reliable
persistent storage.

We discussed using ``nova migrate`` vs ``nova evacuate`` and decided that the
former probably will work better, as we won't mark a nove compute handling the
source node as down (it will bring down many more nodes). The only caveat is
that the users should not set any destination for the migration API call,
allowing nova to pick the destination itself.

Two more potential issues were spotted that need clarifying in the spec:

* How to update hash ring? The compute services for ironic are organized in a
   hash ring, but once a node is provisioned, it is attached to a compute
   service. Probably just a database update is enough.

* How exactly to replug VIFs.

A bonus point for implementing this feature will be support for resizing bare
metal instances, as migration is implemented as resizing without changing the
flavor.

**hshiina** will update and clarify the spec.

Ansible deploy method
---------------------

This was a short session. The proposed ``ansible`` deploy interface already
exists in ironic-staging-drivers and have a voting CI job. We are more or less
in agreement that we need it to satisfy cases requiring extensive
customizations.

**pas-ha** presented a benchmark, showing that this method is only slightly
slower than the ``direct`` deploy method:
http://pshchelo.github.io/ansible-deploy-perf.html. A major optimization
would be calling ansible only once, when deploying several nodes, but
the current ironic architecture does not quite allow that.

Console log
-----------

We already have a support for serial console, so it feels natural to also
implement console log. Not everything, however, is obvious in the
implementation.

First, we discussed the amount of data to store. The current proposal captures
the log indefinitely, which is not perfect. It looks like we can document
enabling *logrotate* to handle this problem outside of ironic. A mailing list
thread can be started to learn what people are using. In any case, we should
return only the last N KiB to nova, where N is to be defined.

Next, we discussed when exactly to start the logging. Logging during
cleaning/provisioning may be helpful, but can potentially expose sensitive
information to end users. We agreed to start logging on starting a provisioned
instance.

**tiendc** will update the spec with the outcome of this discussion.

Graphical console
-----------------

This has been discussed several times already. We confirmed our plan to
introduce a new hardware interface - ``graphical_console_interface``.
**pas-ha** will update the existing spec, as well as the implementation for
the *idrac* hardware type.

Queens priorities
-----------------

This time we decided to take less priorities for the cycle, and make it clear
to the community that the priorities list is **not** our complete backlog.
That means, we will accept work that is not on the priorities list, so not
everything has to be fitted in it.

The list was finalized as a spec after the PTG:
http://specs.openstack.org/openstack/ironic-specs/priorities/queens-priorities.html.



More information about the OpenStack-dev mailing list