Open Stack

Mon Jun 13 14:41:10 UTC 2016

Hi all,

As we are close to merging the initial Nodes Registration workflows and 
action [1, 2] using Mistral which successfully provides the current 
registration logic via common API, I'd like to start discussion on how 
to improve it so it satisfies GUI and CLI requirements. I'd like to try 
to describe the clients goals and define requirements, describe current 
workflow problems and propose a solution. I'd like to record the result 
of discussion to Blueprint [3] which Ryan already created.

CLI goals and optimal workflow
========================

CLI's main benefit is based on the fact that it's commands can simply 
become part of a script, so it is important that the operation is 
idempotent. The optimal CLI workflow is:

User runs 'openstack baremetal import' and provides instackenv.json file 
which includes all nodes information. When the registration fails at 
some point, user is notified about the error and re-runs the command 
with the same set of nodes. Rinse and repeat until all nodes are 
properly registered.

GUI goals and optimal workflow
=========================

GUI's main goal is to provide a user friendly way to register nodes, 
inform the user on the process, handle the problems and lets user fix 
them. GUI strives for being responsive and interactive.

GUI allows user to add nodes specification manually one by one by 
provided form or allow user (in same manner as CLI) to provide the 
instackenv.json file which holds the nodes description. Importing the 
file (or adding node manually) will populate an array of nodes the user 
wants to register. User is able to browse these nodes and make 
corrections to their configuration. GUI provides client side validations 
to verify inputs (node name format, required fields, mac address, ip 
address format etc.)

Then user triggers the registration. The nodes are moved to nodes table 
as they are being registered. If an error occurs during registration of 
any of the nodes, user is notified about the issue and can fix it in 
registration form and can re-trigger registration for failed nodes. 
Rinse and repeat until all nodes are successfully registered and in 
proper state (manageable).

Such workflow keeps the GUI interactive, user does not have to look at 
the spinner for several minutes (in case of registering hundreds of 
nodes), hoping that something does not happen wrong. User is constantly 
informed about the progress, user is able to react to the situation as 
he wants, User is able to freely interact with the GUI while 
registration is happening on the background. User is able to register 
nodes in batches.

Current solution
=============

Current solution uses register_or_update_nodes workflow [1] which takes 
a nodes_json array and runs register_or_update_nodes and 
set_nodes_managed tasks. When the whole operation completes it sends 
Zaqar message notifying about the result of the registration of the 
whole batch of nodes.

register_or_update_nodes runs tripleo.register_or_update_nodes action 
[2] which uses business logic in tripleo_common/utils/nodes.py

utils.nodes.py module has been originally extracted from tripleoclient 
to get the business logic behind the common API. It does following:

- converts the instackenv.json nodes format to appropriate ironic driver 
format (driver-info fields)
- sets kernel and ramdisk ids defaults if they're not provided
- for each node it tests if node already exists (finds nodes by mac 
addresses) and updates it or registers it as new based on the result.

Current Problems:
- no zaqar notification is sent for each node
- nodes are registered in batch, registration fails when error happens 
on a certain node, leaving already registered nodes in inconsistent state
- workflow does not notify user about what nodes have been registered 
and what failed, only thing user gets is relevant error message
- when the workflow succeeds, the registered_nodes list sent by Zaqar 
message has outdated information
- when nodes are updated using nodes registration, the forkflow ends up 
as failed, without any error output, although the nodes are updated 
successfully

- utils/nodes.py decides whether the node should be created or updated 
based on mac address which is subject to change. It needs to be done by 
UUID which is fixed.
- utils/nodes.py uses instackenv.json nodes list format - the conversion 
should be done in client

- instackenv.json uses nodes list format which is not compatible with 
ironic which forces us to do the format conversions and limit the ironic 
driver support

Proposed changes
===============

To satisfy clients requirements we need to:
- assure the idempotency of idempotency of running the nodes 
registration providing the instackenv.json
- enable the workflow to track each node registration workflow separately

The changes can be done in 2 steps:
1. refactor register_or_update_nodes workflow and utils/nodes.py

- register_or_update_nodes workflow calls register_or_update_node 
workflow for each node, that workflow then runs tasks: 
tripleo.register_or_update_node (action), set_node_managed, 
send_message. When whole workflow finishes, summary message is sent.
- reduce the register action and utils/nodes.py to the mechanism of 
deciding on whether we create new or update the node - based on whether 
the uuid is provided in nodes list and whether node is present in ironic
- move the format conversion from utils/nodes.py to client

These changes allow each node to finish it's registration without being 
interrupted by failure on other node, avoid nodes getting into 
inconsistent state, allow interactive reporting on each nodes state, 
allows to run validations on each node separately

2. change instackenv.json nodes format to match Ironic
- this change allows the client to pass data from instackenv.json to 
ironic without intervention, so since the interaction is data > ironic, 
user can specify any ironic driver in instackenv.json. In addition, GUI 
can dynamically display relevant properties user needs to fill in based 
on selected driver.

Possible problems:
We need to consider performance impact by running separate workflow for 
each node, doing some benchmark tests would be beneficial.

Alternative solution:
If we decide to stick with single workflow it will have the UX impact as 
we won't be able to interactively report on each node changes. But we'll 
still be able to report on each task result. Other requirements still stand:
- nodes which got registered before the failure happened need to 
continue with next tasks (set_node_managed)
- workflow needs to report registered nodes list, failed nodes list and 
error.

[1] https://review.openstack.org/#/c/300200/
[2] https://review.openstack.org/#/c/319587/
[3] 
https://blueprints.launchpad.net/tripleo-common/+spec/improve-baremetal-workflows

-- Jirka

Open Stack

[openstack-dev] [TripleO] Nodes Registration workflow improvements

OpenStack

Community

Documentation

Branding & Legal