[openstack-dev] [TripleO] Nodes Registration workflow improvements
Ben Nemec
openstack at nemebean.com
Mon Jun 13 19:14:21 UTC 2016
On 06/13/2016 12:10 PM, Dmitry Tantsur wrote:
> On 06/13/2016 06:28 PM, Ben Nemec wrote:
>> On 06/13/2016 09:41 AM, Jiri Tomasek wrote:
>>> Hi all,
>>>
>>> As we are close to merging the initial Nodes Registration workflows and
>>> action [1, 2] using Mistral which successfully provides the current
>>> registration logic via common API, I'd like to start discussion on how
>>> to improve it so it satisfies GUI and CLI requirements. I'd like to try
>>> to describe the clients goals and define requirements, describe current
>>> workflow problems and propose a solution. I'd like to record the result
>>> of discussion to Blueprint [3] which Ryan already created.
>>>
>>>
>>> CLI goals and optimal workflow
>>> ========================
>>>
>>> CLI's main benefit is based on the fact that it's commands can simply
>>> become part of a script, so it is important that the operation is
>>> idempotent. The optimal CLI workflow is:
>>>
>>> User runs 'openstack baremetal import' and provides instackenv.json file
>>> which includes all nodes information. When the registration fails at
>>> some point, user is notified about the error and re-runs the command
>>> with the same set of nodes. Rinse and repeat until all nodes are
>>> properly registered.
>>
>> I would actually not describe this as the optimal workflow for CLI
>> registration either. It would be much better if the registration
>> completed for all the nodes that it can in the first place and then any
>> failed nodes can be cleaned up later. There's no reason one bad node in
>> a file containing 100 nodes should block the entire deployment.
>>
>> On that note, the only part of your proposal that I'm not sure would be
>> useful for the CLI is the non-blocking part. I don't know that a CLI
>> fire-and-forget mode makes a lot of sense, although if there's a way for
>> the CLI to check the status then that might be fine too. However,
>> pretty much all of the other usability stuff you're talking about would
>> benefit the CLI too.
>>
>>>
>>>
>>> GUI goals and optimal workflow
>>> =========================
>>>
>>> GUI's main goal is to provide a user friendly way to register nodes,
>>> inform the user on the process, handle the problems and lets user fix
>>> them. GUI strives for being responsive and interactive.
>>>
>>> GUI allows user to add nodes specification manually one by one by
>>> provided form or allow user (in same manner as CLI) to provide the
>>> instackenv.json file which holds the nodes description. Importing the
>>> file (or adding node manually) will populate an array of nodes the user
>>> wants to register. User is able to browse these nodes and make
>>> corrections to their configuration. GUI provides client side validations
>>> to verify inputs (node name format, required fields, mac address, ip
>>> address format etc.)
>>>
>>> Then user triggers the registration. The nodes are moved to nodes table
>>> as they are being registered. If an error occurs during registration of
>>> any of the nodes, user is notified about the issue and can fix it in
>>> registration form and can re-trigger registration for failed nodes.
>>> Rinse and repeat until all nodes are successfully registered and in
>>> proper state (manageable).
>>>
>>> Such workflow keeps the GUI interactive, user does not have to look at
>>> the spinner for several minutes (in case of registering hundreds of
>>> nodes), hoping that something does not happen wrong. User is constantly
>>> informed about the progress, user is able to react to the situation as
>>> he wants, User is able to freely interact with the GUI while
>>> registration is happening on the background. User is able to register
>>> nodes in batches.
>>>
>>>
>>> Current solution
>>> =============
>>>
>>> Current solution uses register_or_update_nodes workflow [1] which takes
>>> a nodes_json array and runs register_or_update_nodes and
>>> set_nodes_managed tasks. When the whole operation completes it sends
>>> Zaqar message notifying about the result of the registration of the
>>> whole batch of nodes.
>>>
>>> register_or_update_nodes runs tripleo.register_or_update_nodes action
>>> [2] which uses business logic in tripleo_common/utils/nodes.py
>>>
>>> utils.nodes.py module has been originally extracted from tripleoclient
>>> to get the business logic behind the common API. It does following:
>>>
>>> - converts the instackenv.json nodes format to appropriate ironic driver
>>> format (driver-info fields)
>>> - sets kernel and ramdisk ids defaults if they're not provided
>>> - for each node it tests if node already exists (finds nodes by mac
>>> addresses) and updates it or registers it as new based on the result.
>>>
>>>
>>> Current Problems:
>>> - no zaqar notification is sent for each node
>>> - nodes are registered in batch, registration fails when error happens
>>> on a certain node, leaving already registered nodes in inconsistent state
>>> - workflow does not notify user about what nodes have been registered
>>> and what failed, only thing user gets is relevant error message
>>> - when the workflow succeeds, the registered_nodes list sent by Zaqar
>>> message has outdated information
>>> - when nodes are updated using nodes registration, the forkflow ends up
>>> as failed, without any error output, although the nodes are updated
>>> successfully
>>>
>>> - utils/nodes.py decides whether the node should be created or updated
>>> based on mac address which is subject to change. It needs to be done by
>>> UUID which is fixed.
>>
>> The UUID is not fixed. If you register the same node twice, it will get
>> a different UUID each time (unless you explicitly set the UUID as I see
>> you can now do, but we can't and shouldn't require that). The MAC
>> shouldn't change unless you spoof it (which is not something I think we
>> should be designing around) or point it at a different NIC. In the
>> latter case, we have no way of knowing whether you changed to a
>> different NIC on the same system or reassigned an IPMI address to a
>> completely different system, so requiring that to be a new node is the
>> only sane way to handle it AFAICT.
>
> Starting last week you can put UUID to instackenv.json and ironic will
> use it.
Sure, but are we also saying that is now mandatory? I don't think I'd
be in favor of that (particularly because it means we can no longer be
sure of the UU part of UUID).
Maybe the change should be to use only the UUID if it's specified, and
otherwise do the MAC/IPMI matching? I don't think that's what it's
doing now, but it probably makes sense to.
>
>>
>> I'm not actually sure from just reading the code that we would require a
>> new node on a mac change anyway. In that case I think it would match
>> the IPMI address of the node instead and select that existing node for
>> update. I could be wrong about that since I haven't actually tested it,
>> but that's how it looks to me.
>
> We already use IPMI address to match, but
> 1. it can be changed
> 2. lookup will fail if we switch driver
Only if the driver changes how the unique id is generated though, right?
Most of them seem to use the IPMI address anyway.
>
>>
>>> - utils/nodes.py uses instackenv.json nodes list format - the conversion
>>> should be done in client
>>
>> Why? Putting logic like that in the client is evil.
>>
>>>
>>> - instackenv.json uses nodes list format which is not compatible with
>>> ironic which forces us to do the format conversions and limit the ironic
>>> driver support
>>
>> Totally agree that we should just drop the translation layer. It isn't
>> a useful abstraction - if you're using a vendor-specific driver, then
>> you probably have to set driver-specific keys, so you can't just swap
>> out drivers in your config anyway. If you're not using driver-specific
>> keys then you still aren't gaining anything from this extra layer.
>>
>> From looking at the existing code, the only _possible_ benefit from this
>> layer is some abstraction of drivers that use a different key for the
>> actual IPMI address, but honestly that inconsistency sounds like
>> something that should be fixed in Ironic, not papered over in the node
>> registration code. I guess that's a discussion we'd have to have with
>> the Ironic team.
>
> +1. we were talking about it, but it wasn't a big priority. Some push
> from TripleO side would probably help :)
>
>>
>>>
>>>
>>> Proposed changes
>>> ===============
>>>
>>> To satisfy clients requirements we need to:
>>> - assure the idempotency of idempotency of running the nodes
>>> registration providing the instackenv.json
>>> - enable the workflow to track each node registration workflow separately
>>>
>>>
>>> The changes can be done in 2 steps:
>>> 1. refactor register_or_update_nodes workflow and utils/nodes.py
>>>
>>> - register_or_update_nodes workflow calls register_or_update_node
>>> workflow for each node, that workflow then runs tasks:
>>> tripleo.register_or_update_node (action), set_node_managed,
>>> send_message. When whole workflow finishes, summary message is sent.
>>> - reduce the register action and utils/nodes.py to the mechanism of
>>> deciding on whether we create new or update the node - based on whether
>>> the uuid is provided in nodes list and whether node is present in ironic
>>> - move the format conversion from utils/nodes.py to client
>>>
>>> These changes allow each node to finish it's registration without being
>>> interrupted by failure on other node, avoid nodes getting into
>>> inconsistent state, allow interactive reporting on each nodes state,
>>> allows to run validations on each node separately
>>>
>>> 2. change instackenv.json nodes format to match Ironic
>>> - this change allows the client to pass data from instackenv.json to
>>> ironic without intervention, so since the interaction is data > ironic,
>>> user can specify any ironic driver in instackenv.json. In addition, GUI
>>> can dynamically display relevant properties user needs to fill in based
>>> on selected driver.
>>>
>>>
>>> Possible problems:
>>> We need to consider performance impact by running separate workflow for
>>> each node, doing some benchmark tests would be beneficial.
>>>
>>> Alternative solution:
>>> If we decide to stick with single workflow it will have the UX impact as
>>> we won't be able to interactively report on each node changes. But we'll
>>> still be able to report on each task result. Other requirements still stand:
>>> - nodes which got registered before the failure happened need to
>>> continue with next tasks (set_node_managed)
>>> - workflow needs to report registered nodes list, failed nodes list and
>>> error.
>>>
>>>
>>>
>>> [1] https://review.openstack.org/#/c/300200/
>>> [2] https://review.openstack.org/#/c/319587/
>>> [3]
>>> https://blueprints.launchpad.net/tripleo-common/+spec/improve-baremetal-workflows
>>>
>>>
>>> -- Jirka
>>>
>>> __________________________________________________________________________
>>> OpenStack Development Mailing List (not for usage questions)
>>> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>>
>>
>>
>> __________________________________________________________________________
>> OpenStack Development Mailing List (not for usage questions)
>> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>
>
More information about the OpenStack-dev
mailing list