Open Stack

Wed May 28 13:02:17 UTC 2014

Hi Ironic folks, hi Devananda!

I'd like to share with you my thoughts on asynchronous API, which is
spec https://review.openstack.org/#/c/94923
First I was planned this as comments to the review, but it proved to be
much larger, so I post it for discussion on ML.

Here is list of different consideration, I'd like to take into account
when prototyping async support, some are reflected in spec already, some
are from my and other's comments:

1. "Executability"
We need to make sure that request can be theoretically executed,
which includes:
a) Validating request body
b) For each of entities (e.g. nodes) touched, check that they are
available
   at the moment (at least exist).
   This is arguable, as checking for entity existence requires going to
DB.

2. Appropriate state
For each entity in question, ensure that it's either in a proper state
or
moving to a proper state.
It would help avoid users e.g. setting deploy twice on the same node
It will still require some kind of NodeInAWrongStateError, but we won't
necessary need a client retry on this one.

Allowing the entity to be _moving_ to appropriate state gives us a
problem:
Imagine OP1 was running and OP2 got scheduled, hoping that OP1 will come
to desired state. What if OP1 fails? What if conductor, doing OP1
crashes?
That's why we may want to approve only operations on entities that do
not
undergo state changes. What do you think?

Similar problem with checking node state.
Imagine we schedule OP2 while we had OP1 - regular checking node state.
OP1 discovers that node is actually absent and puts it to maintenance
state.
What to do with OP2?
a) Obvious answer is to fail it
b) Can we make client wait for the results of periodic check?
   That is, wait for OP1 _before scheduling_ OP2?

Anyway, this point requires some state framework, that knows about
states,
transitions, actions and their compatibility with each other.

3. Status feedback
People would like to know, how things are going with their task.
What they know is that their request was scheduled. Options:
a) Poll: return some REQUEST_ID and expect users to poll some endpoint.
   Pros:
   - Should be easy to implement
   Cons:
   - Requires persistent storage for tasks. Does AMQP allow to do this
kinds
     of queries? If not, we'll need to duplicate tasks in DB.
   - Increased load on API instances and DB
b) Callback: take endpoint, call it once task is done/fails.
   Pros:
   - Less load on both client and server
   - Answer exactly when it's ready
   Cons:
   - Will not work for cli and similar
   - If conductor crashes, there will be no callback.

Seems like we'd want both (a) and (b) to comply with current needs.

If we have a state framework from (2), we can also add notifications to
it.

4. Debugging consideration
a) This is an open question: how to debug, if we have a lot of requests
   and something went wrong?
b) One more thing to consider: how to make command like `node-show`
aware of
   scheduled transitioning, so that people don't try operations that are
   doomed to failure.

5. Performance considerations
a) With async approach, users will be able to schedule nearly unlimited
   number of tasks, thus essentially blocking work of Ironic, without
any
   signs of the problem (at least for some time).
   I think there are 2 common answers to this problem:
   - Request throttling: disallow user to make too many requests in some
     amount of time. Send them 503 with Retry-After header set.
   - Queue management: watch queue length, deny new requests if it's too
large.
   This means actually getting back error 503 and will require retrying
again!
   At least it will be exceptional case, and won't affect Tempest run...
b) State framework from (2), if invented, can become a bottleneck as
well.
   Especially with polling approach.

6. Usability considerations
a) People will be unaware, when and whether their request is going to be
   finished. As there will be tempted to retry, we may get flooded by
   duplicates. I would suggest at least make it possible to request
canceling
   any task (which will be possible only if it is not started yet,
obviously).
b) We should try to avoid scheduling contradictive requests.
c) Can we somehow detect duplicated requests and ignore them?
   E.g. we won't want user to make 2-3-4 reboots in a row just because
the user
   was not patient enough.

------

Possible takeaways from this letter:
- We'll need at least throttling to avoid DoS
- We'll still need handling of 503 error, though it should not happen
under
  normal conditions
- Think about state framework that unifies all this complex logic with
features:
  * Track entities, their states and actions on entities
  * Check whether new action is compatible with states of entities it
touches
    and with other ongoing and scheduled actions on these entities.
  * Handle notifications for finished and failed actions by providing
both
    pull and push approaches.
  * Track whether started action is still executed, perform error
notification,
    if not.
  * HA and high performance
- Think about policies for corner cases
- Think, how we can make a user aware of what is going on with both
request
  and entity that some requests may touch. Also consider canceling
requests.

Please let me know, what you think.

Dmitry.

Open Stack

[openstack-dev] [Ironic] Random thoughts on asynchronous API spec

OpenStack

Community

Documentation

Branding & Legal