<div dir="ltr">Accidentally sent this privately.<div><br><div class="gmail_quote">---------- Forwarded message ----------<br>From: <b class="gmail_sendername">Matthew Booth</b> <span dir="ltr"><<a href="mailto:mbooth@redhat.com">mbooth@redhat.com</a>></span><br>Date: Fri, Oct 9, 2015 at 6:14 PM<br>Subject: Re: [openstack-dev] [nova][mistral] Automatic evacuation as a long running task<br>To: "Deja, Dawid" <<a href="mailto:dawid.deja@intel.com">dawid.deja@intel.com</a>><br><br><br><div dir="ltr"><div class="gmail_extra"><div class="gmail_quote"><span class="">On Thu, Oct 8, 2015 at 12:51 PM, Deja, Dawid <span dir="ltr"><<a href="mailto:dawid.deja@intel.com" target="_blank">dawid.deja@intel.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Hi Matthew,<br>

<br>

Thanks for bringing some light on what problems has nova with evacuation of an instance. It is very important to have those limitations in mind when preparing final solution. Or to fix them, as you proposed.<br>

<br>

Nevertheless, I would say that evacuationD does more than what calling 'nova host-evacuate' do. Let's consider such scenario:<br>

<br>

1. Call 'nova host evacuate HostX'<br>

2. Caller dies during call - information that some VMs are still to be evacuated is lost.<br></blockquote><div><br></div></span><div>No, it's not lost because the instances still have instance.host == source. This means that you can (and must, in fact) simply run 'nova host-evacuate' again if it didn't complete successfully the first time.</div><div><br></div><div>Note that an external agent (lets call it pacemaker) must solve exactly the same problem in order to send a message to evacuated. It must assure itself that it successfully 'sent a message' at least once, somewhere. Now replace 'sent a message' with 'ran nova host-evacuate'.</div><span class=""><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

Such thing would not happen with evacuationD, because it prepares one rabbitMQ message for each VM that needs to be evacuated. Moreover, it deals with situation, when process that lists VMs crashes. In such case, whole operation would be continued by another daemon.<br>

<br>

EvacD may also handle another problem that you mentioned: failure of target host of evacuation. In such scenario, 'evacuate host' message will be send for a new host and EvacD will try to evacuate all of it's vms - even those in rebuild state. Of course, evacuation of such instances fails, but they would eventually enter error state and evacuationD would start resurrection process. This can be speed up by setting instances state to 'error' (despite these which are in 'active' state) on the beginning of whole 'evacuate host' process.<br></blockquote><div><br></div></span><div>Again, this situation is identical to simply running nova host-evacuate. EvacD doesn't do any monitoring, and requires an external agent (we called it pacemaker) to invoke it for the newly failed host. In this scenario, whatever sends 'evacuate host' can instead run 'nova host-evacuate', and the behaviour is identical.</div><span class=""><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<br>

Finally, another action - called 'Look for VM' - could be added. It would check if given VM ended up in active state on new hosts; if no, VM could be rebuild. I hope this would give us as much certainty that VM is alive as possible.<br></blockquote><div><br></div></span><div>This would add behaviour over nova host-evacuate. However, it would also be considerably more complex to implement and what's there currently doesn't add any infrastructure which enables it.</div><div><br></div><div>Remember that nova evacuate is not a heavy operation for the caller. It is literally just a nova api call which returns after kicking off a task in conductor. Running nova host-evacuate does:</div><div><br></div><div>1. List all instances</div><div>2. For instance 0, tell nova to initiate evac</div><div>3 ...</div><div><br></div><div>Running evacD does:</div><div>1. List all instances</div><div>2. For instance 0, send ourselves a message to initiate evac</div><div>... rabbit ...</div><div>3. For instance 0, tell nova to initiate evac</div><div><br></div><div>In other words, evacD just makes the call chain longer. It adds overhead and additional potential points of failure. Ironically, this means the resulting solution will be less robust.</div><div><br></div><div>Matt</div><div><div class="h5"><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div><div>

On Tue, 2015-10-06 at 16:34 +0100, Matthew Booth wrote:<br>

Hi, Roman,<br>

<br>

Evacuated has been on my radar for a while and this post has prodded me to take a look at the code. I think it's worth starting by explaining the problems in the current solution. Nova client is currently responsible for doing this evacuate. It does:<br>

<br>

1. List all instances on the source host<br>

2. Initiate evacuate for each instance<br>

<br>

Evacuating a single instance does:<br>

<br>

API:<br>

1. Set instance task state to rebuilding<br>

2. Create a migration record with source and dest if specified<br>

<br>

Conductor:<br>

3. Call the scheduler to get a destination host if not specified<br>

4. Get the migration object from the db<br>

<br>

Compute:<br>

5. Rebuild the instance on dest<br>

6. Update instance.host to dest<br>

<br>

Examining single instance evacuation, the first obvious thing to look at is what if 2 happen simultaneously. Because step 1 is atomic, it should not be possible to initiate 2 evacuations simultaneously of a single instance. However, note that this atomic action hasn't updated the instance host, meaning the source host remains the owner of this instance. If the evacuation process fails to complete, the source host will automatically delete it if it comes back up because it will find a migration record, but it will not be rebuilt anywhere else. Evacuating it again will fail, because its task state is already rebuilding.<br>

<br>

Also, let's imagine that the conductor crashes. There is not enough state for any tool, whether internal or external, to be able to know if the rebuild is ongoing somewhere or not, and therefore whether it is safe to retry even if that retry would succeed, which it wouldn't.<br>

<br>

Which is to say that we can't currently robustly evacuate one instance!<br>

<br>

Looking at the nova client side, there is an obvious race there: there is no guarantee in step 2 that instances returned in step one have not already been evacuated by another process. We're protected here, though because evacuating a single instance twice will fail the second time. Note that the process isn't idempotent, though, because an evacuation which falls into a hole will never be retried.<br>

<br>

Moving on to what evacuated does. Evacuated uses rabbit to distribute jobs reliably. There are 2 jobs in evacuated:<br>

<br>

1. Evacuate host:<br>

  1.1 Get list of all instances on the source host from Nova<br>

  1.2 Send an evacuate vm job for each instance<br>

2. Evacuate vm:<br>

  2.1 Tell Nova to start evacuating an instance<br>

<br>

Because we're using rabbit as a reliable message bus, the initiator of one of the tasks knows that it will eventually run to completion at least once. Note that there's nothing to prevent the task being executed more than once per call, though. A task may crash before sending an ack, or may just be really slow. However, in both cases, for exactly the same reasons as for the implementation in nova client, running more than once should not race. It is still not idempotent, though, again for exactly the same reasons as nova client.<br>

<br>

Also notice that, exactly as in the nova client implementation, we are not asserting that an instance has been evacuated. We are only asserting that we called nova.evacuate, which is to say that we got as far as step 2 in the evacuation sequence above.<br>

<br>

In other words, in terms of robustness, calling evacuated's evacuate host is identical to asserting that nova client's evacuate host ran to completion at least once, which is quite a lot simpler to do. That's still not very robust, though: we don't recover from failures, and we don't ensure that an instance is evacuated, only that we started an attempt to evacuate at least once. I'm obviously not satisfied with nova client, however as the implementation is simpler I would favour it over evacuated.<br>

<br>

I believe we can solve this problem, but I think that without fixing single-instance evacuate we're just pushing the problem around (or creating new places for it to live). I would base the robustness of my implementation on a single principal:<br>

<br>

  An instance has a single owner, which is exclusively responsible for rebuilding it.<br>

<br>

In outline, I would redefine the evacuate process to do:<br>

<br>

API:<br>

1. Call the scheduler to get a destination for the evacuate if none was given.<br>

2. Atomically update instance.host to this destination, and task state to rebuilding.<br>

<br>

Compute:<br>

3. Rebuild the instance.<br>

<br>

This would be supported by a periodic task on the compute host which looks for rebuilding instances assigned to this host which aren't currently rebuilding, and kicks off a rebuild for them. This would cover the compute going down during a rebuild, or the api going down before messaging the compute.<br>

<br>

Implementing this gives us several things:<br>

<br>

1. The list instances, evacuate all instances process becomes idempotent, because as soon as the evacuate is initiated, the instance is removed from the source host.<br>

2. We get automatic recovery of failure of the target compute. Because we atomically moved the instance to the target compute immediately, if the target compute also has to be evacuated, our instance won't fall through the gap.<br>

3. We don't need an additional place for the code to run, because it will run on the compute. All the work has to be done by the compute anyway. By farming the evacuates out directly and immediately to the target compute we reduce both overhead and complexity.<br>

<br>

The coordination becomes very simple. If we've run the nova client evacuation anywhere at least once, the actual evacuations are now Sombody Else's Problem (to quote h2g2), and will complete eventually. As evacuation in any case involves a forced change of owner it requires fencing of the source and implies an external agent such as pacemaker. The nova client evacuation can run in pacemaker.<br>

<br>

Matt<br>

<br>

</div></div><div><div>On Fri, Oct 2, 2015 at 2:05 PM, Roman Dobosz <<a href="mailto:roman.dobosz@intel.com" target="_blank">roman.dobosz@intel.com</a><mailto:<a href="mailto:roman.dobosz@intel.com" target="_blank">roman.dobosz@intel.com</a>>> wrote:<br>

Hi all,<br>

<br>

The case of automatic evacuation (or resurrection currently), is a topic<br>

which surfaces once in a while, but it isn't yet fully supported by<br>

OpenStack and/or by the cluster services. There was some attempts to<br>

bring the feature into OpenStack, however it turns out it cannot be<br>

easily integrated with. On the other hand evacuation may be executed<br>

from the outside using Nova client or Nova API calls for evacuation<br>

initiation.<br>

<br>

I did some research regarding the ways how it could be designed, based<br>

on Russel Bryant blog post[1] as a starting point. Apart from it, I've<br>

also taken high availability and reliability into consideration when<br>

designing the solution.<br>

<br>

Together with coworker, we did first PoC[2] to enable cluster to be able<br>

to perform evacuation. The idea behind that PoC was simple - providing<br>

additional, small service which would trigger and supervise the<br>

evacuation process, which would be triggered from the outside (in this<br>

example we were using Pacemaker fencing facility, but it might be<br>

anything) using RabbitMQ directly. Those services are running on the<br>

control plane in AA fashion.<br>

<br>

That work well for us. So we started exploring other possibilities like<br>

oslo.messaging just to use it in the same manner as we did in the poc.<br>

It turns out that the implementation will not be as easy, because there<br>

is no facility in the oslo.messaging for letting sending an ACK from the<br>

client after the job is done (not as soon as it gets the message). We<br>

also looked at the existing OpenStack projects for a candidate which<br>

provide service for managing long running tasks.<br>

<br>

There is the Mistral project, which gives us almost all the features we<br>

need. The one missing feature is the HA of the Mistral tasks execution.<br>

<br>

The question is, how such problem (long running tasks) could be resolved<br>

in OpenStack?<br>

<br>

[1] <a href="http://blog.russellbryant.net/2014/10/15/openstack-instance-ha-proposal/" rel="noreferrer" target="_blank">http://blog.russellbryant.net/2014/10/15/openstack-instance-ha-proposal/</a><br>

[2] <a href="https://github.com/dawiddeja/evacuationd" rel="noreferrer" target="_blank">https://github.com/dawiddeja/evacuationd</a><br>

<br>

--<br>

Cheers,<br>

Roman Dobosz<br>

<br>

__________________________________________________________________________<br>

OpenStack Development Mailing List (not for usage questions)<br>

</div></div>Unsubscribe: <a href="http://OpenStack-dev-request@lists.openstack.org?subject:unsubscribe" rel="noreferrer" target="_blank">OpenStack-dev-request@lists.openstack.org?subject:unsubscribe</a><<a href="http://OpenStack-dev-request@lists.openstack.org?subject:unsubscribe" rel="noreferrer" target="_blank">http://OpenStack-dev-request@lists.openstack.org?subject:unsubscribe</a>><br>

<a href="http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev" rel="noreferrer" target="_blank">http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev</a><br>

<span><br>

<br>

<br>

__________________________________________________________________________<br>

OpenStack Development Mailing List (not for usage questions)<br>

</span>Unsubscribe: <a href="mailto:OpenStack-dev-request@lists.openstack.org" target="_blank">OpenStack-dev-request@lists.openstack.org</a><mailto:<a href="mailto:OpenStack-dev-request@lists.openstack.org" target="_blank">OpenStack-dev-request@lists.openstack.org</a>>?subject:unsubscribe<br>

<a href="http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev" rel="noreferrer" target="_blank">http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev</a><br>

<br>

</blockquote></div></div></div><br></div></div>

</div><br></div></div>