<div dir="ltr">I'll second much of what Rob said:<div>API that indicated how many live-migrations (l-m) were going would be good.</div><div>API that told you what progress (and start time) a given l-m had made would be great.</div><div>API to cancel a given l-m would also be great. I think this is a preferred approach over an auto timeout (it would give us the tools we need to implement an auto timeout though.)</div><div><br></div><div>I like the idea of trying auto-convergence (and agree it should be flavor feature and likely not the default.) I suspect this one needs some testing. It may be fine to automatically do this if it doesn't actually throttle the VM some 90-99% of the time. (Presumably this could also increase the max downtime between cutover as well as throttling the vm.)</div><div><br></div><div>Thanks Daniel/Rob,</div><div>-dave</div><div><br></div><div>fyi: I'm an operator/devel on the Time Warner Cable openstack cloud.</div></div><div class="gmail_extra"><br><div class="gmail_quote">On Sun, Feb 1, 2015 at 12:24 PM, Robert Collins <span dir="ltr"><<a href="mailto:robertc@robertcollins.net" target="_blank">robertc@robertcollins.net</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span class="">On 31 January 2015 at 05:47, Daniel P. Berrange <<a href="mailto:berrange@redhat.com">berrange@redhat.com</a>> wrote:<br>
> In working on a recent Nova migration bug<br>
><br>
> <a href="https://bugs.launchpad.net/nova/+bug/1414065" target="_blank">https://bugs.launchpad.net/nova/+bug/1414065</a><br>
><br>
> I had cause to refactor the way the nova libvirt driver monitors live<br>
> migration completion/failure/progress. This refactor has opened the<br>
> door for doing more intelligent active management of the live migration<br>
> process.<br>
</span>...<br>
<span class="">> What kind of things would be the biggest win from Operators' or tenants'<br>
> POV ?<br>
<br>
</span>Awesome. Couple thoughts from my perspective. Firstly, there's a bunch<br>
of situation dependent tuning. One thing Crowbar does really nicely is<br>
that you specify the host layout in broad abstract terms - e.g. 'first<br>
10G network link' and so on : some of your settings above like whether<br>
to compress page are going to be heavily dependent on the bandwidth<br>
available (I doubt that compression is a win on a 100G link for<br>
instance, and would be suspect at 10G even). So it would be nice if<br>
there was a single dial or two to set and Nova would auto-calculate<br>
good defaults from that (with appropriate overrides being available).<br>
<br>
Operationally avoiding trouble is better than being able to fix it, so<br>
I quite like the idea of defaulting the auto-converge option on, or<br>
perhaps making it controllable via flavours, so that operators can<br>
offer (and identify!) those particularly performance sensitive<br>
workloads rather than having to guess which instances are special and<br>
which aren't.<br>
<br>
Being able to cancel the migration would be good. Relatedly being able<br>
to restart nova-compute while a migration is going on would be good<br>
(or put differently, a migration happening shouldn't prevent a deploy<br>
of Nova code: interlocks like that make continuous deployment much<br>
harder).<br>
<br>
If we can't already, I'd like as a user to be able to see that the<br>
migration is happening (allows diagnosis of transient issues during<br>
the migration). Some ops folk may want to hide that of course.<br>
<br>
I'm not sure that automatically rolling back after N minutes makes<br>
sense : if the impact on the cluster is significant then 1 minute vs<br>
10 doesn't instrinsically matter: what matters more is preventing too<br>
many concurrent migrations, so that would be another feature that I<br>
don't think we have yet: don't allow more than some N inbound and M<br>
outbound live migrations to a compute host at any time, to prevent IO<br>
storms. We may want to log with NOTIFICATION migrations that are still<br>
progressing but appear to be having trouble completing. And of course<br>
an admin API to query all migrations in progress to allow API driven<br>
health checks by monitoring tools - which gives the power to manage<br>
things to admins without us having to write a probably-too-simple<br>
config interface.<br>
<br>
HTH,<br>
Rob<br>
<span class="HOEnZb"><font color="#888888"><br>
--<br>
Robert Collins <<a href="mailto:rbtcollins@hp.com">rbtcollins@hp.com</a>><br>
Distinguished Technologist<br>
HP Converged Cloud<br>
</font></span><div class="HOEnZb"><div class="h5"><br>
_______________________________________________<br>
OpenStack-operators mailing list<br>
<a href="mailto:OpenStack-operators@lists.openstack.org">OpenStack-operators@lists.openstack.org</a><br>
<a href="http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators" target="_blank">http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators</a><br>
</div></div></blockquote></div><br></div>