<div dir="ltr">I'll second much of what Rob said:<div>API that indicated how many live-migrations (l-m) were going would be good.</div><div>API that told you what progress (and start time) a given l-m had made would be great.</div><div>API to cancel a given l-m would also be great. I think this is a preferred approach over an auto timeout (it would give us the tools we need to implement an auto timeout though.)</div><div><br></div><div>I like the idea of trying auto-convergence (and agree it should be flavor feature and likely not the default.) I suspect this one needs some testing. It may be fine to automatically do this if it doesn't actually throttle the VM some 90-99% of the time.  (Presumably this could also increase the max downtime between cutover as well as throttling the vm.)</div><div><br></div><div>Thanks Daniel/Rob,</div><div>-dave</div><div><br></div><div>fyi: I'm an operator/devel on the Time Warner Cable openstack cloud.</div></div><div class="gmail_extra"><br><div class="gmail_quote">On Sun, Feb 1, 2015 at 12:24 PM, Robert Collins <span dir="ltr"><<a href="mailto:robertc@robertcollins.net" target="_blank">robertc@robertcollins.net</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span class="">On 31 January 2015 at 05:47, Daniel P. Berrange <<a href="mailto:berrange@redhat.com">berrange@redhat.com</a>> wrote:<br>

> In working on a recent Nova migration bug<br>

><br>

>   <a href="https://bugs.launchpad.net/nova/+bug/1414065" target="_blank">https://bugs.launchpad.net/nova/+bug/1414065</a><br>

><br>

> I had cause to refactor the way the nova libvirt driver monitors live<br>

> migration completion/failure/progress. This refactor has opened the<br>

> door for doing more intelligent active management of the live migration<br>

> process.<br>

</span>...<br>

<span class="">> What kind of things would be the biggest win from Operators' or tenants'<br>

> POV ?<br>

<br>

</span>Awesome. Couple thoughts from my perspective. Firstly, there's a bunch<br>

of situation dependent tuning. One thing Crowbar does really nicely is<br>

that you specify the host layout in broad abstract terms - e.g. 'first<br>

10G network link' and so on : some of your settings above like whether<br>

to compress page are going to be heavily dependent on the bandwidth<br>

available (I doubt that compression is a win on a 100G link for<br>

instance, and would be suspect at 10G even). So it would be nice if<br>

there was a single dial or two to set and Nova would auto-calculate<br>

good defaults from that (with appropriate overrides being available).<br>

<br>

Operationally avoiding trouble is better than being able to fix it, so<br>

I quite like the idea of defaulting the auto-converge option on, or<br>

perhaps making it controllable via flavours, so that operators can<br>

offer (and identify!) those particularly performance sensitive<br>

workloads rather than having to guess which instances are special and<br>

which aren't.<br>

<br>

Being able to cancel the migration would be good. Relatedly being able<br>

to restart nova-compute while a migration is going on would be good<br>

(or put differently, a migration happening shouldn't prevent a deploy<br>

of Nova code: interlocks like that make continuous deployment much<br>

harder).<br>

<br>

If we can't already, I'd like as a user to be able to see that the<br>

migration is happening (allows diagnosis of transient issues during<br>

the migration). Some ops folk may want to hide that of course.<br>

<br>

I'm not sure that automatically rolling back after N minutes makes<br>

sense : if the impact on the cluster is significant then 1 minute vs<br>

10 doesn't instrinsically matter: what matters more is preventing too<br>

many concurrent migrations, so that would be another feature that I<br>

don't think we have yet: don't allow more than some N inbound and M<br>

outbound live migrations to a compute host at any time, to prevent IO<br>

storms. We may want to log with NOTIFICATION migrations that are still<br>

progressing but appear to be having trouble completing. And of course<br>

an admin API to query all migrations in progress to allow API driven<br>

health checks by monitoring tools - which gives the power to manage<br>

things to admins without us having to write a probably-too-simple<br>

config interface.<br>

<br>

HTH,<br>

Rob<br>

<span class="HOEnZb"><font color="#888888"><br>

--<br>

Robert Collins <<a href="mailto:rbtcollins@hp.com">rbtcollins@hp.com</a>><br>

Distinguished Technologist<br>

HP Converged Cloud<br>

</font></span><div class="HOEnZb"><div class="h5"><br>

_______________________________________________<br>

OpenStack-operators mailing list<br>

<a href="mailto:OpenStack-operators@lists.openstack.org">OpenStack-operators@lists.openstack.org</a><br>

<a href="http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators" target="_blank">http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators</a><br>

</div></div></blockquote></div><br></div>