<div dir="ltr">Hello, Jay!<br><br>Thanks for the answer. Yeah, it seems like you literally gave me the answer by asking questions about Rabbit. That made me take a closer look at Rabbit and think of it as a source of these problems. What I missed was defining HA policy explicitly (as it's not enabled by default since v.3.0, which is about forever). Well, at least I can't reproduce these problems anymore.<div>


<div><br></div><div><font face="arial, helvetica, sans-serif">As we're already here, I'd like to ask a few more questions regarding architecture for small-sized clusters. </font><span style="font-family:arial,helvetica,sans-serif">Like, let's separate servers logically into 3 groups - controller / storage / compute, these are all pretty much equal in terms of resources, and we wanna add some HA magic here. </span></div>


<div><span style="font-family:arial,helvetica,sans-serif">The first thing that comes to my mind - we can run HAProxy on each and every node and configure everything in a way that node reaches any external service it relies on through HAProxy, so that if we add server of any kind it's just about re-configuring and reloading HAProxy. Is there something wrong about this approach (except for the increased latency?). And are there any OpenStack services that (for some reasons) should not be scaled by spawning other instances of em?</span></div>


<div><font face="arial, helvetica, sans-serif">Another thing that bothers me is how to distribute neutron-related services if there's no dedicated networking node?</font></div><div><font face="arial, helvetica, sans-serif"><br>


</font></div><div><font face="arial, helvetica, sans-serif">I'd really appreciate if you could find a few minutes and answer these questions, as it's not that easy to find any real-life production-ready examples for small deployments.</font></div>


<div><font face="arial, helvetica, sans-serif"><br></font></div><div><font face="arial, helvetica, sans-serif">P.S. My name is Sergey :) And I should definitely add a signature</font></div><div><br></div><div><br></div><div class="gmail_extra">


<div class="gmail_quote">2014-05-16 16:33 GMT+03:00 Jay Pipes <span dir="ltr"><<a href="mailto:jaypipes@gmail.com" target="_blank">jaypipes@gmail.com</a>></span>:<br>

<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">On 05/14/2014 02:49 PM, Сергей Мотовиловец wrote:<br>

<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">

Hello everyone!<br>

</blockquote>

<br>

Hi Motovilovets :) Comments and questions for you inline...<br>

<br>

<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><div>

I'm facing some troubles with nova and cinder here.<br>

<br>

I have 2 control nodes (active/active) in my testing environment with<br>

Percona XtraDB cluster (Galera+xtrabackup) + garbd on a separate node<br></div>

(to avoid split-brain) Â + OpenStack Icehouse, latest from Ubuntu 14.04<div><br>

main repo.<br>

<br>

The problem is horizontal scalability of nova-conductor and<br>

cinder-scheduler services, seems like all active instances of these<br></div>

services are trying to execute sameÂ MySQLÂ queries theyÂ get from<br>

Rabbit, which leads to numerous deadlocks in set-up with Galera.Â<br>

</blockquote>

<br>

Are you using RabbitMQ in clustered mode? Also, how are you doing your load balancing? Do you use HAProxy or some appliance? Do you have sticky sessions enabled for your load balancing?<div><br>

<br>

<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">

In case when multiple nova-conductor services are running (and using<br>

MySQL instances on corresponding control nodes) it appears as "Deadlock<br>

found when trying to get lock; try restarting transaction" in log.<br>

With cinder-scheduler it leads to "InvalidBDM: Block Device Mapping is<br>

Invalid."<br>

</blockquote>

<br></div>

So, it's not actually a deadlock that is occurring... unless I'm mistaken (I've asked a Percona engineer to take a look at this thread to double-check me), the error about "Deadlock found..." is actually *not* a deadlock. It's just that Galera uses the same InnoDB error code as a normal deadlock to indicate that the WSREP certification process has timed out between the cluster nodes. Would you mind pastebin'ing your wsrep.cnf and my.cnf files for us to take a look at? I presume that you do not have much latency between the cluster nodes (i.e. they are not over a WAN)... let me know if that is not the case.<br>


<br>

It would also be helpful to see your rabbit and load balancer configs if you can pastebin those, too.<br>

<br>

<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><div>

Is there any possible way to make multiple instances of these services<br></div>

running simultaneously and not duplicating queries?Â<br>

</blockquote>

<br>

Yes, it most certainly is. At AT&T, we ran Galera clusters of much bigger size with absolutely no problems due to this cert timeout problem that manifests itself as a deadlock, so I know it's definitely possible to have a clean, performant, multi-writer Galera solution for OpenStack. :)<br>


<br>

Best,<br>

-jay<br>

<br>

<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><div>

(I don't really like the idea of handling this with Heartbeat+Pacemaker<br>

or other similar stuff, mostly because I'm thinking about equal load<br>

distribution across control nodes, but in this case it seems like it has<br>

an opposite effect, multiplying load on MySQL)<br>

<br>

Another thing that is extremely annoying: if instance stuck in ERROR<br>

state because of deadlock during its termination - it is impossible to<br>

terminate instance anymore in Horizon, only via nova-api with<br>

reset-state. How can this be handled?<br>

<br>

I'd really appreciate any help/advises/thoughts regarding these problems.<br>

<br>

<br>

Best regards,<br>

Motovilovets Sergey<br>

Software Operation Engineer<br>

<br>

<br></div><div>

______________________________<u></u>_________________<br>

Mailing list: <a href="http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack" target="_blank">http://lists.openstack.org/<u></u>cgi-bin/mailman/listinfo/<u></u>openstack</a><br>

Post to     : <a href="mailto:openstack@lists.openstack.org" target="_blank">openstack@lists.openstack.org</a><br>

Unsubscribe : <a href="http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack" target="_blank">http://lists.openstack.org/<u></u>cgi-bin/mailman/listinfo/<u></u>openstack</a><br>

<br>

</div></blockquote><div><div>

<br>

<br>

______________________________<u></u>_________________<br>

Mailing list: <a href="http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack" target="_blank">http://lists.openstack.org/<u></u>cgi-bin/mailman/listinfo/<u></u>openstack</a><br>

Post to     : <a href="mailto:openstack@lists.openstack.org" target="_blank">openstack@lists.openstack.org</a><br>

Unsubscribe : <a href="http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack" target="_blank">http://lists.openstack.org/<u></u>cgi-bin/mailman/listinfo/<u></u>openstack</a><br>

</div></div></blockquote></div><br></div></div></div>