<div dir="ltr">That particular update query is issued by the agent state report handler. And it looks like they might be falling behind based on the timestamp it's trying to update in the DB (14:46:35) and the log statement (14:50:29).<div><br></div><div>Can you try increasing the rpc_state_report_workers value? If you haven't modified it, the default value is only 1. You can probably cut the number of RPC workers down to make up for the difference.</div></div><div class="gmail_extra"><br><div class="gmail_quote">On Mon, Feb 27, 2017 at 11:39 AM, Satyanarayana Patibandla <span dir="ltr"><<a href="mailto:satya.patibandla@gmail.com" target="_blank">satya.patibandla@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr">Hi Kevin,<div><br></div><div>After increasing the parameter values mentioned in the below mail, we are able to create few hundreds of VMs properly. There were no errors related to neutron. Our environment contain multiple regions. One of our team member by mistake ran all openstack service tempest tests against the site. After running the tempest tests, again we observed the "504 gateway timeout" error. This time even after restarting all neutron agents related containers the neutron CLI was not responsive. We are getting the same gateway timeout error even after restarting all the neutron agent containers.</div><div><br></div><div>We did SHOW PROCESSLIST in MySQL. we can see a lock on the agent table query. </div><div><br></div><div>In the logs we can see below error.</div><div><br></div><div>2017-02-27 14:50:29.085 38 ERROR oslo_messaging.rpc.server DBDeadlock: (pymysql.err.InternalError) (1205, u'Lock wait timeout exceeded; try restarting transaction') [SQL: u'UPDATE agents SET heartbeat_timestamp=%(<wbr>heartbeat_timestamp)s WHERE <a href="http://agents.id" target="_blank">agents.id</a> = %(agents_id)s'] [parameters: {'heartbeat_timestamp': datetime.datetime(2017, 2, 27, 14, 46, 35, 229400), 'agents_id': u'94535d12-4b04-42c2-8a74-<wbr>f2358db41634'}]<br></div><div><br></div><div>We are using stable/ocata code in our enviornment. We had to reimage and redeploy all the nodes to continue our testing. Could you please let us know your thoughts on the above issue.</div><div><br></div><div>Thanks,</div><div>Satya.P</div></div><div class="HOEnZb"><div class="h5"><div class="gmail_extra"><br><div class="gmail_quote">On Mon, Feb 27, 2017 at 12:32 PM, Satyanarayana Patibandla <span dir="ltr"><<a href="mailto:satya.patibandla@gmail.com" target="_blank">satya.patibandla@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr">Hi,<div><br></div><div>We increased <span style="font-size:12.8px">api_workers,</span><span style="font-size:12.8px">rpc_work<wbr>ers and </span><span style="font-size:12.8px">metadata_workers based on the number of cores we are running on controller node ( the workers are half of the number of cores. i.e if we have 24 cores then we are running 12 workers for each). Increased </span><span style="font-size:12.8px">rpc_connect_timeout to 180 and </span><span style="font-size:12.8px">rpc_response_timeout to 600. As of now it seems these are fine.</span></div><div><span style="font-size:12.8px"><br></span></div><div><span style="font-size:12.8px">Let me know if you have any comments or suggestions about increasing those parameter values.</span></div><div><span style="font-size:12.8px"><br></span></div><div><span style="font-size:12.8px">Thanks,</span></div><div><span style="font-size:12.8px">Satya.P</span></div></div><div class="m_6392895348368319497HOEnZb"><div class="m_6392895348368319497h5"><div class="gmail_extra"><br><div class="gmail_quote">On Mon, Feb 27, 2017 at 11:16 AM, Kevin Benton <span dir="ltr"><<a href="mailto:kevin@benton.pub" target="_blank">kevin@benton.pub</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr">Thanks for following up. Would you mind sharing the parameters you had to tune (db pool limits, etc) just in case someone comes across this same thread in a google search?<div><br></div><div>Thanks,</div><div>Kevin Benton</div></div><div class="m_6392895348368319497m_5210366294875791424HOEnZb"><div class="m_6392895348368319497m_5210366294875791424h5"><div class="gmail_extra"><br><div class="gmail_quote">On Sun, Feb 26, 2017 at 8:48 PM, Satyanarayana Patibandla <span dir="ltr"><<a href="mailto:satya.patibandla@gmail.com" target="_blank">satya.patibandla@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr">Hi Saverio,<div><br></div><div>The issue seems to be related to neutron tuning. We observed the same issue with stable/ocata branch code. When we tuned few neutron parameters it is working fine.</div><div>Thanks for your suggestion.</div><div><br></div><div>Thanks,</div><div>Satya.P</div></div><div class="m_6392895348368319497m_5210366294875791424m_6765418772109138313HOEnZb"><div class="m_6392895348368319497m_5210366294875791424m_6765418772109138313h5"><div class="gmail_extra"><br><div class="gmail_quote">On Wed, Feb 22, 2017 at 10:10 AM, Satyanarayana Patibandla <span dir="ltr"><<a href="mailto:satya.patibandla@gmail.com" target="_blank">satya.patibandla@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr">Hi Saverio,<div><br></div><div>Thanks for your inputs. Will test with statable/ocata branch code and will share the result.</div><div><br></div><div>Thanks,</div><div>Satya.P</div></div><div class="m_6392895348368319497m_5210366294875791424m_6765418772109138313m_520937954979450718HOEnZb"><div class="m_6392895348368319497m_5210366294875791424m_6765418772109138313m_520937954979450718h5"><div class="gmail_extra"><br><div class="gmail_quote">On Wed, Feb 22, 2017 at 1:54 AM, Saverio Proto <span dir="ltr"><<a href="mailto:zioproto@gmail.com" target="_blank">zioproto@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Hello,<br>

<br>

I would use at least the stable/ocata branch. If you just use master<br>

that is not supposed to be stable, and also I am not sure if you can<br>

fill a bug against a specific commit in master.<br>

<span class="m_6392895348368319497m_5210366294875791424m_6765418772109138313m_520937954979450718m_-186719621597207765HOEnZb"><font color="#888888"><br>

Saverio<br>

</font></span><div class="m_6392895348368319497m_5210366294875791424m_6765418772109138313m_520937954979450718m_-186719621597207765HOEnZb"><div class="m_6392895348368319497m_5210366294875791424m_6765418772109138313m_520937954979450718m_-186719621597207765h5"><br>

2017-02-21 21:12 GMT+01:00 Satyanarayana Patibandla<br>

<<a href="mailto:satya.patibandla@gmail.com" target="_blank">satya.patibandla@gmail.com</a>>:<br>

> Hi Saverio,<br>

><br>

> We have tried to create 20 VMs each time using heat template. There is 1 sec<br>

> time gap between each VM creation request. When we reached 114 VMs we got<br>

> the error mentioned in the below mail.Heat template will boot instance from<br>

> volume and it assigns floating IP to the instance.<br>

><br>

> Except neutron-server container we restarted all the neutron agent<br>

> containers which are present on all network and compute nodes. We are using<br>

> kolla to deploy openstack services.<br>

><br>

> We are using 1 month old master branch openstack code to deploy our<br>

> services.<br>

><br>

> Please find the error logs in the below link.<br>

> <a href="http://paste.openstack.org/show/599892/" rel="noreferrer" target="_blank">http://paste.openstack.org/sho<wbr>w/599892/</a><br>

><br>

> Thanks,<br>

> Satya.P<br>

><br>

> On Wed, Feb 22, 2017 at 12:21 AM, Saverio Proto <<a href="mailto:zioproto@gmail.com" target="_blank">zioproto@gmail.com</a>> wrote:<br>

>><br>

>> Hello Satya,<br>

>><br>

>> I would fill a bug on launchpad for this issue.<br>

>> 114 VMs is not much. Can you identify how to trigger the issue to<br>

>> reproduce it ? or it just happens randomly ?<br>

>><br>

>> When you say rebooting the network node, do you mean the server<br>

>> running the neutron-server process ?<br>

>><br>

>> what version and distribution of openstack are you using ?<br>

>><br>

>> thank you<br>

>><br>

>> Saverio<br>

>><br>

>><br>

>> 2017-02-21 13:54 GMT+01:00 Satyanarayana Patibandla<br>

>> <<a href="mailto:satya.patibandla@gmail.com" target="_blank">satya.patibandla@gmail.com</a>>:<br>

>> > Hi All,<br>

>> ><br>

>> > We are trying to deploy Openstack in our production environment. For<br>

>> > networking we are using DVR with out L3 HA. We are able to create 114<br>

>> > VMs<br>

>> > with out any issue. After creating 114 VMs we are getting the below<br>

>> > error.<br>

>> ><br>

>> > Error: <html><body><h1>504 Gateway Time-out</h1> The server didn't<br>

>> > respond<br>

>> > in time. </body></html><br>

>> ><br>

>> > Neutron services are getting freezed up due to a persistent lock on the<br>

>> > agents table. it seems one of the network node is holding the lock on<br>

>> > the<br>

>> > table. After rebooting the network node, the Neutron CLI was responsive<br>

>> > again.<br>

>> ><br>

>> > Neutron agent and neutron server is throwing below errors.<br>

>> ><br>

>> > Neutron-server errors:<br>

>> > ERROR oslo_db.sqlalchemy.exc_filters<wbr>     "Can't reconnect until invalid<br>

>> > "<br>

>> > ERROR oslo_db.sqlalchemy.exc_filters InvalidRequestError: Can't<br>

>> > reconnect<br>

>> > until invalid transaction is rolled back<br>

>> > ERROR neutron.api.v2.resource [req-24fa6eaa-a9e0-4f55-97e0-5<wbr>9db203e72c6<br>

>> > 3eb776587c9c40569731ebe5c3557b<wbr>c7 f43e8699cd5a46e89ffe39e3cac753<wbr>41 - - -]<br>

>> > index failed: No details.<br>

>> > ERROR neutron.api.v2.resource DBError: Can't reconnect until invalid<br>

>> > transaction is rolled back<br>

>> ><br>

>> ><br>

>> > Neutron agents errors:<br>

>> > MessagingTimeout: Timed out waiting for a reply to message ID<br>

>> > 40638b6bf12c44cd9a404ecaa14a99<wbr>09<br>

>> ><br>

>> > Could you please provide us your valuable inputs or suggestions for<br>

>> > above<br>

>> > errors.<br>

>> ><br>

>> > Thanks,<br>

>> > Satya.P<br>

>> ><br>

>> > ______________________________<wbr>_________________<br>

>> > OpenStack-operators mailing list<br>

>> > <a href="mailto:OpenStack-operators@lists.openstack.org" target="_blank">OpenStack-operators@lists.open<wbr>stack.org</a><br>

>> > <a href="http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators" rel="noreferrer" target="_blank">http://lists.openstack.org/cgi<wbr>-bin/mailman/listinfo/openstac<wbr>k-operators</a><br>

>> ><br>

><br>

><br>

</div></div></blockquote></div><br></div>

</div></div></blockquote></div><br></div>

</div></div><br>______________________________<wbr>_________________<br>

OpenStack-operators mailing list<br>

<a href="mailto:OpenStack-operators@lists.openstack.org" target="_blank">OpenStack-operators@lists.open<wbr>stack.org</a><br>

<a href="http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators" rel="noreferrer" target="_blank">http://lists.openstack.org/cgi<wbr>-bin/mailman/listinfo/openstac<wbr>k-operators</a><br>

<br></blockquote></div><br></div>

</div></div></blockquote></div><br></div>

</div></div></blockquote></div><br></div>

</div></div></blockquote></div><br></div>