[Openstack-operators] [Large deployments] Neutron issues in Openstack Large deployment using DVR

Satyanarayana Patibandla satya.patibandla at gmail.com
Tue Feb 28 09:27:08 UTC 2017


Hi Kevin,

Thanks for your suggestion. I will modify the parameter value and will test
the changes.

Could you please provide your suggestion on recovering to normal state
after getting this error. Once we get this error the neutron CLI gives "504
gateway timeout". We tried to restart all neutron-server and neutron-agents
container but still we are getting the same "504 gateway timeout" error.
Every time we have to reimage the servers and redeploy from scratch to make
neutron CLI to work again.

Thanks,
Satya.P

On Tue, Feb 28, 2017 at 2:18 PM, Kevin Benton <kevin at benton.pub> wrote:

> That particular update query is issued by the agent state report handler.
> And it looks like they might be falling behind based on the timestamp it's
> trying to update in the DB (14:46:35) and the log statement (14:50:29).
>
> Can you try increasing the rpc_state_report_workers value? If you haven't
> modified it, the default value is only 1. You can probably cut the number
> of RPC workers down to make up for the difference.
>
> On Mon, Feb 27, 2017 at 11:39 AM, Satyanarayana Patibandla <
> satya.patibandla at gmail.com> wrote:
>
>> Hi Kevin,
>>
>> After increasing the parameter values mentioned in the below mail, we are
>> able to create few hundreds of VMs properly. There were no errors related
>> to neutron. Our environment contain multiple regions. One of our team
>> member by mistake ran all openstack service tempest tests against the site.
>> After running the tempest tests, again we observed the "504 gateway
>> timeout" error. This time even after restarting all neutron agents related
>> containers the neutron CLI was not responsive. We are getting the same
>> gateway timeout error even after restarting all the neutron agent
>> containers.
>>
>> We did SHOW PROCESSLIST in MySQL. we can see a lock on the agent table
>> query.
>>
>> In the logs we can see below error.
>>
>> 2017-02-27 14:50:29.085 38 ERROR oslo_messaging.rpc.server DBDeadlock:
>> (pymysql.err.InternalError) (1205, u'Lock wait timeout exceeded; try
>> restarting transaction') [SQL: u'UPDATE agents SET
>> heartbeat_timestamp=%(heartbeat_timestamp)s WHERE agents.id =
>> %(agents_id)s'] [parameters: {'heartbeat_timestamp':
>> datetime.datetime(2017, 2, 27, 14, 46, 35, 229400), 'agents_id':
>> u'94535d12-4b04-42c2-8a74-f2358db41634'}]
>>
>> We are using stable/ocata code in our enviornment. We had to reimage and
>> redeploy all the nodes to continue our testing. Could you please let us
>> know your thoughts on the above issue.
>>
>> Thanks,
>> Satya.P
>>
>> On Mon, Feb 27, 2017 at 12:32 PM, Satyanarayana Patibandla <
>> satya.patibandla at gmail.com> wrote:
>>
>>> Hi,
>>>
>>> We increased api_workers,rpc_workers and metadata_workers based on the
>>> number of cores we are running on controller node ( the workers are half of
>>> the number of cores. i.e if we have 24 cores then we are running 12 workers
>>> for each). Increased rpc_connect_timeout to 180 and rpc_response_timeout
>>> to 600. As of now it seems these are fine.
>>>
>>> Let me know if you have any comments or suggestions about increasing
>>> those parameter values.
>>>
>>> Thanks,
>>> Satya.P
>>>
>>> On Mon, Feb 27, 2017 at 11:16 AM, Kevin Benton <kevin at benton.pub> wrote:
>>>
>>>> Thanks for following up. Would you mind sharing the parameters you had
>>>> to tune (db pool limits, etc) just in case someone comes across this same
>>>> thread in a google search?
>>>>
>>>> Thanks,
>>>> Kevin Benton
>>>>
>>>> On Sun, Feb 26, 2017 at 8:48 PM, Satyanarayana Patibandla <
>>>> satya.patibandla at gmail.com> wrote:
>>>>
>>>>> Hi Saverio,
>>>>>
>>>>> The issue seems to be related to neutron tuning. We observed the same
>>>>> issue with stable/ocata branch code. When we tuned few neutron parameters
>>>>> it is working fine.
>>>>> Thanks for your suggestion.
>>>>>
>>>>> Thanks,
>>>>> Satya.P
>>>>>
>>>>> On Wed, Feb 22, 2017 at 10:10 AM, Satyanarayana Patibandla <
>>>>> satya.patibandla at gmail.com> wrote:
>>>>>
>>>>>> Hi Saverio,
>>>>>>
>>>>>> Thanks for your inputs. Will test with statable/ocata branch code and
>>>>>> will share the result.
>>>>>>
>>>>>> Thanks,
>>>>>> Satya.P
>>>>>>
>>>>>> On Wed, Feb 22, 2017 at 1:54 AM, Saverio Proto <zioproto at gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hello,
>>>>>>>
>>>>>>> I would use at least the stable/ocata branch. If you just use master
>>>>>>> that is not supposed to be stable, and also I am not sure if you can
>>>>>>> fill a bug against a specific commit in master.
>>>>>>>
>>>>>>> Saverio
>>>>>>>
>>>>>>> 2017-02-21 21:12 GMT+01:00 Satyanarayana Patibandla
>>>>>>> <satya.patibandla at gmail.com>:
>>>>>>> > Hi Saverio,
>>>>>>> >
>>>>>>> > We have tried to create 20 VMs each time using heat template.
>>>>>>> There is 1 sec
>>>>>>> > time gap between each VM creation request. When we reached 114 VMs
>>>>>>> we got
>>>>>>> > the error mentioned in the below mail.Heat template will boot
>>>>>>> instance from
>>>>>>> > volume and it assigns floating IP to the instance.
>>>>>>> >
>>>>>>> > Except neutron-server container we restarted all the neutron agent
>>>>>>> > containers which are present on all network and compute nodes. We
>>>>>>> are using
>>>>>>> > kolla to deploy openstack services.
>>>>>>> >
>>>>>>> > We are using 1 month old master branch openstack code to deploy our
>>>>>>> > services.
>>>>>>> >
>>>>>>> > Please find the error logs in the below link.
>>>>>>> > http://paste.openstack.org/show/599892/
>>>>>>> >
>>>>>>> > Thanks,
>>>>>>> > Satya.P
>>>>>>> >
>>>>>>> > On Wed, Feb 22, 2017 at 12:21 AM, Saverio Proto <
>>>>>>> zioproto at gmail.com> wrote:
>>>>>>> >>
>>>>>>> >> Hello Satya,
>>>>>>> >>
>>>>>>> >> I would fill a bug on launchpad for this issue.
>>>>>>> >> 114 VMs is not much. Can you identify how to trigger the issue to
>>>>>>> >> reproduce it ? or it just happens randomly ?
>>>>>>> >>
>>>>>>> >> When you say rebooting the network node, do you mean the server
>>>>>>> >> running the neutron-server process ?
>>>>>>> >>
>>>>>>> >> what version and distribution of openstack are you using ?
>>>>>>> >>
>>>>>>> >> thank you
>>>>>>> >>
>>>>>>> >> Saverio
>>>>>>> >>
>>>>>>> >>
>>>>>>> >> 2017-02-21 13:54 GMT+01:00 Satyanarayana Patibandla
>>>>>>> >> <satya.patibandla at gmail.com>:
>>>>>>> >> > Hi All,
>>>>>>> >> >
>>>>>>> >> > We are trying to deploy Openstack in our production
>>>>>>> environment. For
>>>>>>> >> > networking we are using DVR with out L3 HA. We are able to
>>>>>>> create 114
>>>>>>> >> > VMs
>>>>>>> >> > with out any issue. After creating 114 VMs we are getting the
>>>>>>> below
>>>>>>> >> > error.
>>>>>>> >> >
>>>>>>> >> > Error: <html><body><h1>504 Gateway Time-out</h1> The server
>>>>>>> didn't
>>>>>>> >> > respond
>>>>>>> >> > in time. </body></html>
>>>>>>> >> >
>>>>>>> >> > Neutron services are getting freezed up due to a persistent
>>>>>>> lock on the
>>>>>>> >> > agents table. it seems one of the network node is holding the
>>>>>>> lock on
>>>>>>> >> > the
>>>>>>> >> > table. After rebooting the network node, the Neutron CLI was
>>>>>>> responsive
>>>>>>> >> > again.
>>>>>>> >> >
>>>>>>> >> > Neutron agent and neutron server is throwing below errors.
>>>>>>> >> >
>>>>>>> >> > Neutron-server errors:
>>>>>>> >> > ERROR oslo_db.sqlalchemy.exc_filters     "Can't reconnect
>>>>>>> until invalid
>>>>>>> >> > "
>>>>>>> >> > ERROR oslo_db.sqlalchemy.exc_filters InvalidRequestError: Can't
>>>>>>> >> > reconnect
>>>>>>> >> > until invalid transaction is rolled back
>>>>>>> >> > ERROR neutron.api.v2.resource [req-24fa6eaa-a9e0-4f55-97e0-5
>>>>>>> 9db203e72c6
>>>>>>> >> > 3eb776587c9c40569731ebe5c3557bc7 f43e8699cd5a46e89ffe39e3cac75341
>>>>>>> - - -]
>>>>>>> >> > index failed: No details.
>>>>>>> >> > ERROR neutron.api.v2.resource DBError: Can't reconnect until
>>>>>>> invalid
>>>>>>> >> > transaction is rolled back
>>>>>>> >> >
>>>>>>> >> >
>>>>>>> >> > Neutron agents errors:
>>>>>>> >> > MessagingTimeout: Timed out waiting for a reply to message ID
>>>>>>> >> > 40638b6bf12c44cd9a404ecaa14a9909
>>>>>>> >> >
>>>>>>> >> > Could you please provide us your valuable inputs or suggestions
>>>>>>> for
>>>>>>> >> > above
>>>>>>> >> > errors.
>>>>>>> >> >
>>>>>>> >> > Thanks,
>>>>>>> >> > Satya.P
>>>>>>> >> >
>>>>>>> >> > _______________________________________________
>>>>>>> >> > OpenStack-operators mailing list
>>>>>>> >> > OpenStack-operators at lists.openstack.org
>>>>>>> >> > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstac
>>>>>>> k-operators
>>>>>>> >> >
>>>>>>> >
>>>>>>> >
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> OpenStack-operators mailing list
>>>>> OpenStack-operators at lists.openstack.org
>>>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstac
>>>>> k-operators
>>>>>
>>>>>
>>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-operators/attachments/20170228/e1076e77/attachment.html>


More information about the OpenStack-operators mailing list