[Openstack-operators] [Large deployments] Neutron issues in Openstack Large deployment using DVR

Kevin Benton kevin at benton.pub
Tue Feb 28 10:59:47 UTC 2017


What do you see in the Neutron server logs while it's not responding?

On Tue, Feb 28, 2017 at 1:27 AM, Satyanarayana Patibandla
<satya.patibandla at gmail.com> wrote:
> Hi Kevin,
>
> Thanks for your suggestion. I will modify the parameter value and will test
> the changes.
>
> Could you please provide your suggestion on recovering to normal state after
> getting this error. Once we get this error the neutron CLI gives "504
> gateway timeout". We tried to restart all neutron-server and neutron-agents
> container but still we are getting the same "504 gateway timeout" error.
> Every time we have to reimage the servers and redeploy from scratch to make
> neutron CLI to work again.
>
> Thanks,
> Satya.P
>
> On Tue, Feb 28, 2017 at 2:18 PM, Kevin Benton <kevin at benton.pub> wrote:
>>
>> That particular update query is issued by the agent state report handler.
>> And it looks like they might be falling behind based on the timestamp it's
>> trying to update in the DB (14:46:35) and the log statement (14:50:29).
>>
>> Can you try increasing the rpc_state_report_workers value? If you haven't
>> modified it, the default value is only 1. You can probably cut the number of
>> RPC workers down to make up for the difference.
>>
>> On Mon, Feb 27, 2017 at 11:39 AM, Satyanarayana Patibandla
>> <satya.patibandla at gmail.com> wrote:
>>>
>>> Hi Kevin,
>>>
>>> After increasing the parameter values mentioned in the below mail, we are
>>> able to create few hundreds of VMs properly. There were no errors related to
>>> neutron. Our environment contain multiple regions. One of our team member by
>>> mistake ran all openstack service tempest tests against the site. After
>>> running the tempest tests, again we observed the "504 gateway timeout"
>>> error. This time even after restarting all neutron agents related containers
>>> the neutron CLI was not responsive. We are getting the same gateway timeout
>>> error even after restarting all the neutron agent containers.
>>>
>>> We did SHOW PROCESSLIST in MySQL. we can see a lock on the agent table
>>> query.
>>>
>>> In the logs we can see below error.
>>>
>>> 2017-02-27 14:50:29.085 38 ERROR oslo_messaging.rpc.server DBDeadlock:
>>> (pymysql.err.InternalError) (1205, u'Lock wait timeout exceeded; try
>>> restarting transaction') [SQL: u'UPDATE agents SET
>>> heartbeat_timestamp=%(heartbeat_timestamp)s WHERE agents.id =
>>> %(agents_id)s'] [parameters: {'heartbeat_timestamp': datetime.datetime(2017,
>>> 2, 27, 14, 46, 35, 229400), 'agents_id':
>>> u'94535d12-4b04-42c2-8a74-f2358db41634'}]
>>>
>>> We are using stable/ocata code in our enviornment. We had to reimage and
>>> redeploy all the nodes to continue our testing. Could you please let us know
>>> your thoughts on the above issue.
>>>
>>> Thanks,
>>> Satya.P
>>>
>>> On Mon, Feb 27, 2017 at 12:32 PM, Satyanarayana Patibandla
>>> <satya.patibandla at gmail.com> wrote:
>>>>
>>>> Hi,
>>>>
>>>> We increased api_workers,rpc_workers and metadata_workers based on the
>>>> number of cores we are running on controller node ( the workers are half of
>>>> the number of cores. i.e if we have 24 cores then we are running 12 workers
>>>> for each). Increased rpc_connect_timeout to 180 and rpc_response_timeout to
>>>> 600. As of now it seems these are fine.
>>>>
>>>> Let me know if you have any comments or suggestions about increasing
>>>> those parameter values.
>>>>
>>>> Thanks,
>>>> Satya.P
>>>>
>>>> On Mon, Feb 27, 2017 at 11:16 AM, Kevin Benton <kevin at benton.pub> wrote:
>>>>>
>>>>> Thanks for following up. Would you mind sharing the parameters you had
>>>>> to tune (db pool limits, etc) just in case someone comes across this same
>>>>> thread in a google search?
>>>>>
>>>>> Thanks,
>>>>> Kevin Benton
>>>>>
>>>>> On Sun, Feb 26, 2017 at 8:48 PM, Satyanarayana Patibandla
>>>>> <satya.patibandla at gmail.com> wrote:
>>>>>>
>>>>>> Hi Saverio,
>>>>>>
>>>>>> The issue seems to be related to neutron tuning. We observed the same
>>>>>> issue with stable/ocata branch code. When we tuned few neutron parameters it
>>>>>> is working fine.
>>>>>> Thanks for your suggestion.
>>>>>>
>>>>>> Thanks,
>>>>>> Satya.P
>>>>>>
>>>>>> On Wed, Feb 22, 2017 at 10:10 AM, Satyanarayana Patibandla
>>>>>> <satya.patibandla at gmail.com> wrote:
>>>>>>>
>>>>>>> Hi Saverio,
>>>>>>>
>>>>>>> Thanks for your inputs. Will test with statable/ocata branch code and
>>>>>>> will share the result.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Satya.P
>>>>>>>
>>>>>>> On Wed, Feb 22, 2017 at 1:54 AM, Saverio Proto <zioproto at gmail.com>
>>>>>>> wrote:
>>>>>>>>
>>>>>>>> Hello,
>>>>>>>>
>>>>>>>> I would use at least the stable/ocata branch. If you just use master
>>>>>>>> that is not supposed to be stable, and also I am not sure if you can
>>>>>>>> fill a bug against a specific commit in master.
>>>>>>>>
>>>>>>>> Saverio
>>>>>>>>
>>>>>>>> 2017-02-21 21:12 GMT+01:00 Satyanarayana Patibandla
>>>>>>>> <satya.patibandla at gmail.com>:
>>>>>>>> > Hi Saverio,
>>>>>>>> >
>>>>>>>> > We have tried to create 20 VMs each time using heat template.
>>>>>>>> > There is 1 sec
>>>>>>>> > time gap between each VM creation request. When we reached 114 VMs
>>>>>>>> > we got
>>>>>>>> > the error mentioned in the below mail.Heat template will boot
>>>>>>>> > instance from
>>>>>>>> > volume and it assigns floating IP to the instance.
>>>>>>>> >
>>>>>>>> > Except neutron-server container we restarted all the neutron agent
>>>>>>>> > containers which are present on all network and compute nodes. We
>>>>>>>> > are using
>>>>>>>> > kolla to deploy openstack services.
>>>>>>>> >
>>>>>>>> > We are using 1 month old master branch openstack code to deploy
>>>>>>>> > our
>>>>>>>> > services.
>>>>>>>> >
>>>>>>>> > Please find the error logs in the below link.
>>>>>>>> > http://paste.openstack.org/show/599892/
>>>>>>>> >
>>>>>>>> > Thanks,
>>>>>>>> > Satya.P
>>>>>>>> >
>>>>>>>> > On Wed, Feb 22, 2017 at 12:21 AM, Saverio Proto
>>>>>>>> > <zioproto at gmail.com> wrote:
>>>>>>>> >>
>>>>>>>> >> Hello Satya,
>>>>>>>> >>
>>>>>>>> >> I would fill a bug on launchpad for this issue.
>>>>>>>> >> 114 VMs is not much. Can you identify how to trigger the issue to
>>>>>>>> >> reproduce it ? or it just happens randomly ?
>>>>>>>> >>
>>>>>>>> >> When you say rebooting the network node, do you mean the server
>>>>>>>> >> running the neutron-server process ?
>>>>>>>> >>
>>>>>>>> >> what version and distribution of openstack are you using ?
>>>>>>>> >>
>>>>>>>> >> thank you
>>>>>>>> >>
>>>>>>>> >> Saverio
>>>>>>>> >>
>>>>>>>> >>
>>>>>>>> >> 2017-02-21 13:54 GMT+01:00 Satyanarayana Patibandla
>>>>>>>> >> <satya.patibandla at gmail.com>:
>>>>>>>> >> > Hi All,
>>>>>>>> >> >
>>>>>>>> >> > We are trying to deploy Openstack in our production
>>>>>>>> >> > environment. For
>>>>>>>> >> > networking we are using DVR with out L3 HA. We are able to
>>>>>>>> >> > create 114
>>>>>>>> >> > VMs
>>>>>>>> >> > with out any issue. After creating 114 VMs we are getting the
>>>>>>>> >> > below
>>>>>>>> >> > error.
>>>>>>>> >> >
>>>>>>>> >> > Error: <html><body><h1>504 Gateway Time-out</h1> The server
>>>>>>>> >> > didn't
>>>>>>>> >> > respond
>>>>>>>> >> > in time. </body></html>
>>>>>>>> >> >
>>>>>>>> >> > Neutron services are getting freezed up due to a persistent
>>>>>>>> >> > lock on the
>>>>>>>> >> > agents table. it seems one of the network node is holding the
>>>>>>>> >> > lock on
>>>>>>>> >> > the
>>>>>>>> >> > table. After rebooting the network node, the Neutron CLI was
>>>>>>>> >> > responsive
>>>>>>>> >> > again.
>>>>>>>> >> >
>>>>>>>> >> > Neutron agent and neutron server is throwing below errors.
>>>>>>>> >> >
>>>>>>>> >> > Neutron-server errors:
>>>>>>>> >> > ERROR oslo_db.sqlalchemy.exc_filters     "Can't reconnect until
>>>>>>>> >> > invalid
>>>>>>>> >> > "
>>>>>>>> >> > ERROR oslo_db.sqlalchemy.exc_filters InvalidRequestError: Can't
>>>>>>>> >> > reconnect
>>>>>>>> >> > until invalid transaction is rolled back
>>>>>>>> >> > ERROR neutron.api.v2.resource
>>>>>>>> >> > [req-24fa6eaa-a9e0-4f55-97e0-59db203e72c6
>>>>>>>> >> > 3eb776587c9c40569731ebe5c3557bc7
>>>>>>>> >> > f43e8699cd5a46e89ffe39e3cac75341 - - -]
>>>>>>>> >> > index failed: No details.
>>>>>>>> >> > ERROR neutron.api.v2.resource DBError: Can't reconnect until
>>>>>>>> >> > invalid
>>>>>>>> >> > transaction is rolled back
>>>>>>>> >> >
>>>>>>>> >> >
>>>>>>>> >> > Neutron agents errors:
>>>>>>>> >> > MessagingTimeout: Timed out waiting for a reply to message ID
>>>>>>>> >> > 40638b6bf12c44cd9a404ecaa14a9909
>>>>>>>> >> >
>>>>>>>> >> > Could you please provide us your valuable inputs or suggestions
>>>>>>>> >> > for
>>>>>>>> >> > above
>>>>>>>> >> > errors.
>>>>>>>> >> >
>>>>>>>> >> > Thanks,
>>>>>>>> >> > Satya.P
>>>>>>>> >> >
>>>>>>>> >> > _______________________________________________
>>>>>>>> >> > OpenStack-operators mailing list
>>>>>>>> >> > OpenStack-operators at lists.openstack.org
>>>>>>>> >> >
>>>>>>>> >> > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
>>>>>>>> >> >
>>>>>>>> >
>>>>>>>> >
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> OpenStack-operators mailing list
>>>>>> OpenStack-operators at lists.openstack.org
>>>>>>
>>>>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
>>>>>>
>>>>>
>>>>
>>>
>>
>



More information about the OpenStack-operators mailing list