[magnum][api] Error system library fopen too many open files with magnum-auto-healer

Erik Olof Gunnar Andersson eandersson at blizzard.com
Mon Jan 4 19:49:16 UTC 2021


Sure looks like RabbitMQ. How many workers do have you configured?

Could you try to change it to workers=1 (or processes=1) and then see if it goes beyond 30 connections to amqp.

Best Regards, Erik Olof Gunnar Andersson
Technical Lead, Senior Cloud Engineer

From: Ionut Biru <ionut at fleio.com>
Sent: Monday, January 4, 2021 4:07 AM
To: Erik Olof Gunnar Andersson <eandersson at blizzard.com>
Cc: feilong <feilong at catalyst.net.nz>; openstack-discuss <openstack-discuss at lists.openstack.org>
Subject: Re: [magnum][api] Error system library fopen too many open files with magnum-auto-healer

Hi Erik,

Here is lsof of one uwsgi api. https://paste.xinu.at/5YUWf/<https://urldefense.com/v3/__https:/paste.xinu.at/5YUWf/__;!!Ci6f514n9QsL8ck!wv_wzG-Ntk0gd3ReOupQl-iXIcWpPR3genCqeKNY5JCKZDWxQHSqqa-uxxgUFFhz0Q$>

I have kubernetes 12.0.1 installed in env.


On Sun, Jan 3, 2021 at 3:06 AM Erik Olof Gunnar Andersson <eandersson at blizzard.com<mailto:eandersson at blizzard.com>> wrote:
Maybe something similar to this?
https://github.com/kubernetes-client/python/issues/1158<https://urldefense.com/v3/__https:/github.com/kubernetes-client/python/issues/1158__;!!Ci6f514n9QsL8ck!wv_wzG-Ntk0gd3ReOupQl-iXIcWpPR3genCqeKNY5JCKZDWxQHSqqa-uxxgAtzJkNg$>

What does lsof say?

________________________________
From: Erik Olof Gunnar Andersson <eandersson at blizzard.com<mailto:eandersson at blizzard.com>>
Sent: Saturday, January 2, 2021 4:54 PM
To: Ionut Biru <ionut at fleio.com<mailto:ionut at fleio.com>>; feilong <feilong at catalyst.net.nz<mailto:feilong at catalyst.net.nz>>
Cc: openstack-discuss <openstack-discuss at lists.openstack.org<mailto:openstack-discuss at lists.openstack.org>>
Subject: Re: [magnum][api] Error system library fopen too many open files with magnum-auto-healer

Are you sure you aren't just looking at the connection pool expanding? Each worker has a max number of connections it can use. Maybe look at lowering rpc_conn_pool_size. By default I believe each worker might create a pool of up to 30 connections.

Looking at the code it could also be have something to do with the k8s client. Since it creates a new instance each time it does an health check. What version of the k8s client do you have installed?

________________________________
From: Ionut Biru <ionut at fleio.com<mailto:ionut at fleio.com>>
Sent: Tuesday, December 29, 2020 2:20 PM
To: feilong <feilong at catalyst.net.nz<mailto:feilong at catalyst.net.nz>>
Cc: openstack-discuss <openstack-discuss at lists.openstack.org<mailto:openstack-discuss at lists.openstack.org>>
Subject: Re: [magnum][api] Error system library fopen too many open files with magnum-auto-healer

Hi,

Not sure if my suspicion is true but I think for each update a new notifier is prepared and used without closing the connection but my understanding of oslo is nonexistent.

https://opendev.org/openstack/magnum/src/branch/master/magnum/conductor/utils.py#L147<https://urldefense.com/v3/__https:/opendev.org/openstack/magnum/src/branch/master/magnum/conductor/utils.py*L147__;Iw!!Ci6f514n9QsL8ck!3b_NgWO8HXsOoUOdTUZp4KEzKcx9zpWomeb2yGJ4RRqkS1QI159_zwjwVnKwDSl9vw$>
https://opendev.org/openstack/magnum/src/branch/master/magnum/common/rpc.py#L173<https://urldefense.com/v3/__https:/opendev.org/openstack/magnum/src/branch/master/magnum/common/rpc.py*L173__;Iw!!Ci6f514n9QsL8ck!3b_NgWO8HXsOoUOdTUZp4KEzKcx9zpWomeb2yGJ4RRqkS1QI159_zwjwVnJa8-cGbA$>

On Tue, Dec 29, 2020 at 11:52 PM Ionut Biru <ionut at fleio.com<mailto:ionut at fleio.com>> wrote:
Hi Feilong,

I found out that each time the update_health_status periodic task is run, a new connection(for each uwsgi) is made to rabbitmq.

root at ctrl1cj-magnum-container-7a7a412a:~# netstat -npt | grep 5672 | wc -l
229
root at ctrl1cj-magnum-container-7a7a412a:~# netstat -npt | grep 5672 | wc -l
234
root at ctrl1cj-magnum-container-7a7a412a:~# netstat -npt | grep 5672 | wc -l
238
root at ctrl1cj-magnum-container-7a7a412a:~# netstat -npt | grep 5672 | wc -l
241
root at ctrl1cj-magnum-container-7a7a412a:~# netstat -npt | grep 5672 | wc -l
244

Not sure

Dec 29 21:51:22 ctrl1cj-magnum-container-7a7a412a magnum-conductor[262800]: 2020-12-29 21:51:22.024 262800 DEBUG magnum.service.periodic [req-3b495326-cf80-481e-b3c6-c741f05b7f0e - - - - -]
Dec 29 21:51:22 ctrl1cj-magnum-container-7a7a412a magnum-conductor[262800]: 2020-12-29 21:51:22.024 262800 DEBUG oslo_service.periodic_task [-] Running periodic task MagnumPeriodicTasks.sync
Dec 29 21:51:16 ctrl1cj-magnum-container-7a7a412a magnum-conductor[262804]: 2020-12-29 21:51:16.462 262804 DEBUG magnum.conductor.handlers.cluster_conductor [req-284ac12b-d76a-4e50-8e74-5bfb
Dec 29 21:51:15 ctrl1cj-magnum-container-7a7a412a magnum-conductor[262800]: 2020-12-29 21:51:15.573 262800 DEBUG magnum.service.periodic [-] Status for cluster 118 updated to HEALTHY ({'api'
Dec 29 21:51:15 ctrl1cj-magnum-container-7a7a412a magnum-conductor[262805]: 2020-12-29 21:51:15.572 262805 DEBUG magnum.conductor.handlers.cluster_conductor [req-3fc29ee9-4051-42e7-ae19-3a49
Dec 29 21:51:15 ctrl1cj-magnum-container-7a7a412a magnum-conductor[262800]: 2020-12-29 21:51:15.572 262800 DEBUG magnum.service.periodic [-] Status for cluster 121 updated to HEALTHY ({'api'
Dec 29 21:51:15 ctrl1cj-magnum-container-7a7a412a magnum-conductor[262800]: 2020-12-29 21:51:15.572 262800 DEBUG magnum.service.periodic [-] Status for cluster 122 updated to HEALTHY ({'api'
Dec 29 21:51:15 ctrl1cj-magnum-container-7a7a412a magnum-conductor[262800]: 2020-12-29 21:51:15.553 262800 DEBUG magnum.service.periodic [-] Updating health status for cluster 122 update_hea
Dec 29 21:51:15 ctrl1cj-magnum-container-7a7a412a magnum-conductor[262800]: 2020-12-29 21:51:15.544 262800 DEBUG magnum.service.periodic [-] Updating health status for cluster 121 update_hea
Dec 29 21:51:15 ctrl1cj-magnum-container-7a7a412a magnum-conductor[262800]: 2020-12-29 21:51:15.535 262800 DEBUG magnum.service.periodic [-] Updating health status for cluster 118 update_hea
Dec 29 21:51:15 ctrl1cj-magnum-container-7a7a412a magnum-conductor[262800]: 2020-12-29 21:51:15.494 262800 DEBUG magnum.service.periodic [req-405b1fed-0b8a-4a60-b6ae-834f548b21d1 - - -


2020-12-29 21:51:14.082 [info] <0.953.1293> accepting AMQP connection <0.953.1293> (172.29.93.14:48474<https://urldefense.com/v3/__http:/172.29.93.14:48474__;!!Ci6f514n9QsL8ck!3b_NgWO8HXsOoUOdTUZp4KEzKcx9zpWomeb2yGJ4RRqkS1QI159_zwjwVnJ0aOLwIQ$> -> 172.29.95.38:5672<https://urldefense.com/v3/__http:/172.29.95.38:5672__;!!Ci6f514n9QsL8ck!3b_NgWO8HXsOoUOdTUZp4KEzKcx9zpWomeb2yGJ4RRqkS1QI159_zwjwVnKyfYp2-Q$>)
2020-12-29 21:51:14.083 [info] <0.953.1293> Connection <0.953.1293> (172.29.93.14:48474<https://urldefense.com/v3/__http:/172.29.93.14:48474__;!!Ci6f514n9QsL8ck!3b_NgWO8HXsOoUOdTUZp4KEzKcx9zpWomeb2yGJ4RRqkS1QI159_zwjwVnJ0aOLwIQ$> -> 172.29.95.38:5672<https://urldefense.com/v3/__http:/172.29.95.38:5672__;!!Ci6f514n9QsL8ck!3b_NgWO8HXsOoUOdTUZp4KEzKcx9zpWomeb2yGJ4RRqkS1QI159_zwjwVnKyfYp2-Q$>) has a client-provided name: uwsgi:262739:f86c0570-8739-4b74-8102-76b5357acd71
2020-12-29 21:51:14.084 [info] <0.953.1293> connection <0.953.1293> (172.29.93.14:48474<https://urldefense.com/v3/__http:/172.29.93.14:48474__;!!Ci6f514n9QsL8ck!3b_NgWO8HXsOoUOdTUZp4KEzKcx9zpWomeb2yGJ4RRqkS1QI159_zwjwVnJ0aOLwIQ$> -> 172.29.95.38:5672<https://urldefense.com/v3/__http:/172.29.95.38:5672__;!!Ci6f514n9QsL8ck!3b_NgWO8HXsOoUOdTUZp4KEzKcx9zpWomeb2yGJ4RRqkS1QI159_zwjwVnKyfYp2-Q$> - uwsgi:262739:f86c0570-8739-4b74-8102-76b5357acd71): user 'magnum' authenticated and granted access to vhost '/magnum'
2020-12-29 21:51:15.560 [info] <0.1656.1283> accepting AMQP connection <0.1656.1283> (172.29.93.14:48548<https://urldefense.com/v3/__http:/172.29.93.14:48548__;!!Ci6f514n9QsL8ck!3b_NgWO8HXsOoUOdTUZp4KEzKcx9zpWomeb2yGJ4RRqkS1QI159_zwjwVnLsbQ8hVw$> -> 172.29.95.38:5672<https://urldefense.com/v3/__http:/172.29.95.38:5672__;!!Ci6f514n9QsL8ck!3b_NgWO8HXsOoUOdTUZp4KEzKcx9zpWomeb2yGJ4RRqkS1QI159_zwjwVnKyfYp2-Q$>)
2020-12-29 21:51:15.561 [info] <0.1656.1283> Connection <0.1656.1283> (172.29.93.14:48548<https://urldefense.com/v3/__http:/172.29.93.14:48548__;!!Ci6f514n9QsL8ck!3b_NgWO8HXsOoUOdTUZp4KEzKcx9zpWomeb2yGJ4RRqkS1QI159_zwjwVnLsbQ8hVw$> -> 172.29.95.38:5672<https://urldefense.com/v3/__http:/172.29.95.38:5672__;!!Ci6f514n9QsL8ck!3b_NgWO8HXsOoUOdTUZp4KEzKcx9zpWomeb2yGJ4RRqkS1QI159_zwjwVnKyfYp2-Q$>) has a client-provided name: uwsgi:262744:2c9792ab-9198-493a-970c-f6ccfd9947d3
2020-12-29 21:51:15.561 [info] <0.1656.1283> connection <0.1656.1283> (172.29.93.14:48548<https://urldefense.com/v3/__http:/172.29.93.14:48548__;!!Ci6f514n9QsL8ck!3b_NgWO8HXsOoUOdTUZp4KEzKcx9zpWomeb2yGJ4RRqkS1QI159_zwjwVnLsbQ8hVw$> -> 172.29.95.38:5672<https://urldefense.com/v3/__http:/172.29.95.38:5672__;!!Ci6f514n9QsL8ck!3b_NgWO8HXsOoUOdTUZp4KEzKcx9zpWomeb2yGJ4RRqkS1QI159_zwjwVnKyfYp2-Q$> - uwsgi:262744:2c9792ab-9198-493a-970c-f6ccfd9947d3): user 'magnum' authenticated and granted access to vhost '/magnum'

On Tue, Dec 22, 2020 at 4:12 AM feilong <feilong at catalyst.net.nz<mailto:feilong at catalyst.net.nz>> wrote:

Hi Ionut,

I didn't see this before on our production. Magnum auto healer just simply sends a POST request to Magnum api to update the health status. So I would suggest write a small script or even use curl to see if you can reproduce this firstly.


On 19/12/20 2:27 am, Ionut Biru wrote:
Hi again,

I failed to mention that is stable/victoria with couples of patches from review. Ignore the fact that in logs it  shows the 19.1.4 version in venv path.

On Fri, Dec 18, 2020 at 3:22 PM Ionut Biru <ionut at fleio.com<mailto:ionut at fleio.com>> wrote:
Hi guys,

I have an issue with magnum api returning an error after a while:
Server-side error: "[('system library', 'fopen', 'Too many open files'), ('BIO routines', 'BIO_new_file', 'system lib'), ('x509 certificate routines', 'X509_load_cert_crl_file', 'system lib')]"

Log file: https://paste.xinu.at/6djE/<https://urldefense.com/v3/__https:/paste.xinu.at/6djE/__;!!Ci6f514n9QsL8ck!3b_NgWO8HXsOoUOdTUZp4KEzKcx9zpWomeb2yGJ4RRqkS1QI159_zwjwVnJkPZX-_Q$>

This started to appear after I enabled the template auto_healing_controller = magnum-auto-healer,  magnum_auto_healer_tag = v1.19.0.

Currently, I only have 4 clusters.

After that the API is in error state and doesn't work unless I restart it.


--
Ionut Biru - https://fleio.com<https://urldefense.com/v3/__https:/fleio.com__;!!Ci6f514n9QsL8ck!3b_NgWO8HXsOoUOdTUZp4KEzKcx9zpWomeb2yGJ4RRqkS1QI159_zwjwVnKfpV6EIg$>


--
Ionut Biru - https://fleio.com<https://urldefense.com/v3/__https:/fleio.com__;!!Ci6f514n9QsL8ck!3b_NgWO8HXsOoUOdTUZp4KEzKcx9zpWomeb2yGJ4RRqkS1QI159_zwjwVnKfpV6EIg$>

--

Cheers & Best regards,

Feilong Wang (王飞龙)

------------------------------------------------------

Senior Cloud Software Engineer

Tel: +64-48032246

Email: flwang at catalyst.net.nz<mailto:flwang at catalyst.net.nz>

Catalyst IT Limited

Level 6, Catalyst House, 150 Willis Street, Wellington

------------------------------------------------------


--
Ionut Biru - https://fleio.com<https://urldefense.com/v3/__https:/fleio.com__;!!Ci6f514n9QsL8ck!3b_NgWO8HXsOoUOdTUZp4KEzKcx9zpWomeb2yGJ4RRqkS1QI159_zwjwVnKfpV6EIg$>


--
Ionut Biru - https://fleio.com<https://urldefense.com/v3/__https:/fleio.com__;!!Ci6f514n9QsL8ck!3b_NgWO8HXsOoUOdTUZp4KEzKcx9zpWomeb2yGJ4RRqkS1QI159_zwjwVnKfpV6EIg$>


--
Ionut Biru - https://fleio.com<https://urldefense.com/v3/__https:/fleio.com__;!!Ci6f514n9QsL8ck!wv_wzG-Ntk0gd3ReOupQl-iXIcWpPR3genCqeKNY5JCKZDWxQHSqqa-uxxg3AiT38g$>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-discuss/attachments/20210104/14982a83/attachment-0001.html>


More information about the openstack-discuss mailing list