[magnum][api] Error system library fopen too many open files with magnum-auto-healer
Hi guys, I have an issue with magnum api returning an error after a while: Server-side error: "[('system library', 'fopen', 'Too many open files'), ('BIO routines', 'BIO_new_file', 'system lib'), ('x509 certificate routines', 'X509_load_cert_crl_file', 'system lib')]" Log file: https://paste.xinu.at/6djE/ This started to appear after I enabled the template auto_healing_controller = magnum-auto-healer, magnum_auto_healer_tag = v1.19.0. Currently, I only have 4 clusters. After that the API is in error state and doesn't work unless I restart it. -- Ionut Biru - https://fleio.com
Hi again, I failed to mention that is stable/victoria with couples of patches from review. Ignore the fact that in logs it shows the 19.1.4 version in venv path. On Fri, Dec 18, 2020 at 3:22 PM Ionut Biru <ionut@fleio.com> wrote:
Hi guys,
I have an issue with magnum api returning an error after a while: Server-side error: "[('system library', 'fopen', 'Too many open files'), ('BIO routines', 'BIO_new_file', 'system lib'), ('x509 certificate routines', 'X509_load_cert_crl_file', 'system lib')]"
Log file: https://paste.xinu.at/6djE/
This started to appear after I enabled the template auto_healing_controller = magnum-auto-healer, magnum_auto_healer_tag = v1.19.0.
Currently, I only have 4 clusters.
After that the API is in error state and doesn't work unless I restart it.
-- Ionut Biru - https://fleio.com
-- Ionut Biru - https://fleio.com
Hi Ionut, I didn't see this before on our production. Magnum auto healer just simply sends a POST request to Magnum api to update the health status. So I would suggest write a small script or even use curl to see if you can reproduce this firstly. On 19/12/20 2:27 am, Ionut Biru wrote:
Hi again,
I failed to mention that is stable/victoria with couples of patches from review. Ignore the fact that in logs it shows the 19.1.4 version in venv path.
On Fri, Dec 18, 2020 at 3:22 PM Ionut Biru <ionut@fleio.com <mailto:ionut@fleio.com>> wrote:
Hi guys,
I have an issue with magnum api returning an error after a while: |Server-side error: "[('system library', 'fopen', 'Too many open files'), ('BIO routines', 'BIO_new_file', 'system lib'), ('x509 certificate routines', 'X509_load_cert_crl_file', 'system lib')]"|
Log file: https://paste.xinu.at/6djE/
This started to appear after I enabled the template auto_healing_controller = magnum-auto-healer, magnum_auto_healer_tag = v1.19.0.
Currently, I only have 4 clusters.
After that the API is in error state and doesn't work unless I restart it.
-- Ionut Biru - https://fleio.com
-- Ionut Biru - https://fleio.com
-- Cheers & Best regards, Feilong Wang (王飞龙) ------------------------------------------------------ Senior Cloud Software Engineer Tel: +64-48032246 Email: flwang@catalyst.net.nz Catalyst IT Limited Level 6, Catalyst House, 150 Willis Street, Wellington ------------------------------------------------------
Hi Feilong, I found out that each time the update_health_status periodic task is run, a new connection(for each uwsgi) is made to rabbitmq. root@ctrl1cj-magnum-container-7a7a412a:~# netstat -npt | grep 5672 | wc -l 229 root@ctrl1cj-magnum-container-7a7a412a:~# netstat -npt | grep 5672 | wc -l 234 root@ctrl1cj-magnum-container-7a7a412a:~# netstat -npt | grep 5672 | wc -l 238 root@ctrl1cj-magnum-container-7a7a412a:~# netstat -npt | grep 5672 | wc -l 241 root@ctrl1cj-magnum-container-7a7a412a:~# netstat -npt | grep 5672 | wc -l 244 Not sure Dec 29 21:51:22 ctrl1cj-magnum-container-7a7a412a magnum-conductor[262800]: 2020-12-29 21:51:22.024 262800 DEBUG magnum.service.periodic [req-3b495326-cf80-481e-b3c6-c741f05b7f0e - - - - -] Dec 29 21:51:22 ctrl1cj-magnum-container-7a7a412a magnum-conductor[262800]: 2020-12-29 21:51:22.024 262800 DEBUG oslo_service.periodic_task [-] Running periodic task MagnumPeriodicTasks.sync Dec 29 21:51:16 ctrl1cj-magnum-container-7a7a412a magnum-conductor[262804]: 2020-12-29 21:51:16.462 262804 DEBUG magnum.conductor.handlers.cluster_conductor [req-284ac12b-d76a-4e50-8e74-5bfb Dec 29 21:51:15 ctrl1cj-magnum-container-7a7a412a magnum-conductor[262800]: 2020-12-29 21:51:15.573 262800 DEBUG magnum.service.periodic [-] Status for cluster 118 updated to HEALTHY ({'api' Dec 29 21:51:15 ctrl1cj-magnum-container-7a7a412a magnum-conductor[262805]: 2020-12-29 21:51:15.572 262805 DEBUG magnum.conductor.handlers.cluster_conductor [req-3fc29ee9-4051-42e7-ae19-3a49 Dec 29 21:51:15 ctrl1cj-magnum-container-7a7a412a magnum-conductor[262800]: 2020-12-29 21:51:15.572 262800 DEBUG magnum.service.periodic [-] Status for cluster 121 updated to HEALTHY ({'api' Dec 29 21:51:15 ctrl1cj-magnum-container-7a7a412a magnum-conductor[262800]: 2020-12-29 21:51:15.572 262800 DEBUG magnum.service.periodic [-] Status for cluster 122 updated to HEALTHY ({'api' Dec 29 21:51:15 ctrl1cj-magnum-container-7a7a412a magnum-conductor[262800]: 2020-12-29 21:51:15.553 262800 DEBUG magnum.service.periodic [-] Updating health status for cluster 122 update_hea Dec 29 21:51:15 ctrl1cj-magnum-container-7a7a412a magnum-conductor[262800]: 2020-12-29 21:51:15.544 262800 DEBUG magnum.service.periodic [-] Updating health status for cluster 121 update_hea Dec 29 21:51:15 ctrl1cj-magnum-container-7a7a412a magnum-conductor[262800]: 2020-12-29 21:51:15.535 262800 DEBUG magnum.service.periodic [-] Updating health status for cluster 118 update_hea Dec 29 21:51:15 ctrl1cj-magnum-container-7a7a412a magnum-conductor[262800]: 2020-12-29 21:51:15.494 262800 DEBUG magnum.service.periodic [req-405b1fed-0b8a-4a60-b6ae-834f548b21d1 - - - 2020-12-29 21:51:14.082 [info] <0.953.1293> accepting AMQP connection <0.953.1293> (172.29.93.14:48474 -> 172.29.95.38:5672) 2020-12-29 21:51:14.083 [info] <0.953.1293> Connection <0.953.1293> ( 172.29.93.14:48474 -> 172.29.95.38:5672) has a client-provided name: uwsgi:262739:f86c0570-8739-4b74-8102-76b5357acd71 2020-12-29 21:51:14.084 [info] <0.953.1293> connection <0.953.1293> ( 172.29.93.14:48474 -> 172.29.95.38:5672 - uwsgi:262739:f86c0570-8739-4b74-8102-76b5357acd71): user 'magnum' authenticated and granted access to vhost '/magnum' 2020-12-29 21:51:15.560 [info] <0.1656.1283> accepting AMQP connection <0.1656.1283> (172.29.93.14:48548 -> 172.29.95.38:5672) 2020-12-29 21:51:15.561 [info] <0.1656.1283> Connection <0.1656.1283> ( 172.29.93.14:48548 -> 172.29.95.38:5672) has a client-provided name: uwsgi:262744:2c9792ab-9198-493a-970c-f6ccfd9947d3 2020-12-29 21:51:15.561 [info] <0.1656.1283> connection <0.1656.1283> ( 172.29.93.14:48548 -> 172.29.95.38:5672 - uwsgi:262744:2c9792ab-9198-493a-970c-f6ccfd9947d3): user 'magnum' authenticated and granted access to vhost '/magnum' On Tue, Dec 22, 2020 at 4:12 AM feilong <feilong@catalyst.net.nz> wrote:
Hi Ionut,
I didn't see this before on our production. Magnum auto healer just simply sends a POST request to Magnum api to update the health status. So I would suggest write a small script or even use curl to see if you can reproduce this firstly.
On 19/12/20 2:27 am, Ionut Biru wrote:
Hi again,
I failed to mention that is stable/victoria with couples of patches from review. Ignore the fact that in logs it shows the 19.1.4 version in venv path.
On Fri, Dec 18, 2020 at 3:22 PM Ionut Biru <ionut@fleio.com> wrote:
Hi guys,
I have an issue with magnum api returning an error after a while: Server-side error: "[('system library', 'fopen', 'Too many open files'), ('BIO routines', 'BIO_new_file', 'system lib'), ('x509 certificate routines', 'X509_load_cert_crl_file', 'system lib')]"
Log file: https://paste.xinu.at/6djE/
This started to appear after I enabled the template auto_healing_controller = magnum-auto-healer, magnum_auto_healer_tag = v1.19.0.
Currently, I only have 4 clusters.
After that the API is in error state and doesn't work unless I restart it.
-- Ionut Biru - https://fleio.com
-- Ionut Biru - https://fleio.com
-- Cheers & Best regards, Feilong Wang (王飞龙) ------------------------------------------------------ Senior Cloud Software Engineer Tel: +64-48032246 Email: flwang@catalyst.net.nz Catalyst IT Limited Level 6, Catalyst House, 150 Willis Street, Wellington ------------------------------------------------------
-- Ionut Biru - https://fleio.com
Hi, Not sure if my suspicion is true but I think for each update a new notifier is prepared and used without closing the connection but my understanding of oslo is nonexistent. https://opendev.org/openstack/magnum/src/branch/master/magnum/conductor/util... https://opendev.org/openstack/magnum/src/branch/master/magnum/common/rpc.py#... On Tue, Dec 29, 2020 at 11:52 PM Ionut Biru <ionut@fleio.com> wrote:
Hi Feilong,
I found out that each time the update_health_status periodic task is run, a new connection(for each uwsgi) is made to rabbitmq.
root@ctrl1cj-magnum-container-7a7a412a:~# netstat -npt | grep 5672 | wc -l 229 root@ctrl1cj-magnum-container-7a7a412a:~# netstat -npt | grep 5672 | wc -l 234 root@ctrl1cj-magnum-container-7a7a412a:~# netstat -npt | grep 5672 | wc -l 238 root@ctrl1cj-magnum-container-7a7a412a:~# netstat -npt | grep 5672 | wc -l 241 root@ctrl1cj-magnum-container-7a7a412a:~# netstat -npt | grep 5672 | wc -l 244
Not sure
Dec 29 21:51:22 ctrl1cj-magnum-container-7a7a412a magnum-conductor[262800]: 2020-12-29 21:51:22.024 262800 DEBUG magnum.service.periodic [req-3b495326-cf80-481e-b3c6-c741f05b7f0e - - - - -] Dec 29 21:51:22 ctrl1cj-magnum-container-7a7a412a magnum-conductor[262800]: 2020-12-29 21:51:22.024 262800 DEBUG oslo_service.periodic_task [-] Running periodic task MagnumPeriodicTasks.sync Dec 29 21:51:16 ctrl1cj-magnum-container-7a7a412a magnum-conductor[262804]: 2020-12-29 21:51:16.462 262804 DEBUG magnum.conductor.handlers.cluster_conductor [req-284ac12b-d76a-4e50-8e74-5bfb Dec 29 21:51:15 ctrl1cj-magnum-container-7a7a412a magnum-conductor[262800]: 2020-12-29 21:51:15.573 262800 DEBUG magnum.service.periodic [-] Status for cluster 118 updated to HEALTHY ({'api' Dec 29 21:51:15 ctrl1cj-magnum-container-7a7a412a magnum-conductor[262805]: 2020-12-29 21:51:15.572 262805 DEBUG magnum.conductor.handlers.cluster_conductor [req-3fc29ee9-4051-42e7-ae19-3a49 Dec 29 21:51:15 ctrl1cj-magnum-container-7a7a412a magnum-conductor[262800]: 2020-12-29 21:51:15.572 262800 DEBUG magnum.service.periodic [-] Status for cluster 121 updated to HEALTHY ({'api' Dec 29 21:51:15 ctrl1cj-magnum-container-7a7a412a magnum-conductor[262800]: 2020-12-29 21:51:15.572 262800 DEBUG magnum.service.periodic [-] Status for cluster 122 updated to HEALTHY ({'api' Dec 29 21:51:15 ctrl1cj-magnum-container-7a7a412a magnum-conductor[262800]: 2020-12-29 21:51:15.553 262800 DEBUG magnum.service.periodic [-] Updating health status for cluster 122 update_hea Dec 29 21:51:15 ctrl1cj-magnum-container-7a7a412a magnum-conductor[262800]: 2020-12-29 21:51:15.544 262800 DEBUG magnum.service.periodic [-] Updating health status for cluster 121 update_hea Dec 29 21:51:15 ctrl1cj-magnum-container-7a7a412a magnum-conductor[262800]: 2020-12-29 21:51:15.535 262800 DEBUG magnum.service.periodic [-] Updating health status for cluster 118 update_hea Dec 29 21:51:15 ctrl1cj-magnum-container-7a7a412a magnum-conductor[262800]: 2020-12-29 21:51:15.494 262800 DEBUG magnum.service.periodic [req-405b1fed-0b8a-4a60-b6ae-834f548b21d1 - - -
2020-12-29 21:51:14.082 [info] <0.953.1293> accepting AMQP connection <0.953.1293> (172.29.93.14:48474 -> 172.29.95.38:5672) 2020-12-29 21:51:14.083 [info] <0.953.1293> Connection <0.953.1293> ( 172.29.93.14:48474 -> 172.29.95.38:5672) has a client-provided name: uwsgi:262739:f86c0570-8739-4b74-8102-76b5357acd71 2020-12-29 21:51:14.084 [info] <0.953.1293> connection <0.953.1293> ( 172.29.93.14:48474 -> 172.29.95.38:5672 - uwsgi:262739:f86c0570-8739-4b74-8102-76b5357acd71): user 'magnum' authenticated and granted access to vhost '/magnum' 2020-12-29 21:51:15.560 [info] <0.1656.1283> accepting AMQP connection <0.1656.1283> (172.29.93.14:48548 -> 172.29.95.38:5672) 2020-12-29 21:51:15.561 [info] <0.1656.1283> Connection <0.1656.1283> ( 172.29.93.14:48548 -> 172.29.95.38:5672) has a client-provided name: uwsgi:262744:2c9792ab-9198-493a-970c-f6ccfd9947d3 2020-12-29 21:51:15.561 [info] <0.1656.1283> connection <0.1656.1283> ( 172.29.93.14:48548 -> 172.29.95.38:5672 - uwsgi:262744:2c9792ab-9198-493a-970c-f6ccfd9947d3): user 'magnum' authenticated and granted access to vhost '/magnum'
On Tue, Dec 22, 2020 at 4:12 AM feilong <feilong@catalyst.net.nz> wrote:
Hi Ionut,
I didn't see this before on our production. Magnum auto healer just simply sends a POST request to Magnum api to update the health status. So I would suggest write a small script or even use curl to see if you can reproduce this firstly.
On 19/12/20 2:27 am, Ionut Biru wrote:
Hi again,
I failed to mention that is stable/victoria with couples of patches from review. Ignore the fact that in logs it shows the 19.1.4 version in venv path.
On Fri, Dec 18, 2020 at 3:22 PM Ionut Biru <ionut@fleio.com> wrote:
Hi guys,
I have an issue with magnum api returning an error after a while: Server-side error: "[('system library', 'fopen', 'Too many open files'), ('BIO routines', 'BIO_new_file', 'system lib'), ('x509 certificate routines', 'X509_load_cert_crl_file', 'system lib')]"
Log file: https://paste.xinu.at/6djE/
This started to appear after I enabled the template auto_healing_controller = magnum-auto-healer, magnum_auto_healer_tag = v1.19.0.
Currently, I only have 4 clusters.
After that the API is in error state and doesn't work unless I restart it.
-- Ionut Biru - https://fleio.com
-- Ionut Biru - https://fleio.com
-- Cheers & Best regards, Feilong Wang (王飞龙) ------------------------------------------------------ Senior Cloud Software Engineer Tel: +64-48032246 Email: flwang@catalyst.net.nz Catalyst IT Limited Level 6, Catalyst House, 150 Willis Street, Wellington ------------------------------------------------------
-- Ionut Biru - https://fleio.com
-- Ionut Biru - https://fleio.com
Are you sure you aren't just looking at the connection pool expanding? Each worker has a max number of connections it can use. Maybe look at lowering rpc_conn_pool_size. By default I believe each worker might create a pool of up to 30 connections. Looking at the code it could also be have something to do with the k8s client. Since it creates a new instance each time it does an health check. What version of the k8s client do you have installed? ________________________________ From: Ionut Biru <ionut@fleio.com> Sent: Tuesday, December 29, 2020 2:20 PM To: feilong <feilong@catalyst.net.nz> Cc: openstack-discuss <openstack-discuss@lists.openstack.org> Subject: Re: [magnum][api] Error system library fopen too many open files with magnum-auto-healer Hi, Not sure if my suspicion is true but I think for each update a new notifier is prepared and used without closing the connection but my understanding of oslo is nonexistent. https://opendev.org/openstack/magnum/src/branch/master/magnum/conductor/utils.py#L147<https://urldefense.com/v3/__https://opendev.org/openstack/magnum/src/branch/master/magnum/conductor/utils.py*L147__;Iw!!Ci6f514n9QsL8ck!3b_NgWO8HXsOoUOdTUZp4KEzKcx9zpWomeb2yGJ4RRqkS1QI159_zwjwVnKwDSl9vw$> https://opendev.org/openstack/magnum/src/branch/master/magnum/common/rpc.py#L173<https://urldefense.com/v3/__https://opendev.org/openstack/magnum/src/branch/master/magnum/common/rpc.py*L173__;Iw!!Ci6f514n9QsL8ck!3b_NgWO8HXsOoUOdTUZp4KEzKcx9zpWomeb2yGJ4RRqkS1QI159_zwjwVnJa8-cGbA$> On Tue, Dec 29, 2020 at 11:52 PM Ionut Biru <ionut@fleio.com<mailto:ionut@fleio.com>> wrote: Hi Feilong, I found out that each time the update_health_status periodic task is run, a new connection(for each uwsgi) is made to rabbitmq. root@ctrl1cj-magnum-container-7a7a412a:~# netstat -npt | grep 5672 | wc -l 229 root@ctrl1cj-magnum-container-7a7a412a:~# netstat -npt | grep 5672 | wc -l 234 root@ctrl1cj-magnum-container-7a7a412a:~# netstat -npt | grep 5672 | wc -l 238 root@ctrl1cj-magnum-container-7a7a412a:~# netstat -npt | grep 5672 | wc -l 241 root@ctrl1cj-magnum-container-7a7a412a:~# netstat -npt | grep 5672 | wc -l 244 Not sure Dec 29 21:51:22 ctrl1cj-magnum-container-7a7a412a magnum-conductor[262800]: 2020-12-29 21:51:22.024 262800 DEBUG magnum.service.periodic [req-3b495326-cf80-481e-b3c6-c741f05b7f0e - - - - -] Dec 29 21:51:22 ctrl1cj-magnum-container-7a7a412a magnum-conductor[262800]: 2020-12-29 21:51:22.024 262800 DEBUG oslo_service.periodic_task [-] Running periodic task MagnumPeriodicTasks.sync Dec 29 21:51:16 ctrl1cj-magnum-container-7a7a412a magnum-conductor[262804]: 2020-12-29 21:51:16.462 262804 DEBUG magnum.conductor.handlers.cluster_conductor [req-284ac12b-d76a-4e50-8e74-5bfb Dec 29 21:51:15 ctrl1cj-magnum-container-7a7a412a magnum-conductor[262800]: 2020-12-29 21:51:15.573 262800 DEBUG magnum.service.periodic [-] Status for cluster 118 updated to HEALTHY ({'api' Dec 29 21:51:15 ctrl1cj-magnum-container-7a7a412a magnum-conductor[262805]: 2020-12-29 21:51:15.572 262805 DEBUG magnum.conductor.handlers.cluster_conductor [req-3fc29ee9-4051-42e7-ae19-3a49 Dec 29 21:51:15 ctrl1cj-magnum-container-7a7a412a magnum-conductor[262800]: 2020-12-29 21:51:15.572 262800 DEBUG magnum.service.periodic [-] Status for cluster 121 updated to HEALTHY ({'api' Dec 29 21:51:15 ctrl1cj-magnum-container-7a7a412a magnum-conductor[262800]: 2020-12-29 21:51:15.572 262800 DEBUG magnum.service.periodic [-] Status for cluster 122 updated to HEALTHY ({'api' Dec 29 21:51:15 ctrl1cj-magnum-container-7a7a412a magnum-conductor[262800]: 2020-12-29 21:51:15.553 262800 DEBUG magnum.service.periodic [-] Updating health status for cluster 122 update_hea Dec 29 21:51:15 ctrl1cj-magnum-container-7a7a412a magnum-conductor[262800]: 2020-12-29 21:51:15.544 262800 DEBUG magnum.service.periodic [-] Updating health status for cluster 121 update_hea Dec 29 21:51:15 ctrl1cj-magnum-container-7a7a412a magnum-conductor[262800]: 2020-12-29 21:51:15.535 262800 DEBUG magnum.service.periodic [-] Updating health status for cluster 118 update_hea Dec 29 21:51:15 ctrl1cj-magnum-container-7a7a412a magnum-conductor[262800]: 2020-12-29 21:51:15.494 262800 DEBUG magnum.service.periodic [req-405b1fed-0b8a-4a60-b6ae-834f548b21d1 - - - 2020-12-29 21:51:14.082 [info] <0.953.1293> accepting AMQP connection <0.953.1293> (172.29.93.14:48474<https://urldefense.com/v3/__http://172.29.93.14:48474__;!!Ci6f514n9QsL8ck!3b_NgWO8HXsOoUOdTUZp4KEzKcx9zpWomeb2yGJ4RRqkS1QI159_zwjwVnJ0aOLwIQ$> -> 172.29.95.38:5672<https://urldefense.com/v3/__http://172.29.95.38:5672__;!!Ci6f514n9QsL8ck!3b_NgWO8HXsOoUOdTUZp4KEzKcx9zpWomeb2yGJ4RRqkS1QI159_zwjwVnKyfYp2-Q$>) 2020-12-29 21:51:14.083 [info] <0.953.1293> Connection <0.953.1293> (172.29.93.14:48474<https://urldefense.com/v3/__http://172.29.93.14:48474__;!!Ci6f514n9QsL8ck!3b_NgWO8HXsOoUOdTUZp4KEzKcx9zpWomeb2yGJ4RRqkS1QI159_zwjwVnJ0aOLwIQ$> -> 172.29.95.38:5672<https://urldefense.com/v3/__http://172.29.95.38:5672__;!!Ci6f514n9QsL8ck!3b_NgWO8HXsOoUOdTUZp4KEzKcx9zpWomeb2yGJ4RRqkS1QI159_zwjwVnKyfYp2-Q$>) has a client-provided name: uwsgi:262739:f86c0570-8739-4b74-8102-76b5357acd71 2020-12-29 21:51:14.084 [info] <0.953.1293> connection <0.953.1293> (172.29.93.14:48474<https://urldefense.com/v3/__http://172.29.93.14:48474__;!!Ci6f514n9QsL8ck!3b_NgWO8HXsOoUOdTUZp4KEzKcx9zpWomeb2yGJ4RRqkS1QI159_zwjwVnJ0aOLwIQ$> -> 172.29.95.38:5672<https://urldefense.com/v3/__http://172.29.95.38:5672__;!!Ci6f514n9QsL8ck!3b_NgWO8HXsOoUOdTUZp4KEzKcx9zpWomeb2yGJ4RRqkS1QI159_zwjwVnKyfYp2-Q$> - uwsgi:262739:f86c0570-8739-4b74-8102-76b5357acd71): user 'magnum' authenticated and granted access to vhost '/magnum' 2020-12-29 21:51:15.560 [info] <0.1656.1283> accepting AMQP connection <0.1656.1283> (172.29.93.14:48548<https://urldefense.com/v3/__http://172.29.93.14:48548__;!!Ci6f514n9QsL8ck!3b_NgWO8HXsOoUOdTUZp4KEzKcx9zpWomeb2yGJ4RRqkS1QI159_zwjwVnLsbQ8hVw$> -> 172.29.95.38:5672<https://urldefense.com/v3/__http://172.29.95.38:5672__;!!Ci6f514n9QsL8ck!3b_NgWO8HXsOoUOdTUZp4KEzKcx9zpWomeb2yGJ4RRqkS1QI159_zwjwVnKyfYp2-Q$>) 2020-12-29 21:51:15.561 [info] <0.1656.1283> Connection <0.1656.1283> (172.29.93.14:48548<https://urldefense.com/v3/__http://172.29.93.14:48548__;!!Ci6f514n9QsL8ck!3b_NgWO8HXsOoUOdTUZp4KEzKcx9zpWomeb2yGJ4RRqkS1QI159_zwjwVnLsbQ8hVw$> -> 172.29.95.38:5672<https://urldefense.com/v3/__http://172.29.95.38:5672__;!!Ci6f514n9QsL8ck!3b_NgWO8HXsOoUOdTUZp4KEzKcx9zpWomeb2yGJ4RRqkS1QI159_zwjwVnKyfYp2-Q$>) has a client-provided name: uwsgi:262744:2c9792ab-9198-493a-970c-f6ccfd9947d3 2020-12-29 21:51:15.561 [info] <0.1656.1283> connection <0.1656.1283> (172.29.93.14:48548<https://urldefense.com/v3/__http://172.29.93.14:48548__;!!Ci6f514n9QsL8ck!3b_NgWO8HXsOoUOdTUZp4KEzKcx9zpWomeb2yGJ4RRqkS1QI159_zwjwVnLsbQ8hVw$> -> 172.29.95.38:5672<https://urldefense.com/v3/__http://172.29.95.38:5672__;!!Ci6f514n9QsL8ck!3b_NgWO8HXsOoUOdTUZp4KEzKcx9zpWomeb2yGJ4RRqkS1QI159_zwjwVnKyfYp2-Q$> - uwsgi:262744:2c9792ab-9198-493a-970c-f6ccfd9947d3): user 'magnum' authenticated and granted access to vhost '/magnum' On Tue, Dec 22, 2020 at 4:12 AM feilong <feilong@catalyst.net.nz<mailto:feilong@catalyst.net.nz>> wrote: Hi Ionut, I didn't see this before on our production. Magnum auto healer just simply sends a POST request to Magnum api to update the health status. So I would suggest write a small script or even use curl to see if you can reproduce this firstly. On 19/12/20 2:27 am, Ionut Biru wrote: Hi again, I failed to mention that is stable/victoria with couples of patches from review. Ignore the fact that in logs it shows the 19.1.4 version in venv path. On Fri, Dec 18, 2020 at 3:22 PM Ionut Biru <ionut@fleio.com<mailto:ionut@fleio.com>> wrote: Hi guys, I have an issue with magnum api returning an error after a while: Server-side error: "[('system library', 'fopen', 'Too many open files'), ('BIO routines', 'BIO_new_file', 'system lib'), ('x509 certificate routines', 'X509_load_cert_crl_file', 'system lib')]" Log file: https://paste.xinu.at/6djE/<https://urldefense.com/v3/__https://paste.xinu.at/6djE/__;!!Ci6f514n9QsL8ck!3b_NgWO8HXsOoUOdTUZp4KEzKcx9zpWomeb2yGJ4RRqkS1QI159_zwjwVnJkPZX-_Q$> This started to appear after I enabled the template auto_healing_controller = magnum-auto-healer, magnum_auto_healer_tag = v1.19.0. Currently, I only have 4 clusters. After that the API is in error state and doesn't work unless I restart it. -- Ionut Biru - https://fleio.com<https://urldefense.com/v3/__https://fleio.com__;!!Ci6f514n9QsL8ck!3b_NgWO8HXsOoUOdTUZp4KEzKcx9zpWomeb2yGJ4RRqkS1QI159_zwjwVnKfpV6EIg$> -- Ionut Biru - https://fleio.com<https://urldefense.com/v3/__https://fleio.com__;!!Ci6f514n9QsL8ck!3b_NgWO8HXsOoUOdTUZp4KEzKcx9zpWomeb2yGJ4RRqkS1QI159_zwjwVnKfpV6EIg$> -- Cheers & Best regards, Feilong Wang (王飞龙) ------------------------------------------------------ Senior Cloud Software Engineer Tel: +64-48032246 Email: flwang@catalyst.net.nz<mailto:flwang@catalyst.net.nz> Catalyst IT Limited Level 6, Catalyst House, 150 Willis Street, Wellington ------------------------------------------------------ -- Ionut Biru - https://fleio.com<https://urldefense.com/v3/__https://fleio.com__;!!Ci6f514n9QsL8ck!3b_NgWO8HXsOoUOdTUZp4KEzKcx9zpWomeb2yGJ4RRqkS1QI159_zwjwVnKfpV6EIg$> -- Ionut Biru - https://fleio.com<https://urldefense.com/v3/__https://fleio.com__;!!Ci6f514n9QsL8ck!3b_NgWO8HXsOoUOdTUZp4KEzKcx9zpWomeb2yGJ4RRqkS1QI159_zwjwVnKfpV6EIg$>
Maybe something similar to this? https://github.com/kubernetes-client/python/issues/1158 What does lsof say? ________________________________ From: Erik Olof Gunnar Andersson <eandersson@blizzard.com> Sent: Saturday, January 2, 2021 4:54 PM To: Ionut Biru <ionut@fleio.com>; feilong <feilong@catalyst.net.nz> Cc: openstack-discuss <openstack-discuss@lists.openstack.org> Subject: Re: [magnum][api] Error system library fopen too many open files with magnum-auto-healer Are you sure you aren't just looking at the connection pool expanding? Each worker has a max number of connections it can use. Maybe look at lowering rpc_conn_pool_size. By default I believe each worker might create a pool of up to 30 connections. Looking at the code it could also be have something to do with the k8s client. Since it creates a new instance each time it does an health check. What version of the k8s client do you have installed? ________________________________ From: Ionut Biru <ionut@fleio.com> Sent: Tuesday, December 29, 2020 2:20 PM To: feilong <feilong@catalyst.net.nz> Cc: openstack-discuss <openstack-discuss@lists.openstack.org> Subject: Re: [magnum][api] Error system library fopen too many open files with magnum-auto-healer Hi, Not sure if my suspicion is true but I think for each update a new notifier is prepared and used without closing the connection but my understanding of oslo is nonexistent. https://opendev.org/openstack/magnum/src/branch/master/magnum/conductor/utils.py#L147<https://urldefense.com/v3/__https://opendev.org/openstack/magnum/src/branch/master/magnum/conductor/utils.py*L147__;Iw!!Ci6f514n9QsL8ck!3b_NgWO8HXsOoUOdTUZp4KEzKcx9zpWomeb2yGJ4RRqkS1QI159_zwjwVnKwDSl9vw$> https://opendev.org/openstack/magnum/src/branch/master/magnum/common/rpc.py#L173<https://urldefense.com/v3/__https://opendev.org/openstack/magnum/src/branch/master/magnum/common/rpc.py*L173__;Iw!!Ci6f514n9QsL8ck!3b_NgWO8HXsOoUOdTUZp4KEzKcx9zpWomeb2yGJ4RRqkS1QI159_zwjwVnJa8-cGbA$> On Tue, Dec 29, 2020 at 11:52 PM Ionut Biru <ionut@fleio.com<mailto:ionut@fleio.com>> wrote: Hi Feilong, I found out that each time the update_health_status periodic task is run, a new connection(for each uwsgi) is made to rabbitmq. root@ctrl1cj-magnum-container-7a7a412a:~# netstat -npt | grep 5672 | wc -l 229 root@ctrl1cj-magnum-container-7a7a412a:~# netstat -npt | grep 5672 | wc -l 234 root@ctrl1cj-magnum-container-7a7a412a:~# netstat -npt | grep 5672 | wc -l 238 root@ctrl1cj-magnum-container-7a7a412a:~# netstat -npt | grep 5672 | wc -l 241 root@ctrl1cj-magnum-container-7a7a412a:~# netstat -npt | grep 5672 | wc -l 244 Not sure Dec 29 21:51:22 ctrl1cj-magnum-container-7a7a412a magnum-conductor[262800]: 2020-12-29 21:51:22.024 262800 DEBUG magnum.service.periodic [req-3b495326-cf80-481e-b3c6-c741f05b7f0e - - - - -] Dec 29 21:51:22 ctrl1cj-magnum-container-7a7a412a magnum-conductor[262800]: 2020-12-29 21:51:22.024 262800 DEBUG oslo_service.periodic_task [-] Running periodic task MagnumPeriodicTasks.sync Dec 29 21:51:16 ctrl1cj-magnum-container-7a7a412a magnum-conductor[262804]: 2020-12-29 21:51:16.462 262804 DEBUG magnum.conductor.handlers.cluster_conductor [req-284ac12b-d76a-4e50-8e74-5bfb Dec 29 21:51:15 ctrl1cj-magnum-container-7a7a412a magnum-conductor[262800]: 2020-12-29 21:51:15.573 262800 DEBUG magnum.service.periodic [-] Status for cluster 118 updated to HEALTHY ({'api' Dec 29 21:51:15 ctrl1cj-magnum-container-7a7a412a magnum-conductor[262805]: 2020-12-29 21:51:15.572 262805 DEBUG magnum.conductor.handlers.cluster_conductor [req-3fc29ee9-4051-42e7-ae19-3a49 Dec 29 21:51:15 ctrl1cj-magnum-container-7a7a412a magnum-conductor[262800]: 2020-12-29 21:51:15.572 262800 DEBUG magnum.service.periodic [-] Status for cluster 121 updated to HEALTHY ({'api' Dec 29 21:51:15 ctrl1cj-magnum-container-7a7a412a magnum-conductor[262800]: 2020-12-29 21:51:15.572 262800 DEBUG magnum.service.periodic [-] Status for cluster 122 updated to HEALTHY ({'api' Dec 29 21:51:15 ctrl1cj-magnum-container-7a7a412a magnum-conductor[262800]: 2020-12-29 21:51:15.553 262800 DEBUG magnum.service.periodic [-] Updating health status for cluster 122 update_hea Dec 29 21:51:15 ctrl1cj-magnum-container-7a7a412a magnum-conductor[262800]: 2020-12-29 21:51:15.544 262800 DEBUG magnum.service.periodic [-] Updating health status for cluster 121 update_hea Dec 29 21:51:15 ctrl1cj-magnum-container-7a7a412a magnum-conductor[262800]: 2020-12-29 21:51:15.535 262800 DEBUG magnum.service.periodic [-] Updating health status for cluster 118 update_hea Dec 29 21:51:15 ctrl1cj-magnum-container-7a7a412a magnum-conductor[262800]: 2020-12-29 21:51:15.494 262800 DEBUG magnum.service.periodic [req-405b1fed-0b8a-4a60-b6ae-834f548b21d1 - - - 2020-12-29 21:51:14.082 [info] <0.953.1293> accepting AMQP connection <0.953.1293> (172.29.93.14:48474<https://urldefense.com/v3/__http://172.29.93.14:48474__;!!Ci6f514n9QsL8ck!3b_NgWO8HXsOoUOdTUZp4KEzKcx9zpWomeb2yGJ4RRqkS1QI159_zwjwVnJ0aOLwIQ$> -> 172.29.95.38:5672<https://urldefense.com/v3/__http://172.29.95.38:5672__;!!Ci6f514n9QsL8ck!3b_NgWO8HXsOoUOdTUZp4KEzKcx9zpWomeb2yGJ4RRqkS1QI159_zwjwVnKyfYp2-Q$>) 2020-12-29 21:51:14.083 [info] <0.953.1293> Connection <0.953.1293> (172.29.93.14:48474<https://urldefense.com/v3/__http://172.29.93.14:48474__;!!Ci6f514n9QsL8ck!3b_NgWO8HXsOoUOdTUZp4KEzKcx9zpWomeb2yGJ4RRqkS1QI159_zwjwVnJ0aOLwIQ$> -> 172.29.95.38:5672<https://urldefense.com/v3/__http://172.29.95.38:5672__;!!Ci6f514n9QsL8ck!3b_NgWO8HXsOoUOdTUZp4KEzKcx9zpWomeb2yGJ4RRqkS1QI159_zwjwVnKyfYp2-Q$>) has a client-provided name: uwsgi:262739:f86c0570-8739-4b74-8102-76b5357acd71 2020-12-29 21:51:14.084 [info] <0.953.1293> connection <0.953.1293> (172.29.93.14:48474<https://urldefense.com/v3/__http://172.29.93.14:48474__;!!Ci6f514n9QsL8ck!3b_NgWO8HXsOoUOdTUZp4KEzKcx9zpWomeb2yGJ4RRqkS1QI159_zwjwVnJ0aOLwIQ$> -> 172.29.95.38:5672<https://urldefense.com/v3/__http://172.29.95.38:5672__;!!Ci6f514n9QsL8ck!3b_NgWO8HXsOoUOdTUZp4KEzKcx9zpWomeb2yGJ4RRqkS1QI159_zwjwVnKyfYp2-Q$> - uwsgi:262739:f86c0570-8739-4b74-8102-76b5357acd71): user 'magnum' authenticated and granted access to vhost '/magnum' 2020-12-29 21:51:15.560 [info] <0.1656.1283> accepting AMQP connection <0.1656.1283> (172.29.93.14:48548<https://urldefense.com/v3/__http://172.29.93.14:48548__;!!Ci6f514n9QsL8ck!3b_NgWO8HXsOoUOdTUZp4KEzKcx9zpWomeb2yGJ4RRqkS1QI159_zwjwVnLsbQ8hVw$> -> 172.29.95.38:5672<https://urldefense.com/v3/__http://172.29.95.38:5672__;!!Ci6f514n9QsL8ck!3b_NgWO8HXsOoUOdTUZp4KEzKcx9zpWomeb2yGJ4RRqkS1QI159_zwjwVnKyfYp2-Q$>) 2020-12-29 21:51:15.561 [info] <0.1656.1283> Connection <0.1656.1283> (172.29.93.14:48548<https://urldefense.com/v3/__http://172.29.93.14:48548__;!!Ci6f514n9QsL8ck!3b_NgWO8HXsOoUOdTUZp4KEzKcx9zpWomeb2yGJ4RRqkS1QI159_zwjwVnLsbQ8hVw$> -> 172.29.95.38:5672<https://urldefense.com/v3/__http://172.29.95.38:5672__;!!Ci6f514n9QsL8ck!3b_NgWO8HXsOoUOdTUZp4KEzKcx9zpWomeb2yGJ4RRqkS1QI159_zwjwVnKyfYp2-Q$>) has a client-provided name: uwsgi:262744:2c9792ab-9198-493a-970c-f6ccfd9947d3 2020-12-29 21:51:15.561 [info] <0.1656.1283> connection <0.1656.1283> (172.29.93.14:48548<https://urldefense.com/v3/__http://172.29.93.14:48548__;!!Ci6f514n9QsL8ck!3b_NgWO8HXsOoUOdTUZp4KEzKcx9zpWomeb2yGJ4RRqkS1QI159_zwjwVnLsbQ8hVw$> -> 172.29.95.38:5672<https://urldefense.com/v3/__http://172.29.95.38:5672__;!!Ci6f514n9QsL8ck!3b_NgWO8HXsOoUOdTUZp4KEzKcx9zpWomeb2yGJ4RRqkS1QI159_zwjwVnKyfYp2-Q$> - uwsgi:262744:2c9792ab-9198-493a-970c-f6ccfd9947d3): user 'magnum' authenticated and granted access to vhost '/magnum' On Tue, Dec 22, 2020 at 4:12 AM feilong <feilong@catalyst.net.nz<mailto:feilong@catalyst.net.nz>> wrote: Hi Ionut, I didn't see this before on our production. Magnum auto healer just simply sends a POST request to Magnum api to update the health status. So I would suggest write a small script or even use curl to see if you can reproduce this firstly. On 19/12/20 2:27 am, Ionut Biru wrote: Hi again, I failed to mention that is stable/victoria with couples of patches from review. Ignore the fact that in logs it shows the 19.1.4 version in venv path. On Fri, Dec 18, 2020 at 3:22 PM Ionut Biru <ionut@fleio.com<mailto:ionut@fleio.com>> wrote: Hi guys, I have an issue with magnum api returning an error after a while: Server-side error: "[('system library', 'fopen', 'Too many open files'), ('BIO routines', 'BIO_new_file', 'system lib'), ('x509 certificate routines', 'X509_load_cert_crl_file', 'system lib')]" Log file: https://paste.xinu.at/6djE/<https://urldefense.com/v3/__https://paste.xinu.at/6djE/__;!!Ci6f514n9QsL8ck!3b_NgWO8HXsOoUOdTUZp4KEzKcx9zpWomeb2yGJ4RRqkS1QI159_zwjwVnJkPZX-_Q$> This started to appear after I enabled the template auto_healing_controller = magnum-auto-healer, magnum_auto_healer_tag = v1.19.0. Currently, I only have 4 clusters. After that the API is in error state and doesn't work unless I restart it. -- Ionut Biru - https://fleio.com<https://urldefense.com/v3/__https://fleio.com__;!!Ci6f514n9QsL8ck!3b_NgWO8HXsOoUOdTUZp4KEzKcx9zpWomeb2yGJ4RRqkS1QI159_zwjwVnKfpV6EIg$> -- Ionut Biru - https://fleio.com<https://urldefense.com/v3/__https://fleio.com__;!!Ci6f514n9QsL8ck!3b_NgWO8HXsOoUOdTUZp4KEzKcx9zpWomeb2yGJ4RRqkS1QI159_zwjwVnKfpV6EIg$> -- Cheers & Best regards, Feilong Wang (王飞龙) ------------------------------------------------------ Senior Cloud Software Engineer Tel: +64-48032246 Email: flwang@catalyst.net.nz<mailto:flwang@catalyst.net.nz> Catalyst IT Limited Level 6, Catalyst House, 150 Willis Street, Wellington ------------------------------------------------------ -- Ionut Biru - https://fleio.com<https://urldefense.com/v3/__https://fleio.com__;!!Ci6f514n9QsL8ck!3b_NgWO8HXsOoUOdTUZp4KEzKcx9zpWomeb2yGJ4RRqkS1QI159_zwjwVnKfpV6EIg$> -- Ionut Biru - https://fleio.com<https://urldefense.com/v3/__https://fleio.com__;!!Ci6f514n9QsL8ck!3b_NgWO8HXsOoUOdTUZp4KEzKcx9zpWomeb2yGJ4RRqkS1QI159_zwjwVnKfpV6EIg$>
Hi Erik, Here is lsof of one uwsgi api. https://paste.xinu.at/5YUWf/ I have kubernetes 12.0.1 installed in env. On Sun, Jan 3, 2021 at 3:06 AM Erik Olof Gunnar Andersson < eandersson@blizzard.com> wrote:
Maybe something similar to this? https://github.com/kubernetes-client/python/issues/1158
What does lsof say?
------------------------------ *From:* Erik Olof Gunnar Andersson <eandersson@blizzard.com> *Sent:* Saturday, January 2, 2021 4:54 PM *To:* Ionut Biru <ionut@fleio.com>; feilong <feilong@catalyst.net.nz> *Cc:* openstack-discuss <openstack-discuss@lists.openstack.org> *Subject:* Re: [magnum][api] Error system library fopen too many open files with magnum-auto-healer
Are you sure you aren't just looking at the connection pool expanding? Each worker has a max number of connections it can use. Maybe look at lowering rpc_conn_pool_size. By default I believe each worker might create a pool of up to 30 connections.
Looking at the code it could also be have something to do with the k8s client. Since it creates a new instance each time it does an health check. What version of the k8s client do you have installed?
------------------------------ *From:* Ionut Biru <ionut@fleio.com> *Sent:* Tuesday, December 29, 2020 2:20 PM *To:* feilong <feilong@catalyst.net.nz> *Cc:* openstack-discuss <openstack-discuss@lists.openstack.org> *Subject:* Re: [magnum][api] Error system library fopen too many open files with magnum-auto-healer
Hi,
Not sure if my suspicion is true but I think for each update a new notifier is prepared and used without closing the connection but my understanding of oslo is nonexistent.
https://opendev.org/openstack/magnum/src/branch/master/magnum/conductor/util... <https://urldefense.com/v3/__https://opendev.org/openstack/magnum/src/branch/master/magnum/conductor/utils.py*L147__;Iw!!Ci6f514n9QsL8ck!3b_NgWO8HXsOoUOdTUZp4KEzKcx9zpWomeb2yGJ4RRqkS1QI159_zwjwVnKwDSl9vw$>
https://opendev.org/openstack/magnum/src/branch/master/magnum/common/rpc.py#... <https://urldefense.com/v3/__https://opendev.org/openstack/magnum/src/branch/master/magnum/common/rpc.py*L173__;Iw!!Ci6f514n9QsL8ck!3b_NgWO8HXsOoUOdTUZp4KEzKcx9zpWomeb2yGJ4RRqkS1QI159_zwjwVnJa8-cGbA$>
On Tue, Dec 29, 2020 at 11:52 PM Ionut Biru <ionut@fleio.com> wrote:
Hi Feilong,
I found out that each time the update_health_status periodic task is run, a new connection(for each uwsgi) is made to rabbitmq.
root@ctrl1cj-magnum-container-7a7a412a:~# netstat -npt | grep 5672 | wc -l 229 root@ctrl1cj-magnum-container-7a7a412a:~# netstat -npt | grep 5672 | wc -l 234 root@ctrl1cj-magnum-container-7a7a412a:~# netstat -npt | grep 5672 | wc -l 238 root@ctrl1cj-magnum-container-7a7a412a:~# netstat -npt | grep 5672 | wc -l 241 root@ctrl1cj-magnum-container-7a7a412a:~# netstat -npt | grep 5672 | wc -l 244
Not sure
Dec 29 21:51:22 ctrl1cj-magnum-container-7a7a412a magnum-conductor[262800]: 2020-12-29 21:51:22.024 262800 DEBUG magnum.service.periodic [req-3b495326-cf80-481e-b3c6-c741f05b7f0e - - - - -] Dec 29 21:51:22 ctrl1cj-magnum-container-7a7a412a magnum-conductor[262800]: 2020-12-29 21:51:22.024 262800 DEBUG oslo_service.periodic_task [-] Running periodic task MagnumPeriodicTasks.sync Dec 29 21:51:16 ctrl1cj-magnum-container-7a7a412a magnum-conductor[262804]: 2020-12-29 21:51:16.462 262804 DEBUG magnum.conductor.handlers.cluster_conductor [req-284ac12b-d76a-4e50-8e74-5bfb Dec 29 21:51:15 ctrl1cj-magnum-container-7a7a412a magnum-conductor[262800]: 2020-12-29 21:51:15.573 262800 DEBUG magnum.service.periodic [-] Status for cluster 118 updated to HEALTHY ({'api' Dec 29 21:51:15 ctrl1cj-magnum-container-7a7a412a magnum-conductor[262805]: 2020-12-29 21:51:15.572 262805 DEBUG magnum.conductor.handlers.cluster_conductor [req-3fc29ee9-4051-42e7-ae19-3a49 Dec 29 21:51:15 ctrl1cj-magnum-container-7a7a412a magnum-conductor[262800]: 2020-12-29 21:51:15.572 262800 DEBUG magnum.service.periodic [-] Status for cluster 121 updated to HEALTHY ({'api' Dec 29 21:51:15 ctrl1cj-magnum-container-7a7a412a magnum-conductor[262800]: 2020-12-29 21:51:15.572 262800 DEBUG magnum.service.periodic [-] Status for cluster 122 updated to HEALTHY ({'api' Dec 29 21:51:15 ctrl1cj-magnum-container-7a7a412a magnum-conductor[262800]: 2020-12-29 21:51:15.553 262800 DEBUG magnum.service.periodic [-] Updating health status for cluster 122 update_hea Dec 29 21:51:15 ctrl1cj-magnum-container-7a7a412a magnum-conductor[262800]: 2020-12-29 21:51:15.544 262800 DEBUG magnum.service.periodic [-] Updating health status for cluster 121 update_hea Dec 29 21:51:15 ctrl1cj-magnum-container-7a7a412a magnum-conductor[262800]: 2020-12-29 21:51:15.535 262800 DEBUG magnum.service.periodic [-] Updating health status for cluster 118 update_hea Dec 29 21:51:15 ctrl1cj-magnum-container-7a7a412a magnum-conductor[262800]: 2020-12-29 21:51:15.494 262800 DEBUG magnum.service.periodic [req-405b1fed-0b8a-4a60-b6ae-834f548b21d1 - - -
2020-12-29 21:51:14.082 [info] <0.953.1293> accepting AMQP connection <0.953.1293> (172.29.93.14:48474 <https://urldefense.com/v3/__http://172.29.93.14:48474__;!!Ci6f514n9QsL8ck!3b_NgWO8HXsOoUOdTUZp4KEzKcx9zpWomeb2yGJ4RRqkS1QI159_zwjwVnJ0aOLwIQ$> -> 172.29.95.38:5672 <https://urldefense.com/v3/__http://172.29.95.38:5672__;!!Ci6f514n9QsL8ck!3b_NgWO8HXsOoUOdTUZp4KEzKcx9zpWomeb2yGJ4RRqkS1QI159_zwjwVnKyfYp2-Q$> ) 2020-12-29 21:51:14.083 [info] <0.953.1293> Connection <0.953.1293> ( 172.29.93.14:48474 <https://urldefense.com/v3/__http://172.29.93.14:48474__;!!Ci6f514n9QsL8ck!3b_NgWO8HXsOoUOdTUZp4KEzKcx9zpWomeb2yGJ4RRqkS1QI159_zwjwVnJ0aOLwIQ$> -> 172.29.95.38:5672 <https://urldefense.com/v3/__http://172.29.95.38:5672__;!!Ci6f514n9QsL8ck!3b_NgWO8HXsOoUOdTUZp4KEzKcx9zpWomeb2yGJ4RRqkS1QI159_zwjwVnKyfYp2-Q$>) has a client-provided name: uwsgi:262739:f86c0570-8739-4b74-8102-76b5357acd71 2020-12-29 21:51:14.084 [info] <0.953.1293> connection <0.953.1293> ( 172.29.93.14:48474 <https://urldefense.com/v3/__http://172.29.93.14:48474__;!!Ci6f514n9QsL8ck!3b_NgWO8HXsOoUOdTUZp4KEzKcx9zpWomeb2yGJ4RRqkS1QI159_zwjwVnJ0aOLwIQ$> -> 172.29.95.38:5672 <https://urldefense.com/v3/__http://172.29.95.38:5672__;!!Ci6f514n9QsL8ck!3b_NgWO8HXsOoUOdTUZp4KEzKcx9zpWomeb2yGJ4RRqkS1QI159_zwjwVnKyfYp2-Q$> - uwsgi:262739:f86c0570-8739-4b74-8102-76b5357acd71): user 'magnum' authenticated and granted access to vhost '/magnum' 2020-12-29 21:51:15.560 [info] <0.1656.1283> accepting AMQP connection <0.1656.1283> (172.29.93.14:48548 <https://urldefense.com/v3/__http://172.29.93.14:48548__;!!Ci6f514n9QsL8ck!3b_NgWO8HXsOoUOdTUZp4KEzKcx9zpWomeb2yGJ4RRqkS1QI159_zwjwVnLsbQ8hVw$> -> 172.29.95.38:5672 <https://urldefense.com/v3/__http://172.29.95.38:5672__;!!Ci6f514n9QsL8ck!3b_NgWO8HXsOoUOdTUZp4KEzKcx9zpWomeb2yGJ4RRqkS1QI159_zwjwVnKyfYp2-Q$> ) 2020-12-29 21:51:15.561 [info] <0.1656.1283> Connection <0.1656.1283> ( 172.29.93.14:48548 <https://urldefense.com/v3/__http://172.29.93.14:48548__;!!Ci6f514n9QsL8ck!3b_NgWO8HXsOoUOdTUZp4KEzKcx9zpWomeb2yGJ4RRqkS1QI159_zwjwVnLsbQ8hVw$> -> 172.29.95.38:5672 <https://urldefense.com/v3/__http://172.29.95.38:5672__;!!Ci6f514n9QsL8ck!3b_NgWO8HXsOoUOdTUZp4KEzKcx9zpWomeb2yGJ4RRqkS1QI159_zwjwVnKyfYp2-Q$>) has a client-provided name: uwsgi:262744:2c9792ab-9198-493a-970c-f6ccfd9947d3 2020-12-29 21:51:15.561 [info] <0.1656.1283> connection <0.1656.1283> ( 172.29.93.14:48548 <https://urldefense.com/v3/__http://172.29.93.14:48548__;!!Ci6f514n9QsL8ck!3b_NgWO8HXsOoUOdTUZp4KEzKcx9zpWomeb2yGJ4RRqkS1QI159_zwjwVnLsbQ8hVw$> -> 172.29.95.38:5672 <https://urldefense.com/v3/__http://172.29.95.38:5672__;!!Ci6f514n9QsL8ck!3b_NgWO8HXsOoUOdTUZp4KEzKcx9zpWomeb2yGJ4RRqkS1QI159_zwjwVnKyfYp2-Q$> - uwsgi:262744:2c9792ab-9198-493a-970c-f6ccfd9947d3): user 'magnum' authenticated and granted access to vhost '/magnum'
On Tue, Dec 22, 2020 at 4:12 AM feilong <feilong@catalyst.net.nz> wrote:
Hi Ionut,
I didn't see this before on our production. Magnum auto healer just simply sends a POST request to Magnum api to update the health status. So I would suggest write a small script or even use curl to see if you can reproduce this firstly.
On 19/12/20 2:27 am, Ionut Biru wrote:
Hi again,
I failed to mention that is stable/victoria with couples of patches from review. Ignore the fact that in logs it shows the 19.1.4 version in venv path.
On Fri, Dec 18, 2020 at 3:22 PM Ionut Biru <ionut@fleio.com> wrote:
Hi guys,
I have an issue with magnum api returning an error after a while: Server-side error: "[('system library', 'fopen', 'Too many open files'), ('BIO routines', 'BIO_new_file', 'system lib'), ('x509 certificate routines', 'X509_load_cert_crl_file', 'system lib')]"
Log file: https://paste.xinu.at/6djE/ <https://urldefense.com/v3/__https://paste.xinu.at/6djE/__;!!Ci6f514n9QsL8ck!3b_NgWO8HXsOoUOdTUZp4KEzKcx9zpWomeb2yGJ4RRqkS1QI159_zwjwVnJkPZX-_Q$>
This started to appear after I enabled the template auto_healing_controller = magnum-auto-healer, magnum_auto_healer_tag = v1.19.0.
Currently, I only have 4 clusters.
After that the API is in error state and doesn't work unless I restart it.
-- Ionut Biru - https://fleio.com <https://urldefense.com/v3/__https://fleio.com__;!!Ci6f514n9QsL8ck!3b_NgWO8HXsOoUOdTUZp4KEzKcx9zpWomeb2yGJ4RRqkS1QI159_zwjwVnKfpV6EIg$>
-- Ionut Biru - https://fleio.com <https://urldefense.com/v3/__https://fleio.com__;!!Ci6f514n9QsL8ck!3b_NgWO8HXsOoUOdTUZp4KEzKcx9zpWomeb2yGJ4RRqkS1QI159_zwjwVnKfpV6EIg$>
-- Cheers & Best regards, Feilong Wang (王飞龙) ------------------------------------------------------ Senior Cloud Software Engineer Tel: +64-48032246 Email: flwang@catalyst.net.nz Catalyst IT Limited Level 6, Catalyst House, 150 Willis Street, Wellington ------------------------------------------------------
-- Ionut Biru - https://fleio.com <https://urldefense.com/v3/__https://fleio.com__;!!Ci6f514n9QsL8ck!3b_NgWO8HXsOoUOdTUZp4KEzKcx9zpWomeb2yGJ4RRqkS1QI159_zwjwVnKfpV6EIg$>
-- Ionut Biru - https://fleio.com <https://urldefense.com/v3/__https://fleio.com__;!!Ci6f514n9QsL8ck!3b_NgWO8HXsOoUOdTUZp4KEzKcx9zpWomeb2yGJ4RRqkS1QI159_zwjwVnKfpV6EIg$>
-- Ionut Biru - https://fleio.com
Sure looks like RabbitMQ. How many workers do have you configured? Could you try to change it to workers=1 (or processes=1) and then see if it goes beyond 30 connections to amqp. Best Regards, Erik Olof Gunnar Andersson Technical Lead, Senior Cloud Engineer From: Ionut Biru <ionut@fleio.com> Sent: Monday, January 4, 2021 4:07 AM To: Erik Olof Gunnar Andersson <eandersson@blizzard.com> Cc: feilong <feilong@catalyst.net.nz>; openstack-discuss <openstack-discuss@lists.openstack.org> Subject: Re: [magnum][api] Error system library fopen too many open files with magnum-auto-healer Hi Erik, Here is lsof of one uwsgi api. https://paste.xinu.at/5YUWf/<https://urldefense.com/v3/__https:/paste.xinu.at/5YUWf/__;!!Ci6f514n9QsL8ck!wv_wzG-Ntk0gd3ReOupQl-iXIcWpPR3genCqeKNY5JCKZDWxQHSqqa-uxxgUFFhz0Q$> I have kubernetes 12.0.1 installed in env. On Sun, Jan 3, 2021 at 3:06 AM Erik Olof Gunnar Andersson <eandersson@blizzard.com<mailto:eandersson@blizzard.com>> wrote: Maybe something similar to this? https://github.com/kubernetes-client/python/issues/1158<https://urldefense.com/v3/__https:/github.com/kubernetes-client/python/issues/1158__;!!Ci6f514n9QsL8ck!wv_wzG-Ntk0gd3ReOupQl-iXIcWpPR3genCqeKNY5JCKZDWxQHSqqa-uxxgAtzJkNg$> What does lsof say? ________________________________ From: Erik Olof Gunnar Andersson <eandersson@blizzard.com<mailto:eandersson@blizzard.com>> Sent: Saturday, January 2, 2021 4:54 PM To: Ionut Biru <ionut@fleio.com<mailto:ionut@fleio.com>>; feilong <feilong@catalyst.net.nz<mailto:feilong@catalyst.net.nz>> Cc: openstack-discuss <openstack-discuss@lists.openstack.org<mailto:openstack-discuss@lists.openstack.org>> Subject: Re: [magnum][api] Error system library fopen too many open files with magnum-auto-healer Are you sure you aren't just looking at the connection pool expanding? Each worker has a max number of connections it can use. Maybe look at lowering rpc_conn_pool_size. By default I believe each worker might create a pool of up to 30 connections. Looking at the code it could also be have something to do with the k8s client. Since it creates a new instance each time it does an health check. What version of the k8s client do you have installed? ________________________________ From: Ionut Biru <ionut@fleio.com<mailto:ionut@fleio.com>> Sent: Tuesday, December 29, 2020 2:20 PM To: feilong <feilong@catalyst.net.nz<mailto:feilong@catalyst.net.nz>> Cc: openstack-discuss <openstack-discuss@lists.openstack.org<mailto:openstack-discuss@lists.openstack.org>> Subject: Re: [magnum][api] Error system library fopen too many open files with magnum-auto-healer Hi, Not sure if my suspicion is true but I think for each update a new notifier is prepared and used without closing the connection but my understanding of oslo is nonexistent. https://opendev.org/openstack/magnum/src/branch/master/magnum/conductor/utils.py#L147<https://urldefense.com/v3/__https:/opendev.org/openstack/magnum/src/branch/master/magnum/conductor/utils.py*L147__;Iw!!Ci6f514n9QsL8ck!3b_NgWO8HXsOoUOdTUZp4KEzKcx9zpWomeb2yGJ4RRqkS1QI159_zwjwVnKwDSl9vw$> https://opendev.org/openstack/magnum/src/branch/master/magnum/common/rpc.py#L173<https://urldefense.com/v3/__https:/opendev.org/openstack/magnum/src/branch/master/magnum/common/rpc.py*L173__;Iw!!Ci6f514n9QsL8ck!3b_NgWO8HXsOoUOdTUZp4KEzKcx9zpWomeb2yGJ4RRqkS1QI159_zwjwVnJa8-cGbA$> On Tue, Dec 29, 2020 at 11:52 PM Ionut Biru <ionut@fleio.com<mailto:ionut@fleio.com>> wrote: Hi Feilong, I found out that each time the update_health_status periodic task is run, a new connection(for each uwsgi) is made to rabbitmq. root@ctrl1cj-magnum-container-7a7a412a:~# netstat -npt | grep 5672 | wc -l 229 root@ctrl1cj-magnum-container-7a7a412a:~# netstat -npt | grep 5672 | wc -l 234 root@ctrl1cj-magnum-container-7a7a412a:~# netstat -npt | grep 5672 | wc -l 238 root@ctrl1cj-magnum-container-7a7a412a:~# netstat -npt | grep 5672 | wc -l 241 root@ctrl1cj-magnum-container-7a7a412a:~# netstat -npt | grep 5672 | wc -l 244 Not sure Dec 29 21:51:22 ctrl1cj-magnum-container-7a7a412a magnum-conductor[262800]: 2020-12-29 21:51:22.024 262800 DEBUG magnum.service.periodic [req-3b495326-cf80-481e-b3c6-c741f05b7f0e - - - - -] Dec 29 21:51:22 ctrl1cj-magnum-container-7a7a412a magnum-conductor[262800]: 2020-12-29 21:51:22.024 262800 DEBUG oslo_service.periodic_task [-] Running periodic task MagnumPeriodicTasks.sync Dec 29 21:51:16 ctrl1cj-magnum-container-7a7a412a magnum-conductor[262804]: 2020-12-29 21:51:16.462 262804 DEBUG magnum.conductor.handlers.cluster_conductor [req-284ac12b-d76a-4e50-8e74-5bfb Dec 29 21:51:15 ctrl1cj-magnum-container-7a7a412a magnum-conductor[262800]: 2020-12-29 21:51:15.573 262800 DEBUG magnum.service.periodic [-] Status for cluster 118 updated to HEALTHY ({'api' Dec 29 21:51:15 ctrl1cj-magnum-container-7a7a412a magnum-conductor[262805]: 2020-12-29 21:51:15.572 262805 DEBUG magnum.conductor.handlers.cluster_conductor [req-3fc29ee9-4051-42e7-ae19-3a49 Dec 29 21:51:15 ctrl1cj-magnum-container-7a7a412a magnum-conductor[262800]: 2020-12-29 21:51:15.572 262800 DEBUG magnum.service.periodic [-] Status for cluster 121 updated to HEALTHY ({'api' Dec 29 21:51:15 ctrl1cj-magnum-container-7a7a412a magnum-conductor[262800]: 2020-12-29 21:51:15.572 262800 DEBUG magnum.service.periodic [-] Status for cluster 122 updated to HEALTHY ({'api' Dec 29 21:51:15 ctrl1cj-magnum-container-7a7a412a magnum-conductor[262800]: 2020-12-29 21:51:15.553 262800 DEBUG magnum.service.periodic [-] Updating health status for cluster 122 update_hea Dec 29 21:51:15 ctrl1cj-magnum-container-7a7a412a magnum-conductor[262800]: 2020-12-29 21:51:15.544 262800 DEBUG magnum.service.periodic [-] Updating health status for cluster 121 update_hea Dec 29 21:51:15 ctrl1cj-magnum-container-7a7a412a magnum-conductor[262800]: 2020-12-29 21:51:15.535 262800 DEBUG magnum.service.periodic [-] Updating health status for cluster 118 update_hea Dec 29 21:51:15 ctrl1cj-magnum-container-7a7a412a magnum-conductor[262800]: 2020-12-29 21:51:15.494 262800 DEBUG magnum.service.periodic [req-405b1fed-0b8a-4a60-b6ae-834f548b21d1 - - - 2020-12-29 21:51:14.082 [info] <0.953.1293> accepting AMQP connection <0.953.1293> (172.29.93.14:48474<https://urldefense.com/v3/__http:/172.29.93.14:48474__;!!Ci6f514n9QsL8ck!3b_NgWO8HXsOoUOdTUZp4KEzKcx9zpWomeb2yGJ4RRqkS1QI159_zwjwVnJ0aOLwIQ$> -> 172.29.95.38:5672<https://urldefense.com/v3/__http:/172.29.95.38:5672__;!!Ci6f514n9QsL8ck!3b_NgWO8HXsOoUOdTUZp4KEzKcx9zpWomeb2yGJ4RRqkS1QI159_zwjwVnKyfYp2-Q$>) 2020-12-29 21:51:14.083 [info] <0.953.1293> Connection <0.953.1293> (172.29.93.14:48474<https://urldefense.com/v3/__http:/172.29.93.14:48474__;!!Ci6f514n9QsL8ck!3b_NgWO8HXsOoUOdTUZp4KEzKcx9zpWomeb2yGJ4RRqkS1QI159_zwjwVnJ0aOLwIQ$> -> 172.29.95.38:5672<https://urldefense.com/v3/__http:/172.29.95.38:5672__;!!Ci6f514n9QsL8ck!3b_NgWO8HXsOoUOdTUZp4KEzKcx9zpWomeb2yGJ4RRqkS1QI159_zwjwVnKyfYp2-Q$>) has a client-provided name: uwsgi:262739:f86c0570-8739-4b74-8102-76b5357acd71 2020-12-29 21:51:14.084 [info] <0.953.1293> connection <0.953.1293> (172.29.93.14:48474<https://urldefense.com/v3/__http:/172.29.93.14:48474__;!!Ci6f514n9QsL8ck!3b_NgWO8HXsOoUOdTUZp4KEzKcx9zpWomeb2yGJ4RRqkS1QI159_zwjwVnJ0aOLwIQ$> -> 172.29.95.38:5672<https://urldefense.com/v3/__http:/172.29.95.38:5672__;!!Ci6f514n9QsL8ck!3b_NgWO8HXsOoUOdTUZp4KEzKcx9zpWomeb2yGJ4RRqkS1QI159_zwjwVnKyfYp2-Q$> - uwsgi:262739:f86c0570-8739-4b74-8102-76b5357acd71): user 'magnum' authenticated and granted access to vhost '/magnum' 2020-12-29 21:51:15.560 [info] <0.1656.1283> accepting AMQP connection <0.1656.1283> (172.29.93.14:48548<https://urldefense.com/v3/__http:/172.29.93.14:48548__;!!Ci6f514n9QsL8ck!3b_NgWO8HXsOoUOdTUZp4KEzKcx9zpWomeb2yGJ4RRqkS1QI159_zwjwVnLsbQ8hVw$> -> 172.29.95.38:5672<https://urldefense.com/v3/__http:/172.29.95.38:5672__;!!Ci6f514n9QsL8ck!3b_NgWO8HXsOoUOdTUZp4KEzKcx9zpWomeb2yGJ4RRqkS1QI159_zwjwVnKyfYp2-Q$>) 2020-12-29 21:51:15.561 [info] <0.1656.1283> Connection <0.1656.1283> (172.29.93.14:48548<https://urldefense.com/v3/__http:/172.29.93.14:48548__;!!Ci6f514n9QsL8ck!3b_NgWO8HXsOoUOdTUZp4KEzKcx9zpWomeb2yGJ4RRqkS1QI159_zwjwVnLsbQ8hVw$> -> 172.29.95.38:5672<https://urldefense.com/v3/__http:/172.29.95.38:5672__;!!Ci6f514n9QsL8ck!3b_NgWO8HXsOoUOdTUZp4KEzKcx9zpWomeb2yGJ4RRqkS1QI159_zwjwVnKyfYp2-Q$>) has a client-provided name: uwsgi:262744:2c9792ab-9198-493a-970c-f6ccfd9947d3 2020-12-29 21:51:15.561 [info] <0.1656.1283> connection <0.1656.1283> (172.29.93.14:48548<https://urldefense.com/v3/__http:/172.29.93.14:48548__;!!Ci6f514n9QsL8ck!3b_NgWO8HXsOoUOdTUZp4KEzKcx9zpWomeb2yGJ4RRqkS1QI159_zwjwVnLsbQ8hVw$> -> 172.29.95.38:5672<https://urldefense.com/v3/__http:/172.29.95.38:5672__;!!Ci6f514n9QsL8ck!3b_NgWO8HXsOoUOdTUZp4KEzKcx9zpWomeb2yGJ4RRqkS1QI159_zwjwVnKyfYp2-Q$> - uwsgi:262744:2c9792ab-9198-493a-970c-f6ccfd9947d3): user 'magnum' authenticated and granted access to vhost '/magnum' On Tue, Dec 22, 2020 at 4:12 AM feilong <feilong@catalyst.net.nz<mailto:feilong@catalyst.net.nz>> wrote: Hi Ionut, I didn't see this before on our production. Magnum auto healer just simply sends a POST request to Magnum api to update the health status. So I would suggest write a small script or even use curl to see if you can reproduce this firstly. On 19/12/20 2:27 am, Ionut Biru wrote: Hi again, I failed to mention that is stable/victoria with couples of patches from review. Ignore the fact that in logs it shows the 19.1.4 version in venv path. On Fri, Dec 18, 2020 at 3:22 PM Ionut Biru <ionut@fleio.com<mailto:ionut@fleio.com>> wrote: Hi guys, I have an issue with magnum api returning an error after a while: Server-side error: "[('system library', 'fopen', 'Too many open files'), ('BIO routines', 'BIO_new_file', 'system lib'), ('x509 certificate routines', 'X509_load_cert_crl_file', 'system lib')]" Log file: https://paste.xinu.at/6djE/<https://urldefense.com/v3/__https:/paste.xinu.at/6djE/__;!!Ci6f514n9QsL8ck!3b_NgWO8HXsOoUOdTUZp4KEzKcx9zpWomeb2yGJ4RRqkS1QI159_zwjwVnJkPZX-_Q$> This started to appear after I enabled the template auto_healing_controller = magnum-auto-healer, magnum_auto_healer_tag = v1.19.0. Currently, I only have 4 clusters. After that the API is in error state and doesn't work unless I restart it. -- Ionut Biru - https://fleio.com<https://urldefense.com/v3/__https:/fleio.com__;!!Ci6f514n9QsL8ck!3b_NgWO8HXsOoUOdTUZp4KEzKcx9zpWomeb2yGJ4RRqkS1QI159_zwjwVnKfpV6EIg$> -- Ionut Biru - https://fleio.com<https://urldefense.com/v3/__https:/fleio.com__;!!Ci6f514n9QsL8ck!3b_NgWO8HXsOoUOdTUZp4KEzKcx9zpWomeb2yGJ4RRqkS1QI159_zwjwVnKfpV6EIg$> -- Cheers & Best regards, Feilong Wang (王飞龙) ------------------------------------------------------ Senior Cloud Software Engineer Tel: +64-48032246 Email: flwang@catalyst.net.nz<mailto:flwang@catalyst.net.nz> Catalyst IT Limited Level 6, Catalyst House, 150 Willis Street, Wellington ------------------------------------------------------ -- Ionut Biru - https://fleio.com<https://urldefense.com/v3/__https:/fleio.com__;!!Ci6f514n9QsL8ck!3b_NgWO8HXsOoUOdTUZp4KEzKcx9zpWomeb2yGJ4RRqkS1QI159_zwjwVnKfpV6EIg$> -- Ionut Biru - https://fleio.com<https://urldefense.com/v3/__https:/fleio.com__;!!Ci6f514n9QsL8ck!3b_NgWO8HXsOoUOdTUZp4KEzKcx9zpWomeb2yGJ4RRqkS1QI159_zwjwVnKfpV6EIg$> -- Ionut Biru - https://fleio.com<https://urldefense.com/v3/__https:/fleio.com__;!!Ci6f514n9QsL8ck!wv_wzG-Ntk0gd3ReOupQl-iXIcWpPR3genCqeKNY5JCKZDWxQHSqqa-uxxg3AiT38g$>
Sure looks like RabbitMQ. How many workers do have you configured? Could you try to changing the uwsgi configuration to workers=1 (or processes=1) and then see if it goes beyond 30 connections to amqp. From: Ionut Biru <ionut@fleio.com> Sent: Monday, January 4, 2021 4:07 AM To: Erik Olof Gunnar Andersson <eandersson@blizzard.com> Cc: feilong <feilong@catalyst.net.nz>; openstack-discuss <openstack-discuss@lists.openstack.org> Subject: Re: [magnum][api] Error system library fopen too many open files with magnum-auto-healer Hi Erik, Here is lsof of one uwsgi api. https://paste.xinu.at/5YUWf/<https://urldefense.com/v3/__https:/paste.xinu.at/5YUWf/__;!!Ci6f514n9QsL8ck!wv_wzG-Ntk0gd3ReOupQl-iXIcWpPR3genCqeKNY5JCKZDWxQHSqqa-uxxgUFFhz0Q$> I have kubernetes 12.0.1 installed in env. On Sun, Jan 3, 2021 at 3:06 AM Erik Olof Gunnar Andersson <eandersson@blizzard.com<mailto:eandersson@blizzard.com>> wrote: Maybe something similar to this? https://github.com/kubernetes-client/python/issues/1158<https://urldefense.com/v3/__https:/github.com/kubernetes-client/python/issues/1158__;!!Ci6f514n9QsL8ck!wv_wzG-Ntk0gd3ReOupQl-iXIcWpPR3genCqeKNY5JCKZDWxQHSqqa-uxxgAtzJkNg$> What does lsof say?
Hi, I tried with process=1 and it reached 1016 connections to rabbitmq. lsof https://paste.xinu.at/jGg/ i think it goes into error when it reaches 1024 file descriptors. I'm out of ideas of how to resolve this. I only have 3 clusters available and it's kinda weird and It doesn't scale. On Mon, Jan 4, 2021 at 9:53 PM Erik Olof Gunnar Andersson < eandersson@blizzard.com> wrote:
Sure looks like RabbitMQ. How many workers do have you configured?
Could you try to changing the uwsgi configuration to workers=1 (or processes=1) and then see if it goes beyond 30 connections to amqp.
*From:* Ionut Biru <ionut@fleio.com> *Sent:* Monday, January 4, 2021 4:07 AM *To:* Erik Olof Gunnar Andersson <eandersson@blizzard.com> *Cc:* feilong <feilong@catalyst.net.nz>; openstack-discuss < openstack-discuss@lists.openstack.org> *Subject:* Re: [magnum][api] Error system library fopen too many open files with magnum-auto-healer
Hi Erik,
Here is lsof of one uwsgi api. https://paste.xinu.at/5YUWf/ <https://urldefense.com/v3/__https:/paste.xinu.at/5YUWf/__;!!Ci6f514n9QsL8ck!wv_wzG-Ntk0gd3ReOupQl-iXIcWpPR3genCqeKNY5JCKZDWxQHSqqa-uxxgUFFhz0Q$>
I have kubernetes 12.0.1 installed in env.
On Sun, Jan 3, 2021 at 3:06 AM Erik Olof Gunnar Andersson < eandersson@blizzard.com> wrote:
Maybe something similar to this? https://github.com/kubernetes-client/python/issues/1158 <https://urldefense.com/v3/__https:/github.com/kubernetes-client/python/issues/1158__;!!Ci6f514n9QsL8ck!wv_wzG-Ntk0gd3ReOupQl-iXIcWpPR3genCqeKNY5JCKZDWxQHSqqa-uxxgAtzJkNg$>
What does lsof say?
-- Ionut Biru - https://fleio.com
On Tue, Jan 5, 2021 at 9:36 AM Ionut Biru <ionut@fleio.com> wrote:
Hi,
I tried with process=1 and it reached 1016 connections to rabbitmq. lsof https://paste.xinu.at/jGg/
i think it goes into error when it reaches 1024 file descriptors.
I'm out of ideas of how to resolve this. I only have 3 clusters available and it's kinda weird and It doesn't scale.
No issues here with 100s of clusters. Not sure what doesn't scale. * Maybe your rabbit is flooded with notifications that are not consumed? * You can use way more than 1024 file descriptors, maybe 2^10? Spyros
On Mon, Jan 4, 2021 at 9:53 PM Erik Olof Gunnar Andersson < eandersson@blizzard.com> wrote:
Sure looks like RabbitMQ. How many workers do have you configured?
Could you try to changing the uwsgi configuration to workers=1 (or processes=1) and then see if it goes beyond 30 connections to amqp.
*From:* Ionut Biru <ionut@fleio.com> *Sent:* Monday, January 4, 2021 4:07 AM *To:* Erik Olof Gunnar Andersson <eandersson@blizzard.com> *Cc:* feilong <feilong@catalyst.net.nz>; openstack-discuss < openstack-discuss@lists.openstack.org> *Subject:* Re: [magnum][api] Error system library fopen too many open files with magnum-auto-healer
Hi Erik,
Here is lsof of one uwsgi api. https://paste.xinu.at/5YUWf/ <https://urldefense.com/v3/__https:/paste.xinu.at/5YUWf/__;!!Ci6f514n9QsL8ck!wv_wzG-Ntk0gd3ReOupQl-iXIcWpPR3genCqeKNY5JCKZDWxQHSqqa-uxxgUFFhz0Q$>
I have kubernetes 12.0.1 installed in env.
On Sun, Jan 3, 2021 at 3:06 AM Erik Olof Gunnar Andersson < eandersson@blizzard.com> wrote:
Maybe something similar to this? https://github.com/kubernetes-client/python/issues/1158 <https://urldefense.com/v3/__https:/github.com/kubernetes-client/python/issues/1158__;!!Ci6f514n9QsL8ck!wv_wzG-Ntk0gd3ReOupQl-iXIcWpPR3genCqeKNY5JCKZDWxQHSqqa-uxxgAtzJkNg$>
What does lsof say?
-- Ionut Biru - https://fleio.com
Yea - tested locally as well and wasn't able to reproduce it either. I changed the health service job to run every second and maxed out at about 42 connections to RabbitMQ with two conductor workers. /etc/magnum/magnun.conf [conductor] workers = 2 ________________________________ From: Spyros Trigazis <strigazi@gmail.com> Sent: Tuesday, January 5, 2021 12:59 AM To: Ionut Biru <ionut@fleio.com> Cc: Erik Olof Gunnar Andersson <eandersson@blizzard.com>; feilong <feilong@catalyst.net.nz>; openstack-discuss <openstack-discuss@lists.openstack.org> Subject: Re: [magnum][api] Error system library fopen too many open files with magnum-auto-healer On Tue, Jan 5, 2021 at 9:36 AM Ionut Biru <ionut@fleio.com<mailto:ionut@fleio.com>> wrote: Hi, I tried with process=1 and it reached 1016 connections to rabbitmq. lsof https://paste.xinu.at/jGg/<https://urldefense.com/v3/__https://paste.xinu.at/jGg/__;!!Ci6f514n9QsL8ck!w-sy8zu-TkPMcmlD3ZhyxEiBTRWikibrBZOfumXkqKodtdcI4FD236uNMmjynMvIcA$> i think it goes into error when it reaches 1024 file descriptors. I'm out of ideas of how to resolve this. I only have 3 clusters available and it's kinda weird and It doesn't scale. No issues here with 100s of clusters. Not sure what doesn't scale. * Maybe your rabbit is flooded with notifications that are not consumed? * You can use way more than 1024 file descriptors, maybe 2^10? Spyros On Mon, Jan 4, 2021 at 9:53 PM Erik Olof Gunnar Andersson <eandersson@blizzard.com<mailto:eandersson@blizzard.com>> wrote: Sure looks like RabbitMQ. How many workers do have you configured? Could you try to changing the uwsgi configuration to workers=1 (or processes=1) and then see if it goes beyond 30 connections to amqp. From: Ionut Biru <ionut@fleio.com<mailto:ionut@fleio.com>> Sent: Monday, January 4, 2021 4:07 AM To: Erik Olof Gunnar Andersson <eandersson@blizzard.com<mailto:eandersson@blizzard.com>> Cc: feilong <feilong@catalyst.net.nz<mailto:feilong@catalyst.net.nz>>; openstack-discuss <openstack-discuss@lists.openstack.org<mailto:openstack-discuss@lists.openstack.org>> Subject: Re: [magnum][api] Error system library fopen too many open files with magnum-auto-healer Hi Erik, Here is lsof of one uwsgi api. https://paste.xinu.at/5YUWf/<https://urldefense.com/v3/__https:/paste.xinu.at/5YUWf/__;!!Ci6f514n9QsL8ck!wv_wzG-Ntk0gd3ReOupQl-iXIcWpPR3genCqeKNY5JCKZDWxQHSqqa-uxxgUFFhz0Q$> I have kubernetes 12.0.1 installed in env. On Sun, Jan 3, 2021 at 3:06 AM Erik Olof Gunnar Andersson <eandersson@blizzard.com<mailto:eandersson@blizzard.com>> wrote: Maybe something similar to this? https://github.com/kubernetes-client/python/issues/1158<https://urldefense.com/v3/__https:/github.com/kubernetes-client/python/issues/1158__;!!Ci6f514n9QsL8ck!wv_wzG-Ntk0gd3ReOupQl-iXIcWpPR3genCqeKNY5JCKZDWxQHSqqa-uxxgAtzJkNg$> What does lsof say? -- Ionut Biru - https://fleio.com<https://urldefense.com/v3/__https://fleio.com__;!!Ci6f514n9QsL8ck!w-sy8zu-TkPMcmlD3ZhyxEiBTRWikibrBZOfumXkqKodtdcI4FD236uNMmit-G0eng$>
Hi, Here is my config. maybe something is fishy. I did have around 300 messages in the queue in notification.info and notification.err and I purged them. https://paste.xinu.at/woMt/ On Tue, Jan 5, 2021 at 11:23 AM Erik Olof Gunnar Andersson < eandersson@blizzard.com> wrote:
Yea - tested locally as well and wasn't able to reproduce it either. I changed the health service job to run every second and maxed out at about 42 connections to RabbitMQ with two conductor workers.
/etc/magnum/magnun.conf
[conductor] workers = 2
------------------------------ *From:* Spyros Trigazis <strigazi@gmail.com> *Sent:* Tuesday, January 5, 2021 12:59 AM *To:* Ionut Biru <ionut@fleio.com> *Cc:* Erik Olof Gunnar Andersson <eandersson@blizzard.com>; feilong < feilong@catalyst.net.nz>; openstack-discuss < openstack-discuss@lists.openstack.org> *Subject:* Re: [magnum][api] Error system library fopen too many open files with magnum-auto-healer
On Tue, Jan 5, 2021 at 9:36 AM Ionut Biru <ionut@fleio.com> wrote:
Hi,
I tried with process=1 and it reached 1016 connections to rabbitmq. lsof https://paste.xinu.at/jGg/ <https://urldefense.com/v3/__https://paste.xinu.at/jGg/__;!!Ci6f514n9QsL8ck!w-sy8zu-TkPMcmlD3ZhyxEiBTRWikibrBZOfumXkqKodtdcI4FD236uNMmjynMvIcA$>
i think it goes into error when it reaches 1024 file descriptors.
I'm out of ideas of how to resolve this. I only have 3 clusters available and it's kinda weird and It doesn't scale.
No issues here with 100s of clusters. Not sure what doesn't scale.
* Maybe your rabbit is flooded with notifications that are not consumed? * You can use way more than 1024 file descriptors, maybe 2^10?
Spyros
On Mon, Jan 4, 2021 at 9:53 PM Erik Olof Gunnar Andersson < eandersson@blizzard.com> wrote:
Sure looks like RabbitMQ. How many workers do have you configured?
Could you try to changing the uwsgi configuration to workers=1 (or processes=1) and then see if it goes beyond 30 connections to amqp.
*From:* Ionut Biru <ionut@fleio.com> *Sent:* Monday, January 4, 2021 4:07 AM *To:* Erik Olof Gunnar Andersson <eandersson@blizzard.com> *Cc:* feilong <feilong@catalyst.net.nz>; openstack-discuss < openstack-discuss@lists.openstack.org> *Subject:* Re: [magnum][api] Error system library fopen too many open files with magnum-auto-healer
Hi Erik,
Here is lsof of one uwsgi api. https://paste.xinu.at/5YUWf/ <https://urldefense.com/v3/__https:/paste.xinu.at/5YUWf/__;!!Ci6f514n9QsL8ck!wv_wzG-Ntk0gd3ReOupQl-iXIcWpPR3genCqeKNY5JCKZDWxQHSqqa-uxxgUFFhz0Q$>
I have kubernetes 12.0.1 installed in env.
On Sun, Jan 3, 2021 at 3:06 AM Erik Olof Gunnar Andersson < eandersson@blizzard.com> wrote:
Maybe something similar to this? https://github.com/kubernetes-client/python/issues/1158 <https://urldefense.com/v3/__https:/github.com/kubernetes-client/python/issues/1158__;!!Ci6f514n9QsL8ck!wv_wzG-Ntk0gd3ReOupQl-iXIcWpPR3genCqeKNY5JCKZDWxQHSqqa-uxxgAtzJkNg$>
What does lsof say?
-- Ionut Biru - https://fleio.com <https://urldefense.com/v3/__https://fleio.com__;!!Ci6f514n9QsL8ck!w-sy8zu-TkPMcmlD3ZhyxEiBTRWikibrBZOfumXkqKodtdcI4FD236uNMmit-G0eng$>
-- Ionut Biru - https://fleio.com
Sorry, being repetitive here, but maybe try adding this to your magnum config as well. If you have A LOT of cores it could add up to a crazy amount of connections. [conductor] workers = 2 ________________________________ From: Ionut Biru <ionut@fleio.com> Sent: Tuesday, January 5, 2021 1:50 AM To: Erik Olof Gunnar Andersson <eandersson@blizzard.com> Cc: Spyros Trigazis <strigazi@gmail.com>; feilong <feilong@catalyst.net.nz>; openstack-discuss <openstack-discuss@lists.openstack.org> Subject: Re: [magnum][api] Error system library fopen too many open files with magnum-auto-healer Hi, Here is my config. maybe something is fishy. I did have around 300 messages in the queue in notification.info<https://urldefense.com/v3/__http://notification.info__;!!Ci6f514n9QsL8ck!zXau4TQ7lpYxxCmShvD-QtwfISyXyajq11TeBMle6hAdw3N9NdP7PuG5YgqgOhdO4A$> and notification.err and I purged them. https://paste.xinu.at/woMt/<https://urldefense.com/v3/__https://paste.xinu.at/woMt/__;!!Ci6f514n9QsL8ck!zXau4TQ7lpYxxCmShvD-QtwfISyXyajq11TeBMle6hAdw3N9NdP7PuG5YgrG1_F7_w$> On Tue, Jan 5, 2021 at 11:23 AM Erik Olof Gunnar Andersson <eandersson@blizzard.com<mailto:eandersson@blizzard.com>> wrote: Yea - tested locally as well and wasn't able to reproduce it either. I changed the health service job to run every second and maxed out at about 42 connections to RabbitMQ with two conductor workers. /etc/magnum/magnun.conf [conductor] workers = 2 ________________________________ From: Spyros Trigazis <strigazi@gmail.com<mailto:strigazi@gmail.com>> Sent: Tuesday, January 5, 2021 12:59 AM To: Ionut Biru <ionut@fleio.com<mailto:ionut@fleio.com>> Cc: Erik Olof Gunnar Andersson <eandersson@blizzard.com<mailto:eandersson@blizzard.com>>; feilong <feilong@catalyst.net.nz<mailto:feilong@catalyst.net.nz>>; openstack-discuss <openstack-discuss@lists.openstack.org<mailto:openstack-discuss@lists.openstack.org>> Subject: Re: [magnum][api] Error system library fopen too many open files with magnum-auto-healer On Tue, Jan 5, 2021 at 9:36 AM Ionut Biru <ionut@fleio.com<mailto:ionut@fleio.com>> wrote: Hi, I tried with process=1 and it reached 1016 connections to rabbitmq. lsof https://paste.xinu.at/jGg/<https://urldefense.com/v3/__https://paste.xinu.at/jGg/__;!!Ci6f514n9QsL8ck!w-sy8zu-TkPMcmlD3ZhyxEiBTRWikibrBZOfumXkqKodtdcI4FD236uNMmjynMvIcA$> i think it goes into error when it reaches 1024 file descriptors. I'm out of ideas of how to resolve this. I only have 3 clusters available and it's kinda weird and It doesn't scale. No issues here with 100s of clusters. Not sure what doesn't scale. * Maybe your rabbit is flooded with notifications that are not consumed? * You can use way more than 1024 file descriptors, maybe 2^10? Spyros On Mon, Jan 4, 2021 at 9:53 PM Erik Olof Gunnar Andersson <eandersson@blizzard.com<mailto:eandersson@blizzard.com>> wrote: Sure looks like RabbitMQ. How many workers do have you configured? Could you try to changing the uwsgi configuration to workers=1 (or processes=1) and then see if it goes beyond 30 connections to amqp. From: Ionut Biru <ionut@fleio.com<mailto:ionut@fleio.com>> Sent: Monday, January 4, 2021 4:07 AM To: Erik Olof Gunnar Andersson <eandersson@blizzard.com<mailto:eandersson@blizzard.com>> Cc: feilong <feilong@catalyst.net.nz<mailto:feilong@catalyst.net.nz>>; openstack-discuss <openstack-discuss@lists.openstack.org<mailto:openstack-discuss@lists.openstack.org>> Subject: Re: [magnum][api] Error system library fopen too many open files with magnum-auto-healer Hi Erik, Here is lsof of one uwsgi api. https://paste.xinu.at/5YUWf/<https://urldefense.com/v3/__https:/paste.xinu.at/5YUWf/__;!!Ci6f514n9QsL8ck!wv_wzG-Ntk0gd3ReOupQl-iXIcWpPR3genCqeKNY5JCKZDWxQHSqqa-uxxgUFFhz0Q$> I have kubernetes 12.0.1 installed in env. On Sun, Jan 3, 2021 at 3:06 AM Erik Olof Gunnar Andersson <eandersson@blizzard.com<mailto:eandersson@blizzard.com>> wrote: Maybe something similar to this? https://github.com/kubernetes-client/python/issues/1158<https://urldefense.com/v3/__https:/github.com/kubernetes-client/python/issues/1158__;!!Ci6f514n9QsL8ck!wv_wzG-Ntk0gd3ReOupQl-iXIcWpPR3genCqeKNY5JCKZDWxQHSqqa-uxxgAtzJkNg$> What does lsof say? -- Ionut Biru - https://fleio.com<https://urldefense.com/v3/__https://fleio.com__;!!Ci6f514n9QsL8ck!w-sy8zu-TkPMcmlD3ZhyxEiBTRWikibrBZOfumXkqKodtdcI4FD236uNMmit-G0eng$> -- Ionut Biru - https://fleio.com<https://urldefense.com/v3/__https://fleio.com__;!!Ci6f514n9QsL8ck!zXau4TQ7lpYxxCmShvD-QtwfISyXyajq11TeBMle6hAdw3N9NdP7PuG5Ygp-5WUmyw$>
Hi, I found this story: https://storyboard.openstack.org/#!/story/2008308 regarding disabling cluster update notifications in rabbitmq. I think this will help me. On Tue, Jan 5, 2021 at 12:21 PM Erik Olof Gunnar Andersson < eandersson@blizzard.com> wrote:
Sorry, being repetitive here, but maybe try adding this to your magnum config as well. If you have A LOT of cores it could add up to a crazy amount of connections.
[conductor] workers = 2
------------------------------ *From:* Ionut Biru <ionut@fleio.com> *Sent:* Tuesday, January 5, 2021 1:50 AM *To:* Erik Olof Gunnar Andersson <eandersson@blizzard.com> *Cc:* Spyros Trigazis <strigazi@gmail.com>; feilong < feilong@catalyst.net.nz>; openstack-discuss < openstack-discuss@lists.openstack.org> *Subject:* Re: [magnum][api] Error system library fopen too many open files with magnum-auto-healer
Hi,
Here is my config. maybe something is fishy.
I did have around 300 messages in the queue in notification.info <https://urldefense.com/v3/__http://notification.info__;!!Ci6f514n9QsL8ck!zXau4TQ7lpYxxCmShvD-QtwfISyXyajq11TeBMle6hAdw3N9NdP7PuG5YgqgOhdO4A$> and notification.err and I purged them.
https://paste.xinu.at/woMt/ <https://urldefense.com/v3/__https://paste.xinu.at/woMt/__;!!Ci6f514n9QsL8ck!zXau4TQ7lpYxxCmShvD-QtwfISyXyajq11TeBMle6hAdw3N9NdP7PuG5YgrG1_F7_w$>
On Tue, Jan 5, 2021 at 11:23 AM Erik Olof Gunnar Andersson < eandersson@blizzard.com> wrote:
Yea - tested locally as well and wasn't able to reproduce it either. I changed the health service job to run every second and maxed out at about 42 connections to RabbitMQ with two conductor workers.
/etc/magnum/magnun.conf
[conductor] workers = 2
------------------------------ *From:* Spyros Trigazis <strigazi@gmail.com> *Sent:* Tuesday, January 5, 2021 12:59 AM *To:* Ionut Biru <ionut@fleio.com> *Cc:* Erik Olof Gunnar Andersson <eandersson@blizzard.com>; feilong < feilong@catalyst.net.nz>; openstack-discuss < openstack-discuss@lists.openstack.org> *Subject:* Re: [magnum][api] Error system library fopen too many open files with magnum-auto-healer
On Tue, Jan 5, 2021 at 9:36 AM Ionut Biru <ionut@fleio.com> wrote:
Hi,
I tried with process=1 and it reached 1016 connections to rabbitmq. lsof https://paste.xinu.at/jGg/ <https://urldefense.com/v3/__https://paste.xinu.at/jGg/__;!!Ci6f514n9QsL8ck!w-sy8zu-TkPMcmlD3ZhyxEiBTRWikibrBZOfumXkqKodtdcI4FD236uNMmjynMvIcA$>
i think it goes into error when it reaches 1024 file descriptors.
I'm out of ideas of how to resolve this. I only have 3 clusters available and it's kinda weird and It doesn't scale.
No issues here with 100s of clusters. Not sure what doesn't scale.
* Maybe your rabbit is flooded with notifications that are not consumed? * You can use way more than 1024 file descriptors, maybe 2^10?
Spyros
On Mon, Jan 4, 2021 at 9:53 PM Erik Olof Gunnar Andersson < eandersson@blizzard.com> wrote:
Sure looks like RabbitMQ. How many workers do have you configured?
Could you try to changing the uwsgi configuration to workers=1 (or processes=1) and then see if it goes beyond 30 connections to amqp.
*From:* Ionut Biru <ionut@fleio.com> *Sent:* Monday, January 4, 2021 4:07 AM *To:* Erik Olof Gunnar Andersson <eandersson@blizzard.com> *Cc:* feilong <feilong@catalyst.net.nz>; openstack-discuss < openstack-discuss@lists.openstack.org> *Subject:* Re: [magnum][api] Error system library fopen too many open files with magnum-auto-healer
Hi Erik,
Here is lsof of one uwsgi api. https://paste.xinu.at/5YUWf/ <https://urldefense.com/v3/__https:/paste.xinu.at/5YUWf/__;!!Ci6f514n9QsL8ck!wv_wzG-Ntk0gd3ReOupQl-iXIcWpPR3genCqeKNY5JCKZDWxQHSqqa-uxxgUFFhz0Q$>
I have kubernetes 12.0.1 installed in env.
On Sun, Jan 3, 2021 at 3:06 AM Erik Olof Gunnar Andersson < eandersson@blizzard.com> wrote:
Maybe something similar to this? https://github.com/kubernetes-client/python/issues/1158 <https://urldefense.com/v3/__https:/github.com/kubernetes-client/python/issues/1158__;!!Ci6f514n9QsL8ck!wv_wzG-Ntk0gd3ReOupQl-iXIcWpPR3genCqeKNY5JCKZDWxQHSqqa-uxxgAtzJkNg$>
What does lsof say?
-- Ionut Biru - https://fleio.com <https://urldefense.com/v3/__https://fleio.com__;!!Ci6f514n9QsL8ck!w-sy8zu-TkPMcmlD3ZhyxEiBTRWikibrBZOfumXkqKodtdcI4FD236uNMmit-G0eng$>
-- Ionut Biru - https://fleio.com <https://urldefense.com/v3/__https://fleio.com__;!!Ci6f514n9QsL8ck!zXau4TQ7lpYxxCmShvD-QtwfISyXyajq11TeBMle6hAdw3N9NdP7PuG5Ygp-5WUmyw$>
-- Ionut Biru - https://fleio.com
I pushed a couple of patches that you can try out. This is the most likely culprit. https://review.opendev.org/c/openstack/magnum/+/769471 - Re-use rpc client I also created this one, but doubt this is an issue as the implementation here is the same as I use in Designate https://review.opendev.org/c/openstack/magnum/+/769457 - [WIP] Singleton notifier Finally I also created a PR to add magnum-api testing using uwsgi. https://review.opendev.org/c/openstack/magnum/+/769450 Let me know if any of these patches help! ________________________________ From: Ionut Biru <ionut@fleio.com> Sent: Tuesday, January 5, 2021 8:36 AM To: Erik Olof Gunnar Andersson <eandersson@blizzard.com> Cc: Spyros Trigazis <strigazi@gmail.com>; feilong <feilong@catalyst.net.nz>; openstack-discuss <openstack-discuss@lists.openstack.org> Subject: Re: [magnum][api] Error system library fopen too many open files with magnum-auto-healer Hi, I found this story: https://storyboard.openstack.org/#!/story/2008308<https://urldefense.com/v3/__https://storyboard.openstack.org/*!/story/2008308__;Iw!!Ci6f514n9QsL8ck!1f5rl4Hhpd13WKbYo8oADBrjfiG2BvU4omHN8zT_EtCcWSC4JoI9JJkg_A3rIZPB1g$> regarding disabling cluster update notifications in rabbitmq. I think this will help me. On Tue, Jan 5, 2021 at 12:21 PM Erik Olof Gunnar Andersson <eandersson@blizzard.com<mailto:eandersson@blizzard.com>> wrote: Sorry, being repetitive here, but maybe try adding this to your magnum config as well. If you have A LOT of cores it could add up to a crazy amount of connections. [conductor] workers = 2 ________________________________ From: Ionut Biru <ionut@fleio.com<mailto:ionut@fleio.com>> Sent: Tuesday, January 5, 2021 1:50 AM To: Erik Olof Gunnar Andersson <eandersson@blizzard.com<mailto:eandersson@blizzard.com>> Cc: Spyros Trigazis <strigazi@gmail.com<mailto:strigazi@gmail.com>>; feilong <feilong@catalyst.net.nz<mailto:feilong@catalyst.net.nz>>; openstack-discuss <openstack-discuss@lists.openstack.org<mailto:openstack-discuss@lists.openstack.org>> Subject: Re: [magnum][api] Error system library fopen too many open files with magnum-auto-healer Hi, Here is my config. maybe something is fishy. I did have around 300 messages in the queue in notification.info<https://urldefense.com/v3/__http://notification.info__;!!Ci6f514n9QsL8ck!zXau4TQ7lpYxxCmShvD-QtwfISyXyajq11TeBMle6hAdw3N9NdP7PuG5YgqgOhdO4A$> and notification.err and I purged them. https://paste.xinu.at/woMt/<https://urldefense.com/v3/__https://paste.xinu.at/woMt/__;!!Ci6f514n9QsL8ck!zXau4TQ7lpYxxCmShvD-QtwfISyXyajq11TeBMle6hAdw3N9NdP7PuG5YgrG1_F7_w$> On Tue, Jan 5, 2021 at 11:23 AM Erik Olof Gunnar Andersson <eandersson@blizzard.com<mailto:eandersson@blizzard.com>> wrote: Yea - tested locally as well and wasn't able to reproduce it either. I changed the health service job to run every second and maxed out at about 42 connections to RabbitMQ with two conductor workers. /etc/magnum/magnun.conf [conductor] workers = 2 ________________________________ From: Spyros Trigazis <strigazi@gmail.com<mailto:strigazi@gmail.com>> Sent: Tuesday, January 5, 2021 12:59 AM To: Ionut Biru <ionut@fleio.com<mailto:ionut@fleio.com>> Cc: Erik Olof Gunnar Andersson <eandersson@blizzard.com<mailto:eandersson@blizzard.com>>; feilong <feilong@catalyst.net.nz<mailto:feilong@catalyst.net.nz>>; openstack-discuss <openstack-discuss@lists.openstack.org<mailto:openstack-discuss@lists.openstack.org>> Subject: Re: [magnum][api] Error system library fopen too many open files with magnum-auto-healer On Tue, Jan 5, 2021 at 9:36 AM Ionut Biru <ionut@fleio.com<mailto:ionut@fleio.com>> wrote: Hi, I tried with process=1 and it reached 1016 connections to rabbitmq. lsof https://paste.xinu.at/jGg/<https://urldefense.com/v3/__https://paste.xinu.at/jGg/__;!!Ci6f514n9QsL8ck!w-sy8zu-TkPMcmlD3ZhyxEiBTRWikibrBZOfumXkqKodtdcI4FD236uNMmjynMvIcA$> i think it goes into error when it reaches 1024 file descriptors. I'm out of ideas of how to resolve this. I only have 3 clusters available and it's kinda weird and It doesn't scale. No issues here with 100s of clusters. Not sure what doesn't scale. * Maybe your rabbit is flooded with notifications that are not consumed? * You can use way more than 1024 file descriptors, maybe 2^10? Spyros On Mon, Jan 4, 2021 at 9:53 PM Erik Olof Gunnar Andersson <eandersson@blizzard.com<mailto:eandersson@blizzard.com>> wrote: Sure looks like RabbitMQ. How many workers do have you configured? Could you try to changing the uwsgi configuration to workers=1 (or processes=1) and then see if it goes beyond 30 connections to amqp. From: Ionut Biru <ionut@fleio.com<mailto:ionut@fleio.com>> Sent: Monday, January 4, 2021 4:07 AM To: Erik Olof Gunnar Andersson <eandersson@blizzard.com<mailto:eandersson@blizzard.com>> Cc: feilong <feilong@catalyst.net.nz<mailto:feilong@catalyst.net.nz>>; openstack-discuss <openstack-discuss@lists.openstack.org<mailto:openstack-discuss@lists.openstack.org>> Subject: Re: [magnum][api] Error system library fopen too many open files with magnum-auto-healer Hi Erik, Here is lsof of one uwsgi api. https://paste.xinu.at/5YUWf/<https://urldefense.com/v3/__https:/paste.xinu.at/5YUWf/__;!!Ci6f514n9QsL8ck!wv_wzG-Ntk0gd3ReOupQl-iXIcWpPR3genCqeKNY5JCKZDWxQHSqqa-uxxgUFFhz0Q$> I have kubernetes 12.0.1 installed in env. On Sun, Jan 3, 2021 at 3:06 AM Erik Olof Gunnar Andersson <eandersson@blizzard.com<mailto:eandersson@blizzard.com>> wrote: Maybe something similar to this? https://github.com/kubernetes-client/python/issues/1158<https://urldefense.com/v3/__https:/github.com/kubernetes-client/python/issues/1158__;!!Ci6f514n9QsL8ck!wv_wzG-Ntk0gd3ReOupQl-iXIcWpPR3genCqeKNY5JCKZDWxQHSqqa-uxxgAtzJkNg$> What does lsof say? -- Ionut Biru - https://fleio.com<https://urldefense.com/v3/__https://fleio.com__;!!Ci6f514n9QsL8ck!w-sy8zu-TkPMcmlD3ZhyxEiBTRWikibrBZOfumXkqKodtdcI4FD236uNMmit-G0eng$> -- Ionut Biru - https://fleio.com<https://urldefense.com/v3/__https://fleio.com__;!!Ci6f514n9QsL8ck!zXau4TQ7lpYxxCmShvD-QtwfISyXyajq11TeBMle6hAdw3N9NdP7PuG5Ygp-5WUmyw$> -- Ionut Biru - https://fleio.com<https://urldefense.com/v3/__https://fleio.com__;!!Ci6f514n9QsL8ck!1f5rl4Hhpd13WKbYo8oADBrjfiG2BvU4omHN8zT_EtCcWSC4JoI9JJkg_A1pHO7VEQ$>
Hi Erik, Thanks a lot for the patch. Indeed 769471 fixes my problem at first glance. I'll let it run for a couple of days. On Wed, Jan 6, 2021 at 12:23 PM Erik Olof Gunnar Andersson < eandersson@blizzard.com> wrote:
I pushed a couple of patches that you can try out.
This is the most likely culprit. https://review.opendev.org/c/openstack/magnum/+/769471 - Re-use rpc client
I also created this one, but doubt this is an issue as the implementation here is the same as I use in Designate https://review.opendev.org/c/openstack/magnum/+/769457 - [WIP] Singleton notifier
Finally I also created a PR to add magnum-api testing using uwsgi. https://review.opendev.org/c/openstack/magnum/+/769450
Let me know if any of these patches help!
------------------------------ *From:* Ionut Biru <ionut@fleio.com> *Sent:* Tuesday, January 5, 2021 8:36 AM *To:* Erik Olof Gunnar Andersson <eandersson@blizzard.com> *Cc:* Spyros Trigazis <strigazi@gmail.com>; feilong < feilong@catalyst.net.nz>; openstack-discuss < openstack-discuss@lists.openstack.org> *Subject:* Re: [magnum][api] Error system library fopen too many open files with magnum-auto-healer
Hi,
I found this story: https://storyboard.openstack.org/#!/story/2008308 <https://urldefense.com/v3/__https://storyboard.openstack.org/*!/story/2008308__;Iw!!Ci6f514n9QsL8ck!1f5rl4Hhpd13WKbYo8oADBrjfiG2BvU4omHN8zT_EtCcWSC4JoI9JJkg_A3rIZPB1g$> regarding disabling cluster update notifications in rabbitmq.
I think this will help me.
On Tue, Jan 5, 2021 at 12:21 PM Erik Olof Gunnar Andersson < eandersson@blizzard.com> wrote:
Sorry, being repetitive here, but maybe try adding this to your magnum config as well. If you have A LOT of cores it could add up to a crazy amount of connections.
[conductor] workers = 2
------------------------------ *From:* Ionut Biru <ionut@fleio.com> *Sent:* Tuesday, January 5, 2021 1:50 AM *To:* Erik Olof Gunnar Andersson <eandersson@blizzard.com> *Cc:* Spyros Trigazis <strigazi@gmail.com>; feilong < feilong@catalyst.net.nz>; openstack-discuss < openstack-discuss@lists.openstack.org> *Subject:* Re: [magnum][api] Error system library fopen too many open files with magnum-auto-healer
Hi,
Here is my config. maybe something is fishy.
I did have around 300 messages in the queue in notification.info <https://urldefense.com/v3/__http://notification.info__;!!Ci6f514n9QsL8ck!zXau4TQ7lpYxxCmShvD-QtwfISyXyajq11TeBMle6hAdw3N9NdP7PuG5YgqgOhdO4A$> and notification.err and I purged them.
https://paste.xinu.at/woMt/ <https://urldefense.com/v3/__https://paste.xinu.at/woMt/__;!!Ci6f514n9QsL8ck!zXau4TQ7lpYxxCmShvD-QtwfISyXyajq11TeBMle6hAdw3N9NdP7PuG5YgrG1_F7_w$>
On Tue, Jan 5, 2021 at 11:23 AM Erik Olof Gunnar Andersson < eandersson@blizzard.com> wrote:
Yea - tested locally as well and wasn't able to reproduce it either. I changed the health service job to run every second and maxed out at about 42 connections to RabbitMQ with two conductor workers.
/etc/magnum/magnun.conf
[conductor] workers = 2
------------------------------ *From:* Spyros Trigazis <strigazi@gmail.com> *Sent:* Tuesday, January 5, 2021 12:59 AM *To:* Ionut Biru <ionut@fleio.com> *Cc:* Erik Olof Gunnar Andersson <eandersson@blizzard.com>; feilong < feilong@catalyst.net.nz>; openstack-discuss < openstack-discuss@lists.openstack.org> *Subject:* Re: [magnum][api] Error system library fopen too many open files with magnum-auto-healer
On Tue, Jan 5, 2021 at 9:36 AM Ionut Biru <ionut@fleio.com> wrote:
Hi,
I tried with process=1 and it reached 1016 connections to rabbitmq. lsof https://paste.xinu.at/jGg/ <https://urldefense.com/v3/__https://paste.xinu.at/jGg/__;!!Ci6f514n9QsL8ck!w-sy8zu-TkPMcmlD3ZhyxEiBTRWikibrBZOfumXkqKodtdcI4FD236uNMmjynMvIcA$>
i think it goes into error when it reaches 1024 file descriptors.
I'm out of ideas of how to resolve this. I only have 3 clusters available and it's kinda weird and It doesn't scale.
No issues here with 100s of clusters. Not sure what doesn't scale.
* Maybe your rabbit is flooded with notifications that are not consumed? * You can use way more than 1024 file descriptors, maybe 2^10?
Spyros
On Mon, Jan 4, 2021 at 9:53 PM Erik Olof Gunnar Andersson < eandersson@blizzard.com> wrote:
Sure looks like RabbitMQ. How many workers do have you configured?
Could you try to changing the uwsgi configuration to workers=1 (or processes=1) and then see if it goes beyond 30 connections to amqp.
*From:* Ionut Biru <ionut@fleio.com> *Sent:* Monday, January 4, 2021 4:07 AM *To:* Erik Olof Gunnar Andersson <eandersson@blizzard.com> *Cc:* feilong <feilong@catalyst.net.nz>; openstack-discuss < openstack-discuss@lists.openstack.org> *Subject:* Re: [magnum][api] Error system library fopen too many open files with magnum-auto-healer
Hi Erik,
Here is lsof of one uwsgi api. https://paste.xinu.at/5YUWf/ <https://urldefense.com/v3/__https:/paste.xinu.at/5YUWf/__;!!Ci6f514n9QsL8ck!wv_wzG-Ntk0gd3ReOupQl-iXIcWpPR3genCqeKNY5JCKZDWxQHSqqa-uxxgUFFhz0Q$>
I have kubernetes 12.0.1 installed in env.
On Sun, Jan 3, 2021 at 3:06 AM Erik Olof Gunnar Andersson < eandersson@blizzard.com> wrote:
Maybe something similar to this? https://github.com/kubernetes-client/python/issues/1158 <https://urldefense.com/v3/__https:/github.com/kubernetes-client/python/issues/1158__;!!Ci6f514n9QsL8ck!wv_wzG-Ntk0gd3ReOupQl-iXIcWpPR3genCqeKNY5JCKZDWxQHSqqa-uxxgAtzJkNg$>
What does lsof say?
-- Ionut Biru - https://fleio.com <https://urldefense.com/v3/__https://fleio.com__;!!Ci6f514n9QsL8ck!w-sy8zu-TkPMcmlD3ZhyxEiBTRWikibrBZOfumXkqKodtdcI4FD236uNMmit-G0eng$>
-- Ionut Biru - https://fleio.com <https://urldefense.com/v3/__https://fleio.com__;!!Ci6f514n9QsL8ck!zXau4TQ7lpYxxCmShvD-QtwfISyXyajq11TeBMle6hAdw3N9NdP7PuG5Ygp-5WUmyw$>
-- Ionut Biru - https://fleio.com <https://urldefense.com/v3/__https://fleio.com__;!!Ci6f514n9QsL8ck!1f5rl4Hhpd13WKbYo8oADBrjfiG2BvU4omHN8zT_EtCcWSC4JoI9JJkg_A1pHO7VEQ$>
-- Ionut Biru - https://fleio.com
Glad it helped . Going to work with the magnum team to get it merged. Would it be possible for you to document the issue and create a bug here https://storyboard.openstack.org/#!/project/openstack/magnum ________________________________ From: Ionut Biru <ionut@fleio.com> Sent: Wednesday, January 6, 2021 3:37 AM To: Erik Olof Gunnar Andersson <eandersson@blizzard.com> Cc: Spyros Trigazis <strigazi@gmail.com>; feilong <feilong@catalyst.net.nz>; openstack-discuss <openstack-discuss@lists.openstack.org> Subject: Re: [magnum][api] Error system library fopen too many open files with magnum-auto-healer Hi Erik, Thanks a lot for the patch. Indeed 769471 fixes my problem at first glance. I'll let it run for a couple of days. On Wed, Jan 6, 2021 at 12:23 PM Erik Olof Gunnar Andersson <eandersson@blizzard.com<mailto:eandersson@blizzard.com>> wrote: I pushed a couple of patches that you can try out. This is the most likely culprit. https://review.opendev.org/c/openstack/magnum/+/769471<https://urldefense.com/v3/__https://review.opendev.org/c/openstack/magnum/*/769471__;Kw!!Ci6f514n9QsL8ck!yNYepzwGOz5tzeQ62h1r5z7iHBcYFnMmO9kzEmWdJqo-BK9PgMQWoB-IT5ji6cXKKQ$> - Re-use rpc client I also created this one, but doubt this is an issue as the implementation here is the same as I use in Designate https://review.opendev.org/c/openstack/magnum/+/769457<https://urldefense.com/v3/__https://review.opendev.org/c/openstack/magnum/*/769457__;Kw!!Ci6f514n9QsL8ck!yNYepzwGOz5tzeQ62h1r5z7iHBcYFnMmO9kzEmWdJqo-BK9PgMQWoB-IT5i2M52Ovw$> - [WIP] Singleton notifier Finally I also created a PR to add magnum-api testing using uwsgi. https://review.opendev.org/c/openstack/magnum/+/769450<https://urldefense.com/v3/__https://review.opendev.org/c/openstack/magnum/*/769450__;Kw!!Ci6f514n9QsL8ck!yNYepzwGOz5tzeQ62h1r5z7iHBcYFnMmO9kzEmWdJqo-BK9PgMQWoB-IT5hi_0tIMw$> Let me know if any of these patches help! ________________________________ From: Ionut Biru <ionut@fleio.com<mailto:ionut@fleio.com>> Sent: Tuesday, January 5, 2021 8:36 AM To: Erik Olof Gunnar Andersson <eandersson@blizzard.com<mailto:eandersson@blizzard.com>> Cc: Spyros Trigazis <strigazi@gmail.com<mailto:strigazi@gmail.com>>; feilong <feilong@catalyst.net.nz<mailto:feilong@catalyst.net.nz>>; openstack-discuss <openstack-discuss@lists.openstack.org<mailto:openstack-discuss@lists.openstack.org>> Subject: Re: [magnum][api] Error system library fopen too many open files with magnum-auto-healer Hi, I found this story: https://storyboard.openstack.org/#!/story/2008308<https://urldefense.com/v3/__https://storyboard.openstack.org/*!/story/2008308__;Iw!!Ci6f514n9QsL8ck!1f5rl4Hhpd13WKbYo8oADBrjfiG2BvU4omHN8zT_EtCcWSC4JoI9JJkg_A3rIZPB1g$> regarding disabling cluster update notifications in rabbitmq. I think this will help me. On Tue, Jan 5, 2021 at 12:21 PM Erik Olof Gunnar Andersson <eandersson@blizzard.com<mailto:eandersson@blizzard.com>> wrote: Sorry, being repetitive here, but maybe try adding this to your magnum config as well. If you have A LOT of cores it could add up to a crazy amount of connections. [conductor] workers = 2 ________________________________ From: Ionut Biru <ionut@fleio.com<mailto:ionut@fleio.com>> Sent: Tuesday, January 5, 2021 1:50 AM To: Erik Olof Gunnar Andersson <eandersson@blizzard.com<mailto:eandersson@blizzard.com>> Cc: Spyros Trigazis <strigazi@gmail.com<mailto:strigazi@gmail.com>>; feilong <feilong@catalyst.net.nz<mailto:feilong@catalyst.net.nz>>; openstack-discuss <openstack-discuss@lists.openstack.org<mailto:openstack-discuss@lists.openstack.org>> Subject: Re: [magnum][api] Error system library fopen too many open files with magnum-auto-healer Hi, Here is my config. maybe something is fishy. I did have around 300 messages in the queue in notification.info<https://urldefense.com/v3/__http://notification.info__;!!Ci6f514n9QsL8ck!zXau4TQ7lpYxxCmShvD-QtwfISyXyajq11TeBMle6hAdw3N9NdP7PuG5YgqgOhdO4A$> and notification.err and I purged them. https://paste.xinu.at/woMt/<https://urldefense.com/v3/__https://paste.xinu.at/woMt/__;!!Ci6f514n9QsL8ck!zXau4TQ7lpYxxCmShvD-QtwfISyXyajq11TeBMle6hAdw3N9NdP7PuG5YgrG1_F7_w$> On Tue, Jan 5, 2021 at 11:23 AM Erik Olof Gunnar Andersson <eandersson@blizzard.com<mailto:eandersson@blizzard.com>> wrote: Yea - tested locally as well and wasn't able to reproduce it either. I changed the health service job to run every second and maxed out at about 42 connections to RabbitMQ with two conductor workers. /etc/magnum/magnun.conf [conductor] workers = 2 ________________________________ From: Spyros Trigazis <strigazi@gmail.com<mailto:strigazi@gmail.com>> Sent: Tuesday, January 5, 2021 12:59 AM To: Ionut Biru <ionut@fleio.com<mailto:ionut@fleio.com>> Cc: Erik Olof Gunnar Andersson <eandersson@blizzard.com<mailto:eandersson@blizzard.com>>; feilong <feilong@catalyst.net.nz<mailto:feilong@catalyst.net.nz>>; openstack-discuss <openstack-discuss@lists.openstack.org<mailto:openstack-discuss@lists.openstack.org>> Subject: Re: [magnum][api] Error system library fopen too many open files with magnum-auto-healer On Tue, Jan 5, 2021 at 9:36 AM Ionut Biru <ionut@fleio.com<mailto:ionut@fleio.com>> wrote: Hi, I tried with process=1 and it reached 1016 connections to rabbitmq. lsof https://paste.xinu.at/jGg/<https://urldefense.com/v3/__https://paste.xinu.at/jGg/__;!!Ci6f514n9QsL8ck!w-sy8zu-TkPMcmlD3ZhyxEiBTRWikibrBZOfumXkqKodtdcI4FD236uNMmjynMvIcA$> i think it goes into error when it reaches 1024 file descriptors. I'm out of ideas of how to resolve this. I only have 3 clusters available and it's kinda weird and It doesn't scale. No issues here with 100s of clusters. Not sure what doesn't scale. * Maybe your rabbit is flooded with notifications that are not consumed? * You can use way more than 1024 file descriptors, maybe 2^10? Spyros On Mon, Jan 4, 2021 at 9:53 PM Erik Olof Gunnar Andersson <eandersson@blizzard.com<mailto:eandersson@blizzard.com>> wrote: Sure looks like RabbitMQ. How many workers do have you configured? Could you try to changing the uwsgi configuration to workers=1 (or processes=1) and then see if it goes beyond 30 connections to amqp. From: Ionut Biru <ionut@fleio.com<mailto:ionut@fleio.com>> Sent: Monday, January 4, 2021 4:07 AM To: Erik Olof Gunnar Andersson <eandersson@blizzard.com<mailto:eandersson@blizzard.com>> Cc: feilong <feilong@catalyst.net.nz<mailto:feilong@catalyst.net.nz>>; openstack-discuss <openstack-discuss@lists.openstack.org<mailto:openstack-discuss@lists.openstack.org>> Subject: Re: [magnum][api] Error system library fopen too many open files with magnum-auto-healer Hi Erik, Here is lsof of one uwsgi api. https://paste.xinu.at/5YUWf/<https://urldefense.com/v3/__https:/paste.xinu.at/5YUWf/__;!!Ci6f514n9QsL8ck!wv_wzG-Ntk0gd3ReOupQl-iXIcWpPR3genCqeKNY5JCKZDWxQHSqqa-uxxgUFFhz0Q$> I have kubernetes 12.0.1 installed in env. On Sun, Jan 3, 2021 at 3:06 AM Erik Olof Gunnar Andersson <eandersson@blizzard.com<mailto:eandersson@blizzard.com>> wrote: Maybe something similar to this? https://github.com/kubernetes-client/python/issues/1158<https://urldefense.com/v3/__https:/github.com/kubernetes-client/python/issues/1158__;!!Ci6f514n9QsL8ck!wv_wzG-Ntk0gd3ReOupQl-iXIcWpPR3genCqeKNY5JCKZDWxQHSqqa-uxxgAtzJkNg$> What does lsof say? -- Ionut Biru - https://fleio.com<https://urldefense.com/v3/__https://fleio.com__;!!Ci6f514n9QsL8ck!w-sy8zu-TkPMcmlD3ZhyxEiBTRWikibrBZOfumXkqKodtdcI4FD236uNMmit-G0eng$> -- Ionut Biru - https://fleio.com<https://urldefense.com/v3/__https://fleio.com__;!!Ci6f514n9QsL8ck!zXau4TQ7lpYxxCmShvD-QtwfISyXyajq11TeBMle6hAdw3N9NdP7PuG5Ygp-5WUmyw$> -- Ionut Biru - https://fleio.com<https://urldefense.com/v3/__https://fleio.com__;!!Ci6f514n9QsL8ck!1f5rl4Hhpd13WKbYo8oADBrjfiG2BvU4omHN8zT_EtCcWSC4JoI9JJkg_A1pHO7VEQ$> -- Ionut Biru - https://fleio.com<https://urldefense.com/v3/__https://fleio.com__;!!Ci6f514n9QsL8ck!yNYepzwGOz5tzeQ62h1r5z7iHBcYFnMmO9kzEmWdJqo-BK9PgMQWoB-IT5gdteMBNQ$>
Hi Erik, Here is the story: https://storyboard.openstack.org/#!/story/2008494 On Thu, Jan 7, 2021 at 5:12 AM Erik Olof Gunnar Andersson < eandersson@blizzard.com> wrote:
Glad it helped . Going to work with the magnum team to get it merged.
Would it be possible for you to document the issue and create a bug here https://storyboard.openstack.org/#!/project/openstack/magnum
------------------------------ *From:* Ionut Biru <ionut@fleio.com> *Sent:* Wednesday, January 6, 2021 3:37 AM *To:* Erik Olof Gunnar Andersson <eandersson@blizzard.com> *Cc:* Spyros Trigazis <strigazi@gmail.com>; feilong < feilong@catalyst.net.nz>; openstack-discuss < openstack-discuss@lists.openstack.org> *Subject:* Re: [magnum][api] Error system library fopen too many open files with magnum-auto-healer
Hi Erik,
Thanks a lot for the patch. Indeed 769471 fixes my problem at first glance.
I'll let it run for a couple of days.
On Wed, Jan 6, 2021 at 12:23 PM Erik Olof Gunnar Andersson < eandersson@blizzard.com> wrote:
I pushed a couple of patches that you can try out.
This is the most likely culprit. https://review.opendev.org/c/openstack/magnum/+/769471 <https://urldefense.com/v3/__https://review.opendev.org/c/openstack/magnum/*/769471__;Kw!!Ci6f514n9QsL8ck!yNYepzwGOz5tzeQ62h1r5z7iHBcYFnMmO9kzEmWdJqo-BK9PgMQWoB-IT5ji6cXKKQ$> - Re-use rpc client
I also created this one, but doubt this is an issue as the implementation here is the same as I use in Designate https://review.opendev.org/c/openstack/magnum/+/769457 <https://urldefense.com/v3/__https://review.opendev.org/c/openstack/magnum/*/769457__;Kw!!Ci6f514n9QsL8ck!yNYepzwGOz5tzeQ62h1r5z7iHBcYFnMmO9kzEmWdJqo-BK9PgMQWoB-IT5i2M52Ovw$> - [WIP] Singleton notifier
Finally I also created a PR to add magnum-api testing using uwsgi. https://review.opendev.org/c/openstack/magnum/+/769450 <https://urldefense.com/v3/__https://review.opendev.org/c/openstack/magnum/*/769450__;Kw!!Ci6f514n9QsL8ck!yNYepzwGOz5tzeQ62h1r5z7iHBcYFnMmO9kzEmWdJqo-BK9PgMQWoB-IT5hi_0tIMw$>
Let me know if any of these patches help!
------------------------------ *From:* Ionut Biru <ionut@fleio.com> *Sent:* Tuesday, January 5, 2021 8:36 AM *To:* Erik Olof Gunnar Andersson <eandersson@blizzard.com> *Cc:* Spyros Trigazis <strigazi@gmail.com>; feilong < feilong@catalyst.net.nz>; openstack-discuss < openstack-discuss@lists.openstack.org> *Subject:* Re: [magnum][api] Error system library fopen too many open files with magnum-auto-healer
Hi,
I found this story: https://storyboard.openstack.org/#!/story/2008308 <https://urldefense.com/v3/__https://storyboard.openstack.org/*!/story/2008308__;Iw!!Ci6f514n9QsL8ck!1f5rl4Hhpd13WKbYo8oADBrjfiG2BvU4omHN8zT_EtCcWSC4JoI9JJkg_A3rIZPB1g$> regarding disabling cluster update notifications in rabbitmq.
I think this will help me.
On Tue, Jan 5, 2021 at 12:21 PM Erik Olof Gunnar Andersson < eandersson@blizzard.com> wrote:
Sorry, being repetitive here, but maybe try adding this to your magnum config as well. If you have A LOT of cores it could add up to a crazy amount of connections.
[conductor] workers = 2
------------------------------ *From:* Ionut Biru <ionut@fleio.com> *Sent:* Tuesday, January 5, 2021 1:50 AM *To:* Erik Olof Gunnar Andersson <eandersson@blizzard.com> *Cc:* Spyros Trigazis <strigazi@gmail.com>; feilong < feilong@catalyst.net.nz>; openstack-discuss < openstack-discuss@lists.openstack.org> *Subject:* Re: [magnum][api] Error system library fopen too many open files with magnum-auto-healer
Hi,
Here is my config. maybe something is fishy.
I did have around 300 messages in the queue in notification.info <https://urldefense.com/v3/__http://notification.info__;!!Ci6f514n9QsL8ck!zXau4TQ7lpYxxCmShvD-QtwfISyXyajq11TeBMle6hAdw3N9NdP7PuG5YgqgOhdO4A$> and notification.err and I purged them.
https://paste.xinu.at/woMt/ <https://urldefense.com/v3/__https://paste.xinu.at/woMt/__;!!Ci6f514n9QsL8ck!zXau4TQ7lpYxxCmShvD-QtwfISyXyajq11TeBMle6hAdw3N9NdP7PuG5YgrG1_F7_w$>
On Tue, Jan 5, 2021 at 11:23 AM Erik Olof Gunnar Andersson < eandersson@blizzard.com> wrote:
Yea - tested locally as well and wasn't able to reproduce it either. I changed the health service job to run every second and maxed out at about 42 connections to RabbitMQ with two conductor workers.
/etc/magnum/magnun.conf
[conductor] workers = 2
------------------------------ *From:* Spyros Trigazis <strigazi@gmail.com> *Sent:* Tuesday, January 5, 2021 12:59 AM *To:* Ionut Biru <ionut@fleio.com> *Cc:* Erik Olof Gunnar Andersson <eandersson@blizzard.com>; feilong < feilong@catalyst.net.nz>; openstack-discuss < openstack-discuss@lists.openstack.org> *Subject:* Re: [magnum][api] Error system library fopen too many open files with magnum-auto-healer
On Tue, Jan 5, 2021 at 9:36 AM Ionut Biru <ionut@fleio.com> wrote:
Hi,
I tried with process=1 and it reached 1016 connections to rabbitmq. lsof https://paste.xinu.at/jGg/ <https://urldefense.com/v3/__https://paste.xinu.at/jGg/__;!!Ci6f514n9QsL8ck!w-sy8zu-TkPMcmlD3ZhyxEiBTRWikibrBZOfumXkqKodtdcI4FD236uNMmjynMvIcA$>
i think it goes into error when it reaches 1024 file descriptors.
I'm out of ideas of how to resolve this. I only have 3 clusters available and it's kinda weird and It doesn't scale.
No issues here with 100s of clusters. Not sure what doesn't scale.
* Maybe your rabbit is flooded with notifications that are not consumed? * You can use way more than 1024 file descriptors, maybe 2^10?
Spyros
On Mon, Jan 4, 2021 at 9:53 PM Erik Olof Gunnar Andersson < eandersson@blizzard.com> wrote:
Sure looks like RabbitMQ. How many workers do have you configured?
Could you try to changing the uwsgi configuration to workers=1 (or processes=1) and then see if it goes beyond 30 connections to amqp.
*From:* Ionut Biru <ionut@fleio.com> *Sent:* Monday, January 4, 2021 4:07 AM *To:* Erik Olof Gunnar Andersson <eandersson@blizzard.com> *Cc:* feilong <feilong@catalyst.net.nz>; openstack-discuss < openstack-discuss@lists.openstack.org> *Subject:* Re: [magnum][api] Error system library fopen too many open files with magnum-auto-healer
Hi Erik,
Here is lsof of one uwsgi api. https://paste.xinu.at/5YUWf/ <https://urldefense.com/v3/__https:/paste.xinu.at/5YUWf/__;!!Ci6f514n9QsL8ck!wv_wzG-Ntk0gd3ReOupQl-iXIcWpPR3genCqeKNY5JCKZDWxQHSqqa-uxxgUFFhz0Q$>
I have kubernetes 12.0.1 installed in env.
On Sun, Jan 3, 2021 at 3:06 AM Erik Olof Gunnar Andersson < eandersson@blizzard.com> wrote:
Maybe something similar to this? https://github.com/kubernetes-client/python/issues/1158 <https://urldefense.com/v3/__https:/github.com/kubernetes-client/python/issues/1158__;!!Ci6f514n9QsL8ck!wv_wzG-Ntk0gd3ReOupQl-iXIcWpPR3genCqeKNY5JCKZDWxQHSqqa-uxxgAtzJkNg$>
What does lsof say?
-- Ionut Biru - https://fleio.com <https://urldefense.com/v3/__https://fleio.com__;!!Ci6f514n9QsL8ck!w-sy8zu-TkPMcmlD3ZhyxEiBTRWikibrBZOfumXkqKodtdcI4FD236uNMmit-G0eng$>
-- Ionut Biru - https://fleio.com <https://urldefense.com/v3/__https://fleio.com__;!!Ci6f514n9QsL8ck!zXau4TQ7lpYxxCmShvD-QtwfISyXyajq11TeBMle6hAdw3N9NdP7PuG5Ygp-5WUmyw$>
-- Ionut Biru - https://fleio.com <https://urldefense.com/v3/__https://fleio.com__;!!Ci6f514n9QsL8ck!1f5rl4Hhpd13WKbYo8oADBrjfiG2BvU4omHN8zT_EtCcWSC4JoI9JJkg_A1pHO7VEQ$>
-- Ionut Biru - https://fleio.com <https://urldefense.com/v3/__https://fleio.com__;!!Ci6f514n9QsL8ck!yNYepzwGOz5tzeQ62h1r5z7iHBcYFnMmO9kzEmWdJqo-BK9PgMQWoB-IT5gdteMBNQ$>
-- Ionut Biru - https://fleio.com
Thanks I added it to the commit. Could you share your uwsgi config as well. Best Regards, Erik Olof Gunnar Andersson Technical Lead, Senior Cloud Engineer From: Ionut Biru <ionut@fleio.com> Sent: Tuesday, January 5, 2021 1:51 AM To: Erik Olof Gunnar Andersson <eandersson@blizzard.com> Cc: Spyros Trigazis <strigazi@gmail.com>; feilong <feilong@catalyst.net.nz>; openstack-discuss <openstack-discuss@lists.openstack.org> Subject: Re: [magnum][api] Error system library fopen too many open files with magnum-auto-healer Hi, Here is my config. maybe something is fishy. I did have around 300 messages in the queue in notification.info<https://urldefense.com/v3/__http:/notification.info__;!!Ci6f514n9QsL8ck!zXau4TQ7lpYxxCmShvD-QtwfISyXyajq11TeBMle6hAdw3N9NdP7PuG5YgqgOhdO4A$> and notification.err and I purged them. https://paste.xinu.at/woMt/<https://urldefense.com/v3/__https:/paste.xinu.at/woMt/__;!!Ci6f514n9QsL8ck!zXau4TQ7lpYxxCmShvD-QtwfISyXyajq11TeBMle6hAdw3N9NdP7PuG5YgrG1_F7_w$> On Tue, Jan 5, 2021 at 11:23 AM Erik Olof Gunnar Andersson <eandersson@blizzard.com<mailto:eandersson@blizzard.com>> wrote: Yea - tested locally as well and wasn't able to reproduce it either. I changed the health service job to run every second and maxed out at about 42 connections to RabbitMQ with two conductor workers. /etc/magnum/magnun.conf [conductor] workers = 2 ________________________________ From: Spyros Trigazis <strigazi@gmail.com<mailto:strigazi@gmail.com>> Sent: Tuesday, January 5, 2021 12:59 AM To: Ionut Biru <ionut@fleio.com<mailto:ionut@fleio.com>> Cc: Erik Olof Gunnar Andersson <eandersson@blizzard.com<mailto:eandersson@blizzard.com>>; feilong <feilong@catalyst.net.nz<mailto:feilong@catalyst.net.nz>>; openstack-discuss <openstack-discuss@lists.openstack.org<mailto:openstack-discuss@lists.openstack.org>> Subject: Re: [magnum][api] Error system library fopen too many open files with magnum-auto-healer On Tue, Jan 5, 2021 at 9:36 AM Ionut Biru <ionut@fleio.com<mailto:ionut@fleio.com>> wrote: Hi, I tried with process=1 and it reached 1016 connections to rabbitmq. lsof https://paste.xinu.at/jGg/<https://urldefense.com/v3/__https:/paste.xinu.at/jGg/__;!!Ci6f514n9QsL8ck!w-sy8zu-TkPMcmlD3ZhyxEiBTRWikibrBZOfumXkqKodtdcI4FD236uNMmjynMvIcA$> i think it goes into error when it reaches 1024 file descriptors. I'm out of ideas of how to resolve this. I only have 3 clusters available and it's kinda weird and It doesn't scale. No issues here with 100s of clusters. Not sure what doesn't scale. * Maybe your rabbit is flooded with notifications that are not consumed? * You can use way more than 1024 file descriptors, maybe 2^10? Spyros On Mon, Jan 4, 2021 at 9:53 PM Erik Olof Gunnar Andersson <eandersson@blizzard.com<mailto:eandersson@blizzard.com>> wrote: Sure looks like RabbitMQ. How many workers do have you configured? Could you try to changing the uwsgi configuration to workers=1 (or processes=1) and then see if it goes beyond 30 connections to amqp. From: Ionut Biru <ionut@fleio.com<mailto:ionut@fleio.com>> Sent: Monday, January 4, 2021 4:07 AM To: Erik Olof Gunnar Andersson <eandersson@blizzard.com<mailto:eandersson@blizzard.com>> Cc: feilong <feilong@catalyst.net.nz<mailto:feilong@catalyst.net.nz>>; openstack-discuss <openstack-discuss@lists.openstack.org<mailto:openstack-discuss@lists.openstack.org>> Subject: Re: [magnum][api] Error system library fopen too many open files with magnum-auto-healer Hi Erik, Here is lsof of one uwsgi api. https://paste.xinu.at/5YUWf/<https://urldefense.com/v3/__https:/paste.xinu.at/5YUWf/__;!!Ci6f514n9QsL8ck!wv_wzG-Ntk0gd3ReOupQl-iXIcWpPR3genCqeKNY5JCKZDWxQHSqqa-uxxgUFFhz0Q$> I have kubernetes 12.0.1 installed in env. On Sun, Jan 3, 2021 at 3:06 AM Erik Olof Gunnar Andersson <eandersson@blizzard.com<mailto:eandersson@blizzard.com>> wrote: Maybe something similar to this? https://github.com/kubernetes-client/python/issues/1158<https://urldefense.com/v3/__https:/github.com/kubernetes-client/python/issues/1158__;!!Ci6f514n9QsL8ck!wv_wzG-Ntk0gd3ReOupQl-iXIcWpPR3genCqeKNY5JCKZDWxQHSqqa-uxxgAtzJkNg$> What does lsof say? -- Ionut Biru - https://fleio.com<https://urldefense.com/v3/__https:/fleio.com__;!!Ci6f514n9QsL8ck!w-sy8zu-TkPMcmlD3ZhyxEiBTRWikibrBZOfumXkqKodtdcI4FD236uNMmit-G0eng$> -- Ionut Biru - https://fleio.com<https://urldefense.com/v3/__https:/fleio.com__;!!Ci6f514n9QsL8ck!zXau4TQ7lpYxxCmShvD-QtwfISyXyajq11TeBMle6hAdw3N9NdP7PuG5Ygp-5WUmyw$>
Hi Erik, Here it is: https://paste.xinu.at/LgH8dT/ On Mon, Jan 11, 2021 at 10:45 PM Erik Olof Gunnar Andersson < eandersson@blizzard.com> wrote:
Thanks I added it to the commit.
Could you share your uwsgi config as well.
Best Regards, Erik Olof Gunnar Andersson
Technical Lead, Senior Cloud Engineer
*From:* Ionut Biru <ionut@fleio.com> *Sent:* Tuesday, January 5, 2021 1:51 AM *To:* Erik Olof Gunnar Andersson <eandersson@blizzard.com> *Cc:* Spyros Trigazis <strigazi@gmail.com>; feilong < feilong@catalyst.net.nz>; openstack-discuss < openstack-discuss@lists.openstack.org> *Subject:* Re: [magnum][api] Error system library fopen too many open files with magnum-auto-healer
Hi,
Here is my config. maybe something is fishy.
I did have around 300 messages in the queue in notification.info <https://urldefense.com/v3/__http:/notification.info__;!!Ci6f514n9QsL8ck!zXau4TQ7lpYxxCmShvD-QtwfISyXyajq11TeBMle6hAdw3N9NdP7PuG5YgqgOhdO4A$> and notification.err and I purged them.
https://paste.xinu.at/woMt/ <https://urldefense.com/v3/__https:/paste.xinu.at/woMt/__;!!Ci6f514n9QsL8ck!zXau4TQ7lpYxxCmShvD-QtwfISyXyajq11TeBMle6hAdw3N9NdP7PuG5YgrG1_F7_w$>
On Tue, Jan 5, 2021 at 11:23 AM Erik Olof Gunnar Andersson < eandersson@blizzard.com> wrote:
Yea - tested locally as well and wasn't able to reproduce it either. I changed the health service job to run every second and maxed out at about 42 connections to RabbitMQ with two conductor workers.
/etc/magnum/magnun.conf
[conductor]
workers = 2
------------------------------
*From:* Spyros Trigazis <strigazi@gmail.com> *Sent:* Tuesday, January 5, 2021 12:59 AM *To:* Ionut Biru <ionut@fleio.com> *Cc:* Erik Olof Gunnar Andersson <eandersson@blizzard.com>; feilong < feilong@catalyst.net.nz>; openstack-discuss < openstack-discuss@lists.openstack.org> *Subject:* Re: [magnum][api] Error system library fopen too many open files with magnum-auto-healer
On Tue, Jan 5, 2021 at 9:36 AM Ionut Biru <ionut@fleio.com> wrote:
Hi,
I tried with process=1 and it reached 1016 connections to rabbitmq.
lsof
https://paste.xinu.at/jGg/ <https://urldefense.com/v3/__https:/paste.xinu.at/jGg/__;!!Ci6f514n9QsL8ck!w-sy8zu-TkPMcmlD3ZhyxEiBTRWikibrBZOfumXkqKodtdcI4FD236uNMmjynMvIcA$>
i think it goes into error when it reaches 1024 file descriptors.
I'm out of ideas of how to resolve this. I only have 3 clusters available and it's kinda weird and It doesn't scale.
No issues here with 100s of clusters. Not sure what doesn't scale.
* Maybe your rabbit is flooded with notifications that are not consumed?
* You can use way more than 1024 file descriptors, maybe 2^10?
Spyros
On Mon, Jan 4, 2021 at 9:53 PM Erik Olof Gunnar Andersson < eandersson@blizzard.com> wrote:
Sure looks like RabbitMQ. How many workers do have you configured?
Could you try to changing the uwsgi configuration to workers=1 (or processes=1) and then see if it goes beyond 30 connections to amqp.
*From:* Ionut Biru <ionut@fleio.com> *Sent:* Monday, January 4, 2021 4:07 AM *To:* Erik Olof Gunnar Andersson <eandersson@blizzard.com> *Cc:* feilong <feilong@catalyst.net.nz>; openstack-discuss < openstack-discuss@lists.openstack.org> *Subject:* Re: [magnum][api] Error system library fopen too many open files with magnum-auto-healer
Hi Erik,
Here is lsof of one uwsgi api. https://paste.xinu.at/5YUWf/ <https://urldefense.com/v3/__https:/paste.xinu.at/5YUWf/__;!!Ci6f514n9QsL8ck!wv_wzG-Ntk0gd3ReOupQl-iXIcWpPR3genCqeKNY5JCKZDWxQHSqqa-uxxgUFFhz0Q$>
I have kubernetes 12.0.1 installed in env.
On Sun, Jan 3, 2021 at 3:06 AM Erik Olof Gunnar Andersson < eandersson@blizzard.com> wrote:
Maybe something similar to this? https://github.com/kubernetes-client/python/issues/1158 <https://urldefense.com/v3/__https:/github.com/kubernetes-client/python/issues/1158__;!!Ci6f514n9QsL8ck!wv_wzG-Ntk0gd3ReOupQl-iXIcWpPR3genCqeKNY5JCKZDWxQHSqqa-uxxgAtzJkNg$>
What does lsof say?
--
Ionut Biru - https://fleio.com <https://urldefense.com/v3/__https:/fleio.com__;!!Ci6f514n9QsL8ck!w-sy8zu-TkPMcmlD3ZhyxEiBTRWikibrBZOfumXkqKodtdcI4FD236uNMmit-G0eng$>
--
Ionut Biru - https://fleio.com <https://urldefense.com/v3/__https:/fleio.com__;!!Ci6f514n9QsL8ck!zXau4TQ7lpYxxCmShvD-QtwfISyXyajq11TeBMle6hAdw3N9NdP7PuG5Ygp-5WUmyw$>
-- Ionut Biru - https://fleio.com
Thanks Ionut. If you are able could you test this patch instead. I think I better understand what the issue was now. We were not only creating a new RPC Client for each HTTP request, but also a brand-new transport for each request. https://review.opendev.org/c/openstack/magnum/+/770707 ________________________________ From: Ionut Biru <ionut@fleio.com> Sent: Tuesday, January 12, 2021 3:17 AM To: Erik Olof Gunnar Andersson <eandersson@blizzard.com> Cc: Spyros Trigazis <strigazi@gmail.com>; feilong <feilong@catalyst.net.nz>; openstack-discuss <openstack-discuss@lists.openstack.org> Subject: Re: [magnum][api] Error system library fopen too many open files with magnum-auto-healer Hi Erik, Here it is: https://paste.xinu.at/LgH8dT/<https://urldefense.com/v3/__https://paste.xinu.at/LgH8dT/__;!!Ci6f514n9QsL8ck!3xiIzEOH2LobmnEu2CJo-_pe1pReHXcpL2yaazTbfH6tKSlHF2JOL3RwsRanhwl2Xw$> On Mon, Jan 11, 2021 at 10:45 PM Erik Olof Gunnar Andersson <eandersson@blizzard.com<mailto:eandersson@blizzard.com>> wrote: Thanks I added it to the commit. Could you share your uwsgi config as well. Best Regards, Erik Olof Gunnar Andersson Technical Lead, Senior Cloud Engineer From: Ionut Biru <ionut@fleio.com<mailto:ionut@fleio.com>> Sent: Tuesday, January 5, 2021 1:51 AM To: Erik Olof Gunnar Andersson <eandersson@blizzard.com<mailto:eandersson@blizzard.com>> Cc: Spyros Trigazis <strigazi@gmail.com<mailto:strigazi@gmail.com>>; feilong <feilong@catalyst.net.nz<mailto:feilong@catalyst.net.nz>>; openstack-discuss <openstack-discuss@lists.openstack.org<mailto:openstack-discuss@lists.openstack.org>> Subject: Re: [magnum][api] Error system library fopen too many open files with magnum-auto-healer Hi, Here is my config. maybe something is fishy. I did have around 300 messages in the queue in notification.info<https://urldefense.com/v3/__http:/notification.info__;!!Ci6f514n9QsL8ck!zXau4TQ7lpYxxCmShvD-QtwfISyXyajq11TeBMle6hAdw3N9NdP7PuG5YgqgOhdO4A$> and notification.err and I purged them. https://paste.xinu.at/woMt/<https://urldefense.com/v3/__https:/paste.xinu.at/woMt/__;!!Ci6f514n9QsL8ck!zXau4TQ7lpYxxCmShvD-QtwfISyXyajq11TeBMle6hAdw3N9NdP7PuG5YgrG1_F7_w$> On Tue, Jan 5, 2021 at 11:23 AM Erik Olof Gunnar Andersson <eandersson@blizzard.com<mailto:eandersson@blizzard.com>> wrote: Yea - tested locally as well and wasn't able to reproduce it either. I changed the health service job to run every second and maxed out at about 42 connections to RabbitMQ with two conductor workers. /etc/magnum/magnun.conf [conductor] workers = 2 ________________________________ From: Spyros Trigazis <strigazi@gmail.com<mailto:strigazi@gmail.com>> Sent: Tuesday, January 5, 2021 12:59 AM To: Ionut Biru <ionut@fleio.com<mailto:ionut@fleio.com>> Cc: Erik Olof Gunnar Andersson <eandersson@blizzard.com<mailto:eandersson@blizzard.com>>; feilong <feilong@catalyst.net.nz<mailto:feilong@catalyst.net.nz>>; openstack-discuss <openstack-discuss@lists.openstack.org<mailto:openstack-discuss@lists.openstack.org>> Subject: Re: [magnum][api] Error system library fopen too many open files with magnum-auto-healer On Tue, Jan 5, 2021 at 9:36 AM Ionut Biru <ionut@fleio.com<mailto:ionut@fleio.com>> wrote: Hi, I tried with process=1 and it reached 1016 connections to rabbitmq. lsof https://paste.xinu.at/jGg/<https://urldefense.com/v3/__https:/paste.xinu.at/jGg/__;!!Ci6f514n9QsL8ck!w-sy8zu-TkPMcmlD3ZhyxEiBTRWikibrBZOfumXkqKodtdcI4FD236uNMmjynMvIcA$> i think it goes into error when it reaches 1024 file descriptors. I'm out of ideas of how to resolve this. I only have 3 clusters available and it's kinda weird and It doesn't scale. No issues here with 100s of clusters. Not sure what doesn't scale. * Maybe your rabbit is flooded with notifications that are not consumed? * You can use way more than 1024 file descriptors, maybe 2^10? Spyros On Mon, Jan 4, 2021 at 9:53 PM Erik Olof Gunnar Andersson <eandersson@blizzard.com<mailto:eandersson@blizzard.com>> wrote: Sure looks like RabbitMQ. How many workers do have you configured? Could you try to changing the uwsgi configuration to workers=1 (or processes=1) and then see if it goes beyond 30 connections to amqp. From: Ionut Biru <ionut@fleio.com<mailto:ionut@fleio.com>> Sent: Monday, January 4, 2021 4:07 AM To: Erik Olof Gunnar Andersson <eandersson@blizzard.com<mailto:eandersson@blizzard.com>> Cc: feilong <feilong@catalyst.net.nz<mailto:feilong@catalyst.net.nz>>; openstack-discuss <openstack-discuss@lists.openstack.org<mailto:openstack-discuss@lists.openstack.org>> Subject: Re: [magnum][api] Error system library fopen too many open files with magnum-auto-healer Hi Erik, Here is lsof of one uwsgi api. https://paste.xinu.at/5YUWf/<https://urldefense.com/v3/__https:/paste.xinu.at/5YUWf/__;!!Ci6f514n9QsL8ck!wv_wzG-Ntk0gd3ReOupQl-iXIcWpPR3genCqeKNY5JCKZDWxQHSqqa-uxxgUFFhz0Q$> I have kubernetes 12.0.1 installed in env. On Sun, Jan 3, 2021 at 3:06 AM Erik Olof Gunnar Andersson <eandersson@blizzard.com<mailto:eandersson@blizzard.com>> wrote: Maybe something similar to this? https://github.com/kubernetes-client/python/issues/1158<https://urldefense.com/v3/__https:/github.com/kubernetes-client/python/issues/1158__;!!Ci6f514n9QsL8ck!wv_wzG-Ntk0gd3ReOupQl-iXIcWpPR3genCqeKNY5JCKZDWxQHSqqa-uxxgAtzJkNg$> What does lsof say? -- Ionut Biru - https://fleio.com<https://urldefense.com/v3/__https:/fleio.com__;!!Ci6f514n9QsL8ck!w-sy8zu-TkPMcmlD3ZhyxEiBTRWikibrBZOfumXkqKodtdcI4FD236uNMmit-G0eng$> -- Ionut Biru - https://fleio.com<https://urldefense.com/v3/__https:/fleio.com__;!!Ci6f514n9QsL8ck!zXau4TQ7lpYxxCmShvD-QtwfISyXyajq11TeBMle6hAdw3N9NdP7PuG5Ygp-5WUmyw$> -- Ionut Biru - https://fleio.com<https://urldefense.com/v3/__https://fleio.com__;!!Ci6f514n9QsL8ck!3xiIzEOH2LobmnEu2CJo-_pe1pReHXcpL2yaazTbfH6tKSlHF2JOL3RwsRZoVksdyg$>
Hi Erik, Seems that this one works better than the previous one. I have 19 connections with this patch vs 38. I'll keep it for the following days. On Thu, Jan 14, 2021 at 6:47 AM Erik Olof Gunnar Andersson < eandersson@blizzard.com> wrote:
Thanks Ionut.
If you are able could you test this patch instead. I think I better understand what the issue was now. We were not only creating a new RPC Client for each HTTP request, but also a brand-new transport for each request. https://review.opendev.org/c/openstack/magnum/+/770707
------------------------------ *From:* Ionut Biru <ionut@fleio.com> *Sent:* Tuesday, January 12, 2021 3:17 AM *To:* Erik Olof Gunnar Andersson <eandersson@blizzard.com> *Cc:* Spyros Trigazis <strigazi@gmail.com>; feilong < feilong@catalyst.net.nz>; openstack-discuss < openstack-discuss@lists.openstack.org> *Subject:* Re: [magnum][api] Error system library fopen too many open files with magnum-auto-healer
Hi Erik,
Here it is: https://paste.xinu.at/LgH8dT/ <https://urldefense.com/v3/__https://paste.xinu.at/LgH8dT/__;!!Ci6f514n9QsL8ck!3xiIzEOH2LobmnEu2CJo-_pe1pReHXcpL2yaazTbfH6tKSlHF2JOL3RwsRanhwl2Xw$>
On Mon, Jan 11, 2021 at 10:45 PM Erik Olof Gunnar Andersson < eandersson@blizzard.com> wrote:
Thanks I added it to the commit.
Could you share your uwsgi config as well.
Best Regards, Erik Olof Gunnar Andersson
Technical Lead, Senior Cloud Engineer
*From:* Ionut Biru <ionut@fleio.com> *Sent:* Tuesday, January 5, 2021 1:51 AM *To:* Erik Olof Gunnar Andersson <eandersson@blizzard.com> *Cc:* Spyros Trigazis <strigazi@gmail.com>; feilong < feilong@catalyst.net.nz>; openstack-discuss < openstack-discuss@lists.openstack.org> *Subject:* Re: [magnum][api] Error system library fopen too many open files with magnum-auto-healer
Hi,
Here is my config. maybe something is fishy.
I did have around 300 messages in the queue in notification.info <https://urldefense.com/v3/__http:/notification.info__;!!Ci6f514n9QsL8ck!zXau4TQ7lpYxxCmShvD-QtwfISyXyajq11TeBMle6hAdw3N9NdP7PuG5YgqgOhdO4A$> and notification.err and I purged them.
https://paste.xinu.at/woMt/ <https://urldefense.com/v3/__https:/paste.xinu.at/woMt/__;!!Ci6f514n9QsL8ck!zXau4TQ7lpYxxCmShvD-QtwfISyXyajq11TeBMle6hAdw3N9NdP7PuG5YgrG1_F7_w$>
On Tue, Jan 5, 2021 at 11:23 AM Erik Olof Gunnar Andersson < eandersson@blizzard.com> wrote:
Yea - tested locally as well and wasn't able to reproduce it either. I changed the health service job to run every second and maxed out at about 42 connections to RabbitMQ with two conductor workers.
/etc/magnum/magnun.conf
[conductor]
workers = 2
------------------------------
*From:* Spyros Trigazis <strigazi@gmail.com> *Sent:* Tuesday, January 5, 2021 12:59 AM *To:* Ionut Biru <ionut@fleio.com> *Cc:* Erik Olof Gunnar Andersson <eandersson@blizzard.com>; feilong < feilong@catalyst.net.nz>; openstack-discuss < openstack-discuss@lists.openstack.org> *Subject:* Re: [magnum][api] Error system library fopen too many open files with magnum-auto-healer
On Tue, Jan 5, 2021 at 9:36 AM Ionut Biru <ionut@fleio.com> wrote:
Hi,
I tried with process=1 and it reached 1016 connections to rabbitmq.
lsof
https://paste.xinu.at/jGg/ <https://urldefense.com/v3/__https:/paste.xinu.at/jGg/__;!!Ci6f514n9QsL8ck!w-sy8zu-TkPMcmlD3ZhyxEiBTRWikibrBZOfumXkqKodtdcI4FD236uNMmjynMvIcA$>
i think it goes into error when it reaches 1024 file descriptors.
I'm out of ideas of how to resolve this. I only have 3 clusters available and it's kinda weird and It doesn't scale.
No issues here with 100s of clusters. Not sure what doesn't scale.
* Maybe your rabbit is flooded with notifications that are not consumed?
* You can use way more than 1024 file descriptors, maybe 2^10?
Spyros
On Mon, Jan 4, 2021 at 9:53 PM Erik Olof Gunnar Andersson < eandersson@blizzard.com> wrote:
Sure looks like RabbitMQ. How many workers do have you configured?
Could you try to changing the uwsgi configuration to workers=1 (or processes=1) and then see if it goes beyond 30 connections to amqp.
*From:* Ionut Biru <ionut@fleio.com> *Sent:* Monday, January 4, 2021 4:07 AM *To:* Erik Olof Gunnar Andersson <eandersson@blizzard.com> *Cc:* feilong <feilong@catalyst.net.nz>; openstack-discuss < openstack-discuss@lists.openstack.org> *Subject:* Re: [magnum][api] Error system library fopen too many open files with magnum-auto-healer
Hi Erik,
Here is lsof of one uwsgi api. https://paste.xinu.at/5YUWf/ <https://urldefense.com/v3/__https:/paste.xinu.at/5YUWf/__;!!Ci6f514n9QsL8ck!wv_wzG-Ntk0gd3ReOupQl-iXIcWpPR3genCqeKNY5JCKZDWxQHSqqa-uxxgUFFhz0Q$>
I have kubernetes 12.0.1 installed in env.
On Sun, Jan 3, 2021 at 3:06 AM Erik Olof Gunnar Andersson < eandersson@blizzard.com> wrote:
Maybe something similar to this? https://github.com/kubernetes-client/python/issues/1158 <https://urldefense.com/v3/__https:/github.com/kubernetes-client/python/issues/1158__;!!Ci6f514n9QsL8ck!wv_wzG-Ntk0gd3ReOupQl-iXIcWpPR3genCqeKNY5JCKZDWxQHSqqa-uxxgAtzJkNg$>
What does lsof say?
--
Ionut Biru - https://fleio.com <https://urldefense.com/v3/__https:/fleio.com__;!!Ci6f514n9QsL8ck!w-sy8zu-TkPMcmlD3ZhyxEiBTRWikibrBZOfumXkqKodtdcI4FD236uNMmit-G0eng$>
--
Ionut Biru - https://fleio.com <https://urldefense.com/v3/__https:/fleio.com__;!!Ci6f514n9QsL8ck!zXau4TQ7lpYxxCmShvD-QtwfISyXyajq11TeBMle6hAdw3N9NdP7PuG5Ygp-5WUmyw$>
-- Ionut Biru - https://fleio.com <https://urldefense.com/v3/__https://fleio.com__;!!Ci6f514n9QsL8ck!3xiIzEOH2LobmnEu2CJo-_pe1pReHXcpL2yaazTbfH6tKSlHF2JOL3RwsRZoVksdyg$>
-- Ionut Biru - https://fleio.com
participants (4)
-
Erik Olof Gunnar Andersson
-
feilong
-
Ionut Biru
-
Spyros Trigazis