Re: [Swift] [kolla] Swift issues in one cluster
We're running 2.23.3 but it appears that we are experiencing the bug (1). We tracked the problem down to a client who recently started using a java library to read large files from Swift. When he moves his activity to the other QA cluster, the problem follows. Am I guessing correctly that the bug was never fixed, and that (2) fixes a different problem? On Thursday, June 23, 2022, 10:43:44 AM EDT, Clay Gerrard <clay.gerrard@gmail.com> wrote: https://review.opendev.org/c/openstack/swift/+/575254 has been included in every released swift tag since 2.21.0 I believe Train included a swift version of at least 2.22 https://releases.openstack.org/train/#swift Nvidia doesn't use haproxy in front of our swift proxies, and we don't see BlockingIOError in tracebacks - the traceback might go away if you upgrade to the latest swift (2.29) and/or eventlet and/or python 3.8ish On Thu, Jun 23, 2022 at 8:49 AM Albert Braden <ozzzo@yahoo.com> wrote: Can anyone help with this Swift issue? It looks like we are being hit with this bug (1) but this bug doesn't seem to mention where/whether it was ever fixed. This (2) appears to be a fix, and it appears to have been merged, but it doesn't mention the bug, and it's not obvious to me what version it affects. Is anyone else encountering this problem? It appears that customers in this one cluster may be doing something to cause it; we're still trying to track down specifically what they are doing, that they aren't doing in the other clusters. We are running kolla-ansible Train on RHEL7. 1. https://bugs.launchpad.net/swift/+bug/1572719 2. https://opendev.org/openstack/swift/commit/0e81ffd1e1481a73146fce17f61f2ab9e... On Wednesday, June 22, 2022, 05:10:10 PM EDT, Albert Braden <ozzzo@yahoo.com> wrote: Is this bug fixed in Train? https://bugs.launchpad.net/swift/+bug/1572719 On Tuesday, June 21, 2022, 09:26:47 AM EDT, Albert Braden <ozzzo@yahoo.com> wrote: All of our endpoints are https: +----------------------------------+--------+--------------+--------------+---------+-----------+---------------------------------------------------------+ | ID | Region | Service Name | Service Type | Enabled | Interface | URL | +----------------------------------+--------+--------------+--------------+---------+-----------+---------------------------------------------------------+ | <ID> | <region> | keystone | identity | True | internal | https://api-int.<region>.<domain>:5000 | | <ID> | <region>| swift | object-store | True | public | https://swift. <region>.<domain>:8080/v1/AUTH_%(tenant_id)s | | <ID> | <region>| swift | object-store | True | internal | https://swift.<region>.<domain>:8080/v1/AUTH_%(tenant_id)s | | <ID> | <region>| keystone | identity | True | admin | https://api-int. <region>.<domain>:35357 | | <ID> | <region>| keystone | identity | True | public | https://api-ext. <region>.<domain>:5000 | | <ID> | <region>| swift | object-store | True | admin | https://swift. <region>.<domain>:8080/v1 | +----------------------------------+--------+--------------+--------------+---------+-----------+---------------------------------------------------------+ I don't think this is causing the issue; all of our clusters are setup the same. We did think it was load at first, and got the 2 heaviest users to stop what they were doing, but that didn't make a difference. Our other QA cluster has similar load and identical hardware. When I look at the network graphs, I see traffic spiking up to 1G, but these are 25G interfaces, and none of the resources on the boxes are exhausted. CPU is 97% idle; memory is 30% used, disk is not full. It doesn't look like the problem is load-related. We see the haproxy connections stacking up even when load is low. What else could be causing this? On Friday, June 17, 2022, 11:12:36 PM EDT, Pete Zaitcev <zaitcev@redhat.com> wrote: On Fri, 17 Jun 2022 17:33:27 +0000 (UTC) Albert Braden <ozzzo@yahoo.com> wrote:
$ openstack container list Unable to establish connection to https://swift.<region>.<domain>:8080/v1/AUTH_<project>: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer')
Right away I have a question: why in the world are you connecting to 8080 with HTTPS?
(from Splunk): Payload: swift-proxy-server: STDERR: File "/usr/lib64/python3.6/socket.py", line 604, in write#012 return self._sock.send(b) Payload: swift-proxy-server: STDERR: BlockingIOError Payload: swift-proxy-server: STDERR: os.read(self.rfd, 1) Payload: swift-proxy-server: STDERR: File "/usr/lib/python3.6/site-packages/eventlet/wsgi.py", line 818, in process_request#012 proto.__init__(conn_state, self)
This looks quite fishy to me, because the os.read is in swift/common/utils.py and it's responsible for the mutex.
When we look at network connections, we see haproxy stacking up (many lines of this): # netstat -ntup | sort -b -k2 -n -r | head -n +100 tcp 5976932 0 127.0.0.1:60738 127.0.0.1:8080 ESTABLISHED 13045/haproxy tcp 5976446 0 127.0.0.1:58480 127.0.0.1:8080 ESTABLISHED 13045/haproxy tcp 5973217 0 127.0.0.1:33244 127.0.0.1:8080 ESTABLISHED 13045/haproxy tcp 5973120 0 127.0.0.1:51836 127.0.0.1:8080 ESTABLISHED 13045/haproxy tcp 5971968 0 127.0.0.1:58516 127.0.0.1:8080 ESTABLISHED 13045/haproxy ...
If we restart the swift_haproxy and swift_proxy_server containers then the problem goes away, and comes back over a few minutes. Where should we be looking for the root cause of this issue?
Indeed if so many requests are established, you're in trouble. The best fix, I think, is to find the customer who's doing it and punish them. Otherwise, quotas and the ratelimiting middleware are your friends. There's also a possibility that your cluster is underperforming, although usually that results in 500 results first. But then again, at times users would "compensate" for issues by just launching way more requests, in effect DoS-ing the cluster even worse. -- Pete -- Clay Gerrard210 788 9431
Reading in [1] I see this: "Having another look at that issue, it sounds like slow client shouldn't be handled by OpenStack services but rather with a load balancer, especially if the service is Internet facing" I don't understand what is being recommended here. We have 60 Swift servers, and customer traffic goes directly to those servers. It seems like a load-balancer would be a performance-reducing bottleneck. Our clusters are not internet-facing, and I haven't seen internet-facing Swift clusters at any of my employers. What is the solution for this problem? We have worked around it by getting our customer to stop using his Java app to download large files for now, but we would like to find a long-term solution. Is there any hope of getting this bug fixed? On Thursday, June 23, 2022, 12:49:29 PM EDT, Albert Braden <ozzzo@yahoo.com> wrote: We're running 2.23.3 but it appears that we are experiencing the bug (1). We tracked the problem down to a client who recently started using a java library to read large files from Swift. When he moves his activity to the other QA cluster, the problem follows. Am I guessing correctly that the bug was never fixed, and that (2) fixes a different problem? On Thursday, June 23, 2022, 10:43:44 AM EDT, Clay Gerrard <clay.gerrard@gmail.com> wrote: https://review.opendev.org/c/openstack/swift/+/575254 has been included in every released swift tag since 2.21.0 I believe Train included a swift version of at least 2.22 https://releases.openstack.org/train/#swift Nvidia doesn't use haproxy in front of our swift proxies, and we don't see BlockingIOError in tracebacks - the traceback might go away if you upgrade to the latest swift (2.29) and/or eventlet and/or python 3.8ish On Thu, Jun 23, 2022 at 8:49 AM Albert Braden <ozzzo@yahoo.com> wrote: Can anyone help with this Swift issue? It looks like we are being hit with this bug (1) but this bug doesn't seem to mention where/whether it was ever fixed. This (2) appears to be a fix, and it appears to have been merged, but it doesn't mention the bug, and it's not obvious to me what version it affects. Is anyone else encountering this problem? It appears that customers in this one cluster may be doing something to cause it; we're still trying to track down specifically what they are doing, that they aren't doing in the other clusters. We are running kolla-ansible Train on RHEL7. 1. https://bugs.launchpad.net/swift/+bug/1572719 2. https://opendev.org/openstack/swift/commit/0e81ffd1e1481a73146fce17f61f2ab9e... On Wednesday, June 22, 2022, 05:10:10 PM EDT, Albert Braden <ozzzo@yahoo.com> wrote: Is this bug fixed in Train? https://bugs.launchpad.net/swift/+bug/1572719 On Tuesday, June 21, 2022, 09:26:47 AM EDT, Albert Braden <ozzzo@yahoo.com> wrote: All of our endpoints are https: +----------------------------------+--------+--------------+--------------+---------+-----------+---------------------------------------------------------+ | ID | Region | Service Name | Service Type | Enabled | Interface | URL | +----------------------------------+--------+--------------+--------------+---------+-----------+---------------------------------------------------------+ | <ID> | <region> | keystone | identity | True | internal | https://api-int.<region>.<domain>:5000 | | <ID> | <region>| swift | object-store | True | public | https://swift. <region>.<domain>:8080/v1/AUTH_%(tenant_id)s | | <ID> | <region>| swift | object-store | True | internal | https://swift.<region>.<domain>:8080/v1/AUTH_%(tenant_id)s | | <ID> | <region>| keystone | identity | True | admin | https://api-int. <region>.<domain>:35357 | | <ID> | <region>| keystone | identity | True | public | https://api-ext. <region>.<domain>:5000 | | <ID> | <region>| swift | object-store | True | admin | https://swift. <region>.<domain>:8080/v1 | +----------------------------------+--------+--------------+--------------+---------+-----------+---------------------------------------------------------+ I don't think this is causing the issue; all of our clusters are setup the same. We did think it was load at first, and got the 2 heaviest users to stop what they were doing, but that didn't make a difference. Our other QA cluster has similar load and identical hardware. When I look at the network graphs, I see traffic spiking up to 1G, but these are 25G interfaces, and none of the resources on the boxes are exhausted. CPU is 97% idle; memory is 30% used, disk is not full. It doesn't look like the problem is load-related. We see the haproxy connections stacking up even when load is low. What else could be causing this? On Friday, June 17, 2022, 11:12:36 PM EDT, Pete Zaitcev <zaitcev@redhat.com> wrote: On Fri, 17 Jun 2022 17:33:27 +0000 (UTC) Albert Braden <ozzzo@yahoo.com> wrote:
$ openstack container list Unable to establish connection to https://swift.<region>.<domain>:8080/v1/AUTH_<project>: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer')
Right away I have a question: why in the world are you connecting to 8080 with HTTPS?
(from Splunk): Payload: swift-proxy-server: STDERR: File "/usr/lib64/python3.6/socket.py", line 604, in write#012 return self._sock.send(b) Payload: swift-proxy-server: STDERR: BlockingIOError Payload: swift-proxy-server: STDERR: os.read(self.rfd, 1) Payload: swift-proxy-server: STDERR: File "/usr/lib/python3.6/site-packages/eventlet/wsgi.py", line 818, in process_request#012 proto.__init__(conn_state, self)
This looks quite fishy to me, because the os.read is in swift/common/utils.py and it's responsible for the mutex.
When we look at network connections, we see haproxy stacking up (many lines of this): # netstat -ntup | sort -b -k2 -n -r | head -n +100 tcp 5976932 0 127.0.0.1:60738 127.0.0.1:8080 ESTABLISHED 13045/haproxy tcp 5976446 0 127.0.0.1:58480 127.0.0.1:8080 ESTABLISHED 13045/haproxy tcp 5973217 0 127.0.0.1:33244 127.0.0.1:8080 ESTABLISHED 13045/haproxy tcp 5973120 0 127.0.0.1:51836 127.0.0.1:8080 ESTABLISHED 13045/haproxy tcp 5971968 0 127.0.0.1:58516 127.0.0.1:8080 ESTABLISHED 13045/haproxy ...
If we restart the swift_haproxy and swift_proxy_server containers then the problem goes away, and comes back over a few minutes. Where should we be looking for the root cause of this issue?
Indeed if so many requests are established, you're in trouble. The best fix, I think, is to find the customer who's doing it and punish them. Otherwise, quotas and the ratelimiting middleware are your friends. There's also a possibility that your cluster is underperforming, although usually that results in 500 results first. But then again, at times users would "compensate" for issues by just launching way more requests, in effect DoS-ing the cluster even worse. -- Pete -- Clay Gerrard210 788 9431
On Fri, Jun 24, 2022 at 10:29 AM Albert Braden <ozzzo@yahoo.com> wrote:
"Having another look at that issue, it sounds like slow client shouldn't be handled by OpenStack services but rather with a load balancer, especially if the service is Internet facing"
I don't understand what is being recommended here.
I think they were suggesting using a http proxy application - maybe haproxy - will have more options to protect network resources from misbehaving clients than the swift proxy application. Like kicking off keep-alive connections after a while, or slow clients that hang up resources.
We have 60 Swift servers, and customer traffic goes directly to those servers. It seems like a load-balancer would be a performance-reducing bottleneck.
That's cool, do you use round robin dns or something?
Is there any hope of getting this bug fixed?
If we can reproduce the problem you're seeing there's some chance we could offer a solution through just a code change, but it's going to be difficult if repro requires haproxy in the pipeline. If there is a problem w/o haproxy, it might have more to do with eventlet.wsgi or python's base http server than swift... can you affirm the issue when clients talk directly to the python/eventlet/swift application? -- Clay Gerrard
AFAIK we are running vanilla Swift. Clients usually connect to the Swift endpoint by running "swift" or "openstack container" commands, and Swift uses exabgp to route the traffic. I'm not exactly sure what the Java library is doing but I can dig more into that if it would help. I see swift_proxy_server and swift_haproxy containers running on every Swift node. Is that not the normal configuration? According to [1] it's pretty easy to reproduce: "You can reproduce this by issuing a GET request for a few hundred MB file and never consuming the response, but keep the client socket open. Swift will log a 499 but the socket does not always close." On Friday, June 24, 2022, 12:02:05 PM EDT, Clay Gerrard <clay.gerrard@gmail.com> wrote: On Fri, Jun 24, 2022 at 10:29 AM Albert Braden <ozzzo@yahoo.com> wrote:
"Having another look at that issue, it sounds like slow client shouldn't be handled by OpenStack services but rather with a load balancer, especially if the service is Internet facing"
I don't understand what is being recommended here.
I think they were suggesting using a http proxy application - maybe haproxy - will have more options to protect network resources from misbehaving clients than the swift proxy application. Like kicking off keep-alive connections after a while, or slow clients that hang up resources.
We have 60 Swift servers, and customer traffic goes directly to those servers. It seems like a load-balancer would be a performance-reducing bottleneck.
That's cool, do you use round robin dns or something?
Is there any hope of getting this bug fixed?
If we can reproduce the problem you're seeing there's some chance we could offer a solution through just a code change, but it's going to be difficult if repro requires haproxy in the pipeline. If there is a problem w/o haproxy, it might have more to do with eventlet.wsgi or python's base http server than swift... can you affirm the issue when clients talk directly to the python/eventlet/swift application? -- Clay Gerrard
On Fri, Jun 24, 2022 at 11:46 AM Albert Braden <ozzzo@yahoo.com> wrote:
"You can reproduce this by issuing a GET request for a few hundred MB file and never consuming the response, but keep the client socket open. Swift will log a 499 but the socket does not always close."
Was that the behavior that you were seeing? Swift logs 499, but the socket stays open? Eventlet 0.22.0 says eventlet.wsgi will timeout idle clients now: https://eventlet.net/doc/changelog.html#id23 ... so maybe that bug as written is no longer valid - but it could still be related to what you're seeing. Wasn't the original traceback a BlockingIOError - that seems different from a hung client socket from an idle client.
We were seeing the ChunkWriteTimeout all along but I wasn't focusing on it at first. We're working on getting the customer setup in our lab environment so that he can duplicate the issue there. On Friday, June 24, 2022, 01:07:04 PM EDT, Clay Gerrard <clay.gerrard@gmail.com> wrote: On Fri, Jun 24, 2022 at 11:46 AM Albert Braden <ozzzo@yahoo.com> wrote:
"You can reproduce this by issuing a GET request for a few hundred MB file and never consuming the response, but keep the client socket open. Swift will log a 499 but the socket does not always close."
Was that the behavior that you were seeing? Swift logs 499, but the socket stays open? Eventlet 0.22.0 says eventlet.wsgi will timeout idle clients now: https://eventlet.net/doc/changelog.html#id23 ... so maybe that bug as written is no longer valid - but it could still be related to what you're seeing. Wasn't the original traceback a BlockingIOError - that seems different from a hung client socket from an idle client.
participants (2)
-
Albert Braden
-
Clay Gerrard