<div>                We're running 2.23.3 but it appears that we are experiencing the bug (1). We tracked the problem down to a client who recently started using a java library to read large files from Swift. When he moves his activity to the other QA cluster, the problem follows. Am I guessing correctly that the bug was never fixed, and that (2) fixes a different problem?<br>            </div>            <div class="yahoo_quoted" style="margin:10px 0px 0px 0.8ex;border-left:1px solid #ccc;padding-left:1ex;">                        <div style="font-family:'Helvetica Neue', Helvetica, Arial, sans-serif;font-size:13px;color:#26282a;">                                <div>                    On Thursday, June 23, 2022, 10:43:44 AM EDT, Clay Gerrard <clay.gerrard@gmail.com> wrote:                </div>                <div><br></div>                <div><br></div>                <div><div id="yiv9982257154"><div><div dir="ltr"><div dir="ltr"><a rel="nofollow noopener noreferrer" shape="rect" target="_blank" href="https://review.opendev.org/c/openstack/swift/+/575254">https://review.opendev.org/c/openstack/swift/+/575254</a> has been included in every released swift tag since 2.21.0<br clear="none"></div><div dir="ltr"><br clear="none"></div><div>I believe Train included a swift version of at least 2.22 <a rel="nofollow noopener noreferrer" shape="rect" target="_blank" href="https://releases.openstack.org/train/#swift">https://releases.openstack.org/train/#swift</a></div><div><br clear="none"></div><div>Nvidia doesn't use haproxy in front of our swift proxies, and we don't see BlockingIOError in tracebacks - the traceback might go away if you upgrade to the latest swift (2.29) and/or eventlet and/or python 3.8ish</div><div><br clear="none"></div><div id="yiv9982257154yqt52608" class="yiv9982257154yqt8121672626"><div class="yiv9982257154gmail_quote"><div dir="ltr" class="yiv9982257154gmail_attr">On Thu, Jun 23, 2022 at 8:49 AM Albert Braden <<a rel="nofollow noopener noreferrer" shape="rect" ymailto="mailto:ozzzo@yahoo.com" target="_blank" href="mailto:ozzzo@yahoo.com">ozzzo@yahoo.com</a>> wrote:<br clear="none"></div><blockquote style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex;" class="yiv9982257154gmail_quote"><div>                Can anyone help with this Swift issue? It looks like we are being hit with this bug (1) but this bug doesn't seem to mention where/whether it was ever fixed. This (2) appears to be a fix, and it appears to have been merged, but it doesn't mention the bug, and it's not obvious to me what version it affects. Is anyone else encountering this problem? It appears that customers in this one cluster may be doing something to cause it; we're still trying to track down specifically what they are doing, that they aren't doing in the other clusters.<br clear="none"><br clear="none">We are running kolla-ansible Train on RHEL7.<br clear="none"><br clear="none">1. <a rel="nofollow noopener noreferrer" shape="rect" target="_blank" href="https://bugs.launchpad.net/swift/+bug/1572719">https://bugs.launchpad.net/swift/+bug/1572719</a><br clear="none">2. <a rel="nofollow noopener noreferrer" shape="rect" target="_blank" href="https://opendev.org/openstack/swift/commit/0e81ffd1e1481a73146fce17f61f2ab9e01eb684">https://opendev.org/openstack/swift/commit/0e81ffd1e1481a73146fce17f61f2ab9e01eb684</a><br clear="none">            </div>            <div style="margin:10px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex;">                        <div style="font-family:Helvetica, Arial, sans-serif;font-size:13px;color:rgb(38,40,42);">                                <div>                    On Wednesday, June 22, 2022, 05:10:10 PM EDT, Albert Braden <<a rel="nofollow noopener noreferrer" shape="rect" ymailto="mailto:ozzzo@yahoo.com" target="_blank" href="mailto:ozzzo@yahoo.com">ozzzo@yahoo.com</a>> wrote:                </div>                <div><br clear="none"></div>                <div><br clear="none"></div>                <div><div id="yiv9982257154gmail-m_-4181266032143233798yiv4414005737"><div><div>                Is this bug fixed in Train?<br clear="none"><br clear="none"><a rel="nofollow noopener noreferrer" shape="rect" target="_blank" href="https://bugs.launchpad.net/swift/+bug/1572719">https://bugs.launchpad.net/swift/+bug/1572719</a><br clear="none">            </div>            <div id="yiv9982257154gmail-m_-4181266032143233798yiv4414005737yqt52880"><div style="margin:10px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex;">                        <div style="font-family:Helvetica, Arial, sans-serif;font-size:13px;color:rgb(38,40,42);">                                <div>                    On Tuesday, June 21, 2022, 09:26:47 AM EDT, Albert Braden <<a rel="nofollow noopener noreferrer" shape="rect" ymailto="mailto:ozzzo@yahoo.com" target="_blank" href="mailto:ozzzo@yahoo.com">ozzzo@yahoo.com</a>> wrote:                </div>                <div><br clear="none"></div>                <div><br clear="none"></div>                <div><div id="yiv9982257154gmail-m_-4181266032143233798yiv4414005737"><div><div>                All of our endpoints are https:<br clear="none"><br clear="none">+----------------------------------+--------+--------------+--------------+---------+-----------+---------------------------------------------------------+<br clear="none">| ID                               | Region | Service Name | Service Type | Enabled | Interface | URL                                                     |<br clear="none">+----------------------------------+--------+--------------+--------------+---------+-----------+---------------------------------------------------------+<br clear="none">| <ID> | <region>  | keystone     | identity     | True    | internal  | <a rel="nofollow noopener noreferrer" shape="rect" target="_blank" href="https://api-int">https://api-int</a>.<region>.<domain>:5000                     |<br clear="none">| <ID> | <region>| swift        | object-store | True    | public    | <a rel="nofollow noopener noreferrer" shape="rect" target="_blank" href="https://swift">https://swift</a>. <region>.<domain>:8080/v1/AUTH_%(tenant_id)s |<br clear="none">| <ID> | <region>| swift        | object-store | True    | internal  | <a rel="nofollow noopener noreferrer" shape="rect" target="_blank" href="https://swift">https://swift</a>.<region>.<domain>:8080/v1/AUTH_%(tenant_id)s |<br clear="none">| <ID> | <region>| keystone     | identity     | True    | admin     | <a rel="nofollow noopener noreferrer" shape="rect" target="_blank" href="https://api-int">https://api-int</a>. <region>.<domain>:35357                    |<br clear="none">| <ID> | <region>| keystone     | identity     | True    | public    | <a rel="nofollow noopener noreferrer" shape="rect" target="_blank" href="https://api-ext">https://api-ext</a>. <region>.<domain>:5000                     |<br clear="none">| <ID> | <region>| swift        | object-store | True    | admin     | <a rel="nofollow noopener noreferrer" shape="rect" target="_blank" href="https://swift">https://swift</a>. <region>.<domain>:8080/v1                    |<br clear="none">+----------------------------------+--------+--------------+--------------+---------+-----------+---------------------------------------------------------+<br clear="none"><br clear="none">I don't think this is causing the issue; all of our clusters are setup the same. We did think it was load at first, and got the 2 heaviest users to stop what they were doing, but that didn't make a difference. Our other QA cluster has similar load and identical hardware. When I look at the network graphs, I see traffic spiking up to 1G, but these are 25G interfaces, and none of the resources on the boxes are exhausted. CPU is 97% idle; memory is 30% used, disk is not full. It doesn't look like the problem is load-related. We see the haproxy connections stacking up even when load is low. What else could be causing this?<br clear="none">            </div>            <div id="yiv9982257154gmail-m_-4181266032143233798yiv4414005737yqt33862"><div style="margin:10px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex;">                        <div style="font-family:Helvetica, Arial, sans-serif;font-size:13px;color:rgb(38,40,42);">                                <div>                    On Friday, June 17, 2022, 11:12:36 PM EDT, Pete Zaitcev <<a rel="nofollow noopener noreferrer" shape="rect" ymailto="mailto:zaitcev@redhat.com" target="_blank" href="mailto:zaitcev@redhat.com">zaitcev@redhat.com</a>> wrote:                </div>                <div><br clear="none"></div>                <div><br clear="none"></div>                <div><div dir="ltr">On Fri, 17 Jun 2022 17:33:27 +0000 (UTC)<br clear="none">Albert Braden <<a rel="nofollow noopener noreferrer" shape="rect" ymailto="mailto:ozzzo@yahoo.com" target="_blank" href="mailto:ozzzo@yahoo.com">ozzzo@yahoo.com</a>> wrote:<br clear="none"><br clear="none">> $ openstack container list<br clear="none">> Unable to establish connection to <a rel="nofollow noopener noreferrer" shape="rect" target="_blank" href="https://swift">https://swift</a>.<region>.<domain>:8080/v1/AUTH_<project>: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer')<br clear="none"><br clear="none">Right away I have a question: why in the world are you connecting<br clear="none">to 8080 with HTTPS?<br clear="none"><br clear="none">> (from Splunk):<br clear="none">> Payload: swift-proxy-server: STDERR: File "/usr/lib64/python3.6/socket.py", line 604, in write#012 return self._sock.send(b)<br clear="none">> Payload: swift-proxy-server: STDERR: BlockingIOError<br clear="none">> Payload: swift-proxy-server: STDERR: os.read(self.rfd, 1)<br clear="none">> Payload: swift-proxy-server: STDERR: File "/usr/lib/python3.6/site-packages/eventlet/wsgi.py", line 818, in process_request#012 proto.__init__(conn_state, self)<br clear="none"><br clear="none">This looks quite fishy to me, because the os.read is in swift/common/utils.py<br clear="none">and it's responsible for the mutex.<div id="yiv9982257154gmail-m_-4181266032143233798yiv4414005737yqtfd07167"><br clear="none"><br clear="none">> When we look at network connections, we see haproxy stacking up (many lines of this):<br clear="none">>  <br clear="none">> # netstat -ntup | sort -b -k2 -n -r | head -n +100<br clear="none">> tcp   5976932      0 <a rel="nofollow noopener noreferrer" shape="rect" target="_blank" onclick="return window.theMainWindow.showLinkWarning(this)" href="http://127.0.0.1:60738">127.0.0.1:60738</a>         <a rel="nofollow noopener noreferrer" shape="rect" target="_blank" onclick="return window.theMainWindow.showLinkWarning(this)" href="http://127.0.0.1:8080">127.0.0.1:8080</a>          ESTABLISHED 13045/haproxy      <br clear="none">> tcp   5976446      0 <a rel="nofollow noopener noreferrer" shape="rect" target="_blank" onclick="return window.theMainWindow.showLinkWarning(this)" href="http://127.0.0.1:58480">127.0.0.1:58480</a>         <a rel="nofollow noopener noreferrer" shape="rect" target="_blank" onclick="return window.theMainWindow.showLinkWarning(this)" href="http://127.0.0.1:8080">127.0.0.1:8080</a>          ESTABLISHED 13045/haproxy      <br clear="none">> tcp   5973217      0 <a rel="nofollow noopener noreferrer" shape="rect" target="_blank" onclick="return window.theMainWindow.showLinkWarning(this)" href="http://127.0.0.1:33244">127.0.0.1:33244</a>         <a rel="nofollow noopener noreferrer" shape="rect" target="_blank" onclick="return window.theMainWindow.showLinkWarning(this)" href="http://127.0.0.1:8080">127.0.0.1:8080</a>          ESTABLISHED 13045/haproxy      <br clear="none">> tcp   5973120      0 <a rel="nofollow noopener noreferrer" shape="rect" target="_blank" onclick="return window.theMainWindow.showLinkWarning(this)" href="http://127.0.0.1:51836">127.0.0.1:51836</a>         <a rel="nofollow noopener noreferrer" shape="rect" target="_blank" onclick="return window.theMainWindow.showLinkWarning(this)" href="http://127.0.0.1:8080">127.0.0.1:8080</a>          ESTABLISHED 13045/haproxy      <br clear="none">> tcp   5971968      0 <a rel="nofollow noopener noreferrer" shape="rect" target="_blank" onclick="return window.theMainWindow.showLinkWarning(this)" href="http://127.0.0.1:58516">127.0.0.1:58516</a>         <a rel="nofollow noopener noreferrer" shape="rect" target="_blank" onclick="return window.theMainWindow.showLinkWarning(this)" href="http://127.0.0.1:8080">127.0.0.1:8080</a>          ESTABLISHED 13045/haproxy      <br clear="none">>  ...<br clear="none">> <br clear="none">> If we restart the swift_haproxy and swift_proxy_server containers then the problem goes away, and comes back over a few minutes. Where should we be looking for the root cause of this issue?</div><br clear="none"><br clear="none">Indeed if so many requests are established, you're in trouble.<br clear="none">The best fix, I think, is to find the customer who's doing it and punish them.<br clear="none">Otherwise, quotas and the ratelimiting middleware are your friends.<br clear="none"><br clear="none">There's also a possibility that your cluster is underperforming, although<br clear="none">usually that results in 500 results first. But then again, at times<br clear="none">users would "compensate" for issues by just launching way more requests,<br clear="none">in effect DoS-ing the cluster even worse.<br clear="none"><br clear="none">-- Pete<div id="yiv9982257154gmail-m_-4181266032143233798yiv4414005737yqtfd97721"><br clear="none"><br clear="none"><br clear="none"></div></div></div>            </div>                </div></div></div></div></div>            </div>                </div></div></div></div></div>            </div>                </div></blockquote></div></div><br clear="all"><div><br clear="none"></div>-- <br clear="none"><div dir="ltr" class="yiv9982257154gmail_signature"><div dir="ltr">Clay Gerrard<div>210 788 9431<br clear="none"></div></div></div></div></div></div></div>            </div>                </div>