<div>                Is this bug fixed in Train?<br><br>https://bugs.launchpad.net/swift/+bug/1572719<br>            </div>            <div class="yahoo_quoted" style="margin:10px 0px 0px 0.8ex;border-left:1px solid #ccc;padding-left:1ex;">                        <div style="font-family:'Helvetica Neue', Helvetica, Arial, sans-serif;font-size:13px;color:#26282a;">                                <div>                    On Tuesday, June 21, 2022, 09:26:47 AM EDT, Albert Braden <ozzzo@yahoo.com> wrote:                </div>                <div><br></div>                <div><br></div>                <div><div id="yiv5098442098"><div><div>                All of our endpoints are https:<br clear="none"><br clear="none">+----------------------------------+--------+--------------+--------------+---------+-----------+---------------------------------------------------------+<br clear="none">| ID                               | Region | Service Name | Service Type | Enabled | Interface | URL                                                     |<br clear="none">+----------------------------------+--------+--------------+--------------+---------+-----------+---------------------------------------------------------+<br clear="none">| <ID> | <region>  | keystone     | identity     | True    | internal  | https://api-int.<region>.<domain>:5000                     |<br clear="none">| <ID> | <region>| swift        | object-store | True    | public    | https://swift. <region>.<domain>:8080/v1/AUTH_%(tenant_id)s |<br clear="none">| <ID> | <region>| swift        | object-store | True    | internal  | https://swift.<region>.<domain>:8080/v1/AUTH_%(tenant_id)s |<br clear="none">| <ID> | <region>| keystone     | identity     | True    | admin     | https://api-int. <region>.<domain>:35357                    |<br clear="none">| <ID> | <region>| keystone     | identity     | True    | public    | https://api-ext. <region>.<domain>:5000                     |<br clear="none">| <ID> | <region>| swift        | object-store | True    | admin     | https://swift. <region>.<domain>:8080/v1                    |<br clear="none">+----------------------------------+--------+--------------+--------------+---------+-----------+---------------------------------------------------------+<br clear="none"><br clear="none">I don't think this is causing the issue; all of our clusters are setup the same. We did think it was load at first, and got the 2 heaviest users to stop what they were doing, but that didn't make a difference. Our other QA cluster has similar load and identical hardware. When I look at the network graphs, I see traffic spiking up to 1G, but these are 25G interfaces, and none of the resources on the boxes are exhausted. CPU is 97% idle; memory is 30% used, disk is not full. It doesn't look like the problem is load-related. We see the haproxy connections stacking up even when load is low. What else could be causing this?<br clear="none">            </div>            <div id="yiv5098442098yqt33862" class="yiv5098442098yqt1696451259"><div style="margin:10px 0px 0px 0.8ex;border-left:1px solid #ccc;padding-left:1ex;" class="yiv5098442098yahoo_quoted">                        <div style="font-family:'Helvetica Neue', Helvetica, Arial, sans-serif;font-size:13px;color:#26282a;">                                <div>                    On Friday, June 17, 2022, 11:12:36 PM EDT, Pete Zaitcev <zaitcev@redhat.com> wrote:                </div>                <div><br clear="none"></div>                <div><br clear="none"></div>                <div><div dir="ltr">On Fri, 17 Jun 2022 17:33:27 +0000 (UTC)<br clear="none">Albert Braden <<a rel="nofollow noopener noreferrer" shape="rect" ymailto="mailto:ozzzo@yahoo.com" target="_blank" href="mailto:ozzzo@yahoo.com">ozzzo@yahoo.com</a>> wrote:<br clear="none"><br clear="none">> $ openstack container list<br clear="none">> Unable to establish connection to https://swift.<region>.<domain>:8080/v1/AUTH_<project>: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer')<br clear="none"><br clear="none">Right away I have a question: why in the world are you connecting<br clear="none">to 8080 with HTTPS?<br clear="none"><br clear="none">> (from Splunk):<br clear="none">> Payload: swift-proxy-server: STDERR: File "/usr/lib64/python3.6/socket.py", line 604, in write#012 return self._sock.send(b)<br clear="none">> Payload: swift-proxy-server: STDERR: BlockingIOError<br clear="none">> Payload: swift-proxy-server: STDERR: os.read(self.rfd, 1)<br clear="none">> Payload: swift-proxy-server: STDERR: File "/usr/lib/python3.6/site-packages/eventlet/wsgi.py", line 818, in process_request#012 proto.__init__(conn_state, self)<br clear="none"><br clear="none">This looks quite fishy to me, because the os.read is in swift/common/utils.py<br clear="none">and it's responsible for the mutex.<div id="yiv5098442098yqtfd07167" class="yiv5098442098yqt6951937539"><br clear="none"><br clear="none">> When we look at network connections, we see haproxy stacking up (many lines of this):<br clear="none">>  <br clear="none">> # netstat -ntup | sort -b -k2 -n -r | head -n +100<br clear="none">> tcp   5976932      0 127.0.0.1:60738         127.0.0.1:8080          ESTABLISHED 13045/haproxy      <br clear="none">> tcp   5976446      0 127.0.0.1:58480         127.0.0.1:8080          ESTABLISHED 13045/haproxy      <br clear="none">> tcp   5973217      0 127.0.0.1:33244         127.0.0.1:8080          ESTABLISHED 13045/haproxy      <br clear="none">> tcp   5973120      0 127.0.0.1:51836         127.0.0.1:8080          ESTABLISHED 13045/haproxy      <br clear="none">> tcp   5971968      0 127.0.0.1:58516         127.0.0.1:8080          ESTABLISHED 13045/haproxy      <br clear="none">>  ...<br clear="none">> <br clear="none">> If we restart the swift_haproxy and swift_proxy_server containers then the problem goes away, and comes back over a few minutes. Where should we be looking for the root cause of this issue?</div><br clear="none"><br clear="none">Indeed if so many requests are established, you're in trouble.<br clear="none">The best fix, I think, is to find the customer who's doing it and punish them.<br clear="none">Otherwise, quotas and the ratelimiting middleware are your friends.<br clear="none"><br clear="none">There's also a possibility that your cluster is underperforming, although<br clear="none">usually that results in 500 results first. But then again, at times<br clear="none">users would "compensate" for issues by just launching way more requests,<br clear="none">in effect DoS-ing the cluster even worse.<br clear="none"><br clear="none">-- Pete<div id="yiv5098442098yqtfd97721" class="yiv5098442098yqt6951937539"><br clear="none"><br clear="none"><br clear="none"></div></div></div>            </div>                </div></div></div></div></div>            </div>                </div>