<div>                All of our endpoints are https:<br><br>+----------------------------------+--------+--------------+--------------+---------+-----------+---------------------------------------------------------+<br>| ID                               | Region | Service Name | Service Type | Enabled | Interface | URL                                                     |<br>+----------------------------------+--------+--------------+--------------+---------+-----------+---------------------------------------------------------+<br>| <ID> | <region>  | keystone     | identity     | True    | internal  | https://api-int.<region>.<domain>:5000                     |<br>| <ID> | <region>| swift        | object-store | True    | public    | https://swift. <region>.<domain>:8080/v1/AUTH_%(tenant_id)s |<br>| <ID> | <region>| swift        | object-store | True    | internal  | https://swift.<region>.<domain>:8080/v1/AUTH_%(tenant_id)s |<br>| <ID> | <region>| keystone     | identity     | True    | admin     | https://api-int. <region>.<domain>:35357                    |<br>| <ID> | <region>| keystone     | identity     | True    | public    | https://api-ext. <region>.<domain>:5000                     |<br>| <ID> | <region>| swift        | object-store | True    | admin     | https://swift. <region>.<domain>:8080/v1                    |<br>+----------------------------------+--------+--------------+--------------+---------+-----------+---------------------------------------------------------+<br><br>I don't think this is causing the issue; all of our clusters are setup the same. We did think it was load at first, and got the 2 heaviest users to stop what they were doing, but that didn't make a difference. Our other QA cluster has similar load and identical hardware. When I look at the network graphs, I see traffic spiking up to 1G, but these are 25G interfaces, and none of the resources on the boxes are exhausted. CPU is 97% idle; memory is 30% used, disk is not full. It doesn't look like the problem is load-related. We see the haproxy connections stacking up even when load is low. What else could be causing this?<br>            </div>            <div class="yahoo_quoted" style="margin:10px 0px 0px 0.8ex;border-left:1px solid #ccc;padding-left:1ex;">                        <div style="font-family:'Helvetica Neue', Helvetica, Arial, sans-serif;font-size:13px;color:#26282a;">                                <div>                    On Friday, June 17, 2022, 11:12:36 PM EDT, Pete Zaitcev <zaitcev@redhat.com> wrote:                </div>                <div><br></div>                <div><br></div>                <div><div dir="ltr">On Fri, 17 Jun 2022 17:33:27 +0000 (UTC)<br clear="none">Albert Braden <<a shape="rect" ymailto="mailto:ozzzo@yahoo.com" href="mailto:ozzzo@yahoo.com">ozzzo@yahoo.com</a>> wrote:<br clear="none"><br clear="none">> $ openstack container list<br clear="none">> Unable to establish connection to https://swift.<region>.<domain>:8080/v1/AUTH_<project>: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer')<br clear="none"><br clear="none">Right away I have a question: why in the world are you connecting<br clear="none">to 8080 with HTTPS?<br clear="none"><br clear="none">> (from Splunk):<br clear="none">> Payload: swift-proxy-server: STDERR: File "/usr/lib64/python3.6/socket.py", line 604, in write#012 return self._sock.send(b)<br clear="none">> Payload: swift-proxy-server: STDERR: BlockingIOError<br clear="none">> Payload: swift-proxy-server: STDERR: os.read(self.rfd, 1)<br clear="none">> Payload: swift-proxy-server: STDERR: File "/usr/lib/python3.6/site-packages/eventlet/wsgi.py", line 818, in process_request#012 proto.__init__(conn_state, self)<br clear="none"><br clear="none">This looks quite fishy to me, because the os.read is in swift/common/utils.py<br clear="none">and it's responsible for the mutex.<div class="yqt6951937539" id="yqtfd07167"><br clear="none"><br clear="none">> When we look at network connections, we see haproxy stacking up (many lines of this):<br clear="none">>  <br clear="none">> # netstat -ntup | sort -b -k2 -n -r | head -n +100<br clear="none">> tcp   5976932      0 127.0.0.1:60738         127.0.0.1:8080          ESTABLISHED 13045/haproxy      <br clear="none">> tcp   5976446      0 127.0.0.1:58480         127.0.0.1:8080          ESTABLISHED 13045/haproxy      <br clear="none">> tcp   5973217      0 127.0.0.1:33244         127.0.0.1:8080          ESTABLISHED 13045/haproxy      <br clear="none">> tcp   5973120      0 127.0.0.1:51836         127.0.0.1:8080          ESTABLISHED 13045/haproxy      <br clear="none">> tcp   5971968      0 127.0.0.1:58516         127.0.0.1:8080          ESTABLISHED 13045/haproxy      <br clear="none">>  ...<br clear="none">> <br clear="none">> If we restart the swift_haproxy and swift_proxy_server containers then the problem goes away, and comes back over a few minutes. Where should we be looking for the root cause of this issue?</div><br clear="none"><br clear="none">Indeed if so many requests are established, you're in trouble.<br clear="none">The best fix, I think, is to find the customer who's doing it and punish them.<br clear="none">Otherwise, quotas and the ratelimiting middleware are your friends.<br clear="none"><br clear="none">There's also a possibility that your cluster is underperforming, although<br clear="none">usually that results in 500 results first. But then again, at times<br clear="none">users would "compensate" for issues by just launching way more requests,<br clear="none">in effect DoS-ing the cluster even worse.<br clear="none"><br clear="none">-- Pete<div class="yqt6951937539" id="yqtfd97721"><br clear="none"><br clear="none"><br clear="none"></div></div></div>            </div>                </div>