Swift issues in one cluster

Pete Zaitcev zaitcev at redhat.com
Sat Jun 18 03:05:56 UTC 2022


On Fri, 17 Jun 2022 17:33:27 +0000 (UTC)
Albert Braden <ozzzo at yahoo.com> wrote:

> $ openstack container list
> Unable to establish connection to https://swift.<region>.<domain>:8080/v1/AUTH_<project>: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer')

Right away I have a question: why in the world are you connecting
to 8080 with HTTPS?

> (from Splunk):
> Payload: swift-proxy-server: STDERR: File "/usr/lib64/python3.6/socket.py", line 604, in write#012 return self._sock.send(b)
> Payload: swift-proxy-server: STDERR: BlockingIOError
> Payload: swift-proxy-server: STDERR: os.read(self.rfd, 1)
> Payload: swift-proxy-server: STDERR: File "/usr/lib/python3.6/site-packages/eventlet/wsgi.py", line 818, in process_request#012 proto.__init__(conn_state, self)

This looks quite fishy to me, because the os.read is in swift/common/utils.py
and it's responsible for the mutex.

> When we look at network connections, we see haproxy stacking up (many lines of this):
>  
> # netstat -ntup | sort -b -k2 -n -r | head -n +100
> tcp   5976932      0 127.0.0.1:60738         127.0.0.1:8080          ESTABLISHED 13045/haproxy      
> tcp   5976446      0 127.0.0.1:58480         127.0.0.1:8080          ESTABLISHED 13045/haproxy      
> tcp   5973217      0 127.0.0.1:33244         127.0.0.1:8080          ESTABLISHED 13045/haproxy      
> tcp   5973120      0 127.0.0.1:51836         127.0.0.1:8080          ESTABLISHED 13045/haproxy      
> tcp   5971968      0 127.0.0.1:58516         127.0.0.1:8080          ESTABLISHED 13045/haproxy      
>  ...
> 
> If we restart the swift_haproxy and swift_proxy_server containers then the problem goes away, and comes back over a few minutes. Where should we be looking for the root cause of this issue?

Indeed if so many requests are established, you're in trouble.
The best fix, I think, is to find the customer who's doing it and punish them.
Otherwise, quotas and the ratelimiting middleware are your friends.

There's also a possibility that your cluster is underperforming, although
usually that results in 500 results first. But then again, at times
users would "compensate" for issues by just launching way more requests,
in effect DoS-ing the cluster even worse.

-- Pete




More information about the openstack-discuss mailing list