Hi Tim, All,

Than you very much for your message. I added some comments to Thomas' ones :)

Hi Tim!

I am aware of what my colleague's doing, and I hope he's ok if I add 
some words.

On 7/9/24 02:00, Tim Burke wrote:
 From the proxy-server changes, I take it you found the error-limiting we already have -- if some number of errors happen within some window, the proxy ignores the node for some other window. I take it this was insufficient, however -- presumably, something to do with the periodic high-latency when the node comes out of the error-limited state? Have you looked at the concurrent_gets, concurrency_timeout, and concurrent_ec_extra_requests options? I've found these to be quite useful in reducing request timings in the face of some slow backends.
 
The current approach of Swift is that when it sees a backend is down, it 
will blacklist it for a while. That unfortunately works if you have a 
single proxy server. When, like us, one runs a dozen or more proxies, 
for a backend to be fully blacklisted, it takes up to 12x 3 tries... 
That's not practical, and what's what Olivier is trying to address.


  Or is it more that you want to relieve a node so that it can focus on replication/reconstruction?
 
Yeah, there's that too...

I usually recommend having separate proxy-facing/replication-facing servers -- that opens up lots of opportunities to differentiate configuration options to adjust traffic flows: this includes workers, max_clients, backlog, on down to ionice_class and ionice_priority. You could also look at our backend ratelimiting middleware to try to shed load faster; I often find that a fast 503 is better than a slow timeout.

We already run rsync with a systemd slice, so that it only takes half of
the speed of the HDDs (which is a way nicer than just ionice...).

For example, something like this (example only for sda):

echo "[Slice]
IOAccounting=1
IOWriteBandwidthMax=/srv/node/sda 120M
IOWriteIOPSMax=/srv/node/sda 80
" >/etc/systemd/system/rsync.slice

echo "[Service]
Slice=rsync.slice
IOSchedulingClass=best-effort
IOSchedulingPriority=7
" >/etc/systemd/system/rsync.d/io-limit-slice.conf

How have you used this so far? Do you find yourself typically excluding a single drive (by running with servers_per_port and just disabling that drive's port), a single box, or even a whole rack at a time? How long do you typically go between turning on the exclusion and turning it off again? Are we talking hours, days, ...?

Yes, we use the servers_per_port, so each drive has a specific port.

When adding new servers to the ring. We use servers with 20+ TB, so it
takes a few days to fill them up. And also, when a drive is unmounted
and needs to be changed (broken drive), we would only re-add the drive
to the available pool when it's finished to rsync.

All of this is also pretty new to us, so we haven't finished dealing
with all of the automation, but we hope for this to be done.


  Is there any need/desire to perform this exclusion for background swift processes? expirer, replicator, etc?
 
We don't need to. What we care in this case, is for our customers to 
access their data fast, without any delay.

  It seems to be useful for you, which is enough motivation that we ought to at least look at it! To really answer the question, though, I'd want to know more about the use-case that drove this -- what was going wrong, and what kind of client impacts did it have?
 
As I wrote above: we want to eliminate any kind of delay for customers 
to read or write data to the cluster.


  Are there any concerns regarding the code
 implementation or its impact on existing functionalities?
  
 This seems like a lot of file I/O for such a low-level function -- I'm surprised it hasn't introduced a new a performance bottleneck. But maybe it's all in RAM anyway? Is the idea to mount /var/cache as tmpfs? If it might actually be on a disk, we probably want
 
 - some kind of periodic reload mechanism, similar to what we do with rings or backend ratelimiting and
 - some error handling, such that a failure to open the file doesn't cause the whole proxy to fall over.
 
 Even assuming tmpfs, I think you'll want some file locking -- it looks to me that concurrent removals can have a bad race.
 
Olivier and I discussed this. I was the one to tell him to not worry too 
much about it, because the kernel:
- has a 2 seconds back-write thingy
- keeps all files it read in the past in RAM if there's enough space

So in fact, the only annoying part when doing such read/write intensive 
task is not performance, but having HDD/SSD used too much and fail 
prematurely.

Indeed it is I/O intensive to read the list of nodes excluded from a file. I tried first to use this file only when starting proxy-server workers to load the list of nodes and then work in memory but got thread locking issue so for the time being we store the file in ram to avoid bottleneck but we definitely have to rewrite this part in a smarter way but I'm not an expert in python so it will take me a bit of time :)


  Given that it's all stored in some flat file anyway, why the API for manipulations? Why not use whatever config-management mechanism you already have to push around a canonical exclusion list? As it is, I'd worry about some config skew because some proxies missed an update -- by hooking into your config management, I'd expect that sort of thing to automatically get flagged. Or is there an expectation of one-off exclusions? That proxy A can't talk to object-server B but any other proxies should be fine with it?
 
Config management is typically very slow. In our case, puppet runs every 
30 minutes. That's not one ones want. Ideally, we'd like a device or a 
server to be excluded from all proxies instantly. For example, we could 
have some mechanism for an object server to tell all proxies to 
blacklist a device.


  Assuming you keep the API, it surely should have some authorization required -- presumably only reseller-admins should be able to use any of it.
 
IMO, that'd be an admin only API. It's only that currently, we haven't 
worked on this yet, but I was planning to contribute the use of 
keystoneauth (as I'm in holidays until end of the month, that'd probably 
be for end of summer).

That is one of the reason I wanted to discuss first before coding further. I agree with Thomas that would most probably be an admin action. So far we simply restricted the API call to the bind_ip defined in proxy-server.conf, meaning allowed to everyone if 0.0.0.0 or localhost if localhost + firewall rules. In my opinion the best approach would be to have a single entry point and a single call to do that would propagate changes to all proxies. It could be an new openstack admin command.


  If you don't need the API (at least, not at the proxy) and you expect to want identical configs across the fleet, it might be worth considering doing this as a flag in the device list in the ring. Somewhat similar to how each device has a weight, you could have a flag for whether it should be participating -- maybe even make it a float rather than a boolean, so you could let it take, say, 1% of expected traffic in a kind of canary deployment. This would have a few advantages in my mind:
 
 - Keeps as much knowledge about cluster topology and desired traffic flows in the ring as possible.
 - Automatic support for periodic reloads.
 - Obvious way to support mass operations, such as toggling all ports for an IP or all disks in a zone.
 
I believe Olivier is happy with his API thingy, and I'm not really sure 
about the ring thing, which would take longer to deploy. Though we need 
to find ways to do automation to blacklist devices quickly from all our 
proxies. Hopefully, he may reply better than I just did with what he has 
in his productive mind. :)

Unless I missed something on your flag approach, I believe that the advantage of the API over ring flags is that it avoids ring manipulations/deployments which depending the environment might be scary, challenging, slow. We have disks failing every day, that would mean deploying a new ring several times accross hundred of nodes which doesn't seem the right approach.