Hi Tim, All,

Than you very much for your message. I added some comments to Thomas' ones :)
Hi Tim!

I am aware of what my colleague's doing, and I hope he's ok if I add
some words.

On 7/9/24 02:00, Tim Burke wrote:
From the proxy-server changes, I take it you found the error-limiting we already have -- if some number of errors happen within some window, the proxy ignores the node for some other window. I take it this was insufficient, however -- presumably, something to do with the periodic high-latency when the node comes out of the error-limited state? Have you looked at the concurrent_gets, concurrency_timeout, and concurrent_ec_extra_requests options? I've found these to be quite useful in reducing request timings in the face of some slow backends.

The current approach of Swift is that when it sees a backend is down, it
will blacklist it for a while. That unfortunately works if you have a
single proxy server. When, like us, one runs a dozen or more proxies,
for a backend to be fully blacklisted, it takes up to 12x 3 tries...
That's not practical, and what's what Olivier is trying to address.
 
We didn't try concurrent_gets which can certainly help in some cases indeed. For the rest, as Thomas explained, most of the time, the existing error-limiting system doesn't detect failed disk fast enough or slow disk at all due to the dozen of proxies we have. it would certainly work well if errors reported by proxies were shared between them.

Or is it more that you want to relieve a node so that it can focus on replication/reconstruction?

Yeah, there's that too...
Yes. replication/reconstruction take days with our PetaBytes clusters.



I usually recommend having separate proxy-facing/replication-facing servers -- that opens up lots of opportunities to differentiate configuration options to adjust traffic flows: this includes workers, max_clients, backlog, on down to ionice_class and ionice_priority. You could also look at our backend ratelimiting middleware to try to shed load faster; I often find that a fast 503 is better than a slow timeout.

We already run rsync with a systemd slice, so that it only takes half of
the speed of the HDDs (which is a way nicer than just ionice...).

For example, something like this (example only for sda):

echo "[Slice]
IOAccounting=1
IOWriteBandwidthMax=/srv/node/sda 120M
IOWriteIOPSMax=/srv/node/sda 80
" >/etc/systemd/system/rsync.slice

echo "[Service]
Slice=rsync.slice
IOSchedulingClass=best-effort
IOSchedulingPriority=7
" >/etc/systemd/system/rsync.d/io-limit-slice.conf

How have you used this so far? Do you find yourself typically excluding a single drive (by running with servers_per_port and just disabling that drive's port), a single box, or even a whole rack at a time? How long do you typically go between turning on the exclusion and turning it off again? Are we talking hours, days, ...?

Yes, we use the servers_per_port, so each drive has a specific port.

When adding new servers to the ring. We use servers with 20+ TB, so it
takes a few days to fill them up. And also, when a drive is unmounted
and needs to be changed (broken drive), we would only re-add the drive
to the available pool when it's finished to rsync.

All of this is also pretty new to us, so we haven't finished dealing
with all of the automation, but we hope for this to be done.
 
One use case was when we added 12 servers, 24 disks each. we excluded then using the middleware so requests go to other primary nodes while objects are being replicated.
Another one is when a disk fails and is unmounted, it can be excluded until reconstruction completed.

Is there any need/desire to perform this exclusion for background swift processes? expirer, replicator, etc?

We don't need to. What we care in this case, is for our customers to
access their data fast, without any delay.

It seems to be useful for you, which is enough motivation that we ought to at least look at it! To really answer the question, though, I'd want to know more about the use-case that drove this -- what was going wrong, and what kind of client impacts did it have?

As I wrote above: we want to eliminate any kind of delay for customers
to read or write data to the cluster.
 
Client impacts can be
- Latency for HEAD/GET/PUT due to requests hitting times out like conn_timeout if a disk/server is down or slow due to reconstruction/replication
- Errors, when GET/HEAD returns 404 when querying a server which doesn't hold yet/anymore an object due to reconstruction/replication

Are there any concerns regarding the code
implementation or its impact on existing functionalities?

This seems like a lot of file I/O for such a low-level function -- I'm surprised it hasn't introduced a new a performance bottleneck. But maybe it's all in RAM anyway? Is the idea to mount /var/cache as tmpfs? If it might actually be on a disk, we probably want

- some kind of periodic reload mechanism, similar to what we do with rings or backend ratelimiting and
- some error handling, such that a failure to open the file doesn't cause the whole proxy to fall over.

Even assuming tmpfs, I think you'll want some file locking -- it looks to me that concurrent removals can have a bad race.

Olivier and I discussed this. I was the one to tell him to not worry too
much about it, because the kernel:
- has a 2 seconds back-write thingy
- keeps all files it read in the past in RAM if there's enough space

So in fact, the only annoying part when doing such read/write intensive
task is not performance, but having HDD/SSD used too much and fail
prematurely.
 
Indeed it is I/O intensive to read the list of nodes excluded from a file. I tried first to use this file only when starting proxy-server workers to load the list of nodes and then work in memory but got thread locking issue so for the time being we store the file in ram to avoid bottleneck but we definitely have to rewrite this part in a smarter way but I'm not an expert in python so it will take me a bit of time :)

Given that it's all stored in some flat file anyway, why the API for manipulations? Why not use whatever config-management mechanism you already have to push around a canonical exclusion list? As it is, I'd worry about some config skew because some proxies missed an update -- by hooking into your config management, I'd expect that sort of thing to automatically get flagged. Or is there an expectation of one-off exclusions? That proxy A can't talk to object-server B but any other proxies should be fine with it?

Config management is typically very slow. In our case, puppet runs every
30 minutes. That's not one ones want. Ideally, we'd like a device or a
server to be excluded from all proxies instantly. For example, we could
have some mechanism for an object server to tell all proxies to
blacklist a device.
 
The idea of the API is to allow people to plug it easily to their existing configuration/monitoring system and allow swift operators to take actions immediately. A simple example is to take the output of `swift-recon -u` and exclude reported nodes automatically using a cron job.

Assuming you keep the API, it surely should have some authorization required -- presumably only reseller-admins should be able to use any of it.

IMO, that'd be an admin only API. It's only that currently, we haven't
worked on this yet, but I was planning to contribute the use of
keystoneauth (as I'm in holidays until end of the month, that'd probably
be for end of summer).
 
That is one of the reason I wanted to discuss first before coding further. I agree with Thomas that would most probably be an admin action. So far we simply restricted the API call to the bind_ip defined in proxy-server.conf, meaning allowed to everyone if 0.0.0.0 or localhost if localhost + firewall rules. In my opinion the best approach would be to have a single entry point and a single call to do that would propagate changes to all proxies. It could be an new openstack admin command.

If you don't need the API (at least, not at the proxy) and you expect to want identical configs across the fleet, it might be worth considering doing this as a flag in the device list in the ring. Somewhat similar to how each device has a weight, you could have a flag for whether it should be participating -- maybe even make it a float rather than a boolean, so you could let it take, say, 1% of expected traffic in a kind of canary deployment. This would have a few advantages in my mind:

- Keeps as much knowledge about cluster topology and desired traffic flows in the ring as possible.
- Automatic support for periodic reloads.
- Obvious way to support mass operations, such as toggling all ports for an IP or all disks in a zone.

I believe Olivier is happy with his API thingy, and I'm not really sure
about the ring thing, which would take longer to deploy. Though we need
to find ways to do automation to blacklist devices quickly from all our
proxies. Hopefully, he may reply better than I just did with what he has
in his productive mind. :)
 
integrating ring knowledge is indeed something I already started to implement, allowing to exclude an entire node simply by providing an IP. Useful for maintenance requiring a server reboot for instance.
Unless I missed something on your flag approach, I believe that the advantage of the API over ring flags is that it avoids ring manipulations/deployments which depending the environment might be scary, challenging, slow. We have disks failing every day, that would mean deploying a new ring several times accross hundred of nodes which doesn't seem the right approach.

Cheers,

Thomas Goirand (zigo)
Than you again for your comments,
Olivier Chaze