Hi Olivier,
Main reason for developing this middleware is to avoid objects request latency/failure when nodes are down or when rebalancing objects after adding new nodes for example.
I think I get it -- failures can be unexpectedly expensive, depending on the nature of the failure: - Box powered down (or even whole rack) - Box online but process down - Box online, process up, but overloaded - Box online, process up, but disk unmounted - Box online, process up, but disk failing - Box online, process up, but disk overloaded From the proxy-server changes, I take it you found the error-limiting we already have -- if some number of errors happen within some window, the proxy ignores the node for some other window. I take it this was insufficient, however -- presumably, something to do with the periodic high-latency when the node comes out of the error-limited state? Have you looked at the concurrent_gets, concurrency_timeout, and concurrent_ec_extra_requests options? I've found these to be quite useful in reducing request timings in the face of some slow backends. Or is it more that you want to relieve a node so that it can focus on replication/reconstruction? I usually recommend having separate proxy-facing/replication-facing servers -- that opens up lots of opportunities to differentiate configuration options to adjust traffic flows: this includes workers, max_clients, backlog, on down to ionice_class and ionice_priority. You could also look at our backend ratelimiting middleware to try to shed load faster; I often find that a fast 503 is better than a slow timeout. How have you used this so far? Do you find yourself typically excluding a single drive (by running with servers_per_port and just disabling that drive's port), a single box, or even a whole rack at a time? How long do you typically go between turning on the exclusion and turning it off again? Are we talking hours, days, ...? Is there any need/desire to perform this exclusion for background swift processes? expirer, replicator, etc?
Does the middleware fulfill a common need?
It seems to be useful for you, which is enough motivation that we ought to at least look at it! To really answer the question, though, I'd want to know more about the use-case that drove this -- what was going wrong, and what kind of client impacts did it have? Especially since it requires some (reasonably minor) changes in the proxy server app, it definitely seems worth trying to upstream -- I know from experience, it's a pain to try to carry a patch in your own fork long-term.
Are there any concerns regarding the code implementation or its impact on existing functionalities?
This seems like a lot of file I/O for such a low-level function -- I'm surprised it hasn't introduced a new a performance bottleneck. But maybe it's all in RAM anyway? Is the idea to mount /var/cache as tmpfs? If it might actually be on a disk, we probably want - some kind of periodic reload mechanism, similar to what we do with rings or backend ratelimiting and - some error handling, such that a failure to open the file doesn't cause the whole proxy to fall over. Even assuming tmpfs, I think you'll want some file locking -- it looks to me that concurrent removals can have a bad race. Given that it's all stored in some flat file anyway, why the API for manipulations? Why not use whatever config-management mechanism you already have to push around a canonical exclusion list? As it is, I'd worry about some config skew because some proxies missed an update -- by hooking into your config management, I'd expect that sort of thing to automatically get flagged. Or is there an expectation of one-off exclusions? That proxy A can't talk to object-server B but any other proxies should be fine with it? Assuming you keep the API, it surely should have some authorization required -- presumably only reseller-admins should be able to use any of it. If you don't need the API (at least, not at the proxy) and you expect to want identical configs across the fleet, it might be worth considering doing this as a flag in the device list in the ring. Somewhat similar to how each device has a weight, you could have a flag for whether it should be participating -- maybe even make it a float rather than a boolean, so you could let it take, say, 1% of expected traffic in a kind of canary deployment. This would have a few advantages in my mind: - Keeps as much knowledge about cluster topology and desired traffic flows in the ring as possible. - Automatic support for periodic reloads. - Obvious way to support mass operations, such as toggling all ports for an IP or all disks in a zone. Tim ________________________________________ From: Olivier Chaze <olivier.chaze@infomaniak.com> Sent: Thursday, July 4, 2024 6:41 AM To: openstack-discuss@lists.openstack.org <openstack-discuss@lists.openstack.org> Subject: [Swift - proxy server] manually exclude nodes via new middleware Hi all, I am writing to introduce a new middleware component, ExcludeNode, which we've just started to develop to enhance node management within the Swift proxy server. [Middleware Overview] ExcludeNode is designed to dynamically exclude specific nodes from the proxy's operations. This middleware reads a list of nodes to be excluded from a specified file and prevents these nodes from being considered during normal operations. Main reason for developing this middleware is to avoid objects request latency/failure when nodes are down or when rebalancing objects after adding new nodes for example. The main functionalities include: - Node Exclusion Check: Validates if a node (IP:PORT) is listed in the exclusion file. - Dynamic Updates: Allows updating the exclusion list by writing node information to the specified file (IP:PORT). - Clear the exclusion list How to use it ? Assuming 1.2.3.4 storage node is down, requests will fail with : proxy-server: ERROR with Object server 1.2.3.4:6210/sdm re: Trying to GET /v1/AUTH_[...]: ConnectionTimeout (2.0s) (txn: tx19b7067dcb0c442e96a10-0066867068) To ban the disk 1.2.3.4:6210/sdm , on each proxy-server : curl -X POST `hostname`:8080/exclude_node -H "Content-Type: application/json" -d {"ip":"'1.2.3.4","port":6210}"' in proxy-server.log will be logged : proxy-server: Node added to exclusion list: 1.2.3.4:6210 (txn: tx0f292e3b08f940b5be3d9-0066867068) This way, the node won't serve any request and depending the request type, this latter will be redirected to another primary or handoff node. It will remain like this until the disk is manually removed from the exclusion list and every time this node would have been chosen to serve a request you'll get a log : proxy-server: Node 1.2.3.4:6210 (sdm) explicitly excluded by exclude_node middleware. (txn: txeb9f5953ee3343dbb9699-006686707a) I would greatly appreciate the community's feedback on the following: Functionality: Does the middleware fulfill a common need? Implementation: Are there any concerns regarding the code implementation or its impact on existing functionalities? We'd like to submmit this middleware upstream but to be efficient and avoid wasting everybody's time I'd like some feedbacks first :) The code itself is quite simple today but we plan to add features like auto excluding a node reported as unmounted by swift-recon for example. exclude_node.py middleware : https://kpaste.infomaniak.com/nRp8AdJGnMuLCoP7czIbL9IBJjlSkyYS#ARXTJo5ZnGa5U... server.py modification : https://kpaste.infomaniak.com/Qvy992wlRbc6yEgrIPhAhyrh2V3-Gipy#E2TaKDAXQdXpe... proxy-server.conf : [pipeline:main] - pipeline = [...] proxy-server + pipeline = [...] exclude_node proxy-server [filter:exclude_node] use=egg:swift#exclude_node exclude_nodes_filename = /dev/shm/excluded_nodes Thank you in advance for your feedback, Olivier