[Openstack] [Swift] : Work-flow

John Dickinson me at not.mn
Fri Apr 28 17:37:34 UTC 2017


Great question. There's not a simple yes/no answer, so let me go into some detail about the different ways storage nodes can be selected to handle a request.

There are a few config settings in the proxy server config that can affect how nodes are selected for reads. Instead of describing these directly (or pasting the docs), let me describe it from an implementation perspective.

When the proxy server gets a read request for an object, the proxy looks up in the ring the storage nodes (object servers) that may know something about that object. The proxy server builds two lists[1]. The first is for "primary" nodes. These are the drives where the data is supposed to be. For a replicated storage policy with three replicas, the primary nodes will be a list of three items[2]. For an 8+4 erasure coded storage policy, it will be the list of 12 nodes where the EC fragments are supposed to be. The second list the proxy makes is the list of "handoff" nodes. These are alternative places an object (or fragment) may be found if they aren't found on a primary node.

Once the proxy has the list of primary nodes, there are a few ways it can iterate over that list. The `sorting_method` config option determines this. The default `shuffle` value means that the list of fragments is randomly sorted. When a proxy server makes a connection to a storage node, it tracks how long it took to create the connection. The `timing` value of `sorting_method` will sort the list by these saved connection timings. The idea is that a busy server will take longer to respond to connection requests, and will then get moved lower in the sorted list. The `affinity` value will cause the list of nodes to be sorted according to the rules set in the `read_affinity` config option. This allows a proxy server to specifically prioritize connections that are local (same DC) and de-prioritize remote connections. The `read_affinity` setting is fantastic when Swift is deployed with more than one region (i.e. global clusters).

Once the list of primary nodes is sorted, the the proxy will start contacting each in turn until the request can be satisfied. With erasure codes, at least <number of data bits> fragments need to be contacted (e.g. 8 in the example above), so the sorting method value doesn't do much to change performance. For replicas, though, only one node is needed to satisfy a read request. The naive way to go through the list is: contact the first one, get response, if error, repeat with the next node in the list.

However, just walking through the list can be really slow. Swift has another config option, `concurrency_timeout` (default to 0.5 seconds) that is a delay before the next request is attempted. Basically, it's a balance between network connections created and end-user latency. Let's say the `node_timeout` is set to 5 seconds. If a server can accept a connection with no problem but a disk is slow, this means that the proxy might start a read request but wait five seconds before timing out and moving on. Worst case, this could result in a 10 second delay before the last primary node is even asked if it has the data (first two time out after 5 seconds each). The `concurrency_timeout` means that the proxy will only wait 500ms before starting that next connection in the primary node list. Whichever node responds first will be the one that is used to send data to the client, and the rest are closed and cleaned up.

That's an overview of how the proxy chooses which nodes to talk to when handling a read request. There's a few different options that can be tuned depending on your particular deployment, but the default values (shuffle, 500ms concurrency timeout) are really good for most cases.

As a final note, there's also a `write_affinity` setting for the write data path. This works very similar to the `read_affinity` setting, but I'm not a big fan of it. It seems to cause more problems that not. It causes the proxy server to mix in some local handoff nodes into the primary node list on the write. This means that all writes in a global cluster will be satisfied in the local DC, but it doesn't mean the WAN traversal work goes away. Swift's background consistency process will move the data to the right place, but this is more expensive than putting it in the right place to start with. I strongly recommend that you do not use `write_affinity` in your global Swift clusters.


[1] technically, a list and a lazy-eval'd iterator
[2] I sometimes switch between "node" and "drive" and "server". Each element in these lists has (IP, port, mount point) for finding a particular drive.


Hope this info helps you understand more about how Swift works and how you can best tune it for your use.


--John





On 28 Apr 2017, at 2:46, Sameer Kulkarni wrote:

> Hi All,
>
> I had a doubt regarding the work-flow of Swift.
>
> *'For read operation, we need to read from one of the three replicas. We
> are aware that geographical origin of request is one of the factors to
> decide which replica to read from(usually the nearest replica). But is also
> the load on the nodes containing these replicas taken into account? i.e
> will all read requests for the same object from a given location read from
> same replica or when load increases(on the node containing that replica)
> will the requests be directed to a different replica ?’*
>
>
> Cheers,
> Sameer


> _______________________________________________
> Mailing list: http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack
> Post to     : openstack at lists.openstack.org
> Unsubscribe : http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack/attachments/20170428/748a3c4a/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 801 bytes
Desc: OpenPGP digital signature
URL: <http://lists.openstack.org/pipermail/openstack/attachments/20170428/748a3c4a/attachment.sig>


More information about the Openstack mailing list