[openstack-dev] [Swift] Design note of geo-distributed Swift cluster

Oleg Gelbukh ogelbukh at mirantis.com
Wed Feb 27 13:45:06 UTC 2013


Hello, Adrian, Yuzawa-san

We are still pursuing the global ring implementation with minimal
changes to replication algorithm. However, it has a number of
drawbacks. Some of them were obvious from the very beginning (for
example, a need to tweak rebalance to minimize data transfers between
regions, or operational overhead required to dirstribute ring files in
multi-region environment), others have been made visible by this very
discussion.

We identified 3 basic ways to implement the inter-region replication:

1. introduce replicator affinity, which is in general resembles proxy
affinity in a sense that the original replicator handles replication
to devices from local and foreign regions differently. For example,
limit the number of REPLICATE calls to foreign regions to one in ten
replicator runs and only connect single foreign region server in a
single run. This is an approach we are going to take in the first
iteration.

2. implement separate replicator process for cross-region replication.
The original replicator handles replication to devices in local region
and ignores devices in foreign regions, while region-replicator acts
symmetrically ignoring local devices. This approach is basically an
extension of the first, but allows to isolate changes from the core
code.

3. create replicator-server to sit on the edge of region's replication
network (or storage network if replication network is not used) and
control replication to foreign regions. This server won't store any
data, only database of hashes in a sort of 'ring-of-rings', used to
determine if replication to foreign region required.
In this case, global namespace will move to that 'ring-of-rings', and
for inside-region replication, standard ring is used.
Replication-server represented as special device with very large
'weight' parameter, to get information about replicas in local cluster
from standard replicators. This server will also have to 'proxy'
replication traffic when it detects that partition is modified in
local cluster.
Unlike #1 and #2, this option supports only replication between
regions, no proxy-servers can talk to storage serevr in foreign
region. However, it can allow more sophisticated algorithms for
inter-regions replication.

--
Best regards,
Oleg Gelbukh
Mirantis, Inc.

On Tue, Feb 19, 2013 at 9:36 AM, YUZAWA Takahiko
<yuzawataka at intellilink.co.jp> wrote:
>
> Oleg-san,
>
> The implementation of geo-distributed cluster of swift entirely changed from proxy-affinity and region to inter-region replication?
>
> I have questions like following.
>
> * Have namespace been split normal rings and ring-of-ring of inter-region replicator? If it's so, how clients reach objects in other region?
> * Inter-region replicators must store one replica of objects of each region? Will it scale?
>
> Could you tell us more details of this idea?
>
> Thank you.
>
>
> (2013/02/18 17:14), Oleg Gelbukh wrote:
>>
>> Hello,
>>
>> I would like to continue this insightful discussion by dropping a couple
>> of suggestions inline.
>>
>> On Tue, Feb 5, 2013 at 3:47 PM, Caitlin Bestler
>> <caitlin.bestler at nexenta.com <mailto:caitlin.bestler at nexenta.com>> wrote:
>>
>>     While we don't want to solve every possible topology, I think we
>>     really need to pay attention to what multi-site really requires.
>>
>>     I haven't done any studies of the entire market, but in my
>>     experience inter-site replication used by storage services is almost
>>     always
>>     via dedicated or VPN tunnels, and when VPN tunnels are used they are
>>     traffic shaped.
>>
>>     This is not just a matter of connecting a bunch of IP addresses on
>>     the internet and then form a vague impression as to which ones
>>     are "far" away. It is more like the type of discovery routers do
>>     where each tunnel is a "link".
>>
>>     A proper remote replication solution will be aware of these links,
>>     and take that into account in its replication strategy. One example
>>     topology that I believe is very likely is a distributed corporate
>>     intranet. The branch offices are very unlikely to connect with each
>>     other, but rather mostly connect with the central office (and maybe
>>     one alternate location).
>>
>>     If the communications capacity favors communicating with certain
>>     sites, then we should favor replicating to those sites. Communications
>>     capacity between corporate sites is typically provisioned (whether
>>     with dedicated lines or just VPN) and not something you will be able
>>     to just increase on demand instantly. Inter-site bandwidth is still
>>     expensive.
>>
>>     That said, there are still two important things to reach a consensus on:
>>
>>     * Are we talking about enabling the Swift Proxy to access content
>>     that is at multiple sites, but each object is linked to a specific site.
>>         Or are we creating a global namespace with eventual consistency,
>>     and smart assignment of objects to the sites where they are
>>         actually referenced? The first goal is certainly easier.
>>
>> Our initial idea was to create a global namespace, i.e. have a single
>> ring shared across all regions and containing all devices, and have
>> proxy-servers accessing data based on the ring location with preference
>> of local servers. Now, after some a work done on replication network
>> feature, we understand that the most likely deployment topology is
>> regions with replication networks connected by VPN of some sort and
>> storage networks totally isolated. In such deployment, no proxy server
>> will ever access remote region's storage server, thus no need in global
>> namespace for accessing data. What we're actually need global namespace
>> for is the inter-region replication, which brings us to the second question:
>>
>>     * What forms of site-to-site replication are we going to support? Is
>>     this something each system administrator specifies (such as
>>          by adding policies along the lines of "all new objects created
>>     at a branch office will be replicated to the two central sites on
>>          a daily basis. Only objects actually referenced at a branch
>>     office will be cached there.") or something more akin to how Swift
>>          operates locally where the user does not specify where specific
>>     things are stored?
>>
>>
>> It looks like we need a kind of 'ring-of-rings' and a server(s)
>> controlling inter-region replication in every region. This server might
>> be represented as a device with very high weight, or some special
>> device, which basically has at least one replica of most partitions (or
>> each partition) in the cluster. This ensures the local replicators
>> report number of replicas in local cluster to inter-region replicator.
>> Inter-region replicators, in turn, compare value of replicas to recorded
>> in 'ring-of-rings' and initiate cross-region replication if local region
>> lost all configured replicas of partition.
>>
>>
>>
>>
>>
>>
>>     _________________________________________________
>>     OpenStack-dev mailing list
>>     OpenStack-dev at lists.openstack.__org
>>     <mailto:OpenStack-dev at lists.openstack.org>
>>     http://lists.openstack.org/__cgi-bin/mailman/listinfo/__openstack-dev <http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev>
>>
>>
>>
>> --
>> Best regards,
>> Oleg Gelbukh
>> Mirantis, Inc.
>>
>>
>> _______________________________________________
>> OpenStack-dev mailing list
>> OpenStack-dev at lists.openstack.org
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>
>
>
> _______________________________________________
> OpenStack-dev mailing list
> OpenStack-dev at lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev



More information about the OpenStack-dev mailing list