[Openstack-operators] [Openstack] [SWIFT] Change the partition power to recreate the RING
Alejandro Comisario
alejandro.comisario at mercadolibre.com
Wed Jan 16 12:29:02 UTC 2013
Thanks everyone.
It seems on a first glance that the application that's using swift, is
pushing all the PUT operations in a single container (+100 PUT/sec) so the
developers are making a quick change to try to split all the load across
lots of containers to scale horizontally since we are getting lots of
concurrency on a single container.
No doubt if thats the problem, our next move it to swap account/container
to SSD devices.
I'll keep you posted !
*
*
*Alejandro.*
On Wed, Jan 16, 2013 at 5:13 AM, Ywang225 <ywang225 at 126.com> wrote:
> If you cares about put performance, one thing needs to check, are you
> placing account and container with object together? If it is, this possibly
> becomes bottleneck, you could place account and container on dedicated
> nodes or dedicated faster disks. Of course, this involves ring changes.
>
> Another side is about the parameters for account and container servers,
> workers=48 seems too high, which will increase contention on accessing
> account or container db.
>
> -ywang
>
> 在 2013-1-15,4:01,Alejandro Comisario <alejandro.comisario at mercadolibre.com>
> 写道:
>
> Chuck et All.
>
> Let me go through the point one by one.
>
> #1 Even seeing that "object-auditor" allways runs and never stops, we
> stoped the swift-*-auditor and didnt see any improvements, from all the
> datanodes we have an average of 8% IO-WAIT (using iostat), the only thing
> that we see is the pid "xfsbuf" runs once in a while causing 99% iowait for
> a sec, we delayed the runtime for that process, and didnt see changes
> either.
>
> Our object-auditor config for all devices is as follow :
>
> [object-auditor]
> files_per_second = 5
> zero_byte_files_per_second = 5
> bytes_per_second = 3000000
>
> #2 Our 12 proxyes are 6 physical and 6 kvm instances running on nova,
> checking iftop we are at an average of 15Mb/s of bandwidth usage so i dont
> think we are saturating the networking.
> #3 The overall Idle CPU on all datanodes is 80%, im not sure how to check
> the CPU usage per worker, let me paste the config for a device for object,
> account and container.
>
> *object-server.conf*
> *------------------*
> [DEFAULT]
> devices = /srv/node/sda3
> mount_check = false
> bind_port = 6010
> user = swift
> log_facility = LOG_LOCAL2
> log_level = DEBUG
> workers = 48
> disable_fallocate = true
>
> [pipeline:main]
> pipeline = object-server
>
> [app:object-server]
> use = egg:swift#object
>
> [object-replicator]
> vm_test_mode = yes
> concurrency = 8
> run_pause = 600
>
> [object-updater]
> concurrency = 8
>
> [object-auditor]
> files_per_second = 5
> zero_byte_files_per_second = 5
> bytes_per_second = 3000000
>
> *account-server.conf*
> *-------------------*
> [DEFAULT]
> devices = /srv/node/sda3
> mount_check = false
> bind_port = 6012
> user = swift
> log_facility = LOG_LOCAL2
> log_level = DEBUG
> workers = 48
> db_preallocation = on
> disable_fallocate = true
>
> [pipeline:main]
> pipeline = account-server
>
> [app:account-server]
> use = egg:swift#account
>
> [account-replicator]
> vm_test_mode = yes
> concurrency = 8
> run_pause = 600
>
> [account-auditor]
>
> [account-reaper]
>
> *container-server.conf*
> *---------------------*
> [DEFAULT]
> devices = /srv/node/sda3
> mount_check = false
> bind_port = 6011
> user = swift
> workers = 48
> log_facility = LOG_LOCAL2
> allow_versions = True
> disable_fallocate = true
>
> [pipeline:main]
> pipeline = container-server
>
> [app:container-server]
> use = egg:swift#container
> allow_versions = True
>
> [container-replicator]
> vm_test_mode = yes
> concurrency = 8
> run_pause = 500
>
> [container-updater]
> concurrency = 8
>
> [container-auditor]
>
> #4 We dont use SSL for swift so, no latency over there.
>
> Hope you guys can shed some light.
>
>
> *
> *
> *
> *
> *Alejandro Comisario
> #melicloud CloudBuilders*
> Arias 3751, Piso 7 (C1430CRG)
> Ciudad de Buenos Aires - Argentina
> Cel: +549(11) 15-3770-1857
> Tel : +54(11) 4640-8443
>
>
> On Mon, Jan 14, 2013 at 1:23 PM, Chuck Thier <cthier at gmail.com> wrote:
>
>> Hi Alejandro,
>>
>> I really doubt that partition size is causing these issues. It can be
>> difficult to debug these types of issues without access to the
>> cluster, but I can think of a couple of things to look at.
>>
>> 1. Check your disk io usage and io wait on the storage nodes. If
>> that seems abnormally high, then that could be one of the sources of
>> problems. If this is the case, then the first things that I would
>> look at are the auditors, as they can use up a lot of disk io if not
>> properly configured. I would try turning them off for a bit
>> (swift-*-auditor) and see if that makes any difference.
>>
>> 2. Check your network io usage. You haven't described what type of
>> network you have going to the proxies, but if they share a single GigE
>> interface, if my quick calculations are correct, you could be
>> saturating the network.
>>
>> 3. Check your CPU usage. I listed this one last as you have said
>> that you have already worked at tuning the number of workers (though I
>> would be interested to hear how many workers you have running for each
>> service). The main thing to look for, is to see if all of your
>> workers are maxed out on CPU, if so, then you may need to bump
>> workers.
>>
>> 4. SSL Termination? Where are you terminating the SSL connection?
>> If you are terminating SSL in Swift directly with the swift proxy,
>> then that could also be a source of issue. This was only meant for
>> dev and testing, and you should use an SSL terminating load balancer
>> in front of the swift proxies.
>>
>> That's what I could think of right off the top of my head.
>>
>> --
>> Chuck
>>
>> On Mon, Jan 14, 2013 at 5:45 AM, Alejandro Comisario
>> <alejandro.comisario at mercadolibre.com> wrote:
>> > Chuck / John.
>> > We are having 50.000 request per minute ( where 10.000+ are put from
>> small
>> > objects, from 10KB to 150KB )
>> >
>> > We are using swift 1.7.4 with keystone token caching so no latency over
>> > there.
>> > We are having 12 proxyes and 24 datanodes divided in 4 zones ( each
>> datanode
>> > has 48gb of ram, 2 hexacore and 4 devices of 3TB each )
>> >
>> > The workers that are puting objects in swift are seeing an awful
>> > performance, and we too.
>> > With peaks of 2secs to 15secs per put operations coming from the
>> datanodes.
>> > We tunes db_preallocation, disable_fallocate, workers and concurrency
>> but we
>> > cant reach the request that we need ( we need 24.000 put per minute of
>> small
>> > objects ) but we dont seem to find where is the problem, other than
>> from the
>> > datanodes.
>> >
>> > Maybe worth pasting our config over here?
>> > Thanks in advance.
>> >
>> > alejandro
>> >
>> > On 12 Jan 2013 02:01, "Chuck Thier" <cthier at gmail.com> wrote:
>> >>
>> >> Looking at this from a different perspective. Having 2500 partitions
>> >> per drive shouldn't be an absolutely horrible thing either. Do you
>> >> know how many objects you have per partition? What types of problems
>> >> are you seeing?
>> >>
>> >> --
>> >> Chuck
>> >>
>> >> On Fri, Jan 11, 2013 at 3:28 PM, John Dickinson <me at not.mn> wrote:
>> >> > If effect, this would be a complete replacement of your rings, and
>> that
>> >> > is essentially a whole new cluster. All of the existing data would
>> need to
>> >> > be rehashed into the new ring before it is available.
>> >> >
>> >> > There is no process that rehashes the data to ensure that it is
>> still in
>> >> > the correct partition. Replication only ensures that the partitions
>> are on
>> >> > the right drives.
>> >> >
>> >> > To change the number of partitions, you will need to GET all of the
>> data
>> >> > from the old ring and PUT it to the new ring. A more complicated, but
>> >> > perhaps more efficient) solution may include something like walking
>> each
>> >> > drive and rehashing+moving the data to the right partition and then
>> letting
>> >> > replication settle it down.
>> >> >
>> >> > Either way, 100% of your existing data will need to at least be
>> rehashed
>> >> > (and probably moved). Your CPU (hashing), disks (read+write), RAM
>> (directory
>> >> > walking), and network (replication) may all be limiting factors in
>> how long
>> >> > it will take to do this. Your per-disk free space may also determine
>> what
>> >> > method you choose.
>> >> >
>> >> > I would not expect any data loss while doing this, but you will
>> probably
>> >> > have availability issues, depending on the data access patterns.
>> >> >
>> >> > I'd like to eventually see something in swift that allows for
>> changing
>> >> > the partition power in existing rings, but that will be
>> >> > hard/tricky/non-trivial.
>> >> >
>> >> > Good luck.
>> >> >
>> >> > --John
>> >> >
>> >> >
>> >> > On Jan 11, 2013, at 1:17 PM, Alejandro Comisario
>> >> > <alejandro.comisario at mercadolibre.com> wrote:
>> >> >
>> >> >> Hi guys.
>> >> >> We've created a swift cluster several months ago, the things is that
>> >> >> righ now we cant add hardware and we configured lots of partitions
>> thinking
>> >> >> about the final picture of the cluster.
>> >> >>
>> >> >> Today each datanodes is having 2500+ partitions per device, and even
>> >> >> tuning the background processes ( replicator, auditor & updater )
>> we really
>> >> >> want to try to lower the partition power.
>> >> >>
>> >> >> Since its not possible to do that without recreating the ring, we
>> can
>> >> >> have the luxury of recreate it with a very lower partition power,
>> and
>> >> >> rebalance / deploy the new ring.
>> >> >>
>> >> >> The question is, having a working cluster with *existing data* is it
>> >> >> possible to do this and wait for the data to move around *without
>> data loss*
>> >> >> ???
>> >> >> If so, it might be true to wait for an improvement in the overall
>> >> >> cluster performance ?
>> >> >>
>> >> >> We have no problem to have a non working cluster (while moving the
>> >> >> data) even for an entire weekend.
>> >> >>
>> >> >> Cheers.
>> >> >>
>> >> >>
>> >> >
>> >> >
>> >> > _______________________________________________
>> >> > Mailing list: https://launchpad.net/~openstack
>> >> > Post to : openstack at lists.launchpad.net
>> >> > Unsubscribe : https://launchpad.net/~openstack
>> >> > More help : https://help.launchpad.net/ListHelp
>> >> >
>>
>
> _______________________________________________
> OpenStack-operators mailing list
> OpenStack-operators at lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-operators/attachments/20130116/0652ae67/attachment.html>
More information about the OpenStack-operators
mailing list