[Openstack-operators] [Openstack] [SWIFT] Change the partition power to recreate the RING

Chuck Thier cthier at gmail.com
Mon Jan 14 21:00:50 UTC 2013


Hey Alejandro,

Those were the most common issues that people run into when they are having
performance issues with swift.  The other thing to check is to look at the
logs to make sure there are no major issues (like bad drives, misconfigured
nodes, etc.), which could add latency to the requests.  After that, I'm
starting to run out of the common issues that people run into, and it might
be worth contracting with one of the many swift consulting companies to
help you out.  If you have time, and can hop on #openstack-swift on
freenode IRC we might be able to have a little more interactive discussion,
or some other may come up with some ideas.

--
Chuck


On Mon, Jan 14, 2013 at 2:01 PM, Alejandro Comisario <
alejandro.comisario at mercadolibre.com> wrote:

> Chuck et All.
>
> Let me go through the point one by one.
>
> #1 Even seeing that "object-auditor" allways runs and never stops, we
> stoped the swift-*-auditor and didnt see any improvements, from all the
> datanodes we have an average of 8% IO-WAIT (using iostat), the only thing
> that we see is the pid "xfsbuf" runs once in a while causing 99% iowait for
> a sec, we delayed the runtime for that process, and didnt see changes
> either.
>
> Our object-auditor config for all devices is as follow :
>
> [object-auditor]
> files_per_second = 5
> zero_byte_files_per_second = 5
> bytes_per_second = 3000000
>
> #2 Our 12 proxyes are 6 physical and 6 kvm instances running on nova,
> checking iftop we are at an average of 15Mb/s of bandwidth usage so i dont
> think we are saturating the networking.
> #3 The overall Idle CPU on all datanodes is 80%, im not sure how to check
> the CPU usage per worker, let me paste the config for a device for object,
> account and container.
>
> *object-server.conf*
> *------------------*
> [DEFAULT]
> devices = /srv/node/sda3
> mount_check = false
> bind_port = 6010
> user = swift
> log_facility = LOG_LOCAL2
> log_level = DEBUG
> workers = 48
> disable_fallocate = true
>
> [pipeline:main]
> pipeline = object-server
>
> [app:object-server]
> use = egg:swift#object
>
> [object-replicator]
> vm_test_mode = yes
> concurrency = 8
> run_pause = 600
>
> [object-updater]
> concurrency = 8
>
> [object-auditor]
> files_per_second = 5
> zero_byte_files_per_second = 5
> bytes_per_second = 3000000
>
> *account-server.conf*
> *-------------------*
> [DEFAULT]
> devices = /srv/node/sda3
> mount_check = false
> bind_port = 6012
> user = swift
> log_facility = LOG_LOCAL2
> log_level = DEBUG
> workers = 48
> db_preallocation = on
> disable_fallocate = true
>
> [pipeline:main]
> pipeline = account-server
>
> [app:account-server]
> use = egg:swift#account
>
> [account-replicator]
> vm_test_mode = yes
> concurrency = 8
> run_pause = 600
>
> [account-auditor]
>
> [account-reaper]
>
> *container-server.conf*
> *---------------------*
> [DEFAULT]
> devices = /srv/node/sda3
> mount_check = false
> bind_port = 6011
> user = swift
> workers = 48
> log_facility = LOG_LOCAL2
> allow_versions = True
> disable_fallocate = true
>
> [pipeline:main]
> pipeline = container-server
>
> [app:container-server]
> use = egg:swift#container
> allow_versions = True
>
> [container-replicator]
> vm_test_mode = yes
> concurrency = 8
> run_pause = 500
>
> [container-updater]
> concurrency = 8
>
> [container-auditor]
>
> #4 We dont use SSL for swift so, no latency over there.
>
> Hope you guys can shed some light.
>
>
> *
> *
> *
> *
> *Alejandro Comisario
> #melicloud CloudBuilders*
> Arias 3751, Piso 7 (C1430CRG)
> Ciudad de Buenos Aires - Argentina
> Cel: +549(11) 15-3770-1857
> Tel : +54(11) 4640-8443
>
>
> On Mon, Jan 14, 2013 at 1:23 PM, Chuck Thier <cthier at gmail.com> wrote:
>
>> Hi Alejandro,
>>
>> I really doubt that partition size is causing these issues.  It can be
>> difficult to debug these types of issues without access to the
>> cluster, but I can think of a couple of things to look at.
>>
>> 1.  Check your disk io usage and io wait on the storage nodes.  If
>> that seems abnormally high, then that could be one of the sources of
>> problems.  If this is the case, then the first things that I would
>> look at are the auditors, as they can use up a lot of disk io if not
>> properly configured.  I would try turning them off for a bit
>> (swift-*-auditor) and see if that makes any difference.
>>
>> 2.  Check your network io usage.  You haven't described what type of
>> network you have going to the proxies, but if they share a single GigE
>> interface, if my quick calculations are correct, you could be
>> saturating the network.
>>
>> 3.  Check your CPU usage.  I listed this one last as you have said
>> that you have already worked at tuning the number of workers (though I
>> would be interested to hear how many workers you have running for each
>> service).  The main thing to look for, is to see if all of your
>> workers are maxed out on CPU, if so, then you may need to bump
>> workers.
>>
>> 4.  SSL Termination?  Where are you terminating the SSL connection?
>> If you are terminating SSL in Swift directly with the swift proxy,
>> then that could also be a source of issue.  This was only meant for
>> dev and testing, and you should use an SSL terminating load balancer
>> in front of the swift proxies.
>>
>> That's what I could think of right off the top of my head.
>>
>> --
>> Chuck
>>
>> On Mon, Jan 14, 2013 at 5:45 AM, Alejandro Comisario
>> <alejandro.comisario at mercadolibre.com> wrote:
>> > Chuck / John.
>> > We are having 50.000 request per minute ( where 10.000+ are put from
>> small
>> > objects, from 10KB to 150KB )
>> >
>> > We are using swift 1.7.4 with keystone token caching so no latency over
>> > there.
>> > We are having 12 proxyes and 24 datanodes divided in 4 zones ( each
>> datanode
>> > has 48gb of ram, 2 hexacore and 4 devices of 3TB each )
>> >
>> > The workers that are puting objects in swift are seeing an awful
>> > performance, and we too.
>> > With peaks of 2secs to 15secs per put operations coming from the
>> datanodes.
>> > We tunes db_preallocation, disable_fallocate, workers and concurrency
>> but we
>> > cant reach the request that we need ( we need 24.000 put per minute of
>> small
>> > objects ) but we dont seem to find where is the problem, other than
>> from the
>> > datanodes.
>> >
>> > Maybe worth pasting our config over here?
>> > Thanks in advance.
>> >
>> > alejandro
>> >
>> > On 12 Jan 2013 02:01, "Chuck Thier" <cthier at gmail.com> wrote:
>> >>
>> >> Looking at this from a different perspective.  Having 2500 partitions
>> >> per drive shouldn't be an absolutely horrible thing either.  Do you
>> >> know how many objects you have per partition?  What types of problems
>> >> are you seeing?
>> >>
>> >> --
>> >> Chuck
>> >>
>> >> On Fri, Jan 11, 2013 at 3:28 PM, John Dickinson <me at not.mn> wrote:
>> >> > If effect, this would be a complete replacement of your rings, and
>> that
>> >> > is essentially a whole new cluster. All of the existing data would
>> need to
>> >> > be rehashed into the new ring before it is available.
>> >> >
>> >> > There is no process that rehashes the data to ensure that it is
>> still in
>> >> > the correct partition. Replication only ensures that the partitions
>> are on
>> >> > the right drives.
>> >> >
>> >> > To change the number of partitions, you will need to GET all of the
>> data
>> >> > from the old ring and PUT it to the new ring. A more complicated, but
>> >> > perhaps more efficient) solution may include something like walking
>> each
>> >> > drive and rehashing+moving the data to the right partition and then
>> letting
>> >> > replication settle it down.
>> >> >
>> >> > Either way, 100% of your existing data will need to at least be
>> rehashed
>> >> > (and probably moved). Your CPU (hashing), disks (read+write), RAM
>> (directory
>> >> > walking), and network (replication) may all be limiting factors in
>> how long
>> >> > it will take to do this. Your per-disk free space may also determine
>> what
>> >> > method you choose.
>> >> >
>> >> > I would not expect any data loss while doing this, but you will
>> probably
>> >> > have availability issues, depending on the data access patterns.
>> >> >
>> >> > I'd like to eventually see something in swift that allows for
>> changing
>> >> > the partition power in existing rings, but that will be
>> >> > hard/tricky/non-trivial.
>> >> >
>> >> > Good luck.
>> >> >
>> >> > --John
>> >> >
>> >> >
>> >> > On Jan 11, 2013, at 1:17 PM, Alejandro Comisario
>> >> > <alejandro.comisario at mercadolibre.com> wrote:
>> >> >
>> >> >> Hi guys.
>> >> >> We've created a swift cluster several months ago, the things is that
>> >> >> righ now we cant add hardware and we configured lots of partitions
>> thinking
>> >> >> about the final picture of the cluster.
>> >> >>
>> >> >> Today each datanodes is having 2500+ partitions per device, and even
>> >> >> tuning the background processes ( replicator, auditor & updater )
>> we really
>> >> >> want to try to lower the partition power.
>> >> >>
>> >> >> Since its not possible to do that without recreating the ring, we
>> can
>> >> >> have the luxury of recreate it with a very lower partition power,
>> and
>> >> >> rebalance / deploy the new ring.
>> >> >>
>> >> >> The question is, having a working cluster with *existing data* is it
>> >> >> possible to do this and wait for the data to move around *without
>> data loss*
>> >> >> ???
>> >> >> If so, it might be true to wait for an improvement in the overall
>> >> >> cluster performance ?
>> >> >>
>> >> >> We have no problem to have a non working cluster (while moving the
>> >> >> data) even for an entire weekend.
>> >> >>
>> >> >> Cheers.
>> >> >>
>> >> >>
>> >> >
>> >> >
>> >> > _______________________________________________
>> >> > Mailing list: https://launchpad.net/~openstack
>> >> > Post to     : openstack at lists.launchpad.net
>> >> > Unsubscribe : https://launchpad.net/~openstack
>> >> > More help   : https://help.launchpad.net/ListHelp
>> >> >
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-operators/attachments/20130114/8850da68/attachment.html>


More information about the OpenStack-operators mailing list