[Openstack-operators] [Openstack] [SWIFT] Change the partition power to recreate the RING

Alejandro Comisario alejandro.comisario at mercadolibre.com
Mon Jan 14 20:01:25 UTC 2013


Chuck et All.

Let me go through the point one by one.

#1 Even seeing that "object-auditor" allways runs and never stops, we
stoped the swift-*-auditor and didnt see any improvements, from all the
datanodes we have an average of 8% IO-WAIT (using iostat), the only thing
that we see is the pid "xfsbuf" runs once in a while causing 99% iowait for
a sec, we delayed the runtime for that process, and didnt see changes
either.

Our object-auditor config for all devices is as follow :

[object-auditor]
files_per_second = 5
zero_byte_files_per_second = 5
bytes_per_second = 3000000

#2 Our 12 proxyes are 6 physical and 6 kvm instances running on nova,
checking iftop we are at an average of 15Mb/s of bandwidth usage so i dont
think we are saturating the networking.
#3 The overall Idle CPU on all datanodes is 80%, im not sure how to check
the CPU usage per worker, let me paste the config for a device for object,
account and container.

*object-server.conf*
*------------------*
[DEFAULT]
devices = /srv/node/sda3
mount_check = false
bind_port = 6010
user = swift
log_facility = LOG_LOCAL2
log_level = DEBUG
workers = 48
disable_fallocate = true

[pipeline:main]
pipeline = object-server

[app:object-server]
use = egg:swift#object

[object-replicator]
vm_test_mode = yes
concurrency = 8
run_pause = 600

[object-updater]
concurrency = 8

[object-auditor]
files_per_second = 5
zero_byte_files_per_second = 5
bytes_per_second = 3000000

*account-server.conf*
*-------------------*
[DEFAULT]
devices = /srv/node/sda3
mount_check = false
bind_port = 6012
user = swift
log_facility = LOG_LOCAL2
log_level = DEBUG
workers = 48
db_preallocation = on
disable_fallocate = true

[pipeline:main]
pipeline = account-server

[app:account-server]
use = egg:swift#account

[account-replicator]
vm_test_mode = yes
concurrency = 8
run_pause = 600

[account-auditor]

[account-reaper]

*container-server.conf*
*---------------------*
[DEFAULT]
devices = /srv/node/sda3
mount_check = false
bind_port = 6011
user = swift
workers = 48
log_facility = LOG_LOCAL2
allow_versions = True
disable_fallocate = true

[pipeline:main]
pipeline = container-server

[app:container-server]
use = egg:swift#container
allow_versions = True

[container-replicator]
vm_test_mode = yes
concurrency = 8
run_pause = 500

[container-updater]
concurrency = 8

[container-auditor]

#4 We dont use SSL for swift so, no latency over there.

Hope you guys can shed some light.


*
*
*
*
*Alejandro Comisario
#melicloud CloudBuilders*
Arias 3751, Piso 7 (C1430CRG)
Ciudad de Buenos Aires - Argentina
Cel: +549(11) 15-3770-1857
Tel : +54(11) 4640-8443


On Mon, Jan 14, 2013 at 1:23 PM, Chuck Thier <cthier at gmail.com> wrote:

> Hi Alejandro,
>
> I really doubt that partition size is causing these issues.  It can be
> difficult to debug these types of issues without access to the
> cluster, but I can think of a couple of things to look at.
>
> 1.  Check your disk io usage and io wait on the storage nodes.  If
> that seems abnormally high, then that could be one of the sources of
> problems.  If this is the case, then the first things that I would
> look at are the auditors, as they can use up a lot of disk io if not
> properly configured.  I would try turning them off for a bit
> (swift-*-auditor) and see if that makes any difference.
>
> 2.  Check your network io usage.  You haven't described what type of
> network you have going to the proxies, but if they share a single GigE
> interface, if my quick calculations are correct, you could be
> saturating the network.
>
> 3.  Check your CPU usage.  I listed this one last as you have said
> that you have already worked at tuning the number of workers (though I
> would be interested to hear how many workers you have running for each
> service).  The main thing to look for, is to see if all of your
> workers are maxed out on CPU, if so, then you may need to bump
> workers.
>
> 4.  SSL Termination?  Where are you terminating the SSL connection?
> If you are terminating SSL in Swift directly with the swift proxy,
> then that could also be a source of issue.  This was only meant for
> dev and testing, and you should use an SSL terminating load balancer
> in front of the swift proxies.
>
> That's what I could think of right off the top of my head.
>
> --
> Chuck
>
> On Mon, Jan 14, 2013 at 5:45 AM, Alejandro Comisario
> <alejandro.comisario at mercadolibre.com> wrote:
> > Chuck / John.
> > We are having 50.000 request per minute ( where 10.000+ are put from
> small
> > objects, from 10KB to 150KB )
> >
> > We are using swift 1.7.4 with keystone token caching so no latency over
> > there.
> > We are having 12 proxyes and 24 datanodes divided in 4 zones ( each
> datanode
> > has 48gb of ram, 2 hexacore and 4 devices of 3TB each )
> >
> > The workers that are puting objects in swift are seeing an awful
> > performance, and we too.
> > With peaks of 2secs to 15secs per put operations coming from the
> datanodes.
> > We tunes db_preallocation, disable_fallocate, workers and concurrency
> but we
> > cant reach the request that we need ( we need 24.000 put per minute of
> small
> > objects ) but we dont seem to find where is the problem, other than from
> the
> > datanodes.
> >
> > Maybe worth pasting our config over here?
> > Thanks in advance.
> >
> > alejandro
> >
> > On 12 Jan 2013 02:01, "Chuck Thier" <cthier at gmail.com> wrote:
> >>
> >> Looking at this from a different perspective.  Having 2500 partitions
> >> per drive shouldn't be an absolutely horrible thing either.  Do you
> >> know how many objects you have per partition?  What types of problems
> >> are you seeing?
> >>
> >> --
> >> Chuck
> >>
> >> On Fri, Jan 11, 2013 at 3:28 PM, John Dickinson <me at not.mn> wrote:
> >> > If effect, this would be a complete replacement of your rings, and
> that
> >> > is essentially a whole new cluster. All of the existing data would
> need to
> >> > be rehashed into the new ring before it is available.
> >> >
> >> > There is no process that rehashes the data to ensure that it is still
> in
> >> > the correct partition. Replication only ensures that the partitions
> are on
> >> > the right drives.
> >> >
> >> > To change the number of partitions, you will need to GET all of the
> data
> >> > from the old ring and PUT it to the new ring. A more complicated, but
> >> > perhaps more efficient) solution may include something like walking
> each
> >> > drive and rehashing+moving the data to the right partition and then
> letting
> >> > replication settle it down.
> >> >
> >> > Either way, 100% of your existing data will need to at least be
> rehashed
> >> > (and probably moved). Your CPU (hashing), disks (read+write), RAM
> (directory
> >> > walking), and network (replication) may all be limiting factors in
> how long
> >> > it will take to do this. Your per-disk free space may also determine
> what
> >> > method you choose.
> >> >
> >> > I would not expect any data loss while doing this, but you will
> probably
> >> > have availability issues, depending on the data access patterns.
> >> >
> >> > I'd like to eventually see something in swift that allows for changing
> >> > the partition power in existing rings, but that will be
> >> > hard/tricky/non-trivial.
> >> >
> >> > Good luck.
> >> >
> >> > --John
> >> >
> >> >
> >> > On Jan 11, 2013, at 1:17 PM, Alejandro Comisario
> >> > <alejandro.comisario at mercadolibre.com> wrote:
> >> >
> >> >> Hi guys.
> >> >> We've created a swift cluster several months ago, the things is that
> >> >> righ now we cant add hardware and we configured lots of partitions
> thinking
> >> >> about the final picture of the cluster.
> >> >>
> >> >> Today each datanodes is having 2500+ partitions per device, and even
> >> >> tuning the background processes ( replicator, auditor & updater ) we
> really
> >> >> want to try to lower the partition power.
> >> >>
> >> >> Since its not possible to do that without recreating the ring, we can
> >> >> have the luxury of recreate it with a very lower partition power, and
> >> >> rebalance / deploy the new ring.
> >> >>
> >> >> The question is, having a working cluster with *existing data* is it
> >> >> possible to do this and wait for the data to move around *without
> data loss*
> >> >> ???
> >> >> If so, it might be true to wait for an improvement in the overall
> >> >> cluster performance ?
> >> >>
> >> >> We have no problem to have a non working cluster (while moving the
> >> >> data) even for an entire weekend.
> >> >>
> >> >> Cheers.
> >> >>
> >> >>
> >> >
> >> >
> >> > _______________________________________________
> >> > Mailing list: https://launchpad.net/~openstack
> >> > Post to     : openstack at lists.launchpad.net
> >> > Unsubscribe : https://launchpad.net/~openstack
> >> > More help   : https://help.launchpad.net/ListHelp
> >> >
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-operators/attachments/20130114/fa314411/attachment.html>


More information about the OpenStack-operators mailing list