[Openstack] Swift reliability

Andrew Hale andy at tannhauser-gate.org
Wed Sep 26 17:46:20 UTC 2012


Further to this, I wouldn't recommend using an ext filesystem for critical data in a swift install since the auditor processes use the zero byte files to trigger re-replication whereas orphaned files in a drives lost+found wouldn't be audited. removing files from the swift directory structures doesn't trigger a hash invalidation like a zero byte file from xfs would.

Andrew

----- Original Message -----
From: "John Dickinson" <me at not.mn>
To: "Phil Holden" <Phil.Holden at cognitomobile.com>
Cc: "openstack at lists.launchpad.net" <openstack at lists.launchpad.net>
Sent: Wednesday, September 26, 2012 6:37:05 PM
Subject: Re: [Openstack] Swift reliability

The 404s on object PUTs are probably related to the timeout errors you are seeing on the container servers. This may be because of IO contention on your hardware (eg overtaxed drives). How does the disk IO look on your physical hardware?

The disk full errors may be because you are running out of inodes on the filesystem. You can check this with `df -i`. This is possible if you are using many small files.

--John


On Sep 26, 2012, at 3:39 AM, Phil Holden <Phil.Holden at cognitomobile.com> wrote:

> Hello,
> 
> I have been continuing to run the Swift reliability test described at 
>    https://answers.launchpad.net/swift/+question/201627
> This is now using ext4 filesystems but continues to have some issues.  
> The test has been resized a little and now consists of 40 threads doing 
> a PUT with an object, then a GET on it some time later. Each thread will 
> eventually PUT 15,000 objects in 1 container per thread.  The object 
> number then wraps around and it should thereafter be over-writing 
> objects which already exist.  The data objects are very small, e.g.,
>    "Content of object 11234 in container 15-1 \n"
> The test is rate limited.  It has been run at up to 2,100 HTTP requests 
> (GET or PUT) per minute which is the expected traffic rate we want it to 
> support.  
> 
> The Swift cluster consists of a load balancer in front of 2 x Swift 
> proxies, in turn connected to 6 Swift data nodes. All these systems are 
> VM's in a managed cluster of physical servers and so may compete for 
> physical resources, but we think they are provisioned adequately for 
> this phase of testing.  Other tests have achieved over 3,500 HTTP 
> requests/minute using this cluster.  The rings are configured for 3 
> replicas of the data.  The Swift version is Essex (2012.1).  
> 
> A number of problems continue to be encountered with the test.  These 
> have been as follows:
> 
> The problems described in question 201627 (above) continued to occur 
> when XFS filesystems were used.  This problem is not seen if ext4 
> filesystems are used.  
> 
> The remaining problems have only been seen using ext4 filesystems.  They 
> occur after the test has been running for some time, several days.  
> Using xfs filesystems, the test gets stuck as in question 201627 before 
> encountering any of these.  
> 
> After the test has wrapped around on the object number that it is 
> writing, space usage continues to grow, eventually filling all the data 
> nodes.  If an object is over-written, replacing its contents, is the 
> old data freed immediately or is it left around, waiting to be tidied 
> away later by some clean-up process?  The object-expirer is being run on 
> one of the proxy nodes, but all objects should be over-written well 
> before their expiry time.  
> 
> On one occasion half the data nodes were completely filled at 100% and 
> the cluster overall became unresponsive.  This situation was solved by a 
> rolling restart where each of the data nodes is restarted, one-by-one.  
> 
> HTTP 404 : Not Found is repeatedly reported on a PUT to an object in an 
> existing container.  The test gets stuck on this until it is resolved.  
> This can often be resolved by a rolling restart where each of the data 
> nodes is restarted, one-by-one.  
> 
> One of the Swift proxy server processes became unresponsive.  This meant 
> that only half the requests succeeded, the ones which went through the 
> other proxy.  There was nothing evident in the logfiles.  The proxy 
> process did not respond to an ordinary kill (SIGTERM).  A SIGKILL was 
> needed to remove it.  The object-expirer which was running at the same 
> time on the same host did respond to SIGTERM and stopped.  Everything 
> continued normally after the proxy server and object-expirer were 
> restarted.  
> 
> 
> Further testing is being performed at a reduced rate of 525 HTTP 
> requests per minute (25% of the target rate) to see if this Swift 
> cluster will perform more reliably at this reduced rate. 
> 
> Can anyone shed any light on the problems described above and suggest 
> ways they could be prevented from happening.  
> 
> 
> The overall purpose of the test is to determine if Swift can be reliably 
> used for storage of mission-critical data.  Obviously open source 
> software such as this comes with no warranty, but, in a similar manner 
> to making a judgement about use of the Linux kernel and filesystems and 
> related software for mission-critical activities, a judgement about the 
> use of Swift needs to be made.  This test is intended to support the 
> ability to make this decision.  
> 
> 
>                 Regards
>                   - Phil -
> 
> 
> 
> NOTICE: Cognito Limited. Benham Valence, Newbury, Berkshire, RG20 8LU.  UK. Company number 02723032.  This e-mail message and any attachment is confidential. It may not be disclosed to or used by anyone other than the intended recipient. If you have received this e-mail in error please notify the sender immediately then delete it from your system. Whilst every effort has been made to check this mail is virus free we accept no responsibility for software viruses and you should check for viruses before opening any attachments. Opinions, conclusions and other information in this email and any attachments which do not relate to the official business of the company are neither given by the company nor endorsed by it.
> 
> This email message has been scanned for viruses by Mimecast
> 
> 
> _______________________________________________
> Mailing list: https://launchpad.net/~openstack
> Post to     : openstack at lists.launchpad.net
> Unsubscribe : https://launchpad.net/~openstack
> More help   : https://help.launchpad.net/ListHelp


_______________________________________________
Mailing list: https://launchpad.net/~openstack
Post to     : openstack at lists.launchpad.net
Unsubscribe : https://launchpad.net/~openstack
More help   : https://help.launchpad.net/ListHelp




More information about the Openstack mailing list