Open Stack

Wed Sep 26 10:39:13 UTC 2012

Hello,

I have been continuing to run the Swift reliability test described at 
    https://answers.launchpad.net/swift/+question/201627
This is now using ext4 filesystems but continues to have some issues.  
The test has been resized a little and now consists of 40 threads doing 
a PUT with an object, then a GET on it some time later. Each thread will 
eventually PUT 15,000 objects in 1 container per thread.  The object 
number then wraps around and it should thereafter be over-writing 
objects which already exist.  The data objects are very small, e.g.,
    "Content of object 11234 in container 15-1 \n"
The test is rate limited.  It has been run at up to 2,100 HTTP requests 
(GET or PUT) per minute which is the expected traffic rate we want it to 
support.  

The Swift cluster consists of a load balancer in front of 2 x Swift 
proxies, in turn connected to 6 Swift data nodes. All these systems are 
VM's in a managed cluster of physical servers and so may compete for 
physical resources, but we think they are provisioned adequately for 
this phase of testing.  Other tests have achieved over 3,500 HTTP 
requests/minute using this cluster.  The rings are configured for 3 
replicas of the data.  The Swift version is Essex (2012.1).  

A number of problems continue to be encountered with the test.  These 
have been as follows:

The problems described in question 201627 (above) continued to occur 
when XFS filesystems were used.  This problem is not seen if ext4 
filesystems are used.  

The remaining problems have only been seen using ext4 filesystems.  They 
occur after the test has been running for some time, several days.  
Using xfs filesystems, the test gets stuck as in question 201627 before 
encountering any of these.  

After the test has wrapped around on the object number that it is 
writing, space usage continues to grow, eventually filling all the data 
nodes.  If an object is over-written, replacing its contents, is the 
old data freed immediately or is it left around, waiting to be tidied 
away later by some clean-up process?  The object-expirer is being run on 
one of the proxy nodes, but all objects should be over-written well 
before their expiry time.  

On one occasion half the data nodes were completely filled at 100% and 
the cluster overall became unresponsive.  This situation was solved by a 
rolling restart where each of the data nodes is restarted, one-by-one.  

HTTP 404 : Not Found is repeatedly reported on a PUT to an object in an 
existing container.  The test gets stuck on this until it is resolved.  
This can often be resolved by a rolling restart where each of the data 
nodes is restarted, one-by-one.  

One of the Swift proxy server processes became unresponsive.  This meant 
that only half the requests succeeded, the ones which went through the 
other proxy.  There was nothing evident in the logfiles.  The proxy 
process did not respond to an ordinary kill (SIGTERM).  A SIGKILL was 
needed to remove it.  The object-expirer which was running at the same 
time on the same host did respond to SIGTERM and stopped.  Everything 
continued normally after the proxy server and object-expirer were 
restarted.  

Further testing is being performed at a reduced rate of 525 HTTP 
requests per minute (25% of the target rate) to see if this Swift 
cluster will perform more reliably at this reduced rate. 

Can anyone shed any light on the problems described above and suggest 
ways they could be prevented from happening.  

The overall purpose of the test is to determine if Swift can be reliably 
used for storage of mission-critical data.  Obviously open source 
software such as this comes with no warranty, but, in a similar manner 
to making a judgement about use of the Linux kernel and filesystems and 
related software for mission-critical activities, a judgement about the 
use of Swift needs to be made.  This test is intended to support the 
ability to make this decision.  

                 Regards
                   - Phil -

NOTICE: Cognito Limited. Benham Valence, Newbury, Berkshire, RG20 8LU.  UK. Company number 02723032.  This e-mail message and any attachment is confidential. It may not be disclosed to or used by anyone other than the intended recipient. If you have received this e-mail in error please notify the sender immediately then delete it from your system. Whilst every effort has been made to check this mail is virus free we accept no responsibility for software viruses and you should check for viruses before opening any attachments. Opinions, conclusions and other information in this email and any attachments which do not relate to the official business of the company are neither given by the company nor endorsed by it.

This email message has been scanned for viruses by Mimecast

Open Stack

[Openstack] Swift reliability

OpenStack

Community

Documentation

Branding & Legal