[Openstack] Disk Full. What To Do?
Emre Sokullu
emre at groups-inc.com
Sun Aug 3 01:33:06 UTC 2014
Hi folks,
This is Emre from GROU.PS -- we operate an OpenStack Swift cluster since
2011, it's been great.
We have a standard installation with a single proxy server (proxy1) and
three storage servers (storage1, storage2, storage3) each with 5x1TB disks.
Following a chain of mistakes initiated by our hosting provider, which
changed the wrong disk on one of our OpenStack Swift storage servers, we
ended up with the following situation:
root at proxy1:/etc/swift# swift-ring-builder container.builder
container.builder, build version 41
1048576 partitions, 3 replicas, 3 zones, 14 devices, 60.00 balance
The minimum number of hours before a partition can be reassigned is 1
Devices: id zone ip address port name weight partitions
balance meta
0 1 192.168.1.3 6001 c0d1p1 80.00 262144
37.50
1 1 192.168.1.3 6001 c0d2p1 80.00 262144
37.50
2 1 192.168.1.3 6001 c0d3p1 80.00 262144
37.50
3 2 192.168.1.4 6001 c0d1p1 100.00 238312
-0.00
4 2 192.168.1.4 6001 c0d2p1 100.00 238312
-0.00
5 2 192.168.1.4 6001 c0d3p1 100.00 238312
-0.00
6 3 192.168.1.5 6001 c0d1p1 100.00 209715
-12.00
7 3 192.168.1.5 6001 c0d2p1 100.00 209715
-12.00
8 3 192.168.1.5 6001 c0d3p1 100.00 209715
-12.00
10 2 192.168.1.4 6001 c0d5p1 100.00 238312
-0.00
11 3 192.168.1.5 6001 c0d5p1 100.00 209716
-12.00
14 3 192.168.1.5 6001 c0d6p1 100.00 209715
-12.00
15 1 192.168.1.3 6001 c0d5p1 80.00 262144
37.50
16 2 192.168.1.4 6001 c0d6p1 100.00 95328
-60.00
root at proxy1:/etc/swift# ssh storage1 df -h
Filesystem Size Used Avail Use% Mounted on
/dev/cciss/c0d0p5 1.8T 38G 1.7T 3% /
none 3.9G 220K 3.9G 1% /dev
none 4.0G 0 4.0G 0% /dev/shm
none 4.0G 60K 4.0G 1% /var/run
none 4.0G 0 4.0G 0% /var/lock
none 4.0G 0 4.0G 0% /lib/init/rw
/dev/cciss/c0d1p1 1.9T 1.9T 239M 100% /srv/node/c0d1p1
/dev/cciss/c0d2p1 1.9T 1.9T 210M 100% /srv/node/c0d2p1
/dev/cciss/c0d3p1 1.9T 1.9T 104K 100% /srv/node/c0d3p1
/dev/cciss/c0d5p1 1.9T 1.2T 643G 66% /srv/node/c0d5p1
/dev/cciss/c0d0p2 92M 51M 37M 59% /boot
/dev/cciss/c0d0p3 1.9G 35M 1.8G 2% /tmp
root at proxy1:/etc/swift# ssh storage2 df -h
Filesystem Size Used Avail Use% Mounted on
/dev/cciss/c0d0p5 1.8T 33G 1.7T 2% /
none 3.9G 220K 3.9G 1% /dev
none 4.0G 0 4.0G 0% /dev/shm
none 4.0G 108K 4.0G 1% /var/run
none 4.0G 0 4.0G 0% /var/lock
none 4.0G 0 4.0G 0% /lib/init/rw
/dev/cciss/c0d0p3 1.9G 35M 1.8G 2% /tmp
/dev/cciss/c0d0p2 92M 51M 37M 59% /boot
/dev/cciss/c0d1p1 1.9T 1.5T 375G 80% /srv/node/c0d1p1
/dev/cciss/c0d2p1 1.9T 1.5T 385G 80% /srv/node/c0d2p1
/dev/cciss/c0d3p1 1.9T 1.5T 382G 80% /srv/node/c0d3p1
/dev/cciss/c0d4p1 1.9T 1.5T 377G 80% /srv/node/c0d5p1
/dev/cciss/c0d5p1 1.9T 519G 1.4T 28% /srv/node/c0d6p1
root at proxy1:/etc/swift# ssh storage3 df -h
Filesystem Size Used Avail Use% Mounted on
/dev/cciss/c0d0p5 1.8T 90G 1.7T 6% /
none 3.9G 224K 3.9G 1% /dev
none 4.0G 0 4.0G 0% /dev/shm
none 4.0G 112K 4.0G 1% /var/run
none 4.0G 0 4.0G 0% /var/lock
none 4.0G 0 4.0G 0% /lib/init/rw
/dev/cciss/c0d1p1 1.9T 1.1T 741G 61% /srv/node/c0d1p1
/dev/cciss/c0d2p1 1.9T 1.1T 741G 61% /srv/node/c0d2p1
/dev/cciss/c0d3p1 1.9T 1.1T 758G 60% /srv/node/c0d3p1
/dev/cciss/c0d5p1 1.9T 1.1T 765G 59% /srv/node/c0d5p1
/dev/cciss/c0d6p1 1.9T 1.1T 772G 59% /srv/node/c0d6p1
/dev/cciss/c0d0p2 92M 51M 37M 59% /boot
/dev/cciss/c0d0p3 1.9G 35M 1.8G 2% /tmp
As you can see:
* Balances are messed up and they don't get to a normal state no matter how
long we wait. Although the behavior for the end-user is still stable.
* We tried erasing the contents a disk on storage1 (/dev/cciss/c0d5p1) that
was 100% full before all others (others were still 95%) and this disk
filled up pretty quickly, while others quickly catching up to 100%. We were
expecting each to balance to the same level because storage2 and storage3
(with 5 disks each) are set to 100 in weight, whereas storage1 (with 4
disks only) is set to 80 in weight.
* There was a failing disk with storage2, so we replaced that
(/dev/cciss/c0d5p1) but it is not filling up as quickly. Storage2 is almost
80%
* Storage3 is healthy.
* Storage1 is currently taken offline. Because it's been failing constantly
and its disk space doesn't balance.
What is the best course of action to take in this scenario. I believe, we
can either:
1) Completely dump storage1. Remove zone-1 from the proxy. Get a new server
with similar setup and add it as a new zone on proxy accordingly.
2) Stop storage1. Erase the contents of full disks on storage1. Switch to
proxy, remove the full disks from the cluster, then add them as new
devices. (again, a delay may
3) Or something completely different?
My fear is, with both the first and second alternatives, if there's a delay
between removing the zones or disks, and adding new ones, the other
zones/disks would fill up. Therefore I would need to choose the alternative
where there would be the minimal amount of delay.
Last but not least, please note that this is swift installation is
outdated, never been updated since installation. (I am to blame!)
Thanks for your suggestions, directions in advance.
Cheers,
--
Emre
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack/attachments/20140802/edc5b3f2/attachment.html>
More information about the Openstack
mailing list