Open Stack

Fri Mar 16 16:11:49 UTC 2012

Hi Stackers,

So, in diagnosing a few things on TryStack yesterday, I ran into an 
interesting problem with snapshotting that I'm hoping to get some advice on.

== The Problem ==

The TryStack codebase is Diablo, however the code involved in this 
particular problem I believe is the same in Essex...

The issue that was happening was a user was attempting to snapshot a 
tiny instance (512MB/1-core) through the dashboard. The dashboard 
returned and noted that a snapshot was created and was in Queued status.

The snapshot never goes out of Queued status, and so I logged into the 
compute node that housed the instance in question to see if I could 
figure out what was going on.

Grepping through the compute log, I found the following:

(nova.rpc): TRACE: Traceback (most recent call last):
(nova.rpc): TRACE:   File 
"/usr/lib/python2.7/dist-packages/nova/rpc/impl_kombu.py", line 628, in 
_process_data
(nova.rpc): TRACE:     rval = node_func(context=ctxt, **node_args)
(nova.rpc): TRACE:   File 
"/usr/lib/python2.7/dist-packages/nova/exception.py", line 100, in wrapped
(nova.rpc): TRACE:     return f(*args, **kw)
(nova.rpc): TRACE:   File 
"/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 687, in 
snapshot_instance
(nova.rpc): TRACE:     self.driver.snapshot(context, instance_ref, image_id)
(nova.rpc): TRACE:   File 
"/usr/lib/python2.7/dist-packages/nova/exception.py", line 100, in wrapped
(nova.rpc): TRACE:     return f(*args, **kw)
(nova.rpc): TRACE:   File 
"/usr/lib/python2.7/dist-packages/nova/virt/libvirt/connection.py", line 
479, in snapshot
(nova.rpc): TRACE:     utils.execute(*qemu_img_cmd)
(nova.rpc): TRACE:   File 
"/usr/lib/python2.7/dist-packages/nova/utils.py", line 190, in execute
(nova.rpc): TRACE:     cmd=' '.join(cmd))
(nova.rpc): TRACE: ProcessExecutionError: Unexpected error while running 
command.
(nova.rpc): TRACE: Command: qemu-img convert -f qcow2 -O raw -s 
e7ba4fb5f6f04f99b07d1d222ada0219 
/opt/openstack/nova/instances/instance-00000548/disk 
/tmp/tmpIuOQo0/e7ba4fb5f6f04f99b07d1d222ada0219
(nova.rpc): TRACE: Exit code: 1
(nova.rpc): TRACE: Stdout: ''
(nova.rpc): TRACE: Stderr: 'qemu-img: error while writing\n'

QEMU was unhelpfully returning a vague error message of "error while 
writing".

It turned out, after speaking with a couple folks on IRC (thx vishy and 
rmk!) that the snapshot process (qemu-img convert ... above) is storing 
the output of the process (the snapshot) in a temporary directory 
created using tempfile.mkdtemp() in the nova/virt/libvirt/connection.py 
file.

As it turns out, the base operating system we install on our compute 
nodes in TryStack has a (very) small root partition -- only 2GB in size 
(we use the devstack build_pxe_env.sh script to create the base Ubuntu 
image that is netbooted on the compute nodes.

Looking at the free disk space on the compute node in question, the 
problem was apparent:

root at freecloud102:/var/log/nova# df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/ram0             2.0G  1.4G  535M  73% /
devtmpfs               48G  240K   48G   1% /dev
none                   48G     0   48G   0% /dev/shm
none                   48G  212K   48G   1% /var/run
none                   48G     0   48G   0% /var/lock
/dev/md0              5.4T   93G  5.1T   2% /opt/openstack

There simply isn't enough free space on the root partition (which is 
where /tmp is housed) for the snapshot to be created.

== Possible Solutions ==

So, there are a number of solutions that we can work on here, and I'm 
wondering what the preference would be. Here are the solutions I have 
come up with, along with a no-brainer improvement to Nova that would 
help in diagnosing this problem:

The no-brainer: Detect before attempting a snapshot that there is enough 
space on a device to perform the operation, and if not, throw a useful 
error message up the stack

Solutions to the disk space problem:

(1) Silly Jay, change the damn size of the root partition in your PXE 
base OS install!

Now, I'm no expert in creating customized base disk images, but from 
looking at the build_pxe_env.sh script in devstack [1], it seems pretty 
trivial to change the ramdisk_size parameter in the startup options to 
something larger than 2109600. We could do this and reimage the compute 
nodes one by one.

(2) Make the location in which the snapshot is made configurable.

Right now, as mentioned above, tempfile.mkdtemp() is used, which creates 
a directory in the user's TMPDIR (typically /tmp, which is usually on 
the root partition).

We could add an option (--libvirt-snapshot-dir?) that would allow 
nova-compute to override where that snapshot is built.

(3) Change the user (running nova-compute) TMPDIR setting to something 
different than /tmp on the root partition).

Thoughts?
-jay

[1] 
https://github.com/openstack-dev/devstack/blob/stable/diablo/tools/build_pxe_env.sh

Open Stack

[Openstack] [NOVA] Snapshotting may require significant disk space (in /tmp). How to properly solve disk space issues?

OpenStack

Community

Documentation

Branding & Legal