[openstack-dev] Feedback wanted please on proposal to make root disk size variable

Scott Moser smoser at ubuntu.com
Thu Jun 6 15:57:55 UTC 2013


On Thu, 6 Jun 2013, Robert Collins wrote:

> On 6 June 2013 06:50, Scott Moser <smoser at ubuntu.com> wrote:
>
> > The time and IO it takes to create a filesystem on a "device" (file or
> > block device, doens't really matter) is very real.  I think the caching of
> > that is not an insignificant performance improvement.
> >
> > Heres an example:
> > $ truncate --size=20G my.img
> > $ ls -s my.img
> > 0 my.img
> > $ time mkfs.ext3 -F my.img >/dev/null 2>&1
> > real  0m15.279s
> > user  0m0.020s
> > sys   0m0.980s
> > $ ls -s my.img
> > 464476 my.img
> >
> > So it looks to me that it did ~400M of IO in order to put a filesystem on
> > a 20G image.  If you push that off to the guest, its only going to perform
> > worse (as IO will be slower).
>
> For ext4 this is:
> $ truncate --size=20G my.img
> $ ls -s my.img
> 0 my.img
> $ time mkfs.ext4 -F my.img >/dev/null 2>&1
>
> real    0m1.408s
> user    0m0.060s
> sys     0m0.096s
> $ ls -s my.img
> 135480 my.img

Yeah, but putting ext4 on there breaks user's expectation (as you pointed
out).  It may "just work" as most OSes today will have an ext4 driver, and
in most cases, images running now are probably just "normal linux
distributions".  But in the future, it is very possible for appliances to
be heavily tuned and waste (like unused filesystem code) removed from the
images.  In that future, a change from ext3 to ext4 breaks people.  I
would suspect even *now* that some things would break if you did it.

Heres one interesting addition to the stuff I showed above:

$ for t in ext4 ext3; do
  img=my-$t.img; rm -f $img;
  truncate --size=20G $img
  mkfs.$t -F $img;
  cp --sparse=always $img $img.copy
done
$ ls -1hs my-*
454M my-ext3.img
 11M my-ext3.img.copy
133M my-ext4.img
4.3M my-ext4.img.copy

So, to cache the created filesystem, instead of using qcow, we could
actually do once:
 pristine="pristine_ephemeral_20G_ext3.img"
 truncate --size=20G $pristine.working
 mkfs.ext3 -L ephemeral0 -F $pristine.working
 cp --sparse=always $pristine.working $pristine.img

Then, instead of using that as a qcow backing, we can just
 cp --sparse=always $pristine \
     /var/lib/instances/instance/ephemeral0.img

And then, if we want to fix Phil's issue with unique UUIDs, we just then
do:
 tune2fs -U $(uuidgen) /var/lib/instances/instance/ephemeral0.img

This solves the migration issue as now the instance is not dependent on a
qcow backed image where the backing store differs from compute-node to
compute-node.

I put a gist at https://gist.github.com/smoser/5722502 with a script and
some results.  In  my limited testing, it seems that the 'cp' is actually
faster than the qcow-create until the size got > 20G.

For some reason, e2label and tune2fs are very slow, almost as slow as the
copy from original. I suspect maybe they're not aware of sparse-ness.

So, the tldr of this is it might make a lot of sense to drop the qcow
backing of ephemeral images and instead, just use sparse raw.

>
>
> Does anyone use ext3 these days? ;)
>
> Seriously though, I think keeping existing expectations as simple and
> robust as possible is important. I know I was weirded out ~ a year
> back when I spun up an HPCS ultra large instance and got 10G on / and
> 1T on /mnt : on EC2 when you grab an ultra large / is the thing that
> is large ;).
>
> Phil - I don't quite understand the operational use case: resizing the
> root-fs is an in-OS operation, and snapshotting takes advantage of COW
> semantics if qcow2 or similar backing storage is used. How does making
> the size of the root exactly match that of the image make snapshotting
> more efficient in that case? Or is it for deployments [there may be

It does make sense.
Consider the ubuntu cloud images. They in pristine form as 1.4G
filesystems and are about 700M populated disk space.  If you boot that
with a 10G root filesystem, they grow to 10G.  This immediately dirties a
bunch of space in the backing device.  And over time, several 'apt-get
update && apt-get dist-upgrade' and 'open("w"), close, unlink", more and
more of that disk will be dirty.  Its quite possible that a user uses the
image, never touches more than 2G of usage of root, but snapshotting
that image is now 10G of data.

If the user had no use for that 8G of disk, then everyone involved would
have been better off if the disk was never created to be 10G.

I think that 'trim' can be sent all the way down through kvm and back to
the backing files... or at least i think there was work on that.  That
would get your snapshot clean again, if ext4 was correctly passing 'trim'
down.
> some ;)] that don't use qcow2 style backing storage? Things like Ceph
> mounted block devices will still be doing dirty block tracking, so
> should be super efficient (and be able to do COW snapshots even more
> efficiently in fact).
>
> One thing I will note is that as a user, I *do not want* to specify
> the root size in the images I build: thats what flavor is for, it's
> how I define the runtime environment for my machine images; so - if we
> do make flavor have more dynamic roots - I think it would be very
> helpful to make sure that that can be overridden by the user [or that
> the user has to opt into it]. (I realise that implies that
> python-novaclient changes as well, not to mention documention).
>
> -Rob
>
> --
> Robert Collins <rbtcollins at hp.com>
> Distinguished Technologist
> HP Cloud Services
>
> _______________________________________________
> OpenStack-dev mailing list
> OpenStack-dev at lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
>



More information about the OpenStack-dev mailing list