<div dir="ltr"><br><div class="gmail_extra"><br><div class="gmail_quote">On Wed, Oct 1, 2014 at 4:08 AM, Jesse Pretorius <span dir="ltr"><<a href="mailto:jesse.pretorius@gmail.com" target="_blank">jesse.pretorius@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr">I'd like to clarify a few things, specifically related to Ceph usage, in less of a rushed response. :)<div><br></div><div>Note - my production experience has only been with Ceph Dumpling. Plenty of great patches which resolve many of the issues I've experienced have landed, so YMMV.<br><div class="gmail_extra"><br><div class="gmail_quote"><span class="">On 30 September 2014 15:06, Jesse Pretorius <span dir="ltr"><<a href="mailto:jesse.pretorius@gmail.com" target="_blank">jesse.pretorius@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><div dir="ltr"><div class="gmail_extra"><div class="gmail_quote"><div>I would recommend ensuring that:<br></div><div><br></div><div>1) ceph-mon's and ceph-osd's are not hosted on the same server - they both demand plenty of cpu cycles</div></div></div></div></blockquote><div><br></div></span><div>The ceph-mon will generally not use much CPU. If a whole chassis is lost, you'll see it spike heavily, but it'll drop off again after the rebuild is complete. I would still recommend keeping at least one ceph-mon on a host that isn't hosting OSD's. <span style="font-family:arial,sans-serif;font-size:13px">The mons are where all clients get the data location details from, so at least one really needs to be available no matter what happens.</span></div><div><span style="font-family:arial,sans-serif;font-size:13px"><br></span></div></div></div></div></div></blockquote><div>At the beginning when things are small (few OSD) I'm intending to run mons on the osd nodes. When I start to grow it, my plan is to start deploying separate monitors and eventually disable the mons on the OSD nodes entirely. </div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div><div class="gmail_extra"><div class="gmail_quote"><div><span style="font-family:arial,sans-serif;font-size:13px"></span></div><div><span style="font-family:arial,sans-serif;font-size:13px">And, FYI, I would definitely recommend implementing separate networks for client access and the storage back-end. This can allow you to ensure that your storage replication traffic is separated and you can tune the QoS for each differently.</span></div></div></div></div></div></blockquote><div><br></div><div>I've got a dedicated, isolated 10 GB network between the Ceph nodes dedicated purely to replication traffic. Another interface (also 10 GB) will handle traffic from Openstack, and a 3rd (1 GB) will deal with RadosGW traffic from the public side. </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div><div class="gmail_extra"><div class="gmail_quote"><span class=""><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><div dir="ltr"><div class="gmail_extra"><div class="gmail_quote"><div>5) instance storage on ceph doesn't work very well if you're trying to use the kernel module or cephfs - make sure you're using ceph volumes as the underlying storage (I believe this has been patched in for Juno)</div></div></div></div></blockquote><div><br></div></span><div><div style="font-family:arial,sans-serif;font-size:13px">cephfs, certainly in Dumpling, is not production ready - our experiment with using it in production was quickly rolled back when one of the client servers lost connection to the ceph-mds for some reason and the storage on it became inaccessible. The client connection to the mds in Dumpling isn't as resilient as the client connection for the block device.</div><div style="font-family:arial,sans-serif;font-size:13px"><br></div><div style="font-family:arial,sans-serif;font-size:13px">By 'use the kernel module' I mean create an image and mounting it to the server through the ceph block device kernel module, then building a file system on it and using it like you would any network-based storage.</div><div style="font-family:arial,sans-serif;font-size:13px">We found that when using one image as shared storage between servers, updates from one server wasn't always visible quickly enough (within a minute) on the other server. If you choose to use a single image per server, then only mount server2's image on server1 in a disaster recovery situation then it should be just fine.<br></div><div style="font-family:arial,sans-serif;font-size:13px">We did find that mounting a file system using the kernel module would tend to cause a kernel panic when trying to disconnect the storage. Note that there have been several improvements in the revisions after Dumpling, including some bug fixes for issues that look similar to what we experienced.<br></div></div><div style="font-family:arial,sans-serif;font-size:13px"><br></div><div style="font-family:arial,sans-serif;font-size:13px">By "make sure you're using ceph volumes as the underlying storage" I meant that each instance root disk should be stored as its own Ceph Image in a storage pool. This can be facilitated directly from nova by using 'images_type=rbd' in nova.conf which became available in OpenStack Havana. Support for using RBD for Ephemeral disks as well finally landed in Juno (see <a href="https://bugs.launchpad.net/nova/+bug/1226351" target="_blank">https://bugs.launchpad.net/nova/+bug/1226351</a>), as did support for copy-on-write cloning (see <a href="https://blueprints.launchpad.net/nova/+spec/rbd-clone-image-handler" target="_blank">https://blueprints.launchpad.net/nova/+spec/rbd-clone-image-handler</a>) which rounds out the feature set for using an RBD back-end quite nicely. :)<br></div></div></div></div></div>

<br></blockquote><div>I was originally planning on doing what you say about using images_type=rbd with my main wish being to have the ability to live-migrate images off a compute node. I discovered yesterday that block migration works just fine with kvm/libvirt now despite assertions in the Openstack documentation. I can live with that for now. The last time I tried the RBD backend was in Havana and it had some goofy behavior, so I think I'll let this idea sit for a while and maybe try again in Kilo once the new copy-on-write code has had a chance to age a bit ;). </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">_______________________________________________<br>

OpenStack-operators mailing list<br>

<a href="mailto:OpenStack-operators@lists.openstack.org">OpenStack-operators@lists.openstack.org</a><br>

<a href="http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators" target="_blank">http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators</a><br>

<br></blockquote></div><br></div></div>