[Openstack] [Openstack-operators] [openstack][nova] Several questions/experiences about _base directory on a big production environment

Antonio Messina antonio.s.messina at gmail.com
Fri Apr 4 18:09:21 UTC 2014


Hi Alejandro,

On Thu, Apr 3, 2014 at 11:41 PM, Alejandro Comisario
<alejandro.comisario at mercadolibre.com> wrote:
> I would love to have insights regarding people using _base with no
> shared storage but locally on the compute, up&down sides, experiences
> & comments.

We currently have a small cloud made of heterogeneous hardware.
Whenever we can we try to use the local disk of the nodes. This is the
deployment scenario I would advise, if possible, and if you *really*
need to migrate an instance from a node for maintenance purposes you
can still try a live-migration --block-migration.

Unfortunately, we also have a couple of nodes with very little
storage, and for them we have a shared NFS filesystem which is mounted
as /var/lib/nova/instances.
This allows us to share the _base directory (improving deploymenet
speed) and also allows live migration.

We don't store directly glance images on _base: the first node that
needs that image downloads it, the others will use the already
downloaded image.

One important caveat: when you share _base among multiple compute
nodes, in /var/lib/nova/instances/locks a file is created by the
machine which is downloading the base image, and fcntl() is called: if
you use a NFSv3 filesystem you *will* have troubles, as locking in
NFSv3 is very poorly implemented, so you should definitively use
NFSv4.
If you can't (like we couldn't), use a linux box to export a NFSv4
filesystem and mount it on /var/lib/nova/instances/locks

Also, you need to remove all the nova-compute periodic task to cleanup
the _base directory, since an image which is not used by that
compute-node could be in use by some other node!

In a couple of months we are going to deploy a few hundred nodes that
have very little internal storage, and for them we are planning to
deploy GlusterFS for /var/lib/nova/instances, but since we don't have
much experience yet I can't tell you if this is actually advisable. I
can't tell you that I would have avoided it if it was possible, though
:)

Final note: we run Folsom on Ubuntu 12.04

.a.

>> On Thu, Apr 3, 2014 at 12:28 AM, Joe Topjian <joe at topjian.net> wrote:
>> > Is it Ceph live migration that you don't think is mature for production or
>> > live migration in general? If the latter, I'd like to understand why you
>> > feel that way.
>> >
>> > Looping back to Alejandro's original message: I share his pain of _base
>> > issues. It's happened to me before and it sucks.
>> >
>> > We use shared storage for a production cloud of ours. The cloud has a 24x7
>> > SLA and shared storage with live migration helps us achieve that. It's not a
>> > silver bullet, but it has saved us so many hours of work.
>> >
>> > The remove_unused_base_images option is stable and works. I still disagree
>> > with the default value being "true", but I can vouch that it has worked
>> > without harm for the past year in an environment where it previously shot me
>> > in the foot.
>> >
>> > With that option enabled, you should not have to go into _base at all. Any
>> > work that we do in _base is manual audits and the rare time when the
>> > database might be inconsistent with what's really hosted.
>> >
>> > To mitigate against potential _base issues, we just try to be as careful as
>> > possible -- measure 5 times before cutting. Our standard procedure is to
>> > move the files we plan on removing to a temporary directory and wait a few
>> > days to see if any users raise an alarm.
>> >
>> > Diego has a great point about not using qemu backing files: if your backend
>> > storage implements deduplication and/or compression, you should see the same
>> > savings as what _base is trying to achieve.
>> >
>> > We're in the process of building a new public cloud and made the decision to
>> > not implement shared storage. I have a queue of blog posts that I'd love to
>> > write and the thoughts behind this decision is one of them. Very briefly,
>> > the decision was based on the SLA that the public cloud will have combined
>> > with our feeling that "cattle" instances are more acceptable to the average
>> > end-user nowadays.
>> >
>> > That's not to say that I'm "done" with shared storage. IMO, it all depends
>> > on the environment. One great thing about OpenStack is that it can be
>> > tailored to work in so many different environments.
>> >
>> >
>> >
>> > On Wed, Apr 2, 2014 at 5:48 PM, matt <matt at nycresistor.com> wrote:
>> >>
>> >> there's shared storage on a centralized network filesystem... then there's
>> >> shared storage on a distributed network filesystem.  thus the age old
>> >> openafs vs nfs war is reborn.
>> >>
>> >> i'd check out ceph block device for live migration... but saying that...
>> >> live migration has not achieved a maturity level that i'd even consider
>> >> trying it in production.
>> >>
>> >> -matt
>> >>
>> >>
>> >> On Wed, Apr 2, 2014 at 7:40 PM, Chris Friesen
>> >> <chris.friesen at windriver.com> wrote:
>> >>>
>> >>> So if you're recommending not using shared storage, what's your answer to
>> >>> people asking for live-migration?  (Given that block migration is supposed
>> >>> to be going away.)
>> >>>
>> >>> Chris
>> >>>
>> >>>
>> >>> On 04/02/2014 05:08 PM, George Shuklin wrote:
>> >>>>
>> >>>> Every time anyone start to consolidate resources (shared storage,
>> >>>> virtual chassis for router, etc), it consolidate all failures to one.
>> >>>> One failure and every consolidated system participating in festival.
>> >>>>
>> >>>> Then they starts to increase fault tolerance of consolidated system,
>> >>>> raising administrative plank to the sky, requesting more and more
>> >>>> hardware for the clustering, requesting enterprise-grade, "no one was
>> >>>> fired buying enterprise <bullshit-brand-name-here>". As result -
>> >>>> consolidated system works with same MTBF as non-consolidated, saving
>> >>>> "costs" compare to even more enterprise-grade super-solution with cost
>> >>>> of few percent countries GDP, and actually costs more than
>> >>>> non-consolidated solution.
>> >>>>
>> >>>> Failure for x86 is ALWAYS option. Processor can not repeat instructions,
>> >>>> no comparator between few parallel processors, and so on. Compare to
>> >>>> mainframes. So, if failure is an option, that means, reduce importance
>> >>>> of that failure, it scope.
>> >>>>
>> >>>> If one of 1k hosts goes down for three hours this is sad. But it much
>> >>>> much much better than central system every of 1k hosts depends on goes
>> >>>> down just for 11 seconds (3h*3600/1000).
>> >>>>
>> >>>> So answer is simple: do not aggregate. But _base to slower drives if you
>> >>>> want to save costs, but do not consolidate failures.
>> >>>>
>> >>>> On 04/02/2014 09:04 PM, Alejandro Comisario wrote:
>> >>>>>
>> >>>>> Hi guys ...
>> >>>>> We have a pretty big openstack environment and we use a shared NFS to
>> >>>>> populate backing file directory ( the famous _base directory located
>> >>>>> on /var/lib/nova/instances/_base ) due to a human error, the backing
>> >>>>> file used by thousands of guests was deleted, causing this guests to
>> >>>>> go read-only filesystem in a second.
>> >>>>>
>> >>>>> Till that moment we were convinced to use the _base directory as a
>> >>>>> shared NFS because:
>> >>>>>
>> >>>>> * spawning a new ami gives total visibility to the whole cloud making
>> >>>>> instances take nothing to boot despite the nova region
>> >>>>> * ease glance workload
>> >>>>> * easiest management no having to replicate files constantly not
>> >>>>> pushing bandwidth usage internally
>> >>>>>
>> >>>>> But after this really big issue, and after what took us to recover
>> >>>>> from this, we were thinking about how to protect against this kind of
>> >>>>> "single point of failure".
>> >>>>> Our first aproach this days was to put Read Only the NFS share, making
>> >>>>> impossible for computes ( and humans ) to write to that directory,
>> >>>>> giving permision to just one compute whos the one responsible to spawn
>> >>>>> an instance from a new ami and write the file to the directory, still
>> >>>>> ... the storage keeps being the SPOF.
>> >>>>>
>> >>>>> So, we are handling the possibility of having the used backing files
>> >>>>> LOCAL on every compute ( +1K hosts ) and reduce the failure chances to
>> >>>>> the minimum, obviously, with a pararell talk about what technology to
>> >>>>> use to keep data replicated among computes when a new ami is launched,
>> >>>>> launching times, performance matters on compute nodes having to store
>> >>>>> backing files locally, etc.
>> >>>>>
>> >>>>> This make me realize, i have a huge comminity behind openstack, so
>> >>>>> wanted to ear from it:
>> >>>>>
>> >>>>> * what are your thoughts about what happened / what we are thinking
>> >>>>> right now ?
>> >>>>> * how does other users manage the backing file ( _base ) directory
>> >>>>> having all this considerations on big openstack deployments ?
>> >>>>>
>> >>>>> I will be thrilled to read from other users, experiences and thoughts.
>> >>>>>
>> >>>>> As allways, best.
>> >>>>> Alejandro
>> >>>>>
>> >>>>> _______________________________________________
>> >>>>> OpenStack-operators mailing list
>> >>>>> OpenStack-operators at lists.openstack.org
>> >>>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
>> >>>>
>> >>>>
>> >>>>
>> >>>> _______________________________________________
>> >>>> OpenStack-operators mailing list
>> >>>> OpenStack-operators at lists.openstack.org
>> >>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
>> >>>
>> >>>
>> >>>
>> >>> _______________________________________________
>> >>> OpenStack-operators mailing list
>> >>> OpenStack-operators at lists.openstack.org
>> >>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
>> >>
>> >>
>> >>
>> >> _______________________________________________
>> >> OpenStack-operators mailing list
>> >> OpenStack-operators at lists.openstack.org
>> >> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
>> >>
>> >
>> >
>> > _______________________________________________
>> > OpenStack-operators mailing list
>> > OpenStack-operators at lists.openstack.org
>> > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
>> >
>
> _______________________________________________
> Mailing list: http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack
> Post to     : openstack at lists.openstack.org
> Unsubscribe : http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack



-- 
antonio.s.messina at gmail.com
antonio.messina at uzh.ch                     +41 (0)44 635 42 22
GC3: Grid Computing Competence Center      http://www.gc3.uzh.ch/
University of Zurich
Winterthurerstrasse 190
CH-8057 Zurich Switzerland




More information about the Openstack mailing list