[Openstack-operators] [openstack][nova] Several questions/experiences about _base directory on a big production environment

matt matt at nycresistor.com
Wed Apr 2 23:48:17 UTC 2014


there's shared storage on a centralized network filesystem... then there's
shared storage on a distributed network filesystem.  thus the age old
openafs vs nfs war is reborn.

i'd check out ceph block device for live migration... but saying that...
live migration has not achieved a maturity level that i'd even consider
trying it in production.

-matt


On Wed, Apr 2, 2014 at 7:40 PM, Chris Friesen
<chris.friesen at windriver.com>wrote:

> So if you're recommending not using shared storage, what's your answer to
> people asking for live-migration?  (Given that block migration is supposed
> to be going away.)
>
> Chris
>
>
> On 04/02/2014 05:08 PM, George Shuklin wrote:
>
>> Every time anyone start to consolidate resources (shared storage,
>> virtual chassis for router, etc), it consolidate all failures to one.
>> One failure and every consolidated system participating in festival.
>>
>> Then they starts to increase fault tolerance of consolidated system,
>> raising administrative plank to the sky, requesting more and more
>> hardware for the clustering, requesting enterprise-grade, "no one was
>> fired buying enterprise <bullshit-brand-name-here>". As result -
>> consolidated system works with same MTBF as non-consolidated, saving
>> "costs" compare to even more enterprise-grade super-solution with cost
>> of few percent countries GDP, and actually costs more than
>> non-consolidated solution.
>>
>> Failure for x86 is ALWAYS option. Processor can not repeat instructions,
>> no comparator between few parallel processors, and so on. Compare to
>> mainframes. So, if failure is an option, that means, reduce importance
>> of that failure, it scope.
>>
>> If one of 1k hosts goes down for three hours this is sad. But it much
>> much much better than central system every of 1k hosts depends on goes
>> down just for 11 seconds (3h*3600/1000).
>>
>> So answer is simple: do not aggregate. But _base to slower drives if you
>> want to save costs, but do not consolidate failures.
>>
>> On 04/02/2014 09:04 PM, Alejandro Comisario wrote:
>>
>>> Hi guys ...
>>> We have a pretty big openstack environment and we use a shared NFS to
>>> populate backing file directory ( the famous _base directory located
>>> on /var/lib/nova/instances/_base ) due to a human error, the backing
>>> file used by thousands of guests was deleted, causing this guests to
>>> go read-only filesystem in a second.
>>>
>>> Till that moment we were convinced to use the _base directory as a
>>> shared NFS because:
>>>
>>> * spawning a new ami gives total visibility to the whole cloud making
>>> instances take nothing to boot despite the nova region
>>> * ease glance workload
>>> * easiest management no having to replicate files constantly not
>>> pushing bandwidth usage internally
>>>
>>> But after this really big issue, and after what took us to recover
>>> from this, we were thinking about how to protect against this kind of
>>> "single point of failure".
>>> Our first aproach this days was to put Read Only the NFS share, making
>>> impossible for computes ( and humans ) to write to that directory,
>>> giving permision to just one compute whos the one responsible to spawn
>>> an instance from a new ami and write the file to the directory, still
>>> ... the storage keeps being the SPOF.
>>>
>>> So, we are handling the possibility of having the used backing files
>>> LOCAL on every compute ( +1K hosts ) and reduce the failure chances to
>>> the minimum, obviously, with a pararell talk about what technology to
>>> use to keep data replicated among computes when a new ami is launched,
>>> launching times, performance matters on compute nodes having to store
>>> backing files locally, etc.
>>>
>>> This make me realize, i have a huge comminity behind openstack, so
>>> wanted to ear from it:
>>>
>>> * what are your thoughts about what happened / what we are thinking
>>> right now ?
>>> * how does other users manage the backing file ( _base ) directory
>>> having all this considerations on big openstack deployments ?
>>>
>>> I will be thrilled to read from other users, experiences and thoughts.
>>>
>>> As allways, best.
>>> Alejandro
>>>
>>> _______________________________________________
>>> OpenStack-operators mailing list
>>> OpenStack-operators at lists.openstack.org
>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
>>>
>>
>>
>> _______________________________________________
>> OpenStack-operators mailing list
>> OpenStack-operators at lists.openstack.org
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
>>
>
>
> _______________________________________________
> OpenStack-operators mailing list
> OpenStack-operators at lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-operators/attachments/20140402/b5b1f758/attachment.html>


More information about the OpenStack-operators mailing list