[Openstack-operators] [openstack][nova] Several questions/experiences about _base directory on a big production environment
chris.friesen at windriver.com
Wed Apr 2 23:40:49 UTC 2014
So if you're recommending not using shared storage, what's your answer
to people asking for live-migration? (Given that block migration is
supposed to be going away.)
On 04/02/2014 05:08 PM, George Shuklin wrote:
> Every time anyone start to consolidate resources (shared storage,
> virtual chassis for router, etc), it consolidate all failures to one.
> One failure and every consolidated system participating in festival.
> Then they starts to increase fault tolerance of consolidated system,
> raising administrative plank to the sky, requesting more and more
> hardware for the clustering, requesting enterprise-grade, "no one was
> fired buying enterprise <bullshit-brand-name-here>". As result -
> consolidated system works with same MTBF as non-consolidated, saving
> "costs" compare to even more enterprise-grade super-solution with cost
> of few percent countries GDP, and actually costs more than
> non-consolidated solution.
> Failure for x86 is ALWAYS option. Processor can not repeat instructions,
> no comparator between few parallel processors, and so on. Compare to
> mainframes. So, if failure is an option, that means, reduce importance
> of that failure, it scope.
> If one of 1k hosts goes down for three hours this is sad. But it much
> much much better than central system every of 1k hosts depends on goes
> down just for 11 seconds (3h*3600/1000).
> So answer is simple: do not aggregate. But _base to slower drives if you
> want to save costs, but do not consolidate failures.
> On 04/02/2014 09:04 PM, Alejandro Comisario wrote:
>> Hi guys ...
>> We have a pretty big openstack environment and we use a shared NFS to
>> populate backing file directory ( the famous _base directory located
>> on /var/lib/nova/instances/_base ) due to a human error, the backing
>> file used by thousands of guests was deleted, causing this guests to
>> go read-only filesystem in a second.
>> Till that moment we were convinced to use the _base directory as a
>> shared NFS because:
>> * spawning a new ami gives total visibility to the whole cloud making
>> instances take nothing to boot despite the nova region
>> * ease glance workload
>> * easiest management no having to replicate files constantly not
>> pushing bandwidth usage internally
>> But after this really big issue, and after what took us to recover
>> from this, we were thinking about how to protect against this kind of
>> "single point of failure".
>> Our first aproach this days was to put Read Only the NFS share, making
>> impossible for computes ( and humans ) to write to that directory,
>> giving permision to just one compute whos the one responsible to spawn
>> an instance from a new ami and write the file to the directory, still
>> ... the storage keeps being the SPOF.
>> So, we are handling the possibility of having the used backing files
>> LOCAL on every compute ( +1K hosts ) and reduce the failure chances to
>> the minimum, obviously, with a pararell talk about what technology to
>> use to keep data replicated among computes when a new ami is launched,
>> launching times, performance matters on compute nodes having to store
>> backing files locally, etc.
>> This make me realize, i have a huge comminity behind openstack, so
>> wanted to ear from it:
>> * what are your thoughts about what happened / what we are thinking
>> right now ?
>> * how does other users manage the backing file ( _base ) directory
>> having all this considerations on big openstack deployments ?
>> I will be thrilled to read from other users, experiences and thoughts.
>> As allways, best.
>> OpenStack-operators mailing list
>> OpenStack-operators at lists.openstack.org
> OpenStack-operators mailing list
> OpenStack-operators at lists.openstack.org
More information about the OpenStack-operators