[Openstack-operators] [openstack][nova] Several questions/experiences about _base directory on a big production environment

George Shuklin george.shuklin at gmail.com
Wed Apr 2 23:08:27 UTC 2014


Every time anyone start to consolidate resources (shared storage, 
virtual chassis for router, etc), it consolidate all failures to one. 
One failure and every consolidated system participating in festival.

Then they starts to increase fault tolerance of consolidated system, 
raising administrative plank to the sky, requesting more and more 
hardware for the clustering, requesting enterprise-grade, "no one was 
fired buying enterprise <bullshit-brand-name-here>". As result - 
consolidated system works with same MTBF as non-consolidated, saving 
"costs" compare to even more enterprise-grade super-solution with cost 
of few percent countries GDP, and actually costs more than 
non-consolidated solution.

Failure for x86 is ALWAYS option. Processor can not repeat instructions, 
no comparator between few parallel processors, and so on. Compare to 
mainframes. So, if failure is an option, that means, reduce importance 
of that failure, it scope.

If one of 1k hosts goes down for three hours this is sad. But it much 
much much better than central system every of 1k hosts depends on goes 
down just for 11 seconds (3h*3600/1000).

So answer is simple: do not aggregate. But _base to slower drives if you 
want to save costs, but do not consolidate failures.

On 04/02/2014 09:04 PM, Alejandro Comisario wrote:
> Hi guys ...
> We have a pretty big openstack environment and we use a shared NFS to
> populate backing file directory ( the famous _base directory located
> on /var/lib/nova/instances/_base ) due to a human error, the backing
> file used by thousands of guests was deleted, causing this guests to
> go read-only filesystem in a second.
>
> Till that moment we were convinced to use the _base directory as a
> shared NFS because:
>
> * spawning a new ami gives total visibility to the whole cloud making
> instances take nothing to boot despite the nova region
> * ease glance workload
> * easiest management no having to replicate files constantly not
> pushing bandwidth usage internally
>
> But after this really big issue, and after what took us to recover
> from this, we were thinking about how to protect against this kind of
> "single point of failure".
> Our first aproach this days was to put Read Only the NFS share, making
> impossible for computes ( and humans ) to write to that directory,
> giving permision to just one compute whos the one responsible to spawn
> an instance from a new ami and write the file to the directory, still
> ... the storage keeps being the SPOF.
>
> So, we are handling the possibility of having the used backing files
> LOCAL on every compute ( +1K hosts ) and reduce the failure chances to
> the minimum, obviously, with a pararell talk about what technology to
> use to keep data replicated among computes when a new ami is launched,
> launching times, performance matters on compute nodes having to store
> backing files locally, etc.
>
> This make me realize, i have a huge comminity behind openstack, so
> wanted to ear from it:
>
> * what are your thoughts about what happened / what we are thinking right now ?
> * how does other users manage the backing file ( _base ) directory
> having all this considerations on big openstack deployments ?
>
> I will be thrilled to read from other users, experiences and thoughts.
>
> As allways, best.
> Alejandro
>
> _______________________________________________
> OpenStack-operators mailing list
> OpenStack-operators at lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators




More information about the OpenStack-operators mailing list