<div dir="ltr"><br><div class="gmail_extra"><br><br><div class="gmail_quote">On Wed, Mar 13, 2013 at 5:29 PM, Michael Still <span dir="ltr"><<a href="mailto:mikal@stillhq.com" target="_blank">mikal@stillhq.com</a>></span> wrote:<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="im">On Wed, Mar 13, 2013 at 5:23 PM, Joe Topjian <<a href="mailto:joe.topjian@cybera.ca">joe.topjian@cybera.ca</a>> wrote:<br>


> On Wed, Mar 13, 2013 at 5:12 PM, Michael Still <<a href="mailto:mikal@stillhq.com">mikal@stillhq.com</a>> wrote:<br>

>> On Wed, Mar 13, 2013 at 4:42 PM, Joe Topjian <<a href="mailto:joe.topjian@cybera.ca">joe.topjian@cybera.ca</a>><br>

>> wrote:<br>

>> > It would, yes, but I think your caveat trumps that idea. Having x nodes<br>

>> > be<br>

>> > able to work with a shared _base directory is great for saving space and<br>

>> > centrally using images. As an example, one of my OpenStack's _base<br>

>> > directory<br>

>> > is 650gb in size. It's currently shared via NFS. If it was not shared or<br>

>> > used a _base_$host scheme, that would be 650gb per compute node. 10<br>

>> > nodes<br>

>> > and you're already at 6.5TB.<br>

>><br>

>> Is that _base directory so large because its never been cleaned up<br>

>> though? What sort of maintenance are you performing on it?<br>

><br>

> It's true that I haven't done any maintenance to _base. From my estimations,<br>

> a cleanup wouldn't reclaim a substantial amount of space to warrant me doing<br>

> an actual cleanup (basically "benefits of disk space reclaimed" is not<br>

> greater than "risk of accidentally corrupting x users instances" yet).<br>

<br>

</div>What release of openstack are you running? I think you might get<br>

significant benefits from turning cleanup on, so long as you're using<br>

grizzly [1]. I'd be very very interested in the results of a lab test.<br></blockquote><div><br></div><div style>I am using Folsom and do plan on testing Grizzly when it's released.</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">


<br>

Michael<br>

<br>

1: yes I know its not released yet, but if you found a bug now we<br>

could fix it before it hurts everyone else...<br>

</blockquote></div><br>The scenario that I ran into had these conditions:</div><div class="gmail_extra"><br></div><div class="gmail_extra">1. Using shared storage</div><div class="gmail_extra">2. All instances (1 or more) of a certain image or snapshots are running on one compute node</div>

<div class="gmail_extra">3. That compute node is taken down for 10 minutes or so (reboot, maintenance, etc)</div><div class="gmail_extra">4. That compute node is unable to mark its _base files as being in use since it's offline</div>

<div class="gmail_extra">5. Other compute nodes see that those _base files are not in use and delete them</div><div class="gmail_extra">6. The compute node comes back online and the image/snapshot in question is now broke</div>

<div class="gmail_extra"><br></div><div class="gmail_extra">Has that scenario been accounted for or fixed?<br clear="all"><div><br></div><div style>The probability of this happening is low, but it actually happened to me twice before I figured out what was going on. I had two users who created snapshots and then launched a single instance of those snapshots. One compute node went offline due to a hardware failure and the _base image of that snapshot was removed while it was down.</div>

<div style><br></div><div style>The second compute node was fat-finger-rebooted when someone looked at the first node and the same thing happened with another _base snapshot.</div><div style><br></div><div style>Optimistically only two instances were lost. However, I don't like having to email users saying "oops - your instance is gone" when it's something I could have prevented. </div>

<div style><br></div><div style>If the compute nodes were shut down proactively, I could have live migrated everything off of those nodes, but since this was due to hardware failures, I had no time to react like that.</div>

<div><br></div><div style>Joe</div><div><br></div>-- <br>Joe Topjian<div>Systems Administrator</div><div>Cybera Inc.</div><div><br></div><div><a href="http://www.cybera.ca" target="_blank">www.cybera.ca</a></div><div><br>

</div><div><font color="#666666"><span>Cybera</span><span> is a not-for-profit organization that works to spur and support innovation, for the economic benefit of Alberta, through the use of cyberinfrastructure.</span></font></div>


</div></div>