[OpenStack-Infra] Spot instances for CI

James E. Blair corvus at inaugust.com
Thu Dec 17 22:27:45 UTC 2015


Jeremy Stanley <fungi at yuggoth.org> writes:

> On 2015-12-17 09:24:33 +0000 (+0000), Jean-Daniel Bonnetot wrote:
>> You probably know that the foundation need more ressources for the
>> CI. What do you think about pushing spot instances? It could be a
>> great solution for our CI and make a smaller step for providers to
>> give ressources « when they can ».
>> 
>> https://review.openstack.org/#/c/104883/
>
> The idea, from a high level at least, looks reasonable (though I'm
> not heavily involved with Nova development so don't really know how
> well it fits with their plans). Since our CI is somewhat tolerant of
> nodepool-managed instances dying on their own (because some jobs can
> actually crash the operating system on them, and because cloud
> instances and networks can be unreliable at times), the impact of
> some fraction of job workers getting deleted out from under us
> simply delays test results reporting while those jobs are rerun. If
> it were constant and a significant enough percentage of our
> aggregate quota impacted in this way then it would likely be pretty
> crippling for us, but as we diversify across an increasing number of
> donors the risk there diminishes as well I expect.

This becomes more complex in the multi-node situation, where we count on
being able to have more than one node at a time in the same provider; if
one of those is deleted, then we also waste the other(s) that are tied
to it and have to relaunch.  Multi-node tests would, statistically, be
more likely to be failed due to a spot-instance expiring yet are more
difficult to schedule.

Being able to detect this situation and recover from it would require
some additional complexity across our CI system.  It's "just
programming" but it's also not trivial.  This approach would probably
have a visible impact to developers, and in the case where we
continuously but slowly lose our spot instances, we could end up in a
situation where changes take a long time to manage to finish a test run.

While I think this approach could work, it doesn't feel like a natural
fit for me.  Our tests and therefore instances are actually rather
short-lived.  Rather than using a system where, at 57 minutes into a 60
minute job, the instance is deleted and we need to restart the work, it
would be more ideal to simply let that job complete, let the instance be
naturally deleted, and not replace it.  In other words, a smoother
system for us that may achieve similar results more efficiently would be
one that simply adjusted our quota dynamically.

-Jim



More information about the OpenStack-Infra mailing list