[openstack-dev] [nova] [qa] EC2 status and call for assistance

Sean Dague sean at dague.net
Thu Apr 24 10:51:03 UTC 2014


On 04/24/2014 05:12 AM, Thierry Carrez wrote:
> Michael Still wrote:
>> On Thu, Apr 24, 2014 at 7:39 AM, Joe Gordon <joe.gordon0 at gmail.com> wrote:
>>
>>> So no one is seriously discussing moving EC2 out of nova right now. The
>>> issue is that the EC2 code and tempest tests aren't being maintained are
>>> slowly code rotting. The goal of this thread is to get some volunteers to
>>> work on EC2.
>>>
>>> "I'd like to see if there are any more people interested in keeping these
>>> interfaces functional (by contributing both on the nova and tempest
>>> sides). If so, great!"
>>
>> Sure, but if this code continues being ignored, then I don't see how
>> we have a choice long term.
>>
>> So a few questions:
>>
>>  - who currently ships a private cloud with EC2 enabled?
>>  - who has a public cloud with EC2 enabled?
> 
> In the user survey from October 2013, about 30% of respondents claim to
> have the EC2 API enabled.
> 
> It's not the first time we make the sad observation that noone actually
> cares enough about the EC2 API to invest the one FTE that would be
> required to maintain it in good shape. That session is up at every
> Design Summit, but every time we get various resource pledges that never
> pan out.
> 
> Now that we have raised the QA bar for hypervisors, plugins and backend
> drivers to stay in mainline code, it's only a matter of time until the
> EC2 API is held to the same standards... I think it's important that we
> keep that API though, so I really hope someone will step up soon.

It's also important to realize that outside of *very* basic stuff, it
doesn't really work very well. For instance, the bug in question that I
was diving on was a simple scenario of booting a compute, attaching a
volume, detaching the volume, and bringing down the compute.

It failed at some noticable rate every day. The issue was around the
fact that in OpenStack we use a single status for volume lifecycle. EC2
uses 2. So we have to collapse our 'attaching' and 'detaching' states
into 'in-use'. We were waiting for 'in-use' before proceeding to the
volume detach. That's not sufficient. There is a second status in EC2
land that tells the attachment state. The first fix was to check for:
not it ('attaching', 'detaching'). But it turns out we can't do that,
because the EC2 implementation in Nova never actually sets those states.
It only sets the attachment state to None or 'attached'. So we changed that.

All of this was to handle the fact that volumes.detach() with boto on an
unattached volume is a 500 - Unidentified error that creates a stack
trace in n-cpu. And if you get there first, you are clearly going to
leak a volume.

Which means our test will no longer explode, and randomly fail unrelated
code (Good). However things are still pretty broken in the tear down
path (Bad). And for normal use there is still a volume leak path (Bad).

This issue has existed for a long time. In past we just turned off this
test for some period of time because no one seemed willing to dive and
debug what was actually going on. It took me a linear week to get to the
bottom of this, realistically it was probably about a day or two of
actual work on this one to unwind what was going on and figure out where
our race in the test was.

I don't intend to fix the underlying Nova issue (don't explode and don't
leak), because the EC2 paths in Nova need more than random fixes, they
need some real love. My interest was sorting out why this was impacting
unrelated code changes, and I believe we have a test case work around
for that now.

	-Sean

-- 
Sean Dague
http://dague.net

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 482 bytes
Desc: OpenPGP digital signature
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20140424/ccf59de5/attachment.pgp>


More information about the OpenStack-dev mailing list