[openstack-dev] [glance] proposed priorities for Mitaka
Kuvaja, Erno
kuvaja at hpe.com
Mon Sep 14 15:02:59 UTC 2015
> -----Original Message-----
> From: Flavio Percoco [mailto:flavio at redhat.com]
> Sent: Monday, September 14, 2015 1:41 PM
> To: OpenStack Development Mailing List (not for usage questions)
> Subject: Re: [openstack-dev] [glance] proposed priorities for Mitaka
>
> On 14/09/15 08:10 -0400, Doug Hellmann wrote:
> >
> >After having some conversations with folks at the Ops Midcycle a few
> >weeks ago, and observing some of the more recent email threads related
> >to glance, glance-store, the client, and the API, I spent last week
> >contacting a few of you individually to learn more about some of the
> >issues confronting the Glance team. I had some very frank, but I think
> >constructive, conversations with all of you about the issues as you see
> >them. As promised, this is the public email thread to discuss what I
> >found, and to see if we can agree on what the Glance team should be
> >focusing on going into the Mitaka summit and development cycle and how
> >the rest of the community can support you in those efforts.
> >
> >I apologize for the length of this email, but there's a lot to go over.
> >I've identified 2 high priority items that I think are critical for the
> >team to be focusing on starting right away in order to use the upcoming
> >summit time effectively. I will also describe several other issues that
> >need to be addressed but that are less immediately critical. First the
> >high priority items:
> >
> >1. Resolve the situation preventing the DefCore committee from
> > including image upload capabilities in the tests used for trademark
> > and interoperability validation.
> >
> >2. Follow through on the original commitment of the project to
> > provide an image API by completing the integration work with
> > nova and cinder to ensure V2 API adoption.
>
> Hi Doug,
>
> First and foremost, I'd like to thank you for taking the time to dig into these
> issues, and for reaching out to the community seeking for information and a
> better understanding of what the real issues are. I can imagine how much
> time you had to dedicate on this and I'm glad you did.
++ Really thanks for taking the time for this.
>
> Now, to your email, I very much agree with the priorities you mentioned
> above and I'd like for, whomever will win Glance's PTL election, to bring focus
> back on that.
>
> Please, find some comments in-line for each point:
>
>
> >
> >I. DefCore
> >
> >The primary issue that attracted my attention was the fact that DefCore
> >cannot currently include an image upload API in its interoperability
> >test suite, and therefore we do not have a way to ensure
> >interoperability between clouds for users or for trademark use. The
> >DefCore process has been long, and at times confusing, even to those of
> >us following it sort of closely. It's not entirely surprising that some
> >projects haven't been following the whole time, or aren't aware of
> >exactly what the whole thing means. I have proposed a cross-project
> >summit session for the Mitaka summit to address this need for
> >communication more broadly, but I'll try to summarize a bit here.
>
Looking how different OpenStack based public clouds limits or fully prevents their users to upload images to their deployments, I'm not convinced the Image Upload should be included to this definition.
> +1
>
> I think it's quite sad that some projects, especially those considered to be
> part of the `starter-kit:compute`[0], don't follow closely what's going on in
> DefCore. I personally consider this a task PTLs should incorporate in their role
> duties. I'm glad you proposed such session, I hope it'll help raising awareness
> of this effort and it'll help moving things forward on that front.
>
>
> >
> >DefCore is using automated tests, combined with business policies, to
> >build a set of criteria for allowing trademark use. One of the goals of
> >that process is to ensure that all OpenStack deployments are
> >interoperable, so that users who write programs that talk to one cloud
> >can use the same program with another cloud easily. This is a *REST
> >API* level of compatibility. We cannot insert cloud-specific behavior
> >into our client libraries, because not all cloud consumers will use
> >those libraries to talk to the services. Similarly, we can't put the
> >logic in the test suite, because that defeats the entire purpose of
> >making the APIs interoperable. For this level of compatibility to work,
> >we need well-defined APIs, with a long support period, that work the
> >same no matter how the cloud is deployed. We need the entire community
> >to support this effort. From what I can tell, that is going to require
> >some changes to the current Glance API to meet the requirements. I'll
> >list those requirements, and I hope we can discuss them to a degree
> >that ensures everyone understands them. I don't want this email thread
> >to get bogged down in implementation details or API designs, though, so
> >let's try to keep the discussion at a somewhat high level, and leave
> >the details for specs and summit discussions. I do hope you will
> >correct any misunderstandings or misconceptions, because unwinding this
> >as an outside observer has been quite a challenge and it's likely I
> >have some details wrong.
This just reinforces my doubt above. By including upload to the defcore requirements probably just closes out lots of the public clouds out there. Is that the intention here?
> >
> >As I understand it, there are basically two ways to upload an image to
> >glance using the V2 API today. The "POST" API pushes the image's bits
> >through the Glance API server, and the "task" API instructs Glance to
> >download the image separately in the background. At one point
> >apparently there was a bug that caused the results of the two different
> >paths to be incompatible, but I believe that is now fixed.
> >However, the two separate APIs each have different issues that make
> >them unsuitable for DefCore.
While being true that there is two ways to get image into the glance via V2 Images API, the use case for those two is completely different. While some (like Flavio) might argue the tasks being internal API only, which it might well be in Private cloud, others might be willing to expose only that for their public cloud users due to the improved processability (antivirus, etc.) and the last group is just not willing to let their users to bring their own images in at all.
Looking outside of the box the Tasks API should not be included in any core definition as it's just interface to _optional_ plugins. Obviously if there are different classifications, it might be included to some.
> >
> >The DefCore process relies on several factors when designating APIs for
> >compliance. One factor is the technical direction, as communicated by
> >the contributor community -- that's where we tell them things like "we
> >plan to deprecate the Glance V1 API". In addition to the technical
> >direction, DefCore looks at the deployment history of an API. They do
> >not want to require deploying an API if it is not seen as widely
> >usable, and they look for some level of existing adoption by cloud
> >providers and distributors as an indication of that the API is desired
> >and can be successfully used. Because we have multiple upload APIs, the
> >message we're sending on technical direction is weak right now, and so
> >they have focused on deployment considerations to resolve the question.
>
> The task upload process you're referring to is the one that uses the `import`
> task, which allows you to download an image from an external source,
> asynchronously, and import it in Glance. This is the old `copy-from` behavior
> that was moved into a task.
>
> The "fun" thing about this - and I'm sure other folks in the Glance community
> will disagree - is that I don't consider tasks to be a public API. That is to say, I
> would expect tasks to be an internal API used by cloud admins to perform
> some actions (bsaed on its current implementation). Eventually, some of
> these tasks could be triggered from the external API but as background
> operations that are triggered by the well-known public ones and not through
> the task API.
>
> Ultimately, I believe end-users of the cloud simply shouldn't care about what
> tasks are or aren't and more importantly, as you mentioned later in the
> email, tasks make clouds not interoperable. I'd be pissed if my public image
> service would ask me to learn about tasks to be able to use the service.
I'd like to bring another argument here. I think our Public Images API should behave consistently regardless if there is tasks enabled in the deployment or not and with what plugins. This meaning that _if_ we expect glance upload work over the POST API and that endpoint is available in the deployment I would expect a) my image hash to match with the one the cloud returns b) I'd assume all or none of the clouds rejecting my image if it gets flagged by Vendor X virus definitions and c) it being bootable across the clouds taken it's in supported format. On the other hand if I get told by the vendor that I need to use cloud specific task that accepts only ova compliant image packages and that the image will be checked before acceptance, my expectations are quite different and I would expect all that happening outside of the standard API as it's not consistent behavior.
>
> Long story short, I believe the only upload API that should be considered is
> the one that uses HTTP and, eventually, to bring compatibility with v1 as far
> as the copy-from behavior goes, Glance could bring back that behavior on
> top of the task (just dropping this here for the sake of discussion and
> interoperability).
>
>
> >The POST API is enabled in many public clouds, but not consistently.
> >In some clouds like HP, a tenant requires special permission to use the
> >API. At least one provider, Rackspace, has disabled the API entirely.
> >This is apparently due to what seems like a fair argument that
> >uploading the bits directly to the API service presents a possible
> >denial of service vector. Without arguing the technical merits of that
> >decision, the fact remains that without a strong consensus from
> >deployers that the POST API should be publicly and consistently
> >available, it does not meet the requirements to be used for DefCore
> >testing.
>
> This is definitely unfortunate. I believe a good step forward for this
> discussion would be to create a list of issues related to uploading images and
> see how those issues can be addressed. The result from that work might be
> that it's not recommended to make that endpoint public but again, without
> going through the issues, it'll be hard to understand how we can improve this
> situation. I expect most of this issues to have a security impact.
>
++, regardless of the helpfulness of that discussion, I don't think it's realistic expectation to prioritize that work so much that majority of those issues would be solved amongst the priorities at the top of this e-mail within a cycle.
>
> >The task API is also not widely deployed, so its adoption for DefCore
> >is problematic. If we provide a clear technical direction that this API
> >is preferred, that may overcome the lack of adoption, but the current
> >task API seems to have technical issues that make it fundamentally
> >unsuitable for DefCore consideration. While the task API addresses the
> >problem of a denial of service, and includes useful features such as
> >processing of the image during import, it is not strongly enough
> >defined in its current form to be interoperable.
> >Because it's a generic API, the caller must know how to fully construct
> >each task, and know what task types are supported in the first place.
> >There is only one "import" task type supported in the Glance code
> >repository right now, but it is not clear that "import"
> >always uses the same arguments, or interprets them in the same way.
> >For example, the upstream documentation [1] describes a task that
> >appears to use a URL as source, while the Rackspace documentation [2]
> >describes a task that appears to take a swift storage location.
> >I wasn't able to find JSONSchema validation for the "input" blob
> >portion of the task in the code [3], though that may happen down inside
> >the task implementation itself somewhere.
>
>
> The above sounds pretty accurate as there's currently just 1 flow that can be
> triggered (the import flow) and that accepts an input, which is a json. As I
> mentioned above, I don't believe tasks should be part of the public API and
> this is yet another reason why I think so. The tasks API is not well defined as
> there's, currently, not good way to define the expected input in a backwards
> compatible way and to provide all the required validation.
>
> I like having tasks in Glance, despite my comments above - but I like them for
> cloud usage and not public usage.
>
> As far as Rackspace's docs/endpoint goes, I'd assume this is an error in their
> documetation since Glance currently doesn't allow[0] for swift URLs to be
> imported (not even in juno[1]).
>
> [0]
> http://git.openstack.org/cgit/openstack/glance/tree/glance/common/script
> s/utils.py#n84
> [1]
> http://git.openstack.org/cgit/openstack/glance/tree/glance/common/script
> s/utils.py?h=stable/juno#n83
>
> >Tasks also come from plugins, which may be installed differently based
> >on the deployment. This is an interesting approach to creating API
> >extensions, but isn't discoverable enough to write interoperable tools
> >against. Most of the other projects are starting to move away from
> >supporting API extensions at all because of interoperability concerns
> >they introduce. Deployers should be able to configure their clouds to
> >perform well, but not to behave in fundamentally different ways.
> >Extensions are just that, extensions. We can't rely on them for
> >interoperability testing.
>
> This is, indeed, an interesting interpretation of what tasks are for.
> I'd probably just blame us (Glance team) for not communicating properly
> what tasks are meant to be. I don't believe tasks are a way to extend the
> *public* API and I'd be curious to know if others see it that way. I fully agree
> that just breaks interoperability and as I've mentioned a couple of times in
> this reply already, I don't even think tasks should be part of the public API.
Hmm-m, that's exactly how I have seen it. Plugins that can be provided to expand the standard functionality. I totally agree these not to being relied on interoperability. I've always assumed that that has been also the reason why tasks have not had too much focus as we've prioritized the actual API functionality and stability before it's expandability.
>
> But again, very poor job communicating so[0]. Nonetheless, for the sake of
> providing enough information about tasks and sources to read from, I'd also
> like to point out the original blueprint[1], some discussions during the
> havana's summit[2], the wiki page for tasks[3] and a patch I just reviewed
> today (thanks Brian) that introduces docs for tasks[4]. These links show
> already some differences in what tasks are.
>
> [0]
> http://git.openstack.org/cgit/openstack/glance/tree/etc/policy.json?h=stabl
> e/juno#n28
> [1] https://blueprints.launchpad.net/glance/+spec/async-glance-workers
> [2] https://etherpad.openstack.org/p/havana-glance-requirements
> [3] https://wiki.openstack.org/wiki/Glance-tasks-api
> [4] https://review.openstack.org/#/c/220166/
>
> >
> >There is a lot of fuzziness around exactly what is supported for image
> >upload, both in the documentation and in the minds of the developers
> >I've spoken to this week, so I'd like to take a step back and try to
> >work through some clear requirements, and then we can have folks
> >familiar with the code help figure out if we have a real issue, if a
> >minor tweak is needed, or if things are good as they stand today and
> >it's all a misunderstanding.
> >
> >1. We need a strongly defined and well documented API, with arguments
> > that do not change based on deployment choices. The behind-the-scenes
> > behaviors can change, but the arguments provided by the caller
> > must be the same and the responses must look the same. The
> > implementation can run as a background task rather than receiving
> > the full image directly, but the current task API is too vaguely
> > defined to meet this requirement, and IMO we need an entry point
> > focused just on uploading or importing an image.
> >
> >2. Glance cannot require having a Swift deployment. It's not clear
> > whether this is actually required now, so if it's not then we're
> > in a good state.
>
> This is definitely not the case. Glance doesn't require any specific store to be
> deployed. It does require at least one other than the http one (because it
> doesn't support write operations).
>
> > It's fine to provide an optional way to take
> > advantage of Swift if it is present, but it cannot be a required
> > component. There are three separate trademark "programs", with
> > separate policies attached to them. There is an umbrella "Platform"
> > program that is intended to include all of the TC approved release
> > projects, such as nova, glance, and swift. However, there is
> > also a separate "Compute" program that is intended to include
> > Nova, Glance, and some others but *not* Swift. This is an important
> > distinction, because there are many use cases both for distributors
> > and public cloud providers that do not incorporate Swift for a
> > variety of reasons. So, we can't have Glance's primary configuration
> > require Swift and we need to provide tests for the DefCore team
> > that run without Swift. Duplicate tests that do use Swift are
> > fine, and might be used for "Platform" compliance tests.
It really saddens me and tells how narrow focused we have been, this point 2 even needing discussion.
> >
> >3. We need an integration test suite in tempest that fully exercises
> > the public image API by talking directly to Glance. This applies
> > to the entire API, not just image uploads. It's fine to have
> > duplicate tests using the proxy in Nova if the Nova team wants
> > those, but DefCore should be using tests that talk directly to
> > the service that owns each feature, without relying on any
> > proxying. We've already missed the chance to deal with this in
> > the current DefCore definition, which uses image-related tests
> > that talk to the Nova proxy [4][5], so we'll have to maintain
> > the proxy for the required deprecation period. But we won't be
> > able to consider removing that proxy until we provide alternate
> > tests for those features that speak directly to Glance. We may
> > have some coverage already, but I wasn't able to find a task-based
> > image upload test and there is no "image create" mentioned in
> > the current draft of capabilities being reviewed [6]. There may
> > be others missing, so someone more familiar with the feature set
> > of Glance should do an audit and document what tests are needed
> > so the work can be split up.
> >
>
> +1 This should become one of the top priorities for Mitaka (as you
> mentioned at the beginning of this email).
But I hope this integration test suite in tempest is not seen de-facto needed functionality by DefCore as those two should be different things.
>
> >4. Once identified and incorporated into the DefCore capabilities
> > set, the selected API needs to remain stable for an extended
> > period of time and follow the deprecation timelines defined by
> > DefCore. That has implications for the V3 API currently in
> > development to turn Glance into a more generic artifacts service.
> > There are a lot of ways to handle those implications, and no
> > choice needs to be made today, so I only mention it to make sure
> > it's clear that (a) we must get V2 into shape for DefCore and
> > (b) when that happens, we will need to maintain V2 even if V3
> > is finished. We won't be able to deprecate V2 quickly.
This is absolutely reasonable and pretty much combines what should be our near future focus moving forwards.
> >
> >Now, it's entirely possible that we can meet all of those requirements
> >today, and that would be great. If that's the case, then the problem is
> >just one of clear communication and documentation. I think there's
> >probably more work to be done than that, though.
>
>
> There's clearly a communication problem. The fact that this very email has
> been sent out is a sign of that. However, I'd like to say, in a very optimistic
> way, that Glance is not so far away from the expecte status. There are things
> to fix, other things to clarify, tons to discuss but, IMHO, besides the tempests
> tests and DefCore, the most critical one is the one you mentioned in the
> following section.
Being no so optimistic person I think we're just bit lost, but still fairly close.
>
> >
> >[1] http://developer.openstack.org/api-ref-image-v2.html#os-tasks-v2
> >[2]
> >http://docs.rackspace.com/images/api/v2/ci-
> devguide/content/POST_import
> >Image_tasks_Image_Task_Calls.html#d6e4193
> >[3]
> >http://git.openstack.org/cgit/openstack/glance/tree/glance/api/v2/tasks
> >.py [4]
> >http://git.openstack.org/cgit/openstack/defcore/tree/2015.05.json#n70
> >[5]
> >http://git.openstack.org/cgit/openstack/defcore/tree/doc/source/guideli
> >nes/2015.07.rst [6] https://review.openstack.org/#/c/213353/
> >
> >II. Complete Cinder and Nova V2 Adoption
> >
> >The Glance team originally committed to providing an Image Service API.
> >Besides our end users, both Cinder and Nova consume that API.
> >The shift from V1 to V2 has been a long road. We're far enough along,
> >and the V1 API has enough issues preventing us from using it for
> >DefCore, that we should push ahead and complete the V2 adoption. That
> >will let us properly deprecate and drop V1 support, and concentrate on
> >maintaining V2 for the necessary amount of time.
> >
> >There are a few specs for the work needed in Nova, but that work didn't
> >land in Liberty for a variety of reasons. We need resources from both
> >the Glance and Nova teams to work together to get this done as early as
> >possible in Mitaka to ensure that it actually lands this time. We
> >should be able to schedule a joint session at the summit to have the
> >conversation, and we need to take advantage of that opportunity to
> >ensure the details are fully resolved so that everyone understands the
> >plan.
>
> Super important point. I'd like people replying to this email to focus on what
> we can do next and not why this hasn't been done. The later will take us
> down a path that won't be useful at all at it'll just waste everyone's time.
++
>
> That said, I fully agree with the above. Last time we talked, John Garbutt and
> Jay Pipes, from the nova team, raised their hands to help out with this effort.
> From Glance's side, Fei Long Wang and myself were working on the
> implementation. To help moving this forward and to follow on the latest
> plan, which allows this migration to be smoother than our original plan, we
> need folks from Glance to raise their hand.
>
> If I'm not elected PTL, I'm more than happy to help out here but we need
> someone that can commit to the above right now and we'll likely need a
> team of at least 2 people to help moving this forward in early Mitaka.
>
>
> >The work in Cinder is more complete, but may need to be reviewed to
> >ensure that it is using the API correctly, safely, and efficiently.
> >Again, this is a joint effort between the Glance and Cinder teams to
> >identify any issues and work out a resolution.
> >
> >Part of this work will also be to audit the Glance API documentation,
> >to ensure it accurately reflects what the APIs expect to receive and
> >return. There are reportedly at least a few cases where things are out
> >of sync right now. This will require some coordination with the
> >Documentation team.
> >
> >
> >Those are the two big priorities I see, based on things the rest of the
> >community needs from the team and existing commitments that have been
> >made. There are some other things that should also be addressed.
> >
> >
> >III. Security audits & bug fixes
> >
> >Five of 18 recent security reports were related to Glance [7]. It's not
> >surprising, given recent resource constraints, that addressing these
> >has been a challenge. Still, these should be given high priority.
> >
> >[7]
> >https://security.openstack.org/search.html?q=glance&check_keywords=y
> es&
> >area=default
I'm not sure if I'm more ashamed or happy about this. The fact that someone is actually looking into it and working on these issues is nice 'though.
>
>
> +1 FWIW, we're in the process of growing Glance's security team. But
> it's clear from the above that there needs to be quicker replies to security
> issues.
>
> >IV. Sorting out the glance-store question
> >
> >This was perhaps the most confusing thing I learned about this week.
> >The perception outside of the Glance team is that the library is meant
> >to be used by Nova and Cinder to communicate directly with the image
> >store, bypassing the REST API, to improve performance in several cases.
> >I know the Cinder team is especially interested in some sort of
> >interface for manipulating images inside the storage system without
> >having to download them to make copies (for RBD and other systems that
> >support CoW natively).
>
> Correct, the above was one of the triggerers for this effort and I like to think
> it's still one of the main drivers. There are other fancier things that could be
> done in the future assuming the librarie's API is refactored in a way that such
> features can be implemented.[0]
>
> [0] https://review.openstack.org/#/c/188050/
>
> >That doesn't seem to be
> >what the library is actually good for, though, since most of the Glance
> >core folks I talked to thought it was really a caching layer.
> >This discrepancy in what folks wanted vs. what they got may explain
> >some of the heated discussions in other email threads.
>
> It's strange that some folks think of it as a caching layer. I believe one of the
> reasons there's such discrepancy is because not enough effort has been put
> in the refactor this library requires. The reason this library requires such a
> refactor is that it came out from the old `glance/store` code which was very
> specific to Glance's internal use.
>
> The mistake here could be that the library should've been refactored
> *before* adopting it in Glance.
>
> >
> >Frankly, given the importance of the other issues, I recommend leaving
> >glance-store standalone this cycle. Unless the work for dealing with
> >priorities I and II is made *significantly* easier by not having a
> >library, the time and energy it will take to re-integrate it with the
> >Glance service seems like a waste of limited resources.
> >The time to even discuss it may be better spent on the planning work
> >needed. That said, if the library doesn't provide the features its
> >users were expecting, it may be better to fold it back in and create a
> >different library with a better understanding of the requirements at
> >some point. The path to take is up to the Glance team, of course, but
> >we're already down far enough on the priority list that I think we'll
> >be lucky to finish the preceding items this cycle.
>
I don't think we should put too much effort on this, based on the reality that we do not have even agreement within the team what the motivators are.
>
> I don't think merging glance-store back into Glance will help with any of the
> priorities mentioned in this thread. If anything, refactoring the API might help
> with future work that could come after the v1 -> v2 migration is complete.
>
Well it would close some discussions that have been causing confusion lately, but I do agree it might not be worth of it just now.
> >
> >
> >Those are the development priorities I was able to identify in my
> >interviews this week, and there is one last thing the team needs to do
> >this cycle: Recruit more contributors.
> >
> >Almost every current core contributor I spoke with this week indicated
> >that their time was split between another project and Glance. Often
> >higher priority had to be given, understandibly, to internal product
> >work. That's the reality we work in, and everyone feels the same
> >pressures to some degree. One way to address that pressure is to bring
> >in help. So, we need a recruiting drive to find folks willing to
> >contribute code and reviews to the project to keep the team healthy. I
> >listed this item last because if you've made it this far you should see
> >just how much work the team has ahead. We're a big community, and I'm
> >confident that we'll be able to find help for the Glance team, but it
> >will require mentoring and education to bring people up to speed to
> >make them productive.
>
I'm almost sad to say, but I'm not really convinced that our issues are because of lack of manpower. Obviously any help is welcome to improve the current situation, but I think this discussion is extremely important to have before we take all that crowd in who wants to be part of developing Glance. ;)
> Fully agree here as well. However, I also believe that the fact that some
> efforts have gone to the wrong tasks has taken Glance to the situation it is
> today. More help is welcomed and required but a good strategy is more
> important right now.
>
> FWIW, I agree that our focus has gone to different thing and this has taken us
> to the status you mentioned above. More importantly, it's postponed some
> important tasks. However, I don't believe Glance is completely broken - I
> know you are not saying this but I'd like to mention it - and I certainly believe
> we can bring it back to a good state faster than expecte, but I'm known for
> being a bit optimistic sometimes.
>
> In this reply I was hard on us (Glance team), because I tend to be hard on
> myself and to dig deep into the things that are not working well. Many times
> I do this based on the feedback provided by others, which I personally value
> **a lot**. Unfortunately, I have to say that there hasn't been enough
> feedback about these issues until now. There was Mike's email[0] where I
> explicitly asked the community to speak up. This is to say that I appreciate
> the time you've taken to dig into this a lot and to encourage folks to *always*
> speak up and reach out through every *public* medium possible..
>
> No one can fix rumors, we can fix issues, though.
>
> Thanks again and lets all work together to improve this situation, Flavio
All the above is just so easy to agree on!
>
> [0] http://lists.openstack.org/pipermail/openstack-dev/2015-
> August/071971.html
>
> --
> @flaper87
> Flavio Percoco
- Erno (jokke) Kuvaja
More information about the OpenStack-dev
mailing list