[openstack-dev] [glance] proposed priorities for Mitaka

Doug Hellmann doug at doughellmann.com
Mon Sep 14 19:51:24 UTC 2015


Excerpts from Flavio Percoco's message of 2015-09-14 14:41:00 +0200:
> On 14/09/15 08:10 -0400, Doug Hellmann wrote:
> >
> >After having some conversations with folks at the Ops Midcycle a
> >few weeks ago, and observing some of the more recent email threads
> >related to glance, glance-store, the client, and the API, I spent
> >last week contacting a few of you individually to learn more about
> >some of the issues confronting the Glance team. I had some very
> >frank, but I think constructive, conversations with all of you about
> >the issues as you see them. As promised, this is the public email
> >thread to discuss what I found, and to see if we can agree on what
> >the Glance team should be focusing on going into the Mitaka summit
> >and development cycle and how the rest of the community can support
> >you in those efforts.
> >
> >I apologize for the length of this email, but there's a lot to go
> >over. I've identified 2 high priority items that I think are critical
> >for the team to be focusing on starting right away in order to use
> >the upcoming summit time effectively. I will also describe several
> >other issues that need to be addressed but that are less immediately
> >critical. First the high priority items:
> >
> >1. Resolve the situation preventing the DefCore committee from
> >   including image upload capabilities in the tests used for trademark
> >   and interoperability validation.
> >
> >2. Follow through on the original commitment of the project to
> >   provide an image API by completing the integration work with
> >   nova and cinder to ensure V2 API adoption.
> 
> Hi Doug,
> 
> First and foremost, I'd like to thank you for taking the time to dig
> into these issues, and for reaching out to the community seeking for
> information and a better understanding of what the real issues are. I
> can imagine how much time you had to dedicate on this and I'm glad you
> did.
> 
> Now, to your email, I very much agree with the priorities you
> mentioned above and I'd like for, whomever will win Glance's PTL
> election, to bring focus back on that.
> 
> Please, find some comments in-line for each point:
> 
> >
> >I. DefCore
> >
> >The primary issue that attracted my attention was the fact that
> >DefCore cannot currently include an image upload API in its
> >interoperability test suite, and therefore we do not have a way to
> >ensure interoperability between clouds for users or for trademark
> >use. The DefCore process has been long, and at times confusing,
> >even to those of us following it sort of closely. It's not entirely
> >surprising that some projects haven't been following the whole time,
> >or aren't aware of exactly what the whole thing means. I have
> >proposed a cross-project summit session for the Mitaka summit to
> >address this need for communication more broadly, but I'll try to
> >summarize a bit here.
> 
> +1
> 
> I think it's quite sad that some projects, especially those considered
> to be part of the `starter-kit:compute`[0], don't follow closely
> what's going on in DefCore. I personally consider this a task PTLs
> should incorporate in their role duties. I'm glad you proposed such
> session, I hope it'll help raising awareness of this effort and it'll
> help moving things forward on that front.

Until fairly recently a lot of the discussion was around process
and priorities for the DefCore committee. Now that those things are
settled, and we have some approved policies, it's time to engage
more fully.  I'll be working during Mitaka to improve the two-way
communication.

> 
> >
> >DefCore is using automated tests, combined with business policies,
> >to build a set of criteria for allowing trademark use. One of the
> >goals of that process is to ensure that all OpenStack deployments
> >are interoperable, so that users who write programs that talk to
> >one cloud can use the same program with another cloud easily. This
> >is a *REST API* level of compatibility. We cannot insert cloud-specific
> >behavior into our client libraries, because not all cloud consumers
> >will use those libraries to talk to the services. Similarly, we
> >can't put the logic in the test suite, because that defeats the
> >entire purpose of making the APIs interoperable. For this level of
> >compatibility to work, we need well-defined APIs, with a long support
> >period, that work the same no matter how the cloud is deployed. We
> >need the entire community to support this effort. From what I can
> >tell, that is going to require some changes to the current Glance
> >API to meet the requirements. I'll list those requirements, and I
> >hope we can discuss them to a degree that ensures everyone understands
> >them. I don't want this email thread to get bogged down in
> >implementation details or API designs, though, so let's try to keep
> >the discussion at a somewhat high level, and leave the details for
> >specs and summit discussions. I do hope you will correct any
> >misunderstandings or misconceptions, because unwinding this as an
> >outside observer has been quite a challenge and it's likely I have
> >some details wrong.
> >
> >As I understand it, there are basically two ways to upload an image
> >to glance using the V2 API today. The "POST" API pushes the image's
> >bits through the Glance API server, and the "task" API instructs
> >Glance to download the image separately in the background. At one
> >point apparently there was a bug that caused the results of the two
> >different paths to be incompatible, but I believe that is now fixed.
> >However, the two separate APIs each have different issues that make
> >them unsuitable for DefCore.
> >
> >The DefCore process relies on several factors when designating APIs
> >for compliance. One factor is the technical direction, as communicated
> >by the contributor community -- that's where we tell them things
> >like "we plan to deprecate the Glance V1 API". In addition to the
> >technical direction, DefCore looks at the deployment history of an
> >API. They do not want to require deploying an API if it is not seen
> >as widely usable, and they look for some level of existing adoption
> >by cloud providers and distributors as an indication of that the
> >API is desired and can be successfully used. Because we have multiple
> >upload APIs, the message we're sending on technical direction is
> >weak right now, and so they have focused on deployment considerations
> >to resolve the question.
> 
> The task upload process you're referring to is the one that uses the
> `import` task, which allows you to download an image from an external
> source, asynchronously, and import it in Glance. This is the old
> `copy-from` behavior that was moved into a task.
> 
> The "fun" thing about this - and I'm sure other folks in the Glance
> community will disagree - is that I don't consider tasks to be a
> public API. That is to say, I would expect tasks to be an internal API
> used by cloud admins to perform some actions (bsaed on its current
> implementation). Eventually, some of these tasks could be triggered
> from the external API but as background operations that are triggered
> by the well-known public ones and not through the task API.

Does that mean it's more of an "admin" API?

> 
> Ultimately, I believe end-users of the cloud simply shouldn't care
> about what tasks are or aren't and more importantly, as you mentioned
> later in the email, tasks make clouds not interoperable. I'd be pissed
> if my public image service would ask me to learn about tasks to be
> able to use the service.

It would be OK if a public API set up to do a specific task returned a
task ID that could be used with a generic task API to check status, etc.
So the idea of tasks isn't completely bad, it's just too vague as it's
exposed right now.

> Long story short, I believe the only upload API that should be
> considered is the one that uses HTTP and, eventually, to bring
> compatibility with v1 as far as the copy-from behavior goes, Glance
> could bring back that behavior on top of the task (just dropping this
> here for the sake of discussion and interoperability).
> 
> >The POST API is enabled in many public clouds, but not consistently.
> >In some clouds like HP, a tenant requires special permission to use
> >the API. At least one provider, Rackspace, has disabled the API
> >entirely. This is apparently due to what seems like a fair argument
> >that uploading the bits directly to the API service presents a
> >possible denial of service vector. Without arguing the technical
> >merits of that decision, the fact remains that without a strong
> >consensus from deployers that the POST API should be publicly and
> >consistently available, it does not meet the requirements to be
> >used for DefCore testing.
> 
> This is definitely unfortunate. I believe a good step forward for this
> discussion would be to create a list of issues related to uploading
> images and see how those issues can be addressed. The result from that
> work might be that it's not recommended to make that endpoint public
> but again, without going through the issues, it'll be hard to
> understand how we can improve this situation. I expect most of this
> issues to have a security impact.

A report like that would be good to have. Can someone on the Glance team
volunteer to put it together?

> 
> >The task API is also not widely deployed, so its adoption for DefCore
> >is problematic. If we provide a clear technical direction that this
> >API is preferred, that may overcome the lack of adoption, but the
> >current task API seems to have technical issues that make it
> >fundamentally unsuitable for DefCore consideration. While the task
> >API addresses the problem of a denial of service, and includes
> >useful features such as processing of the image during import, it
> >is not strongly enough defined in its current form to be interoperable.
> >Because it's a generic API, the caller must know how to fully
> >construct each task, and know what task types are supported in the
> >first place. There is only one "import" task type supported in the
> >Glance code repository right now, but it is not clear that "import"
> >always uses the same arguments, or interprets them in the same way.
> >For example, the upstream documentation [1] describes a task that
> >appears to use a URL as source, while the Rackspace documentation [2]
> >describes a task that appears to take a swift storage location.
> >I wasn't able to find JSONSchema validation for the "input" blob
> >portion of the task in the code [3], though that may happen down
> >inside the task implementation itself somewhere.
> 
> 
> The above sounds pretty accurate as there's currently just 1 flow that
> can be triggered (the import flow) and that accepts an input, which is
> a json. As I mentioned above, I don't believe tasks should be part of
> the public API and this is yet another reason why I think so. The
> tasks API is not well defined as there's, currently, not good way to
> define the expected input in a backwards compatible way and to provide
> all the required validation.
> 
> I like having tasks in Glance, despite my comments above - but I like
> them for cloud usage and not public usage.
> 
> As far as Rackspace's docs/endpoint goes, I'd assume this is an error
> in their documetation since Glance currently doesn't allow[0] for
> swift URLs to be imported (not even in juno[1]).
> 
> [0] http://git.openstack.org/cgit/openstack/glance/tree/glance/common/scripts/utils.py#n84
> [1] http://git.openstack.org/cgit/openstack/glance/tree/glance/common/scripts/utils.py?h=stable/juno#n83
> 
> >Tasks also come from plugins, which may be installed differently
> >based on the deployment. This is an interesting approach to creating
> >API extensions, but isn't discoverable enough to write interoperable
> >tools against. Most of the other projects are starting to move away
> >from supporting API extensions at all because of interoperability
> >concerns they introduce. Deployers should be able to configure their
> >clouds to perform well, but not to behave in fundamentally different
> >ways. Extensions are just that, extensions. We can't rely on them
> >for interoperability testing.
> 
> This is, indeed, an interesting interpretation of what tasks are for.
> I'd probably just blame us (Glance team) for not communicating
> properly what tasks are meant to be. I don't believe tasks are a way
> to extend the *public* API and I'd be curious to know if others see it
> that way. I fully agree that just breaks interoperability and as I've
> mentioned a couple of times in this reply already, I don't even think
> tasks should be part of the public API.

Whether they are intended to be an extension mechanism, they
effectively are right now, as far as I can tell.

> 
> But again, very poor job communicating so[0]. Nonetheless, for the
> sake of providing enough information about tasks and sources to read
> from, I'd also like to point out the original blueprint[1], some
> discussions during the havana's summit[2], the wiki page for tasks[3]
> and a patch I just reviewed today (thanks Brian) that introduces docs
> for tasks[4]. These links show already some differences in what tasks
> are.
> 
> [0] http://git.openstack.org/cgit/openstack/glance/tree/etc/policy.json?h=stable/juno#n28
> [1] https://blueprints.launchpad.net/glance/+spec/async-glance-workers
> [2] https://etherpad.openstack.org/p/havana-glance-requirements
> [3] https://wiki.openstack.org/wiki/Glance-tasks-api
> [4] https://review.openstack.org/#/c/220166/
> 
> >
> >There is a lot of fuzziness around exactly what is supported for
> >image upload, both in the documentation and in the minds of the
> >developers I've spoken to this week, so I'd like to take a step
> >back and try to work through some clear requirements, and then we
> >can have folks familiar with the code help figure out if we have a
> >real issue, if a minor tweak is needed, or if things are good as
> >they stand today and it's all a misunderstanding.
> >
> >1. We need a strongly defined and well documented API, with arguments
> >   that do not change based on deployment choices. The behind-the-scenes
> >   behaviors can change, but the arguments provided by the caller
> >   must be the same and the responses must look the same. The
> >   implementation can run as a background task rather than receiving
> >   the full image directly, but the current task API is too vaguely
> >   defined to meet this requirement, and IMO we need an entry point
> >   focused just on uploading or importing an image.
> >
> >2. Glance cannot require having a Swift deployment. It's not clear
> >   whether this is actually required now, so if it's not then we're
> >   in a good state.
> 
> This is definitely not the case. Glance doesn't require any specific
> store to be deployed. It does require at least one other than the http
> one (because it doesn't support write operations).
> 
> > It's fine to provide an optional way to take
> >   advantage of Swift if it is present, but it cannot be a required
> >   component. There are three separate trademark "programs", with
> >   separate policies attached to them. There is an umbrella "Platform"
> >   program that is intended to include all of the TC approved release
> >   projects, such as nova, glance, and swift. However, there is
> >   also a separate "Compute" program that is intended to include
> >   Nova, Glance, and some others but *not* Swift. This is an important
> >   distinction, because there are many use cases both for distributors
> >   and public cloud providers that do not incorporate Swift for a
> >   variety of reasons. So, we can't have Glance's primary configuration
> >   require Swift and we need to provide tests for the DefCore team
> >   that run without Swift. Duplicate tests that do use Swift are
> >   fine, and might be used for "Platform" compliance tests.
> >
> >3. We need an integration test suite in tempest that fully exercises
> >   the public image API by talking directly to Glance. This applies
> >   to the entire API, not just image uploads. It's fine to have
> >   duplicate tests using the proxy in Nova if the Nova team wants
> >   those, but DefCore should be using tests that talk directly to
> >   the service that owns each feature, without relying on any
> >   proxying. We've already missed the chance to deal with this in
> >   the current DefCore definition, which uses image-related tests
> >   that talk to the Nova proxy [4][5], so we'll have to maintain
> >   the proxy for the required deprecation period. But we won't be
> >   able to consider removing that proxy until we provide alternate
> >   tests for those features that speak directly to Glance. We may
> >   have some coverage already, but I wasn't able to find a task-based
> >   image upload test and there is no "image create" mentioned in
> >   the current draft of capabilities being reviewed [6]. There may
> >   be others missing, so someone more familiar with the feature set
> >   of Glance should do an audit and document what tests are needed
> >   so the work can be split up.
> >
> 
> +1 This should become one of the top priorities for Mitaka (as you
> mentioned at the beginning of this email).
> 
> >4. Once identified and incorporated into the DefCore capabilities
> >   set, the selected API needs to remain stable for an extended
> >   period of time and follow the deprecation timelines defined by
> >   DefCore.  That has implications for the V3 API currently in
> >   development to turn Glance into a more generic artifacts service.
> >   There are a lot of ways to handle those implications, and no
> >   choice needs to be made today, so I only mention it to make sure
> >   it's clear that (a) we must get V2 into shape for DefCore and
> >   (b) when that happens, we will need to maintain V2 even if V3
> >   is finished. We won't be able to deprecate V2 quickly.
> >
> >Now, it's entirely possible that we can meet all of those requirements
> >today, and that would be great. If that's the case, then the problem
> >is just one of clear communication and documentation. I think there's
> >probably more work to be done than that, though.
> 
> 
> There's clearly a communication problem. The fact that this very email
> has been sent out is a sign of that. However, I'd like to say, in a
> very optimistic way, that Glance is not so far away from the expecte
> status. There are things to fix, other things to clarify, tons to
> discuss but, IMHO, besides the tempests tests and DefCore, the most
> critical one is the one you mentioned in the following section.
> 
> >
> >[1] http://developer.openstack.org/api-ref-image-v2.html#os-tasks-v2
> >[2] http://docs.rackspace.com/images/api/v2/ci-devguide/content/POST_importImage_tasks_Image_Task_Calls.html#d6e4193
> >[3] http://git.openstack.org/cgit/openstack/glance/tree/glance/api/v2/tasks.py
> >[4] http://git.openstack.org/cgit/openstack/defcore/tree/2015.05.json#n70
> >[5] http://git.openstack.org/cgit/openstack/defcore/tree/doc/source/guidelines/2015.07.rst
> >[6] https://review.openstack.org/#/c/213353/
> >
> >II. Complete Cinder and Nova V2 Adoption
> >
> >The Glance team originally committed to providing an Image Service
> >API. Besides our end users, both Cinder and Nova consume that API.
> >The shift from V1 to V2 has been a long road. We're far enough
> >along, and the V1 API has enough issues preventing us from using
> >it for DefCore, that we should push ahead and complete the V2
> >adoption. That will let us properly deprecate and drop V1 support,
> >and concentrate on maintaining V2 for the necessary amount of time.
> >
> >There are a few specs for the work needed in Nova, but that work
> >didn't land in Liberty for a variety of reasons. We need resources
> >from both the Glance and Nova teams to work together to get this
> >done as early as possible in Mitaka to ensure that it actually lands
> >this time. We should be able to schedule a joint session at the
> >summit to have the conversation, and we need to take advantage of
> >that opportunity to ensure the details are fully resolved so that
> >everyone understands the plan.
> 
> Super important point. I'd like people replying to this email to focus
> on what we can do next and not why this hasn't been done. The later
> will take us down a path that won't be useful at all at it'll just
> waste everyone's time.
> 
> That said, I fully agree with the above. Last time we talked, John
> Garbutt and Jay Pipes, from the nova team, raised their hands to help
> out with this effort. From Glance's side, Fei Long Wang and myself
> were working on the implementation. To help moving this forward and to
> follow on the latest plan, which allows this migration to be smoother
> than our original plan, we need folks from Glance to raise their hand.
> 
> If I'm not elected PTL, I'm more than happy to help out here but we
> need someone that can commit to the above right now and we'll likely
> need a team of at least 2 people to help moving this forward in early
> Mitaka.

Right, the work needs to be starting now to ensure the relevant specs
are ready for review and approval, summit discussions can be planned,
etc.

> 
> >The work in Cinder is more complete, but may need to be reviewed
> >to ensure that it is using the API correctly, safely, and efficiently.
> >Again, this is a joint effort between the Glance and Cinder teams
> >to identify any issues and work out a resolution.
> >
> >Part of this work will also be to audit the Glance API documentation,
> >to ensure it accurately reflects what the APIs expect to receive
> >and return. There are reportedly at least a few cases where things
> >are out of sync right now. This will require some coordination with
> >the Documentation team.
> >
> >
> >Those are the two big priorities I see, based on things the rest
> >of the community needs from the team and existing commitments that
> >have been made. There are some other things that should also be
> >addressed.
> >
> >
> >III. Security audits & bug fixes
> >
> >Five of 18 recent security reports were related to Glance [7]. It's
> >not surprising, given recent resource constraints, that addressing
> >these has been a challenge. Still, these should be given high
> >priority.
> >
> >[7] https://security.openstack.org/search.html?q=glance&check_keywords=yes&area=default
> 
> 
> +1 FWIW, we're in the process of growing Glance's security team. But
> it's clear from the above that there needs to be quicker replies to
> security issues.
> 
> >IV. Sorting out the glance-store question
> >
> >This was perhaps the most confusing thing I learned about this week.
> >The perception outside of the Glance team is that the library is
> >meant to be used by Nova and Cinder to communicate directly with
> >the image store, bypassing the REST API, to improve performance in
> >several cases. I know the Cinder team is especially interested in
> >some sort of interface for manipulating images inside the storage
> >system without having to download them to make copies (for RBD and
> >other systems that support CoW natively).
> 
> Correct, the above was one of the triggerers for this effort and I
> like to think it's still one of the main drivers. There are other
> fancier things that could be done in the future assuming the
> librarie's API is refactored in a way that such features can be
> implemented.[0]
> 
> [0] https://review.openstack.org/#/c/188050/
> 
> >That doesn't seem to be
> >what the library is actually good for, though, since most of the
> >Glance core folks I talked to thought it was really a caching layer.
> >This discrepancy in what folks wanted vs. what they got may explain
> >some of the heated discussions in other email threads.
> 
> It's strange that some folks think of it as a caching layer. I believe
> one of the reasons there's such discrepancy is because not enough
> effort has been put in the refactor this library requires. The reason
> this library requires such a refactor is that it came out from the old
> `glance/store` code which was very specific to Glance's internal use.
> 
> The mistake here could be that the library should've been refactored
> *before* adopting it in Glance.

The fact that there is disagreement over the intent of the library makes
me think the plan for creating it wasn't sufficiently circulated or
detailed.

> 
> >
> >Frankly, given the importance of the other issues, I recommend
> >leaving glance-store standalone this cycle. Unless the work for
> >dealing with priorities I and II is made *significantly* easier by
> >not having a library, the time and energy it will take to re-integrate
> >it with the Glance service seems like a waste of limited resources.
> >The time to even discuss it may be better spent on the planning
> >work needed. That said, if the library doesn't provide the features
> >its users were expecting, it may be better to fold it back in and
> >create a different library with a better understanding of the
> >requirements at some point. The path to take is up to the Glance
> >team, of course, but we're already down far enough on the priority
> >list that I think we'll be lucky to finish the preceding items this
> >cycle.
> 
> 
> I don't think merging glance-store back into Glance will help with any
> of the priorities mentioned in this thread. If anything, refactoring
> the API might help with future work that could come after the v1 -> v2
> migration is complete.
> 
> >
> >
> >Those are the development priorities I was able to identify in my
> >interviews this week, and there is one last thing the team needs
> >to do this cycle: Recruit more contributors.
> >
> >Almost every current core contributor I spoke with this week indicated
> >that their time was split between another project and Glance. Often
> >higher priority had to be given, understandibly, to internal product
> >work. That's the reality we work in, and everyone feels the same
> >pressures to some degree. One way to address that pressure is to
> >bring in help. So, we need a recruiting drive to find folks willing
> >to contribute code and reviews to the project to keep the team
> >healthy. I listed this item last because if you've made it this far
> >you should see just how much work the team has ahead. We're a big
> >community, and I'm confident that we'll be able to find help for
> >the Glance team, but it will require mentoring and education to
> >bring people up to speed to make them productive.
> 
> Fully agree here as well. However, I also believe that the fact that
> some efforts have gone to the wrong tasks has taken Glance to the
> situation it is today. More help is welcomed and required but a good
> strategy is more important right now.
> 
> FWIW, I agree that our focus has gone to different thing and this has
> taken us to the status you mentioned above. More importantly, it's
> postponed some important tasks. However, I don't believe Glance is
> completely broken - I know you are not saying this but I'd like to
> mention it - and I certainly believe we can bring it back to a good
> state faster than expecte, but I'm known for being a bit optimistic
> sometimes.
> 
> In this reply I was hard on us (Glance team), because I tend to be
> hard on myself and to dig deep into the things that are not working
> well. Many times I do this based on the feedback provided by others,
> which I personally value **a lot**. Unfortunately, I have to say that
> there hasn't been enough feedback about these issues until now. There
> was Mike's email[0] where I explicitly asked the community to speak
> up. This is to say that I appreciate the time you've taken to dig into
> this a lot and to encourage folks to *always* speak up and reach out
> through every *public* medium possible..
> 
> No one can fix rumors, we can fix issues, though.
> 
> Thanks again and lets all work together to improve this situation,
> Flavio
> 
> [0] http://lists.openstack.org/pipermail/openstack-dev/2015-August/071971.html
> 



More information about the OpenStack-dev mailing list