Open Stack

Mon Sep 14 12:10:16 UTC 2015

After having some conversations with folks at the Ops Midcycle a
few weeks ago, and observing some of the more recent email threads
related to glance, glance-store, the client, and the API, I spent
last week contacting a few of you individually to learn more about
some of the issues confronting the Glance team. I had some very
frank, but I think constructive, conversations with all of you about
the issues as you see them. As promised, this is the public email
thread to discuss what I found, and to see if we can agree on what
the Glance team should be focusing on going into the Mitaka summit
and development cycle and how the rest of the community can support
you in those efforts.

I apologize for the length of this email, but there's a lot to go
over. I've identified 2 high priority items that I think are critical
for the team to be focusing on starting right away in order to use
the upcoming summit time effectively. I will also describe several
other issues that need to be addressed but that are less immediately
critical. First the high priority items:

1. Resolve the situation preventing the DefCore committee from
   including image upload capabilities in the tests used for trademark
   and interoperability validation.

2. Follow through on the original commitment of the project to
   provide an image API by completing the integration work with
   nova and cinder to ensure V2 API adoption.

I. DefCore

The primary issue that attracted my attention was the fact that
DefCore cannot currently include an image upload API in its
interoperability test suite, and therefore we do not have a way to
ensure interoperability between clouds for users or for trademark
use. The DefCore process has been long, and at times confusing,
even to those of us following it sort of closely. It's not entirely
surprising that some projects haven't been following the whole time,
or aren't aware of exactly what the whole thing means. I have
proposed a cross-project summit session for the Mitaka summit to
address this need for communication more broadly, but I'll try to
summarize a bit here.

DefCore is using automated tests, combined with business policies,
to build a set of criteria for allowing trademark use. One of the
goals of that process is to ensure that all OpenStack deployments
are interoperable, so that users who write programs that talk to
one cloud can use the same program with another cloud easily. This
is a *REST API* level of compatibility. We cannot insert cloud-specific
behavior into our client libraries, because not all cloud consumers
will use those libraries to talk to the services. Similarly, we
can't put the logic in the test suite, because that defeats the
entire purpose of making the APIs interoperable. For this level of
compatibility to work, we need well-defined APIs, with a long support
period, that work the same no matter how the cloud is deployed. We
need the entire community to support this effort. From what I can
tell, that is going to require some changes to the current Glance
API to meet the requirements. I'll list those requirements, and I
hope we can discuss them to a degree that ensures everyone understands
them. I don't want this email thread to get bogged down in
implementation details or API designs, though, so let's try to keep
the discussion at a somewhat high level, and leave the details for
specs and summit discussions. I do hope you will correct any
misunderstandings or misconceptions, because unwinding this as an
outside observer has been quite a challenge and it's likely I have
some details wrong.

As I understand it, there are basically two ways to upload an image
to glance using the V2 API today. The "POST" API pushes the image's
bits through the Glance API server, and the "task" API instructs
Glance to download the image separately in the background. At one
point apparently there was a bug that caused the results of the two
different paths to be incompatible, but I believe that is now fixed.
However, the two separate APIs each have different issues that make
them unsuitable for DefCore.

The DefCore process relies on several factors when designating APIs
for compliance. One factor is the technical direction, as communicated
by the contributor community -- that's where we tell them things
like "we plan to deprecate the Glance V1 API". In addition to the
technical direction, DefCore looks at the deployment history of an
API. They do not want to require deploying an API if it is not seen
as widely usable, and they look for some level of existing adoption
by cloud providers and distributors as an indication of that the
API is desired and can be successfully used. Because we have multiple
upload APIs, the message we're sending on technical direction is
weak right now, and so they have focused on deployment considerations
to resolve the question.

The POST API is enabled in many public clouds, but not consistently.
In some clouds like HP, a tenant requires special permission to use
the API. At least one provider, Rackspace, has disabled the API
entirely. This is apparently due to what seems like a fair argument
that uploading the bits directly to the API service presents a
possible denial of service vector. Without arguing the technical
merits of that decision, the fact remains that without a strong
consensus from deployers that the POST API should be publicly and
consistently available, it does not meet the requirements to be
used for DefCore testing.

The task API is also not widely deployed, so its adoption for DefCore
is problematic. If we provide a clear technical direction that this
API is preferred, that may overcome the lack of adoption, but the
current task API seems to have technical issues that make it
fundamentally unsuitable for DefCore consideration. While the task
API addresses the problem of a denial of service, and includes
useful features such as processing of the image during import, it
is not strongly enough defined in its current form to be interoperable.
Because it's a generic API, the caller must know how to fully
construct each task, and know what task types are supported in the
first place. There is only one "import" task type supported in the
Glance code repository right now, but it is not clear that "import"
always uses the same arguments, or interprets them in the same way.
For example, the upstream documentation [1] describes a task that
appears to use a URL as source, while the Rackspace documentation [2]
describes a task that appears to take a swift storage location.
I wasn't able to find JSONSchema validation for the "input" blob
portion of the task in the code [3], though that may happen down
inside the task implementation itself somewhere.

Tasks also come from plugins, which may be installed differently
based on the deployment. This is an interesting approach to creating
API extensions, but isn't discoverable enough to write interoperable
tools against. Most of the other projects are starting to move away
from supporting API extensions at all because of interoperability
concerns they introduce. Deployers should be able to configure their
clouds to perform well, but not to behave in fundamentally different
ways. Extensions are just that, extensions. We can't rely on them
for interoperability testing.

There is a lot of fuzziness around exactly what is supported for
image upload, both in the documentation and in the minds of the
developers I've spoken to this week, so I'd like to take a step
back and try to work through some clear requirements, and then we
can have folks familiar with the code help figure out if we have a
real issue, if a minor tweak is needed, or if things are good as
they stand today and it's all a misunderstanding.

1. We need a strongly defined and well documented API, with arguments
   that do not change based on deployment choices. The behind-the-scenes
   behaviors can change, but the arguments provided by the caller
   must be the same and the responses must look the same. The
   implementation can run as a background task rather than receiving
   the full image directly, but the current task API is too vaguely
   defined to meet this requirement, and IMO we need an entry point
   focused just on uploading or importing an image.

2. Glance cannot require having a Swift deployment. It's not clear
   whether this is actually required now, so if it's not then we're
   in a good state. It's fine to provide an optional way to take
   advantage of Swift if it is present, but it cannot be a required
   component. There are three separate trademark "programs", with
   separate policies attached to them. There is an umbrella "Platform"
   program that is intended to include all of the TC approved release
   projects, such as nova, glance, and swift. However, there is
   also a separate "Compute" program that is intended to include
   Nova, Glance, and some others but *not* Swift. This is an important
   distinction, because there are many use cases both for distributors
   and public cloud providers that do not incorporate Swift for a
   variety of reasons. So, we can't have Glance's primary configuration
   require Swift and we need to provide tests for the DefCore team
   that run without Swift. Duplicate tests that do use Swift are
   fine, and might be used for "Platform" compliance tests.

3. We need an integration test suite in tempest that fully exercises
   the public image API by talking directly to Glance. This applies
   to the entire API, not just image uploads. It's fine to have
   duplicate tests using the proxy in Nova if the Nova team wants
   those, but DefCore should be using tests that talk directly to
   the service that owns each feature, without relying on any
   proxying. We've already missed the chance to deal with this in
   the current DefCore definition, which uses image-related tests
   that talk to the Nova proxy [4][5], so we'll have to maintain
   the proxy for the required deprecation period. But we won't be
   able to consider removing that proxy until we provide alternate
   tests for those features that speak directly to Glance. We may
   have some coverage already, but I wasn't able to find a task-based
   image upload test and there is no "image create" mentioned in
   the current draft of capabilities being reviewed [6]. There may
   be others missing, so someone more familiar with the feature set
   of Glance should do an audit and document what tests are needed
   so the work can be split up.

4. Once identified and incorporated into the DefCore capabilities
   set, the selected API needs to remain stable for an extended
   period of time and follow the deprecation timelines defined by
   DefCore.  That has implications for the V3 API currently in
   development to turn Glance into a more generic artifacts service.
   There are a lot of ways to handle those implications, and no
   choice needs to be made today, so I only mention it to make sure
   it's clear that (a) we must get V2 into shape for DefCore and
   (b) when that happens, we will need to maintain V2 even if V3
   is finished. We won't be able to deprecate V2 quickly.

Now, it's entirely possible that we can meet all of those requirements
today, and that would be great. If that's the case, then the problem
is just one of clear communication and documentation. I think there's
probably more work to be done than that, though.

[1] http://developer.openstack.org/api-ref-image-v2.html#os-tasks-v2
[2] http://docs.rackspace.com/images/api/v2/ci-devguide/content/POST_importImage_tasks_Image_Task_Calls.html#d6e4193
[3] http://git.openstack.org/cgit/openstack/glance/tree/glance/api/v2/tasks.py
[4] http://git.openstack.org/cgit/openstack/defcore/tree/2015.05.json#n70
[5] http://git.openstack.org/cgit/openstack/defcore/tree/doc/source/guidelines/2015.07.rst
[6] https://review.openstack.org/#/c/213353/

II. Complete Cinder and Nova V2 Adoption

The Glance team originally committed to providing an Image Service
API. Besides our end users, both Cinder and Nova consume that API.
The shift from V1 to V2 has been a long road. We're far enough
along, and the V1 API has enough issues preventing us from using
it for DefCore, that we should push ahead and complete the V2
adoption. That will let us properly deprecate and drop V1 support,
and concentrate on maintaining V2 for the necessary amount of time.

There are a few specs for the work needed in Nova, but that work
didn't land in Liberty for a variety of reasons. We need resources
from both the Glance and Nova teams to work together to get this
done as early as possible in Mitaka to ensure that it actually lands
this time. We should be able to schedule a joint session at the
summit to have the conversation, and we need to take advantage of
that opportunity to ensure the details are fully resolved so that
everyone understands the plan.

The work in Cinder is more complete, but may need to be reviewed
to ensure that it is using the API correctly, safely, and efficiently.
Again, this is a joint effort between the Glance and Cinder teams
to identify any issues and work out a resolution.

Part of this work will also be to audit the Glance API documentation,
to ensure it accurately reflects what the APIs expect to receive
and return. There are reportedly at least a few cases where things
are out of sync right now. This will require some coordination with
the Documentation team.

Those are the two big priorities I see, based on things the rest
of the community needs from the team and existing commitments that
have been made. There are some other things that should also be
addressed.

III. Security audits & bug fixes

Five of 18 recent security reports were related to Glance [7]. It's
not surprising, given recent resource constraints, that addressing
these has been a challenge. Still, these should be given high
priority.

[7] https://security.openstack.org/search.html?q=glance&check_keywords=yes&area=default

IV. Sorting out the glance-store question

This was perhaps the most confusing thing I learned about this week.
The perception outside of the Glance team is that the library is
meant to be used by Nova and Cinder to communicate directly with
the image store, bypassing the REST API, to improve performance in
several cases. I know the Cinder team is especially interested in
some sort of interface for manipulating images inside the storage
system without having to download them to make copies (for RBD and
other systems that support CoW natively). That doesn't seem to be
what the library is actually good for, though, since most of the
Glance core folks I talked to thought it was really a caching layer.
This discrepancy in what folks wanted vs. what they got may explain
some of the heated discussions in other email threads.

Frankly, given the importance of the other issues, I recommend
leaving glance-store standalone this cycle. Unless the work for
dealing with priorities I and II is made *significantly* easier by
not having a library, the time and energy it will take to re-integrate
it with the Glance service seems like a waste of limited resources.
The time to even discuss it may be better spent on the planning
work needed. That said, if the library doesn't provide the features
its users were expecting, it may be better to fold it back in and
create a different library with a better understanding of the
requirements at some point. The path to take is up to the Glance
team, of course, but we're already down far enough on the priority
list that I think we'll be lucky to finish the preceding items this
cycle.

Those are the development priorities I was able to identify in my
interviews this week, and there is one last thing the team needs
to do this cycle: Recruit more contributors.

Almost every current core contributor I spoke with this week indicated
that their time was split between another project and Glance. Often
higher priority had to be given, understandibly, to internal product
work. That's the reality we work in, and everyone feels the same
pressures to some degree. One way to address that pressure is to
bring in help. So, we need a recruiting drive to find folks willing
to contribute code and reviews to the project to keep the team
healthy. I listed this item last because if you've made it this far
you should see just how much work the team has ahead. We're a big
community, and I'm confident that we'll be able to find help for
the Glance team, but it will require mentoring and education to
bring people up to speed to make them productive.

Doug

Open Stack

[openstack-dev] [glance] proposed priorities for Mitaka

OpenStack

Community

Documentation

Branding & Legal