Open Stack

Fri Nov 15 21:10:35 UTC 2013

Hi folks,

My summary notes from the OpenStack Design Summit Glance sessions follow.
Enjoy, and please help correct any misunderstandings.

Image State Consistency:
------------------------

https://etherpad.openstack.org/p/icehouse-summit-image-state-consistency

In this session, we focused on the problem of snapshots that fail
after the image is created but before the image data is uploaded
result in a pending image that will never become active, and the
only operation nova can do is to delete the image. Thus there is
not a very good way to communicate the failure to users without
just leaving a useless image record around.

A solution was proposed to allow Nova to directly set the status
of the image, say to "killed" or some other state.

A problem with the proposed solution is that we generally have
kept the "status" field internally controlled by glance, which
means there are some modeling and authorization concerns.
However, it is actually something Nova could do today through
the hacky mechanism of initiating a PUT with data, but then
terminating the connection without sending a complete body. So
the authorization aspects are not really a fundamental concern.

It was suggested that the solution to this problem
is to make Nova responsible for reporting these failures rather
than Glance. In the short term, we could do the following
 - have nova delete the image when snapshot fails (already merged)
 - merge nova patch to report the failure as part of instance
   error reporting

In the longer term, it was seen as desirable for nova to treat
snapshots as asynchronous tasks and reflect those tasks in the
api, including the failure/success of those tasks.

Another long term option that was viewed mostly favorably was
to add another asynchronous task to glance for vanilla uploads
so that nova snapshots can avoid creating the image until it
is fully active.

Fei Long Wang is going to follow up on what approach makes the
most sense for Nova and report back for our next steps.

What to do about v1?
--------------------

https://etherpad.openstack.org/p/icehouse-summit-images-v1-api

In this discussion, we hammered out the details for how to drop
the v1 api and in what timetable.

Leaning heavily on cinder's experience dropping v1, we came
up with the following schedule.

Icehouse:
    - Announce plan to deprecate the V1 API and registry in J and remove it
in K
    - Announce feature freeze for v1 API immediately
    - Make sure everything in OpenStack is using v2 (cinder, nova, ?)
    - Ensure v2 is being fully covered in tempest tests
    - Ensure there are no gaps in the migration strategy from v1 to v2
        - after the fact, it seems to me we need to produce a migration
guide as a way to evaluate the presence of such gaps
    - Make v2 the default in glanceclient
    - Turn v2 on by default in glance API

"J":
    - Mark v1 as deprecated
    - Turn v1 off by default in config

"K":
    - Delete v1 api and v1 registry

A few gotchas were identified, in particular, a concern was raised
about breaking stable branch testing when we switch the default in
glanceclient to v2--since latest glanceclient will be used to test
glance  in say Folsom or Grizzly where the v2 api didn't really
work at all.

In addition, it was suggested that we should be very aggressive
in using deprecation warnings for config options to communicate
this change as loudly as possible.

Image Sharing
-------------

https://etherpad.openstack.org/p/icehouse-summit-enhance-v2-image-sharing

This session focused on the gaps between the current image sharing
functionality and what is needed to establish an image marketplace.

One issue was the lack of verification of project ids when sharing an image.

A few other issues were identified:
- there is no way to share an image with a large number of projects in a
single api operation
- membership lists are not currently paged
- there is no way to share an image with everyone, you must know each other
project id

We identified a potential issue with bulk operations and
verification--namely there is no way to do bulk verification of project ids
in keystone that we know of, so probably keystone work would be needed to
have both of these features in place without implying super slow api calls.

In addition, we spent some time toying with the idea of image catalogs. If
publishers put images in catalogs, rather than having shared images show up
directly in other users' image lists, things would be a lot safer and we
could relax some of our restrictions. However, there are some issues with
this approach as well,
- How do you find the catalog of a trusted image publisher?
- Are we just pushing the issue of sensible world-listings to another
resource?
- This would be a big change.

Enhancing Image Locations:
--------------------------

https://etherpad.openstack.org/p/icehouse-summit-enhance-image-location-property

This session proposed adding several attributes to image locations

1. Add 'status' to each location.

I think consensus was that this approach makes sense moving forward. In
particular, it would be nice to have a 'pending-delete' status for image
locations, so that when you delete a single location from an image it can
be picked up properly by the glance scrubber.

There was some concern about how we define the overall image status if we
allow other statuses on locations. Is image status just stored
independently of image locations statuses? Or is it newly defined as a
function of those image locations statuses?

2. Allow disk_format, container_format, and checksum to vary per location.

The usecase here is that if you have a multi-hypervisor cloud, where
different formats are needed, the client can automatically select the
correct format when it downloads an image.

This idea was initially met with some skepticism because we have a strong
view that an image is immutable once it is created, and the checksum is a
big part of how we enforce that.

However it was correctly pointed out that the immutability we care about is
actually a property of the block device that each image format represents.
But for the moment we were unsure how to enforce that block device
immutability save keeping the checksum and image formats the same.

3. Add metrics to each image location.

The essential idea here is to track the performance metrics of each image
location to ensure we choose the fastest location. These metrics would not
be revealed as part of the API.

I think most of us were initially a bit confused by this suggestion.
However, after talking with Zhi Yan after the session, I think it makes
sense to support this in a local sense rather than storing such information
in the database. Locality is critical because different glance nodes likely
have different relationships to the underlying locations in terms of
network distance, so each node should be gearing towards what is best for
it.

We can also probably reuse a local metrics tracking library to enable
similar optimizations in a future incarnation of the glance client.

Images and Taskflow
-------------------

https://etherpad.openstack.org/p/icehouse-summit-taskflow-and-glance

In this session we discussed both the general layout of taskflow the
strategy for porting the current image tasks under development to use
taskflow, and came up with the following basic outline.

Short Term:

As we add more and more complexity to the import task, we can try to
compose the work as a flow of tasks. With this set up, our local,
eventlet-backed executor (glance task execution engine) could be just a
thin wrapper around a local taskflow engine.

Medium Term:

At some point pretty early on we are going to want to have glance tasks
running on distributed worker processes, mostly likely having the tasks
triggered by rpc. At this point, we can copy the existing approach in
cinder c.a. Havana

Longer Term:

When taskflow engines support distributing tasks across different workers,
we can fall back to having a local task engine that is distributing tasks
using that engine.

During the discussion a few concerns were discussed about working with
taskflow.
- tasks have to be structured in the right way to make restart, recovery,
and rollback work
  - in other words, if we don't think about this carefully, we'll likely
screw things up
- it remains difficult to determine if a task has stalled or failed
- we are not sure how to restart a failed task at this point

Some of these concerns may already be being addressed in the library.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20131115/976fd49a/attachment.html>

Open Stack

[openstack-dev] [Glance] Summit Session Summaries

OpenStack

Community

Documentation

Branding & Legal