<div dir="ltr">Hi folks,<div><br></div><div>My summary notes from the OpenStack Design Summit Glance sessions follow. Enjoy, and please help correct any misunderstandings.</div><div><br></div><div><br></div><div><div><br></div>
<div>Image State Consistency:</div><div>------------------------</div><div><br></div><div><a href="https://etherpad.openstack.org/p/icehouse-summit-image-state-consistency">https://etherpad.openstack.org/p/icehouse-summit-image-state-consistency</a></div>
<div><br></div><div>In this session, we focused on the problem of snapshots that fail</div><div>after the image is created but before the image data is uploaded </div><div>result in a pending image that will never become active, and the</div>
<div>only operation nova can do is to delete the image. Thus there is</div><div>not a very good way to communicate the failure to users without</div><div>just leaving a useless image record around.</div><div><br></div><div>
A solution was proposed to allow Nova to directly set the status</div><div>of the image, say to "killed" or some other state.</div><div><br></div><div>A problem with the proposed solution is that we generally have</div>
<div>kept the "status" field internally controlled by glance, which</div><div>means there are some modeling and authorization concerns.</div><div>However, it is actually something Nova could do today through</div>
<div>the hacky mechanism of initiating a PUT with data, but then</div><div>terminating the connection without sending a complete body. So</div><div>the authorization aspects are not really a fundamental concern.</div><div>
<br></div><div>It was suggested that the solution to this problem</div><div>is to make Nova responsible for reporting these failures rather</div><div>than Glance. In the short term, we could do the following</div><div> - have nova delete the image when snapshot fails (already merged)</div>
<div> - merge nova patch to report the failure as part of instance </div><div> error reporting</div><div><br></div><div>In the longer term, it was seen as desirable for nova to treat</div><div>snapshots as asynchronous tasks and reflect those tasks in the</div>
<div>api, including the failure/success of those tasks.</div><div><br></div><div>Another long term option that was viewed mostly favorably was</div><div>to add another asynchronous task to glance for vanilla uploads</div>
<div>so that nova snapshots can avoid creating the image until it</div><div>is fully active.</div><div><br></div><div>Fei Long Wang is going to follow up on what approach makes the</div><div>most sense for Nova and report back for our next steps.</div>
<div><br></div><div><br></div><div><br></div><div>What to do about v1?</div><div>--------------------</div><div><br></div><div><a href="https://etherpad.openstack.org/p/icehouse-summit-images-v1-api">https://etherpad.openstack.org/p/icehouse-summit-images-v1-api</a></div>
<div><br></div><div>In this discussion, we hammered out the details for how to drop</div><div>the v1 api and in what timetable.</div><div><br></div><div>Leaning heavily on cinder's experience dropping v1, we came</div>
<div>up with the following schedule.</div><div><br></div><div>Icehouse:</div><div> - Announce plan to deprecate the V1 API and registry in J and remove it in K</div><div> - Announce feature freeze for v1 API immediately</div>
<div> - Make sure everything in OpenStack is using v2 (cinder, nova, ?)</div><div> - Ensure v2 is being fully covered in tempest tests</div><div> - Ensure there are no gaps in the migration strategy from v1 to v2</div>
<div> - after the fact, it seems to me we need to produce a migration guide as a way to evaluate the presence of such gaps</div><div> - Make v2 the default in glanceclient</div><div> - Turn v2 on by default in glance API</div>
<div><br></div><div>"J":</div><div> - Mark v1 as deprecated</div><div> - Turn v1 off by default in config</div><div><br></div><div>"K":</div><div> - Delete v1 api and v1 registry</div><div><br>
</div><div><br></div><div>A few gotchas were identified, in particular, a concern was raised</div><div>about breaking stable branch testing when we switch the default in </div><div>glanceclient to v2--since latest glanceclient will be used to test</div>
<div>glance in say Folsom or Grizzly where the v2 api didn't really </div><div>work at all.</div><div><br></div><div>In addition, it was suggested that we should be very aggressive</div><div>in using deprecation warnings for config options to communicate</div>
<div>this change as loudly as possible.</div><div><br></div><div><br></div><div><br></div><div><br></div><div>Image Sharing</div><div>-------------</div><div><br></div><div><a href="https://etherpad.openstack.org/p/icehouse-summit-enhance-v2-image-sharing">https://etherpad.openstack.org/p/icehouse-summit-enhance-v2-image-sharing</a></div>
<div><br></div><div>This session focused on the gaps between the current image sharing functionality and what is needed to establish an image marketplace.</div><div><br></div><div>One issue was the lack of verification of project ids when sharing an image.</div>
<div><br></div><div>A few other issues were identified:</div><div>- there is no way to share an image with a large number of projects in a single api operation</div><div>- membership lists are not currently paged</div><div>
- there is no way to share an image with everyone, you must know each other project id</div><div><br></div><div>We identified a potential issue with bulk operations and verification--namely there is no way to do bulk verification of project ids in keystone that we know of, so probably keystone work would be needed to have both of these features in place without implying super slow api calls.</div>
<div><br></div><div>In addition, we spent some time toying with the idea of image catalogs. If publishers put images in catalogs, rather than having shared images show up directly in other users' image lists, things would be a lot safer and we could relax some of our restrictions. However, there are some issues with this approach as well,</div>
<div>- How do you find the catalog of a trusted image publisher?</div><div>- Are we just pushing the issue of sensible world-listings to another resource?</div><div>- This would be a big change.</div><div><br></div><div><br>
</div><div><br></div><div>Enhancing Image Locations:</div><div>--------------------------</div><div><br></div><div><a href="https://etherpad.openstack.org/p/icehouse-summit-enhance-image-location-property">https://etherpad.openstack.org/p/icehouse-summit-enhance-image-location-property</a></div>
<div><br></div><div>This session proposed adding several attributes to image locations</div><div><br></div><div>1. Add 'status' to each location.</div><div><br></div><div>I think consensus was that this approach makes sense moving forward. In particular, it would be nice to have a 'pending-delete' status for image locations, so that when you delete a single location from an image it can be picked up properly by the glance scrubber.</div>
<div><br></div><div>There was some concern about how we define the overall image status if we allow other statuses on locations. Is image status just stored independently of image locations statuses? Or is it newly defined as a function of those image locations statuses?</div>
<div><br></div><div>2. Allow disk_format, container_format, and checksum to vary per location.</div><div><br></div><div>The usecase here is that if you have a multi-hypervisor cloud, where different formats are needed, the client can automatically select the correct format when it downloads an image.</div>
<div><br></div><div>This idea was initially met with some skepticism because we have a strong view that an image is immutable once it is created, and the checksum is a big part of how we enforce that.</div><div><br></div>
<div>However it was correctly pointed out that the immutability we care about is actually a property of the block device that each image format represents. But for the moment we were unsure how to enforce that block device immutability save keeping the checksum and image formats the same.</div>
<div><br></div><div><br></div><div>3. Add metrics to each image location.</div><div><br></div><div>The essential idea here is to track the performance metrics of each image location to ensure we choose the fastest location. These metrics would not be revealed as part of the API.</div>
<div><br></div><div>I think most of us were initially a bit confused by this suggestion. However, after talking with Zhi Yan after the session, I think it makes sense to support this in a local sense rather than storing such information in the database. Locality is critical because different glance nodes likely have different relationships to the underlying locations in terms of network distance, so each node should be gearing towards what is best for it.</div>
<div><br></div><div>We can also probably reuse a local metrics tracking library to enable similar optimizations in a future incarnation of the glance client.</div><div><br></div><div><br></div><div><br></div><div><br></div>
<div>Images and Taskflow</div><div>-------------------</div><div><br></div><div><a href="https://etherpad.openstack.org/p/icehouse-summit-taskflow-and-glance">https://etherpad.openstack.org/p/icehouse-summit-taskflow-and-glance</a></div>
<div><br></div><div>In this session we discussed both the general layout of taskflow the strategy for porting the current image tasks under development to use taskflow, and came up with the following basic outline.</div><div>
<br></div><div>Short Term:</div><div><br></div><div>As we add more and more complexity to the import task, we can try to compose the work as a flow of tasks. With this set up, our local, eventlet-backed executor (glance task execution engine) could be just a thin wrapper around a local taskflow engine.</div>
<div><br></div><div>Medium Term:</div><div><br></div><div>At some point pretty early on we are going to want to have glance tasks running on distributed worker processes, mostly likely having the tasks triggered by rpc. At this point, we can copy the existing approach in cinder c.a. Havana</div>
<div><br></div><div>Longer Term:</div><div><br></div><div>When taskflow engines support distributing tasks across different workers, we can fall back to having a local task engine that is distributing tasks using that engine.</div>
<div><br></div><div>During the discussion a few concerns were discussed about working with taskflow.</div><div>- tasks have to be structured in the right way to make restart, recovery, and rollback work</div><div> - in other words, if we don't think about this carefully, we'll likely screw things up</div>
<div>- it remains difficult to determine if a task has stalled or failed</div><div>- we are not sure how to restart a failed task at this point</div><div><br></div><div>Some of these concerns may already be being addressed in the library.</div>
</div></div>