<div dir="ltr">Thanks for Chirs and Monty's great conversation at the issue.<div><br></div><div>As you both pointed out, one of the problem is that we are lacking good enough abstractions for the backends. </div><div><br></div><div>What we could do to solve this problem ? Should we consider to push a defcore workaround to handle the short term problem, and discuss this with the Glance team on Barcelona summit for the long term fix ?</div></div><div class="gmail_extra"><br><div class="gmail_quote">On Wed, Sep 14, 2016 at 8:44 PM, Monty Taylor <span dir="ltr"><<a href="mailto:mordred@inaugust.com" target="_blank">mordred@inaugust.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="HOEnZb"><div class="h5">On 09/14/2016 12:30 AM, Chris Hoge wrote:<br>

> I’ve been thinking quite a bit about your response Monty, and have<br>

> some observations and suggestions below.<br>

><br>

>> On Sep 2, 2016, at 8:30 AM, Monty Taylor <<a href="mailto:mordred@inaugust.com">mordred@inaugust.com</a>> wrote:<br>

>><br>

>> On 09/02/2016 01:47 AM, Zhenyu Zheng wrote:<br>

>>> Hi All,<br>

>>><br>

>>> We are deploying Public Cloud platform based on OpensStack in EU, we are<br>

>>> now working on DefCore certificate for our public cloud platform and we<br>

>>> meet some problems;<br>

>>><br>

>>> OpenStack(Nova) supports both "boot from Image" and "boot from Volume"<br>

>>> when launching instances; When we talk about large scale commercial<br>

>>> deployments such as Public Cloud, the reliability of the service is been<br>

>>> considered as the key factor;<br>

>>><br>

>>> When we use "boot from Image" we can have two kinds of deployments: 1.<br>

>>> Nova-compute with no shared storage backend; 2. Nova-compute with shared<br>

>>> storage backend. As for case 1, the system disk created from the image<br>

>>> will be created on the local disk of the host that nova-compute is on,<br>

>>> and the reliability of the userdata is considered low and it will be<br>

>>> very hard to manage this large amount of disks from different hosts all<br>

>>> over the deployment, thus it can be considered not commercially ready<br>

>>> for large scale deployments. As for case 2, the problem of reliability<br>

>>> and manage can be solved, but new problems are introduced - the resource<br>

>>> usage and capacity amounts tracking being incorrect, this has been an<br>

>>> known issue[1] in Nova for a long time and the Nova team is trying to<br>

>>> solve the problem by introducing a new "resource provider" architecture<br>

>>> [2], this new architecture will need few releases to be fully<br>

>>> functional, thus case 2 is also considered to be not commercially ready.<br>

>>><br>

>>> For the reasons I listed above, we have chosen to use "boot from Volume"<br>

>>> to be the only way of booting instance in our Public Cloud, by doing<br>

>>> this, we can overcome the above mentioned cons and get other benefits<br>

>>> such as:<br>

>>><br>

>>> Resiliency - Cloud Block Storage is a persistent volume, users can<br>

>>> retain it after the server is deleted. Users can then use the volume to<br>

>>> create a new server.<br>

>>> Flexibility - User can have control over the size and type (SSD or SATA)<br>

>>> of volume that used to boot the server. This control enables users to<br>

>>> fine-tune the storage to the needs of your operating system or application.<br>

>>> Improvements in managing and recovering from server outages<br>

>>> Unified volume management<br>

>>><br>

>>> Only support "boot from Volume" brings us problems when pursuing the<br>

>>> DefCore certificate:<br>

>>><br>

>>> we have tests that trying to get instance list filtered by "image_id"<br>

>>> which is None for volume booted instances:<br>

>>><br>

>>> tempest.api.compute.servers.<wbr>test_create_server.<wbr>ServersTestJSON.test_verify_<wbr>server_details<br>

>>> tempest.api.compute.servers.<wbr>test_create_server.<wbr>ServersTestManualDisk.test_<wbr>verify_server_details<br>

>>> tempest.api.compute.servers.<wbr>test_list_server_filters.<wbr>ListServerFiltersTestJSON.<wbr>test_list_servers_detailed_<wbr>filter_by_image<br>

>>> tempest.api.compute.servers.<wbr>test_list_server_filters.<wbr>ListServerFiltersTestJSON.<wbr>test_list_servers_filter_by_<wbr>image<br>

>>><br>

>>>    - The detailed information for instances booted from volumes does<br>

>>> not contain informations about image_id, thus the test cases filter<br>

>>> instance by image id cannot pass.<br>

>>><br>

>>><br>

>>> we also have tests like this:<br>

>>><br>

>>> tempest.api.compute.images.<wbr>test_images.ImagesTestJSON.<wbr>test_delete_saving_image<br>

>>><br>

>>>    - This test tests creating an image for an instance, and delete the<br>

>>> created instance snapshot during the image status of “saving”. As for<br>

>>> instances booted from images, the snapshot status flow will be:<br>

>>> queued->saving->active. But for instances booted from volumes, the<br>

>>> action of instance snapshotting is actually an volume snapshot action<br>

>>> done by cinder, the image saved in glance will only have the link to the<br>

>>> created cinder volume snapshot, and the image status will be directly<br>

>>> change to “active”, as the logic in this test will wait for the image<br>

>>> status in glance change to “saving”, so it cannot pass for volume booted<br>

>>> instances.<br>

>>><br>

>>><br>

>>> Also:<br>

>>><br>

>>> test_attach_volume.<wbr>AttachVolumeTestJSON.test_<wbr>list_get_volume_attachments<br>

>>><br>

>>>    - This test attaches one volume to an instance and then counts the<br>

>>> number of attachments for that instance, the expected count was<br>

>>> hardcoded to be 1. As for volume booted instances, the system disk is<br>

>>> already an attachment, so the actual count of attachment will be 2, and<br>

>>> the test fails.<br>

>>><br>

>>> And finally:<br>

>>><br>

>>> tempest.api.compute.servers.<wbr>test_server_actions.<wbr>ServerActionsTestJSON.test_<wbr>rebuild_server<br>

>>> tempest.api.compute.servers.<wbr>test_servers_negative.<wbr>ServersNegativeTestJSON.test_<wbr>rebuild_deleted_server<br>

>>> tempest.api.compute.servers.<wbr>test_servers_negative.<wbr>ServersNegativeTestJSON.test_<wbr>rebuild_non_existent_server<br>

>>><br>

>>>    - Rebuilding action is not supported when the instance is created<br>

>>> via volume.<br>

>>><br>

>>><br>

>>> All those tests mentioned above are not friendly to "boot from Volume"<br>

>>> instances, we hope we can have some workarounds about the above<br>

>>> mentioned tests, as the problem that is having with "boot from Image" is<br>

>>> really stopping us using it and it will also be good for DefCore if we<br>

>>> can figure out how to deal with this two types of instance creation.<br>

>>><br>

>>><br>

>>> References:<br>

>>> [1] Bugs related to resource usage reporting and calculation:<br>

>>><br>

>>> * Hypervisor summary shows incorrect total storage (Ceph)<br>

>>>  <a href="https://bugs.launchpad.net/nova/+bug/1387812" rel="noreferrer" target="_blank">https://bugs.launchpad.net/<wbr>nova/+bug/1387812</a><br>

>>> * rbd backend reports wrong 'local_gb_used' for compute node<br>

>>>  <a href="https://bugs.launchpad.net/nova/+bug/1493760" rel="noreferrer" target="_blank">https://bugs.launchpad.net/<wbr>nova/+bug/1493760</a><br>

>>> * nova hypervisor-stats shows wrong disk usage with shared storage<br>

>>>  <a href="https://bugs.launchpad.net/nova/+bug/1414432" rel="noreferrer" target="_blank">https://bugs.launchpad.net/<wbr>nova/+bug/1414432</a><br>

>>> * report disk consumption incorrect in nova-compute<br>

>>>  <a href="https://bugs.launchpad.net/nova/+bug/1315988" rel="noreferrer" target="_blank">https://bugs.launchpad.net/<wbr>nova/+bug/1315988</a><br>

>>> * VMWare: available disk spaces(hypervisor-list) only based on a single<br>

>>>  datastore instead of all available datastores from cluster<br>

>>>  <a href="https://bugs.launchpad.net/nova/+bug/1347039" rel="noreferrer" target="_blank">https://bugs.launchpad.net/<wbr>nova/+bug/1347039</a><br>

>>><br>

>>> [2] BP about solving resource usage reporting and calculation with a<br>

>>> generic resource pool (resource provider):<br>

>>><br>

>>> <a href="https://git.openstack.org/cgit/openstack/nova-specs/tree/specs/newton/approved/generic-resource-pools.rst" rel="noreferrer" target="_blank">https://git.openstack.org/<wbr>cgit/openstack/nova-specs/<wbr>tree/specs/newton/approved/<wbr>generic-resource-pools.rst</a><br>

>><br>

>> I can totally understand why would would value boot from volume. It has<br>

>> a bunch of great features, as you mention.<br>

>><br>

>> However, running a cloud that disables boot from image is a niche choice<br>

>> and I do not think that we should allow such a cloud to be considered<br>

>> "normal". If I were to encounter such a cloud, based on the workloads I<br>

>> currently run in 10 other public OpenStack clouds, I would consider it<br>

>> broken - and none of my automation that has been built based on how<br>

>> OpenStack clouds work consistently would work with that cloud.<br>

><br>

> If I understand their implementation correctly, the boot from volume<br>

> is an implementation detail of booting from what appears to be a<br>

> standard image from an API standpoint. We need to differentiate<br>

> between disallowing a set of APIs and implementing a different<br>

> backend for a particular API.<br>

<br>

</div></div>Yes, I whole heartedly agree!<br>

<span class=""><br>

>> I do think that we should do whatever we need to to push that boot from<br>

>> volume is a regular, expected and consistent thing that people who are<br>

>> using clouds can count on. I do not think that we should accept lack of<br>

>> boot from image as a valid choice. It does not promote interoperability,<br>

>> and it removes choice from the end user, which is a Bad Thing.<br>

><br>

> I think a better point of view is if a vendor chooses to use a different<br>

> backend, we should still expect the user-facing API to behave predictably.<br>

> It seems that we’re facing a leaky abstraction more than we are<br>

> some decision to not conform to the expected API behavior.<br>

<br>

</span>I believe that we are in agreement but that you have stated it better<br>

than I did.<br>

<span class=""><br>

>> It seems that some analysis has been done to determine that<br>

>> boot-from-image is somehow not production ready or scalable.<br>

>><br>

>> To counter that, I would like to point out that the OpenStack Infra<br>

>> team, using resources in Rackspace, OVH, Vexxhost, Internap, BlueBox,<br>

>> the OpenStack Innovation Center, a private cloud run by the TripleO team<br>

>> and a private cloud run by the Infra team boot 20k instance per day<br>

>> using custom images. We upload those custom-made images using Glance<br>

>> image upload daily. We have over 10 different custom images - each about<br>

>> 7.7G in size. While we _DO_ have node-launch errors given the number we<br>

>> launch each day:<br>

>><br>

>> <a href="http://grafana.openstack.org/dashboard/db/nodepool?panelId=16&fullscreen" rel="noreferrer" target="_blank">http://grafana.openstack.org/<wbr>dashboard/db/nodepool?panelId=<wbr>16&fullscreen</a><br>

>><br>

>> it's a small number compared to the successful node launches:<br>

>><br>

>> <a href="http://grafana.openstack.org/dashboard/db/nodepool?panelId=15&fullscreen" rel="noreferrer" target="_blank">http://grafana.openstack.org/<wbr>dashboard/db/nodepool?panelId=<wbr>15&fullscreen</a><br>

>><br>

>> And we have tracked ZERO of the problems down to anything related to<br>

>> images. (it's most frequently networking related)<br>

>><br>

>> We _do_ have issues successfully uploading new images to the cloud - but<br>

>> we also have rather large images since they contain a bunch of cached<br>

>> data ... and the glance team is working on making the image upload<br>

>> process more resilient and scalable.<br>

>><br>

>> In summary:<br>

>><br>

>> * Please re-enable boot from image on your cloud if you care about<br>

>> interoperability and end users<br>

><br>

> -or- fix the unexpected behavior in the interoperable API to account<br>

> for the implementation details.<br>

<br>

</span>++<br>

<span class=""><br>

>> * Please do not think that after having disabled one of the most common<br>

>> and fundamental features of the cloud that the group responsible for<br>

>> ensuring cloud interoperability should change anything to allow your<br>

>> divergent cloud to be considered interoperable. It is not. It needs to<br>

>> be fixed.<br>

><br>

> I don’t disagree, but I also think it’s important to work with the issues that<br>

> vendors who deploy OpenStack are facing and try to understand how<br>

> they fit into the larger ecosystem. Part of what we’re trying to accomplish<br>

> here is build a virtuous circle between upstream developers, downstream<br>

> deployers, and users.<br>

<br>

</span>I absolutely agree. Again, you said words better than I did. If/when<br>

there is an issue a deployer has, it's important to fix it.<br>

<span class=""><br>

>> If the tests we have right now are only testing boot-from-image as an<br>

>> implementation happenstance, we should immediately add tests that<br>

>> EXPLICITLY test for boot-from-image. If we cannot count on that basic<br>

>> functionality, the we truly will have given up on the entire idea of<br>

>> interoperable clouds.<br>

><br>

> We can count on the basic black-box functionality at some level. It’s<br>

> the interaction of the implementation with the rest of the API that’s<br>

> causing problems.<br>

><br>

> The ‘create from image’ call itself does part of what it’s advertised to do,<br>

> it boots an image. And at first pass (the tests for the actual launching<br>

> of the image) all looks well. The issue comes later when the user<br>

> queries the ‘read vms that are images’ api, and it’s clear the abstraction<br>

> loop hasn’t been closed. This is exactly what the interoperability<br>

> tests are meant to catch, and indicates what needs to be fixed.<br>

<br>

</span>Yup<br>

<span class=""><br>

> How it’s fixed is a different story, and I think that we as a community<br>

> need to be careful about prescribing one solution over another.<br>

<br>

</span>Totally.<br>

<br>

I should be clear about my POV on this, just for the record.<br>

<br>

I by and large speak from the perspective of someone who consumes<br>

OpenStack APIs across a lot of clouds. One of the things I think is<br>

great about OpenStack is that in theory I should never need to know the<br>

implementation details that someone has chosen to make.<br>

<br>

A good example of this is cells v1. Rackspace is the only public cloud<br>

I'm aware of that runs cells v1 ... but I do not know this as a consumer<br>

of the API. It's completely transparent to me, even though it's an<br>

implementation choice Rackspace made to deal with scaling issues. It's a<br>

place where our providers have been able to make the choices that make<br>

sense and our end-users don't suffer because of it. This is good!<br>

<br>

Very related to this thread, one of the places where the abstraction<br>

breaks down is, in fact, with Images - as I have to, as an end-user,<br>

know what image format the cloud I'm using has decided to use. All of<br>

the choices that deployers make around this are valid choices, but we<br>

haven't done a good enough job in OpenStack to hide this from our users,<br>

so they suffer.<br>

<br>

The above thread about a user issuing boot from image and the backend<br>

doing boot from volume I think should (and can) be more like cells and<br>

less like image type. I'm confident that it can be done.<br>

<br>

The ceph driver may be a place to look for inspiration. ceph has a<br>

glance driver, and when you upload an image to glance, glance stores it<br>

in ceph. BUT - when the nova driver decides to boot from an image when<br>

the image is stored in ceph, it bypasses all of the normal image<br>

download/caching code and essentially does a boot from volume behind the<br>

scenes. (it's a zero-cost COW operation, so booting vms when the ceph<br>

glance and nova drivers are used is very quick)<br>

<br>

Now, those interactions do not result in a volume object that's visible<br>

through the Cinder api. They're implementation details, so the user<br>

don't have to know that they are booting from a COW volume in ceph.<br>

<br>

I do not know enough details about how this is implemented, but I'd<br>

imagine that if ceph was able to achieve what sound like the semantics<br>

that are desired here without introducing API issues, it should be<br>

totally possible to achieve them here to.<br>

<br>

Thank you Chris for being much clearer and much more helpful in what you<br>

said than what I did.<br>

<div class="HOEnZb"><div class="h5"><br>

Monty<br>

<br>

______________________________<wbr>_________________<br>

Defcore-committee mailing list<br>

<a href="mailto:Defcore-committee@lists.openstack.org">Defcore-committee@lists.<wbr>openstack.org</a><br>

<a href="http://lists.openstack.org/cgi-bin/mailman/listinfo/defcore-committee" rel="noreferrer" target="_blank">http://lists.openstack.org/<wbr>cgi-bin/mailman/listinfo/<wbr>defcore-committee</a><br>

</div></div></blockquote></div><br><br clear="all"><div><br></div>-- <br><div class="gmail_signature" data-smartmail="gmail_signature"><div dir="ltr"><div><div dir="ltr"><div><div dir="ltr">Zhipeng (Howard) Huang</div><div dir="ltr"><br></div><div dir="ltr">Standard Engineer</div><div>IT Standard & Patent/IT Prooduct Line</div><div dir="ltr">Huawei Technologies Co,. Ltd</div><div dir="ltr">Email: <a href="mailto:huangzhipeng@huawei.com" target="_blank">huangzhipeng@huawei.com</a></div><div dir="ltr">Office: Huawei Industrial Base, Longgang, Shenzhen</div><div dir="ltr"><br></div><div dir="ltr">(Previous)<br><div>Research Assistant</div><div>Mobile Ad-Hoc Network Lab, Calit2</div><div>University of California, Irvine</div><div>Email: <a href="mailto:zhipengh@uci.edu" target="_blank">zhipengh@uci.edu</a></div><div>Office: Calit2 Building Room 2402</div><div><br></div><div>OpenStack, OPNFV, OpenDaylight, OpenCompute Aficionado</div></div></div></div></div></div></div>

</div>