[openstack-dev] [Nova][blueprint] Accelerate the booting process of a number of vms via VMThunder
Zhi Yan Liu
lzy.dev at gmail.com
Fri Apr 18 15:33:55 UTC 2014
On Fri, Apr 18, 2014 at 10:52 PM, lihuiba <magazine.lihuiba at 163.com> wrote:
>>btw, I see but at the moment we had fixed it by network interface
>>device driver instead of workaround - to limit network traffic slow
>>down.
> Which kind of driver, in host kernel, in guest kernel or in openstack?
>
In compute host kernel, doesn't related with OpenStack.
>
>
>>There are few works done in Glance
>>(https://blueprints.launchpad.net/glance/+spec/glance-cinder-driver ),
>>but some work still need to be taken I'm sure. There are something on
>>drafting, and some dependencies need to be resolved as well.
> I read the blueprints carefully, but still have some doubts.
> Will it store an image as a single volume in cinder? Or store all image
Yes
> files
> in one shared volume (with a file system on the volume, of course)?
> Openstack already has support to convert an image to a volume, and to boot
> from a volume. Are these features similar to this blueprint?
Not similar but it could be leverage for this case.
>
I prefer to talk this details in IRC. (And I had read all VMThunder
code at today early (my timezone), there are some questions from me as
well)
zhiyan
>
> Huiba Li
>
> National Key Laboratory for Parallel and Distributed
> Processing, College of Computer Science, National University of Defense
> Technology, Changsha, Hunan Province, P.R. China
> 410073
>
>
> At 2014-04-18 12:14:25,"Zhi Yan Liu" <lzy.dev at gmail.com> wrote:
>>On Fri, Apr 18, 2014 at 10:53 AM, lihuiba <magazine.lihuiba at 163.com> wrote:
>>>>It's not 100% true, in my case at last. We fixed this problem by
>>>>network interface driver, it causes kernel panic and readonly issues
>>>>under heavy networking workload actually.
>>>
>>> Network traffic control could help. The point is to ensure no instance
>>> is starved to death. Traffic control can be done with tc.
>>>
>>
>>btw, I see but at the moment we had fixed it by network interface
>>device driver instead of workaround - to limit network traffic slow
>>down.
>>
>>>
>>>
>>>>btw, we are doing some works to make Glance to integrate Cinder as a
>>>>unified block storage
>>> backend.
>>> That sounds interesting. Is there some more materials?
>>>
>>
>>There are few works done in Glance
>>(https://blueprints.launchpad.net/glance/+spec/glance-cinder-driver ),
>>but some work still need to be taken I'm sure. There are something on
>>drafting, and some dependencies need to be resolved as well.
>>
>>>
>>>
>>> At 2014-04-18 06:05:23,"Zhi Yan Liu" <lzy.dev at gmail.com> wrote:
>>>>Replied as inline comments.
>>>>
>>>>On Thu, Apr 17, 2014 at 9:33 PM, lihuiba <magazine.lihuiba at 163.com>
>>>> wrote:
>>>>>>IMO we'd better to use backend storage optimized approach to access
>>>>>>remote image from compute node instead of using iSCSI only. And from
>>>>>>my experience, I'm sure iSCSI is short of stability under heavy I/O
>>>>>>workload in product environment, it could causes either VM filesystem
>>>>>>to be marked as readonly or VM kernel panic.
>>>>>
>>>>> Yes, in this situation, the problem lies in the backend storage, so no
>>>>> other
>>>>>
>>>>> protocol will perform better. However, P2P transferring will greatly
>>>>> reduce
>>>>>
>>>>> workload on the backend storage, so as to increase responsiveness.
>>>>>
>>>>
>>>>It's not 100% true, in my case at last. We fixed this problem by
>>>>network interface driver, it causes kernel panic and readonly issues
>>>>under heavy networking workload actually.
>>>>
>>>>>
>>>>>
>>>>>>As I said currently Nova already has image caching mechanism, so in
>>>>>>this case P2P is just an approach could be used for downloading or
>>>>>>preheating for image caching.
>>>>>
>>>>> Nova's image caching is file level, while VMThunder's is block-level.
>>>>> And
>>>>>
>>>>> VMThunder is for working in conjunction with Cinder, not Glance.
>>>>> VMThunder
>>>>>
>>>>> currently uses facebook's flashcache to realize caching, and dm-cache,
>>>>>
>>>>> bcache are also options in the future.
>>>>>
>>>>
>>>>Hm if you say bcache, dm-cache and flashcache, I'm just thinking if
>>>>them could be leveraged by operation/best-practice level.
>>>>
>>>>btw, we are doing some works to make Glance to integrate Cinder as a
>>>>unified block storage backend.
>>>>
>>>>>
>>>>>>I think P2P transferring/pre-caching sounds a good way to go, as I
>>>>>>mentioned as well, but actually for the area I'd like to see something
>>>>>>like zero-copy + CoR. On one hand we can leverage the capability of
>>>>>>on-demand downloading image bits by zero-copy approach, on the other
>>>>>>hand we can prevent to reading data from remote image every time by
>>>>>>CoR.
>>>>>
>>>>> Yes, on-demand transferring is what you mean by "zero-copy", and
>>>>> caching
>>>>> is something close to CoR. In fact, we are working on a kernel module
>>>>> called
>>>>> foolcache that realize a true CoR. See
>>>>> https://github.com/lihuiba/dm-foolcache.
>>>>>
>>>>
>>>>Yup. And it's really interesting to me, will take a look, thanks for
>>>> sharing.
>>>>
>>>>>
>>>>>
>>>>>
>>>>> National Key Laboratory for Parallel and Distributed
>>>>> Processing, College of Computer Science, National University of Defense
>>>>> Technology, Changsha, Hunan Province, P.R. China
>>>>> 410073
>>>>>
>>>>>
>>>>> At 2014-04-17 17:11:48,"Zhi Yan Liu" <lzy.dev at gmail.com> wrote:
>>>>>>On Thu, Apr 17, 2014 at 4:41 PM, lihuiba <magazine.lihuiba at 163.com>
>>>>>> wrote:
>>>>>>>>IMHO, zero-copy approach is better
>>>>>>> VMThunder's "on-demand transferring" is the same thing as your
>>>>>>> "zero-copy
>>>>>>> approach".
>>>>>>> VMThunder is uses iSCSI as the transferring protocol, which is option
>>>>>>> #b
>>>>>>> of
>>>>>>> yours.
>>>>>>>
>>>>>>
>>>>>>IMO we'd better to use backend storage optimized approach to access
>>>>>>remote image from compute node instead of using iSCSI only. And from
>>>>>>my experience, I'm sure iSCSI is short of stability under heavy I/O
>>>>>>workload in product environment, it could causes either VM filesystem
>>>>>>to be marked as readonly or VM kernel panic.
>>>>>>
>>>>>>>
>>>>>>>>Under #b approach, my former experience from our previous similar
>>>>>>>>Cloud deployment (not OpenStack) was that: under 2 PC server storage
>>>>>>>>nodes (general *local SAS disk*, without any storage backend) +
>>>>>>>>2-way/multi-path iSCSI + 1G network bandwidth, we can provisioning
>>>>>>>> 500
>>>>>>>>VMs in a minute.
>>>>>>> suppose booting one instance requires reading 300MB of data, so 500
>>>>>>> ones
>>>>>>> require 150GB. Each of the storage server needs to send data at a
>>>>>>> rate
>>>>>>> of
>>>>>>> 150GB/2/60 = 1.25GB/s on average. This is absolutely a heavy burden
>>>>>>> even
>>>>>>> for high-end storage appliances. In production systems, this request
>>>>>>> (booting
>>>>>>> 500 VMs in one shot) will significantly disturb other running
>>>>>>> instances
>>>>>>> accessing the same storage nodes.
>>>>>>>
>>>>
>>>>btw, I believe the case/numbers is not true as well, since remote
>>>>image bits could be loaded on-demand instead of load them all on boot
>>>>stage.
>>>>
>>>>zhiyan
>>>>
>>>>>>> VMThunder eliminates this problem by P2P transferring and
>>>>>>> on-compute-node
>>>>>>> caching. Even a pc server with one 1gb NIC (this is a true pc
>>>>>>> server!)
>>>>>>> can
>>>>>>> boot
>>>>>>> 500 VMs in a minute with ease. For the first time, VMThunder makes
>>>>>>> bulk
>>>>>>> provisioning of VMs practical for production cloud systems. This is
>>>>>>> the
>>>>>>> essential
>>>>>>> value of VMThunder.
>>>>>>>
>>>>>>
>>>>>>As I said currently Nova already has image caching mechanism, so in
>>>>>>this case P2P is just an approach could be used for downloading or
>>>>>>preheating for image caching.
>>>>>>
>>>>>>I think P2P transferring/pre-caching sounds a good way to go, as I
>>>>>>mentioned as well, but actually for the area I'd like to see something
>>>>>>like zero-copy + CoR. On one hand we can leverage the capability of
>>>>>>on-demand downloading image bits by zero-copy approach, on the other
>>>>>>hand we can prevent to reading data from remote image every time by
>>>>>>CoR.
>>>>>>
>>>>>>zhiyan
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> ===================================================
>>>>>>> From: Zhi Yan Liu <lzy.dev at gmail.com>
>>>>>>> Date: 2014-04-17 0:02 GMT+08:00
>>>>>>> Subject: Re: [openstack-dev] [Nova][blueprint] Accelerate the booting
>>>>>>> process of a number of vms via VMThunder
>>>>>>> To: "OpenStack Development Mailing List (not for usage questions)"
>>>>>>> <openstack-dev at lists.openstack.org>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Hello Yongquan Fu,
>>>>>>>
>>>>>>> My thoughts:
>>>>>>>
>>>>>>> 1. Currently Nova has already supported image caching mechanism. It
>>>>>>> could caches the image on compute host which VM had provisioning from
>>>>>>> it before, and next provisioning (boot same image) doesn't need to
>>>>>>> transfer it again only if cache-manger clear it up.
>>>>>>> 2. P2P transferring and prefacing is something that still based on
>>>>>>> copy mechanism, IMHO, zero-copy approach is better, even
>>>>>>> transferring/prefacing could be optimized by such approach. (I have
>>>>>>> not check "on-demand transferring" of VMThunder, but it is a kind of
>>>>>>> transferring as well, at last from its literal meaning).
>>>>>>> And btw, IMO, we have two ways can go follow zero-copy idea:
>>>>>>> a. when Nova and Glance use same backend storage, we could use
>>>>>>> storage
>>>>>>> special CoW/snapshot approach to prepare VM disk instead of
>>>>>>> copy/transferring image bits (through HTTP/network or local copy).
>>>>>>> b. without "unified" storage, we could attach volume/LUN to compute
>>>>>>> node from backend storage as a base image, then do such CoW/snapshot
>>>>>>> on it to prepare root/ephemeral disk of VM. This way just like
>>>>>>> boot-from-volume but different is that we do CoW/snapshot on Nova
>>>>>>> side
>>>>>>> instead of Cinder/storage side.
>>>>>>>
>>>>>>> For option #a, we have already got some progress:
>>>>>>> https://blueprints.launchpad.net/nova/+spec/image-multiple-location
>>>>>>> https://blueprints.launchpad.net/nova/+spec/rbd-clone-image-handler
>>>>>>>
>>>>>>> https://blueprints.launchpad.net/nova/+spec/vmware-clone-image-handler
>>>>>>>
>>>>>>> Under #b approach, my former experience from our previous similar
>>>>>>> Cloud deployment (not OpenStack) was that: under 2 PC server storage
>>>>>>> nodes (general *local SAS disk*, without any storage backend) +
>>>>>>> 2-way/multi-path iSCSI + 1G network bandwidth, we can provisioning
>>>>>>> 500
>>>>>>> VMs in a minute.
>>>>>>>
>>>>>>> For vmThunder topic I think it sounds a good idea, IMO P2P, prefacing
>>>>>>> is one of optimized approach for image transferring valuably.
>>>>>>>
>>>>>>> zhiyan
>>>>>>>
>>>>>>> On Wed, Apr 16, 2014 at 9:14 PM, yongquan Fu <quanyongf at gmail.com>
>>>>>>> wrote:
>>>>>>>>
>>>>>>>> Dear all,
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> We would like to present an extension to the vm-booting
>>>>>>>> functionality
>>>>>>>> of
>>>>>>>> Nova when a number of homogeneous vms need to be launched at the
>>>>>>>> same
>>>>>>>> time.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> The motivation for our work is to increase the speed of provisioning
>>>>>>>> vms
>>>>>>>> for
>>>>>>>> large-scale scientific computing and big data processing. In that
>>>>>>>> case,
>>>>>>>> we
>>>>>>>> often need to boot tens and hundreds virtual machine instances at
>>>>>>>> the
>>>>>>>> same
>>>>>>>> time.
>>>>>>>>
>>>>>>>>
>>>>>>>> Currently, under the Openstack, we found that creating a large
>>>>>>>> number
>>>>>>>> of
>>>>>>>> virtual machine instances is very time-consuming. The reason is the
>>>>>>>> booting
>>>>>>>> procedure is a centralized operation that involve performance
>>>>>>>> bottlenecks.
>>>>>>>> Before a virtual machine can be actually started, OpenStack either
>>>>>>>> copy
>>>>>>>> the
>>>>>>>> image file (swift) or attach the image volume (cinder) from storage
>>>>>>>> server
>>>>>>>> to compute node via network. Booting a single VM need to read a
>>>>>>>> large
>>>>>>>> amount
>>>>>>>> of image data from the image storage server. So creating a large
>>>>>>>> number
>>>>>>>> of
>>>>>>>> virtual machine instances would cause a significant workload on the
>>>>>>>> servers.
>>>>>>>> The servers become quite busy even unavailable during the deployment
>>>>>>>> phase.
>>>>>>>> It would consume a very long time before the whole virtual machine
>>>>>>>> cluster
>>>>>>>> useable.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Our extension is based on our work on vmThunder, a novel mechanism
>>>>>>>> accelerating the deployment of large number virtual machine
>>>>>>>> instances.
>>>>>>>> It
>>>>>>>> is
>>>>>>>> written in Python, can be integrated with OpenStack easily.
>>>>>>>> VMThunder
>>>>>>>> addresses the problem described above by following improvements:
>>>>>>>> on-demand
>>>>>>>> transferring (network attached storage), compute node caching, P2P
>>>>>>>> transferring and prefetching. VMThunder is a scalable and
>>>>>>>> cost-effective
>>>>>>>> accelerator for bulk provisioning of virtual machines.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> We hope to receive your feedbacks. Any comments are extremely
>>>>>>>> welcome.
>>>>>>>> Thanks in advance.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> PS:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> VMThunder enhanced nova blueprint:
>>>>>>>> https://blueprints.launchpad.net/nova/+spec/thunderboost
>>>>>>>>
>>>>>>>> VMThunder standalone project: https://launchpad.net/vmthunder;
>>>>>>>>
>>>>>>>> VMThunder prototype: https://github.com/lihuiba/VMThunder
>>>>>>>>
>>>>>>>> VMThunder etherpad: https://etherpad.openstack.org/p/vmThunder
>>>>>>>>
>>>>>>>> VMThunder portal: http://www.vmthunder.org/
>>>>>>>>
>>>>>>>> VMThunder paper:
>>>>>>>> http://www.computer.org/csdl/trans/td/preprint/06719385.pdf
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Regards
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> vmThunder development group
>>>>>>>>
>>>>>>>> PDL
>>>>>>>>
>>>>>>>> National University of Defense Technology
>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> OpenStack-dev mailing list
>>>>>>>> OpenStack-dev at lists.openstack.org
>>>>>>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> OpenStack-dev mailing list
>>>>>>> OpenStack-dev at lists.openstack.org
>>>>>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Yongquan Fu
>>>>>>> PhD, Assistant Professor,
>>>>>>> National Key Laboratory for Parallel and Distributed
>>>>>>> Processing, College of Computer Science, National University of
>>>>>>> Defense
>>>>>>> Technology, Changsha, Hunan Province, P.R. China
>>>>>>> 410073
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> OpenStack-dev mailing list
>>>>>>> OpenStack-dev at lists.openstack.org
>>>>>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>>>>>>
>>>>>>
>>>>>>_______________________________________________
>>>>>>OpenStack-dev mailing list
>>>>>>OpenStack-dev at lists.openstack.org
>>>>>>http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> OpenStack-dev mailing list
>>>>> OpenStack-dev at lists.openstack.org
>>>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>>>>
>>>>
>>>>_______________________________________________
>>>>OpenStack-dev mailing list
>>>>OpenStack-dev at lists.openstack.org
>>>>http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>>
>>>
>>> _______________________________________________
>>> OpenStack-dev mailing list
>>> OpenStack-dev at lists.openstack.org
>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>>
>>
>>_______________________________________________
>>OpenStack-dev mailing list
>>OpenStack-dev at lists.openstack.org
>>http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
>
> _______________________________________________
> OpenStack-dev mailing list
> OpenStack-dev at lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
More information about the OpenStack-dev
mailing list