[Openstack] Compute downloading corrupted image from Glance

Rick Jones rick.jones2 at hpe.com
Tue Mar 29 20:19:37 UTC 2016


On 03/29/2016 01:01 PM, Kaustubh Kelkar wrote:
> -----Original Message-----
> From: Rick Jones [mailto:rick.jones2 at hpe.com]
> Sent: Tuesday, March 29, 2016 1:43 PM
> To: openstack at lists.openstack.org
> Subject: Re: [Openstack] Compute downloading corrupted image from Glance
>
> On 03/29/2016 10:17 AM, Kaustubh Kelkar wrote:
>> Every time I tried to download the image on the compute, I get a new
>> hash value (albeit, a wrong one).
>
> On the compute node, what is the type of NIC and its driver and such?
> [Kaustubh] It is an Intel X710 NIC with i40e driver. The NIC is part of the integrated card on a Dell R730.
>
> lscpi -v | grep -A 1 Ethernet
> [Kaustubh] (Output redacted to show only the relevant interface)
> 01:00.1 Ethernet controller: Intel Corporation Ethernet 10G 2P X710 Adapter (rev 01)
>          Subsystem: Dell Device 0000

It wasn't assigned a sub-device ID? (Device 0000).  I'm not all that 
familiar with Dell kit but that seems a trifle odd.

> ethtool -i <interfacename>
> [Kaustubh] root at dchi:/home/kkelkar# ethtool -i em2
> driver: i40e
> version: 1.4.25
> firmware-version: 4.41 0x80001863 16.5.20
> bus-info: 0000:01:00.1
> supports-statistics: yes
> supports-test: yes
> supports-eeprom-access: yes
> supports-register-dump: yes
> supports-priv-flags: yes
> And are any of the stateless offloads enabled?
>
> ethtool -k <interfacename>
> [Kaustubh] root at dchi:/home/kkelkar# ethtool -k em2
> Features for em2:

trimmed...

>
> Those would include checksum offload, and things built on top of it
> like TSO, GSO, LRO and/or GRO.
>
> If you find that checksum offload is enabled, and you disable it,
> does the corrupt image download problem go away?  If so, you have a
> problem with your NIC and/or its driver getting the offloads wrong
> and/or corrupting the traffic in a place outside the protection of
> the offloaded checksuming.  One of the central assumptions with the
> likes of checksum offload in a NIC is that anything "above" the
> checksum offload in the NIC has some sort of data protection - at
> least parity, if not ECC.  This includes components in the NIC
> itself, the I/O bus etc etc.
>
> If disabling checksum offload on the compute node doesn't resolve the
> matter, you might consider the same on the controller.
>
> [Kaustubh] I ended up disabling checksumming, TSO, GSO and GRO on
> both controller and the compute so the ethtool output looks as above.
> Now, the problem can only be reproduced intermittently. At times,
> compute node still gets a corrupted image.

Ah, that ethtool -i output was after not before - I was initially 
confused because I'd not expected the offloads to be disabled by default.

If the issue is still intermittent I'd *guess* it was timing related. 
You might see if there are any increases in the back checksum stats in 
netstat.

Other bits of straw-grasping would include, but not be limited to:

*) Transferring the image via scp and see if that always works OK
*) Run something like netperf TCP_STREAM or iperf and see if you see 
checksum errors accumulating.
*) Perhaps create a fake image of the same size with a fixed pattern and 
transfer that via glance and see if it ever complains.  If it does, you 
can look to see where the pattern breaks in terms of offset into the 
file and how it breaks.  If it is then reproducible you can then 
consider getting packet traces at either end and looking through those 
to see if it was indeed good or bad at the sender and such.

rick jones




More information about the Openstack mailing list