[Openstack-operators] 4K block size

Mark Mielke mark.mielke at gmail.com
Tue Apr 24 01:22:18 UTC 2018

On Mon, Apr 23, 2018 at 3:54 PM, Sean McGinnis <sean.mcginnis at gmx.com>

> On Mon, Apr 23, 2018 at 05:46:40PM +0000, Tim Bell wrote:
> > Has anyone experience of working with local disks or volumes with
> physical/logical block sizes of 4K rather than 512?
> > There seems to be KVM support for this (http://fibrevillage.com/
> sysadmin/216-how-to-make-qemu-kvm-accept-4k-sector-sized-disks) but I
> could not see how to get the appropriate flavors/volumes in an OpenStack
> environment?
> > Is there any performance improvement from moving to 4K rather than 512
> byte sectors?
> I haven't seen much of a performance difference between drives with one
> exception that I don't think will apply here. For backward compatiblity,
> there
> is something that is called "512e mode". This basically takes a 4k sector
> size
> and, using software abstraction, presents it to the host as a 512 byte
> sector
> drive. So with this abstraction being done in between, there can be a
> slight
> performance hit as things are translated.

Today, most commonly used file systems where performance matters, already
use 4K logical block sizes underneath. So, it doesn't matter if it is 512n,
512e, or 4Kn. They all work approximately the same. In theory the disk
performance is better with "Advanced Format" disks as there are fewer gaps
between the data sectors, but you can get such gains with denser platters
or faster rotation. An example of a difference here might be that you might
have 5 platter 4TB disk with 512n, or a 4 platter 4TB with 4Kn. The 4
platter might require less energy to run, and may have a higher sequential
read and write performance. But, if the disk specs meet your requirements,
you often wouldn't care if it was 512n, 512e, or 4Kn.

One case where it definitely does matter, is alignment. If the logical
sectors are not aligned with the physical sectors, this can have crippling
impact. A 512e drive "emulates" 512n. But, if you logical sector is out of
alignment, and bridges the end of one physical sector, and the beginning of
another physical sector, how does it safely write in units of the physical
sector? Unless the physical blocks happen to be in cache, it will have to
first read each block before it can re-write the block.

I believe GRUB Legacy is not 512e/4Kn aware. RHEL 5 systems, and RHEL 5
systems upgraded ot RHEL 6 systems, or systems that were created with fixed
partition tables that were designed before 512e/4Kn drives existed, can end
up with the more traditional MBR layout where the first partition of the
disk begins on the 63rd 512-byte sector. Using such a partition table on a
512e disk can be very bad news. In our case, we had real life RHEL 7
systems naively imaged using a Kickstart configuration that was originally
designed for RHEL 5. They were configured to use the Docker lvm-thinp
driver, and this particular use case was very heavy on random I/O through
the thin volume layers. A set of users were reporting good performance, and
another set of users were reporting really bad performance. I looked at the
systems and determined that they had different make and model of disk, and
the "slow" systems all had 512e, and the "fast" systems all had 512n. I
checked the Kickstart configuration they were using, and sure enough they
were using the original layout.

Modern partition tools allow a full 1 MB at the start of disk, making the
first partition aligned on both 512, and 4K (and 1MB). This leaves more
room for the boot code, and it mostly eliminates alignment problems.

If you did have a file system that was of the belief that it could read and
write at the 512-byte sector level, it would also have worst case behaviour
similar to the above. I don't think EXT4 or XFS do this, so it is outside
my concern and I didn't research which ones still do this. But, knowing all
of the above... I actually patch our SolidFire driver to properly implement
512e information to be exposed through libvirt and qemu/virtio into the
guest. The guest can clearly see "4K" physical sectors, and "512" logical
sectors, and it can then make the best decision based upon this information.

I did suggest that the SolidFire people adopt this, but with my local patch
the pressure to follow up here was eliminated, and I didn't get back to

I think the guest should have the right information so that it can make the
correct decision. If the information is filtered, and a guest is presented
with 512 byte physical and logical, even though the physical is 4K, then
this means that certain use cases may exhibit very bad behaviour. Probably
you won't notice, because the typical benchmarks run would show good speed
and you would be unaware that the overhead is actually due to
mis-alignment, or partial sector reads and writes.

Mark Mielke <mark.mielke at gmail.com>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-operators/attachments/20180423/2511aa3e/attachment.html>

More information about the OpenStack-operators mailing list