Here is the bug which I'm trying to fix -

In  VMs (set up with fuel-virtualbox) kernel panic occurs every time you
delete node, stack trace shows error in ext4 driver [1].
The same as in the bug.

Here is a patch - https://review.openstack.org/297669 . I've checked it
with virtual box VMs and it works fine.

I propose also don't reboot nodes in case of kernel panic, so that we'll
catch possible errors, but maybe it's too dangerous before release.

[13607.545119] EXT4-fs error (device dm-0) in
ext4_reserve_inode_write:4928: IO failure
[13608.157968] EXT4-fs error (device dm-0) in
ext4_reserve_inode_write:4928: IO failure
[13608.780695] EXT4-fs error (device dm-0) in
ext4_reserve_inode_write:4928: IO failure
[13609.471245] Aborting journal on device dm-0-8.
[13609.478549] EXT4-fs error (device dm-0) in ext4_dirty_inode:5047: IO
[13610.069244] EXT4-fs error (device dm-0) in ext4_dirty_inode:5047: IO
[13610.698915] Kernel panic - not syncing: EXT4-fs (device dm-0): panic
forced after error
[13611.060673] CPU: 0 PID: 8676 Comm: systemd-udevd Not tainted
3.13.0-83-generic #127-Ubuntu
[13611.236566] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS
VirtualBox 12/01/2006
[13611.887198]  00000000fffffffb ffff88003b6e9a08 ffffffff81725992
[13612.527154]  ffff88003b6e9a80 ffffffff8171e80b ffffffff00000010
[13613.037061]  ffff88003b6e9a30 ffff88003b6e9a50 ffff8800367f2ad0
[13613.717119] Call Trace:
[13613.927162]  [<ffffffff81725992>] dump_stack+0x45/0x56
[13614.306858]  [<ffffffff8171e80b>] panic+0xc8/0x1e1
[13614.767154]  [<ffffffff8125e7c6>] ext4_handle_error.part.187+0xa6/0xb0
[13615.187201]  [<ffffffff8125eddb>] __ext4_std_error+0x7b/0x100
[13615.627960]  [<ffffffff81244c64>] ext4_reserve_inode_write+0x44/0xa0
[13616.007943]  [<ffffffff81247f80>] ? ext4_dirty_inode+0x40/0x60
[13616.448084]  [<ffffffff81244d04>] ext4_mark_inode_dirty+0x44/0x1f0
[13616.917611]  [<ffffffff8126f7f9>] ? __ext4_journal_start_sb+0x69/0xe0
[13617.367730]  [<ffffffff81247f80>] ext4_dirty_inode+0x40/0x60
[13617.747567]  [<ffffffff811e858a>] __mark_inode_dirty+0x10a/0x2d0
[13618.088060]  [<ffffffff811d94e1>] update_time+0x81/0xd0
[13618.467965]  [<ffffffff811d96f0>] file_update_time+0x80/0xd0
[13618.977649]  [<ffffffff811511f0>] __generic_file_aio_write+0x180/0x3d0
[13619.467993]  [<ffffffff81151498>] generic_file_aio_write+0x58/0xa0
[13619.978080]  [<ffffffff8123c712>] ext4_file_write+0xa2/0x3f0
[13620.467624]  [<ffffffff81158066>] ? free_hot_cold_page_list+0x46/0xa0
[13621.038045]  [<ffffffff8115d400>] ? release_pages+0x80/0x210
[13621.408080]  [<ffffffff811bdf5a>] do_sync_write+0x5a/0x90
[13621.818155]  [<ffffffff810e52f6>] do_acct_process+0x4e6/0x5c0
[13622.278005]  [<ffffffff810e5a91>] acct_process+0x71/0xa0
[13622.597617]  [<ffffffff8106a3cf>] do_exit+0x80f/0xa50
[13622.968015]  [<ffffffff811c041e>] ? ____fput+0xe/0x10
[13623.337738]  [<ffffffff8106a68f>] do_group_exit+0x3f/0xa0
[13623.738020]  [<ffffffff8106a704>] SyS_exit_group+0x14/0x20
[13624.137447]  [<ffffffff8173659d>] system_call_fastpath+0x1a/0x1f
[13624.518044] Rebooting in 10 seconds..

On Tue, Mar 22, 2016 at 1:07 PM, Dmitry Guryanov <dguryanov at mirantis.com>

> Hello,
> Here is a start of the discussion -
> http://lists.openstack.org/pipermail/openstack-dev/2015-December/083021.html
> . I've subscribed to this mailing list later, so can reply there.
> Currently we clear node's disks in two places. The first one is before
> reboot into bootstrap image [0] and the second - just before provisioning
> in fuel-agent [1].
> There are two problems, which should be solved with erasing first megabyte
> of disk data: node should not boot from hdd after reboot and new
> partitioning scheme should overwrite the previous one.
> The first problem could be solved with zeroing first 512 bytes of each
> disk (not partition). Even 446 to be precise, because last 66 bytes are
> partition scheme, see
> https://wiki.archlinux.org/index.php/Master_Boot_Record .
> The second problem should be solved only after reboot into bootstrap.
> Because if we bring a new node to the cluster from some other place and
> boot it with bootstrap image it will possibly have disks with some
> partitions, md devices and lvm volumes. So all these entities should be
> correctly cleared before provisioning, not before reboot. And fuel-agent
> does it in [1].
> I propose to remove erasing first 1M of each partiton, because it can lead
> to errors in FS kernel drivers and kernel panic. An existing workaround,
> that in case of kernel panic we do reboot is bad because it may occur just
> after clearing first partition of the first disk and after reboot bios will
> read MBR of the second disk and boot from it instead of network. Let's just
> clear first 446 bytes of each disk.
> [0]
> https://github.com/openstack/fuel-astute/blob/master/mcagents/erase_node.rb#L162-L174
> [1]
> https://github.com/openstack/fuel-agent/blob/master/fuel_agent/manager.py#L194-L221
> --
> Dmitry Guryanov

Dmitry Guryanov
