[openstack-dev] [Fuel] Wiping node's disks on delete
Dmitry Guryanov
dguryanov at mirantis.com
Fri Mar 25 14:29:56 UTC 2016
On Fri, 2016-03-25 at 08:00 -0600, Alex Schultz wrote:
>
> On Fri, Mar 25, 2016 at 7:32 AM, Dmitry Guryanov <dguryanov at mirantis.
> com> wrote:
> > Here is the bug which I'm trying to fix - https://bugs.launchpad.ne
> > t/fuel/+bug/1538587.
> >
> > In VMs (set up with fuel-virtualbox) kernel panic occurs every
> > time you delete node, stack trace shows error in ext4 driver [1].
> > The same as in the bug.
> >
> > Here is a patch - https://review.openstack.org/297669 . I've
> > checked it with virtual box VMs and it works fine.
> >
> > I propose also don't reboot nodes in case of kernel panic, so that
> > we'll catch possible errors, but maybe it's too dangerous before
> > release.
> >
> >
> The panic is in there to prevent controllers from staying active with
> a bad disk. If the file system on a controller goes RO, the node
> stays in the cluster and causes errors with the openstack
> deployment. The node erase code tries to disable this prior to
> erasing the disk so if it's not working we need to fix that, not
> remove it.
There will be no filesystem errors because of erasing disks with my
patch. The node will be fully operable until reboot.
> Thanks,
> -Alex
>
> > [1]
> > [13607.545119] EXT4-fs error (device dm-0) in
> > ext4_reserve_inode_write:4928: IO failure
> > [13608.157968] EXT4-fs error (device dm-0) in
> > ext4_reserve_inode_write:4928: IO failure
> > [13608.780695] EXT4-fs error (device dm-0) in
> > ext4_reserve_inode_write:4928: IO failure
> > [13609.471245] Aborting journal on device dm-0-8.
> > [13609.478549] EXT4-fs error (device dm-0) in
> > ext4_dirty_inode:5047: IO failure
> > [13610.069244] EXT4-fs error (device dm-0) in
> > ext4_dirty_inode:5047: IO failure
> > [13610.698915] Kernel panic - not syncing: EXT4-fs (device dm-0):
> > panic forced after error
> > [13610.698915]
> > [13611.060673] CPU: 0 PID: 8676 Comm: systemd-udevd Not tainted
> > 3.13.0-83-generic #127-Ubuntu
> > [13611.236566] Hardware name: innotek GmbH VirtualBox/VirtualBox,
> > BIOS VirtualBox 12/01/2006
> > [13611.887198] 00000000fffffffb ffff88003b6e9a08 ffffffff81725992
> > ffffffff81a77878
> > [13612.527154] ffff88003b6e9a80 ffffffff8171e80b ffffffff00000010
> > ffff88003b6e9a90
> > [13613.037061] ffff88003b6e9a30 ffff88003b6e9a50 ffff8800367f2ad0
> > 0000000000000040
> > [13613.717119] Call Trace:
> > [13613.927162] [<ffffffff81725992>] dump_stack+0x45/0x56
> > [13614.306858] [<ffffffff8171e80b>] panic+0xc8/0x1e1
> > [13614.767154] [<ffffffff8125e7c6>]
> > ext4_handle_error.part.187+0xa6/0xb0
> > [13615.187201] [<ffffffff8125eddb>] __ext4_std_error+0x7b/0x100
> > [13615.627960] [<ffffffff81244c64>]
> > ext4_reserve_inode_write+0x44/0xa0
> > [13616.007943] [<ffffffff81247f80>] ? ext4_dirty_inode+0x40/0x60
> > [13616.448084] [<ffffffff81244d04>]
> > ext4_mark_inode_dirty+0x44/0x1f0
> > [13616.917611] [<ffffffff8126f7f9>] ?
> > __ext4_journal_start_sb+0x69/0xe0
> > [13617.367730] [<ffffffff81247f80>] ext4_dirty_inode+0x40/0x60
> > [13617.747567] [<ffffffff811e858a>] __mark_inode_dirty+0x10a/0x2d0
> > [13618.088060] [<ffffffff811d94e1>] update_time+0x81/0xd0
> > [13618.467965] [<ffffffff811d96f0>] file_update_time+0x80/0xd0
> > [13618.977649] [<ffffffff811511f0>]
> > __generic_file_aio_write+0x180/0x3d0
> > [13619.467993] [<ffffffff81151498>]
> > generic_file_aio_write+0x58/0xa0
> > [13619.978080] [<ffffffff8123c712>] ext4_file_write+0xa2/0x3f0
> > [13620.467624] [<ffffffff81158066>] ?
> > free_hot_cold_page_list+0x46/0xa0
> > [13621.038045] [<ffffffff8115d400>] ? release_pages+0x80/0x210
> > [13621.408080] [<ffffffff811bdf5a>] do_sync_write+0x5a/0x90
> > [13621.818155] [<ffffffff810e52f6>] do_acct_process+0x4e6/0x5c0
> > [13622.278005] [<ffffffff810e5a91>] acct_process+0x71/0xa0
> > [13622.597617] [<ffffffff8106a3cf>] do_exit+0x80f/0xa50
> > [13622.968015] [<ffffffff811c041e>] ? ____fput+0xe/0x10
> > [13623.337738] [<ffffffff8106a68f>] do_group_exit+0x3f/0xa0
> > [13623.738020] [<ffffffff8106a704>] SyS_exit_group+0x14/0x20
> > [13624.137447] [<ffffffff8173659d>] system_call_fastpath+0x1a/0x1f
> > [13624.518044] Rebooting in 10 seconds..
> >
> > On Tue, Mar 22, 2016 at 1:07 PM, Dmitry Guryanov <dguryanov at miranti
> > s.com> wrote:
> > > Hello,
> > >
> > > Here is a start of the discussion - http://lists.openstack.org/pi
> > > permail/openstack-dev/2015-December/083021.html . I've subscribed
> > > to this mailing list later, so can reply there.
> > >
> > > Currently we clear node's disks in two places. The first one is
> > > before reboot into bootstrap image [0] and the second - just
> > > before provisioning in fuel-agent [1].
> > >
> > > There are two problems, which should be solved with erasing first
> > > megabyte of disk data: node should not boot from hdd after reboot
> > > and new partitioning scheme should overwrite the previous one.
> > >
> > > The first problem could be solved with zeroing first 512 bytes of
> > > each disk (not partition). Even 446 to be precise, because last
> > > 66 bytes are partition scheme, see https://wiki.archlinux.org/ind
> > > ex.php/Master_Boot_Record .
> > >
> > > The second problem should be solved only after reboot into
> > > bootstrap. Because if we bring a new node to the cluster from
> > > some other place and boot it with bootstrap image it will
> > > possibly have disks with some partitions, md devices and lvm
> > > volumes. So all these entities should be correctly cleared before
> > > provisioning, not before reboot. And fuel-agent does it in [1].
> > >
> > > I propose to remove erasing first 1M of each partiton, because it
> > > can lead to errors in FS kernel drivers and kernel panic. An
> > > existing workaround, that in case of kernel panic we do reboot is
> > > bad because it may occur just after clearing first partition of
> > > the first disk and after reboot bios will read MBR of the second
> > > disk and boot from it instead of network. Let's just clear first
> > > 446 bytes of each disk.
> > >
> > >
> > > [0] https://github.com/openstack/fuel-astute/blob/master/mcagents
> > > /erase_node.rb#L162-L174
> > > [1] https://github.com/openstack/fuel-agent/blob/master/fuel_agen
> > > t/manager.py#L194-L221
> > >
> > >
> > > --
> > > Dmitry Guryanov
> > >
> >
> >
> > --
> > Dmitry Guryanov
> >
> > ___________________________________________________________________
> > _______
> > OpenStack Development Mailing List (not for usage questions)
> > Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsu
> > bscribe
> > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> >
> _____________________________________________________________________
> _____
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubs
> cribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
More information about the OpenStack-dev
mailing list