[Openstack] BUG: soft lockup messages

Matej Mailing mailing at tam.si
Wed Jan 7 10:40:17 UTC 2015


2015-01-07 9:06 GMT+01:00 Ritesh Raj Sarraf <rrs at researchut.com>:
>
>
> On 01/05/2015 03:46 PM, Matej Mailing wrote:
>> Hello Erhan,
>>
>> soft lock-up has just happened again on the instance node. I am
>> monitoring network interface traffic on which the NFS server is
>> connected and the interface has constantly been under 20% of it's
>> capacity, also the load on the NFS server is load - though I am unsure
>> if this is relevant at all...
>>
>> What makes me wonder - is is "normal" that both lock-ups are for the
>> same period of time (51s) even on two CPUs and with two different
>> PIDs?
>>
>> The output from the log is:
>>
>> Jan  5 11:01:13 postar kernel: [477123.485080] NMI watchdog: BUG: soft
>> lockup - CPU#3 stuck for 51s! [mysqld:2612]
>> Jan  5 11:01:13 postar kernel: [477123.485151] Modules linked in:
>> xt_hl ip6t_rt nf_conntrack_ipv6 nf_defrag_ipv6 ipt_REJECT
>> nf_reject_ipv4 xt_limit
>> xt_tcpudp xt_addrtype nf_conntrack_ipv4 nf_defrag_ipv4 xt_state
>> ip6table_filter ip6_tables nfsd nfs_acl auth_rpcgss
>> nf_conntrack_netbios_ns
>> nf_conntrack_broadcast nfs nf_nat_ftp nf_nat ppdev nf_conntrack_ftp
>> nf_conntrack fscache parport_pc parport lockd joydev pvpanic
>> iptable_filter cirrus
>> ip_tables 8250_fintek ttm psmouse drm_kms_helper sunrpc x_tables
>> serio_raw hid_generic drm grace mac_hid sysimgblt sysfillrect
>> syscopyarea i2c_piix4
>> usbhid hid floppy
>> Jan  5 11:01:13 postar kernel: [477123.485200] CPU: 3 PID: 2612 Comm:
>> mysqld Tainted: G             L 3.18.1-031801-generic #201412170637
>> Jan  5 11:01:13 postar kernel: [477123.485202] Hardware name:
>> OpenStack Foundation OpenStack Nova, BIOS Bochs 01/01/2011
>> Jan  5 11:01:13 postar kernel: [477123.485205] task: ffff880037ab6400
>> ti: ffff880206e40000 task.ti: ffff880206e40000
>> Jan  5 11:01:13 postar kernel: [477123.485207] RIP:
>> 0010:[<ffffffff813a97dc>]  [<ffffffff813a97dc>]
>> copy_user_generic_string+0x2c/0x40
>> Jan  5 11:01:13 postar kernel: [477123.485217] RSP:
>> 0018:ffff880206e43c90  EFLAGS: 00010246
>> Jan  5 11:01:13 postar kernel: [477123.485219] RAX: 0000000076793000
>> RBX: ffff880206e43ea0 RCX: 0000000000000200
>> Jan  5 11:01:13 postar kernel: [477123.485221] RDX: 0000000000000000
>> RSI: ffff880076793000 RDI: 00007fd365448000
>> Jan  5 11:01:13 postar kernel: [477123.485223] RBP: ffff880206e43d08
>> R08: ffffea0001d9e4dc R09: ffff880206e43ca0
>> Jan  5 11:01:13 postar kernel: [477123.485225] R10: ffff8801b76ca6f0
>> R11: 0000000000000293 R12: 0000000000000000
>> Jan  5 11:01:13 postar kernel: [477123.485227] R13: ffff880206e43ec8
>> R14: 0000000000001000 R15: 00007fd365448000
>> Jan  5 11:01:13 postar kernel: [477123.485235] FS:
>> 00007fd36c1fb700(0000) GS:ffff88023fd80000(0000)
>> knlGS:0000000000000000
>> Jan  5 11:01:13 postar kernel: [477123.485237] CS:  0010 DS: 0000 ES:
>> 0000 CR0: 0000000080050033
>> Jan  5 11:01:13 postar kernel: [477123.485239] CR2: 000000000b416003
>> CR3: 00000002317ee000 CR4: 00000000000006e0
>> Jan  5 11:01:13 postar kernel: [477123.485253] Stack:
>> Jan  5 11:01:13 postar kernel: [477123.485255]  ffffffff811a0f75
>> 000000000012f2ac ffff8801b76ca6f0 ffff880206e43cc8
>> Jan  5 11:01:13 postar kernel: [477123.485259]  ffffffff8117864e
>> ffff880076793000 ffffea0001d9e4c0 0000000000001000
>> Jan  5 11:01:13 postar kernel: [477123.485262]  ffff880076793000
>> ffffea0001d9e4c0 ffffea0001d9e4c0 ffff880231e4b930
>> Jan  5 11:01:13 postar kernel: [477123.485265] Call Trace:
>> Jan  5 11:01:13 postar kernel: [477123.485273]  [<ffffffff811a0f75>] ?
>> copy_page_to_iter_iovec+0xe5/0x300
>> Jan  5 11:01:13 postar kernel: [477123.485279]  [<ffffffff8117864e>] ?
>> find_get_entry+0x1e/0x90
>> Jan  5 11:01:13 postar kernel: [477123.485282]  [<ffffffff811a14a6>]
>> copy_page_to_iter+0x16/0x70
>> Jan  5 11:01:13 postar kernel: [477123.485286]  [<ffffffff81179428>]
>> do_generic_file_read+0x1f8/0x490
>> Jan  5 11:01:13 postar kernel: [477123.485289]  [<ffffffff8117a234>]
>> generic_file_read_iter+0xf4/0x150
>> Jan  5 11:01:13 postar kernel: [477123.485294]  [<ffffffff810aade1>] ?
>> update_curr+0x141/0x1f0
>> Jan  5 11:01:13 postar kernel: [477123.485298]  [<ffffffff811eef28>]
>> new_sync_read+0x78/0xb0
>> Jan  5 11:01:13 postar kernel: [477123.485301]  [<ffffffff811f013b>]
>> vfs_read+0xab/0x180
>> Jan  5 11:01:13 postar kernel: [477123.485304]  [<ffffffff811f0402>]
>> SyS_pread64+0x92/0xa0
>> Jan  5 11:01:13 postar kernel: [477123.485309]  [<ffffffff817b376d>]
>> system_call_fastpath+0x16/0x1b
>> Jan  5 11:01:13 postar kernel: [477123.485311] Code: 66 90 83 fa 08 72
>> 27 89 f9 83 e1 07 74 15 83 e9 08 f7 d9 29 ca 8a 06 88 07 48 ff c6 48
>> ff c7 ff c9 75 f$
>> Jan  5 11:01:13 postar kernel: [477123.486574] NMI watchdog: BUG: soft
>> lockup - CPU#1 stuck for 51s! [mysqld:2282]
>> Jan  5 11:01:13 postar kernel: [477123.486633] Modules linked in:
>> xt_hl ip6t_rt nf_conntrack_ipv6 nf_defrag_ipv6 ipt_REJECT
>> nf_reject_ipv4 xt_limit xt_tcpud$
>> Jan  5 11:01:13 postar kernel: [477123.486669] CPU: 1 PID: 2282 Comm:
>> mysqld Tainted: G             L 3.18.1-031801-generic #201412170637
>> Jan  5 11:01:13 postar kernel: [477123.486671] Hardware name:
>> OpenStack Foundation OpenStack Nova, BIOS Bochs 01/01/2011
>> Jan  5 11:01:13 postar kernel: [477123.486673] task: ffff880232bbb200
>> ti: ffff880232bd8000 task.ti: ffff880232bd8000
>> Jan  5 11:01:13 postar kernel: [477123.486675] RIP:
>> 0010:[<ffffffff813a97dc>]  [<ffffffff813a97dc>]
>> copy_user_generic_string+0x2c/0x40
>> Jan  5 11:01:13 postar kernel: [477123.486680] RSP:
>> 0018:ffff880232bdbc18  EFLAGS: 00010246
>> Jan  5 11:01:13 postar kernel: [477123.486683] RAX: 00007fd35f714000
>> RBX: 0000000000001000 RCX: 0000000000000200
>> Jan  5 11:01:13 postar kernel: [477123.486685] RDX: 0000000000000000
>> RSI: 00007fd35f714000 RDI: ffff88007131c000
>> Jan  5 11:01:13 postar kernel: [477123.486687] RBP: ffff880232bdbc28
>> R08: ffffea0001c4c700 R09: 00000000fffff000
>> Jan  5 11:01:13 postar kernel: [477123.486689] R10: ffff8801b77814e0
>> R11: 0000000000000293 R12: 0000000000001000
>> Jan  5 11:01:13 postar kernel: [477123.486691] R13: ffff880232bdbea0
>> R14: 0000000000000000 R15: 0000000000001000
>> Jan  5 11:01:13 postar kernel: [477123.486697] FS:
>> 00007fd35afe7700(0000) GS:ffff88023fc80000(0000)
>> knlGS:0000000000000000
>> Jan  5 11:01:13 postar kernel: [477123.486699] CS:  0010 DS: 0000 ES:
>> 0000 CR0: 0000000080050033
>> Jan  5 11:01:13 postar kernel: [477123.486702] CR2: 000000000cdc3001
>> CR3: 00000002317ee000 CR4: 00000000000006e0
>> Jan  5 11:01:13 postar kernel: [477123.486714] Stack:
>> Jan  5 11:01:13 postar kernel: [477123.486716]  ffffffff811a0386
>> 0000000120094000 ffff880232bdbc78 ffffffff811a0a35
>> Jan  5 11:01:13 postar kernel: [477123.486719]  ffff880232bdbc88
>> 0000000000000000 ffff880232bdbc78 0000000120094000
>> Jan  5 11:01:13 postar kernel: [477123.486722]  0000000000001000
>> ffff880232bdbea0 ffff880231e4b930 0000000000000000
>> Jan  5 11:01:13 postar kernel: [477123.486726] Call Trace:
>> Jan  5 11:01:13 postar kernel: [477123.486752]  [<ffffffff811a0386>] ?
>> copy_from_user_atomic_iovec+0x56/0x80
>> Jan  5 11:01:13 postar kernel: [477123.486757]  [<ffffffff811a0a35>]
>> iov_iter_copy_from_user_atomic+0xd5/0xe0
>> Jan  5 11:01:13 postar kernel: [477123.486761]  [<ffffffff81177410>]
>> generic_perform_write+0xe0/0x1c0
>> Jan  5 11:01:13 postar kernel: [477123.486766]  [<ffffffff8120a0f1>] ?
>> update_time+0x81/0xc0
>> Jan  5 11:01:13 postar kernel: [477123.486770]  [<ffffffff8120e4a2>] ?
>> mnt_clone_write+0x12/0x30
>> Jan  5 11:01:13 postar kernel: [477123.486773]  [<ffffffff81179e9f>]
>> __generic_file_write_iter+0x16f/0x350
>> Jan  5 11:01:13 postar kernel: [477123.486778]  [<ffffffff8126b7d9>]
>> ext4_file_write_iter+0x119/0x3d0
>> Jan  5 11:01:13 postar kernel: [477123.486783]  [<ffffffff810efcd8>] ?
>> get_futex_key+0x1f8/0x2e0
>> Jan  5 11:01:13 postar kernel: [477123.486786]  [<ffffffff811ef0eb>]
>> new_sync_write+0x7b/0xb0
>> Jan  5 11:01:13 postar kernel: [477123.486789]  [<ffffffff811eff67>]
>> vfs_write+0xc7/0x1f0
>> Jan  5 11:01:13 postar kernel: [477123.486792]  [<ffffffff811f04a2>]
>> SyS_pwrite64+0x92/0xa0
>> Jan  5 11:01:13 postar kernel: [477123.486795]  [<ffffffff817b376d>]
>> system_call_fastpath+0x16/0x1b
>> Jan  5 11:01:13 postar kernel: [477123.486797] Code: 66 90 83 fa 08 72
>> 27 89 f9 83 e1 07 74 15 83 e9 08 f7 d9 29 ca 8a 06 88 07 48 ff c6 48
>> ff c7 ff c9
>> 75 f2 89 d1 c1 e9 03 83 e2 07 <f3> 48 a5 89 d1 f3 a4 31 c0 66 66 90 c3
>> 0f 1f 80 00 00 00 00 66
>>
>>
>> Thanks,
>> Matej
>
>
> The profile of this hardware looks virtualized. That makes me ask if you
> have installed the para-virtualized drivers ?
>
>
> --
> Given the large number of mailing lists I follow, I request you to CC me
> in replies for quicker response
>
>
> _______________________________________________
> Mailing list: http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack
> Post to     : openstack at lists.openstack.org
> Unsubscribe : http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack


Hi all, just an update:

today we got a stall and then a soft lock-up:
an  7 08:05:28 postar kernel: [639021.260041] INFO: rcu_sched
self-detected stall on CPU { 2}  (t=91560 jiffies g=4180521 c=4180520
q=0)
Jan  7 08:05:28 postar kernel: [639021.260041] Task dump for CPU 2:
Jan  7 08:05:28 postar kernel: [639021.260041] /usr/sbin/spamd R
running task        0  9932  24127 0x00000008
Jan  7 08:05:28 postar kernel: [639021.260041]  ffffffff81c55b00
ffff88023fd03cf8 ffffffff810a0ee8 0000000000000002
Jan  7 08:05:28 postar kernel: [639021.260041]  ffffffff81c55b00
ffff88023fd03d18 ffffffff810a388f ffff88023fd03d58
Jan  7 08:05:28 postar kernel: [639021.260041]  ffffffff81c55b00
ffff88023fd03d48 ffffffff810d3e7b ffffffff81c55b00
Jan  7 08:05:28 postar kernel: [639021.260041] Call Trace:
Jan  7 08:05:28 postar kernel: [639021.260041]  <IRQ>
[<ffffffff810a0ee8>] sched_show_task+0xa8/0x110
Jan  7 08:05:28 postar kernel: [639021.260041]  [<ffffffff810a388f>]
dump_cpu_task+0x3f/0x50
Jan  7 08:05:28 postar kernel: [639021.260041]  [<ffffffff810d3e7b>]
rcu_dump_cpu_stacks+0x8b/0xc0
Jan  7 08:05:28 postar kernel: [639021.260041]  [<ffffffff810d5cb7>]
print_cpu_stall+0xd7/0x140
Jan  7 08:05:28 postar kernel: [639021.260041]  [<ffffffff810d6405>]
check_cpu_stall.isra.53+0xa5/0xb0
Jan  7 08:05:28 postar kernel: [639021.260041]  [<ffffffff810d6448>]
__rcu_pending+0x38/0x1d0
Jan  7 08:05:28 postar kernel: [639021.260041]  [<ffffffff810d8065>]
rcu_check_callbacks+0xa5/0x190
Jan  7 08:05:28 postar kernel: [639021.260041]  [<ffffffff810dd648>]
update_process_times+0x48/0x80
Jan  7 08:05:28 postar kernel: [639021.260041]  [<ffffffff810ed1a3>]
tick_sched_handle.isra.17+0x33/0x70
Jan  7 08:05:28 postar kernel: [639021.260041]  [<ffffffff810ed2dc>]
tick_sched_timer+0x4c/0x80
Jan  7 08:05:28 postar kernel: [639021.260041]  [<ffffffff810de106>]
__run_hrtimer+0x76/0x230
Jan  7 08:05:28 postar kernel: [639021.260041]  [<ffffffff810ed290>] ?
tick_nohz_handler+0xb0/0xb0
Jan  7 08:05:28 postar kernel: [639021.260041]  [<ffffffff810de90f>]
hrtimer_interrupt+0xff/0x260
Jan  7 08:05:28 postar kernel: [639021.260041]  [<ffffffff8104ae3b>]
local_apic_timer_interrupt+0x3b/0x60
Jan  7 08:05:28 postar kernel: [639021.260041]  [<ffffffff817b65f5>]
smp_apic_timer_interrupt+0x45/0x60
Jan  7 08:05:28 postar kernel: [639021.260041]  [<ffffffff817b46bd>]
apic_timer_interrupt+0x6d/0x80
....

and then:
Jan  7 08:05:28 postar kernel: [639377.621033] NMI watchdog: BUG: soft
lockup - CPU#1 stuck for 92s! [apache2:5279]
....


This time I have iostat stats for this exact moment since I kept it logging:
last entry before Jan  7 08:05:28 is:
2015-01-07 07:59:31 avg-cpu:  %user   %nice %system %iowait  %steal   %idle
2015-01-07 07:59:31            0.00    0.00   48.46    0.00   17.75   33.79
2015-01-07 07:59:31
2015-01-07 07:59:31 Device:         rrqm/s   wrqm/s     r/s     w/s
rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
2015-01-07 07:59:31 vda               0.00     0.00    0.00    0.00
 0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
2015-01-07 07:59:31


and then:
2015-01-07 08:05:28 avg-cpu:  %user   %nice %system %iowait  %steal   %idle
2015-01-07 08:05:28            5.88    0.00   11.76    0.00   17.65   64.71
2015-01-07 08:05:28
2015-01-07 08:05:28 Device:         rrqm/s   wrqm/s     r/s     w/s
rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
2015-01-07 08:05:28 vda               0.00     0.00    0.00    0.00
 0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
2015-01-07 08:05:28
2015-01-07 08:05:29 avg-cpu:  %user   %nice %system %iowait  %steal   %idle
2015-01-07 08:05:29            8.59    0.00   15.46   16.15   13.40   46.39
2015-01-07 08:05:29
2015-01-07 08:05:29 Device:         rrqm/s   wrqm/s     r/s     w/s
rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
2015-01-07 08:05:29 vda               0.00     0.00    0.00  230.69
 0.00  1679.21    14.56    32.52  140.98    0.00  140.98   3.30  76.04
2015-01-07 08:05:29
2015-01-07 08:05:30 avg-cpu:  %user   %nice %system %iowait  %steal   %idle
2015-01-07 08:05:30            4.21    0.00   52.87    5.36   21.84   15.71
2015-01-07 08:05:30
2015-01-07 08:05:30 Device:         rrqm/s   wrqm/s     r/s     w/s
rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
2015-01-07 08:05:30 vda               0.00     0.00    0.00  111.00
 0.00   448.00     8.07     0.83    7.50    0.00    7.50   5.05  56.00
2015-01-07 08:05:30
2015-01-07 08:05:31 avg-cpu:  %user   %nice %system %iowait  %steal   %idle
2015-01-07 08:05:31           15.41    0.00   54.14    3.01   15.41   12.03
2015-01-07 08:05:31
2015-01-07 08:05:31 Device:         rrqm/s   wrqm/s     r/s     w/s
rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
2015-01-07 08:05:31 vda               0.00     0.00    0.00  266.00
 0.00  1044.00     7.85     0.78    2.93    0.00    2.93   1.68  44.80
2015-01-07 08:05:31
2015-01-07 08:05:32 avg-cpu:  %user   %nice %system %iowait  %steal   %idle
2015-01-07 08:05:32           13.79    0.00   63.60    2.30   19.16    1.15
2015-01-07 08:05:32
2015-01-07 08:05:32 Device:         rrqm/s   wrqm/s     r/s     w/s
rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
2015-01-07 08:05:32 vda               0.00     0.00    0.00  120.59
 0.00   478.43     7.93     0.38    3.15    0.00    3.15   2.05  24.71
2015-01-07 08:05:32
2015-01-07 08:05:33 avg-cpu:  %user   %nice %system %iowait  %steal   %idle
2015-01-07 08:05:33           18.61    0.00   33.94    8.39   14.60   24.45
2015-01-07 08:05:33
2015-01-07 08:05:33 Device:         rrqm/s   wrqm/s     r/s     w/s
rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
2015-01-07 08:05:33 vda               0.00     0.00    0.00  274.26
 0.00  6637.62    48.40     4.82   17.56    0.00   17.56   1.47  40.40
2015-01-07 08:05:33

Then load slowly goes down. Can we blame I/O load?

Thanks,
Matej




More information about the Openstack mailing list