lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [day] [month] [year] [list]
Message-ID: <accb30e3-9d6d-777e-fe84-7ae810cb2495@suddenlinkmail.com>
Date:   Sun, 3 Jun 2018 03:08:21 -0500
From:   "David C. Rankin" <drankinatty@...denlinkmail.com>
To:     kernel <linux-kernel@...r.kernel.org>
Subject: 4.16.x takes 3 minutes to boot on Supermicro H8DM8-2/H8DM8-2, 4.14
 LTS boots fine

All,

  I have been trying to find the reason the 4.16.x (x could begin about 4 or
so) kernel now reports:

NMI handler (perf_event_nmi_handler) took too long to run

which starts a chain of checks that ends up looping through each core on the
dual-quad-core Opeterons pushing the boot time out to 4+ minutes. The sequence
of failures on boot progresses as follows:


[    0.464392] NET: Registered protocol family 1
[    0.464769] PCI Interrupt Link [LUB0] enabled at IRQ 23
<all OK to here>

[   10.412957] perf: interrupt took too long (10078 > 2500), lowering
kernel.perf_event_max_sample_rate to 19800
[   10.466267] INFO: NMI handler (perf_event_nmi_handler) took too long to
run: 3.398 msecs
[   10.466267] perf: interrupt took too long (23235 > 12597), lowering
kernel.perf_event_max_sample_rate to 8400
[   11.139636] INFO: NMI handler (nmi_cpu_backtrace_handler) took too long to
run: 38.086 msecs
[   11.138340] perf: interrupt took too long (979330 > 911352), lowering
kernel.perf_event_max_sample_rate to 300
[   11.391081] INFO: NMI handler (nmi_cpu_backtrace_handler) took too long to
run: 72.692 msecs
[   11.139636] INFO: NMI handler (perf_event_nmi_handler) took too long to
run: 381.378 msecs
[   12.138796] perf: interrupt took too long (1897735 > 1224162), lowering
kernel.perf_event_max_sample_rate to 300
[   25.142864] INFO: NMI handler (perf_event_nmi_handler) took too long to
run: 1947.833 msecs
[   28.140139] perf: interrupt took too long (13755889 > 2372168), lowering
kernel.perf_event_max_sample_rate to 300
[   28.138850] watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [swapper/0:0]
[   28.138850] Modules linked in:
[   28.138850] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.16.13-1-ARCH #1
[   86.138261] Hardware name: Supermicro H8DM8-2/H8DM8-2, BIOS 080014  11/06/2009
[   90.139366] RIP: 0010:__do_softirq+0x75/0x2e0
[   90.139366] RSP: 0018:ffff9b36e7c03f70 EFLAGS: 00000297 ORIG_RAX:
ffffffffffffff12
[   90.139366] RAX: 0000000000021000 RBX: ffff9b36e7c15f00 RCX: 000000003543891f
[   91.138917] INFO: NMI handler (nmi_cpu_backtrace_handler) took too long to
run: 467.840 msecs
[   91.138917] RDX: 0000000000000000 RSI: 000000003543891f RDI: 0000000000000100
[   91.138917] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000078
[   91.138917] R10: 000000005b1397ce R11: 0000000000000000 R12: 0000000000000000
[   91.138917] R13: 0000000000000000 R14: 0000000000000282 R15: 0000000000000000
[   91.138917] FS:  0000000000000000(0000) GS:ffff9b36e7c00000(0000)
knlGS:0000000000000000
[   91.138917] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   91.138917] NMI watchdog: Watchdog detected hard LOCKUP on cpu 4
[   92.138146] CR2: ffff9b33d2655000 CR3: 000000011200a000 CR4: 00000000000006f0
[   92.138146] Call Trace:
[   92.138146]  <IRQ>
[   92.138146]  irq_exit+0xc9/0xe0
[   93.138944]  smp_apic_timer_interrupt+0x73/0x160
[   94.137773] Modules linked in:
[   94.137772]  apic_timer_interrupt+0xf/0x20
[   95.137578]  </IRQ>
[   95.137578] RIP: 0010:native_safe_halt+0x2/0x10
[   95.137578] RSP: 0018:ffffffffbf003ea0 EFLAGS: 00000202
[   96.136670] CPU: 4 PID: 0 Comm: swapper/4 Not tainted 4.16.13-1-ARCH #1
[   97.138510] Hardware name: Supermicro H8DM8-2/H8DM8-2, BIOS 080014  11/06/2009
[   97.138081]  ORIG_RAX: ffffffffffffff12

<Roughly 900 lines snipped>

then continues

[  151.438482] Sending NMI from CPU 1 to CPUs 3:
[  151.469721] NMI backtrace for cpu 3
[  151.469722] CPU: 3 PID: 0 Comm: swapper/3 Tainted: G             L
4.16.13-1-ARCH #1
[  151.469723] Hardware name: Supermicro H8DM8-2/H8DM8-2, BIOS 080014  11/06/2009
[  151.469724] RIP: 0010:native_queued_spin_lock_slowpath+0x19a/0x1c0
[  151.469725] RSP: 0018:ffff9b36e7cc3ef8 EFLAGS: 00000046
[  151.469726] RAX: 0000000000000000 RBX: ffffffffbf051640 RCX: ffff9b36e7ce2480
[  151.469726] RDX: 0000000000000002 RSI: ffffffffbee70e1c RDI: 0000000000100000
[  151.469727] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000101
[  151.469728] R10: 0000000000338518 R11: 0000000000000000 R12: ffff9b36e7ce2480
[  151.469728] R13: ffffffffbeed81f2 R14: 0000000000000009 R15: 0000000000000000
[  151.469729] FS:  0000000000000000(0000) GS:ffff9b36e7cc0000(0000)
knlGS:0000000000000000
[  151.469730] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  151.469731] CR2: 0000000000000000 CR3: 000000011200a000 CR4: 00000000000006e0
[  151.469731] Call Trace:
[  151.469732]  <IRQ>
[  151.469733]  _raw_spin_lock_irqsave+0x44/0x50
[  151.469735]  rcu_process_callbacks+0xce/0x3b0
[  151.469737]  __do_softirq+0xf1/0x2e0
[  151.469739]  irq_exit+0xc9/0xe0
[  151.469741]  smp_apic_timer_interrupt+0x73/0x160
[  151.469743]  apic_timer_interrupt+0xf/0x20
[  151.469744]  </IRQ>
[  151.469745] RIP: 0010:native_safe_halt+0x2/0x10
[  151.469745] RSP: 0018:ffff9d6a431bbec0 EFLAGS: 00000202 ORIG_RAX:
ffffffffffffff12
[  151.469747] RAX: 0000000000000003 RBX: 0000000000000003 RCX: 0000000000000000
[  151.469747] RDX: 0000000000000000 RSI: ffffffffbee70e1c RDI: ffffffffbee7107a
[  151.469748] RBP: ffffffffbf132b60 R08: 0000000000000000 R09: 0000000000000000
[  151.469748] R10: 0000000000000000 R11: 0000000000000000 R12: ffffffffbeed81f2
[  151.469749] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[  151.469752]  default_idle+0x18/0x100
[  151.469753]  do_idle+0x179/0x1b0
[  151.469755]  cpu_startup_entry+0x6f/0x80
[  151.469757]  start_secondary+0x1aa/0x200
[  151.469758]  secondary_startup_64+0xa5/0xb0
[  151.469760] Code: 12 83 e0 03 83 ea 01 48 c1 e0 04 48 63 d2 48 05 80 24 02
00 48 03 04 d5 80 04 ef be 48 89 08 8b 41 08 85 c0 75 09 f3 90 8b 41 08 <85>
c0 74 f7 48 8b 01 48 85 c0 0f 84 54 ff ff ff 49 89 c0 0f 0d
[  151.438482] Sending NMI from CPU 1 to CPUs 4:
[  151.470739] NMI backtrace for cpu 4 skipped: idling at
native_safe_halt+0x2/0x10
[  151.438482] Sending NMI from CPU 1 to CPUs 5:
[  151.470007] NMI backtrace for cpu 5 skipped: idling at
native_safe_halt+0x2/0x10
[  151.438482] Sending NMI from CPU 1 to CPUs 7:
[  151.471011] NMI backtrace for cpu 7 skipped: idling at
native_safe_halt+0x2/0x10
[  151.474216] PCI Interrupt Link [LUB2] enabled at IRQ 22
[  151.490261] pci 0000:00:00.0: Found enabled HT MSI Mapping

and boot then completes normally.

All earlier kernels continue to boot fine without this delay. I have Archlinux
with both 4.16.13 and 4.14.47-LTS installed on this server, and 4.14.47 has no
such delay.

I have had no success trying to bisect the differences (I'm not that practiced
at it), but need help figuring out what changed and why I now have a 150
second delay where some test no loops through every core dumping info -- just
to figure out everything is OK and then continue the boot.

I have attached the full dmesg output here. If you need anything further, just
ask.


-- 
David C. Rankin, J.D.,P.E.

View attachment "dmesg_4.16.13" of type "text/plain" (116591 bytes)

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ