lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Thu, 15 Jun 2023 12:29:19 +0100
From:   "Richard W.M. Jones" <rjones@...hat.com>
To:     YiFei Zhu <zhuyifei@...gle.com>
Cc:     dev@...ont.org, linux-kernel@...r.kernel.org, peterz@...radead.org,
        zhuyifei1999@...il.com
Subject: Re: printk.time causes rare kernel boot hangs

On Thu, Jun 15, 2023 at 11:04:29AM +0000, YiFei Zhu wrote:
> > FWIW attached is a test program that runs the qemu instances in
> > parallel (up to 8 threads), which seems to be a quicker way to hit the
> > problem for me.  Even on Intel, with this test I can hit the bug in a
> > few hundred iteration.
> 
> A friend sent me here so I took a look.
> 
> I was unable to reproduce with this script after 10000 iterations,
> on a AMD Gentoo Linux host:
> 
> Host kernel:  6.3.3 vanilla
> Guest kernel: git commit f31dcb152a3d0816e2f1deab4e64572336da197d
> Guest config: Provided full-fat Fedora config + CONFIG_GDB_SCRIPTS
> QEMU:         8.0.2 (with kvm_amd)
> Hardware:     AMD Ryzen 7 PRO 5850U
> 
> I wonder if anything on the host side affects this, or could be some
> sort of race condition.

We've had multiple independent reports of reproducing the bug, since
this story (unfortunately) hit Hacker News.  Your configuration above
should work, so I still don't know what the factor is.

[...]

> If you can reproduce the original bug (without the msleep or busy wait
> patch), could you check if you can reproduce that with idle=poll? If so,
> can you run "p show_state_filter(0)" so we get a stack trace of kernel_init,
> assuming it hit a similar issue as if msleep was added. If idle=poll does
> not work, or you can't call functions from within gdb (some old qemu versions
> did not support this), see if you can send a alt-sysrq-w to show stacks of
> blocked tasks.

(1) Adding idle=poll to the guest kernel

=> Bug still occurs, with about the same frequency as before.

(2) Connect with gdb to qemu's gdb-stub:

Trying to evaluate show_state_filter(0) didn't work for reasons I
don't understand:

(gdb) target remote localhost:1234
Remote debugging using localhost:1234
warning: Remote gdbserver does not support determining executable automatically.
RHEL <=6.8 and <=7.2 versions of gdbserver do not support such automatic execut.
The following versions of gdbserver support it:
- Upstream version of gdbserver (unsupported) 7.10 or later
- Red Hat Developer Toolset (DTS) version of gdbserver from DTS 4.0 or later (o)
- RHEL-7.3 versions of gdbserver (on any architecture)
arch_static_branch (branch=false, key=<optimized out>)
    at ./arch/x86/include/asm/jump_label.h:27
27     asm_volatile_goto("1:"
(gdb) bt
#0  arch_static_branch (branch=false, key=<optimized out>)
    at ./arch/x86/include/asm/jump_label.h:27
#1  static_key_false (key=<optimized out>) at ./include/linux/jump_label.h:207
#2  native_write_msr (high=222, low=719927812, msr=1760)
    at ./arch/x86/include/asm/msr.h:147
#3  wrmsrl (val=954202667524, msr=1760) at ./arch/x86/include/asm/msr.h:262
#4  lapic_next_deadline (delta=474, evt=0xffff88804e81bf40)
    at arch/x86/kernel/apic/apic.c:491
#5  0xffffffff81143667 in clockevents_program_event (dev=0xffff88804e81bf40, 
    expires=<optimized out>, force=<optimized out>)
    at kernel/time/clockevents.c:334
#6  0xffffffff81143c0b in tick_handle_periodic (dev=0xffff88804e81bf40)
    at kernel/time/tick-common.c:133
#7  0xffffffff8105d01c in local_apic_timer_interrupt ()
    at arch/x86/kernel/apic/apic.c:1095
#8  __sysvec_apic_timer_interrupt (regs=regs@...ry=0xffffc90000003ee8)
    at arch/x86/kernel/apic/apic.c:1112
#9  0xffffffff81e61a91 in sysvec_apic_timer_interrupt (regs=0xffffc90000003ee8)
    at arch/x86/kernel/apic/apic.c:1106
#10 0xffffffff8200144a in asm_sysvec_apic_timer_interrupt ()
    at ./arch/x86/include/asm/idtentry.h:645
#11 0x0000000000000000 in ?? ()
(gdb) p show_state_filter(0)
[Inferior 1 (process 1) exited normally]
The program being debugged exited while in a function called from GDB.
Evaluation of the expression containing the function
(show_state_filter) will be abandoned.

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-top is 'top' for virtual machines.  Tiny program with many
powerful monitoring features, net stats, disk stats, logging, etc.
http://people.redhat.com/~rjones/virt-top

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ