linux-kernel - Re: [BUG] workqueues and printk not playing nice since next-20240130

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <c6ce8816-c4ff-4668-8cbb-88285330057d@huaweicloud.com>
Date: Fri, 2 Feb 2024 15:26:14 +0100
From: Jonas Oberhauser <jonas.oberhauser@...weicloud.com>
To: paulmck@...nel.org, Tejun Heo <tj@...nel.org>,
 Lai Jiangshan <jiangshanlai@...il.com>, Petr Mladek <pmladek@...e.com>,
 Steven Rostedt <rostedt@...dmis.org>, John Ogness
 <john.ogness@...utronix.de>, Sergey Senozhatsky <senozhatsky@...omium.org>
Cc: Stephen Rothwell <sfr@...b.auug.org.au>, linux-kernel@...r.kernel.org,
 rcu@...r.kernel.org
Subject: Re: [BUG] workqueues and printk not playing nice since next-20240130


Am 2/2/2024 um 2:04 PM schrieb Paul E. McKenney:
> Hello!
>
> Starting with next-20240130 (and perhaps a bit earlier), rcutorture gets
> what initially looked like early-boot hangs, but only when running on
> dual-socket x86 systems [1], as it it works just fine on my x86 laptop [2].
> But when running on dual-socket systems, this happens all the time,
> perhaps because rcutorture works hard to split each guest OS across a
> socket boundary.
>
> This is the reproducer:
>
> tools/testing/selftests/rcutorture/bin/kvm.sh --allcpus --duration 1m --configs "10*TREE01" --trust-make
>
> By "looked like early-boot hangs" I mean that qemu was quite happy,
> but there was absolutely no console output.
>
> Bisection identified this commit:
>
> 5797b1c18919 ("workqueue: Implement system-wide nr_active enforcement for unbound workqueues")
>
> Reverting this commit made the problem go away.  Except that it is really
> hard to imagine this commit having any effect whatsoever on early boot
> execution.  Of course, this might be a failure of imagination on my part,
> so I enlisted the aid of gdb:
>
> tools/testing/selftests/rcutorture/bin/kvm.sh --allcpus --duration 1m --configs "TREE01" --trust-make --gdb
>
> After following the resulting gdb startup instructions and waiting for
> about ten seconds, I hit control-C on the gdb window and then:
>
> 	(gdb) bt
> 	#0  default_idle () at arch/x86/kernel/process.c:743
> 	#1  0xffffffff81e94d34 in default_idle_call () at kernel/sched/idle.c:97
> 	#2  0xffffffff810d746d in cpuidle_idle_call () at kernel/sched/idle.c:170
> 	#3  do_idle () at kernel/sched/idle.c:312
> 	#4  0xffffffff810d76a4 in cpu_startup_entry (state=state@...ry=CPUHP_ONLINE)
> 	    at kernel/sched/idle.c:410
> 	#5  0xffffffff81e95417 in rest_init () at init/main.c:730
> 	#6  0xffffffff8329adf2 in start_kernel () at init/main.c:1067
> 	#7  0xffffffff832a5038 in x86_64_start_reservations (
> 	    real_mode_data=real_mode_data@...ry=0x13d50 <exception_stacks+32080> <error: Cannot access memory at address 0x13d50>) at arch/x86/kernel/head64.c:555
> 	#8  0xffffffff832a513c in x86_64_start_kernel (
> 	    real_mode_data=0x13d50 <exception_stacks+32080> <error: Cannot access memory at address 0x13d50>) at arch/x86/kernel/head64.c:536
> 	#9  0xffffffff810001d2 in secondary_startup_64 ()
> 	    at arch/x86/kernel/head_64.S:461
> 	#10 0x0000000000000000 in ?? ()
> 	(gdb) print jiffies
> 	$1 = 4294676330
> 	(gdb) print system_state
> 	$2 = SYSTEM_RUNNING
>
> In other words, the system really has booted, and at least one CPU is
> happily idling in the idle loop.  And another CPU is (maybe not quite
> so happily) running rcutorture:
>
> 	(gdb) thread 6
> 	[Switching to thread 6 (Thread 1.6)]
> 	#0  0xffffffff8111160b in rcu_torture_one_read (
> 	    trsp=trsp@...ry=0xffffc900004abe90, myid=myid@...ry=4)
> 	    at kernel/rcu/rcutorture.c:2003
> 	2003            completed = cur_ops->get_gp_seq();
> 	(gdb) bt
> 	#0  0xffffffff8111160b in rcu_torture_one_read (
> 	    trsp=trsp@...ry=0xffffc900004abe90, myid=myid@...ry=4)
> 	    at kernel/rcu/rcutorture.c:2003
> 	#1  0xffffffff81111bef in rcu_torture_reader (arg=0x4 <fixed_percpu_data+4>)
> 	    at kernel/rcu/rcutorture.c:2097
> 	#2  0xffffffff810af3e0 in kthread (_create=0xffff8880047aa480)
> 	    at kernel/kthread.c:388
> 	#3  0xffffffff8103af1f in ret_from_fork (prev=<optimized out>,
> 	    regs=0xffffc900004abf58, fn=0xffffffff810af300 <kthread>,
> 	    fn_arg=0xffff8880047aa480) at arch/x86/kernel/process.c:147
> 	#4  0xffffffff8100247a in ret_from_fork_asm () at arch/x86/entry/entry_64.S:242
> 	#5  0x0000000000000000 in ?? ()
>
> So the system really did boot and is running just fine.  It is just that
> there is no console output.  Details, details!
>
> Is there anything I can do to some combination of workqueues and printk
> to help debug this?  Or that I can do to anything else, as I am not
> feeling all that picky.  ;-)


Just to exclude one source of troubles, have you tried turning all 
atomics into full barriers and seen if the issue still reproduces?

jonas


>
> 							Thanx, Paul
>
> [1] The dual-socket system is an 80-hardware-thread (20 cores per socket)
>      system with model name Intel(R) Xeon(R) Gold 6138 CPU @ 2.00GHz.
>      I get the same results when using either of these compilers:
>      gcc version 8.5.0 20210514 (Red Hat 8.5.0-21) (GCC)
>      gcc version 11.4.1 20230605 (Red Hat 11.4.1-2) (GCC)
>
> [2] My laptop is a 16-hardware-thread (8 cores) single-socket system with
>      model name "Intel(R) Core(TM) i9-10885H CPU @ 2.40GHz" and
>      gcc version 11.4.0 (Ubuntu 11.4.0-1ubuntu1~22.04).