linux-kernel - Re: [BUG almost bisected] Splat in dequeue_rt

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <7cdc0f04-819d-429c-9a2c-9ad25d85db55@paulmck-laptop>
Date: Mon, 21 Oct 2024 12:25:41 -0700
From: "Paul E. McKenney" <paulmck@...nel.org>
To: Tomas Glozar <tglozar@...hat.com>
Cc: Valentin Schneider <vschneid@...hat.com>, Chen Yu <yu.c.chen@...el.com>,
	Peter Zijlstra <peterz@...radead.org>, linux-kernel@...r.kernel.org,
	sfr@...b.auug.org.au, linux-next@...r.kernel.org,
	kernel-team@...a.com
Subject: Re: [BUG almost bisected] Splat in dequeue_rt_stack() and build error

On Mon, Oct 14, 2024 at 11:55:05AM -0700, Paul E. McKenney wrote:
> On Thu, Oct 10, 2024 at 04:28:38PM -0700, Paul E. McKenney wrote:
> > On Thu, Oct 10, 2024 at 08:01:35AM -0700, Paul E. McKenney wrote:
> > > On Thu, Oct 10, 2024 at 01:24:11PM +0200, Tomas Glozar wrote:
> > > > st 2. 10. 2024 v 11:01 odesílatel Tomas Glozar <tglozar@...hat.com> napsal:
> > > > >
> > > > > FYI I have managed to reproduce the bug on our infrastructure after 21
> > > > > hours of 7*TREE03 and I will continue with trying to reproduce it with
> > > > > the tracers we want.
> > > > >
> > > > > Tomas
> > > > 
> > > > I successfully reproduced the bug also with the tracers active after a
> > > > few 8-hour test runs on our infrastructure:
> > > > 
> > > > [    0.000000] Linux version 6.11.0-g2004cef11ea0-dirty (...) #1 SMP
> > > > PREEMPT_DYNAMIC Wed Oct  9 12:13:40 EDT 2024
> > > > [    0.000000] Command line: debug_boot_weak_hash panic=-1 selinux=0
> > > > initcall_debug debug console=ttyS0 rcutorture.n_barrier_cbs=4
> > > > rcutorture.stat_interval=15 rcutorture.shutdown_secs=25200
> > > > rcutorture.test_no_idle_hz=1 rcutorture.verbose=1
> > > > rcutorture.onoff_interval=200 rcutorture.onoff_holdoff=30
> > > > rcutree.gp_preinit_delay=12 rcutree.gp_init_delay=3
> > > > rcutree.gp_cleanup_delay=3 rcutree.kthread_prio=2 threadirqs
> > > > rcutree.use_softirq=0
> > > > trace_event=sched:sched_switch,sched:sched_wakeup
> > > > ftrace_filter=dl_server_start,dl_server_stop trace_buf_size=2k
> > > > ftrace=function torture.ftrace_dump_at_shutdown=1
> > > > ...
> > > > [13550.127541] WARNING: CPU: 1 PID: 155 at
> > > > kernel/sched/deadline.c:1971 enqueue_dl_entity+0x554/0x5d0
> > > > [13550.128982] Modules linked in:
> > > > [13550.129528] CPU: 1 UID: 0 PID: 155 Comm: rcu_torture_rea Tainted: G
> > > >        W          6.11.0-g2004cef11ea0-dirty #1
> > > > [13550.131419] Tainted: [W]=WARN
> > > > [13550.131979] Hardware name: Red Hat KVM/RHEL, BIOS 1.16.3-2.el9 04/01/2014
> > > > [13550.133230] RIP: 0010:enqueue_dl_entity+0x554/0x5d0
> > > > ...
> > > > [13550.151286] Call Trace:
> > > > [13550.151749]  <TASK>
> > > > [13550.152141]  ? __warn+0x88/0x130
> > > > [13550.152717]  ? enqueue_dl_entity+0x554/0x5d0
> > > > [13550.153485]  ? report_bug+0x18e/0x1a0
> > > > [13550.154149]  ? handle_bug+0x54/0x90
> > > > [13550.154792]  ? exc_invalid_op+0x18/0x70
> > > > [13550.155484]  ? asm_exc_invalid_op+0x1a/0x20
> > > > [13550.156249]  ? enqueue_dl_entity+0x554/0x5d0
> > > > [13550.157055]  dl_server_start+0x36/0xf0
> > > > [13550.157709]  enqueue_task_fair+0x220/0x6b0
> > > > [13550.158447]  activate_task+0x26/0x60
> > > > [13550.159131]  attach_task+0x35/0x50
> > > > [13550.159756]  sched_balance_rq+0x663/0xe00
> > > > [13550.160511]  sched_balance_newidle.constprop.0+0x1a5/0x360
> > > > [13550.161520]  pick_next_task_fair+0x2f/0x340
> > > > [13550.162290]  __schedule+0x203/0x900
> > > > [13550.162958]  ? enqueue_hrtimer+0x35/0x90
> > > > [13550.163703]  schedule+0x27/0xd0
> > > > [13550.164299]  schedule_hrtimeout_range_clock+0x99/0x120
> > > > [13550.165239]  ? __pfx_hrtimer_wakeup+0x10/0x10
> > > > [13550.165954]  torture_hrtimeout_us+0x7b/0xe0
> > > > [13550.166624]  rcu_torture_reader+0x139/0x200
> > > > [13550.167284]  ? __pfx_rcu_torture_timer+0x10/0x10
> > > > [13550.168019]  ? __pfx_rcu_torture_reader+0x10/0x10
> > > > [13550.168764]  kthread+0xd6/0x100
> > > > [13550.169262]  ? __pfx_kthread+0x10/0x10
> > > > [13550.169860]  ret_from_fork+0x34/0x50
> > > > [13550.170424]  ? __pfx_kthread+0x10/0x10
> > > > [13550.171020]  ret_from_fork_asm+0x1a/0x30
> > > > [13550.171657]  </TASK>
> > > > 
> > > > Unfortunately, the following rcu stalls appear to have resulted in
> > > > abnormal termination of the VM, which led to the ftrace buffer not
> > > > being dumped into the console. Currently re-running the same test with
> > > > the addition of "ftrace_dump_on_oops panic_on_warn=1" and hoping for
> > > > the best.
> > > 
> > > Another approach would be rcupdate.rcu_cpu_stall_suppress=1.
> > > 
> > > We probably need to disable RCU CPU stall warnings automatically while
> > > dumping ftrace buffers, but the asynchronous nature of printk() makes
> > > it difficult to work out when to automatically re-enable them...
> > 
> > And in the meantime, for whatever it is worth...
> > 
> > The pattern of failures motivated me to add to rcutorture a real-time
> > task that randomly preempts a randomly chosen online CPU.  Here are
> > the (new and not-to-be-trusted) commits on -rcu's "dev" branch:
> > 
> > d1b99fa42af7 ("torture: Add dowarn argument to torture_sched_setaffinity()")
> > aed555adc22a ("rcutorture: Add random real-time preemption")
> > b09bcf8e1406 ("rcutorture: Make the TREE03 scenario do preemption")
> > 
> > Given these, the following sort of command, when run on dual-socket
> > systems, reproduces a silent failure within a few minutes:
> > 
> > tools/testing/selftests/rcutorture/bin/kvm.sh --allcpus --duration 4h --configs "4*TREE03" --kconfig "CONFIG_NR_CPUS=4" --trust-make
> > 
> > But on my laptop, a 30-minute run resulted in zero failures.  I am now
> > retrying with a four-hour laptop run.
> 
> And this silent failure was me hurting myself with a change to scripting
> to better handle test hosts disappearing (it does sometimes happen).
> With the scripting fixed, I am getting simple too-short grace periods,
> though only a few per 8-hour 400*TREE03 4-CPU guest-OS run.
> 
> > I am also adjusting the preemption duration and frequency to see if a
> > more edifying failure mode might make itself apparent.  :-/
> 
> But no big wins thus far, so this will be a slow process.  My current test
> disables CPU hotplug.  I will be disabling other things in the hope of
> better identifying the code paths that should be placed under suspicion.

Disabling CPU hotplug seems to make the problem go away (though
all I can really say is that I am 99% confident that it reduces the
problem's incidence by at least a factor of two).  This problem affects
non-preemptible kernels and non-preemptible RCU, though it is possible
that use of the latter reduces the failure rate (which is of course *not*
what you want for testing).

A number of experiments failed to significantly/usefully increase the
failure rate.

The problem does not seem to happen on straight normal call_rcu()
grace periods (rcutorture.gp_normal=1), synchronize_rcu() grace periods
(rcutorture.gp_sync=1), or synchronize_rcu_expedited() grace periods.
Of course, my reproduction rate is still low enough that I might be
fooled here.

However, the problem does occur reasonably often on polled grace periods
(rcutorture.gp_poll=1 rcutorture.gp_poll_exp=1 rcutorture.gp_poll_full=1
rcutorture.gp_poll_exp_full=1).  This might be a bug in the polling
code itself, or it might be because polling grace periods do not incur
callback and/or wakeup delays (as in the bug might still be in the
underlying grace-period computation and polling makes it more apparent).
This also appears to have made the problem happen more frequently,
but not enough data to be sure yet.

Currently, rcutorture does millisecond-scale waits of duration randomly
chosen between zero and 15 milliseconds.  I have started a run with
microsecond-scale waits of duration randomly chosen between zero and
128 microseconds.  Here is hoping that this will cause the problem to
reproduce more quickly, and I will know more this evening, Pacific Time.

							Thanx, Paul