linux-kernel - Re: [BUG almost bisected] Splat in dequeue_rt

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAP4=nvTqnABSzYXiDfizoaeviqLtC87jG1fnGH4XFV+xQGn-2Q@mail.gmail.com>
Date: Mon, 16 Dec 2024 15:38:20 +0100
From: Tomas Glozar <tglozar@...hat.com>
To: paulmck@...nel.org
Cc: Valentin Schneider <vschneid@...hat.com>, Chen Yu <yu.c.chen@...el.com>, 
	Peter Zijlstra <peterz@...radead.org>, linux-kernel@...r.kernel.org, sfr@...b.auug.org.au, 
	linux-next@...r.kernel.org, kernel-team@...a.com
Subject: Re: [BUG almost bisected] Splat in dequeue_rt_stack() and build error

ne 15. 12. 2024 v 19:41 odesílatel Paul E. McKenney <paulmck@...nel.org> napsal:
>
> And the fix for the TREE03 too-short grace periods is finally in, at
> least in prototype form:
>
> https://lore.kernel.org/all/da5065c4-79ba-431f-9d7e-1ca314394443@paulmck-laptop/
>
> Or this commit on -rcu:
>
> 22bee20913a1 ("rcu: Fix get_state_synchronize_rcu_full() GP-start detection")
>
> This passes more than 30 hours of 400 concurrent instances of rcutorture's
> TREE03 scenario, with modifications that brought the bug reproduction
> rate up to 50 per hour.  I therefore have strong reason to believe that
> this fix is a real fix.
>
> With this fix in place, a 20-hour run of 400 concurrent instances
> of rcutorture's TREE03 scenario resulted in 50 instances of the
> enqueue_dl_entity() splat pair.  One (untrimmed) instance of this pair
> of splats is shown below.
>
> You guys did reproduce this some time back, so unless you tell me
> otherwise, I will assume that you have this in hand.  I would of course
> be quite happy to help, especially with adding carefully chosen debug
> (heisenbug and all that) or testing of alleged fixes.
>

The same splat was recently reported to LKML [1] and a patchset was
sent and merged into tip/sched/urgent that fixes a few bugs around
double-enqueue of the deadline server [2]. I'm currently re-running
TREE03 with those patches, hoping they will also fix this issue.

Also, last week I came up with some more extensive tracing, which
showed dl_server_update and dl_server_start happening right after each
other, apparently during the same run of enqueue_task_fair (see
below). I'm currently looking into that to figure out whether the
mechanism shown by the trace is fixed by the patchset.

--------------------------

rcu_tort-148       1dN.3. 20531758076us : dl_server_stop <-dequeue_entities
rcu_tort-148       1dN.2. 20531758076us : dl_server_queue: cpu=1
level=2 enqueue=0
rcu_tort-148       1dN.3. 20531758078us : <stack trace>
 => trace_event_raw_event_dl_server_queue
 => dl_server_stop
 => dequeue_entities
 => dequeue_task_fair
 => __schedule
 => schedule
 => schedule_hrtimeout_range_clock
 => torture_hrtimeout_us
 => rcu_torture_writer
 => kthread
 => ret_from_fork
 => ret_from_fork_asm
rcu_tort-148       1dN.3. 20531758095us : dl_server_update <-update_curr
rcu_tort-148       1dN.3. 20531758097us : dl_server_update <-update_curr
rcu_tort-148       1dN.2. 20531758101us : dl_server_queue: cpu=1
level=2 enqueue=1
rcu_tort-148       1dN.3. 20531758103us : <stack trace>
rcu_tort-148       1dN.2. 20531758104us : dl_server_queue: cpu=1
level=1 enqueue=1
rcu_tort-148       1dN.3. 20531758106us : <stack trace>
rcu_tort-148       1dN.2. 20531758106us : dl_server_queue: cpu=1
level=0 enqueue=1
rcu_tort-148       1dN.3. 20531758108us : <stack trace>
 => trace_event_raw_event_dl_server_queue
 => rb_insert_color
 => enqueue_dl_entity
 => update_curr_dl_se
 => update_curr
 => enqueue_task_fair
 => enqueue_task
 => activate_task
 => attach_task
 => sched_balance_rq
 => sched_balance_newidle.constprop.0
 => pick_next_task_fair
 => __schedule
 => schedule
 => schedule_hrtimeout_range_clock
 => torture_hrtimeout_us
 => rcu_torture_writer
 => kthread
 => ret_from_fork
 => ret_from_fork_asm
rcu_tort-148       1dN.3. 20531758110us : dl_server_start <-enqueue_task_fair
rcu_tort-148       1dN.2. 20531758110us : dl_server_queue: cpu=1
level=2 enqueue=1
rcu_tort-148       1dN.3. 20531760934us : <stack trace>
 => trace_event_raw_event_dl_server_queue
 => enqueue_dl_entity
 => dl_server_start
 => enqueue_task_fair
 => enqueue_task
 => activate_task
 => attach_task
 => sched_balance_rq
 => sched_balance_newidle.constprop.0
 => pick_next_task_fair
 => __schedule
 => schedule
 => schedule_hrtimeout_range_clock
 => torture_hrtimeout_us
 => rcu_torture_writer
 => kthread
 => ret_from_fork
 => ret_from_fork_asm

[1] - https://lore.kernel.org/lkml/571b2045-320d-4ac2-95db-1423d0277613@ovn.org/
[2] - https://lore.kernel.org/lkml/20241213032244.877029-1-vineeth@bitbyteword.org/

> Just let me know!
>
>                                                         Thanx, Paul

Tomas