[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <87h5s4mjqw.ffs@tglx>
Date: Thu, 29 Jan 2026 18:06:15 +0100
From: Thomas Gleixner <tglx@...nel.org>
To: Ihor Solodrai <ihor.solodrai@...ux.dev>, Shrikanth Hegde
<sshegde@...ux.ibm.com>, Peter Zijlstra <peterz@...radead.org>, LKML
<linux-kernel@...r.kernel.org>
Cc: Gabriele Monaco <gmonaco@...hat.com>, Mathieu Desnoyers
<mathieu.desnoyers@...icios.com>, Michael Jeanson <mjeanson@...icios.com>,
Jens Axboe <axboe@...nel.dk>, "Paul E. McKenney" <paulmck@...nel.org>,
"Gautham R. Shenoy" <gautham.shenoy@....com>, Florian Weimer
<fweimer@...hat.com>, Tim Chen <tim.c.chen@...el.com>, Yury Norov
<yury.norov@...il.com>, bpf <bpf@...r.kernel.org>,
sched-ext@...ts.linux.dev, Kernel Team <kernel-team@...a.com>, Alexei
Starovoitov <ast@...nel.org>, Andrii Nakryiko <andrii@...nel.org>, Daniel
Borkmann <daniel@...earbox.net>, Puranjay Mohan <puranjay@...nel.org>,
Tejun Heo <tj@...nel.org>
Subject: Re: [patch V5 00/20] sched: Rewrite MM CID management
On Wed, Jan 28 2026 at 15:08, Ihor Solodrai wrote:
> On 1/28/26 2:33 PM, Ihor Solodrai wrote:
>> [...]
>>
>> We have a steady stream of jobs running, so if it's not a one-off it's
>> likely to happen again. I'll share if we get anything.
>
> Here is another one, with backtraces of other CPUs:
>
> [ 59.133925] CPU: 2 UID: 0 PID: 127 Comm: test_progs Tainted: G OE 6.19.0-rc5-gbe9790cb9e63-dirty #1 PREEMPT(full)
> [ 59.133935] RIP: 0010:queued_spin_lock_slowpath+0x3a9/0xac0
> [ 59.133985] do_raw_spin_lock+0x1d9/0x270
> [ 59.134001] task_rq_lock+0xcf/0x3c0
> [ 59.134007] mm_cid_fixup_task_to_cpu+0xb0/0x460
> [ 59.134025] sched_mm_cid_fork+0x6da/0xc20
Compared to Shrikanth's splat this is the reverse situation, i.e. fork()
reached the point where it needs to switch to per CPU mode and the fixup
function is stuck on a runqueue lock.
> [ 59.134176] CPU: 3 UID: 0 PID: 67 Comm: kworker/3:1 Tainted: G OE 6.19.0-rc5-gbe9790cb9e63-dirty #1 PREEMPT(full)
> [ 59.134186] Workqueue: events drain_vmap_area_work
> [ 59.134194] RIP: 0010:smp_call_function_many_cond+0x772/0xe60
> [ 59.134250] on_each_cpu_cond_mask+0x24/0x40
> [ 59.134254] flush_tlb_kernel_range+0x402/0x6b0
CPU3 is unrelated as it does not hold runqueue lock.
> [ 59.134374] NMI backtrace for cpu 1
> [ 59.134388] RIP: 0010:_find_first_zero_bit+0x50/0x90
> [ 59.134423] __schedule+0x3312/0x4390
> [ 59.134430] ? __pfx___schedule+0x10/0x10
> [ 59.134434] ? trace_rcu_watching+0x105/0x150
> [ 59.134440] schedule_idle+0x59/0x90
CPU1 holds runqueue lock and find_first_zero_bit() suggests that this
comes from mm_get_cid(), but w/o decoding the return address it's hard
to tell for sure.
> [ 59.134474] NMI backtrace for cpu 0 skipped: idling at default_idle+0xf/0x20
CPU0 is idle and not involved at all.
So the situation is:
test_prog creates the 4th child, which exceeds the number of CPUs, so
it switches to per CPU mode.
At this point each task of test_prog has a CID associated. Let's
assume thread creation order assignment for simplicity.
T0 (main thread) CID0 runs fork()
T1 (1st child) CID1
T2 (2nd child) CID2
T3 (3rd child) CID3
T4 (4th child) --- is about to be forked and causes the
mode switch
T0 sets mm_cid::percpu = true
transfers the CID from T0 to CPU2
Starts the fixup which walks through the threads
During that T1 - T3 are free to schedule in and out before the fixup
caught up with them. Now I played through all possible permutations with
a python script and came up with the following snafu:
T1 schedules in on CPU3 and observes percpu == true, so it transfers
it's CID to CPU3
T1 is migrated to CPU1 and schedule in observes percpu == true, but
CPU1 does not have a CID associated and T1 transferred it's own to
CPU3
So it has to allocate one with CPU1 runqueue lock held, but the
pool is empty, so it keeps looping.
Now T0 reaches T1 in the thread walk and tries to lock the corresponding
runqueue lock, which is held. ---> Livelock
So this side needs the same MM_CID_TRANSIT treatment as the other side,
which brings me back to the splat Shrikanth observed.
I used the same script to run through all possible permutations on that
side too, but nothing showed up there and the yesterday finding is
harmless because that only creates slightly inconsistent state as the
task is already marked CID inactive. But the CID has the MM_CID_TRANSIT
bit set, so the CID is dropped back into the pool when the exiting task
schedules out via preemption or the final schedule().
So I scratched my head some more and stared at the code with two things
in mind:
1) It seems to be hard to reproduce
2) It happened on a weakly ordered architecture
and indeed there is a opportunity to get this wrong:
The mode switch does:
WRITE_ONCE(mm->mm_cid.transit, MM_CID_TRANSIT);
WRITE_ONCE(mm->mm_cid.percpu, ....);
sched_in() does:
if (!READ_ONCE(mm->mm_cid.percpu))
...
cid |= READ_ONCE(mm->mm_cid.transit);
so it can observe percpu == false and transit == 0 even if the fixup
function has not yet completed. As a consequence the task will not drop
the CID when scheduling out before the fixup is completed, which means
the CID space can be exhausted and the next task scheduling in will loop
in mm_get_cid() and the fixup thread can livelock on the held runqueue
lock as above.
I'll send out a series to address all of that later this evening when
tests have completed and changelogs are polished.
Thanks,
tglx
Powered by blists - more mailing lists