linux-kernel - Re: [RFC][PATCH] sched/core: Tweak wait_task_inactive() to force dequeue sched

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20250422101930.GD14170@noisy.programming.kicks-ass.net>
Date: Tue, 22 Apr 2025 12:19:30 +0200
From: Peter Zijlstra <peterz@...radead.org>
To: Frederic Weisbecker <frederic@...nel.org>
Cc: John Stultz <jstultz@...gle.com>, LKML <linux-kernel@...r.kernel.org>,
	Ingo Molnar <mingo@...hat.com>, Juri Lelli <juri.lelli@...hat.com>,
	Vincent Guittot <vincent.guittot@...aro.org>,
	Dietmar Eggemann <dietmar.eggemann@....com>,
	Steven Rostedt <rostedt@...dmis.org>,
	Ben Segall <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>,
	Valentin Schneider <vschneid@...hat.com>,
	K Prateek Nayak <kprateek.nayak@....com>, kernel-team@...roid.com,
	Frederic Weisbecker <fweisbec@...il.com>
Subject: Re: [RFC][PATCH] sched/core: Tweak wait_task_inactive() to force
 dequeue sched_delayed tasks

On Tue, Apr 22, 2025 at 11:55:31AM +0200, Frederic Weisbecker wrote:
> Le Tue, Apr 22, 2025 at 10:56:28AM +0200, Peter Zijlstra a écrit :
> > On Mon, Apr 21, 2025 at 09:43:45PM -0700, John Stultz wrote:
> > > It was reported that in 6.12, smpboot_create_threads() was
> > > taking much longer then in 6.6.
> > > 
> > > I narrowed down the call path to:
> > >  smpboot_create_threads()
> > >  -> kthread_create_on_cpu()
> > >     -> kthread_bind()
> > >        -> __kthread_bind_mask()
> > >           ->wait_task_inactive()
> > > 
> > > Where in wait_task_inactive() we were regularly hitting the
> > > queued case, which sets a 1 tick timeout, which when called
> > > multiple times in a row, accumulates quickly into a long
> > > delay.
> > 
> > Argh, this is all stupid :-(
> > 
> > The whole __kthread_bind_*() thing is a bit weird, but fundamentally it
> > tries to avoid a race vs current. Notably task_state::flags is only ever
> > modified by current, except here.
> > 
> > delayed_dequeue is fine, except wait_task_inactive() hasn't been
> > told about it (I hate that function, murder death kill etc.).
> > 
> > But more fundamentally, we've put so much crap into struct kthread and
> > kthread() itself by now, why not also pass down the whole per-cpu-ness
> > thing and simply do it there. Heck, Frederic already made it do affinity
> > crud.
> > 
> > On that, Frederic, *why* do you do that after started=1, that seems like
> > a weird place, should this not be done before complete() ?, like next to
> > sched_setscheduler_nocheck() or so?
> 
> You mean the call to kthread_affine_node() ? Because it is a default behaviour
> that only happens if no call to kthread_bind() or kthread_affine_preferred()
> has been issued before the first wake up to the kthread.
> 
> If kthread_affine_node() was called before everything by default instead
> then we would get its unconditional overhead for all started kthreads. Plus
> kthread_bind() and kthread_affine_preferred() would need to undo
> kthread_affine_node().

Urgh, I see. Perhaps we should put a comment on, because I'm sure I'll
have this same question again next time (probably in another few years)
when I look at this code.