linux-kernel - Re: [PATCH 13/14] sched: Add {DE,EN}QUEUE

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <aMMzpnyx__ZgZGRc@slm.duckdns.org>
Date: Thu, 11 Sep 2025 10:40:06 -1000
From: Tejun Heo <tj@...nel.org>
To: Peter Zijlstra <peterz@...radead.org>
Cc: linux-kernel@...r.kernel.org, mingo@...hat.com, juri.lelli@...hat.com,
	vincent.guittot@...aro.org, dietmar.eggemann@....com,
	rostedt@...dmis.org, bsegall@...gle.com, mgorman@...e.de,
	vschneid@...hat.com, longman@...hat.com, hannes@...xchg.org,
	mkoutny@...e.com, void@...ifault.com, arighi@...dia.com,
	changwoo@...lia.com, cgroups@...r.kernel.org,
	sched-ext@...ts.linux.dev, liuwenfang@...or.com, tglx@...utronix.de
Subject: Re: [PATCH 13/14] sched: Add {DE,EN}QUEUE_LOCKED

Hello,

On Thu, Sep 11, 2025 at 11:42:40AM +0200, Peter Zijlstra wrote:
...
> I didn't immediately see how to do that. Doesn't that
> list_for_each_entry_safe_reverse() rely on rq->lock to retain integrity?

Ah, sorry, I was thinking it was iterating scx_tasks list. Yes, as
implemented, it needs to hold rq lock throughout.

> Moreover, since the goal is to allow:
> 
>  __schedule()
>    lock(rq->lock);
>    next = pick_task() := pick_task_scx()
>      lock(dsq->lock);
>      p = some_dsq_task(dsq);
>      task_unlink_from_dsq(p, dsq);
>      set_task_cpu(p, cpu_of(rq));
>      move_task_to_local_dsq(p, ...);
>      return p;
> 
> without dropping rq->lock, by relying on dsq->lock to serialize things,
> I don't see how we can retain the runnable list at all.
>
> And at this point, I'm not sure I understand ext well enough to know
> what this bypass stuff does at all, let alone suggest means to
> re architect this.

Bypass mode is enabled when the kernel side can't trust the BPF scheduling
anymore and wants to fall back to dumb FIFO scheduling to guarantee forward
progress (e.g. so that we can switch back to fair).

It comes down to flipping scx_rq_bypassing() on, which makes scheduling
paths bypass most BPF parts and fall back to FIFO behavior, and then making
sure every thread is on FIFO behavior. The latter part is what the loop is
doing. It scans all currently runnable tasks and dequeues and re-enqueues
them. As scx_rq_bypass() is true at this point, if a task were queued on the
BPF side, the cycling takes it out of the BPF side and puts it on the
fallback FIFO queue.

If we want to get rid of the locking requirement:

- Walk scx_tasks list which is iterated with a cursor and allows dropping
  locks while iterating. However, on some hardware, there are cases where
  CPUs are extremely slowed down from BPF scheduler making bad decisions and
  causing a lot of sync cacheline pingponging across e.g. NUMA nodes. As
  scx_bypass() is what's supposed to extricate the system from this state,
  walking all tasks while going through each's locking probably isn't going
  to be great.

- We can update ->runnable_list iteration to allow dropping rq lock e.g.
  with a cursor based iteration. Maybe some code can be shared with
  scx_tasks iteration. Cycling through locks still isn't going to be great
  but here it's likely a lot fewer of them at least.

Neither option is great. Leave it as-is for now?

Thanks.

-- 
tejun