linux-kernel - Very high scheduling delay with plenty of idle CPUs

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAGETcx830PZyr_oZAkghR=CLsThLUX1hZRxrNK_FNSLuF2TBAw@mail.gmail.com>
Date: Thu, 7 Nov 2024 23:28:07 -0800
From: Saravana Kannan <saravanak@...gle.com>
To: Ingo Molnar <mingo@...hat.com>, "Peter Zijlstra (Intel)" <peterz@...radead.org>, 
	Juri Lelli <juri.lelli@...hat.com>, Vincent Guittot <vincent.guittot@...aro.org>, 
	Dietmar Eggemann <dietmar.eggemann@....com>, Steven Rostedt <rostedt@...dmis.org>, 
	Benjamin Segall <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>, 
	Valentin Schneider <vschneid@...hat.com>, LKML <linux-kernel@...r.kernel.org>, 
	wuyun.abel@...edance.com, youssefesmat@...omium.org, 
	Thomas Gleixner <tglx@...utronix.de>, efault@....de, 
	K Prateek Nayak <kprateek.nayak@....com>, John Stultz <jstultz@...gle.com>, 
	Vincent Palomares <paillon@...gle.com>
Subject: Very high scheduling delay with plenty of idle CPUs

Hi scheduler folks,

I'm running into some weird scheduling issues when testing non-sched
changes on a Pixel 6 that's running close to 6.12-rc5. I'm not sure if
this is an issue in earlier kernel versions or not.

The async suspend/resume code calls async_schedule_dev_nocall() to
queue up a bunch of work. These queued up work seem to be running in
kworker threads.

However, there have been many times where I see scheduling latency
(runnable, but not running) of 4.5 ms or higher for a kworker thread
when there are plenty of idle CPUs.

Does async_schedule_dev_nocall() have some weird limitations on where
they can be run? I know it has some NUMA related stuff, but the Pixel
6 doesn't have NUMA. This oddity unnecessarily increases
suspend/resume latency as it adds up across kworker threads. So, I'd
appreciate any insights on what might be happening?

If you know how to use perfetto (it's really pretty simple, all you
need to know is WASD and clicking), here's an example:
https://ui.perfetto.dev/#!/?s=e20045736e7dfa1e897db6489710061d2495be92

Thanks,
Saravana