linux-kernel - Re: [PATCH v2 02/12] sched/deadline: Less agressive dl

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <khfhrdrxesp645dy7hphefvovhfarjke2qn5nvldyjavhg2j7p@vbv3jsskhc2j>
Date: Wed, 16 Jul 2025 19:19:53 +0100
From: Mel Gorman <mgorman@...hsingularity.net>
To: Chris Mason <clm@...a.com>
Cc: Peter Zijlstra <peterz@...radead.org>, mingo@...hat.com, 
	juri.lelli@...hat.com, vincent.guittot@...aro.org, dietmar.eggemann@....com, 
	rostedt@...dmis.org, bsegall@...gle.com, mgorman@...e.de, vschneid@...hat.com, 
	linux-kernel@...r.kernel.org
Subject: Re: [PATCH v2 02/12] sched/deadline: Less agressive dl_server
 handling

On Tue, Jul 15, 2025 at 10:55:03AM -0400, Chris Mason wrote:
> On 7/14/25 6:56 PM, Mel Gorman wrote:
> > On Wed, Jul 02, 2025 at 01:49:26PM +0200, Peter Zijlstra wrote:
> >> Chris reported that commit 5f6bd380c7bd ("sched/rt: Remove default
> >> bandwidth control") caused a significant dip in his favourite
> >> benchmark of the day. Simply disabling dl_server cured things.
> >>
> > 
> > Unrelated to the patch but I've been doing a bit of arcology recently
> > finding the motivation for various decisions and paragraphs like this
> > have been painful (most recent was figuring out why a decision was made
> > for 2.6.32). If the load was described, can you add a Link: tag?  If the
> > workload is proprietary, cannot be described or would be impractical to
> > independently created than can that be stated here instead?
> > 
> 
> Hi Mel,
> 
> "benchmark of the day" is pretty accurate, since I usually just bash on
> schbench until I see roughly the same problem that I'm debugging from
> production.  This time, it was actually a networking benchmark (uperf),
> but setup for that is more involved.
> 
> This other thread describes the load, with links to schbench and command
> line:
> 
> https://lore.kernel.org/lkml/20250626144017.1510594-2-clm@fb.com/
> 
> The short version:
> 
> https://github.com/masoncl/schbench.git
> schbench -L -m 4 -M auto -t 256 -n 0 -r 0 -s 0
> 
> - 4 CPUs waking up all the other CPUs constantly
>   - (pretending to be network irqs)

Ok, so the 4 CPUs are a simulation of network traffic arriving that can be
delivered to any CPU. Sounds similar to MSIX where interrupts can arrive
on any CPU and I'm guessing you're not doing any packet steering in the
"real" workload. I'm also guessing there is nothing special about "4"
other than it was enough threads to keep the SUT active even if the worker
tasks did no work.

> - 1024 total worker threads spread over the other CPUs

Ok.

> - all the workers immediately going idle after waking

So 0 think time to stress a corner case.

> - single socket machine with ~250 cores and HT.
> 

To be 100% sure, 250 cores + HT is 500 logical CPUs correct? Using 1024 would
appear to be an attempt to simulate strict deadlines for minimal processing
of data received from the network while processors are saturated. IIUC,
the workload would stress wakeup preemption, LB and finding an idle CPU
decisions while ensuring EEVDF rules are adhered to.

> The basic recipe for the regression is as many CPUs as possible going in
> and out of idle.
> 
> (I know you're really asking for these details in the commit or in the
> comments, but hopefully this is useful for Link:'ing)
> 

Yes it is. Because even adding this will capture the specific benchmark
for future reference -- at least as long as lore lives.

Link: https://lore.kernel.org/r/3c67ae44-5244-4341-9edd-04a93b1cb290@meta.com

Do you mind adding this or ensure it makes it to the final changelog?
It's not a big deal, just a preference. Historically there was no push
for something like this but most recent history was dominated by CFS.
There were a lot of subtle heuristics there that are hard to replicate in
EEVDF without violating the intent of EEVDF.

I had seen that schbench invocation and I was 99% certain it was the
"favourite benchmark of the day".  The pattern seems reasonable as a
microbench favouring latency over throughput for fast dispatching of work
from network ingress to backend processing. Thats enough to name a mmtests
configuration based on the existing schbench implementation. Maybe something
like schbench-fakenet-fastdispatch.  This sort of pattern is not even that
unique as such as IO-intensive workloads may also exhibit a similar pattern,
particularly if XFS is the filesystem. That is a reasonable scenario
whether DL is involved or not.

Thanks Chris.

-- 
Mel Gorman
SUSE Labs