linux-kernel - Re: [PATCH v2 3/3] cgroup/rstat: Add run

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <m3og4sktkzf6j62terh4xcbfiw45ziymhmt7x7iuyzcogl67cy@ufqvgzttd2n7>
Date: Fri, 21 Feb 2025 16:36:02 +0100
From: Michal Koutný <mkoutny@...e.com>
To: Johannes Weiner <hannes@...xchg.org>
Cc: Tejun Heo <tj@...nel.org>, Abel Wu <wuyun.abel@...edance.com>, 
	Jonathan Corbet <corbet@....net>, Ingo Molnar <mingo@...hat.com>, 
	Peter Zijlstra <peterz@...radead.org>, Juri Lelli <juri.lelli@...hat.com>, 
	Vincent Guittot <vincent.guittot@...aro.org>, Dietmar Eggemann <dietmar.eggemann@....com>, 
	Steven Rostedt <rostedt@...dmis.org>, Ben Segall <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>, 
	Valentin Schneider <vschneid@...hat.com>, Thomas Gleixner <tglx@...utronix.de>, 
	Yury Norov <yury.norov@...il.com>, Andrew Morton <akpm@...ux-foundation.org>, 
	Bitao Hu <yaoma@...ux.alibaba.com>, Chen Ridong <chenridong@...wei.com>, 
	"open list:CONTROL GROUP (CGROUP)" <cgroups@...r.kernel.org>, "open list:DOCUMENTATION" <linux-doc@...r.kernel.org>, 
	open list <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH v2 3/3] cgroup/rstat: Add run_delay accounting for cgroups

On Mon, Feb 10, 2025 at 01:25:45PM -0500, Johannes Weiner <hannes@...xchg.org> wrote:
> Yes, a more detailed description of the usecase would be helpful.
> 
> I'm not exactly sure how the sum of wait times in a cgroup would be
> used to gauge load without taking available concurrency into account.
> One second of aggregate wait time means something very different if
> you have 200 cpus compared to if you have 2.
> 
> This is precisely what psi tries to capture. "Some" does provide group
> loading information in a sense, but it's a
>
> ratio over available concurrency,

This comes as a surprise to me (I originally assumed it's only
time(some)/time(interval)).
But I confirm that after actually looking at the avg* values it is over
nr_tasks.
If the value is already normalized by nr_tasks, I'm seeing less of a
benefit of Σ run_delay.

> and currently capped at 100%. I.e.  if you have N cpus, 100% some is
> "at least N threads waiting at all times." There is a gradient below
> that, but not above.

Is this a typo? (s/some/full/ or s/at least N/at least 1/)

(Actually, if I correct my thinking with the nr_tasks normalization,
then your statement makes sense. OTOH, what is the difference betwen
'full' and 'some' at 100%?)

Also I played a bit.

cat >/root/cpu_n.sh <<EOD
#!/bin/bash

worker() {
	echo "$BASHPID: starting on $1"
	taskset -c -p $i $BASHPID
	while true ; do
		true
	done
}

for i in $(seq ${1:-1}) ; do
	worker $i &
	pids+=($!)
done

echo pids: ${pids[*]}
wait
EOD

systemd-run -u test.service /root/cpu_n.sh 2
# test.service/cpu.pressure:some is ~0

systemd-run -u pressure.service /root/cpu_n.sh 1
# test.service/cpu.pressure:some settles at ~25%, cpu1 is free, cpu2 half
# test.service/cpu.pressure:full settles at ~25% too(?!), I'd expect 0
                                            ^^^^^^^^^^^^

(kernel v6.13)

# pressure.service/cpu.pressure:some settles at ~50%, makes sense
# pressure.service/cpu.pressure:full settles at ~50%, makes sense

Thanks,
Michal