linux-kernel - Re: Re: [PATCH v2 3/3] cgroup/rstat: Add run

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <4c0f852e-bf79-4e59-be42-bdf11fb92f3b@bytedance.com>
Date: Wed, 12 Feb 2025 23:12:29 +0800
From: Abel Wu <wuyun.abel@...edance.com>
To: Michal Koutný <mkoutny@...e.com>
Cc: Tejun Heo <tj@...nel.org>, Johannes Weiner <hannes@...xchg.org>,
 Jonathan Corbet <corbet@....net>, Ingo Molnar <mingo@...hat.com>,
 Peter Zijlstra <peterz@...radead.org>, Juri Lelli <juri.lelli@...hat.com>,
 Vincent Guittot <vincent.guittot@...aro.org>,
 Dietmar Eggemann <dietmar.eggemann@....com>,
 Steven Rostedt <rostedt@...dmis.org>, Ben Segall <bsegall@...gle.com>,
 Mel Gorman <mgorman@...e.de>, Valentin Schneider <vschneid@...hat.com>,
 Thomas Gleixner <tglx@...utronix.de>, Yury Norov <yury.norov@...il.com>,
 Andrew Morton <akpm@...ux-foundation.org>, Bitao Hu
 <yaoma@...ux.alibaba.com>, Chen Ridong <chenridong@...wei.com>,
 "open list:CONTROL GROUP (CGROUP)" <cgroups@...r.kernel.org>,
 "open list:DOCUMENTATION" <linux-doc@...r.kernel.org>,
 open list <linux-kernel@...r.kernel.org>
Subject: Re: Re: [PATCH v2 3/3] cgroup/rstat: Add run_delay accounting for
 cgroups

On 2/10/25 11:38 PM, Michal Koutný Wrote:
> Hello Abel (sorry for my delay).
> 
> On Wed, Jan 29, 2025 at 12:48:09PM +0800, Abel Wu <wuyun.abel@...edance.com> wrote:
>> PSI tracks stall times for each cpu, and
>>
>> 	tSOME[cpu] = time(nr_delayed_tasks[cpu] != 0)
>>
>> which turns nr_delayed_tasks[cpu] into boolean value, hence loses
>> insight into how severely this task group is stalled on this cpu.
> 
> Thanks for example. So the lost information is kind of a group load.

Exactly.

> What meaning it has when there is no group throttling?

It means how severely this cgroup is interfered by co-located tasks.
Both psi and run_delay are tracked in (part of) our fleet, and the
spikes usually lead to poor SLI. But we do find circumstances that
run_delay has a better correlation with SLI due to the abovementioned
method of stall time accounting.

They are treated as indicators of triggering throttling or evicting
the co-located low priority jobs.

In fact we also track per-cpu stats (cpu.stat.percpu) for cgroups,
including run_delay which helped us to decide which job to be the
victim, and also provided useful info when we diagnose issues.

> 
> Honestly, I can't reason neither about PSI.some nor Σ run_delay wrt
> feedback for resource control. What it is slightly bugging me is
> introduction of another stats field before first one was explored :-)
> 
> But if there's information loss with PSI -- could cpu.pressure:some be
> removed in favor of Σ run_delay? (The former could be calculated from
> latter if you're right :-p)

It is not my intent to replacing cpu.pressure:some by run_delay. The
former provides a normalized value that can be used to compare among
different cgroups while the latter isn't able to.

> 
> (I didn't like the before/after shuffling with enum cpu_usage_stat
> NR_STATS but I saw v4 where you tackled that.)
> 
> Michal
> 
> 
> More context form previous message, the difference is between a) and c),
> or better equal lanes:
> 
> a')
>     t1 |----|
>     t2 |xx--|
>     t3 |----|
> 
> c)
>     t1 |----|
>     t2 |xx--|
>     t3 |xx--|
> 
>        <-Δt->

Yes, a) and c) have same cpu.pressure:some but make different progress.

> 
> run_delay can be calculated indepently of cpu.pressure:some
> because there is still difference between a') and c) in terms of total
> cpu usage.
> 
> 	Δrun_delay = nr * Δt - Δusage
> 
> The challenge is with nr (assuming they're all runnable during Δt), that
> would need to be sampled from /sys/kernel/debug/sched/debug. But then
> you can get whatever load for individual cfs_rqs from there. Hm, does it
> even make sense to add up run_delays from different CPUs?

Very good question. In our case, this summed value is used as a general
indicator to trigger strategy which further depends on raw per-cpu data
provided by cpu.stat.percpu, which implies that what we actually want is
the per-cpu data.

Thanks & Best Regards,
	Abel