linux-kernel - Re: [RFC PATCH] cgroup: Track time in cgroup v2 freezer

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <dbx8o6tn8jae.fsf@ynaffit-andsys.c.googlers.com>
Date: Sun, 13 Jul 2025 21:53:45 -0700
From: Tiffany Yang <ynaffit@...gle.com>
To: "Michal Koutný" <mkoutny@...e.com>
Cc: linux-kernel@...r.kernel.org, cgroups@...r.kernel.org, 
	kernel-team@...roid.com, John Stultz <jstultz@...gle.com>, 
	Thomas Gleixner <tglx@...utronix.de>, Stephen Boyd <sboyd@...nel.org>, 
	Anna-Maria Behnsen <anna-maria@...utronix.de>, Frederic Weisbecker <frederic@...nel.org>, 
	Tejun Heo <tj@...nel.org>, Johannes Weiner <hannes@...xchg.org>, 
	"Rafael J. Wysocki" <rafael@...nel.org>, Pavel Machek <pavel@...nel.org>, 
	Roman Gushchin <roman.gushchin@...ux.dev>, Chen Ridong <chenridong@...wei.com>, 
	Ingo Molnar <mingo@...hat.com>, Peter Zijlstra <peterz@...radead.org>, 
	Juri Lelli <juri.lelli@...hat.com>, Vincent Guittot <vincent.guittot@...aro.org>, 
	Dietmar Eggemann <dietmar.eggemann@....com>, Steven Rostedt <rostedt@...dmis.org>, 
	Ben Segall <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>, 
	Valentin Schneider <vschneid@...hat.com>
Subject: Re: [RFC PATCH] cgroup: Track time in cgroup v2 freezer

Michal Koutný <mkoutny@...e.com> writes:

> Would it be sufficient to measure that deadline against
> cpu.stat:usage_usec (CPU time consumed by the cgroup)? Or do I
> misunderstand your latter deadline metric?

CPU time is a good way to think about the quantity we are trying to
measure against, but it does not account for sleep time (either
voluntarily or waiting on a futex, etc.). Unlike freeze time, we would
want sleep time to count against our deadline because a timeout would
likely indicate a problem in the application's logic.

> (Note that SIGSTOP may be sent to self or within the group but) mind
> that even the category "not requested" is split into two other: resource
> contention and freezing management. And the latter should be under
> control of the agent that sets the deadlines.


This would be ideal, but in our case, the agent that sets/enforces the
deadlines is a task in the same application. It has no control over
freezing events and (currently) no way to know when one has
occurred. Consequently, even if the freezing manager were to send the
relevant information to our agent, none of those messages could be
processed until the application was unfrozen.

The result would be competing directly against the task under deadline
(to handle communication as it came in) or delaying corrective action
decisions (to wait until the deadline to deal with any messages). If the
application were frozen multiple times during the timer interval, that
cost would be incurred each time. As an alternative, the watchdog could
request this information from the freezing manager upon timer elapse,
but that would also introduce significant latency to deadline
enforcement.

> Those are order(s) of magnitude different. I can't imagine that using
> freezer for jobs where also wakeup latency matters.

This is true! These examples were mainly to illustrate the breadth of
the problem space/how slippery it can be to generalize.

> Well, there are multiple similar metrics: various (cgroup) PSI, (global)
> steal time, cpu.stat:throttled_usage and perhaps some more.

Ah! Thanks for noting these. It's helpful to have these concrete
examples to find ways to think about this problem.

Philosophically, I think the time we're trying to account for is most
similar to steal time because it allows a VM to correct the internal
accounting it uses to enforce policy. After considering how the delay
we're trying to track fits among these, I think one quality that makes
it somewhat difficult to formalize is that we are trying to account for
multiple external sources of delay, but we also want to exclude
"internal" delay (contention, voluntary sleep). The specificity of this
is making an iterative approach seem more appealing...

> Tejun's suggestion with tracking cgroup's frozen time of whole cgroup
> could complement other "debugging" stats provided by cgroups by I tend
> to think that it's not good (and certainly not complete) solution to
> your problem.

I agree that it doesn't necessarily feel complete, but after spending
this time mulling over the problem, I think it still feels too narrow to
know what a more general solution should look like.

Since there isn't yet a clear way to identify a set of "lost" time that
everyone (or at least a wider group of users) cares about, it seems like
iterating over components of interest is the best way to make progress
for now. That way, at least folks can track some combination of the
values that matter to them. (One aspect of this I find interesting is
time that is accounted for in multiple metrics. Maybe a better way to
think about this problem can be found in some relation between these
overlaps.)

I really appreciate the effort that you've put into trying to understand
the larger problem and the questions you've asked to help me think about
it. Thank you very much for your time!

-- 
Tiffany Y. Yang