linux-kernel - Re: [RFC PATCH] cgroup: Track time in cgroup v2 freezer

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <dbx8h601k4ms.fsf@ynaffit-andsys.c.googlers.com>
Date: Fri, 27 Jun 2025 00:47:23 -0700
From: Tiffany Yang <ynaffit@...gle.com>
To: Michal Koutný <mkoutny@...e.com>
Cc: linux-kernel@...r.kernel.org,  cgroups@...r.kernel.org,
  kernel-team@...roid.com,  John Stultz <jstultz@...gle.com>,  Thomas
 Gleixner <tglx@...utronix.de>,  Stephen Boyd <sboyd@...nel.org>,
  Anna-Maria Behnsen <anna-maria@...utronix.de>,  Frederic Weisbecker
 <frederic@...nel.org>,  Tejun Heo <tj@...nel.org>,  Johannes Weiner
 <hannes@...xchg.org>,  "Rafael J. Wysocki" <rafael@...nel.org>,  Pavel
 Machek <pavel@...nel.org>,  Roman Gushchin <roman.gushchin@...ux.dev>,
  Chen Ridong <chenridong@...wei.com>,  Ingo Molnar <mingo@...hat.com>,
  Peter Zijlstra <peterz@...radead.org>,  Juri Lelli
 <juri.lelli@...hat.com>,  Vincent Guittot <vincent.guittot@...aro.org>,
  Dietmar Eggemann <dietmar.eggemann@....com>,  Steven Rostedt
 <rostedt@...dmis.org>,  Ben Segall <bsegall@...gle.com>,  Mel Gorman
 <mgorman@...e.de>,  Valentin Schneider <vschneid@...hat.com>
Subject: Re: [RFC PATCH] cgroup: Track time in cgroup v2 freezer

Michal Koutný <mkoutny@...e.com> writes:

Hello! Thanks for taking the time to respond!

> Hello.
>
> On Tue, Jun 03, 2025 at 10:43:05PM +0000, Tiffany Yang <ynaffit@...gle.com> wrote:
>> The cgroup v2 freezer controller allows user processes to be dynamically
>> added to and removed from an interruptible frozen state from
>> userspace.
>
> Beware of freezing by migration vs freezing by cgroup attribute change.
> The latter is primary design of cgroup v2, the former is "only" for
> consistency.
>
>> This feature is helpful for application management, as it
>> allows background tasks to be frozen to prevent them from being
>> scheduled or otherwise contending with foreground tasks for resources.
>
>> Still, applications are usually unaware of their having been placed in
>> the freezer cgroup, so any watchdog timers they may have set will fire
>> when they exit. To address this problem, I propose tracking the per-task
>> frozen time and exposing it to userland via procfs.
>
> But the watchdog fires rightfully when the application does not run,
> doesn't it?

Good question. I should've been clearer about our use case. In both
cases, the watchdog is being used to ensure that a job is completed
before some deadline. When the deadline is relative to the system time,
then yes, it would be firing correctly. In our case, the deadline is
meant to be relative to the time our task spends running; since we don't
have a clock for that, we set our timer against the system time
(CLOCK_MONOTONIC, in this case) as an approximation.

This timer may fire (correctly) while our application is still frozen,
but our watchdog task won't run until it's unfrozen. At that point, it
can check how much time has been spent in the cgroup v2 freezer and
decide whether to rearm the timer or to initiate a corrective action.

> It should be responsibility of the "freezing agent" to prepare or notify
> the application about expected latencies.
>

Fair point! The freezing agent could roughly track freeze-entrance and
freeze-exit times, but how it would communicate those values to every
application being frozen along with who would be responsible for
keeping track of per-thread accumulated frozen times make this a little
messy. The accuracy of those user timestamps compared to ones taken in
the kernel may be further degraded by possible preemptions, etc.

>> but the main focus in this initial submission is establishing the
>> right UAPI for this accounting information.
>
> /proc/<pid>/cgroup_v2_freezer_time_frozen looks quite extraordinary with

Agreed.

> other similar metrics, my first thought would be a field in
> /proc/<pid>/stat (or track it per cgroup as Tejun suggests).
>

Adding it to /proc/<pid>/stat is an option, but because this metric
isn't very widely used and exactly what it measures is pretty particular
("freezer time, but no, cgroup freezer time, but v2 and not v1"), we
were hesitant to add it there and make this interface even more
difficult for folks to parse.

> Could you please primarily explain why the application itself should
> care about the frozen time (and not other causes of delay)?
>

Thank you for asking this! This is a very helpful question. My answer is
that other causes of delay may be equally important, but this is another
place where things get messy because of the spectrum of types of
"delay". If we break delays into 2 categories, delays that were
requested (sleep) and delays that were not (SIGSTOP), I can say that we
are primarily interested in delays that were not requested. However,
there are many cases that fall somewhere in between, like the wakeup
latency after a sleep, or that are difficult to account for, like
blocking on a futex (requested), where the owner might be preempted (not
requested).

Which is all to say that this is a hard thing to really pin down
generalized semantics for.

We can usually ignore the smaller sources of delay on a time-shared
system, but larger causes of delay (e.g., cgroup v2 freezer, SIGSTOP,
or really bad cases of scheduler starvation) can cause problems.

In this case, we've focused on a narrowish solution to just the cgroup
v2 freezer delays because it's fairly tractable. Ideally, we could
abstract this out in a more general way to other delays (like SIGSTOP),
but the challenge here is that there isn't a clear line that separates a
problematic delay from an acceptable delay. Suggestions for a framework
to approach this more generally are very welcome.

In the meantime, focusing on task frozen/stopped time seems like the
most reasonable approach. Maybe that would be clear enough to make it
palatable for proc/<pid>/stat ?

-- 
Tiffany Y. Yang