linux-kernel - Re: [RFC PATCH] cgroup: Track time in cgroup v2 freezer

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <dbx8y0tej595.fsf@ynaffit-andsys.c.googlers.com>
Date: Thu, 26 Jun 2025 19:19:18 -0700
From: Tiffany Yang <ynaffit@...gle.com>
To: Tejun Heo <tj@...nel.org>
Cc: linux-kernel@...r.kernel.org,  cgroups@...r.kernel.org,
  kernel-team@...roid.com,  John Stultz <jstultz@...gle.com>,  Thomas
 Gleixner <tglx@...utronix.de>,  Stephen Boyd <sboyd@...nel.org>,
  Anna-Maria Behnsen <anna-maria@...utronix.de>,  Frederic Weisbecker
 <frederic@...nel.org>,  Johannes Weiner <hannes@...xchg.org>,  Michal
 Koutný <mkoutny@...e.com>,  "Rafael J. Wysocki"
 <rafael@...nel.org>,
  Pavel Machek <pavel@...nel.org>,  Roman Gushchin
 <roman.gushchin@...ux.dev>,  Chen Ridong <chenridong@...wei.com>,  Ingo
 Molnar <mingo@...hat.com>,  Peter Zijlstra <peterz@...radead.org>,  Juri
 Lelli <juri.lelli@...hat.com>,  Vincent Guittot
 <vincent.guittot@...aro.org>,  Dietmar Eggemann
 <dietmar.eggemann@....com>,  Steven Rostedt <rostedt@...dmis.org>,  Ben
 Segall <bsegall@...gle.com>,  Mel Gorman <mgorman@...e.de>,  Valentin
 Schneider <vschneid@...hat.com>
Subject: Re: [RFC PATCH] cgroup: Track time in cgroup v2 freezer

Tejun Heo <tj@...nel.org> writes:

> Hello, Tiffany.
>
> On Wed, Jun 04, 2025 at 07:39:29PM +0000, Tiffany Yang wrote:
> ...
>> Thanks for taking a look! In this case, I would argue that the value we
>> are accounting for (time that a task has not been able to run because it
>> is in the cgroup v2 frozen state) is task-specific and distinct from the
>> time that the cgroup it belongs to has been frozen.
>> 
>> A cgroup is not considered frozen until all of its members are frozen,
>> and if one task then leaves the frozen state, the entire cgroup is
>> considered no longer frozen, even if its other members stay in the
>> frozen state. Similarly, even if a task is migrated from one frozen
>> cgroup (A) to another frozen cgroup (B), the time cgroup B has been
>> frozen would not be representative of that task even though it is a
>> member.
>> 
>> There is also latency between when each task in a cgroup is marked as
>> to-be-frozen/unfrozen and when it actually enters the frozen state, so
>> each descendant task has a different frozen time. For watchdogs that
>> elapse on a per-task basis, a per-cgroup time-in-frozen value would
>> underreport the actual time each task spent unable to run. Tasks that
>> miss a deadline might incorrectly be considered misbehaving when the
>> time they spent suspended was not correctly accounted for.
>> 
>> Please let me know if that answers your question or if there's something
>> I'm missing. I agree that it would be cleaner/preferable to keep this
>> accounting under a cgroup-specific umbrella, so I hope there is some way
>> to get around these issues, but it doesn't look like cgroup fs has a
>> good way to keep task-specific stats at the moment.
>
> I'm not sure freezing/frozen distinction is that meaningful. If each cgroup
> tracks total durations for both states, most threads should be able to rely
> on freezing duration delta, right? There shouldn't be significant time gap
> between freezing starting and most threads being frozen although the cgroup
> may not reach full frozen state due to e.g. NFS and what not.
>
> As long as threads are not migrated across cgroups, it should be able to do
> something like:
>
> 1. Read /proc/self/cgroup to determine the current cgroup.
> 2. Read and remember freezing duration $CGRP/cgroup.stat.
> 3. Do time taking operation.
> 4. Read $CGRP/cgrp.stat and calculate delta and deduct that from time taken.
>
> Would that work?
>
> Thanks.

Hi Tejun,

Thank you for your feedback! You made a good observation that it's
really the duration delta that matters here. I looked at tracking the
time from when we set/clear a cgroup's CGRP_FREEZE flag and compared
that to the per-task measurements of its members. For large (1000+
thread) cgroups, the latency between when a cgroup starts freezing and
when a task near the tail end of its cset->tasks actually enters the
handler is fairly significant. On an x86 VM, I saw a difference of about
1 tick per hundred tasks (i.e., the 6000th task would have been frozen
for 60 ticks less than the duration reported by its cgroup). We'd expect
this latency to accumulate more slowly on bare metal, but it would still
grow linearly.

Fortunately, since this same latency is present when we
unfreeze a cgroup and each member task, it's effectively canceled out
when we look at the freezing duration for tasks in cgroups that are not
currently frozen. For a running task, the measurement of how long it had
spent frozen in the past was within 1-2 ticks of its cgroup's. Our use
case does not look at this accounting until after a task has become
unfrozen, so the per-cgroup values seem like a reasonable substitution
for our purposes!

That being said, I realized from Michal's reply that the tracked value
doesn't have to be as narrow as the cgroup v2 freezing time. Basically,
we just want to give userspace some measure of time that a task cannot
run when it expects to be running. It doesn't seem practical to give an
exact accounting, but maybe tracking the time that each task spends in
some combination of stopped or frozen would provide a useful estimate.

What do you think?

-- 
Tiffany Y. Yang