linux-kernel - Re: [PATCH v4 1/2] cgroup: cgroup.stat.local time accounting

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <1b6498f3-ca07-41d5-9637-f20a58184e60@huaweicloud.com>
Date: Sat, 23 Aug 2025 09:45:26 +0800
From: Chen Ridong <chenridong@...weicloud.com>
To: Tiffany Yang <ynaffit@...gle.com>
Cc: linux-kernel@...r.kernel.org, John Stultz <jstultz@...gle.com>,
 Thomas Gleixner <tglx@...utronix.de>, Stephen Boyd <sboyd@...nel.org>,
 Anna-Maria Behnsen <anna-maria@...utronix.de>,
 Frederic Weisbecker <frederic@...nel.org>, Tejun Heo <tj@...nel.org>,
 Johannes Weiner <hannes@...xchg.org>, Michal Koutný
 <mkoutny@...e.com>, "Rafael J. Wysocki" <rafael@...nel.org>,
 Pavel Machek <pavel@...nel.org>, Roman Gushchin <roman.gushchin@...ux.dev>,
 Chen Ridong <chenridong@...wei.com>, kernel-team@...roid.com,
 Jonathan Corbet <corbet@....net>, Shuah Khan <shuah@...nel.org>,
 cgroups@...r.kernel.org, linux-doc@...r.kernel.org,
 linux-kselftest@...r.kernel.org
Subject: Re: [PATCH v4 1/2] cgroup: cgroup.stat.local time accounting



On 2025/8/23 3:32, Tiffany Yang wrote:
> Hi Chen,
> 
> Thanks again for taking a look!
> 
> Chen Ridong <chenridong@...weicloud.com> writes:
> 
>> On 2025/8/22 14:14, Chen Ridong wrote:
> 
> 
>>> On 2025/8/22 9:37, Tiffany Yang wrote:
>>>> There isn't yet a clear way to identify a set of "lost" time that
>>>> everyone (or at least a wider group of users) cares about. However,
>>>> users can perform some delay accounting by iterating over components of
>>>> interest. This patch allows cgroup v2 freezing time to be one of those
>>>> components.
> 
>>>> Track the cumulative time that each v2 cgroup spends freezing and expose
>>>> it to userland via a new local stat file in cgroupfs. Thank you to
>>>> Michal, who provided the ASCII art in the updated documentation.
> 
>>>> To access this value:
>>>>    $ mkdir /sys/fs/cgroup/test
>>>>    $ cat /sys/fs/cgroup/test/cgroup.stat.local
>>>>    freeze_time_total 0
> 
>>>> Ensure consistent freeze time reads with freeze_seq, a per-cgroup
>>>> sequence counter. Writes are serialized using the css_set_lock.
> 
> ...
> 
>>>>       spin_lock_irq(&css_set_lock);
>>>> -    if (freeze)
>>>> +    write_seqcount_begin(&cgrp->freezer.freeze_seq);
>>>> +    if (freeze) {
>>>>           set_bit(CGRP_FREEZE, &cgrp->flags);
>>>> -    else
>>>> +        cgrp->freezer.freeze_start_nsec = ts_nsec;
>>>> +    } else {
>>>>           clear_bit(CGRP_FREEZE, &cgrp->flags);
>>>> +        cgrp->freezer.frozen_nsec += (ts_nsec -
>>>> +            cgrp->freezer.freeze_start_nsec);
>>>> +    }
>>>> +    write_seqcount_end(&cgrp->freezer.freeze_seq);
>>>>       spin_unlock_irq(&css_set_lock);
> 
> 
>>> Hello Tiffany,
> 
>>> I wanted to check if there are any specific considerations regarding how we should input the ts_nsec
>>> value.
> 
>>> Would it be possible to define this directly within the cgroup_do_freeze function rather than
>>> passing it as a parameter? This approach might simplify the implementation and potentially improve
>>> timing accuracy when it have lots of descendants.
> 
> 
>> I revisited v3, and this was Michal's point.
>>     p
>>       /  |  \
>>      1  ...  n
>> When we freeze the parent group p, is it expected that all descendant cgroups (1 to n) should share
>> the same frozen timestamp?
> 
> 
> Yes, this is the expectation from the current change. I understand your
> concern about the accuracy of this measurement (especially when there
> are many descendants), but I agree with Michal's point that the time to
> traverse the descendant cgroups is basically noise relative to the
> quantity we're trying to measure here.
> 
>> If the cgroup tree structure is stable, the exact frozen time may not be really matter. However, if
>> the tree is not stable, obtaining the same frozen time is acceptable?
> 
> I'm a little unclear as to what you mean about when the cgroup tree is
> unstable. In the case where a new descendant of p is being created, I
> believe the cgroup_mutex prevents that from happening at the same time
> as we are freezing p's other descendants. If it won the race, was
> created unfrozen under p, and then became frozen during cgroup_freeze,
> it would have the same timestamp as the other descendants. If it lost
> the race and was created as a frozen cgroup under p, it would get its
> own timestamp in cgroup_create, so its freezing duration would be
> slightly less than that of the others in the hierarchy. Both values
> would be acceptable for our purposes, but if there was a different case
> you had in mind, please let me know!
> 
> Thanks,

What I mean by "stable" is that while cgroup 1 through n might be deleted or have more descendants
created. For example:

         n  n-1  n-2  ... 1
frozen   a  a+1  a+2     a+n
unfozen  b  b+1  b+2  ... b+n
nsec     b-a ...

In this case, all frozen_nsec values are b - a, which I believe is correct.
However, consider a scenario where some cgroups are deleted:

         n  n-1  n-2  ... 1
frozen   a  a+1  a+2     a+n
// 2 ... n-1 are deleted.
unfozen  b               b+1

Here, the frozen_nsec for cgroup n would be b - a, but for cgroup 1 it would be (b + 1) - (a + n).
This could introduce some discrepancy / timing inaccuracies.

-- 
Best regards,
Ridong