linux-kernel - Re: [PATCH-patch v6] cgroup: Show # of subsystem CSSes in cgroup.stat

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <3061f280-5a4a-4ff6-8b90-ab667a7e8e48@redhat.com>
Date: Mon, 15 Jul 2024 10:00:17 -0400
From: Waiman Long <longman@...hat.com>
To: Kamalesh Babulal <kamalesh.babulal@...cle.com>, Tejun Heo
 <tj@...nel.org>, Zefan Li <lizefan.x@...edance.com>,
 Johannes Weiner <hannes@...xchg.org>, Michal Koutný
 <mkoutny@...e.com>, Jonathan Corbet <corbet@....net>
Cc: cgroups@...r.kernel.org, linux-doc@...r.kernel.org,
 linux-kernel@...r.kernel.org, Roman Gushchin <roman.gushchin@...ux.dev>
Subject: Re: [PATCH-patch v6] cgroup: Show # of subsystem CSSes in cgroup.stat

On 7/13/24 11:50, Kamalesh Babulal wrote:
> On 7/13/24 4:59 AM, Waiman Long wrote:
>> Cgroup subsystem state (CSS) is an abstraction in the cgroup layer to
>> help manage different structures in various cgroup subsystems by being
>> an embedded element inside a larger structure like cpuset or mem_cgroup.
>>
>> The /proc/cgroups file shows the number of cgroups for each of the
>> subsystems.  With cgroup v1, the number of CSSes is the same as the
>> number of cgroups.  That is not the case anymore with cgroup v2. The
>> /proc/cgroups file cannot show the actual number of CSSes for the
>> subsystems that are bound to cgroup v2.
>>
>> So if a v2 cgroup subsystem is leaking cgroups (usually memory cgroup),
>> we can't tell by looking at /proc/cgroups which cgroup subsystems may
>> be responsible.
>>
>> As cgroup v2 had deprecated the use of /proc/cgroups, the hierarchical
>> cgroup.stat file is now being extended to show the number of live and
>> dying CSSes associated with all the non-inhibited cgroup subsystems
>> that have been bound to cgroup v2 as long as they are not both zero.
>> The number includes CSSes in the current cgroup as well as in all the
>> descendants underneath it.  This will help us pinpoint which subsystems
>> are responsible for the increasing number of dying (nr_dying_descendants)
>> cgroups.
>>
>> The CSSes dying counts are stored in the cgroup structure itself
>> instead of inside the CSS as suggested by Johannes. This will allow
>> us to accurately track dying counts of cgroup subsystems that have
>> recently been disabled in a cgroup. It is now possible that a zero
>> subsystem number is coupled with a non-zero dying subsystem number.
>>
>> The cgroup-v2.rst file is updated to discuss this new behavior.
>>
>> With this patch applied, a sample output from root cgroup.stat file
>> was shown below.
>>
>> 	nr_descendants 56
>> 	nr_subsys_cpuset 1
>> 	nr_subsys_cpu 43
>> 	nr_subsys_io 43
>> 	nr_subsys_memory 56
>> 	nr_subsys_perf_event 57
>> 	nr_subsys_hugetlb 1
>> 	nr_subsys_pids 56
>> 	nr_subsys_rdma 1
>> 	nr_subsys_misc 1
>> 	nr_dying_descendants 30
>> 	nr_dying_subsys_cpuset 0
>> 	nr_dying_subsys_cpu 0
>> 	nr_dying_subsys_io 0
>> 	nr_dying_subsys_memory 30
>> 	nr_dying_subsys_perf_event 0
>> 	nr_dying_subsys_hugetlb 0
>> 	nr_dying_subsys_pids 0
>> 	nr_dying_subsys_rdma 0
>> 	nr_dying_subsys_misc 0
>>
>> Another sample output from system.slice/cgroup.stat was:
>>
>> 	nr_descendants 32
>> 	nr_subsys_cpu 30
>> 	nr_subsys_io 30
>> 	nr_subsys_memory 32
>> 	nr_subsys_perf_event 33
>> 	nr_subsys_pids 32
>> 	nr_dying_descendants 32
>> 	nr_dying_subsys_cpu 0
>> 	nr_dying_subsys_io 0
>> 	nr_dying_subsys_memory 32
>> 	nr_dying_subsys_perf_event 0
>> 	nr_dying_subsys_pids 0
>>
> [...]
>
>> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
>> index c8e4b62b436a..73774c841100 100644
>> --- a/kernel/cgroup/cgroup.c
>> +++ b/kernel/cgroup/cgroup.c
>> @@ -3669,12 +3669,43 @@ static int cgroup_events_show(struct seq_file *seq, void *v)
>>   static int cgroup_stat_show(struct seq_file *seq, void *v)
>>   {
>>   	struct cgroup *cgroup = seq_css(seq)->cgroup;
>> +	struct cgroup_subsys_state *css;
>> +	int dying_cnt[CGROUP_SUBSYS_COUNT];
>> +	int ssid;
>>   
>>   	seq_printf(seq, "nr_descendants %d\n",
>>   		   cgroup->nr_descendants);
>> +
>> +	/*
>> +	 * Show the number of live and dying csses associated with each of
>> +	 * non-inhibited cgroup subsystems that is either enabled in current
>> +	 * cgroup or has non-zero dying count.
>> +	 *
>> +	 * Without proper lock protection, racing is possible. So the
>> +	 * numbers may not be consistent when that happens.
>> +	 */
>> +	rcu_read_lock();
>> +	for (ssid = 0; ssid < CGROUP_SUBSYS_COUNT; ssid++) {
>> +		dying_cnt[ssid] = -1;
>> +		if (BIT(ssid) & cgrp_dfl_inhibit_ss_mask)
>> +			continue;
>> +		css = rcu_dereference_raw(cgroup->subsys[ssid]);
>> +		if (!css && !cgroup->nr_dying_subsys[ssid])
>> +			continue;
> Sorry, If I have misread the discussion from the other thread about displaying
> nr_descendants and nr_dying_subsys_<subsys>. I believe the idea was to print
> them for enabled and disabled cgroup controllers, so the output stays consistent
> and does not vary depending on the enabled controllers or previously enabled
> controller with nr_dying_subsys > 0.
>
> For example, on a rebooted vm:
>
> # cd /sys/fs/cgroup/
> # cat cgroup.subtree_control
> cpu memory pids
>
> # mkdir foo
> # cat foo/cgroup.stat
> nr_descendants 0
> nr_subsys_cpu 1
> nr_subsys_memory 1
> nr_subsys_perf_event 1
> nr_subsys_pids 1
> nr_dying_descendants 0
> nr_dying_subsys_cpu 0
> nr_dying_subsys_memory 0
> nr_dying_subsys_perf_event 0
> nr_dying_subsys_pids 0
>
> # echo '+cpuset' > cgroup.subtree_control
>
> # cat foo/cgroup.stat
> nr_descendants 0
> nr_subsys_cpuset 1
> nr_subsys_cpu 1
> nr_subsys_memory 1
> nr_subsys_perf_event 1
> nr_subsys_pids 1
> nr_dying_descendants 0
> nr_dying_subsys_cpuset 0
> nr_dying_subsys_cpu 0
> nr_dying_subsys_memory 0
> nr_dying_subsys_perf_event 0
> nr_dying_subsys_pids 0

I am fine with fine with that. I will update the patch as suggested.

Thanks,
Longman

>> +
>> +		dying_cnt[ssid] = cgroup->nr_dying_subsys[ssid];
>> +		seq_printf(seq, "nr_subsys_%s %d\n", cgroup_subsys[ssid]->name,
>> +			   css ? (css->nr_descendants + 1) : 0);
>> +	}
>> +
>>   	seq_printf(seq, "nr_dying_descendants %d\n",
>>   		   cgroup->nr_dying_descendants);
>> -
>> +	for (ssid = 0; ssid < CGROUP_SUBSYS_COUNT; ssid++) {
>> +		if (dying_cnt[ssid] >= 0)
>> +			seq_printf(seq, "nr_dying_subsys_%s %d\n",
>> +				   cgroup_subsys[ssid]->name, dying_cnt[ssid]);
>> +	}
>> +	rcu_read_unlock();
>>   	return 0;
>>   }
>>