linux-kernel - Re: [RFC next v2 0/5] ucount: add rlimit cache for ucount

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <f5c701d5-c501-4179-959c-85057705a09d@huaweicloud.com>
Date: Mon, 19 May 2025 21:39:34 +0800
From: Chen Ridong <chenridong@...weicloud.com>
To: Alexey Gladkov <legion@...nel.org>
Cc: akpm@...ux-foundation.org, paulmck@...nel.org, bigeasy@...utronix.de,
 roman.gushchin@...ux.dev, brauner@...nel.org, tglx@...utronix.de,
 frederic@...nel.org, peterz@...radead.org, oleg@...hat.com,
 joel.granados@...nel.org, viro@...iv.linux.org.uk,
 lorenzo.stoakes@...cle.com, avagin@...gle.com, mengensun@...cent.com,
 linux@...ssschuh.net, jlayton@...nel.org, ruanjinjie@...wei.com,
 kees@...nel.org, linux-kernel@...r.kernel.org, lujialin4@...wei.com,
 Eric Biederman <ebiederm@...ssion.com>
Subject: Re: [RFC next v2 0/5] ucount: add rlimit cache for ucount



On 2025/5/16 19:48, Alexey Gladkov wrote:
> On Fri, May 09, 2025 at 07:20:49AM +0000, Chen Ridong wrote:
>> The will-it-scale test case signal1 [1] has been observed. and the test
>> results reveal that the signal sending system call lacks linearity.
> 
> The signal1 testcase is pretty synthetic. It sends a signal in a busy loop.
> 
> Do you have an example of a closer-to-life scenario where this delay
> becomes a bottleneck ?
> 
> https://github.com/antonblanchard/will-it-scale/blob/master/tests/signal1.c
> 

Thank you for your prompt reply. Unfortunately, I do not have the
specific scenario.

Motivation:
I plan to use servers with 384 cores, and potentially even more in the
future. Therefore, I am testing these system calls to identify any
scalability bottlenecks that could arise in massively parallel
high-density computing environments.

In addition, we hope that the containers can be isolated as much as
possible to avoid interfering with each other.

Best regards,
Ridong

>> To further investigate this issue, we initiated a series of tests by
>> launching varying numbers of dockers and closely monitored the throughput
>> of each individual docker. The detailed test outcomes are presented as
>> follows:
>>
>> 	| Dockers     |1      |4      |8      |16     |32     |64     |
>> 	| Throughput  |380068 |353204 |308948 |306453 |180659 |129152 |
>>
>> The data clearly demonstrates a discernible trend: as the quantity of
>> dockers increases, the throughput per container progressively declines.
>> In-depth analysis has identified the root cause of this performance
>> degradation. The ucouts module conducts statistics on rlimit, which
>> involves a significant number of atomic operations. These atomic
>> operations, when acting on the same variable, trigger a substantial number
>> of cache misses or remote accesses, ultimately resulting in a drop in
>> performance.
>>
>> Notably, even though a new user_namespace is created upon docker startup,
>> the problem persists. This is because all these dockers share the same
>> parent node, meaning that rlimit statistics continuously modify the same
>> atomic variable.
>>
>> Currently, when incrementing a specific rlimit within a child user
>> namespace by 1, the corresponding rlimit in the parent node must also be
>> incremented by 1. Specifically, if the ucounts corresponding to a task in
>> Docker B is ucount_b_1, after incrementing the rlimit of ucount_b_1 by 1,
>> the rlimit of the parent node, init_ucounts, must also be incremented by 1.
>> This operation should be ensured to stay within the limits set for the
>> user namespaces.
>>
>> 	init_user_ns                             init_ucounts
>> 	^                                              ^
>> 	|                        |                     |
>> 	|<---- usr_ns_a(docker A)|usr_ns_a->ucount---->|
>> 	|                        |                     |
>> 	|<---- usr_ns_b(docker B)|usr_ns_a->ucount---->|
>> 					^
>> 					|
>> 					|
>> 					|
>> 					ucount_b_1
>>
>> What is expected is that dockers operating within separate namespaces
>> should remain isolated and not interfere with one another. Regrettably,
>> the current signal system call fails to achieve this desired level of
>> isolation.
>>
>> Proposal:
>>
>> To address the aforementioned issues, the concept of implementing a cache
>> for each namespace's rlimit has been proposed. If a cache is added for
>> each user namespace's rlimit, a certain amount of rlimits can be allocated
>> to a particular namespace in one go. When resources are abundant, these
>> resources do not need to be immediately returned to the parent node. Within
>> a user namespace, if there are available values in the cache, there is no
>> need to request additional resources from the parent node.
>>
>> 	init_user_ns                             init_ucounts
>> 	^                                              ^
>> 	|                        |                     |
>> 	|<---- usr_ns_a(docker A)|usr_ns_a->ucount---->|
>> 	|                        |                     |
>> 	|<---- usr_ns_b(docker B)|usr_ns_b->ucount---->|
>> 			^		^
>> 			|		|
>> 			cache_rlimit--->|
>> 					|
>> 					ucount_b_1
>>
>>
>> The ultimate objective of this solution is to achieve complete isolation
>> among namespaces. After applying this patch set, the final test results
>> indicate that in the signal1 test case, the performance does not
>> deteriorate as the number of containers increases. This effectively meets
> 
>> the goal of linear scalability.
>>
>> 	| Dockers     |1      |4      |8      |16     |32     |64     |
>> 	| Throughput  |381809 |382284 |380640 |383515 |381318 |380120 |
>>
>> Challenges:
>>
>> When checking the pending signals in the parent node using the command
>>  cat /proc/self/status | grep SigQ, the retrieved value includes the
>> cached signal counts from its child nodes. As a result, the SigQ value
>> in the parent node fails to accurately and instantaneously reflect the
>> actual number of pending signals.
>>
>> 	# cat /proc/self/status | grep SigQ
>> 	SigQ:	16/6187667
>>
>> TODO:
>>
>> Add cache for the other rlimits.
>>
>> [1] https://github.com/antonblanchard/will-it-scale/blob/master/tests/
>>
>> Chen Ridong (5):
>>   user_namespace: add children list node
>>   usernamespace: make usernamespace rcu safe
>>   user_namespace: add user_ns iteration helper
>>   uounts: factor out __inc_rlimit_get_ucounts/__dec_rlimit_put_ucounts
>>   ucount: add rlimit cache for ucount
>>
>>  include/linux/user_namespace.h |  23 ++++-
>>  kernel/signal.c                |   2 +-
>>  kernel/ucount.c                | 181 +++++++++++++++++++++++++++++----
>>  kernel/user.c                  |   2 +
>>  kernel/user_namespace.c        |  60 ++++++++++-
>>  5 files changed, 243 insertions(+), 25 deletions(-)
>>
>> -- 
>> 2.34.1
>>
>