lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <6c3b4ac3-fe33-45d0-b7e8-d722e29c8240@huaweicloud.com>
Date: Wed, 21 May 2025 09:32:12 +0800
From: Chen Ridong <chenridong@...weicloud.com>
To: Alexey Gladkov <legion@...nel.org>
Cc: akpm@...ux-foundation.org, paulmck@...nel.org, bigeasy@...utronix.de,
 roman.gushchin@...ux.dev, brauner@...nel.org, tglx@...utronix.de,
 frederic@...nel.org, peterz@...radead.org, oleg@...hat.com,
 joel.granados@...nel.org, viro@...iv.linux.org.uk,
 lorenzo.stoakes@...cle.com, avagin@...gle.com, mengensun@...cent.com,
 linux@...ssschuh.net, jlayton@...nel.org, ruanjinjie@...wei.com,
 kees@...nel.org, linux-kernel@...r.kernel.org, lujialin4@...wei.com,
 Eric Biederman <ebiederm@...ssion.com>
Subject: Re: [RFC next v2 0/5] ucount: add rlimit cache for ucount



On 2025/5/20 0:32, Alexey Gladkov wrote:
> On Mon, May 19, 2025 at 09:39:34PM +0800, Chen Ridong wrote:
>>
>>
>> On 2025/5/16 19:48, Alexey Gladkov wrote:
>>> On Fri, May 09, 2025 at 07:20:49AM +0000, Chen Ridong wrote:
>>>> The will-it-scale test case signal1 [1] has been observed. and the test
>>>> results reveal that the signal sending system call lacks linearity.
>>>
>>> The signal1 testcase is pretty synthetic. It sends a signal in a busy loop.
>>>
>>> Do you have an example of a closer-to-life scenario where this delay
>>> becomes a bottleneck ?
>>>
>>> https://github.com/antonblanchard/will-it-scale/blob/master/tests/signal1.c
>>>
>>
>> Thank you for your prompt reply. Unfortunately, I do not have the
>> specific scenario.
>>
>> Motivation:
>> I plan to use servers with 384 cores, and potentially even more in the
>> future. Therefore, I am testing these system calls to identify any
>> scalability bottlenecks that could arise in massively parallel
>> high-density computing environments.
> 
> But it turns out that you're proposing complex changes for something that

Using percpu_couter is not as complex as this patch set.

> is essentially a non-issue. In the real world, applications don't spam
> signals and I'm not sure we want to support that scenario.
> 
>> In addition, we hope that the containers can be isolated as much as
>> possible to avoid interfering with each other.
> 
> But that's impossible. Even before migration to ucounts, some rlimits
> (RLIMIT_MSGQUEUE, RLIMIT_MEMLOCK, RLIMIT_SIGPENDING, RLIMIT_NPROC) were
> bound to user_struct. I mean, atomic counter and "bottleneck" was there.
> We can't remove the counters for that rlimits and they will have an
> impact.
> 
> These rlimits are now counted per-namespace. In real life, docker/podman
> creates a separate user namespace for each container from init_user_ns.
> Usually only one additional counter is added for each rlimit in this way.
> 
The problem is that when increasing an rlimit in Docker, the limit has
to be increased in the init_user_ns even if a separate user namespace
has been created. This is why I believe this issue deserves to be fixed.


> All I'm saying is that "bottleneck" with atomic counter was there before
> and can't be removed anywhere.
> 

Yes, it can not be removed anywhere, maybe we can make it better.

Best regards,
Ridong

>>
>> Best regards,
>> Ridong
>>
>>>> To further investigate this issue, we initiated a series of tests by
>>>> launching varying numbers of dockers and closely monitored the throughput
>>>> of each individual docker. The detailed test outcomes are presented as
>>>> follows:
>>>>
>>>> 	| Dockers     |1      |4      |8      |16     |32     |64     |
>>>> 	| Throughput  |380068 |353204 |308948 |306453 |180659 |129152 |
>>>>
>>>> The data clearly demonstrates a discernible trend: as the quantity of
>>>> dockers increases, the throughput per container progressively declines.
>>>> In-depth analysis has identified the root cause of this performance
>>>> degradation. The ucouts module conducts statistics on rlimit, which
>>>> involves a significant number of atomic operations. These atomic
>>>> operations, when acting on the same variable, trigger a substantial number
>>>> of cache misses or remote accesses, ultimately resulting in a drop in
>>>> performance.
>>>>
>>>> Notably, even though a new user_namespace is created upon docker startup,
>>>> the problem persists. This is because all these dockers share the same
>>>> parent node, meaning that rlimit statistics continuously modify the same
>>>> atomic variable.
>>>>
>>>> Currently, when incrementing a specific rlimit within a child user
>>>> namespace by 1, the corresponding rlimit in the parent node must also be
>>>> incremented by 1. Specifically, if the ucounts corresponding to a task in
>>>> Docker B is ucount_b_1, after incrementing the rlimit of ucount_b_1 by 1,
>>>> the rlimit of the parent node, init_ucounts, must also be incremented by 1.
>>>> This operation should be ensured to stay within the limits set for the
>>>> user namespaces.
>>>>
>>>> 	init_user_ns                             init_ucounts
>>>> 	^                                              ^
>>>> 	|                        |                     |
>>>> 	|<---- usr_ns_a(docker A)|usr_ns_a->ucount---->|
>>>> 	|                        |                     |
>>>> 	|<---- usr_ns_b(docker B)|usr_ns_a->ucount---->|
>>>> 					^
>>>> 					|
>>>> 					|
>>>> 					|
>>>> 					ucount_b_1
>>>>
>>>> What is expected is that dockers operating within separate namespaces
>>>> should remain isolated and not interfere with one another. Regrettably,
>>>> the current signal system call fails to achieve this desired level of
>>>> isolation.
>>>>
>>>> Proposal:
>>>>
>>>> To address the aforementioned issues, the concept of implementing a cache
>>>> for each namespace's rlimit has been proposed. If a cache is added for
>>>> each user namespace's rlimit, a certain amount of rlimits can be allocated
>>>> to a particular namespace in one go. When resources are abundant, these
>>>> resources do not need to be immediately returned to the parent node. Within
>>>> a user namespace, if there are available values in the cache, there is no
>>>> need to request additional resources from the parent node.
>>>>
>>>> 	init_user_ns                             init_ucounts
>>>> 	^                                              ^
>>>> 	|                        |                     |
>>>> 	|<---- usr_ns_a(docker A)|usr_ns_a->ucount---->|
>>>> 	|                        |                     |
>>>> 	|<---- usr_ns_b(docker B)|usr_ns_b->ucount---->|
>>>> 			^		^
>>>> 			|		|
>>>> 			cache_rlimit--->|
>>>> 					|
>>>> 					ucount_b_1
>>>>
>>>>
>>>> The ultimate objective of this solution is to achieve complete isolation
>>>> among namespaces. After applying this patch set, the final test results
>>>> indicate that in the signal1 test case, the performance does not
>>>> deteriorate as the number of containers increases. This effectively meets
>>>
>>>> the goal of linear scalability.
>>>>
>>>> 	| Dockers     |1      |4      |8      |16     |32     |64     |
>>>> 	| Throughput  |381809 |382284 |380640 |383515 |381318 |380120 |
>>>>
>>>> Challenges:
>>>>
>>>> When checking the pending signals in the parent node using the command
>>>>  cat /proc/self/status | grep SigQ, the retrieved value includes the
>>>> cached signal counts from its child nodes. As a result, the SigQ value
>>>> in the parent node fails to accurately and instantaneously reflect the
>>>> actual number of pending signals.
>>>>
>>>> 	# cat /proc/self/status | grep SigQ
>>>> 	SigQ:	16/6187667
>>>>
>>>> TODO:
>>>>
>>>> Add cache for the other rlimits.
>>>>
>>>> [1] https://github.com/antonblanchard/will-it-scale/blob/master/tests/
>>>>
>>>> Chen Ridong (5):
>>>>   user_namespace: add children list node
>>>>   usernamespace: make usernamespace rcu safe
>>>>   user_namespace: add user_ns iteration helper
>>>>   uounts: factor out __inc_rlimit_get_ucounts/__dec_rlimit_put_ucounts
>>>>   ucount: add rlimit cache for ucount
>>>>
>>>>  include/linux/user_namespace.h |  23 ++++-
>>>>  kernel/signal.c                |   2 +-
>>>>  kernel/ucount.c                | 181 +++++++++++++++++++++++++++++----
>>>>  kernel/user.c                  |   2 +
>>>>  kernel/user_namespace.c        |  60 ++++++++++-
>>>>  5 files changed, 243 insertions(+), 25 deletions(-)
>>>>
>>>> -- 
>>>> 2.34.1
>>>>
>>>
>>
> 


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ