[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <f5c701d5-c501-4179-959c-85057705a09d@huaweicloud.com>
Date: Mon, 19 May 2025 21:39:34 +0800
From: Chen Ridong <chenridong@...weicloud.com>
To: Alexey Gladkov <legion@...nel.org>
Cc: akpm@...ux-foundation.org, paulmck@...nel.org, bigeasy@...utronix.de,
roman.gushchin@...ux.dev, brauner@...nel.org, tglx@...utronix.de,
frederic@...nel.org, peterz@...radead.org, oleg@...hat.com,
joel.granados@...nel.org, viro@...iv.linux.org.uk,
lorenzo.stoakes@...cle.com, avagin@...gle.com, mengensun@...cent.com,
linux@...ssschuh.net, jlayton@...nel.org, ruanjinjie@...wei.com,
kees@...nel.org, linux-kernel@...r.kernel.org, lujialin4@...wei.com,
Eric Biederman <ebiederm@...ssion.com>
Subject: Re: [RFC next v2 0/5] ucount: add rlimit cache for ucount
On 2025/5/16 19:48, Alexey Gladkov wrote:
> On Fri, May 09, 2025 at 07:20:49AM +0000, Chen Ridong wrote:
>> The will-it-scale test case signal1 [1] has been observed. and the test
>> results reveal that the signal sending system call lacks linearity.
>
> The signal1 testcase is pretty synthetic. It sends a signal in a busy loop.
>
> Do you have an example of a closer-to-life scenario where this delay
> becomes a bottleneck ?
>
> https://github.com/antonblanchard/will-it-scale/blob/master/tests/signal1.c
>
Thank you for your prompt reply. Unfortunately, I do not have the
specific scenario.
Motivation:
I plan to use servers with 384 cores, and potentially even more in the
future. Therefore, I am testing these system calls to identify any
scalability bottlenecks that could arise in massively parallel
high-density computing environments.
In addition, we hope that the containers can be isolated as much as
possible to avoid interfering with each other.
Best regards,
Ridong
>> To further investigate this issue, we initiated a series of tests by
>> launching varying numbers of dockers and closely monitored the throughput
>> of each individual docker. The detailed test outcomes are presented as
>> follows:
>>
>> | Dockers |1 |4 |8 |16 |32 |64 |
>> | Throughput |380068 |353204 |308948 |306453 |180659 |129152 |
>>
>> The data clearly demonstrates a discernible trend: as the quantity of
>> dockers increases, the throughput per container progressively declines.
>> In-depth analysis has identified the root cause of this performance
>> degradation. The ucouts module conducts statistics on rlimit, which
>> involves a significant number of atomic operations. These atomic
>> operations, when acting on the same variable, trigger a substantial number
>> of cache misses or remote accesses, ultimately resulting in a drop in
>> performance.
>>
>> Notably, even though a new user_namespace is created upon docker startup,
>> the problem persists. This is because all these dockers share the same
>> parent node, meaning that rlimit statistics continuously modify the same
>> atomic variable.
>>
>> Currently, when incrementing a specific rlimit within a child user
>> namespace by 1, the corresponding rlimit in the parent node must also be
>> incremented by 1. Specifically, if the ucounts corresponding to a task in
>> Docker B is ucount_b_1, after incrementing the rlimit of ucount_b_1 by 1,
>> the rlimit of the parent node, init_ucounts, must also be incremented by 1.
>> This operation should be ensured to stay within the limits set for the
>> user namespaces.
>>
>> init_user_ns init_ucounts
>> ^ ^
>> | | |
>> |<---- usr_ns_a(docker A)|usr_ns_a->ucount---->|
>> | | |
>> |<---- usr_ns_b(docker B)|usr_ns_a->ucount---->|
>> ^
>> |
>> |
>> |
>> ucount_b_1
>>
>> What is expected is that dockers operating within separate namespaces
>> should remain isolated and not interfere with one another. Regrettably,
>> the current signal system call fails to achieve this desired level of
>> isolation.
>>
>> Proposal:
>>
>> To address the aforementioned issues, the concept of implementing a cache
>> for each namespace's rlimit has been proposed. If a cache is added for
>> each user namespace's rlimit, a certain amount of rlimits can be allocated
>> to a particular namespace in one go. When resources are abundant, these
>> resources do not need to be immediately returned to the parent node. Within
>> a user namespace, if there are available values in the cache, there is no
>> need to request additional resources from the parent node.
>>
>> init_user_ns init_ucounts
>> ^ ^
>> | | |
>> |<---- usr_ns_a(docker A)|usr_ns_a->ucount---->|
>> | | |
>> |<---- usr_ns_b(docker B)|usr_ns_b->ucount---->|
>> ^ ^
>> | |
>> cache_rlimit--->|
>> |
>> ucount_b_1
>>
>>
>> The ultimate objective of this solution is to achieve complete isolation
>> among namespaces. After applying this patch set, the final test results
>> indicate that in the signal1 test case, the performance does not
>> deteriorate as the number of containers increases. This effectively meets
>
>> the goal of linear scalability.
>>
>> | Dockers |1 |4 |8 |16 |32 |64 |
>> | Throughput |381809 |382284 |380640 |383515 |381318 |380120 |
>>
>> Challenges:
>>
>> When checking the pending signals in the parent node using the command
>> cat /proc/self/status | grep SigQ, the retrieved value includes the
>> cached signal counts from its child nodes. As a result, the SigQ value
>> in the parent node fails to accurately and instantaneously reflect the
>> actual number of pending signals.
>>
>> # cat /proc/self/status | grep SigQ
>> SigQ: 16/6187667
>>
>> TODO:
>>
>> Add cache for the other rlimits.
>>
>> [1] https://github.com/antonblanchard/will-it-scale/blob/master/tests/
>>
>> Chen Ridong (5):
>> user_namespace: add children list node
>> usernamespace: make usernamespace rcu safe
>> user_namespace: add user_ns iteration helper
>> uounts: factor out __inc_rlimit_get_ucounts/__dec_rlimit_put_ucounts
>> ucount: add rlimit cache for ucount
>>
>> include/linux/user_namespace.h | 23 ++++-
>> kernel/signal.c | 2 +-
>> kernel/ucount.c | 181 +++++++++++++++++++++++++++++----
>> kernel/user.c | 2 +
>> kernel/user_namespace.c | 60 ++++++++++-
>> 5 files changed, 243 insertions(+), 25 deletions(-)
>>
>> --
>> 2.34.1
>>
>
Powered by blists - more mailing lists