linux-kernel - Re: [RFC next v2 0/5] ucount: add rlimit cache for ucount

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <d3f27222-7262-4f04-8c80-329852bc251e@huaweicloud.com>
Date: Tue, 13 May 2025 09:48:31 +0800
From: Chen Ridong <chenridong@...weicloud.com>
To: Sebastian Andrzej Siewior <bigeasy@...utronix.de>,
 Andrew Morton <akpm@...ux-foundation.org>
Cc: paulmck@...nel.org, legion@...nel.org, roman.gushchin@...ux.dev,
 brauner@...nel.org, tglx@...utronix.de, frederic@...nel.org,
 peterz@...radead.org, oleg@...hat.com, joel.granados@...nel.org,
 viro@...iv.linux.org.uk, lorenzo.stoakes@...cle.com, avagin@...gle.com,
 mengensun@...cent.com, linux@...ssschuh.net, jlayton@...nel.org,
 ruanjinjie@...wei.com, kees@...nel.org, linux-kernel@...r.kernel.org,
 lujialin4@...wei.com, Chen Ridong <chenridong@...wei.com>
Subject: Re: [RFC next v2 0/5] ucount: add rlimit cache for ucount



On 2025/5/12 18:48, Sebastian Andrzej Siewior wrote:
> On 2025-05-09 13:18:49 [-0700], Andrew Morton wrote:
>> On Fri,  9 May 2025 07:20:49 +0000 Chen Ridong <chenridong@...weicloud.com> wrote:
>>
>>> The will-it-scale test case signal1 [1] has been observed. and the test
>>> results reveal that the signal sending system call lacks linearity.
>>> To further investigate this issue, we initiated a series of tests by
>>> launching varying numbers of dockers and closely monitored the throughput
>>> of each individual docker. The detailed test outcomes are presented as
>>> follows:
>>>
>>> 	| Dockers     |1      |4      |8      |16     |32     |64     |
>>> 	| Throughput  |380068 |353204 |308948 |306453 |180659 |129152 |
>>>
>>> The data clearly demonstrates a discernible trend: as the quantity of
>>> dockers increases, the throughput per container progressively declines.
>>> In-depth analysis has identified the root cause of this performance
>>> degradation. The ucouts module conducts statistics on rlimit, which
>>> involves a significant number of atomic operations. These atomic
>>> operations, when acting on the same variable, trigger a substantial number
>>> of cache misses or remote accesses, ultimately resulting in a drop in
>>> performance.
>>
>> Did you consider simply turning that atomic_t counter into a
>> percpu_counter?
> 
> That sounds like a smaller change. Also, do these 1…64 docker container
> play signal ping-pong or is there a real workload behind it?
> 
> Sebastian

Hi Andrew and Sebastian,

Thanks for your prompt reply. I'm currently conducting a "will-it-scale"
test on a 384-core machine configured to run up to 64 Docker
containers(max num). Even with just 32 containers, the throughput drops
by 53%, which indicates significant scaling challenges.

Your suggestion about using percpu_counter was spot on. I've since
implemented a demo to benchmark its performance. Here's the code I wrote
for testing:

static void do_dec_rlimit_put_ucounts(struct ucounts *ucounts,
@@ -281,10 +289,10 @@ static void do_dec_rlimit_put_ucounts(struct
ucounts *ucounts,
 {
        struct ucounts *iter, *next;
        for (iter = ucounts; iter != last; iter = next) {
-               long dec = atomic_long_sub_return(1, &iter->rlimit[type]);
-               WARN_ON_ONCE(dec < 0);
+               percpu_counter_sub(&iter->rlimit[type], 1);
+
                next = iter->ns->ucounts;
-               if (dec == 0)
+               if (percpu_counter_compare(&iter->rlimit[type], 0) == 0)
                        put_ucounts(iter);
        }
 }
@@ -295,36 +303,40 @@ void dec_rlimit_put_ucounts(struct ucounts
*ucounts, enum rlimit_type type)
 }

 long inc_rlimit_get_ucounts(struct ucounts *ucounts, enum rlimit_type type,
-                           bool override_rlimit)
+                           bool override_rlimit, long limit)
 {
        /* Caller must hold a reference to ucounts */
        struct ucounts *iter;
        long max = LONG_MAX;
-       long dec, ret = 0;
+       long ret = 0;
+
+       if (override_rlimit)
+               limit = LONG_MAX;

        for (iter = ucounts; iter; iter = iter->ns->ucounts) {
-               long new = atomic_long_add_return(1, &iter->rlimit[type]);
-               if (new < 0 || new > max)
+               max = min(limit, max);
+
+               if (!percpu_counter_limited_add(&iter->rlimit[type],
max, 1)) {
+                       ret = -1;
                        goto dec_unwind;
-               if (iter == ucounts)
-                       ret = new;
+               }
                if (!override_rlimit)
                        max = get_userns_rlimit_max(iter->ns, type);
                /*
                 * Grab an extra ucount reference for the caller when
                 * the rlimit count was previously 0.
                 */
-               if (new != 1)
+               if (percpu_counter_compare(&iter->rlimit[type], 1) != 0)
                        continue;
-               if (!get_ucounts(iter))
+               if (!get_ucounts(iter)) {
+                       ret = 0;
                        goto dec_unwind;
+               }
        }
-       return ret;
+       return 1;
 dec_unwind:
-       dec = atomic_long_sub_return(1, &iter->rlimit[type]);
-       WARN_ON_ONCE(dec < 0);
        do_dec_rlimit_put_ucounts(ucounts, iter, type);
-       return 0;
+       return ret;
 }

As shown in demo code, the current implementation retrieves ucounts
during the initial rlimit increment and releases them when the rlimit
hits zero. This mechanism was introduced with the commits:

  fda31c50292a5062332fa0343c084bd9f46604d9 signal: avoid double atomic
 counter increments for user accounting

  d64696905554e919321e31afc210606653b8f6a4   Reimplement
RLIMIT_SIGPENDING on top of ucounts

  15bc01effefe97757ef02ca09e9d1b927ab22725 ucounts: Fix signal ucount
refcounting

It means that we have to requires summing all percpu rlimit counters
very time we increment or decrement the rlimit and this is expensive.

Running the demo code in a single Docker container yielded a throughput
of 73,970 —significantly lower than expected. Performance profiling via
perf revealed:__percpu_counter_sum is the primary performance bottleneck.

+   97.44%     0.27%  signal1_process    [k] entry_SYSCALL_64_after_hwframe
+   97.13%     1.96%  signal1_process    [k] do_syscall_64
+   91.54%     0.00%  signal1_process    [.] 0x00007fb905c8d13f
+   78.66%     0.03%  signal1_process    [k] __percpu_counter_compare
+   78.63%    68.18%  signal1_process    [k] __percpu_counter_sum
+   45.17%     0.37%  signal1_process    [k] syscall_exit_to_user_mode
+   44.95%     0.20%  signal1_process    [k] __x64_sys_tgkill
+   44.51%     0.40%  signal1_process    [k] do_send_specific
+   44.16%     0.07%  signal1_process    [k] arch_do_signal_or_restart
+   43.03%     0.37%  signal1_process    [k] do_send_sig_info
+   42.08%     0.34%  signal1_process    [k] __send_signal_locked
+   40.87%     0.03%  signal1_process    [k] sig_get_ucounts
+   40.74%     0.44%  signal1_process    [k] inc_rlimit_get_ucounts
+   40.55%     0.54%  signal1_process    [k] get_signal
+   39.81%     0.07%  signal1_process    [k] dequeue_signal
+   39.00%     0.07%  signal1_process    [k] __sigqueue_free
+   38.94%     0.27%  signal1_process    [k] do_dec_rlimit_put_ucounts
+    8.60%     8.60%  signal1_process    [k] _find_next_or_bit

However, If we can implement a mechanism to unpin ucounts for signal
pending operations (eliminating the need for get/put refcount
operations), the percpu_counter approach should effectively resolve this
scalability issue. I am trying to figure this out, and if you have any
ideas, please let know, I will do test.

Thank you for your guidance on this matter.

Best regards,
Ridong