[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <eea0d7c3-9aed-4c1f-8146-23b82e611899@kernel.dk>
Date: Fri, 23 Jan 2026 09:52:34 -0700
From: Jens Axboe <axboe@...nel.dk>
To: Pavel Begunkov <asml.silence@...il.com>,
Yuhao Jiang <danisjiang@...il.com>
Cc: io-uring@...r.kernel.org, linux-kernel@...r.kernel.org,
stable@...r.kernel.org
Subject: Re: [PATCH v2] io_uring/rsrc: fix RLIMIT_MEMLOCK bypass by removing
cross-buffer accounting
On 1/23/26 8:04 AM, Jens Axboe wrote:
> On 1/23/26 7:50 AM, Jens Axboe wrote:
>> On 1/23/26 7:26 AM, Pavel Begunkov wrote:
>>> On 1/22/26 21:51, Pavel Begunkov wrote:
>>> ...
>>>>>>> I already briefly touched on that earlier, for sure not going to be of
>>>>>>> any practical concern.
>>>>>>
>>>>>> Modest 16 GB can give 1M entries. Assuming 50ns-100ns per entry for the
>>>>>> xarray business, that's 50-100ms. It's all serialised, so multiply by
>>>>>> the number of CPUs/threads, e.g. 10-100, that's 0.5-10s. Account sky
>>>>>> high spinlock contention, and it jumps again, and there can be more
>>>>>> memory / CPUs / numa nodes. Not saying that it's worse than the
>>>>>> current O(n^2), I have a test program that borderline hangs the
>>>>>> system.
>>>>>
>>>>> It's definitely not worse than the existing system, which is why I don't
>>>>> think it's a big deal. Nobody has ever complained about time to register
>>>>> buffers. It's inherently a slow path, and quite slow at that depending
>>>>> on the use case. Out of curiosity, I ran some stilly testing on
>>>>> registering 16GB of memory, with 1..32 threads. Each will do 16GB, so
>>>>> 512GB registered in total for the 32 case. Before is the current kernel,
>>>>> after is with per-user xarray accounting:
>>>>>
>>>>> before
>>>>>
>>>>> nthreads 1: 646 msec
>>>>> nthreads 2: 888 msec
>>>>> nthreads 4: 864 msec
>>>>> nthreads 8: 1450 msec
>>>>> nthreads 16: 2890 msec
>>>>> nthreads 32: 4410 msec
>>>>>
>>>>> after
>>>>>
>>>>> nthreads 1: 650 msec
>>>>> nthreads 2: 888 msec
>>>>> nthreads 4: 892 msec
>>>>> nthreads 8: 1270 msec
>>>>> nthreads 16: 2430 msec
>>>>> nthreads 32: 4160 msec
>>>>>
>>>>> This includes both registering buffers, cloning all of them to another
>>>>> ring, and unregistering times, and nowhere is locking scalability an
>>>>> issue for the xarray manipulation. The box has 32 nodes and 512 CPUs. So
>>>>> no, I strongly believe this isn't an issue.
>>>>>
>>>>> IOW, accurate accounting is cheaper than the stuff we have now. None of
>>>>> them are super cheap. Does it matter? I really don't think so, or people
>>>>> would've complained already. The only complaint I got on these kinds of
>>>>> things was for cloning, which did get fixed up some releases ago.
>>>>
>>>> You need compound pages
>>>>
>>>> always > /sys/kernel/mm/transparent_hugepage/hugepages-16kB/enabled
>>>>
>>>> And use update() instead of register() as accounting dedup for
>>>> registration is broken-disabled. For the current kernel:
>>>>
>>>> Single threaded:
>>>> 1x1G: 7.5s
>>>> 2x1G: 45s
>>>> 4x1G: 190s
>>>>
>>>> 16x should be ~3000s, not going to run it. Uninterruptible and no
>>>> cond_resched, so spawn NR_CPUS threads and the system is completely
>>>> unresponsive (I guess it depends on the preemption mode).
>>> The program is below for reference, but it's trivial. THP setting
>>> is done inside for convenience. There are ways to make the runtime
>>> even worse, but that should be enough.
>>
>> Thanks for sending that. Ran it on the same box, on current -git and
>> with user_struct xarray accounting. Modified it so that 2nd arg is
>> number of threads, for easy running:
>
> Should've tried 32x32 as well, that ends up going deep into "this sucks"
> territory:
>
> git
>
> good luck
>
> git + user_struct
>
> axboe@...25 ~> time ./ppage 32 32
> register 32 GB, num threads 32
>
> ________________________________________________________
> Executed in 16.34 secs fish external
> usr time 0.54 secs 497.00 micros 0.54 secs
> sys time 451.94 secs 55.00 micros 451.94 secs
OK, if we use a per-ctx btree, otherwise the code is the same:
axboe@...25 ~> for i in 1 2 4 8 16; time ./ppage $i $i; end
register 1 GB, num threads 1
________________________________________________________
Executed in 54.06 millis fish external
usr time 41.70 millis 382.00 micros 41.32 millis
sys time 10.64 millis 314.00 micros 10.33 millis
register 2 GB, num threads 2
________________________________________________________
Executed in 105.56 millis fish external
usr time 60.65 millis 485.00 micros 60.16 millis
sys time 40.11 millis 0.00 micros 40.11 millis
register 4 GB, num threads 4
________________________________________________________
Executed in 209.98 millis fish external
usr time 38.57 millis 447.00 micros 38.12 millis
sys time 190.61 millis 0.00 micros 190.61 millis
register 8 GB, num threads 8
________________________________________________________
Executed in 423.37 millis fish external
usr time 130.50 millis 470.00 micros 130.03 millis
sys time 380.80 millis 0.00 micros 380.80 millis
register 16 GB, num threads 16
________________________________________________________
Executed in 832.71 millis fish external
usr time 0.27 secs 470.00 micros 0.27 secs
sys time 1.04 secs 0.00 micros 1.04 secs
and the crazier cases:
axboe@...25 ~> time ./ppage 32 32
register 32 GB, num threads 32
________________________________________________________
Executed in 2.81 secs fish external
usr time 0.71 secs 497.00 micros 0.71 secs
sys time 19.57 secs 183.00 micros 19.57 secs
which isn't insane. Obviously also needs conditional rescheduling in the
page loops, as those can take a loooong time for large amounts of
memory.
--
Jens Axboe
Powered by blists - more mailing lists