[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <efe080c9-5176-4fa1-9f65-5be44074779e@gmail.com>
Date: Thu, 22 Jan 2026 21:51:10 +0000
From: Pavel Begunkov <asml.silence@...il.com>
To: Jens Axboe <axboe@...nel.dk>, Yuhao Jiang <danisjiang@...il.com>
Cc: io-uring@...r.kernel.org, linux-kernel@...r.kernel.org,
stable@...r.kernel.org
Subject: Re: [PATCH v2] io_uring/rsrc: fix RLIMIT_MEMLOCK bypass by removing
cross-buffer accounting
On 1/22/26 17:47, Jens Axboe wrote:
> On 1/22/26 4:43 AM, Pavel Begunkov wrote:
>> On 1/21/26 14:58, Jens Axboe wrote:
>>> On 1/20/26 2:45 PM, Pavel Begunkov wrote:
>>>> On 1/20/26 17:03, Jens Axboe wrote:
>>>>> On 1/20/26 5:05 AM, Pavel Begunkov wrote:
>>>>>> On 1/20/26 07:05, Yuhao Jiang wrote:
>>>> ...
>>>>>>>
>>>>>>> I've been implementing the xarray-based ref tracking approach for v3.
>>>>>>> While working on it, I discovered an issue with buffer cloning.
>>>>>>>
>>>>>>> If ctx1 has two buffers sharing a huge page, ctx1->hpage_acct[page] = 2.
>>>>>>> Clone to ctx2, now both have a refcount of 2. On cleanup both hit zero
>>>>>>> and unaccount, so we double-unaccount and user->locked_vm goes negative.
>>>>>>>
>>>>>>> The per-context xarray can't coordinate across clones - each context
>>>>>>> tracks its own refcount independently. I think we either need a global
>>>>>>> xarray (shared across all contexts), or just go back to v2. What do
>>>>>>> you think?
>>>>>>
>>>>>> The Jens' diff is functionally equivalent to your v1 and has
>>>>>> exactly same problems. Global tracking won't work well.
>>>>>
>>>>> Why not? My thinking was that we just use xa_lock() for this, with
>>>>> a global xarray. It's not like register+unregister is a high frequency
>>>>> thing. And if they are, then we've got much bigger problems than the
>>>>> single lock as the runtime complexity isn't ideal.
>>>>
>>>> 1. There could be quite a lot of entries even for a single ring
>>>> with realistic amount of memory. If lots of threads start up
>>>> at the same time taking it in a loop, it might become a chocking
>>>> point for large systems. Should be even more spectacular for
>>>> some numa setups.
>>>
>>> I already briefly touched on that earlier, for sure not going to be of
>>> any practical concern.
>>
>> Modest 16 GB can give 1M entries. Assuming 50ns-100ns per entry for the
>> xarray business, that's 50-100ms. It's all serialised, so multiply by
>> the number of CPUs/threads, e.g. 10-100, that's 0.5-10s. Account sky
>> high spinlock contention, and it jumps again, and there can be more
>> memory / CPUs / numa nodes. Not saying that it's worse than the
>> current O(n^2), I have a test program that borderline hangs the
>> system.
>
> It's definitely not worse than the existing system, which is why I don't
> think it's a big deal. Nobody has ever complained about time to register
> buffers. It's inherently a slow path, and quite slow at that depending
> on the use case. Out of curiosity, I ran some stilly testing on
> registering 16GB of memory, with 1..32 threads. Each will do 16GB, so
> 512GB registered in total for the 32 case. Before is the current kernel,
> after is with per-user xarray accounting:
>
> before
>
> nthreads 1: 646 msec
> nthreads 2: 888 msec
> nthreads 4: 864 msec
> nthreads 8: 1450 msec
> nthreads 16: 2890 msec
> nthreads 32: 4410 msec
>
> after
>
> nthreads 1: 650 msec
> nthreads 2: 888 msec
> nthreads 4: 892 msec
> nthreads 8: 1270 msec
> nthreads 16: 2430 msec
> nthreads 32: 4160 msec
>
> This includes both registering buffers, cloning all of them to another
> ring, and unregistering times, and nowhere is locking scalability an
> issue for the xarray manipulation. The box has 32 nodes and 512 CPUs. So
> no, I strongly believe this isn't an issue.
>
> IOW, accurate accounting is cheaper than the stuff we have now. None of
> them are super cheap. Does it matter? I really don't think so, or people
> would've complained already. The only complaint I got on these kinds of
> things was for cloning, which did get fixed up some releases ago.
You need compound pages
always > /sys/kernel/mm/transparent_hugepage/hugepages-16kB/enabled
And use update() instead of register() as accounting dedup for
registration is broken-disabled. For the current kernel:
Single threaded:
1x1G: 7.5s
2x1G: 45s
4x1G: 190s
16x should be ~3000s, not going to run it. Uninterruptible and no
cond_resched, so spawn NR_CPUS threads and the system is completely
unresponsive (I guess it depends on the preemption mode).
--
Pavel Begunkov
Powered by blists - more mailing lists