lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <3b7e6088-7d92-4d5c-96c7-f8c0e2cc7745@kernel.dk>
Date: Thu, 22 Jan 2026 10:47:56 -0700
From: Jens Axboe <axboe@...nel.dk>
To: Pavel Begunkov <asml.silence@...il.com>,
 Yuhao Jiang <danisjiang@...il.com>
Cc: io-uring@...r.kernel.org, linux-kernel@...r.kernel.org,
 stable@...r.kernel.org
Subject: Re: [PATCH v2] io_uring/rsrc: fix RLIMIT_MEMLOCK bypass by removing
 cross-buffer accounting

On 1/22/26 4:43 AM, Pavel Begunkov wrote:
> On 1/21/26 14:58, Jens Axboe wrote:
>> On 1/20/26 2:45 PM, Pavel Begunkov wrote:
>>> On 1/20/26 17:03, Jens Axboe wrote:
>>>> On 1/20/26 5:05 AM, Pavel Begunkov wrote:
>>>>> On 1/20/26 07:05, Yuhao Jiang wrote:
>>> ...
>>>>>>
>>>>>> I've been implementing the xarray-based ref tracking approach for v3.
>>>>>> While working on it, I discovered an issue with buffer cloning.
>>>>>>
>>>>>> If ctx1 has two buffers sharing a huge page, ctx1->hpage_acct[page] = 2.
>>>>>> Clone to ctx2, now both have a refcount of 2. On cleanup both hit zero
>>>>>> and unaccount, so we double-unaccount and user->locked_vm goes negative.
>>>>>>
>>>>>> The per-context xarray can't coordinate across clones - each context
>>>>>> tracks its own refcount independently. I think we either need a global
>>>>>> xarray (shared across all contexts), or just go back to v2. What do
>>>>>> you think?
>>>>>
>>>>> The Jens' diff is functionally equivalent to your v1 and has
>>>>> exactly same problems. Global tracking won't work well.
>>>>
>>>> Why not? My thinking was that we just use xa_lock() for this, with
>>>> a global xarray. It's not like register+unregister is a high frequency
>>>> thing. And if they are, then we've got much bigger problems than the
>>>> single lock as the runtime complexity isn't ideal.
>>>
>>> 1. There could be quite a lot of entries even for a single ring
>>> with realistic amount of memory. If lots of threads start up
>>> at the same time taking it in a loop, it might become a chocking
>>> point for large systems. Should be even more spectacular for
>>> some numa setups.
>>
>> I already briefly touched on that earlier, for sure not going to be of
>> any practical concern.
> 
> Modest 16 GB can give 1M entries. Assuming 50ns-100ns per entry for the
> xarray business, that's 50-100ms. It's all serialised, so multiply by
> the number of CPUs/threads, e.g. 10-100, that's 0.5-10s. Account sky
> high spinlock contention, and it jumps again, and there can be more
> memory / CPUs / numa nodes. Not saying that it's worse than the
> current O(n^2), I have a test program that borderline hangs the
> system.

It's definitely not worse than the existing system, which is why I don't
think it's a big deal. Nobody has ever complained about time to register
buffers. It's inherently a slow path, and quite slow at that depending
on the use case. Out of curiosity, I ran some stilly testing on
registering 16GB of memory, with 1..32 threads. Each will do 16GB, so
512GB registered in total for the 32 case. Before is the current kernel,
after is with per-user xarray accounting:

before

nthreads 1:      646 msec
nthreads 2:      888 msec
nthreads 4:      864 msec
nthreads 8:     1450 msec
nthreads 16:    2890 msec
nthreads 32:    4410 msec

after

nthreads 1:      650 msec
nthreads 2:      888 msec
nthreads 4:      892 msec
nthreads 8:     1270 msec
nthreads 16:    2430 msec
nthreads 32:    4160 msec

This includes both registering buffers, cloning all of them to another
ring, and unregistering times, and nowhere is locking scalability an
issue for the xarray manipulation. The box has 32 nodes and 512 CPUs. So
no, I strongly believe this isn't an issue.

IOW, accurate accounting is cheaper than the stuff we have now. None of
them are super cheap. Does it matter? I really don't think so, or people
would've complained already. The only complaint I got on these kinds of
things was for cloning, which did get fixed up some releases ago.

> Look, I don't care what it'd be, whether it stutters or blows up the
> kernel, I only took a quick look since you pinged me and was asking
> "why not". If you don't want to consider my reasoning, as the
> maintainer you can merge whatever you like, and it'll be easier for
> me as I won't be wasting more time.

I do consider your reasoning, but you also need to consider mine rather
than assuming there's only one answer here, or that yours is invariably
the correct one and being stubborn about it. The above test obviously
isn't the end-all be-all of testing, but it would show if we had issues
with scaling to the extent that you assume.

Also worth considering that for these kinds of parallel setups running,
the (by far) common use case is threads. And hence you're going to be
banging on the shared mm anyway for a lot of these memory related setup
operations.

-- 
Jens Axboe

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ