linux-kernel - Re: [PATCH v2] io_uring/rsrc: fix RLIMIT_MEMLOCK bypass by removing cross-buffer accounting

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <596bc7ac-3d24-43a7-9e7e-e59189525ebc@gmail.com>
Date: Fri, 23 Jan 2026 14:26:39 +0000
From: Pavel Begunkov <asml.silence@...il.com>
To: Jens Axboe <axboe@...nel.dk>, Yuhao Jiang <danisjiang@...il.com>
Cc: io-uring@...r.kernel.org, linux-kernel@...r.kernel.org,
 stable@...r.kernel.org
Subject: Re: [PATCH v2] io_uring/rsrc: fix RLIMIT_MEMLOCK bypass by removing
 cross-buffer accounting

On 1/22/26 21:51, Pavel Begunkov wrote:
...
>>>> I already briefly touched on that earlier, for sure not going to be of
>>>> any practical concern.
>>>
>>> Modest 16 GB can give 1M entries. Assuming 50ns-100ns per entry for the
>>> xarray business, that's 50-100ms. It's all serialised, so multiply by
>>> the number of CPUs/threads, e.g. 10-100, that's 0.5-10s. Account sky
>>> high spinlock contention, and it jumps again, and there can be more
>>> memory / CPUs / numa nodes. Not saying that it's worse than the
>>> current O(n^2), I have a test program that borderline hangs the
>>> system.
>>
>> It's definitely not worse than the existing system, which is why I don't
>> think it's a big deal. Nobody has ever complained about time to register
>> buffers. It's inherently a slow path, and quite slow at that depending
>> on the use case. Out of curiosity, I ran some stilly testing on
>> registering 16GB of memory, with 1..32 threads. Each will do 16GB, so
>> 512GB registered in total for the 32 case. Before is the current kernel,
>> after is with per-user xarray accounting:
>>
>> before
>>
>> nthreads 1:      646 msec
>> nthreads 2:      888 msec
>> nthreads 4:      864 msec
>> nthreads 8:     1450 msec
>> nthreads 16:    2890 msec
>> nthreads 32:    4410 msec
>>
>> after
>>
>> nthreads 1:      650 msec
>> nthreads 2:      888 msec
>> nthreads 4:      892 msec
>> nthreads 8:     1270 msec
>> nthreads 16:    2430 msec
>> nthreads 32:    4160 msec
>>
>> This includes both registering buffers, cloning all of them to another
>> ring, and unregistering times, and nowhere is locking scalability an
>> issue for the xarray manipulation. The box has 32 nodes and 512 CPUs. So
>> no, I strongly believe this isn't an issue.
>>
>> IOW, accurate accounting is cheaper than the stuff we have now. None of
>> them are super cheap. Does it matter? I really don't think so, or people
>> would've complained already. The only complaint I got on these kinds of
>> things was for cloning, which did get fixed up some releases ago.
> 
> You need compound pages
> 
> always > /sys/kernel/mm/transparent_hugepage/hugepages-16kB/enabled
> 
> And use update() instead of register() as accounting dedup for
> registration is broken-disabled. For the current kernel:
> 
> Single threaded:
> 1x1G: 7.5s
> 2x1G: 45s
> 4x1G: 190s
> 
> 16x should be ~3000s, not going to run it. Uninterruptible and no
> cond_resched, so spawn NR_CPUS threads and the system is completely
> unresponsive (I guess it depends on the preemption mode).
The program is below for reference, but it's trivial. THP setting
is done inside for convenience. There are ways to make the runtime
even worse, but that should be enough.


#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
#include <unistd.h>
#include <string.h>
#include <errno.h>
#include "liburing.h"

#define NUM_THREADS 1
#define BUFFER_SIZE (1024UL * 1024 * 1024)
#define MAX_IOVS 64

static int num_iovs = 1;
static void *buffer;
static pthread_barrier_t barrier;

static void *thread_func(void *arg)
{
	struct io_uring ring;
	struct iovec iov[MAX_IOVS];
	int th_idx = (long)arg;
	int ret, i;

	for (i = 0; i < MAX_IOVS; i++) {
		iov[i].iov_base = buffer + i * BUFFER_SIZE;
		iov[i].iov_len  = BUFFER_SIZE;
	}

	ret = io_uring_queue_init(8, &ring, 0);
	if (ret) {
		fprintf(stderr, "ring init failed: %i\n", ret);
		return NULL;
	}

	ret = io_uring_register_buffers_sparse(&ring, MAX_IOVS);
	if (ret < 0) {
		fprintf(stderr, "reg sparse failed\n");
		return NULL;
	}

	pthread_barrier_wait(&barrier);

	ret = io_uring_register_buffers_update_tag(&ring, 0, iov, NULL, num_iovs);
	if (ret < 0)
		fprintf(stderr, "buffer update failed: %i\n", ret);

	printf("thread %i finished\n", th_idx);
	io_uring_queue_exit(&ring);
	return NULL;
}

int main(int argc, char **argv)
{
	pthread_t threads[NUM_THREADS];
	int sys_fd;
	int ret;

	if (argc != 2) {
		fprintf(stderr, "invalid number of arguments\n");
		return 1;
	}
	num_iovs = strtoul(argv[1], NULL, 0);
	printf("register %i GB, num threads %i\n", num_iovs, NUM_THREADS);

	// always > /sys/kernel/mm/transparent_hugepage/hugepages-16kB/enabled
	sys_fd = open("/sys/kernel/mm/transparent_hugepage/hugepages-16kB/enabled", O_RDWR);
	if (sys_fd < 0) {
		fprintf(stderr, "thp sys open failed %i\n", errno);
		return 1;
	}

	const char str[] = "always";
	ret = write(sys_fd, str, sizeof(str));
	if (ret != sizeof(str)) {
		fprintf(stderr, "thp sys write failed %i\n", errno);
		return 1;
	}

	buffer = aligned_alloc(64 * 1024, BUFFER_SIZE * num_iovs);
	if (!buffer) {
		fprintf(stderr, "allocation failed\n");
		return 1;
	}
	memset(buffer, 0, BUFFER_SIZE * num_iovs);

	pthread_barrier_init(&barrier, NULL, NUM_THREADS);
	for (long i = 0; i < NUM_THREADS; i++) {
		ret = pthread_create(&threads[i], NULL, thread_func, (void *)i);
		if (ret) {
			fprintf(stderr, "pthread_create failed for thread %ld\n", i);
			return 1;
		}
	}

	for (int i = 0; i < NUM_THREADS; i++)
		pthread_join(threads[i], NULL);
	pthread_barrier_destroy(&barrier);
	free(buffer);
	return 0;
}