linux-kernel - Re: Is number of process_madvise()-able ranges limited to 8? (was Re: [PATCH 1/6] prctl: introduce PR_THP_POLICY_DEFAULT

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <005161f7-d849-41a9-8350-f56e52c49e7e@lucifer.local>
Date: Sat, 17 May 2025 19:50:34 +0100
From: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>
To: SeongJae Park <sj@...nel.org>
Cc: David Hildenbrand <david@...hat.com>, Usama Arif <usamaarif642@...il.com>,
        Andrew Morton <akpm@...ux-foundation.org>, linux-mm@...ck.org,
        hannes@...xchg.org, shakeel.butt@...ux.dev, riel@...riel.com,
        ziy@...dia.com, laoar.shao@...il.com, baolin.wang@...ux.alibaba.com,
        Liam.Howlett@...cle.com, npache@...hat.com, ryan.roberts@....com,
        linux-kernel@...r.kernel.org, linux-doc@...r.kernel.org,
        kernel-team@...a.com
Subject: Re: Is number of process_madvise()-able ranges limited to 8? (was
 Re: [PATCH 1/6] prctl: introduce PR_THP_POLICY_DEFAULT_HUGE for the process)

Hi SJ,

I'm happy to discuss this, and reply below, but I _think_ replying in this
thread is really not optimal, as we're digressing quite a bit from the
proposal/issue at hand and the cc's now not quite aligned, potentially
creating confusion and noise.

I know you're in good faith based on your (excellent :) series in this
area, so I presume it was just to provide context -as to why you're raising
it- more than anything else.

This is more of a 'email development is sucky' comment, but I _think_ this
would be better as a [DISCUSSION] thread maybe linking this original one
back to it or something.

But anyway, getting to the point - my answer is simple so not much
discussion _really_ required here - you're right, I'm wrong! :)

I go into detail inline below:

On Sat, May 17, 2025 at 09:20:48AM -0700, SeongJae Park wrote:
> Hi Lorenzo,
>
> On Fri, 16 May 2025 13:57:18 +0100 Lorenzo Stoakes <lorenzo.stoakes@...cle.com> wrote:
> [...]
> > Right now madvise() has limited utility because:
> [...]
> > - While you can perform multiple operations at once via process_madvise(),
> >   even to the current process (after my changes to extend it), it's limited
> >   to a single advice over 8 ranges.
>
> I'm bit confused by the last part, since I'm understanding your point as 'vlen'
> parameter of process_madvise() is limited to 8, but my test code below succeeds
> with 'vlen' parameter value 512.  Could you please enlighten me?

Let's keep this simple - I'm just wrong here :) apologies, entirely my
fault.

We have discussed this a few times before, where I suspect my incorrect
assertion on this has led to you also assuming the wrong thing (again,
apologies!).

But it does raise the important point - we need to re-examine your changes
(see [0]) where this assumption reduced the urgency of considering
contention issues.

Let's take a look at that again on Monday. Though I do strongly suspect
it's fine honestly. We just need to take a look...!

[0]: https://lore.kernel.org/linux-mm/5fc4e100-70d3-44c1-99f7-f8a5a6a0ba65@lucifer.local/

Anyway, let's dig into the code to get things right:

SYSCALL_DEFINE5(process_madvise, int, pidfd, const struct iovec __user *, vec,
		size_t, vlen, int, behavior, unsigned int, flags)
{
	...
	struct iovec iovstack[UIO_FASTIOV];
	struct iovec *iov = iovstack;
	struct iov_iter iter;
	...

	ret = import_iovec(ITER_DEST, vec, vlen, ARRAY_SIZE(iovstack), &iov, &iter);
	if (ret < 0)
		goto out;

	...
}

My mistake was assuming that UIO_FASTIOV was the hard limit on size. This
is not the case, it's just an optimisation - if the iovec is small enough
to fit we use it, otherwise we allocate.

We can see this by examining the comment from import_iovec():

/*
 * ...
 * If the array pointed to by *@iov is large enough to hold all @nr_segs,
 * then this function places %NULL in *@iov on return. Otherwise, a new
 * array will be allocated and the result placed in *@.... This means that
 * the caller may call kfree() on *@iov regardless of whether the small
 * on-stack array was used or not (and regardless of whether this function
 * returns an error or not).
 * ...
 */
 ssize_t import_iovec(int type, const struct iovec __user *uvec,
		 unsigned nr_segs, unsigned fast_segs,
		 struct iovec **iovp, struct iov_iter *i)
{
	return __import_iovec(type, uvec, nr_segs, fast_segs, iovp, i,
			      in_compat_syscall());
}

Where nr_segs == vlen, fast_segs == UIO_FASTIOV (8), iovp is &iov, and I
think iov referred to by the comment is *iovp (only sensible conclusion,
really).

Looking into the code further we see:

ssize_t __import_iovec(int type, const struct iovec __user *uvec,
		 unsigned nr_segs, unsigned fast_segs, struct iovec **iovp,
		 struct iov_iter *i, bool compat)
{
	...
	iov = iovec_from_user(uvec, nr_segs, fast_segs, *iovp, compat);
	...
}

struct iovec *iovec_from_user(const struct iovec __user *uvec,
		unsigned long nr_segs, unsigned long fast_segs,
		struct iovec *fast_iov, bool compat)
{
	...
	if (nr_segs > UIO_MAXIOV)
		return ERR_PTR(-EINVAL);
	if (nr_segs > fast_segs) {
		iov = kmalloc_array(nr_segs, sizeof(struct iovec), GFP_KERNEL);
		if (!iov)
			return ERR_PTR(-ENOMEM);
	}
	...
}

So - this confirms it - we're fine, it just tries to use the stack-based
array if it can - otherwise it kmalloc()'s.

Of course, UIO_MAXIOV remains the _actual_ hard limit (hardcoded to 1,024
in include/uapi/linux/uio.h).

The other points I made about the proposed interface remain, but I won't go
into more detail as we are obviously lacking that context here.

Thanks for bringing this up and correcting my misinterpretation, as well as
providing the below repro code, and let's revisit your old series... but on
Monday :)

I should really not be looking at work mail on a Saturday (mea culpa, once
again... :)

One small nit in the repro code below (hey I'm a kernel dev, can't help
myself... ;)

Cheers, Lorenzo

>
> Attaching my test code below.  You could simply run it as below.
>
>     gcc test.c && ./a.out
>
> ==== Attachment 0 (test.c) ====
> #define _GNU_SOURCE
>
> #include <stdio.h>
> #include <stdlib.h>
> #include <sys/mman.h>
> #include <sys/syscall.h>
> #include <sys/uio.h>
> #include <unistd.h>
>
> #define SZ_PAGE (4096)
> #define NR_PAGES (512)
> #define MMAP_SZ	(SZ_PAGE * NR_PAGES)
>
> int main(void)
> {
> 	char *buf;
> 	unsigned int i;
> 	int ret;
> 	pid_t pid = getpid();
> 	int pidfd = syscall(SYS_pidfd_open, pid, 0);
> 	struct iovec *vec;
>
> 	buf = mmap(NULL, MMAP_SZ, PROT_READ | PROT_WRITE, MAP_PRIVATE |
> 			MAP_ANON, -1, 0);
> 	if (buf == MAP_FAILED) {
> 		printf("mmap fail\n");
> 		return -1;
> 	}
>
> 	for (i = 0; i < MMAP_SZ; i++)
> 		buf[i] = 123;
>
> 	vec = malloc(sizeof(*vec) * NR_PAGES);
> 	for (i = 0; i < NR_PAGES; i++) {
> 		vec[i].iov_base = &buf[i * SZ_PAGE];
> 		vec[i].iov_len = SZ_PAGE;
> 	}
>
> 	ret = syscall(SYS_process_madvise, pidfd, vec, NR_PAGES,
> 			MADV_DONTNEED, 0);
> 	if (ret != MMAP_SZ) {
> 		printf("process_madvise fail\n");
> 		return -1;
> 	}

To be pedantic, you are really only checking to see if an error was
returned, in theory no error might have been returned but the operation
might have not proceeded, so a more proper check here would be to populated
the anon memory with non-zero data, then check afterwards that it's zeroed.

Given this outcome would probably imply iovec issues, it's not likely, but
to really assert the point you'd probably want to do that!

>
> 	ret = munmap(buf, MMAP_SZ);
> 	if (ret) {
> 		printf("munmap failed\n");
> 		return -1;
> 	}
>
> 	close(pidfd);
> 	printf("good\n");
> 	return 0;
> }