linux-kernel - Re: [PATCH] selftests/mm: add test for (BATCH_PROCESS)MADV

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <8fb737cc-ba83-4949-b4fb-2a2e1af0967a@lucifer.local>
Date: Mon, 23 Jun 2025 13:35:10 +0100
From: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>
To: wang lian <lianux.mm@...il.com>
Cc: Andrew Morton <akpm@...ux-foundation.org>, Shuah Khan <shuah@...nel.org>,
        Christian Brauner <brauner@...nel.org>, linux-kernel@...r.kernel.org,
        linux-mm@...ck.org, linux-kselftest@...r.kernel.org,
        SeongJae Park <sj@...nel.org>, zijing.zhang@...ton.me,
        ryncsn@...il.com, p1ucky0923@...il.com, gkwang@...x-info.com,
        "Liam R. Howlett" <Liam.Howlett@...cle.com>,
        David Hildenbrand <david@...hat.com>, Vlastimil Babka <vbabka@...e.cz>,
        Jann Horn <jannh@...gle.com>
Subject: Re: [PATCH] selftests/mm: add test for (BATCH_PROCESS)MADV_DONTNEED

+cc Liam, David, Vlastimil, Jann

(it might not be obvious from get_maintainers.pl but please cc
maintainers/reviewers of the thing you are adding a test for, thanks!)

Overall I'm not in favour of us taking this patch.

There are a number of issues with it (explained inline below), but those aside,
it seems to be:

- Checking whether a simple anon buffer of arbitrary size is zapped by
  MADV_DONTNEED.

- Printing out a dubious microbenchmark that seems to be mostly asserting that
  fewer sycalls are faster when using process_madvise() locally.

And I'm struggling to see the value of that.

The test is also slow and will slow down a test run for little benefit.

On Sat, Jun 21, 2025 at 09:30:03PM +0800, wang lian wrote:
> From: Lian Wang <lianux.mm@...il.com>
>
> Let's add a simple test for MADV_DONTNEED and PROCESS_MADV_DONTNEED,

I'm not sure what PROCESS_MADV_DONTNEED is?

you mean process_madvise(..., MADV_DONTNEED, ...)?

> and inspired by SeongJae Park's test at GitHub[1] add batch test
> for PROCESS_MADV_DONTNEED,but for now it influence by workload and
> need add some race conditions test.We can add it later.
>
> Signed-off-by: Lian Wang <lianux.mm@...il.com>
> References
> ==========
>
> [1] https://github.com/sjp38/eval_proc_madvise
>
> ---
>  tools/testing/selftests/mm/.gitignore      |   1 +
>  tools/testing/selftests/mm/Makefile        |   1 +
>  tools/testing/selftests/mm/madv_dontneed.c | 220 +++++++++++++++++++++
>  tools/testing/selftests/mm/run_vmtests.sh  |   5 +
>  4 files changed, 227 insertions(+)
>  create mode 100644 tools/testing/selftests/mm/madv_dontneed.c
>
> diff --git a/tools/testing/selftests/mm/.gitignore b/tools/testing/selftests/mm/.gitignore
> index 824266982aa3..911f39d634be 100644
> --- a/tools/testing/selftests/mm/.gitignore
> +++ b/tools/testing/selftests/mm/.gitignore
> @@ -25,6 +25,7 @@ pfnmap
>  protection_keys
>  protection_keys_32
>  protection_keys_64
> +madv_dontneed
>  madv_populate
>  uffd-stress
>  uffd-unit-tests
> diff --git a/tools/testing/selftests/mm/Makefile b/tools/testing/selftests/mm/Makefile
> index ae6f994d3add..2352252f3914 100644
> --- a/tools/testing/selftests/mm/Makefile
> +++ b/tools/testing/selftests/mm/Makefile
> @@ -67,6 +67,7 @@ TEST_GEN_FILES += hugepage-mremap
>  TEST_GEN_FILES += hugepage-shm
>  TEST_GEN_FILES += hugepage-vmemmap
>  TEST_GEN_FILES += khugepaged
> +TEST_GEN_FILES += madv_dontneed
>  TEST_GEN_FILES += madv_populate
>  TEST_GEN_FILES += map_fixed_noreplace
>  TEST_GEN_FILES += map_hugetlb
> diff --git a/tools/testing/selftests/mm/madv_dontneed.c b/tools/testing/selftests/mm/madv_dontneed.c
> new file mode 100644
> index 000000000000..b88444da7f9e
> --- /dev/null
> +++ b/tools/testing/selftests/mm/madv_dontneed.c
> @@ -0,0 +1,220 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * MADV_DONTNEED and PROCESS_MADV_DONTNEED tests
> + *
> + * Copyright (C) 2025, Linx Software Corp.
> + *
> + * Author(s): Lian Wang <lianux.mm@...il.com>
> + */
> +#define _GNU_SOURCE
> +#include <stdlib.h>
> +#include <string.h>
> +#include <stdbool.h>
> +#include <stdint.h>
> +#include <unistd.h>
> +#include <errno.h>
> +#include <fcntl.h>
> +#include <linux/mman.h>
> +#include <sys/mman.h>
> +#include <sys/syscall.h>
> +#include "vm_util.h"
> +#include <time.h>
> +
> +#include "../kselftest.h"
> +
> +/*
> + * For now, we're using 2 MiB of private anonymous memory for all tests.
> + */
> +#define SIZE (256 * 1024 * 1024)

This is 256 MB?

Also you don't need to do:

/*
 * foo bar baz bar foo
 */

Just do:

/* foo bar baz bar foo */

> +
> +static size_t pagesize;
> +
> +static void sense_support(void)
> +{
> +	char *addr;
> +	int ret;
> +
> +	addr = mmap(0, pagesize, PROT_READ | PROT_WRITE,
> +		    MAP_ANONYMOUS | MAP_PRIVATE, 0, 0);

This is incorrect, you need to specify -1 for fd for anon mappings, not zero.

Also the first parameter in mmap() is a pointer, use NULL.

> +	if (!addr)
> +		ksft_exit_fail_msg("mmap failed\n");

This is incorrect, you need to check for MAP_FAILED.

> +
> +	ret = madvise(addr, pagesize, MADV_DONTNEED);
> +	if (ret)
> +		ksft_exit_skip("MADV_DONTNEED is not available\n");

This is incorrect, you're testing for a current kernel, MADV_DONTNEED is very
definitely always available.

I'm guessing you've copy/pasted from madv_populate.c, which really shouldn't be
'sensing' if this option is available either - the tests are for the most recent
kernel, in which you know the behaviour is supported.

Anyway you should just drop this.

> +
> +	munmap(addr, pagesize);
> +}
> +
> +/*
> + * Read pagemap to check page is present in mermory
> + */
> +static bool is_page_present(void *addr)

This seems a flakey thing to test? While unlikely, a page might be swapped
out. So you will want to add a check for this also.

> +{
> +	uintptr_t vaddr = (uintptr_t)addr;
> +	uintptr_t offset = (vaddr / pagesize) * sizeof(uint64_t);
> +	ssize_t bytes_read;
> +	uint64_t entry;
> +	bool ret;
> +	int fd;
> +
> +	fd = open("/proc/self/pagemap", O_RDONLY);
> +	if (fd < 0) {
> +		ksft_exit_fail_msg("opening pagemap failed\n");
> +		ret = false;

Why are you doing stuff after ksft_exit_fail_msg()? As the name suggests, it
literally exits the whole process.

> +	}
> +
> +	if ((lseek(fd, offset, SEEK_SET)) == -1) {
> +		close(fd);
> +		ret = false;
> +	}
> +
> +	bytes_read = read(fd, &entry, sizeof(entry));
> +	close(fd);
> +
> +	if (bytes_read != sizeof(entry)) {
> +		perror("read failed");
> +		return false;
> +	}
> +
> +	if (entry & (1ULL << 63))

This is a 'magical number'. Please #define something like:

#define PAGEMAP_PRESENT_BIT (63)
#define PAGEMAP_PRESENT_MASK (1ULL << 63)

> +		ret = true;
> +
> +	return ret;
> +}
> +
> +/*
> + * test madvsise_dontneed

Useless comment.

> + */
> +static void test_madv_dontneed(void)
> +{
> +	unsigned long rss_anon_before, rss_anon_after;
> +	bool present, rss;
> +	char *addr;
> +	int ret;
> +
> +	ksft_print_msg("[RUN] %s\n", __func__);

Not sure why we need so much noise.

> +
> +	addr = mmap(0, SIZE, PROT_READ | PROT_WRITE,
> +		    MAP_ANONYMOUS | MAP_PRIVATE, 0, 0);

Again this is incorrect - you need to set -1 for the fd, NULL for the address.

> +	if (!addr)
> +		ksft_exit_fail_msg("mmap failed\n");

This is wildly incorrect and means you will segfault the test, check for
MAP_FAILED.

> +
> +	memset(addr, 0x7A, SIZE);

I've said it elsewhere but I'm not sure you should write 256 megabytes of data
when you could write 65,536 to fault in memory.

Or even use MADV_POPULATE_READ/WRITE...?

> +
> +	rss_anon_before = rss_anon();
> +	if (!rss_anon_before)
> +		ksft_exit_fail_msg("No RssAnon is allocated before dontneed\n");

This seems copy/pasted from split_huge_page_test.c.

I mean I'm not sure it's really useful to assert that RSS goes down. Again, if
RSS doesn't go down we have bigger problems...

> +	ret = madvise(addr, SIZE, MADV_DONTNEED);
> +	ksft_test_result(!ret, "MADV_DONTNEED\n");
> +
> +	rss_anon_after = rss_anon();
> +	if (rss_anon_after < rss_anon_before)
> +		rss = true;
> +	ksft_test_result(rss, "MADV_DONTNEED rss is correct\n");

This is incorrect, rss is uninitialised here so you're reading uninitialised
memory unless your condition is met.

I'm not sure why you don't just pass in the condition... or use
kselftest_harness rather than do everything manually here.

> +
> +	for (size_t i = 0; i < SIZE; i += pagesize) {
> +		present = is_page_present(addr + i);

You're doing this in a super slow way, you're checking 256 MB's worth of memory
via /proc/pagemap, opening this every single time

> +		if (present) {
> +			ksft_print_msg("Page not zero at offset %zu\n",
> +				       (size_t)i);

I'm not sure what 'page not zero' means.

And also if this failed, you're going to spam like crazy. 256 * 1024 / 4 =
65,536 lines of output if somehow, in an imagined scenario where zapping no
longer works, this fails.

> +		}
> +	}
> +
> +	ksft_test_result(!present, "MADV_DONTNEED page is present\n");

Are you not duplicating test result here? So this function named test_xxx()
isn't actually testing 1 thing it's testing multiple things output as separate
test results...

kselftest_harness again would help you a lot.

> +	munmap(addr, SIZE);
> +}
> +
> +/*
> + * Measure performance of batched process_madvise vs madvise
> + */
> +static int measure_process_madvise_batching(int hint, int total_size,

I have no idea what you mean by 'hint'?

You're passing in MADV_DONTNEED, surely this should be 'behavior'?

> +					    int single_unit, int batch_size)

On my system this took a long time to run, and I have a powerful system. I just
don't think this is any way appropriate or suited to mm self tests, sorry.

At the very least it should be hugely sped up.

> +{
> +	struct iovec *vec = malloc(sizeof(*vec) * batch_size);
> +	struct timespec start, end;
> +	unsigned long elapsed_ns = 0;
> +	unsigned long nr_measures = 0;
> +	pid_t pid = getpid();
> +	char *buf;
> +	int pidfd;
> +
> +	pidfd = syscall(SYS_pidfd_open, pid, 0);
> +	if (pidfd == -1) {
> +		perror("pidfd_open fail");
> +		return -1;

Why -1? And this is just ignored by caller?

In any case you don't need to do any of this, just pass PIDFD_SELF, I went to
great pains to add this and have process_madvise() work locally so you don't
have to do this kind of thing :)

> +	}
> +
> +	buf = mmap(NULL, total_size, PROT_READ | PROT_WRITE,
> +		   MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);

Hm strangely here you invoke mmap() correctly (+ test for failure correctly) but
not elsewhere?

I don't know why you get the pidfd first before doing this.

> +	if (buf == MAP_FAILED) {
> +		perror("mmap fail");

Are we good to just randomly perror() in test code? Not 100% sure that will
work?

> +		goto out;
> +	}
> +
> +	if (!vec) {
> +		perror("malloc vec failed");
> +		goto unmap_out;
> +	}

Why are you checking whether a malloc() failed a million years after
malloc()'ing?

> +
> +	while (elapsed_ns < 5UL * 1000 * 1000 * 1000) {

This is just horrible. Completely arbitrary and a magical number.

And making these tests run for 3 * 5 seconds to not assert anything but to just
print out some stats is not ok.

> +		memset(buf, 0x7A, total_size);

I don't know why you feel the need to memset() the whole thing each time? If the
point is faulting in maybe better to just fault in each page?

> +
> +		clock_gettime(CLOCK_MONOTONIC, &start);
> +
> +		for (int off = 0; off < total_size;
> +		     off += single_unit * batch_size) {
> +			for (int i = 0; i < batch_size; i++) {
> +				vec[i].iov_base = buf + off + i * single_unit;
> +				vec[i].iov_len = single_unit;
> +			}
> +			syscall(SYS_process_madvise, pidfd, vec, batch_size,
> +				hint, 0);

Yeah 'hint' here is awful as a name, it's actively misleading. You're not
'hinting' at a process_madvise() behaviour, you're specifying it.

Hint usually implies something like 'map this around <hint> address'.

> +		}
> +
> +		clock_gettime(CLOCK_MONOTONIC, &end);
> +		elapsed_ns += (end.tv_sec - start.tv_sec) * 1e9 +
> +			      (end.tv_nsec - start.tv_nsec);

> +		nr_measures++;
> +	}
> +
> +	ksft_print_msg("[RESULT] batch=%d time=%.3f us/op\n", batch_size,
> +		       (double)(elapsed_ns / nr_measures) /
> +			       (total_size / single_unit));
> +
> +	free(vec);

Surely this should be under out?

> +unmap_out:
> +	munmap(buf, total_size);
> +out:
> +	close(pidfd);
> +	return 0;
> +}

Anyway I seriously question the value of this. The syscall overhead savings are
pretty obvious, but I'm not sure what we're asserting here? Are we trying to
avoid perf regressions in future? Who will read this?

Just an 'advert for use of process_madvise()' I think is not really a great
justification.

> +
> +static void test_perf_batch_process(void)
> +{
> +	ksft_print_msg("[RUN] %s\n", __func__);
> +	measure_process_madvise_batching(MADV_DONTNEED, SIZE, pagesize, 1);
> +	measure_process_madvise_batching(MADV_DONTNEED, SIZE, pagesize, 2);
> +	measure_process_madvise_batching(MADV_DONTNEED, SIZE, pagesize, 4);

I wonder whether this really measures anything re: TLB batching, because you're
doing 4x fewer syscalls in the last instance over probably millions of
invocations.

Asserting 'fewer syscalls are cheaper' is you know, pretty obvious :)

> +	ksft_test_result(1, "All test were done\n");

This is really weird output, it reads like you're saying all tests have passed
or something, but these aren't tests, and this isn't a test result.

> +}
> +
> +int main(int argc, char **argv)
> +{
> +	int err;
> +
> +	pagesize = getpagesize();
> +
> +	ksft_print_header();
> +	ksft_set_plan(4);

I absolutely hate this manual style of writing tests. We kselftest_harness -
there's literally no need to write tests in this unmaintainable way any more.

> +
> +	sense_support();
> +	test_madv_dontneed();
> +	test_perf_batch_process();
> +
> +	err = ksft_get_fail_cnt();
> +	if (err)
> +		ksft_exit_fail_msg("%d out of %d tests failed\n", err,
> +				   ksft_test_num());
> +	ksft_exit_pass();

More useless manual stuff. Use kselftest_harness.

> +}
> diff --git a/tools/testing/selftests/mm/run_vmtests.sh b/tools/testing/selftests/mm/run_vmtests.sh
> index dddd1dd8af14..f96d43153fc0 100755
> --- a/tools/testing/selftests/mm/run_vmtests.sh
> +++ b/tools/testing/selftests/mm/run_vmtests.sh
> @@ -47,6 +47,8 @@ separated by spaces:
>  	hmm smoke tests
>  - madv_guard
>  	test madvise(2) MADV_GUARD_INSTALL and MADV_GUARD_REMOVE options
> +- madv_dontneed
> +	test memadvise(2) MADV_DONTNEED and PROCESS_MADV_DONTNEED options
>  - madv_populate
>  	test memadvise(2) MADV_POPULATE_{READ,WRITE} options
>  - memfd_secret
> @@ -422,6 +424,9 @@ CATEGORY="hmm" run_test bash ./test_hmm.sh smoke
>  # MADV_GUARD_INSTALL and MADV_GUARD_REMOVE tests
>  CATEGORY="madv_guard" run_test ./guard-regions
>
> +# MADV_DONTNEED and PROCESS_DONTNEED tests
> +CATEGORY="madv_dontneed" run_test ./madv_dontneed
> +
>  # MADV_POPULATE_READ and MADV_POPULATE_WRITE tests
>  CATEGORY="madv_populate" run_test ./madv_populate
>
> --
> 2.43.0
>
>
>