linux-kernel - Re: [PATCH v4 09/14] selftests: harness: Move teardown conditional into test metadata

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aEp6tGUEFCQz1prh@nvidia.com>
Date: Wed, 11 Jun 2025 23:59:00 -0700
From: Nicolin Chen <nicolinc@...dia.com>
To: Jason Gunthorpe <jgg@...dia.com>
CC: Thomas Weißschuh <thomas.weissschuh@...utronix.de>,
	Shuah Khan <shuah@...nel.org>, Shuah Khan <skhan@...uxfoundation.org>, "Willy
 Tarreau" <w@....eu>, Thomas Weißschuh
	<linux@...ssschuh.net>, Kees Cook <kees@...nel.org>, Andy Lutomirski
	<luto@...capital.net>, Will Drewry <wad@...omium.org>, Mark Brown
	<broonie@...nel.org>, Muhammad Usama Anjum <usama.anjum@...labora.com>,
	<linux-kernel@...r.kernel.org>, <linux-kselftest@...r.kernel.org>
Subject: Re: [PATCH v4 09/14] selftests: harness: Move teardown conditional
 into test metadata

On Wed, Jun 11, 2025 at 08:51:17PM -0300, Jason Gunthorpe wrote:
> On Wed, Jun 11, 2025 at 04:43:00PM -0700, Nicolin Chen wrote:
> > So, the test case sets an alignment with HUGEPAGE_SIZE=512MB while
> > allocating buffer_size=64MB:
> > 	rc = posix_memalign(&self->buffer, HUGEPAGE_SIZE, variant->buffer_size);
> > 	vrc = mmap(self->buffer, variant->buffer_size, PROT_READ | PROT_WRITE,
> > this gives the self->buffer a location that is 512MB aligned, but
> > only mmap part of one 512MB huge page.
> > 
> > On the other hand, _metadata->no_teardown was mmap() outside the
> > range of the [self->buffer, self->buffer + 64MB), but within the
> > range of [self->buffer, self->buffer + 512MB).
> > 
> > E.g.
> >    _metadata->no_teardown = 0xfffbfc610000 // inside range2 below
> >    buffer=[0xfffbe0000000, fffbe4000000) // range1
> >    buffer=[0xfffbe0000000, fffc00000000) // range2
> > 
> > Then ,the "vrc = mmap(..." overwrites the _metadata->no_teardown
> > location to NULL..
> > 
> > The following change can fix, though it feels odd that the buffer
> > has to be preserved with the entire huge page:
> > ---------------------------------------------------------------
> > @@ -2024,3 +2027,4 @@ FIXTURE_SETUP(iommufd_dirty_tracking)
> > 
> > -       rc = posix_memalign(&self->buffer, HUGEPAGE_SIZE, variant->buffer_size);
> > +       rc = posix_memalign(&self->buffer, HUGEPAGE_SIZE,
> > +                           __ALIGN_KERNEL(variant->buffer_size, HUGEPAGE_SIZE));
> >         if (rc || !self->buffer) {
> > ---------------------------------------------------------------
> > 
> > Any thought?
> 
> This seems like something, variant->buffer_size should not
> be less than HUGEPAGE_SIZE I guess that is possible on 64K ARM64
> 
> But I still don't quite get it..
> 
>         rc = posix_memalign(&self->buffer, HUGEPAGE_SIZE, variant->buffer_size);
> 
> Should allocate buffer_size
> 
>         mmap_flags = MAP_SHARED | MAP_ANONYMOUS | MAP_FIXED;
>                 mmap_flags |= MAP_HUGETLB | MAP_POPULATE;
>         vrc = mmap(self->buffer, variant->buffer_size, PROT_READ | PROT_WRITE,
>                    mmap_flags, -1, 0);
> 
> Should fail if buffer_size is not a multiple of HUGEPAGE_SIZE? 

Yea, I think you are right. But..

> It certainly shouldn't mmap past the provided buffer_size!!!
> 
> Are you seeing the above mmap succeed and also map beyond buffer -> buffer + buffer_size?
> 
> I think that would be a kernel bug in MAP_HUGETLB!

..I did some bpftrace:

ksys_mmap_pgoff() addr=ffff80000000, len=4000000
    hugetlb_file_setup(): size=0x20000000
        hugetlb_reserve_pages() from=0, to=1
        hugetlb_reserve_pages() returned: ret=1
    hugetlb_file_setup() returned: size=0x20000000 ret=-281471746619776
    vm_mmap_pgoff() addr=ffff80000000, len=20000000
        do_mmap() addr=ffff80000000, len=20000000
            hugetlb_reserve_pages() from=0, to=1
            hugetlb_reserve_pages() returned: ret=1
        do_mmap() returned: addr=0xffff80000000 ret=ffff80000000, pop=20000000
    vm_mmap_pgoff() returned: addr=0xffff80000000 ret=ffff80000000
ksys_mmap_pgoff() returned: addr=0xffff80000000 ret=ffff80000000

We can see the 64MB was rounded up to 512MB by ksys_mmap_pgoff()
when being passed in to hugetlb_file_setup() at:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/mm/mmap.c?h=v6.16-rc1#n594
"		len = ALIGN(len, huge_page_size(hs));  "

By looking at the comments here..:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/hugetlbfs/inode.c#n1521
"
/*
 * Note that size should be aligned to proper hugepage size in caller side,
 * otherwise hugetlb_reserve_pages reserves one less hugepages than intended.
 */
struct file *hugetlb_file_setup(const char *name, size_t size,
"

..I guess this function was supposed to fail the not-a-multiple
case as you remarked? But it certainly can't do that, when that
size passed in is already hugepage-aligned..

It feels like a kernel bug as you suspect :-/


And I just found one more weird thing...

In iommufd.c selftest code, we have:
"static __attribute__((constructor)) void setup_sizes(void)"
where it does another pair of posix_memalign/mmap, although this
one doesn't flag MAP_HUGETLB and shouldn't impact what is coming
to the next...

If I keep this code, the first hugepage test case can pass (64MB
buffer_size; 512MB THP), but all the following cases will fail,
as I reported here:
https://lore.kernel.org/all/aEm6tuzy7WK12sMh@nvidia.com/

If I remove this code, the hugepage test case will fail from the
first case with signal 11. But this time, it is not because the
mmap() overwrites the _metadata->no_teardown, it's because mmap()
call itself crashed...

And, in either a failed case (crashed) or a passed case, the top
kernel function ksys_mmap_pgoff() returned successfully, which
means it seemingly crashed inside the libc?

Thanks
Nicolin