[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aEp6tGUEFCQz1prh@nvidia.com>
Date: Wed, 11 Jun 2025 23:59:00 -0700
From: Nicolin Chen <nicolinc@...dia.com>
To: Jason Gunthorpe <jgg@...dia.com>
CC: Thomas Weißschuh <thomas.weissschuh@...utronix.de>,
Shuah Khan <shuah@...nel.org>, Shuah Khan <skhan@...uxfoundation.org>, "Willy
Tarreau" <w@....eu>, Thomas Weißschuh
<linux@...ssschuh.net>, Kees Cook <kees@...nel.org>, Andy Lutomirski
<luto@...capital.net>, Will Drewry <wad@...omium.org>, Mark Brown
<broonie@...nel.org>, Muhammad Usama Anjum <usama.anjum@...labora.com>,
<linux-kernel@...r.kernel.org>, <linux-kselftest@...r.kernel.org>
Subject: Re: [PATCH v4 09/14] selftests: harness: Move teardown conditional
into test metadata
On Wed, Jun 11, 2025 at 08:51:17PM -0300, Jason Gunthorpe wrote:
> On Wed, Jun 11, 2025 at 04:43:00PM -0700, Nicolin Chen wrote:
> > So, the test case sets an alignment with HUGEPAGE_SIZE=512MB while
> > allocating buffer_size=64MB:
> > rc = posix_memalign(&self->buffer, HUGEPAGE_SIZE, variant->buffer_size);
> > vrc = mmap(self->buffer, variant->buffer_size, PROT_READ | PROT_WRITE,
> > this gives the self->buffer a location that is 512MB aligned, but
> > only mmap part of one 512MB huge page.
> >
> > On the other hand, _metadata->no_teardown was mmap() outside the
> > range of the [self->buffer, self->buffer + 64MB), but within the
> > range of [self->buffer, self->buffer + 512MB).
> >
> > E.g.
> > _metadata->no_teardown = 0xfffbfc610000 // inside range2 below
> > buffer=[0xfffbe0000000, fffbe4000000) // range1
> > buffer=[0xfffbe0000000, fffc00000000) // range2
> >
> > Then ,the "vrc = mmap(..." overwrites the _metadata->no_teardown
> > location to NULL..
> >
> > The following change can fix, though it feels odd that the buffer
> > has to be preserved with the entire huge page:
> > ---------------------------------------------------------------
> > @@ -2024,3 +2027,4 @@ FIXTURE_SETUP(iommufd_dirty_tracking)
> >
> > - rc = posix_memalign(&self->buffer, HUGEPAGE_SIZE, variant->buffer_size);
> > + rc = posix_memalign(&self->buffer, HUGEPAGE_SIZE,
> > + __ALIGN_KERNEL(variant->buffer_size, HUGEPAGE_SIZE));
> > if (rc || !self->buffer) {
> > ---------------------------------------------------------------
> >
> > Any thought?
>
> This seems like something, variant->buffer_size should not
> be less than HUGEPAGE_SIZE I guess that is possible on 64K ARM64
>
> But I still don't quite get it..
>
> rc = posix_memalign(&self->buffer, HUGEPAGE_SIZE, variant->buffer_size);
>
> Should allocate buffer_size
>
> mmap_flags = MAP_SHARED | MAP_ANONYMOUS | MAP_FIXED;
> mmap_flags |= MAP_HUGETLB | MAP_POPULATE;
> vrc = mmap(self->buffer, variant->buffer_size, PROT_READ | PROT_WRITE,
> mmap_flags, -1, 0);
>
> Should fail if buffer_size is not a multiple of HUGEPAGE_SIZE?
Yea, I think you are right. But..
> It certainly shouldn't mmap past the provided buffer_size!!!
>
> Are you seeing the above mmap succeed and also map beyond buffer -> buffer + buffer_size?
>
> I think that would be a kernel bug in MAP_HUGETLB!
..I did some bpftrace:
ksys_mmap_pgoff() addr=ffff80000000, len=4000000
hugetlb_file_setup(): size=0x20000000
hugetlb_reserve_pages() from=0, to=1
hugetlb_reserve_pages() returned: ret=1
hugetlb_file_setup() returned: size=0x20000000 ret=-281471746619776
vm_mmap_pgoff() addr=ffff80000000, len=20000000
do_mmap() addr=ffff80000000, len=20000000
hugetlb_reserve_pages() from=0, to=1
hugetlb_reserve_pages() returned: ret=1
do_mmap() returned: addr=0xffff80000000 ret=ffff80000000, pop=20000000
vm_mmap_pgoff() returned: addr=0xffff80000000 ret=ffff80000000
ksys_mmap_pgoff() returned: addr=0xffff80000000 ret=ffff80000000
We can see the 64MB was rounded up to 512MB by ksys_mmap_pgoff()
when being passed in to hugetlb_file_setup() at:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/mm/mmap.c?h=v6.16-rc1#n594
" len = ALIGN(len, huge_page_size(hs)); "
By looking at the comments here..:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/hugetlbfs/inode.c#n1521
"
/*
* Note that size should be aligned to proper hugepage size in caller side,
* otherwise hugetlb_reserve_pages reserves one less hugepages than intended.
*/
struct file *hugetlb_file_setup(const char *name, size_t size,
"
..I guess this function was supposed to fail the not-a-multiple
case as you remarked? But it certainly can't do that, when that
size passed in is already hugepage-aligned..
It feels like a kernel bug as you suspect :-/
And I just found one more weird thing...
In iommufd.c selftest code, we have:
"static __attribute__((constructor)) void setup_sizes(void)"
where it does another pair of posix_memalign/mmap, although this
one doesn't flag MAP_HUGETLB and shouldn't impact what is coming
to the next...
If I keep this code, the first hugepage test case can pass (64MB
buffer_size; 512MB THP), but all the following cases will fail,
as I reported here:
https://lore.kernel.org/all/aEm6tuzy7WK12sMh@nvidia.com/
If I remove this code, the hugepage test case will fail from the
first case with signal 11. But this time, it is not because the
mmap() overwrites the _metadata->no_teardown, it's because mmap()
call itself crashed...
And, in either a failed case (crashed) or a passed case, the top
kernel function ksys_mmap_pgoff() returned successfully, which
means it seemingly crashed inside the libc?
Thanks
Nicolin
Powered by blists - more mailing lists