[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aFwt6wjuDzbWM4_C@x1.local>
Date: Wed, 25 Jun 2025 13:12:11 -0400
From: Peter Xu <peterx@...hat.com>
To: Jason Gunthorpe <jgg@...dia.com>
Cc: "Liam R. Howlett" <Liam.Howlett@...cle.com>,
Lorenzo Stoakes <lorenzo.stoakes@...cle.com>,
linux-kernel@...r.kernel.org, linux-mm@...ck.org,
kvm@...r.kernel.org, Andrew Morton <akpm@...ux-foundation.org>,
Alex Williamson <alex.williamson@...hat.com>,
Zi Yan <ziy@...dia.com>, Alex Mastro <amastro@...com>,
David Hildenbrand <david@...hat.com>,
Nico Pache <npache@...hat.com>
Subject: Re: [PATCH 5/5] vfio-pci: Best-effort huge pfnmaps with !MAP_FIXED
mappings
On Wed, Jun 25, 2025 at 10:07:11AM -0300, Jason Gunthorpe wrote:
> On Tue, Jun 24, 2025 at 08:48:45PM -0400, Peter Xu wrote:
> > > My feeling, and the reason I used the phrase "pgoff aligned address",
> > > is that the owner of the file should already ensure that for the large
> > > PTEs/folios:
> > > pgoff % 2**order == 0
> > > physical % 2**order == 0
> >
> > IMHO there shouldn't really be any hard requirement in mm that pgoff and
> > physical address space need to be aligned.. but I confess I don't have an
> > example driver that didn't do that in the linux tree.
>
> Well, maybe, but right now there does seem to be for
> THP/hugetlbfs/etc. It is a nice simple solution that exposes the
> alignment requirements to userspace if it wants to use MAP_FIXED.
>
> > > To me this just keeps thing simpler. I guess if someone comes up with
> > > a case where they really can't get a pgoff alignment and really need a
> > > high order mapping then maybe we can add a new return field of some
> > > kind (pgoff adjustment?) but that is so weird I'd leave it to the
> > > future person to come and justfiy it.
> >
> > When looking more, I also found some special cased get_unmapped_area() that
> > may not be trivially converted into the new API even for CONFIG_MMU, namely:
> >
> > - io_uring_get_unmapped_area
> > - arena_get_unmapped_area (from bpf_map->ops->map_get_unmapped_area)
> >
> > I'll need to have some closer look tomorrow. If any of them cannot be 100%
> > safely converted to the new API, I'd also think we should not introduce the
> > new API, but reuse get_unmapped_area() until we know a way out.
>
> Oh yuk. It is trying to avoid the dcache flush on some kernel paths
> for virtually tagged cache systems.
>
> Arguably this fixup should not be in io_uring, but conveying the right
> information to the core code, and requesting a special flush
> avoidance mapping is not so easy.
IIUC it still makes sense to be with io_uring, because only io_uring
subsystem knows what to align against. I don't yet understand how generic
mm can do this, after all generic mm doesn't know the address that io_uring
is using (from io_region_get_ptr()).
>
> But again I suspect the pgoff is the right solution.
>
> IIRC this is handled by forcing a few low virtual address bits to
> always match across all user mappings (the colour) via the pgoff. This
> way the userspace always uses the same cache tag and doesn't become
> cache incoherent. ie:
>
> user_addr % PAGE_SIZE*N == pgoff % PAGE_SIZE*N
>
> The issue is now the kernel is using the direct map and we can't force
After I read the two use cases, I mostly agree. Just one trivial thing to
mention, it may not be direct map but vmap() (see io_region_init_ptr()).
> a random jumble of pages to have the right colours to match
> userspace. So the kernel has all those dcache flushes sprinkled about
> before it touches user mapped memory through the direct map as the
> kernel will use a different colour and cache tag.
>
> So.. if iouring selects a pgoff that automatically gives the right
> colour for the userspace mapping to also match the kernel mapping's
> colour then things should just work.
>
> Frankly I'm shocked that someone invested time in trying to make this
> work - the commit log says it was for parisc and only 2 years ago :(
>
> d808459b2e31 ("io_uring: Adjust mapping wrt architecture aliasing requirements")
>
> I thought such physically tagged cache systems were long ago dead and
> buried..
Yeah.. internet says parisc stopped shipping since 2005. Obviously
there're still people running io_uring on parisc systems, more or less.
This change seems to be required to make io_uring work on parisc or any
vipt.
>
> Shouldn't this entirely reject MAP_FIXED too?
It already does, see (io_uring_get_unmapped_area(), of parisc):
/*
* Do not allow to map to user-provided address to avoid breaking the
* aliasing rules. Userspace is not able to guess the offset address of
* kernel kmalloc()ed memory area.
*/
if (addr)
return -EINVAL;
I do not know whoever would use MAP_FIXED but with addr=0. So failing
addr!=0 should literally stop almost all MAP_FIXED already.
Side topic, but... logically speaking this should really be fine when
!SHM_COLOUR. This commit should break MAP_FIXED for everyone on io_uring,
but I guess nobody really use MAP_FIXED for io_uring fds..
It's also utterly confusing to set addr=ptr for parisc, fundamentally addr
here must be a kernel va not user va, so it'll (AFAIU) 100% fail later with
STACK_SIZE checks.. IMHO we should really change this to:
diff --git a/io_uring/memmap.c b/io_uring/memmap.c
index 725dc0bec24c..1225a9393dc5 100644
--- a/io_uring/memmap.c
+++ b/io_uring/memmap.c
@@ -380,12 +380,10 @@ unsigned long io_uring_get_unmapped_area(struct file *filp, unsigned long addr,
*/
filp = NULL;
flags |= MAP_SHARED;
- pgoff = 0; /* has been translated to ptr above */
#ifdef SHM_COLOUR
- addr = (uintptr_t) ptr;
- pgoff = addr >> PAGE_SHIFT;
+ pgoff = (uintptr_t)ptr >> PAGE_SHIFT;
#else
- addr = 0UL;
+ pgoff = 0; /* has been translated to ptr above */
#endif
return mm_get_unmapped_area(current->mm, filp, addr, len, pgoff, flags);
}
And avoid the confusing "addr=ptr" setup. This might be too off-topic,
though.
Then I also looked at the other bpf arena use case, which doubled the len
when requesting VA and does proper round ups for 4G:
arena_get_unmapped_area():
ret = mm_get_unmapped_area(current->mm, filp, addr, len * 2, 0, flags);
...
return round_up(ret, SZ_4G);
AFAIU, this is buggy.. at least we should check "round_up(ret, SZ_4G)"
still falls into the (ret, ret+2*len) region... or AFAIU we can return some
address that might be used by other VMAs already..
But in general that smells like a similar alignment issue, IIUC. So might
be applicable for the new API.
Going back to the topic of this series - I think the new API would work for
io_uring and parisc too if I can return phys_pgoff, here what parisc would
need is:
#ifdef SHM_COLOUR
*phys_pgoff = io_region_get_ptr(..) >> PAGE_SHIFT;
#else
*phys_pgoff = pgoff;
#endif
Here *phys_pgoff (or a rename) would be required to fetch the kernel VA (no
matter direct mapping or vmap()) offset, to avoid aliasing issue.
Should I go and introduce the API with *phys_pgoff returned together, then?
I'll still need to scratch my head on how to properly define it, but it at
least will also get vfio use case decouple with spec dependency.
Thanks,
--
Peter Xu
Powered by blists - more mailing lists