[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <yvcwdfgxnyet7rjf6lhnsypz72jmp5czfkb2muvgzcul53scn6@rkhqrfgdaxsw>
Date: Tue, 27 Aug 2024 12:01:28 -0400
From: "Liam R. Howlett" <Liam.Howlett@...cle.com>
To: Nam Cao <namcao@...utronix.de>
Cc: Dave Hansen <dave.hansen@...ux.intel.com>,
Andy Lutomirski <luto@...nel.org>,
Peter Zijlstra <peterz@...radead.org>,
Thomas Gleixner <tglx@...utronix.de>, Ingo Molnar <mingo@...hat.com>,
Borislav Petkov <bp@...en8.de>, x86@...nel.org,
"H. Peter Anvin" <hpa@...or.com>,
Andrew Morton <akpm@...ux-foundation.org>,
Vlastimil Babka <vbabka@...e.cz>,
Lorenzo Stoakes <lorenzo.stoakes@...cle.com>,
linux-kernel@...r.kernel.org, linux-mm@...ck.org,
bigeasy@...utronix.de
Subject: Re: [PATCH] x86/mm/pat: Support splitting of virtual memory areas
* Nam Cao <namcao@...utronix.de> [240827 03:59]:
> On Mon, Aug 26, 2024 at 09:58:11AM -0400, Liam R. Howlett wrote:
> > * Nam Cao <namcao@...utronix.de> [240825 11:29]:
> > > When a virtual memory area (VMA) gets splitted, memtype_rbroot's entries
> > > are not updated. This causes confusion later on when the VMAs get
> > > un-mapped, because the address ranges of the splitted VMAs do not match the
> > > address range of the initial VMA.
> > >
> > > For example, if user does:
> > >
> > > fd = open("/some/pci/bar", O_RDWR);
> > > addr = mmap(0, 8192, PROT_READ, MAP_SHARED, fd, 0);
> > > mprotect(addr, 4096, PROT_READ | PROT_WRITE);
> > > munmap(p, 8192);
What is p? By the comments below, you mean addr here?
> > >
> > > with the physical address starting from 0xfd000000, the range
> > > (0xfd000000-0xfd002000) would be tracked with the mmap() call.
> > >
> > > After mprotect(), the initial range gets splitted into
> > > (0xfd000000-0xfd001000) and (0xfd001000-0xfd002000).
> > >
> > > Then, at munmap(), the first range does not match any entry in
> > > memtype_rbroot, and a message is seen in dmesg:
> > >
> > > x86/PAT: test:177 freeing invalid memtype [mem 0xfd000000-0xfd000fff]
> > >
> > > The second range still matches by accident, because matching only the end
> > > address is acceptable (to handle shrinking VMA, added by 2039e6acaf94
> > > (x86/mm/pat: Change free_memtype() to support shrinking case)).
> >
> > Does this need a fixes tag?
>
> Yes, it should have
> Fixes: 2e5d9c857d4e ("x86: PAT infrastructure patch")
> thanks for the reminder.
That commit is from 2008, is there a bug report on this issue?
>
> >
> > >
> > > Make sure VMA splitting is handled properly, by splitting the entries in
> > > memtype_rbroot.
> > >
> > > Signed-off-by: Nam Cao <namcao@...utronix.de>
> > > ---
> > > arch/x86/mm/pat/memtype.c | 59 ++++++++++++++++++++++++++++++
> > > arch/x86/mm/pat/memtype.h | 3 ++
> > > arch/x86/mm/pat/memtype_interval.c | 22 +++++++++++
> > > include/linux/pgtable.h | 6 +++
> > > mm/mmap.c | 8 ++++
> > > 5 files changed, 98 insertions(+)
> > >
...
> >
> > It is also a bit odd that you check VM_PFNMAP() here, then call a
> > function to check another flag?
>
> Right, this check is redundant, thanks for pointing it out.
>
> I stole this "style" from unmap_single_vma(), but I think the check is
> redundant there as well.
If you have identified a redundant check, can you please remove it with
a separate patch?
>
> >
> > > + err = track_pfn_split(vma, addr);
> > > + if (err)
> > > + goto out_vma_unlink;
> > > + }
> > > +
> >
> > I don't think the __split_vma() location is the best place to put this.
> > Can this be done through the vm_ops->may_split() that is called above?
>
> I don't think ->may_split() is a suitable place. Its name gives me the
> impression that it only checks whether it is okay to split the VMA, but not
> really does any splitting work. Also that function pointer can be
> overwritten by any driver.
It's a callback that takes the arguments you need and is called as long
as it exists. Your function would deny splitting if it failed, so it
may not split in that case.
Also, any driver that overwrites it should do what is necessary for PAT
then. I don't love the idea of using the vm_ops either, I just like it
better than dropping in flag checks and arch-specific code. I can see
issue with using the callback and drivers that may have their own vma
mapping that also use PAT, I guess.
> >
> > This is arch independent code that now has an x86 specific check, and
> > I'd like to keep __split_vma() out of the flag checking.
>
> I think these track_pfn_*() functions are meant to be arch-independent,
> it's just that only x86 implements it at the moment. For instance,
> untrack_pfn() and track_pfn_remap() are called in mm/ code.
>
Arch-independent wrappers that are only used by one arch are not
arch-independent. PAT has been around for ages and only exists for x86
and x86_64.
We just went through removing arch_unmap(), which was used just for ppc.
They cause problems for general mm changes and just get in the way. If
we can avoid them, we should.
memtype_interval.c doesn't have any knowledge of the vmas, so you have
this extraction layer in memtype.c that is being bypassed here for the
memtype_erase(); ensuring the start-end match or at least the end
matches.
So your comment about the second range still matching by accident is
misleading - it's not matched at all because you are searching for the
exact match or the end address being the same (which it isn't in your
interval tree).
Taking a step back here, you are splitting a range in an interval tree
to match a vma split, but you aren't splitting the range based on PAT
changing; you are splitting it based on the vma becoming two vmas.
Since VM_PFNMAP is in VM_SPECIAL, the splitting is never undone and will
continue to fragment the interval tree, so even if flags change back to
match each other there will always be two vams - and what changed may
not even be the PAT.
So the interval split should occur when the PAT changes and needs to be
tracked differently. This does not happen when the vma is split - it
happens when a vma is removed or when the PAT is changed.
And, indeed, for the mremap() shrinking case, you already support
finding a range by just the end and have an abstraction layer. The
problem here is that you don't check by the start - but you could. You
could make the change to memtype_erase() to search for the exact, end,
or start and do what is necessary to shrink off the front of a region as
well.
What I find very strange is that 2039e6acaf94 ("x86/mm/pat: Change
free_memtype() to support shrinking case") enables shrinking of
VM_PFNMAP, but doesn't allow shrinking the end address. Why is one
allowed and the other not allowed?
Thanks,
Liam
Powered by blists - more mailing lists