[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20240903103616.i0GrRAfD@linutronix.de>
Date: Tue, 3 Sep 2024 12:36:16 +0200
From: Nam Cao <namcao@...utronix.de>
To: "Liam R. Howlett" <Liam.Howlett@...cle.com>,
Dave Hansen <dave.hansen@...ux.intel.com>,
Andy Lutomirski <luto@...nel.org>,
Peter Zijlstra <peterz@...radead.org>,
Thomas Gleixner <tglx@...utronix.de>,
Ingo Molnar <mingo@...hat.com>, Borislav Petkov <bp@...en8.de>,
x86@...nel.org, "H. Peter Anvin" <hpa@...or.com>,
Andrew Morton <akpm@...ux-foundation.org>,
Vlastimil Babka <vbabka@...e.cz>,
Lorenzo Stoakes <lorenzo.stoakes@...cle.com>,
linux-kernel@...r.kernel.org, linux-mm@...ck.org,
bigeasy@...utronix.de
Subject: Re: [PATCH] x86/mm/pat: Support splitting of virtual memory areas
Sorry for the late reply, I was a bit busy, and needed some time to digest
your email.
On Tue, Aug 27, 2024 at 12:01:28PM -0400, Liam R. Howlett wrote:
> * Nam Cao <namcao@...utronix.de> [240827 03:59]:
> > On Mon, Aug 26, 2024 at 09:58:11AM -0400, Liam R. Howlett wrote:
> > > * Nam Cao <namcao@...utronix.de> [240825 11:29]:
> > > > When a virtual memory area (VMA) gets splitted, memtype_rbroot's entries
> > > > are not updated. This causes confusion later on when the VMAs get
> > > > un-mapped, because the address ranges of the splitted VMAs do not match the
> > > > address range of the initial VMA.
> > > >
> > > > For example, if user does:
> > > >
> > > > fd = open("/some/pci/bar", O_RDWR);
> > > > addr = mmap(0, 8192, PROT_READ, MAP_SHARED, fd, 0);
> > > > mprotect(addr, 4096, PROT_READ | PROT_WRITE);
> > > > munmap(p, 8192);
>
> What is p? By the comments below, you mean addr here?
Yes, it should be addr. Sorry about that.
>
> > > >
> > > > with the physical address starting from 0xfd000000, the range
> > > > (0xfd000000-0xfd002000) would be tracked with the mmap() call.
> > > >
> > > > After mprotect(), the initial range gets splitted into
> > > > (0xfd000000-0xfd001000) and (0xfd001000-0xfd002000).
> > > >
> > > > Then, at munmap(), the first range does not match any entry in
> > > > memtype_rbroot, and a message is seen in dmesg:
> > > >
> > > > x86/PAT: test:177 freeing invalid memtype [mem 0xfd000000-0xfd000fff]
> > > >
> > > > The second range still matches by accident, because matching only the end
> > > > address is acceptable (to handle shrinking VMA, added by 2039e6acaf94
> > > > (x86/mm/pat: Change free_memtype() to support shrinking case)).
> > >
> > > Does this need a fixes tag?
> >
> > Yes, it should have
> > Fixes: 2e5d9c857d4e ("x86: PAT infrastructure patch")
> > thanks for the reminder.
>
> That commit is from 2008, is there a bug report on this issue?
Not that I am aware of. I'm not entirely sure why, but I would guess due to
the combination of:
- This is not an issue for pages in RAM
- This only happens if VMAs are splitted
- The only user-visible effect is merely a pr_info(), and people may miss it.
I only encountered this issue while "trying to be smart" with mprotect() on
a portion of mmap()-ed device memory, I guess probably not many people do
that.
>
> >
> > >
> > > >
> > > > Make sure VMA splitting is handled properly, by splitting the entries in
> > > > memtype_rbroot.
> > > >
> > > > Signed-off-by: Nam Cao <namcao@...utronix.de>
> > > > ---
> > > > arch/x86/mm/pat/memtype.c | 59 ++++++++++++++++++++++++++++++
> > > > arch/x86/mm/pat/memtype.h | 3 ++
> > > > arch/x86/mm/pat/memtype_interval.c | 22 +++++++++++
> > > > include/linux/pgtable.h | 6 +++
> > > > mm/mmap.c | 8 ++++
> > > > 5 files changed, 98 insertions(+)
> > > >
> ...
>
> > >
> > > It is also a bit odd that you check VM_PFNMAP() here, then call a
> > > function to check another flag?
> >
> > Right, this check is redundant, thanks for pointing it out.
> >
> > I stole this "style" from unmap_single_vma(), but I think the check is
> > redundant there as well.
>
> If you have identified a redundant check, can you please remove it with
> a separate patch?
Sure.
>
> >
> > >
> > > > + err = track_pfn_split(vma, addr);
> > > > + if (err)
> > > > + goto out_vma_unlink;
> > > > + }
> > > > +
> > >
> > > I don't think the __split_vma() location is the best place to put this.
> > > Can this be done through the vm_ops->may_split() that is called above?
> >
> > I don't think ->may_split() is a suitable place. Its name gives me the
> > impression that it only checks whether it is okay to split the VMA, but not
> > really does any splitting work. Also that function pointer can be
> > overwritten by any driver.
>
> It's a callback that takes the arguments you need and is called as long
> as it exists. Your function would deny splitting if it failed, so it
> may not split in that case.
>
> Also, any driver that overwrites it should do what is necessary for PAT
> then. I don't love the idea of using the vm_ops either, I just like it
> better than dropping in flag checks and arch-specific code. I can see
> issue with using the callback and drivers that may have their own vma
> mapping that also use PAT, I guess.
Yeah I don't love this. You mentioned another approach below, which I
think would be the best (if it's possible). I will attempt that other
approach.
>
> > >
> > > This is arch independent code that now has an x86 specific check, and
> > > I'd like to keep __split_vma() out of the flag checking.
> >
> > I think these track_pfn_*() functions are meant to be arch-independent,
> > it's just that only x86 implements it at the moment. For instance,
> > untrack_pfn() and track_pfn_remap() are called in mm/ code.
> >
>
> Arch-independent wrappers that are only used by one arch are not
> arch-independent. PAT has been around for ages and only exists for x86
> and x86_64.
>
> We just went through removing arch_unmap(), which was used just for ppc.
> They cause problems for general mm changes and just get in the way. If
> we can avoid them, we should.
>
> memtype_interval.c doesn't have any knowledge of the vmas, so you have
> this extraction layer in memtype.c that is being bypassed here for the
> memtype_erase(); ensuring the start-end match or at least the end
> matches.
>
> So your comment about the second range still matching by accident is
> misleading - it's not matched at all because you are searching for the
> exact match or the end address being the same (which it isn't in your
> interval tree).
But the second range *does* match, because the end address match?
The second range is (0xfd001000-0xfd002000), which matches with
(0xfd000000-0xfd002000) in the interval tree.
Perhaps I should be clearer in the description..
>
> Taking a step back here, you are splitting a range in an interval tree
> to match a vma split, but you aren't splitting the range based on PAT
> changing; you are splitting it based on the vma becoming two vmas.
>
> Since VM_PFNMAP is in VM_SPECIAL, the splitting is never undone and will
> continue to fragment the interval tree, so even if flags change back to
> match each other there will always be two vams - and what changed may
> not even be the PAT.
Right, I did not consider this scenario.
>
> So the interval split should occur when the PAT changes and needs to be
> tracked differently. This does not happen when the vma is split - it
> happens when a vma is removed or when the PAT is changed.
>
> And, indeed, for the mremap() shrinking case, you already support
> finding a range by just the end and have an abstraction layer. The
> problem here is that you don't check by the start - but you could. You
> could make the change to memtype_erase() to search for the exact, end,
> or start and do what is necessary to shrink off the front of a region as
> well.
I thought about this solution initially, but since the interval tree allow
overlapping ranges, it can be tricky to determine the "best match" out
of the overlapping ranges. But I agree that this approach (if possible)
would be better than the current patch.
Let me think about this some more, and I will come back later.
>
> What I find very strange is that 2039e6acaf94 ("x86/mm/pat: Change
> free_memtype() to support shrinking case") enables shrinking of
> VM_PFNMAP, but doesn't allow shrinking the end address. Why is one
> allowed and the other not allowed?
Not really sure what you mean. I think you are ultimately asking why that
commit only matches end address, and not start address? That's because
mremap() may shrink a VMA from [start, end] to [start, new_end] (with
new_end < end). In that case, the range [new_end, end] would be removed
from the interval tree, and that commit wants to match [new_end, end] to
[start, end].
And I don't think mremap() can shrink [start, end] to [new_start, end]?
Thanks for sharing your thoughts.
Best regards,
Nam
Powered by blists - more mailing lists