lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Zph051h5BsSFMpR6@yzhao56-desk.sh.intel.com>
Date: Thu, 18 Jul 2024 09:50:31 +0800
From: Yan Zhao <yan.y.zhao@...el.com>
To: Peter Xu <peterx@...hat.com>
CC: "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	"linux-mm@...ck.org" <linux-mm@...ck.org>, Andrew Morton
	<akpm@...ux-foundation.org>, Alex Williamson <alex.williamson@...hat.com>,
	Jason Gunthorpe <jgg@...dia.com>, Al Viro <viro@...iv.linux.org.uk>, "Dave
 Hansen" <dave.hansen@...ux.intel.com>, Andy Lutomirski <luto@...nel.org>,
	Peter Zijlstra <peterz@...radead.org>, Thomas Gleixner <tglx@...utronix.de>,
	Ingo Molnar <mingo@...hat.com>, Borislav Petkov <bp@...en8.de>, "Kirill A .
 Shutemov" <kirill@...temov.name>, "x86@...nel.org" <x86@...nel.org>, "Tian,
 Kevin" <kevin.tian@...el.com>, Pei Li <peili.dev@...il.com>, "David
 Hildenbrand" <david@...hat.com>, David Wang <00107082@....com>, Bert
 Karwatzki <spasswolf@....de>, Sergey Senozhatsky <senozhatsky@...omium.org>
Subject: Re: [PATCH] mm/x86/pat: Only untrack the pfn range if unmap region

On Wed, Jul 17, 2024 at 10:15:24PM +0800, Peter Xu wrote:
> On Wed, Jul 17, 2024 at 09:38:34AM +0800, Yan Zhao wrote:
> > On Tue, Jul 16, 2024 at 03:01:50PM -0400, Peter Xu wrote:
> > > On Tue, Jul 16, 2024 at 05:13:29PM +0800, Yan Zhao wrote:
> > > > On Mon, Jul 15, 2024 at 10:29:59PM +0800, Peter Xu wrote:
> > > > > On Mon, Jul 15, 2024 at 03:08:58PM +0800, Yan Zhao wrote:
> > > > > > On Fri, Jul 12, 2024 at 10:42:44AM -0400, Peter Xu wrote:
...
> > > > > > > diff --git a/mm/memory.c b/mm/memory.c
> > > > > > > index 4bcd79619574..f57cc304b318 100644
> > > > > > > --- a/mm/memory.c
> > > > > > > +++ b/mm/memory.c
> > > > > > > @@ -1827,9 +1827,6 @@ static void unmap_single_vma(struct mmu_gather *tlb,
> > > > > > >  	if (vma->vm_file)
> > > > > > >  		uprobe_munmap(vma, start, end);
> > > > > > >  
> > > > > > > -	if (unlikely(vma->vm_flags & VM_PFNMAP))
> > > > > > > -		untrack_pfn(vma, 0, 0, mm_wr_locked);
> > > > > > > -
> > > > > > Specifically to VFIO's case, looks it doesn't matter if untrack_pfn() is
> > > > > > called here, since remap_pfn_range() is not called in mmap() and fault
> > > > > > handler, and therefore (vma->vm_flags & VM_PAT) is always 0.
> > > > > 
> > > > > Right when with current repo, but I'm thinking maybe we should have VM_PAT
> > > > > there..
> > > > Yes, I agree.
> > > > 
> > > > But, currently for VFIO, it cannot call io_remap_pfn_range() in the fault
> > > > handler since vm_flags_set() requires mmap lock held for write while
> > > > the fault handler can only hold mmap lock for read.
> > > > So, it relies on ioremap()/iounmap() to reserve/de-reserve memtypes,
> > > > without VM_PAT being set in vma.
> > > 
> > > Right, neither vm_flags_set() nor io_remap_pfn_range() should be called in
> > > a fault handler.  They should only be called in mmap() phase.
> > > 
> > > When you mentioned ioremap(), it's only for the VGA device, right?  I don't
> > > see it's used in the vfio-pci's fault handler.
> > Oh, it's pci_iomap() in the vfio-pci's fault handler, another version of
> > ioremap() :)
> 
> Right. If to be explicit, I think it's in mmap(), and looks like Alex has a
Yes, in mmap(). (The "fault handler" in the previous reply is a typo :))

> comment for that:
> 

> 	/*
> 	 * Even though we don't make use of the barmap for the mmap,
> 	 * we need to request the region and the barmap tracks that.
> 	 */
> 
> Maybe in 2012 that's a must?

I think ioremap/pci_iomap in mmap() is not a must. At least it's not there in
nvgrace_gpu_mmap().
But nvgrace_gpu_mmap() has remap_pfn_range() and Alex explained that
nvgrace-gpu "only exists on ARM platforms. memtype_reserve() only exists on
x86." [1].

[1] https://lore.kernel.org/all/20240529105012.39756143.alex.williamson@redhat.com/

> PS: when looking, that looks like a proper
> place to reuse vfio_pci_core_setup_barmap() also in the mmap() function.
Yes, look so.

> > 
> > > > 
> > > > > 
> > > > > See what reserve_pfn_range() does right now: it'll make sure only one owner
> > > > > reserve this area, e.g. memtype_reserve() will fail if it has already been
> > > > > reserved.  Then when succeeded as the first one to reserve the region,
> > > > > it'll make sure this mem type to also be synchronized to the kernel map
> > > > > (memtype_kernel_map_sync).
> > > > > 
> > > > > So I feel like when MMIO disabled for a VFIO card, we may need to call
> > > > > reserve_pfn_range(), telling the kernel the mem type VFIO uses, even if the
> > > > > pgtable will be empty, and even if currently it's always UC- for now:
> > > > > 
> > > > > vfio_pci_core_mmap():
> > > > > 	vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
> > > > > 
> > > > > Then the memtype will be reserved even if it cannot be used.  Otherwise I
> > > > > don't know whether there's side effects of kernel identity mapping where it
> > > > > mismatch later with what's mapped in the userspace via the vma, when MMIO
> > > > > got enabled again: pgtable started to be injected with a different memtype
> > > > > that specified only in the vma's pgprot.
> > > > Current VFIO relies on pci_iomap() to reserve MMIO pfns as UC-, thus
> > > > no VM_PAT in vmas.
> > > >  
> > > > Calling remap_pfn_range() in the mmap() will cause problem to support
> > > > dynamic faulting. Looks there's still a window even with an immediate
> > > > unmap after mmap().
> > > 
> > > It can be conditionally injected.  See Alex's commit (not yet posted
> > > anywhere, only used in our internal testing so far):
> > > 
> > > https://github.com/xzpeter/linux/commit/f432e0e46c34e493943034be4cb9d6eb638f57d1
> > >
> > I'm a bit confused by the code.
> > It looks that it will do the remap_pfn_range() in mmap() if hardware is ready,
> > and will just set vma flags if hardware is not ready.
> > 
> > But if the hardware is not ready in mmap(), which code in the fault handler
> > will reserve pfn memtypes?
> 
> I thought I answered that below.. :)  I think we should use track_pfn_remap().

Ok. Then if we have two sets of pfns, then we can
1. Call remap_pfn_range() in mmap() for pfn set 1.
2. Export track_pfn_remap() and call track_pfn_remap() in mmap() for pfn
   set 2.
3. Unmap and call vmf_insert_pfn() in the fault handler to map pfn set 2.

However, I'm not sure if we can properly untrack both pfn sets 1 and 2.

By exporting untrack_pfn() too? Or, you'll leave VFIO to use ioremap/iounmap()
(or the variants) to reserve memtype by itself?

> 
> > 
> > > > 
> > > > Also, calling remap_pfn_range() in mmap() may not meet the requirement of
> > > > drivers to dynamic switch backend resources, e.g. as what's in
> > > > cxl_mmap_fault(), since one remap_pfn_range() cannot reserve memtypes for
> > > > all backend resources at once.
> > > > 
> > > > So, is there any way for the driver to reserve memtypes and set VM_PAT in
> > > > fault handler?
> > > 
> > > I must confess I'm not familiar with memtype stuff, and just started
> > > looking recently.
> > > 
> > > Per my reading so far, we have these multiple ways of doing memtype
> > > reservations, and no matter which API we use (ioremap / track_pfn_remap /
> > > pci_iomap), the same memtype needs to be used otherwise the reservation
> > > will fail.  Here ioremap / pci_iomap will involve one more vmap so that the
> > > virtual mapping will present already after the API returns.
> > Right, I understand in the same way :)
> > 
> > > 
> > > Then here IMHO track_pfn_remap() is the one that we should still use in
> > > page-level mmap() controls, because so far we don't want to map any MMIO
> > > range yet, however we want to say "hey this VMA maps something special", by
> > > reserving these memtype and set VM_PAT.  We want the virtual mapping only
> > > later during a fault().
> > > 
> > > In short, it looks to me the right thing we should do is to manually invoke
> > > track_pfn_remap() in vfio-pci's mmap() hook.
> > > 
> > > 	if (!vdev->pm_runtime_engaged && __vfio_pci_memory_enabled(vdev))
> > > 		ret = remap_pfn_range(vma, vma->vm_start, pfn,
> > > 				      req_len, vma->vm_page_prot);
> > > 	else
> > >                 // TODO: manually invoke track_pfn_remap() here
> > > 		vm_flags_set(vma, VM_IO | VM_PFNMAP |
> > > 				  VM_DONTEXPAND | VM_DONTDUMP);
> > What if we have to remap another set of pfns in the "else" case?
> 
> Use vmf_insert_pfn*(): these helpers do not reserve memtype but looks them
> up only.
> 
> > 
> > > 
> > > Then the vma is registered, and untrack_pfn() should be automatically done
> > > when vma unmaps (and that relies on this patch to not do that too early).
> > Well, I'm concerned by one use case.
> > 
> > 1. The driver calls remap_pfn_range() to reserve memtype for pfn1 + add
> >    VM_PAT flag.
> > 2. Then unmap_single_vma() is called. With this patch, the pfn1 memtype is
> >    still reserved.
> > 3. The fault handler calls vmf_insert_pfn() for another pfn2.
> > 4. unmap_vmas() is called. However, untrack_pfn() will only find pfn2
> >    instead of prevous pfn1.
> 
> It depends on what's your exact question, if it's about pgtable: I don't
> think munmap() requires PFNMAP pgtables to always exist, and it'll simply
> skip none entries.
> 
> And if it's about PAT tracking: that is exactly what I'm talking about
> below..  where I think untracking shouldn't rely on pgtables.

> I'll copy you too if I'll prepare some patches for the latter.  With that
> patchset (plus this patch) it should fix David Wang's issue and similar,
> AFAICT.
Thank you!

> 
> > 
> > 
> > > From that POV, I still think this patch does the right thing and should be
> > > merged, even if we probably have one more issue to fix as David Wang
> > > reported..
> > > 
> > > I'll need to think about how to do that right, as I think that'll be needed
> > > as long as pfnmaps will support fault()s: it means when munmap() the
> > > pgtable may not present, and PAT cannot rely on walking the pgtable to know
> > > the base PFN anymore.
> > > 
> > > Thanks,
> > > 
> > > -- 
> > > Peter Xu
> > > 
> > 
> 
> -- 
> Peter Xu
> 
> 

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ