linux-kernel - Re: [PATCH] vfio/pci: take mmap write lock for io_remap_pfn

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20240522134306.11350e13.alex.williamson@redhat.com>
Date: Wed, 22 May 2024 13:43:06 -0600
From: Alex Williamson <alex.williamson@...hat.com>
To: Jason Gunthorpe <jgg@...dia.com>
Cc: Andrew Jones <ajones@...tanamicro.com>, Yan Zhao <yan.y.zhao@...el.com>,
 kvm@...r.kernel.org, linux-kernel@...r.kernel.org, kevin.tian@...el.com,
 yishaih@...dia.com, shameerali.kolothum.thodi@...wei.com, Peter Xu
 <peterx@...hat.com>
Subject: Re: [PATCH] vfio/pci: take mmap write lock for io_remap_pfn_range

On Wed, 22 May 2024 15:30:46 -0300
Jason Gunthorpe <jgg@...dia.com> wrote:

> On Wed, May 22, 2024 at 11:50:06AM -0600, Alex Williamson wrote:
> > I'm not sure if there are any outstanding blockers on Peter's side, but
> > this seems like a good route from the vfio side.  If we're seeing this
> > now without lockdep, we might need to bite the bullet and take the hit
> > with vmf_insert_pfn() while the pmd/pud path learn about pfnmaps.  
> 
> There is another alternative...
> 
> Ideally we wouldn't use the fault handler.
> 
> Instead when the MMIO becomes available again we'd iterate over all
> the VMA's and do remap_pfn_range(). When the MMIO becomes unavailable
> we do unmap_mapping_range() and remove it. This whole thing is
> synchronous and the fault handler should simply trigger SIGBUS if
> userspace races things.
> 
> unmap_mapping_range() is easy, but the remap side doesn't have a
> helper today..
> 
> Something grotesque like this perhaps?
> 
> 	while (1) {
> 		struct mm_struct *cur_mm = NULL;
> 
> 		i_mmap_lock_read(mapping);
> 		vma_interval_tree_foreach(vma, mapping->i_mmap, 0, ULONG_MAX) {
> 			if (vma_populated(vma))
> 				continue;
> 
> 			cur_mm = vm->mm_struct;
> 			mmgrab(cur_mm);
> 		}
> 		i_mmap_unlock_read(mapping);
> 
> 		if (!cur_mm)
> 			return;
> 
> 		mmap_write_lock(cur_mm);
> 		i_mmap_lock_read(mapping);
> 		vma_interval_tree_foreach(vma, mapping->i_mmap, 0, ULONG_MAX) {
> 			if (vma->mm_struct == mm)
> 				vfio_remap_vma(vma);
> 		}
> 		i_mmap_unlock_read(mapping);
> 		mmap_write_unlock(cur_mm);
> 		mmdrop(cur_mm);
> 	}

Yes, I've played with similar it's a pretty ugly and painful path
versus lazily faulting.

> I'm pretty sure we need to hold the mmap_write lock to do the
> remap_pfn..
> 
> vma_populated() would have to do a RCU read of the page table to check
> if the page 0 is present.
> 
> Also there is a race in mmap if you call remap_pfn_range() from the
> mmap fops and also use unmap_mapping_range(). mmap_region() does
> call_mmap() before it does vma_link_file() which gives a window where
> the VMA is populated but invisible to unmap_mapping_range(). We'd need
> to add another fop to call after vma_link_file() to populate the mmap
> or rely exclusively on the fault handler to populate.

Thanks, the branch I linked added back remap_pfn_range() on mmap,
conditional on locking/enabled, but I've predominantly been testing
with that disabled to exercise the pmd/pud faults.  I would have missed
the gap vs unmap_mapping_range().

It's hard to beat the faulting mechanism from a simplicity standpoint,
so long as the mm folks don't balk at pfnmap faults.  I think the vma
address space walking ends up with tracking flags to avoid redundant
mappings that get complicated and add to the ugliness.  The mm handles
all that if we only use faulting and we get a simple test on fault to
proceed or trigger a SIGBUS.  Using pmd/pud faults should have some
efficiency benefits as well.  Thanks,

Alex