lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <n3ex3s3requwlub5lmwl37ia2uqnl23ngic2icvi5qqozrpts7@mwrd5ucdi5mo>
Date: Mon, 12 Jan 2026 11:23:33 +1100
From: Alistair Popple <apopple@...dia.com>
To: Bjorn Helgaas <helgaas@...nel.org>
Cc: Hou Tao <houtao@...weicloud.com>, linux-kernel@...r.kernel.org, 
	linux-pci@...r.kernel.org, linux-mm@...ck.org, linux-nvme@...ts.infradead.org, 
	Bjorn Helgaas <bhelgaas@...gle.com>, Logan Gunthorpe <logang@...tatee.com>, 
	Leon Romanovsky <leonro@...dia.com>, Greg Kroah-Hartman <gregkh@...uxfoundation.org>, 
	Tejun Heo <tj@...nel.org>, "Rafael J . Wysocki" <rafael@...nel.org>, 
	Danilo Krummrich <dakr@...nel.org>, Andrew Morton <akpm@...ux-foundation.org>, 
	David Hildenbrand <david@...nel.org>, Lorenzo Stoakes <lorenzo.stoakes@...cle.com>, 
	Keith Busch <kbusch@...nel.org>, Jens Axboe <axboe@...nel.dk>, Christoph Hellwig <hch@....de>, 
	Sagi Grimberg <sagi@...mberg.me>, houtao1@...wei.com
Subject: Re: [PATCH 01/13] PCI/P2PDMA: Release the per-cpu ref of pgmap when
 vm_insert_page() fails

On 2026-01-12 at 11:12 +1100, Alistair Popple <apopple@...dia.com> wrote...
> On 2026-01-12 at 10:21 +1100, Alistair Popple <apopple@...dia.com> wrote...
> > On 2026-01-10 at 02:03 +1100, Bjorn Helgaas <helgaas@...nel.org> wrote...
> > > On Fri, Jan 09, 2026 at 11:41:51AM +1100, Alistair Popple wrote:
> > > > On 2026-01-09 at 02:55 +1100, Bjorn Helgaas <helgaas@...nel.org> wrote...
> > > > > On Thu, Jan 08, 2026 at 02:23:16PM +1100, Alistair Popple wrote:
> > > > > > On 2025-12-20 at 15:04 +1100, Hou Tao <houtao@...weicloud.com> wrote...
> > > > > > > From: Hou Tao <houtao1@...wei.com>
> > > > > > > 
> > > > > > > When vm_insert_page() fails in p2pmem_alloc_mmap(), p2pmem_alloc_mmap()
> > > > > > > doesn't invoke percpu_ref_put() to free the per-cpu ref of pgmap
> > > > > > > acquired after gen_pool_alloc_owner(), and memunmap_pages() will hang
> > > > > > > forever when trying to remove the PCIe device.
> > > > > > > 
> > > > > > > Fix it by adding the missed percpu_ref_put().
> > > > ...
> > > 
> > > > > Looking at this again, I'm confused about why in the normal, non-error
> > > > > case, we do the percpu_ref_tryget_live_rcu(ref), followed by another
> > > > > percpu_ref_get(ref) for each page, followed by just a single
> > > > > percpu_ref_put() at the exit.
> > > > > 
> > > > > So we do ref_get() "1 + number of pages" times but we only do a single
> > > > > ref_put().  Is there a loop of ref_put() for each page elsewhere?
> > > > 
> > > > Right, the per-page ref_put() happens when the page is freed (ie. the struct
> > > > page refcount drops to zero) - in this case free_zone_device_folio() will call
> > > > p2pdma_folio_free() which has the corresponding percpu_ref_put().
> > > 
> > > I don't see anything that looks like a loop to call ref_put() for each
> > > page in free_zone_device_folio() or in p2pdma_folio_free(), but this
> > > is all completely out of my range, so I'll take your word for it :)  
> > 
> > That's brave :-)
> > 
> > What happens is the core mm takes over managing the page life time once
> > vm_insert_page() has been (successfully) called to map the page:
> > 
> > 	VM_WARN_ON_ONCE_PAGE(!page_ref_count(page), page);
> > 	set_page_count(page, 1);
> > 	ret = vm_insert_page(vma, vaddr, page);
> > 	if (ret) {
> > 		gen_pool_free(p2pdma->pool, (uintptr_t)kaddr, len);
> > 		return ret;
> > 	}
> > 	percpu_ref_get(ref);
> > 	put_page(page);
> > 
> > In the above sequence vm_insert_page() takes a page ref for each page it maps
> > into the user page tables with folio_get(). This reference is dropped when the
> > user page table entry is removed, typically by the loop in zap_pte_range().
> > 
> > Normally the user page table mapping is the only thing holding a reference so
> > it ends up calling folio_put()->free_zone_device_folio->...->ref_put() one page
> > at a time as the PTEs are removed from the page tables. At least that's what
> > happens conceptually - the TLB batching code makes it hard to actually see where
> > the folio_put() is called in this sequence.
> > 
> > Note the extra set_page_count(1) and put_page(page) in the above sequence is
> > just to make vm_insert_page() happy - it complains it you try and insert a page
> > with a zero page ref.
> > 
> > And looking at that sequence there is another minor bug - in the failure
> > path we are exiting the loop with the failed page ref count set to
> > 1 from set_page_count(page, 1). That needs to be reset to zero with
> > set_page_count(page, 0) to avoid the VM_WARN_ON_ONCE_PAGE() if the page gets
> > reused. I will send a fix for that.
> 
> Actually the whole failure path above seems wrong to me - we
> free the entire allocation with gen_pool_free() even though
> vm_insert_page() may have succeeded in mapping some pages. AFAICT the
> generic VFS mmap code will call unmap_region() to undo any partial
> mapping (see __mmap_new_file_vma) but that should end up calling
> folio_put()->zone_free_device_range()->p2pdma_folio_free()->gen_pool_free_owner()
> for the mapped pages even though we've already freed the entire pool.

Oh nevermind, I hit send too soon. Ignore the above paragraph - I hadn't noticed
kaddr/len gets updated at the end of the loop to account for the successful
mappings.

> >  - Alistair
> > 
> > > Bjorn
> > 
> 

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ