linux-kernel - Re: [PATCH 10/12] fs/dax: Properly refcount fs dax pages

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <87a5eon8v3.fsf@nvdebian.thelocal>
Date: Mon, 28 Oct 2024 15:24:49 +1100
From: Alistair Popple <apopple@...dia.com>
To: Dan Williams <dan.j.williams@...el.com>
Cc: linux-mm@...ck.org, vishal.l.verma@...el.com, dave.jiang@...el.com,
 logang@...tatee.com, bhelgaas@...gle.com, jack@...e.cz, jgg@...pe.ca,
 catalin.marinas@....com, will@...nel.org, mpe@...erman.id.au,
 npiggin@...il.com, dave.hansen@...ux.intel.com, ira.weiny@...el.com,
 willy@...radead.org, djwong@...nel.org, tytso@....edu,
 linmiaohe@...wei.com, david@...hat.com, peterx@...hat.com,
 linux-doc@...r.kernel.org, linux-kernel@...r.kernel.org,
 linux-arm-kernel@...ts.infradead.org, linuxppc-dev@...ts.ozlabs.org,
 nvdimm@...ts.linux.dev, linux-cxl@...r.kernel.org,
 linux-fsdevel@...r.kernel.org, linux-ext4@...r.kernel.org,
 linux-xfs@...r.kernel.org, jhubbard@...dia.com, hch@....de,
 david@...morbit.com
Subject: Re: [PATCH 10/12] fs/dax: Properly refcount fs dax pages


Dan Williams <dan.j.williams@...el.com> writes:

> Alistair Popple wrote:
> [..]
>>> I'm not really following this scenario, or at least how it relates to
>> >> the comment above. If the page is pinned for DMA it will have taken a
>> >> refcount on it and so the page won't be considered free/idle per
>> >> dax_wait_page_idle() or any of the other mm code.
>> >
>> > [ tl;dr: I think we're ok, analysis below, but I did talk myself into
>> > the proposed dax_busy_page() changes indeed being broken and needing to
>> > remain checking for refcount > 1, not > 0 ]
>> >
>> > It's not the mm code I am worried about. It's the filesystem block
>> > allocator staying in-sync with the allocation state of the page.
>> >
>> > fs/dax.c is charged with converting idle storage blocks to pfns to
>> > mapped folios. Once they are mapped, DMA can pin the folio, but nothing
>> > in fs/dax.c pins the mapping. In the pagecache case the page reference
>> > is sufficient to keep the DMA-busy page from being reused. In the dax
>> > case something needs to arrange for DMA to be idle before
>> > dax_delete_mapping_entry().
>> 
>> Ok. How does that work today? My current mental model is that something
>> has to call dax_layout_busy_page() whilst holding the correct locks to
>> prevent a new mapping being established prior to calling
>> dax_delete_mapping_entry(). Is that correct?
>
> Correct. dax_delete_mapping_entry() is invoked by the filesystem with
> inode locks held. See xfs_file_fallocate() where it takes the lock,
> calls xfs_break_layouts() and if that succeeds performs
> xfs_file_free_space() with the lock held.

Thanks for confirming. I've broken it enough times during development of
this that I thought I was correct but the confirmation is nice.

> xfs_file_free_space() triggers dax_delete_mapping_entry() with knowledge
> that the mapping cannot be re-established until the lock is dropped.
>
>> > However, looking at XFS it indeed makes that guarantee. First it does
>> > xfs_break_dax_layouts() then it does truncate_inode_pages() =>
>> > dax_delete_mapping_entry().
>> >
>> > It follows that that the DMA-idle condition still needs to look for the
>> > case where the refcount is > 1 rather than 0 since refcount == 1 is the
>> > page-mapped-but-DMA-idle condition.
>> 
>> Sorry, but I'm still not following this line of reasoning. If the
>> refcount == 1 the page is either mapped xor DMA-busy.
>
> No, my expectation is the refcount is 1 while the page has a mapping
> entry, analagous to an idle / allocated page cache page, and the
> refcount is 2 or more for DMA, get_user_pages(), or any page walker that
> takes a transient page pin.

Argh, I think we may have been talking past each other. By "mapped" I
was thinking of folio_mapped() == true. Ie. page->mapcount >= 1 due to
having page table entries. I suspect you're talking about DAX page-cache
entries here?

The way the series currently works the DAX page-cache does not hold a
reference on the page. Whether or not that is a good idea (or even
valid/functionally correct) is a reasonable question and where I think
this discussion is heading (see below).

>> is enough to conclude that the page cannot be reused because it is
>> either being accessed from userspace via a CPU mapping or from some
>> device DMA or some other in kernel user.
>
> Userspace access is not a problem, that access can always be safely
> revoked by unmapping the page, and that's what dax_layout_busy_page()
> does to force a fault and re-taking the inode + mmap locks so that the
> truncate path knows it has temporary exclusive access to the page, pfn,
> and storage-block association.

Right.

>> The current proposal is that dax_busy_page() returns true if refcount >=
>> 1, and dax_wait_page_idle() will wait until the refcount ==
>> 0. dax_busy_page() will try and force the refcount == 0 by unmapping it,
>> but obviously can't force other pinners to release their reference hence
>> the need to wait. Callers should already be holding locks to ensure new
>> mappings can't be established and hence can't become DMA-busy after the
>> unmap.
>
> Am I missing a page_ref_dec() somewhere? Are you saying that
> dax_layout_busy_page() will find entries with ->mapping non-NULL and
> refcount == 0?

No, ->mapping gets set to NULL when the page is freed in
free_zone_device_folio() but I think the mapping->i_pages XArray will
still contain references to the page with a zero
refcount. Ie. truncate_inode_pages_range() will still find them and call
truncate_inode_pages_range().

> [..]
>> >> >> @@ -1684,14 +1663,21 @@ static vm_fault_t dax_fault_iter(struct vm_fault *vmf,
>> >> >>  	if (dax_fault_is_synchronous(iter, vmf->vma))
>> >> >>  		return dax_fault_synchronous_pfnp(pfnp, pfn);
>> >> >>  
>> >> >> -	/* insert PMD pfn */
>> >> >> +	page = pfn_t_to_page(pfn);
>> >> >
>> >> > I think this is clearer if dax_insert_entry() returns folios with an
>> >> > elevated refrence count that is dropped when the folio is invalidated
>> >> > out of the mapping.
>> >> 
>> >> I presume this comment is for the next line:
>> >> 
>> >> +	page_ref_inc(page);
>> >>  
>> >> I can move that into dax_insert_entry(), but we would still need to
>> >> drop it after calling vmf_insert_*() to ensure we get the 1 -> 0
>> >> transition when the page is unmapped and therefore
>> >> freed. Alternatively we can make it so vmf_insert_*() don't take
>> >> references on the page, and instead ownership of the reference is
>> >> transfered to the mapping. Personally I prefered having those
>> >> functions take their own reference but let me know what you think.
>> >
>> > Oh, the model I was thinking was that until vmf_insert_XXX() succeeds
>> > then the page was never allocated because it was never mapped. What
>> > happens with the code as proposed is that put_page() triggers page-free
>> > semantics on vmf_insert_XXX() failures, right?
>> 
>> Right. And actually that means I can't move the page_ref_inc(page) into
>> what will be called dax_create_folio(), because an entry may have been
>> created previously that had a failed vmf_insert_XXX() which will
>> therefore have a zero refcount folio associated with it.
>
> I would expect a full cleanup on on vmf_insert_XXX() failure, not
> leaving a zero-referenced entry.
>
>> But I think that model is wrong. I think the model needs to be the page
>> gets allocated when the entry is first created (ie. when
>> dax_create_folio() is called). A subsequent free (ether due to
>> vmf_insert_XXX() failing or the page being unmapped or becoming
>> DMA-idle) should then delete the entry.
>>
>> I think that makes the semantics around dax_busy_page() nicer as well -
>> no need for the truncate to have a special path to call
>> dax_delete_mapping_entry().
>
> I agree it would be lovely if the final put could clean up the mapping
> entry and not depend on truncate_inode_pages_range() to do that.

I think I'm understanding you better now, thanks for your patience. I
think the problem here is most filesystems tend to basically do the
following:

1. Call some fs-specific version of break_dax_layouts() which:
   a) unmaps all the pages from the page-tables via
      dax_layout_busy_page()
   b) waits for DMA[1] to complete by looking at page refcounts

2. Removes DAX page-cache entries by calling
   truncate_inode_pages_range() or some equivalent.

In this series this works because the DAX page-cache doesn't hold a page
reference nor does it call dax_delete_mapping_entry() on free - it
relies on the truncate code to do that. So I think I understand your
original comment now:

>> > It follows that that the DMA-idle condition still needs to look for the
>> > case where the refcount is > 1 rather than 0 since refcount == 1 is the
>> > page-mapped-but-DMA-idle condition.

Because if the DAX page-cache holds a reference the refcount won't go to
zero until dax_delete_mapping_entry() is called. However this interface
seems really strange to me - filesystems call
dax_layout_busy_page()/dax_wait_page_idle() to make sure both user-space
and DMA[1] have finished with the page, but not the DAX code which still
has references in it's page-cache.

Is there some reason for this? In order words why can't the interface to
the filesystem be something like calling dax_break_layouts() which
ensures everything, including core FS DAX code, has finished with the
page(s) in question? I don't see why that wouldn't work for at least
EXT4 and XFS (FUSE seemed a bit different but I haven't dug too deeply).

If we could do that dax_break_layouts() would essentially:
1. unmap userspace via eg. unmap_mapping_pages() to drive the refcount
   down.
2. delete the DAX page-cache entry to remove its refcount.
3. wait for DMA to complete by waiting for the refcount to hit zero.

The problem with the filesystem truncate code at the moment is steps 2
and 3 are reversed so step 3 has to wait for a refcount of 1 as you
pointed out previously. But does that matter? Are there ever cases when
a filesystem needs to wait for the page to be idle but maintain it's DAX
page-cache entry?

I may be missing something though, because I was having trouble getting
this scheme to actually work today.

[1] - Where "DMA" means any unknown page reference

> ...but I do not immediately see how to get there when block, pfn, and
> page are so tightly coupled with dax. That's a whole new project to
> introduce that paradigm, no? The page cache case gets away with
> it by safely disconnecting the pfn+page from the block and then letting
> DMA final put_page() take its time.

Oh of course, thanks for pointing out the difference there.

>> > There is no need to invoke the page-free / final-put path on
>> > vmf_insert_XXX() error because the storage-block / pfn never actually
>> > transitioned into a page / folio.
>> 
>> It's not mapping a page/folio that transitions a pfn into a page/folio
>> it is the allocation of the folio that happens in dax_create_folio()
>> (aka. dax_associate_new_entry()). So we need to delete the entry (as
>> noted above I don't do that currently) if the insertion fails.
>
> Yeah, deletion on insert failure makes sense.
>
> [..]
>> >> >> @@ -519,21 +529,3 @@ void zone_device_page_init(struct page *page)
>> >> >>  	lock_page(page);
>> >> >>  }
>> >> >>  EXPORT_SYMBOL_GPL(zone_device_page_init);
>> >> >> -
>> >> >> -#ifdef CONFIG_FS_DAX
>> >> >> -bool __put_devmap_managed_folio_refs(struct folio *folio, int refs)
>> >> >> -{
>> >> >> -	if (folio->pgmap->type != MEMORY_DEVICE_FS_DAX)
>> >> >> -		return false;
>> >> >> -
>> >> >> -	/*
>> >> >> -	 * fsdax page refcounts are 1-based, rather than 0-based: if
>> >> >> -	 * refcount is 1, then the page is free and the refcount is
>> >> >> -	 * stable because nobody holds a reference on the page.
>> >> >> -	 */
>> >> >> -	if (folio_ref_sub_return(folio, refs) == 1)
>> >> >> -		wake_up_var(&folio->_refcount);
>> >> >> -	return true;
>> >> >
>> >> > It follow from the refcount disvussion above that I think there is an
>> >> > argument to still keep this wakeup based on the 2->1 transitition.
>> >> > pagecache pages are refcount==1 when they are dma-idle but still
>> >> > allocated. To keep the same semantics for dax a dax_folio would have an
>> >> > elevated refcount whenever it is referenced by mapping entry.
>> >> 
>> >> I'm not sold on keeping it as it doesn't seem to offer any benefit
>> >> IMHO. I know both Jason and Christoph were keen to see it go so it be
>> >> good to get their feedback too. Also one of the primary goals of this
>> >> series was to refcount the page normally so we could remove the whole
>> >> "page is free with a refcount of 1" semantics.
>> >
>> > The page is still free at refcount 0, no argument there. But, by
>> > introducing a new "page refcount is elevated while mapped" (as it
>> > should), it follows that "page is DMA idle at refcount == 1", right?
>> 
>> No. The page is either mapped xor DMA-busy - ie. not free. If we want
>> (need?) to tell the difference we can use folio_maybe_dma_pinned(),
>> assuming the driver doing DMA has called pin_user_pages() as it should.
>> 
>> That said I'm not sure why we care about the distinction between
>> DMA-idle and mapped? If the page is not free from the mm perspective the
>> block can't be reallocated by the filesystem.
>
> "can't be reallocated", what enforces that in your view? I am hoping it
> is something I am overlooking.
>
> In my view the filesystem has no idea of this page-to-block
> relationship. All it knows is that when it wants to destroy the
> page-to-block association, dax notices and says "uh, oh, this is my last
> chance to make sure the block can go back into the fs allocation pool so
> I need to wait for the mm to say that the page is exclusive to me (dax
> core) before dax_delete_mapping_entry() destroys the page-to-block
> association and the fs reclaims the allocation".
>
>> > Otherwise, the current assumption that fileystems can have
>> > dax_layout_busy_page_range() poll on the state of the pfn in the mapping
>> > is broken because page refcount == 0 also means no page to mapping
>> > association.
>> 
>> And also means nothing from the mm (userspace mapping, DMA-busy, etc.)
>> is using the page so the page isn't busy and is free to be reallocated
>> right?
>
> Lets take the 'map => start dma => truncate => end dma' scenario.
>
> At the 'end dma' step, how does the filesystem learn that the block that
> it truncated, potentially hours ago, is now a free block? The filesystem
> thought it reclaimed the block when truncate completed. I.e. dax says,
> thou shalt 'end dma' => 'truncate' in all cases.

Agreed, but I don't think I was suggesting we change that. I agree DAX
has to ensure 'end dma' happens before truncate completes.

> Note "dma" can be replaced with "any non dax core page_ref".