linux-kernel - Re: [PATCH 10/12] fs/dax: Properly refcount fs dax pages

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <6720428aa1fcc_bc6a929439@dwillia2-xfh.jf.intel.com.notmuch>
Date: Mon, 28 Oct 2024 19:03:54 -0700
From: Dan Williams <dan.j.williams@...el.com>
To: Alistair Popple <apopple@...dia.com>, Dan Williams
	<dan.j.williams@...el.com>
CC: <linux-mm@...ck.org>, <vishal.l.verma@...el.com>, <dave.jiang@...el.com>,
	<logang@...tatee.com>, <bhelgaas@...gle.com>, <jack@...e.cz>, <jgg@...pe.ca>,
	<catalin.marinas@....com>, <will@...nel.org>, <mpe@...erman.id.au>,
	<npiggin@...il.com>, <dave.hansen@...ux.intel.com>, <ira.weiny@...el.com>,
	<willy@...radead.org>, <djwong@...nel.org>, <tytso@....edu>,
	<linmiaohe@...wei.com>, <david@...hat.com>, <peterx@...hat.com>,
	<linux-doc@...r.kernel.org>, <linux-kernel@...r.kernel.org>,
	<linux-arm-kernel@...ts.infradead.org>, <linuxppc-dev@...ts.ozlabs.org>,
	<nvdimm@...ts.linux.dev>, <linux-cxl@...r.kernel.org>,
	<linux-fsdevel@...r.kernel.org>, <linux-ext4@...r.kernel.org>,
	<linux-xfs@...r.kernel.org>, <jhubbard@...dia.com>, <hch@....de>,
	<david@...morbit.com>
Subject: Re: [PATCH 10/12] fs/dax: Properly refcount fs dax pages

Alistair Popple wrote:
[..]

> >> > It follows that that the DMA-idle condition still needs to look for the
> >> > case where the refcount is > 1 rather than 0 since refcount == 1 is the
> >> > page-mapped-but-DMA-idle condition.
> 
> Because if the DAX page-cache holds a reference the refcount won't go to
> zero until dax_delete_mapping_entry() is called. However this interface
> seems really strange to me - filesystems call
> dax_layout_busy_page()/dax_wait_page_idle() to make sure both user-space
> and DMA[1] have finished with the page, but not the DAX code which still
> has references in it's page-cache.

First, I appreciate the clarification that I was mixing up "mapped"
(elevated map count) with, for lack of a better term, "tracked" (mapping
entry valid).

So, to repeat back to you what I understand now, the proposal is to
attempt to allow _count==0 as the DMA idle condition, but still have the
final return of the block to the allocator (fs allocator) occur after
dax_delete_mapping_entry().

> Is there some reason for this? In order words why can't the interface to
> the filesystem be something like calling dax_break_layouts() which
> ensures everything, including core FS DAX code, has finished with the
> page(s) in question? I don't see why that wouldn't work for at least
> EXT4 and XFS (FUSE seemed a bit different but I haven't dug too deeply).
> 
> If we could do that dax_break_layouts() would essentially:
> 1. unmap userspace via eg. unmap_mapping_pages() to drive the refcount
>    down.

Am I missing where unmap_mapping_pages() drops the _count? I can see
where it drops _mapcount. I don't think that matters for the proposal,
but that's my last gap in tracking the proposed refcount model.

> 2. delete the DAX page-cache entry to remove its refcount.
> 3. wait for DMA to complete by waiting for the refcount to hit zero.
> 
> The problem with the filesystem truncate code at the moment is steps 2
> and 3 are reversed so step 3 has to wait for a refcount of 1 as you
> pointed out previously. But does that matter? Are there ever cases when
> a filesystem needs to wait for the page to be idle but maintain it's DAX
> page-cache entry?

No, not that I can think of. The filesystem just cares that the page was
seen as part of the file at some point and that it is holding locks to
keep the block associated with that page allocated to the file until it
can complete its operation.

I think what we are talking about is a pfn-state not a page state. If
the block-pfn-page lifecycle from allocation to free is deconstructed as:

    block free
    block allocated
    pfn untracked
    pfn tracked
    page free
    page busy
    page free
    pfn untracked
    block free

...then I can indeed see cases where there is pfn metadata live even
though the page is free.

So I think I was playing victim to the current implementation that
assumes that "pfn tracked" means the page is allocated and that
pfn_to_folio(pfn)->mapping is valid and not NULL.

All this to say I am at least on the same page as you that _count == 0
can be used as the page free state even if the pfn tracking goes through
delayed cleanup.

However, if vmf_insert_XXX is increasing _count then, per my
unmap_mapping_pages() question above, I think dax_wait_page_idle() needs
to call try_to_unmap() to drop that _count, right? Similar observation
for the memory_failure_dev_pagemap() path, I think that path only calls
unmap_mapping_range() not try_to_unmap() and leaves _count elevated.

Lastly walking through the code again I think this fix is valid today:

diff --git a/fs/dax.c b/fs/dax.c
index fcbe62bde685..48f2c85690e1 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -660,7 +660,7 @@ struct page *dax_layout_busy_page_range(struct address_space *mapping,
        pgoff_t end_idx;
        XA_STATE(xas, &mapping->i_pages, start_idx);
 
-       if (!dax_mapping(mapping) || !mapping_mapped(mapping))
+       if (!dax_mapping(mapping))
                return NULL;
 
        /* If end == LLONG_MAX, all pages from start to till end of file */


...because unmap_mapping_pages() will mark the mapping as unmapped even
though there are "pfn tracked + page busy" entries to clean up.

Appreciate you grappling this with me!