linux-ext4 - Re: [PATCH 10/12] fs/dax: Properly refcount fs dax pages

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <871q06c4z7.fsf@nvdebian.thelocal>
Date: Thu, 24 Oct 2024 18:52:23 +1100
From: Alistair Popple <apopple@...dia.com>
To: Dan Williams <dan.j.williams@...el.com>
Cc: linux-mm@...ck.org, vishal.l.verma@...el.com, dave.jiang@...el.com,
 logang@...tatee.com, bhelgaas@...gle.com, jack@...e.cz, jgg@...pe.ca,
 catalin.marinas@....com, will@...nel.org, mpe@...erman.id.au,
 npiggin@...il.com, dave.hansen@...ux.intel.com, ira.weiny@...el.com,
 willy@...radead.org, djwong@...nel.org, tytso@....edu,
 linmiaohe@...wei.com, david@...hat.com, peterx@...hat.com,
 linux-doc@...r.kernel.org, linux-kernel@...r.kernel.org,
 linux-arm-kernel@...ts.infradead.org, linuxppc-dev@...ts.ozlabs.org,
 nvdimm@...ts.linux.dev, linux-cxl@...r.kernel.org,
 linux-fsdevel@...r.kernel.org, linux-ext4@...r.kernel.org,
 linux-xfs@...r.kernel.org, jhubbard@...dia.com, hch@....de,
 david@...morbit.com
Subject: Re: [PATCH 10/12] fs/dax: Properly refcount fs dax pages


Dan Williams <dan.j.williams@...el.com> writes:

> Alistair Popple wrote:

[...]

>> @@ -318,85 +323,58 @@ static unsigned long dax_end_pfn(void *entry)
>>   */
>>  #define for_each_mapped_pfn(entry, pfn) \
>>  	for (pfn = dax_to_pfn(entry); \
>> -			pfn < dax_end_pfn(entry); pfn++)
>> +		pfn < dax_end_pfn(entry); pfn++)
>>  
>> -static inline bool dax_page_is_shared(struct page *page)
>> +static void dax_device_folio_init(struct folio *folio, int order)
>>  {
>> -	return page->mapping == PAGE_MAPPING_DAX_SHARED;
>> -}
>> +	int orig_order = folio_order(folio);
>> +	int i;
>>  
>> -/*
>> - * Set the page->mapping with PAGE_MAPPING_DAX_SHARED flag, increase the
>> - * refcount.
>> - */
>> -static inline void dax_page_share_get(struct page *page)
>> -{
>> -	if (page->mapping != PAGE_MAPPING_DAX_SHARED) {
>> -		/*
>> -		 * Reset the index if the page was already mapped
>> -		 * regularly before.
>> -		 */
>> -		if (page->mapping)
>> -			page->share = 1;
>> -		page->mapping = PAGE_MAPPING_DAX_SHARED;
>> -	}
>> -	page->share++;
>> -}
>> +	if (orig_order != order) {
>> +		struct dev_pagemap *pgmap = page_dev_pagemap(&folio->page);
>
> Was there a discussion I missed about why the conversion to typical
> folios allows the page->share accounting to be dropped.

The problem with keeping it is we now treat DAX pages as "normal"
pages according to vm_normal_page(). As such we use the normal paths
for unmapping pages.

Specifically page->share accounting relies on PAGE_MAPPING_DAX_SHARED
aka PAGE_MAPPING_ANON which causes folio_test_anon(), PageAnon(),
etc. to return true leading to all sorts of issues in at least the
unmap paths.

There hasn't been a previous discussion on this, but given this is
only used to print warnings it seemed easier to get rid of it. I
probably should have called that out more clearly in the commit
message though.

> I assume this is because the page->mapping validation was dropped, which
> I think might be useful to keep at least for one development cycle to
> make sure this conversion is not triggering any of the old warnings.
>
> Otherwise, the ->share field of 'struct page' can also be cleaned up.

Yes, we should also clean up the ->share field, unless you have an
alternate suggestion to solve the above issue.

>> -static inline unsigned long dax_page_share_put(struct page *page)
>> -{
>> -	return --page->share;
>> -}
>> +		for (i = 0; i < (1UL << orig_order); i++) {
>> +			struct page *page = folio_page(folio, i);
>>  
>> -/*
>> - * When it is called in dax_insert_entry(), the shared flag will indicate that
>> - * whether this entry is shared by multiple files.  If so, set the page->mapping
>> - * PAGE_MAPPING_DAX_SHARED, and use page->share as refcount.
>> - */
>> -static void dax_associate_entry(void *entry, struct address_space *mapping,
>> -		struct vm_area_struct *vma, unsigned long address, bool shared)
>> -{
>> -	unsigned long size = dax_entry_size(entry), pfn, index;
>> -	int i = 0;
>> +			ClearPageHead(page);
>> +			clear_compound_head(page);
>>  
>> -	if (IS_ENABLED(CONFIG_FS_DAX_LIMITED))
>> -		return;
>> -
>> -	index = linear_page_index(vma, address & ~(size - 1));
>> -	for_each_mapped_pfn(entry, pfn) {
>> -		struct page *page = pfn_to_page(pfn);
>> +			/*
>> +			 * Reset pgmap which was over-written by
>> +			 * prep_compound_page().
>> +			 */
>> +			page_folio(page)->pgmap = pgmap;
>>  
>> -		if (shared) {
>> -			dax_page_share_get(page);
>> -		} else {
>> -			WARN_ON_ONCE(page->mapping);
>> -			page->mapping = mapping;
>> -			page->index = index + i++;
>> +			/* Make sure this isn't set to TAIL_MAPPING */
>> +			page->mapping = NULL;
>>  		}
>>  	}
>> +
>> +	if (order > 0) {
>> +		prep_compound_page(&folio->page, order);
>> +		if (order > 1)
>> +			INIT_LIST_HEAD(&folio->_deferred_list);
>> +	}
>>  }
>>  
>> -static void dax_disassociate_entry(void *entry, struct address_space *mapping,
>> -		bool trunc)
>> +static void dax_associate_new_entry(void *entry, struct address_space *mapping,
>> +				pgoff_t index)
>
> Lets call this dax_create_folio(), to mirror filemap_create_folio() and
> have it transition the folio refcount from 0 to 1 to indicate that it is
> allocated.
>
> While I am not sure anything requires that, it seems odd that page cache
> pages have an elevated refcount at map time and dax pages do not.

The refcount gets elevated further up the call stack, but I agree it
would be clearer to move it here.

> It does have implications for the dax dma-idle tracking thought, see
> below.
>
>>  {
>> -	unsigned long pfn;
>> +	unsigned long order = dax_entry_order(entry);
>> +	struct folio *folio = dax_to_folio(entry);
>>  
>> -	if (IS_ENABLED(CONFIG_FS_DAX_LIMITED))
>> +	if (!dax_entry_size(entry))
>>  		return;
>>  
>> -	for_each_mapped_pfn(entry, pfn) {
>> -		struct page *page = pfn_to_page(pfn);
>> -
>> -		WARN_ON_ONCE(trunc && page_ref_count(page) > 1);
>> -		if (dax_page_is_shared(page)) {
>> -			/* keep the shared flag if this page is still shared */
>> -			if (dax_page_share_put(page) > 0)
>> -				continue;
>> -		} else
>> -			WARN_ON_ONCE(page->mapping && page->mapping != mapping);
>> -		page->mapping = NULL;
>> -		page->index = 0;
>> -	}
>> +	/*
>> +	 * We don't hold a reference for the DAX pagecache entry for the
>> +	 * page. But we need to initialise the folio so we can hand it
>> +	 * out. Nothing else should have a reference either.
>> +	 */
>> +	WARN_ON_ONCE(folio_ref_count(folio));
>
> Per above I would feel more comfortable if we kept the paranoia around
> to ensure that all the pages in this folio have dropped all references
> and cleared ->mapping and ->index.
>
> That paranoia can be placed behind a CONFIG_DEBUB_VM check, and we can
> delete in a follow-on development cycle, but in the meantime it helps to
> prove the correctness of the conversion.

I'm ok with paranoia, but as noted above the issue is that at a minimum
page->mapping (and probably index) now needs to be valid for any code
that might walk the page tables.

> [..]
>> @@ -1189,11 +1165,14 @@ static vm_fault_t dax_load_hole(struct xa_state *xas, struct vm_fault *vmf,
>>  	struct inode *inode = iter->inode;
>>  	unsigned long vaddr = vmf->address;
>>  	pfn_t pfn = pfn_to_pfn_t(my_zero_pfn(vaddr));
>> +	struct page *page = pfn_t_to_page(pfn);
>>  	vm_fault_t ret;
>>  
>>  	*entry = dax_insert_entry(xas, vmf, iter, *entry, pfn, DAX_ZERO_PAGE);
>>  
>> -	ret = vmf_insert_mixed(vmf->vma, vaddr, pfn);
>> +	page_ref_inc(page);
>> +	ret = dax_insert_pfn(vmf, pfn, false);
>> +	put_page(page);
>
> Per above I think it is problematic to have pages live in the system
> without a refcount.

I'm a bit confused by this - the pages have a reference taken on them
when they are mapped. They only live in the system without a refcount
when the mm considers them free (except for the bit between getting
created in dax_associate_entry() and actually getting mapped but as
noted I will fix that).

> One scenario where this might be needed is invalidate_inode_pages() vs
> DMA. The invaldation should pause and wait for DMA pins to be dropped
> before the mapping xarray is cleaned up and the dax folio is marked
> free.

I'm not really following this scenario, or at least how it relates to
the comment above. If the page is pinned for DMA it will have taken a
refcount on it and so the page won't be considered free/idle per
dax_wait_page_idle() or any of the other mm code.

> I think this may be a gap in the current code. I'll attempt to write a
> test for this to check.

Ok, let me know if you come up with anything there as it might help
explain the problem more clearly.

> [..]
>> @@ -1649,9 +1627,10 @@ static vm_fault_t dax_fault_iter(struct vm_fault *vmf,
>>  	loff_t pos = (loff_t)xas->xa_index << PAGE_SHIFT;
>>  	bool write = iter->flags & IOMAP_WRITE;
>>  	unsigned long entry_flags = pmd ? DAX_PMD : 0;
>> -	int err = 0;
>> +	int ret, err = 0;
>>  	pfn_t pfn;
>>  	void *kaddr;
>> +	struct page *page;
>>  
>>  	if (!pmd && vmf->cow_page)
>>  		return dax_fault_cow_page(vmf, iter);
>> @@ -1684,14 +1663,21 @@ static vm_fault_t dax_fault_iter(struct vm_fault *vmf,
>>  	if (dax_fault_is_synchronous(iter, vmf->vma))
>>  		return dax_fault_synchronous_pfnp(pfnp, pfn);
>>  
>> -	/* insert PMD pfn */
>> +	page = pfn_t_to_page(pfn);
>
> I think this is clearer if dax_insert_entry() returns folios with an
> elevated refrence count that is dropped when the folio is invalidated
> out of the mapping.

I presume this comment is for the next line:

+	page_ref_inc(page);
 
I can move that into dax_insert_entry(), but we would still need to
drop it after calling vmf_insert_*() to ensure we get the 1 -> 0
transition when the page is unmapped and therefore
freed. Alternatively we can make it so vmf_insert_*() don't take
references on the page, and instead ownership of the reference is
transfered to the mapping. Personally I prefered having those
functions take their own reference but let me know what you think.

> [..]
>> @@ -519,21 +529,3 @@ void zone_device_page_init(struct page *page)
>>  	lock_page(page);
>>  }
>>  EXPORT_SYMBOL_GPL(zone_device_page_init);
>> -
>> -#ifdef CONFIG_FS_DAX
>> -bool __put_devmap_managed_folio_refs(struct folio *folio, int refs)
>> -{
>> -	if (folio->pgmap->type != MEMORY_DEVICE_FS_DAX)
>> -		return false;
>> -
>> -	/*
>> -	 * fsdax page refcounts are 1-based, rather than 0-based: if
>> -	 * refcount is 1, then the page is free and the refcount is
>> -	 * stable because nobody holds a reference on the page.
>> -	 */
>> -	if (folio_ref_sub_return(folio, refs) == 1)
>> -		wake_up_var(&folio->_refcount);
>> -	return true;
>
> It follow from the refcount disvussion above that I think there is an
> argument to still keep this wakeup based on the 2->1 transitition.
> pagecache pages are refcount==1 when they are dma-idle but still
> allocated. To keep the same semantics for dax a dax_folio would have an
> elevated refcount whenever it is referenced by mapping entry.

I'm not sold on keeping it as it doesn't seem to offer any benefit
IMHO. I know both Jason and Christoph were keen to see it go so it be
good to get their feedback too. Also one of the primary goals of this
series was to refcount the page normally so we could remove the whole
"page is free with a refcount of 1" semantics.

  - Alistair