linux-kernel - Re: [PATCH v1 4/4] mm/memory: document restore_exclusive

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <7vejbjs7btkof4iguvn3nqvozxqpnzbymxbumd7pant4zi4ac4@3ozuzfzsm5tp>
Date: Thu, 30 Jan 2025 11:27:37 +1100
From: Alistair Popple <apopple@...dia.com>
To: David Hildenbrand <david@...hat.com>
Cc: linux-kernel@...r.kernel.org, linux-doc@...r.kernel.org, 
	dri-devel@...ts.freedesktop.org, linux-mm@...ck.org, nouveau@...ts.freedesktop.org, 
	Andrew Morton <akpm@...ux-foundation.org>, Jérôme Glisse <jglisse@...hat.com>, 
	Jonathan Corbet <corbet@....net>, Alex Shi <alexs@...nel.org>, Yanteng Si <si.yanteng@...ux.dev>, 
	Karol Herbst <kherbst@...hat.com>, Lyude Paul <lyude@...hat.com>, 
	Danilo Krummrich <dakr@...nel.org>, David Airlie <airlied@...il.com>, 
	Simona Vetter <simona@...ll.ch>, "Liam R. Howlett" <Liam.Howlett@...cle.com>, 
	Lorenzo Stoakes <lorenzo.stoakes@...cle.com>, Vlastimil Babka <vbabka@...e.cz>, Jann Horn <jannh@...gle.com>, 
	Pasha Tatashin <pasha.tatashin@...een.com>, Peter Xu <peterx@...hat.com>, Jason Gunthorpe <jgg@...dia.com>
Subject: Re: [PATCH v1 4/4] mm/memory: document restore_exclusive_pte()

On Wed, Jan 29, 2025 at 12:58:02PM +0100, David Hildenbrand wrote:
> Let's document how this function is to be used, and why the requirement
> for the folio lock might maybe be dropped in the future.

Sorry, only just catching up on your other thread. The folio lock was to ensure
the GPU got a chance to make forward progress by mapping the page. Without it
the CPU could immediately invalidate the entry before the GPU had a chance to
retry the fault.

Obviously performance wise having such thrashing is terrible, so should
really be avoided by userspace, but the lock at least allowed such programs
to complete.

> Signed-off-by: David Hildenbrand <david@...hat.com>
> ---
>  mm/memory.c | 25 +++++++++++++++++++++++++
>  1 file changed, 25 insertions(+)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index 46956994aaff..caaae8df11a9 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -718,6 +718,31 @@ struct folio *vm_normal_folio_pmd(struct vm_area_struct *vma,
>  }
>  #endif
>  
> +/**
> + * restore_exclusive_pte - Restore a device-exclusive entry
> + * @vma: VMA covering @address
> + * @folio: the mapped folio
> + * @page: the mapped folio page
> + * @address: the virtual address
> + * @ptep: PTE pointer into the locked page table mapping the folio page
> + * @orig_pte: PTE value at @ptep
> + *
> + * Restore a device-exclusive non-swap entry to an ordinary present PTE.
> + *
> + * The folio and the page table must be locked, and MMU notifiers must have
> + * been called to invalidate any (exclusive) device mappings. In case of
> + * fork(), MMU_NOTIFY_PROTECTION_PAGE is triggered, and in case of a page
> + * fault MMU_NOTIFY_EXCLUSIVE is triggered.
> + *
> + * Locking the folio makes sure that anybody who just converted the PTE to
> + * a device-private entry can map it into the device, before unlocking it; so
> + * the folio lock prevents concurrent conversion to device-exclusive.

I don't quite follow this - a concurrent conversion would already fail
because the GUP in make_device_exclusive_range() would most likely cause
an unexpected reference during the migration. And if a migration entry
has already been installed for the device private PTE conversion then
make_device_exclusive_range() will skip it as a non-present entry anyway.

However s/device-private/device-exclusive/ makes sense - the intent was to allow
the device to map it before a call to restore_exclusive_pte() (ie. a CPU fault)
could convert it back to a normal PTE.

> + * TODO: the folio lock does not protect against all cases of concurrent
> + * page table modifications (e.g., MADV_DONTNEED, mprotect), so device drivers
> + * must already use MMU notifiers to sync against any concurrent changes

Right. It's expected drivers are using MMU notifiers to keep page tables in
sync, same as for hmm_range_fault().

> + * Maybe the requirement for the folio lock can be dropped in the future.
> + */
>  static void restore_exclusive_pte(struct vm_area_struct *vma,
>  		struct folio *folio, struct page *page, unsigned long address,
>  		pte_t *ptep, pte_t orig_pte)
> -- 
> 2.48.1
>