lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <cfc4f8ac-80c4-472f-85fc-36ffcd212441@redhat.com>
Date: Thu, 30 Jan 2025 10:37:06 +0100
From: David Hildenbrand <david@...hat.com>
To: Alistair Popple <apopple@...dia.com>
Cc: linux-kernel@...r.kernel.org, linux-doc@...r.kernel.org,
 dri-devel@...ts.freedesktop.org, linux-mm@...ck.org,
 nouveau@...ts.freedesktop.org, Andrew Morton <akpm@...ux-foundation.org>,
 Jérôme Glisse <jglisse@...hat.com>,
 Jonathan Corbet <corbet@....net>, Alex Shi <alexs@...nel.org>,
 Yanteng Si <si.yanteng@...ux.dev>, Karol Herbst <kherbst@...hat.com>,
 Lyude Paul <lyude@...hat.com>, Danilo Krummrich <dakr@...nel.org>,
 David Airlie <airlied@...il.com>, Simona Vetter <simona@...ll.ch>,
 "Liam R. Howlett" <Liam.Howlett@...cle.com>,
 Lorenzo Stoakes <lorenzo.stoakes@...cle.com>,
 Vlastimil Babka <vbabka@...e.cz>, Jann Horn <jannh@...gle.com>,
 Pasha Tatashin <pasha.tatashin@...een.com>, Peter Xu <peterx@...hat.com>,
 Jason Gunthorpe <jgg@...dia.com>
Subject: Re: [PATCH v1 4/4] mm/memory: document restore_exclusive_pte()

On 30.01.25 01:27, Alistair Popple wrote:
> On Wed, Jan 29, 2025 at 12:58:02PM +0100, David Hildenbrand wrote:
>> Let's document how this function is to be used, and why the requirement
>> for the folio lock might maybe be dropped in the future.
> 
> Sorry, only just catching up on your other thread. The folio lock was to ensure
> the GPU got a chance to make forward progress by mapping the page. Without it
> the CPU could immediately invalidate the entry before the GPU had a chance to
> retry the fault.
 > > Obviously performance wise having such thrashing is terrible, so should
> really be avoided by userspace, but the lock at least allowed such programs
> to complete.

Thanks for the clarification. So it's relevant that the MMU notifier in 
remove_device_exclusive_entry() is sent after taking the folio lock.

However, as soon as we drop the folio lock, 
remove_device_exclusive_entry() will become active, lock the folio and 
trigger the MMU notifier.

So the time it is actually mapped into the device is rather

> 
>> Signed-off-by: David Hildenbrand <david@...hat.com>
>> ---
>>   mm/memory.c | 25 +++++++++++++++++++++++++
>>   1 file changed, 25 insertions(+)
>>
>> diff --git a/mm/memory.c b/mm/memory.c
>> index 46956994aaff..caaae8df11a9 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -718,6 +718,31 @@ struct folio *vm_normal_folio_pmd(struct vm_area_struct *vma,
>>   }
>>   #endif
>>   
>> +/**
>> + * restore_exclusive_pte - Restore a device-exclusive entry
>> + * @vma: VMA covering @address
>> + * @folio: the mapped folio
>> + * @page: the mapped folio page
>> + * @address: the virtual address
>> + * @ptep: PTE pointer into the locked page table mapping the folio page
>> + * @orig_pte: PTE value at @ptep
>> + *
>> + * Restore a device-exclusive non-swap entry to an ordinary present PTE.
>> + *
>> + * The folio and the page table must be locked, and MMU notifiers must have
>> + * been called to invalidate any (exclusive) device mappings. In case of
>> + * fork(), MMU_NOTIFY_PROTECTION_PAGE is triggered, and in case of a page
>> + * fault MMU_NOTIFY_EXCLUSIVE is triggered.
>> + *
>> + * Locking the folio makes sure that anybody who just converted the PTE to
>> + * a device-private entry can map it into the device, before unlocking it; so
>> + * the folio lock prevents concurrent conversion to device-exclusive.
> 
> I don't quite follow this - a concurrent conversion would already fail
> because the GUP in make_device_exclusive_range() would most likely cause
> an unexpected reference during the migration. And if a migration entry
> has already been installed for the device private PTE conversion then
> make_device_exclusive_range() will skip it as a non-present entry anyway.

Sorry, I meant "device-exclusive", so migration is not a concern.

> 
> However s/device-private/device-exclusive/ makes sense - the intent was to allow
> the device to map it before a call to restore_exclusive_pte() (ie. a CPU fault)
> could convert it back to a normal PTE.
> 
>> + * TODO: the folio lock does not protect against all cases of concurrent
>> + * page table modifications (e.g., MADV_DONTNEED, mprotect), so device drivers
>> + * must already use MMU notifiers to sync against any concurrent changes
> 
> Right. It's expected drivers are using MMU notifiers to keep page tables in
> sync, same as for hmm_range_fault().

Let me try to rephrase it given that the folio lock is purely to 
guarantee forward-progress, not for correctness; that's what MMU 
notifiers must be used for.

-- 
Cheers,

David / dhildenb


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ