lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <d7c54bbf-ed99-43c7-99a7-a9be70071b4a@redhat.com>
Date: Mon, 16 Jun 2025 18:13:04 +0200
From: David Hildenbrand <david@...hat.com>
To: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>
Cc: Dev Jain <dev.jain@....com>, akpm@...ux-foundation.org,
 Liam.Howlett@...cle.com, vbabka@...e.cz, jannh@...gle.com, pfalcato@...e.de,
 linux-mm@...ck.org, linux-kernel@...r.kernel.org, peterx@...hat.com,
 ryan.roberts@....com, mingo@...nel.org, libang.li@...group.com,
 maobibo@...ngson.cn, zhengqi.arch@...edance.com, baohua@...nel.org,
 anshuman.khandual@....com, willy@...radead.org, ioworker0@...il.com,
 yang@...amperecomputing.com, baolin.wang@...ux.alibaba.com, ziy@...dia.com,
 hughd@...gle.com
Subject: Re: [PATCH v4 2/2] mm: Optimize mremap() by PTE batching

[...]

>>>    			}
>>> -			set_pte_at(mm, new_addr, new_ptep, pte);
>>> +			set_ptes(mm, new_addr, new_ptep, pte, nr_ptes);
>>
>>
>> What I dislike is that some paths work on a single PTE, and we implicitly have to know
>> that they don't apply for !pte_present.
> 
> I hate any kind of implicit knowledge like this.
> 
>>
>> Like
>> 	if (need_clear_uffd_wp && pte_marker_uffd_wp(pte))
> 
> I also despise [with words I cannot use on-list] how uffd is implemented.
> 
> It's _nothing but_ ad-hoc stuff like this spawned all around the place.
> 
> It's hateful.
> 
>>
>> Will not get batched yet. And that is hidden inside the pte_marker_uffd_wp check ...
>>
>> Should we properly separate both paths (present vs. !present), and while at it, do
>> some more cleanups? I'm thinking of the following on top (only compile-tested)
> 
> I'd like to see that, but I think maybe better as a follow up series?

I'd do the split as a cleanup patch upfront.

Then add the batching for the pte_present() case.

> 
> On the other hand, this does improve this quite a bit. Could also be another
> patch in the series.
> 
>>
>>
>> diff --git a/mm/mremap.c b/mm/mremap.c
>> index 18b215521adae..b88abf02b34e0 100644
>> --- a/mm/mremap.c
>> +++ b/mm/mremap.c
>> @@ -155,21 +155,6 @@ static void drop_rmap_locks(struct vm_area_struct *vma)
>>                  i_mmap_unlock_write(vma->vm_file->f_mapping);
>>   }
>> -static pte_t move_soft_dirty_pte(pte_t pte)
>> -{
>> -       /*
>> -        * Set soft dirty bit so we can notice
>> -        * in userspace the ptes were moved.
>> -        */
>> -#ifdef CONFIG_MEM_SOFT_DIRTY
>> -       if (pte_present(pte))
>> -               pte = pte_mksoft_dirty(pte);
>> -       else if (is_swap_pte(pte))
>> -               pte = pte_swp_mksoft_dirty(pte);
>> -#endif
>> -       return pte;
>> -}
>> -
>>   static int mremap_folio_pte_batch(struct vm_area_struct *vma, unsigned long addr,
>>                  pte_t *ptep, pte_t pte, int max_nr)
>>   {
>> @@ -260,7 +245,6 @@ static int move_ptes(struct pagetable_move_control *pmc,
>>                  VM_WARN_ON_ONCE(!pte_none(*new_ptep));
>>                  nr_ptes = 1;
>> -               max_nr_ptes = (old_end - old_addr) >> PAGE_SHIFT;
>>                  old_pte = ptep_get(old_ptep);
>>                  if (pte_none(old_pte))
>>                          continue;
>> @@ -277,24 +261,34 @@ static int move_ptes(struct pagetable_move_control *pmc,
>>                   * flushed.
>>                   */
>>                  if (pte_present(old_pte)) {
>> +                       max_nr_ptes = (old_end - old_addr) >> PAGE_SHIFT;
>>                          nr_ptes = mremap_folio_pte_batch(vma, old_addr, old_ptep,
>>                                                           old_pte, max_nr_ptes);
>>                          force_flush = true;
>> -               }
>> -               pte = get_and_clear_full_ptes(mm, old_addr, old_ptep, nr_ptes, 0);
>> -               pte = move_pte(pte, old_addr, new_addr);
>> -               pte = move_soft_dirty_pte(pte);
>> -
>> -               if (need_clear_uffd_wp && pte_marker_uffd_wp(pte))
>> -                       pte_clear(mm, new_addr, new_ptep);
>> -               else {
>> -                       if (need_clear_uffd_wp) {
>> -                               if (pte_present(pte))
>> -                                       pte = pte_clear_uffd_wp(pte);
>> -                               else if (is_swap_pte(pte))
>> +
>> +                       pte = get_and_clear_full_ptes(mm, old_addr, old_ptep,
>> +                                                     nr_ptes, 0);
>> +                       /*
>> +                        * Moving present PTEs requires special care on some
>> +                        * archs.
>> +                        */
>> +                       pte = move_pte(pte, old_addr, new_addr);
> 
> I guess we're good with only doing this in pte_present() case because the only
> arch that implements this, sparc, does a present check anyway.

Yes, we could then remove the call from pte_present(), see below.

> 
>> +                       /* make userspace aware that this pte moved. */
>> +                       pte = pte_mksoft_dirty(pte);
>> +                       if (need_clear_uffd_wp)
>> +                               pte = pte_clear_uffd_wp(pte);
>> +                       set_ptes(mm, new_addr, new_ptep, pte, nr_ptes);
>> +               } else if (need_clear_uffd_wp && pte_marker_uffd_wp(pte)) {
>> +                       pte_clear(mm, old_addr, old_ptep);
> 
> Same comment as below re: pte_clear().
> 
> I see you've dropped pte_clear(mm, new_addr, new_ptep) which I guess is
> purposefully?
> 
> I do think that it is pointless yes.

Yeah, that's what I raised below.

> 
>> +               } else {
>> +                       pte_clear(mm, old_addr, old_ptep);
> 
> I guess this is intended to replace ptep_get_and_clear_full_ptes() above in the
> single PTE case... no?  Is this sufficient?
> 
> In the original code we'd always do ptep_get_and_clear().
> 
> I think the key difference is page_table_check_pte_clear().
> 
> I notice, hilariously, that there is a ptep_clear() that _does_ call this. What
> a mess.

ptep_get_and_clear() is only relevant when something is present and 
could change concurrently (especially A/D bits managed by HW).

We already obtained the pte, it's not present, and now just want to 
clear it.

> 
> 
>> +                       if (is_swap_pte(pte)) {
>> +                               if (need_clear_uffd_wp)
>>                                          pte = pte_swp_clear_uffd_wp(pte);
>> +                               /* make userspace aware that this pte moved. */
>> +                               pte = pte_swp_mksoft_dirty(pte);
>>                          }
>> -                       set_ptes(mm, new_addr, new_ptep, pte, nr_ptes);
>> +                       set_pte_at(mm, new_addr, new_ptep, pte);
>>                  }
>>          }
>>
>>
>> Note that I don't know why we had the existing
>>
>> -               if (need_clear_uffd_wp && pte_marker_uffd_wp(pte))
>> -                       pte_clear(mm, new_addr, new_ptep);
>>
>>
>> I thought we would always expect that the destination pte is already pte_none() ?
> 
> I think this is because we already did the move_pte() call in the original code
> before checking this:

Oh, maybe that's why.

> 
> 		pte = get_and_clear_full_ptes(mm, old_addr, old_ptep, nr_ptes, 0);
> 		pte = move_pte(pte, old_addr, new_addr);
> 		pte = move_soft_dirty_pte(pte);
> 
> 		if (need_clear_uffd_wp && pte_marker_uffd_wp(pte))
> 			pte_clear(mm, new_addr, new_ptep);
> 
> But maybe it's because there was a presumption move_pte() would like you know,
> move a PTE entry? Which it doesn't, it - only on SPARC - does a hook to flush
> the dcache.

I wish we could remove move_pte() completely. Or just notify about the 
move of a present pte ... because it doesn't ever modify the pte val.

So renaming it to "arch_notify_move_pte()" or sth. like that that is a 
void function might even be better.

... I think we could go one step further if we already have the folio: 
we could call it arch_notify_move_folio_pte(), and simplify the sparc 
implementation ...

Anyhow, the move_pte() cleanup can be done separately. Splitting the 
present from !present case here should probably be done as a cleanup 
upfront.

-- 
Cheers,

David / dhildenb


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ