lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <ff024919-a58e-419c-a03d-73da6505caf8@redhat.com>
Date: Tue, 1 Apr 2025 21:33:06 +0200
From: David Hildenbrand <david@...hat.com>
To: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>
Cc: Jann Horn <jannh@...gle.com>, Andrew Morton <akpm@...ux-foundation.org>,
 Vlastimil Babka <vbabka@...e.cz>, "Liam R . Howlett"
 <Liam.Howlett@...cle.com>, Suren Baghdasaryan <surenb@...gle.com>,
 Matthew Wilcox <willy@...radead.org>, linux-mm@...ck.org,
 linux-kernel@...r.kernel.org
Subject: Re: [RFC PATCH 1/7] mm/mremap: introduce more mergeable mremap via
 MREMAP_RELOCATE_ANON

On 31.03.25 16:50, Lorenzo Stoakes wrote:
> On Sun, Mar 23, 2025 at 01:49:07PM +0100, David Hildenbrand wrote:
>>>>
>>>> c)  In -next, there is now be the option to use folio lock +
>>>> folio_maybe_mapped_shared() == false. But it doesn't tell you into how many
>>>> VMAs a large folio is mapped into.
>>>>
>>>> In the following case:
>>>>
>>>> [       folio     ]
>>>> [ VMA#1 ] [ VMA#2 ]
>>>>
>>>> c) would not tell you if you are fine modifying the folio when moving VMA#2.
>>>
>>> Right, I feel like prior checks made should assert this is not the case,
>>> however?  But mapcount check should be a last ditch assurance?
>>
>> Something nice might be hiding in c) that might be able to handle a single
>> folio being covered by multiple vmas.
>>
>> I was thinking about the following:
>>
>> [       folio0     ]
>> [       VMA#0      ]
>>
>> Then we do a partial (old-school) mremap()
>>
>> [ folio0 ]               [ folio0 ]
>> [ VMA#1  ]               [ VMA#2  ]
>>
>> To then extend VMA#1 and fault in pages
>>
>> [ folio0 ][ folio1 ]         [ folio0 ]
>> [      VMA#1       ]         [ VMA#2  ]
>>
>> If that is possible (did not try!, maybe something prevents us from
>> extending VMA#1) mremap(MREMAP_RELOCATE_ANON) of VMA#1  / VMA#2 cannot work.
>>
>> We'd have to detect that scenario (partial mremap). You might be doing that
>> with the anon-vma magic, something different might be: Assume we flag large
>> folios if they were partially mremapped in any process.
> 
> Do we have spare folio flags? :)) I always lose track of the situation with this
> and Matthew's levels of tolerance for it :P

For large folios yes (currently stored in page[1].flags)!

> 
>>
>> Then (with folio lock only)
>>
>> 1) folio_maybe_mapped_shared() == false: mapped into single process
>> 2) folio_maybe_partially_mremaped() == false: not scattered in virtual
>>     address space
>>
>> It would be sufficient to check if the folio fully falls into the memap()
>> range to decide if we can adjust the folio index etc.
>>
>> We *might* be able to use that in the COW-reuse path for large folios to
>> perform a folio_move_anon_rmap(), which we currently only perform for small
>> folios / PMD-mapped folios (single mapping). Not sure yet if actually
>> multiple VMAs are involved.
> 
> Interesting... this is the wp_can_reuse_anon_folio() stuff? I'll have a look
> into that!

Yes. For small folios / PMD-mapped THPs it's easy. I readded it in

Author: David Hildenbrand <david@...hat.com>
Date:   Mon May 9 18:20:43 2022 -0700

     mm/rmap: use page_move_anon_rmap() when reusing a mapped PageAnon() page exclusively
     

We'd need something along the for PTE-mapped large folios like (todo):

diff --git a/mm/memory.c b/mm/memory.c
index a838c8c44bfdc..df18fcfe29aab 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3793,6 +3793,15 @@ static bool __wp_can_reuse_large_anon_folio(struct folio *folio,
         exclusive = true;
  unlock:
         folio_unlock_large_mapcount(folio);
+
+       if (exclusive && (folio->mapping == vma->anon_vma | PAGE_MAPPING_ANON) &&
+           todo_check_folio_in_vma(folio, vma) &&
+           folio_trylock(folio)) {
+               folio_move_anon_rmap(folio, vma);
+               folio_unlock(folio);
+       }
+out:
         return exclusive;
  }

Whereby a "partially mremapped" indication could come in handy.

Note that I never learned how important that optimization actually is in practice, so
I didn't care too much for now for PTE-mapped large folios.

> 
> I'm concerned about partial cases moreso though, e.g.:
> 
>       mremap this
>      <----------->
> [       folio0     ]
> [       VMA#0      ]

Right, there is no easy way to not split I think. We should just try to split and if it fails, too bad.

> 
> I mean, I'm leaning more towards just breaking up the folio, especialy if we
> consider a case like a biiig range:
> 
>                         mremap this
>      <--------------------------------------------------->
> [ folio0 ][ folio1 ][ folio2 ][ folio3 ][ folio4 ][ folio5 ] (say order-9 each)
> [                           VMA#0                          ]
> 
> Then at this point, refusing to do the whole thing seems maybe a bad idea, at
> which point splitting the folios for folio0, 5 might be sensible.
> 
> I guess a user is saying 'please, I really care about merging' so might well be
> willing to tolerate losing some of the huge page benefits, at least at the edges
> here.

Yes. In uffdio_move we simply split all PTE-mapped large folios, because this way the
folio_move_anon_rmap() is way easier to handle (and we could always try optimizing it later
if really required). Once we have khugeapged with mthp support in place, it might just fix
it up for us later.

-- 
Cheers,

David / dhildenb


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ