lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aFENQMQUDx4GeMuc@hyeyoo>
Date: Tue, 17 Jun 2025 15:37:52 +0900
From: Harry Yoo <harry.yoo@...cle.com>
To: David Hildenbrand <david@...hat.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Vlastimil Babka <vbabka@...e.cz>, Jann Horn <jannh@...gle.com>,
        "Liam R . Howlett" <Liam.Howlett@...cle.com>,
        Suren Baghdasaryan <surenb@...gle.com>,
        Matthew Wilcox <willy@...radead.org>, Pedro Falcato <pfalcato@...e.de>,
        Rik van Riel <riel@...riel.com>, Zi Yan <ziy@...dia.com>,
        Baolin Wang <baolin.wang@...ux.alibaba.com>,
        Nico Pache <npache@...hat.com>, Ryan Roberts <ryan.roberts@....com>,
        Dev Jain <dev.jain@....com>, Jakub Matena <matenajakub@...il.com>,
        Wei Yang <richard.weiyang@...il.com>, Barry Song <baohua@...nel.org>,
        linux-mm@...ck.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH 01/11] mm/mremap: introduce more mergeable mremap via
 MREMAP_RELOCATE_ANON

On Mon, Jun 16, 2025 at 10:58:28PM +0200, David Hildenbrand wrote:
> On 09.06.25 15:26, Lorenzo Stoakes wrote:
> > When mremap() moves a mapping around in memory, it goes to great lengths to
> > avoid having to walk page tables as this is expensive and
> > time-consuming.
> > 
> > Rather, if the VMA was faulted (that is vma->anon_vma != NULL), the virtual
> > page offset stored in the VMA at vma->vm_pgoff will remain the same, as
> > well all the folio indexes pointed at the associated anon_vma object.
> > 
> > This means the VMA and page tables can simply be moved and this affects the
> > change (and if we can move page tables at a higher page table level, this
> > is even faster).
> > 
> > While this is efficient, it does lead to big problems with VMA merging - in
> > essence it causes faulted anonymous VMAs to not be mergeable under many
> > circumstances once moved.
> > 
> > This is limiting and leads to both a proliferation of unreclaimable,
> > unmovable kernel metadata (VMAs, anon_vma's, anon_vma_chain's) and has an
> > impact on further use of mremap(), which has a requirement that the VMA
> > moved (which can also be a partial range within a VMA) may span only a
> > single VMA.
> > 
> > This makes the mergeability or not of VMAs in effect a uAPI concern.
> > 
> > In some use cases, users may wish to accept the overhead of actually going
> > to the trouble of updating VMAs and folios to affect mremap() moves. Let's
> > provide them with the choice.
> > 
> > This patch add a new MREMAP_RELOCATE_ANON flag to do just that, which
> > attempts to perform such an operation. If it is unable to do so, it cleanly
> > falls back to the usual method.
> > 
> > It carefully takes the rmap locks such that at no time will a racing rmap
> > user encounter incorrect or missing VMAs.
> > 
> > It is also designed to interact cleanly with the existing mremap() error
> > fallback mechanism (inverting the remap should the page table move fail).
> > 
> > Also, if we could merge cleanly without such a change, we do so, avoiding
> > the overhead of the operation if it is not required.
> > 
> > In the instance that no merge may occur when the move is performed, we
> > still perform the folio and VMA updates to ensure that future mremap() or
> > mprotect() calls will result in merges.
> > 
> > In this implementation, we simply give up if we encounter large folios. A
> > subsequent commit will extend the functionality to allow for these cases.
> > 
> > We restrict this flag to purely anonymous memory only.
> > 
> > we separate out the vma_had_uncowed_parents() helper function for checking
> > in should_relocate_anon() and introduce a new function
> > vma_maybe_has_shared_anon_folios() which combines a check against this and
> > any forked child anon_vma's.
> > 
> > We carefully check for pinned folios in case a caller who holds a pin might
> > make assumptions about index, mapping fields which we are about to
> > manipulate.
> 
> Som quick feedback, I did not yet digest everything.
> 
> > @@ -1134,6 +1380,67 @@ static void unmap_source_vma(struct vma_remap_struct *vrm)
> >   	pmc.new = new_vma;
> > +	if (relocate_anon) {
> > +		lock_new_anon_vma(new_vma);
> > +		pmc.relocate_locked = new_vma;
> > +
> > +		if (!relocate_anon_folios(&pmc, /* undo= */false)) {
> > +			unsigned long start = new_vma->vm_start;
> > +			unsigned long size = new_vma->vm_end - start;
> > +
> > +			/* Undo if fails. */
> > +			relocate_anon_folios(&pmc, /* undo= */true);
> 
> You'd assume this cannot fail, but I think it can: imagine concurrent
> GUP-fast ...

Oops, that sounds really bad.

> I really wish we can find a way to not require the fallback.

Maybe split the VMA at the point where it fails, instead of undo?

-- 
Cheers,
Harry / Hyeonggon

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ