lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <b6ba1a37-e26a-47d8-aa56-91b8bbc70220@lucifer.local>
Date: Tue, 17 Jun 2025 10:52:14 +0100
From: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>
To: Harry Yoo <harry.yoo@...cle.com>
Cc: David Hildenbrand <david@...hat.com>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Vlastimil Babka <vbabka@...e.cz>, Jann Horn <jannh@...gle.com>,
        "Liam R . Howlett" <Liam.Howlett@...cle.com>,
        Suren Baghdasaryan <surenb@...gle.com>,
        Matthew Wilcox <willy@...radead.org>, Pedro Falcato <pfalcato@...e.de>,
        Rik van Riel <riel@...riel.com>, Zi Yan <ziy@...dia.com>,
        Baolin Wang <baolin.wang@...ux.alibaba.com>,
        Nico Pache <npache@...hat.com>, Ryan Roberts <ryan.roberts@....com>,
        Dev Jain <dev.jain@....com>, Jakub Matena <matenajakub@...il.com>,
        Wei Yang <richard.weiyang@...il.com>, Barry Song <baohua@...nel.org>,
        linux-mm@...ck.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH 01/11] mm/mremap: introduce more mergeable mremap via
 MREMAP_RELOCATE_ANON

On Tue, Jun 17, 2025 at 03:37:52PM +0900, Harry Yoo wrote:
> On Mon, Jun 16, 2025 at 10:58:28PM +0200, David Hildenbrand wrote:
> > On 09.06.25 15:26, Lorenzo Stoakes wrote:
> > > When mremap() moves a mapping around in memory, it goes to great lengths to
> > > avoid having to walk page tables as this is expensive and
> > > time-consuming.
> > >
> > > Rather, if the VMA was faulted (that is vma->anon_vma != NULL), the virtual
> > > page offset stored in the VMA at vma->vm_pgoff will remain the same, as
> > > well all the folio indexes pointed at the associated anon_vma object.
> > >
> > > This means the VMA and page tables can simply be moved and this affects the
> > > change (and if we can move page tables at a higher page table level, this
> > > is even faster).
> > >
> > > While this is efficient, it does lead to big problems with VMA merging - in
> > > essence it causes faulted anonymous VMAs to not be mergeable under many
> > > circumstances once moved.
> > >
> > > This is limiting and leads to both a proliferation of unreclaimable,
> > > unmovable kernel metadata (VMAs, anon_vma's, anon_vma_chain's) and has an
> > > impact on further use of mremap(), which has a requirement that the VMA
> > > moved (which can also be a partial range within a VMA) may span only a
> > > single VMA.
> > >
> > > This makes the mergeability or not of VMAs in effect a uAPI concern.
> > >
> > > In some use cases, users may wish to accept the overhead of actually going
> > > to the trouble of updating VMAs and folios to affect mremap() moves. Let's
> > > provide them with the choice.
> > >
> > > This patch add a new MREMAP_RELOCATE_ANON flag to do just that, which
> > > attempts to perform such an operation. If it is unable to do so, it cleanly
> > > falls back to the usual method.
> > >
> > > It carefully takes the rmap locks such that at no time will a racing rmap
> > > user encounter incorrect or missing VMAs.
> > >
> > > It is also designed to interact cleanly with the existing mremap() error
> > > fallback mechanism (inverting the remap should the page table move fail).
> > >
> > > Also, if we could merge cleanly without such a change, we do so, avoiding
> > > the overhead of the operation if it is not required.
> > >
> > > In the instance that no merge may occur when the move is performed, we
> > > still perform the folio and VMA updates to ensure that future mremap() or
> > > mprotect() calls will result in merges.
> > >
> > > In this implementation, we simply give up if we encounter large folios. A
> > > subsequent commit will extend the functionality to allow for these cases.
> > >
> > > We restrict this flag to purely anonymous memory only.
> > >
> > > we separate out the vma_had_uncowed_parents() helper function for checking
> > > in should_relocate_anon() and introduce a new function
> > > vma_maybe_has_shared_anon_folios() which combines a check against this and
> > > any forked child anon_vma's.
> > >
> > > We carefully check for pinned folios in case a caller who holds a pin might
> > > make assumptions about index, mapping fields which we are about to
> > > manipulate.
> >
> > Som quick feedback, I did not yet digest everything.
> >
> > > @@ -1134,6 +1380,67 @@ static void unmap_source_vma(struct vma_remap_struct *vrm)
> > >   	pmc.new = new_vma;
> > > +	if (relocate_anon) {
> > > +		lock_new_anon_vma(new_vma);
> > > +		pmc.relocate_locked = new_vma;
> > > +
> > > +		if (!relocate_anon_folios(&pmc, /* undo= */false)) {
> > > +			unsigned long start = new_vma->vm_start;
> > > +			unsigned long size = new_vma->vm_end - start;
> > > +
> > > +			/* Undo if fails. */
> > > +			relocate_anon_folios(&pmc, /* undo= */true);
> >
> > You'd assume this cannot fail, but I think it can: imagine concurrent
> > GUP-fast ...
>
> Oops, that sounds really bad.

I don't think it's quite as bad as it sounds. Let's reserve judgment until we've
fully analysed this and considered different approaches :)

>
> > I really wish we can find a way to not require the fallback.
>
> Maybe split the VMA at the point where it fails, instead of undo?

I don't think this is actually possible without major rework as we've separated
the VMA and folio, page table parts of the operation.

Let me put thoughts on this in reply to David so we don't split conversation
(pun intended ;) I think we have other options also.

>
> --
> Cheers,
> Harry / Hyeonggon

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ