[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <d63d00f2-2302-44ca-a6b1-b25d6d163628@lucifer.local>
Date: Thu, 29 May 2025 16:39:09 +0100
From: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>
To: Vlastimil Babka <vbabka@...e.cz>
Cc: Andrew Morton <akpm@...ux-foundation.org>,
Alexander Viro <viro@...iv.linux.org.uk>,
Christian Brauner <brauner@...nel.org>, Jan Kara <jack@...e.cz>,
"Liam R . Howlett" <Liam.Howlett@...cle.com>,
Jann Horn <jannh@...gle.com>, Pedro Falcato <pfalcato@...e.de>,
David Hildenbrand <david@...hat.com>, Xu Xin <xu.xin16@....com.cn>,
Chengming Zhou <chengming.zhou@...ux.dev>, linux-mm@...ck.org,
linux-kernel@...r.kernel.org, linux-fsdevel@...r.kernel.org,
Stefan Roesch <shr@...kernel.io>
Subject: Re: [PATCH v2 3/4] mm: prevent KSM from completely breaking VMA
merging
On Thu, May 29, 2025 at 04:50:16PM +0200, Vlastimil Babka wrote:
> On 5/21/25 20:20, Lorenzo Stoakes wrote:
> > If a user wishes to enable KSM mergeability for an entire process and all
> > fork/exec'd processes that come after it, they use the prctl()
> > PR_SET_MEMORY_MERGE operation.
> >
> > This defaults all newly mapped VMAs to have the VM_MERGEABLE VMA flag set
> > (in order to indicate they are KSM mergeable), as well as setting this flag
> > for all existing VMAs.
> >
> > However it also entirely and completely breaks VMA merging for the process
> > and all forked (and fork/exec'd) processes.
>
> I think merging due to e.g. mprotect() should still work, but for new VMAs,
> yeah.
Yes for new VMAs. I'll update the cover letter subject line + commit
messages accordingly... I may have over-egged the pudding, but it's still
really serious.
But you're right, this is misleading, it's not _all_ merging, it's just
_all merging of new mappings, ever_. Which is still you know. Not great :)
>
> > This is because when a new mapping is proposed, the flags specified will
> > never have VM_MERGEABLE set. However all adjacent VMAs will already have
> > VM_MERGEABLE set, rendering VMAs unmergeable by default.
> >
> > To work around this, we try to set the VM_MERGEABLE flag prior to
> > attempting a merge. In the case of brk() this can always be done.
> >
> > However on mmap() things are more complicated - while KSM is not supported
> > for file-backed mappings, it is supported for MAP_PRIVATE file-backed
>
> ^ insert "shared" to make it obvious?
Good spot, this is confusing as-is, will fixup on respin.
>
> > mappings.
> >
> > And these mappings may have deprecated .mmap() callbacks specified which
> > could, in theory, adjust flags and thus KSM eligiblity.
>
> Right, however your can_set_ksm_flags_early() isn't testing exactly that?
> More on that there.
It's testing to see whether we are in a known case where you can go ahead
and set VM_MERGEABLE because either you know .mmap can't change _KSM_
mergabe eligibility, or it won't be invoked so can't hurt us.
Realistically almost certainly the only cases where this applies are ones
where VM_PFNMAP + friends are set, which a bunch of drivers do.
David actually proposes to stop disallowing this for KSM, at which point we
can drop this function anyway.
But that's best done as a follow-up.
>
> > This is unlikely to cause an issue on merge, as any adjacent file-backed
> > mappings would already have the same post-.mmap() callback attributes, and
> > thus would naturally not be merged.
>
> I'm getting a bit lost as two kinds of merging have to be discussed. If the
> vma's around have the same afftributes, they would be VMA-merged, no?
The overloading of this term is very annoying.
But yeah I need to drop this bit, the VMA mergeability isn't really
applicable - I'll explain why...
My concern was that you'd set VM_MERGEABLE then attempt mergeability then
get merged with an adjacent VMA.
But _later_ the .mmap() hook, had the merge not occurred, would have set
some flags that would have made the prior merge invalid (oopsy!)
However this isn't correct.
The vma->vm_file would need to be the same for both, and therefore any
adjacent VMA would already have had .mmap() called and had their VMA flags
changed.
And therefore TL;DR I should drop this bit from the commit message...
>
> > But for the purposes of establishing a VMA as KSM-eligible (as well as
> > initially scanning the VMA), this is potentially very problematic.
>
> This part I understand as we have to check if we can add VM_MERGEABLE after
> mmap() has adjusted the flags, as it might have an effect on the result of
> ksm_compatible()?
Yes.
>
> > So we check to determine whether this at all possible. If not, we set
> > VM_MERGEABLE prior to the merge attempt on mmap(), otherwise we retain the
> > previous behaviour.
> >
> > When .mmap_prepare() is more widely used, we can remove this precaution.
> >
> > While this doesn't quite cover all cases, it covers a great many (all
> > anonymous memory, for instance), meaning we should already see a
> > significant improvement in VMA mergeability.
> >
> > Since, when it comes to file-backed mappings (other than shmem) we are
> > really only interested in MAP_PRIVATE mappings which have an available anon
> > page by default. Therefore, the VM_SPECIAL restriction makes less sense for
> > KSM.
> >
> > In a future series we therefore intend to remove this limitation, which
> > ought to simplify this implementation. However it makes sense to defer
> > doing so until a later stage so we can first address this mergeability
> > issue.
> >
> > Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>
> > Fixes: d7597f59d1d3 ("mm: add new api to enable ksm per process") # please no backport!
> > Reviewed-by: Chengming Zhou <chengming.zhou@...ux.dev>
>
> <snip>
>
> > +/*
> > + * Are we guaranteed no driver can change state such as to preclude KSM merging?
> > + * If so, let's set the KSM mergeable flag early so we don't break VMA merging.
> > + *
> > + * This is applicable when PR_SET_MEMORY_MERGE has been set on the mm_struct via
> > + * prctl() causing newly mapped VMAs to have the KSM mergeable VMA flag set.
> > + *
> > + * If this is not the case, then we set the flag after considering mergeability,
>
> ^ "VMA"
>
> > + * which will prevent mergeability as, when PR_SET_MEMORY_MERGE is set, a new
>
> ^ "VMA"
>
> > + * VMA will not have the KSM mergeability VMA flag set, but all other VMAs will,
> > + * preventing any merge.
>
> ^ "VMA"
>
> tedious I know, but more obvious, IMHO
Ack will fixup.
>
> > + */
> > +static bool can_set_ksm_flags_early(struct mmap_state *map)
> > +{
> > + struct file *file = map->file;
> > +
> > + /* Anonymous mappings have no driver which can change them. */
> > + if (!file)
> > + return true;
> > +
> > + /* shmem is safe. */
> > + if (shmem_file(file))
> > + return true;
> > +
> > + /*
> > + * If .mmap_prepare() is specified, then the driver will have already
> > + * manipulated state prior to updating KSM flags.
> > + */
> > + if (file->f_op->mmap_prepare)
> > + return true;
> > +
> > + return false;
>
> So back to my reply in the commit log, why test for mmap_prepare and
> otherwise assume false, and not instead test for f_op->mmap which would
> result in false, and otherwise return true? Or am I assuming wrong that
> there are f_ops that have neither of those two callbacks?
Because shmem has .mmap() set but we know it's safe.
I mean, we should probably put the mmap_prepare check before shmem_file()
to make things clearer.
We plan to drop this function soon (see above) anyway. Just being mighty
cautious.
>
> > +}
> > +
> > static unsigned long __mmap_region(struct file *file, unsigned long addr,
> > unsigned long len, vm_flags_t vm_flags, unsigned long pgoff,
> > struct list_head *uf)
> > @@ -2595,6 +2633,7 @@ static unsigned long __mmap_region(struct file *file, unsigned long addr,
> > bool have_mmap_prepare = file && file->f_op->mmap_prepare;
> > VMA_ITERATOR(vmi, mm, addr);
> > MMAP_STATE(map, mm, &vmi, addr, len, pgoff, vm_flags, file);
> > + bool check_ksm_early = can_set_ksm_flags_early(&map);
> >
> > error = __mmap_prepare(&map, uf);
> > if (!error && have_mmap_prepare)
> > @@ -2602,6 +2641,9 @@ static unsigned long __mmap_region(struct file *file, unsigned long addr,
> > if (error)
> > goto abort_munmap;
> >
> > + if (check_ksm_early)
> > + update_ksm_flags(&map);
> > +
> > /* Attempt to merge with adjacent VMAs... */
> > if (map.prev || map.next) {
> > VMG_MMAP_STATE(vmg, &map, /* vma = */ NULL);
> > @@ -2611,6 +2653,9 @@ static unsigned long __mmap_region(struct file *file, unsigned long addr,
> >
> > /* ...but if we can't, allocate a new VMA. */
> > if (!vma) {
> > + if (!check_ksm_early)
> > + update_ksm_flags(&map);
> > +
> > error = __mmap_new_vma(&map, &vma);
> > if (error)
> > goto unacct_error;
> > @@ -2713,6 +2758,7 @@ int do_brk_flags(struct vma_iterator *vmi, struct vm_area_struct *vma,
> > * Note: This happens *after* clearing old mappings in some code paths.
> > */
> > flags |= VM_DATA_DEFAULT_FLAGS | VM_ACCOUNT | mm->def_flags;
> > + flags = ksm_vma_flags(mm, NULL, flags);
> > if (!may_expand_vm(mm, flags, len >> PAGE_SHIFT))
> > return -ENOMEM;
> >
> > @@ -2756,7 +2802,6 @@ int do_brk_flags(struct vma_iterator *vmi, struct vm_area_struct *vma,
> >
> > mm->map_count++;
> > validate_mm(mm);
> > - ksm_add_vma(vma);
> > out:
> > perf_event_mmap(vma);
> > mm->total_vm += len >> PAGE_SHIFT;
> > --
> > 2.49.0
>
Powered by blists - more mailing lists