[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <38ed372a-4b27-498e-bb3b-f95792bbbe27@lucifer.local>
Date: Thu, 24 Jul 2025 20:07:20 +0100
From: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>
To: Jeff Xu <jeffxu@...omium.org>
Cc: Andrew Morton <akpm@...ux-foundation.org>,
"Liam R . Howlett" <Liam.Howlett@...cle.com>,
David Hildenbrand <david@...hat.com>, Vlastimil Babka <vbabka@...e.cz>,
Jann Horn <jannh@...gle.com>, Pedro Falcato <pfalcato@...e.de>,
linux-mm@...ck.org, linux-kernel@...r.kernel.org,
Kees Cook <kees@...nel.org>, linux-hardening@...r.kernel.org
Subject: Re: [PATCH v3 2/5] mm/mseal: update madvise() logic
On Thu, Jul 24, 2025 at 11:39:05AM -0700, Jeff Xu wrote:
> Hi Lorenzo,
>
> This change has two parts: a non-functional refactoring work of moving
> function from mseal.c to madvise.c, and a functional change to the
> behavior of madvise under mseal. Would you consider splitting the
> change into two parts? which seems to be a common practice at
> supplying patches in lkml.
No I won't do that.
it's a very small change, and it was intentionally bundled so we correct
the oddly implemented version while moving.
>
> On Wed, Jul 16, 2025 at 10:38 AM Lorenzo Stoakes
> <lorenzo.stoakes@...cle.com> wrote:
> >
> > The madvise() logic is inexplicably performed in mm/mseal.c - this ought
> > to be located in mm/madvise.c.
> >
> This is part one of a non-functional refactoring work to move a
> function from mseal.c to madvise.c.
>
> There are two main reasons why I initially wanted to keep all
> mseal-related functions in mseal.c:
>
> 1 Keeping all mseal related logic in mseal.c makes it easier for
> developers to find all the impacted areas of mseal.
That's very silly, sorry but no.
You're putting stuff entirely specific to madvise() away from madvise(),
to bundle things up for no really good reason.
Again, you have a desire to do things that are at odds with how everything
else in mm is done.
> 2 mseal is not supported in 32 bits, and mseal.c is excluded from 32
> bits build (see makefile).This would prevent accidentally using mseal
> in code paths shared between 32-bit and 64-bit architectures. It also
> avoids adding #Ifdef in the .c file, which is recommended by the mm
> coding standard (I received comments about avoiding #ifdef in .c in
> the past).
You're doing something strictly worse by putting madvise() stuff to bitrot
in another file.
It's always a trade-off.
There's no set 'coding standard', there's what maintainers accept.
>
> Point 2 can go aways if 32 bits mseal support is coming soon, same as
> my other comments for patch 1/5.
But that's not the reason, as commit message states.
>
> Point 1 remains. However, I want to focus on the functional change
> part of this patch, rather than the other aspects.
No point 1 is dimsissed as is point 2.
>
> > Additionally can_modify_vma_madv() is inconsistently named and, in
> > combination with is_ro_anon(), is very confusing logic.
> >
> > Put a static function in mm/madvise.c instead - can_madvise_modify() -
> > that spells out exactly what's happening. Also explicitly check for an
> > anon VMA.
> >
> > Also add commentary to explain what's going on.
> >
> > Essentially - we disallow discarding of data in mseal()'d mappings in
> > instances where the user couldn't otherwise write to that data.
> >
> > Shared mappings are always backed, so no discard will actually truly
> > discard the data. Read-only anonymous and MAP_PRIVATE file-backed
> > mappings are the ones we are interested in.
> >
> > We make a change to the logic here to correct a mistake - we must disallow
> > discard of read-only MAP_PRIVATE file-backed mappings, which previously we
> > were not.
> >
> > The justification for this change is to account for the case where:
> >
> > 1. A MAP_PRIVATE R/W file-backed mapping is established.
> > 2. The mapping is written to, which backs it with anonymous memory.
> > 3. The mapping is mprotect()'d read-only.
> > 4. The mapping is mseal()'d.
> >
> > If we were to now allow discard of this data, it would mean mseal() would
> > not prevent the unrecoverable discarding of data and it was thus violate
> > the semantics of sealed VMAs.
> >
> This is the functional change and the most important area that I would
> like to discuss in this patch series.
OK.
>
> Since Jann Horn first highlighted [1] the problematic behavior of
> destructive madvise for anonymous mapping during initial discussions
> of mseal(), the proper solution has been open to discussion and
> exploration. Linus Torvalds has expressed interest [2] in fixing this
> within madvise itself, without requiring mseal, and I copied it here
> for ease of reference:
>
> “Hmm. I actually would be happier if we just made that change in
> general. Maybe even without sealing, but I agree that it *definitely*
> makes sense in general as a sealing thing.”
>
> After mseal was merged, Pedro Falcato raised a valid concern regarding
> file-backed private mappings. The issue is that these mappings can be
> written to, and subsequently changed to read-only, which is precisely
> the problem this patch aims to fix. I attempted to address this in
> [3]. However, that patch was rejected due to the introduction of a new
> vm_flags, which was a valid rejection as it wasn't the ideal solution.
> Nevertheless, it sparked an interesting discussion, with me raising a
> point that userspace might use this feature to free up RAM for
> file-backed private mapping that is never written to, and input about
> this topic from Vlastimil Babka [4] is about MADV_COLD/MADV_PAGEOUT.
OK not sure what point you're making yet?
>
> A detail about userspace calling those madvise for file-backed private
> mapping. Previously, I added extra logging in the kernel, and observed
> many instances of those calls during Android phone startup, although
> I haven’t delved into the reason behind why user space calls those,
> e.g. if it is from an individual app or Android platform.
Hang on, sorry, are you saying that you intentionally allowed destruction
of mseal()'d VMAs to serve android?
I hope I'm misunderstanding you here.
Either way I don't know what your point is? Don't mseal them if you want to
perform destructive operations on them?
You have to argue as to why this change is not valid in _upstream_ linux.
>
> Incidentally, recently while I was studying selinux code, particularly
> exemod permission [5] , I learned that selinux utilizes vma->anon_vma
> to identify file-backed mappings that have been written to. Jann
> pointed out to me that the kernel creates a COW mapping when a private
> file-backed mapping is written. Now I'm wondering if this could be the
> answer to our problem. We could try having destructive madvise check
> vma->anon_vma and reject the call if it's true. I haven't had a chance
> to test this theory yet, though.
Umm what? Why? What are you talking about?
A MAP_PRIVATE mapping will not have VM_SHARED set. This is why I changed
the check to this.
I really don't understand what point you're trying to make here.
The check I've provided works correctly to disallow the previously
_incorrectly allowed_ MAP_PRIVATE mapping of a file-backed mapping.
This was something you missed, and is now corrected.
>
> To summarize all the discussion points so far:
> 1. It's questionable behavior for madvise to allow destructive
> behavior for read-only anonymous mappings, regardless of mseal state.
Umm, ok. Well maybe, but it's essentially uAPI at this point. This feels
irrelevant to this series.
> 2. We could potentially fix point 1 within madvise itself, without
> involving mseal, as Linus desires.
No, we will not do that.
> 3. Android userspace uses destructive madvise to free up RAM, but I
> need to take a closer look at the patterns and usage to understand why
> they do that.
OK so what?
> 4. We could ask applications to switch to non-destructive madvise,
> like MADV_COLD or MADV_PAGEOUT. Or, another option is that we could
> switch the kernel to use non-destructive madvise implicitly for
> destructive madvise in suitable situations.
Umm what? I don't understand your point.
> 5. We could investigate more based on vma->anon_vma
I have no idea what you mean by this. I am an rmap maintainer and have
worked extensively with anon_vma, what's the point exactly?
>
> Link: https://lore.kernel.org/lkml/CAG48ez3ShUYey+ZAFsU2i1RpQn0a5eOs2hzQ426FkcgnfUGLvA@mail.gmail.com/
> [1]
> Link: https://lore.kernel.org/lkml/CAHk-=wiVhHmnXviy1xqStLRozC4ziSugTk=1JOc8ORWd2_0h7g@mail.gmail.com/
> [2]
> Link: https://lore.kernel.org/all/20241017005105.3047458-2-jeffxu@chromium.org/
> [3]
> Link: https://lore.kernel.org/all/8f68ad82-2f60-49f8-b150-0cf183c9cc71@suse.cz/
> [4]
> Link: https://elixir.bootlin.com/linux/v6.15.7/source/security/selinux/hooks.c#L3878
> [5]
>
> > Finally, update the mseal tests, which were asserting previously that a
> > read-only MAP_PRIVATE file-backed mapping could be discarded.
> >
> The test you are updating is not intended for the scenario this patch
> is aimed to fix: i.e. the scenario:
> 1. A MAP_PRIVATE R/W file-backed mapping is established.
> 2. The mapping is written to, which backs it with anonymous memory.
> 3. The mapping is mprotect()'d read-only.
> 4. The mapping is mseal()'d.
>
> The test case doesn't include steps 1, 2, and 3, is it possible to
> keep the existing one and create a new test case? But I don't think
> that's the main discussion point. We can revisit test cases once we've
> fully discussed the design.
I do not know why you put this here. Can you please put review along side
the code you're reviewing.
You're making my life unnecessarily hard here.
But yes, you're right, I messed up the test here, I'll send a fix-patch.
Incidentally, it seems like the test _explicitly_ asserted that this
behaviour was the opposite of what it should be which makes me think again
this might be intentional... Concerning.
>
> Thanks and regards,
> -Jeff
> > Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>
> > Reviewed-by: Liam R. Howlett <Liam.Howlett@...cle.com>
> > Reviewed-by: Pedro Falcato <pfalcato@...e.de>
> > Acked-by: David Hildenbrand <david@...hat.com>
> > ---
> > mm/madvise.c | 63 ++++++++++++++++++++++++-
> > mm/mseal.c | 49 -------------------
> > mm/vma.h | 7 ---
> > tools/testing/selftests/mm/mseal_test.c | 3 +-
> > 4 files changed, 63 insertions(+), 59 deletions(-)
> >
> > diff --git a/mm/madvise.c b/mm/madvise.c
> > index 2bf80989d5b6..dc3d8497b0f4 100644
> > --- a/mm/madvise.c
> > +++ b/mm/madvise.c
> > @@ -19,6 +19,7 @@
> > #include <linux/sched.h>
> > #include <linux/sched/mm.h>
> > #include <linux/mm_inline.h>
> > +#include <linux/mmu_context.h>
> > #include <linux/string.h>
> > #include <linux/uio.h>
> > #include <linux/ksm.h>
> > @@ -1255,6 +1256,66 @@ static long madvise_guard_remove(struct madvise_behavior *madv_behavior)
> > &guard_remove_walk_ops, NULL);
> > }
> >
> > +#ifdef CONFIG_64BIT
> > +/* Does the madvise operation result in discarding of mapped data? */
> > +static bool is_discard(int behavior)
> > +{
> > + switch (behavior) {
> > + case MADV_FREE:
> > + case MADV_DONTNEED:
> > + case MADV_DONTNEED_LOCKED:
> > + case MADV_REMOVE:
> > + case MADV_DONTFORK:
> > + case MADV_WIPEONFORK:
> > + case MADV_GUARD_INSTALL:
> > + return true;
> > + }
> > +
> > + return false;
> > +}
> > +
> > +/*
> > + * We are restricted from madvise()'ing mseal()'d VMAs only in very particular
> > + * circumstances - discarding of data from read-only anonymous SEALED mappings.
> > + *
> > + * This is because users cannot trivally discard data from these VMAs, and may
> > + * only do so via an appropriate madvise() call.
> > + */
> > +static bool can_madvise_modify(struct madvise_behavior *madv_behavior)
> > +{
> > + struct vm_area_struct *vma = madv_behavior->vma;
> > +
> > + /* If the VMA isn't sealed we're good. */
> > + if (can_modify_vma(vma))
> > + return true;
> > +
> > + /* For a sealed VMA, we only care about discard operations. */
> > + if (!is_discard(madv_behavior->behavior))
> > + return true;
> > +
> > + /*
> > + * But shared mappings are fine, as dirty pages will be written to a
> > + * backing store regardless of discard.
> > + */
> > + if (vma->vm_flags & VM_SHARED)
> > + return true;
> > +
> > + /* If the user could write to the mapping anyway, then this is fine. */
> > + if ((vma->vm_flags & VM_WRITE) &&
> > + arch_vma_access_permitted(vma, /* write= */ true,
> > + /* execute= */ false, /* foreign= */ false))
> > + return true;
> > +
> > + /* Otherwise, we are not permitted to perform this operation. */
> > + return false;
> > +}
> > +#else
> > +static bool can_madvise_modify(struct madvise_behavior *madv_behavior)
> > +{
> > + return true;
> > +}
> > +#endif
> > +
> > /*
> > * Apply an madvise behavior to a region of a vma. madvise_update_vma
> > * will handle splitting a vm area into separate areas, each area with its own
> > @@ -1268,7 +1329,7 @@ static int madvise_vma_behavior(struct madvise_behavior *madv_behavior)
> > struct madvise_behavior_range *range = &madv_behavior->range;
> > int error;
> >
> > - if (unlikely(!can_modify_vma_madv(madv_behavior->vma, behavior)))
> > + if (unlikely(!can_madvise_modify(madv_behavior)))
> > return -EPERM;
> >
> > switch (behavior) {
> > diff --git a/mm/mseal.c b/mm/mseal.c
> > index c27197ac04e8..1308e88ab184 100644
> > --- a/mm/mseal.c
> > +++ b/mm/mseal.c
> > @@ -11,7 +11,6 @@
> > #include <linux/mman.h>
> > #include <linux/mm.h>
> > #include <linux/mm_inline.h>
> > -#include <linux/mmu_context.h>
> > #include <linux/syscalls.h>
> > #include <linux/sched.h>
> > #include "internal.h"
> > @@ -21,54 +20,6 @@ static inline void set_vma_sealed(struct vm_area_struct *vma)
> > vm_flags_set(vma, VM_SEALED);
> > }
> >
> > -static bool is_madv_discard(int behavior)
> > -{
> > - switch (behavior) {
> > - case MADV_FREE:
> > - case MADV_DONTNEED:
> > - case MADV_DONTNEED_LOCKED:
> > - case MADV_REMOVE:
> > - case MADV_DONTFORK:
> > - case MADV_WIPEONFORK:
> > - case MADV_GUARD_INSTALL:
> > - return true;
> > - }
> > -
> > - return false;
> > -}
> > -
> > -static bool is_ro_anon(struct vm_area_struct *vma)
> > -{
> > - /* check anonymous mapping. */
> > - if (vma->vm_file || vma->vm_flags & VM_SHARED)
> > - return false;
> > -
> > - /*
> > - * check for non-writable:
> > - * PROT=RO or PKRU is not writeable.
> > - */
> > - if (!(vma->vm_flags & VM_WRITE) ||
> > - !arch_vma_access_permitted(vma, true, false, false))
> > - return true;
> > -
> > - return false;
> > -}
> > -
> > -/*
> > - * Check if a vma is allowed to be modified by madvise.
> > - */
> > -bool can_modify_vma_madv(struct vm_area_struct *vma, int behavior)
> > -{
> > - if (!is_madv_discard(behavior))
> > - return true;
> > -
> > - if (unlikely(!can_modify_vma(vma) && is_ro_anon(vma)))
> > - return false;
> > -
> > - /* Allow by default. */
> > - return true;
> > -}
> > -
> > static int mseal_fixup(struct vma_iterator *vmi, struct vm_area_struct *vma,
> > struct vm_area_struct **prev, unsigned long start,
> > unsigned long end, vm_flags_t newflags)
> > diff --git a/mm/vma.h b/mm/vma.h
> > index acdcc515c459..85db5e880fcc 100644
> > --- a/mm/vma.h
> > +++ b/mm/vma.h
> > @@ -577,8 +577,6 @@ static inline bool can_modify_vma(struct vm_area_struct *vma)
> > return true;
> > }
> >
> > -bool can_modify_vma_madv(struct vm_area_struct *vma, int behavior);
> > -
> > #else
> >
> > static inline bool can_modify_vma(struct vm_area_struct *vma)
> > @@ -586,11 +584,6 @@ static inline bool can_modify_vma(struct vm_area_struct *vma)
> > return true;
> > }
> >
> > -static inline bool can_modify_vma_madv(struct vm_area_struct *vma, int behavior)
> > -{
> > - return true;
> > -}
> > -
> > #endif
> >
> > #if defined(CONFIG_STACK_GROWSUP)
> > diff --git a/tools/testing/selftests/mm/mseal_test.c b/tools/testing/selftests/mm/mseal_test.c
> > index 005f29c86484..34c042da4de2 100644
> > --- a/tools/testing/selftests/mm/mseal_test.c
> > +++ b/tools/testing/selftests/mm/mseal_test.c
> > @@ -1712,7 +1712,6 @@ static void test_seal_discard_ro_anon_on_filebacked(bool seal)
> > unsigned long size = 4 * page_size;
> > int ret;
> > int fd;
> > - unsigned long mapflags = MAP_PRIVATE;
> >
> > fd = memfd_create("test", 0);
> > FAIL_TEST_IF_FALSE(fd > 0);
> > @@ -1720,7 +1719,7 @@ static void test_seal_discard_ro_anon_on_filebacked(bool seal)
> > ret = fallocate(fd, 0, 0, size);
> > FAIL_TEST_IF_FALSE(!ret);
> >
> > - ptr = mmap(NULL, size, PROT_READ, mapflags, fd, 0);
> > + ptr = mmap(NULL, size, PROT_READ, MAP_SHARED, fd, 0);
> > FAIL_TEST_IF_FALSE(ptr != MAP_FAILED);
> >
> > if (seal) {
> > --
> > 2.50.1
> >
Powered by blists - more mailing lists