linux-kernel - Re: [RFC PATCH 0/5] add process_madvise() flags to modify behaviour

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <e8c459cb-c8b8-4c34-8f94-c8918bef582f@lucifer.local>
Date: Tue, 20 May 2025 21:39:37 +0100
From: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>
To: Shakeel Butt <shakeel.butt@...ux.dev>
Cc: Andrew Morton <akpm@...ux-foundation.org>,
        "Liam R . Howlett" <Liam.Howlett@...cle.com>,
        David Hildenbrand <david@...hat.com>, Vlastimil Babka <vbabka@...e.cz>,
        Jann Horn <jannh@...gle.com>, Arnd Bergmann <arnd@...db.de>,
        Christian Brauner <brauner@...nel.org>, linux-mm@...ck.org,
        linux-arch@...r.kernel.org, linux-kernel@...r.kernel.org,
        SeongJae Park <sj@...nel.org>, Usama Arif <usamaarif642@...il.com>
Subject: Re: [RFC PATCH 0/5] add process_madvise() flags to modify behaviour

On Tue, May 20, 2025 at 12:49:24PM -0700, Shakeel Butt wrote:
> On Tue, May 20, 2025 at 07:45:43PM +0100, Lorenzo Stoakes wrote:
> > On Tue, May 20, 2025 at 11:25:05AM -0700, Shakeel Butt wrote:
> > > On Mon, May 19, 2025 at 09:52:37PM +0100, Lorenzo Stoakes wrote:
> > > > REVIEWERS NOTES:
> > > > ================
> > > >
> > > > This is a VERY EARLY version of the idea, it's relatively untested, and I'm
> > > > 'putting it out there' for feedback. Any serious version of this will add a
> > > > bunch of self-tests to assert correct behaviour and I will more carefully
> > > > confirm everything's working.
> > > >
> > > > This is based on discussion arising from Usama's series [0], SJ's input on
> > > > the thread around process_madvise() behaviour [1] (and a subsequent
> > > > response by me [2]) and prior discussion about a new madvise() interface
> > > > [3].
> > > >
> > > > [0]: https://lore.kernel.org/linux-mm/20250515133519.2779639-1-usamaarif642@gmail.com/
> > > > [1]: https://lore.kernel.org/linux-mm/20250517162048.36347-1-sj@kernel.org/
> > > > [2]: https://lore.kernel.org/linux-mm/e3ba284c-3cb1-42c1-a0ba-9c59374d0541@lucifer.local/
> > > > [3]: https://lore.kernel.org/linux-mm/c390dd7e-0770-4d29-bb0e-f410ff6678e3@lucifer.local/
> > > >
> > > > ================
> > > >
> > > > Currently, we are rather restricted in how madvise() operations
> > > > proceed. While effort has been put in to expanding what process_madvise()
> > > > can do (that is - unrestricted application of advice to the local process
> > > > alongside recent improvements on the efficiency of TLB operations over
> > > > these batvches), we are still constrained by existing madvise() limitations
> > > > and default behaviours.
> > > >
> > > > This series makes use of the currently unused flags field in
> > > > process_madvise() to provide more flexiblity.
> > > >
> > > > It introduces four flags:
> > > >
> > > > 1. PMADV_SKIP_ERRORS
> > > >
> > > > Currently, when an error arises applying advice in any individual VMA
> > > > (keeping in mind that a range specified to madvise() or as part of the
> > > > iovec passed to process_madvise()), the operation stops where it is and
> > > > returns an error.
> > > >
> > > > This might not be the desired behaviour of the user, who may wish instead
> > > > for the operation to be 'best effort'. By setting this flag, that behaviour
> > > > is obtained.
> > > >
> > > > Since process_madvise() would trivially, if skipping errors, simply return
> > > > the input vector size, we instead return the number of entries in the
> > > > vector which completed successfully without error.
> > > >
> > > > The PMADV_SKIP_ERRORS flag implicitly implies PMADV_NO_ERROR_ON_UNMAPPED.
> > > >
> > > > 2. PMADV_NO_ERROR_ON_UNMAPPED
> > > >
> > > > Currently madvise() has the peculiar behaviour of, if the range specified
> > > > to it contains unmapped range(s), completing the full operation, but
> > > > ultimately returning -ENOMEM.
> > > >
> > > > In the case of process_madvise(), this is fatal, as the operation will stop
> > > > immediately upon this occurring.
> > > >
> > > > By setting PMADV_NO_ERROR_ON_UNMAPPED, the user can indicate that it wishes
> > > > unmapped areas to simply be entirely ignored.
> > >
> > > Why do we need PMADV_NO_ERROR_ON_UNMAPPED explicitly and why
> > > PMADV_SKIP_ERRORS is not enough? I don't see a need for
> > > PMADV_NO_ERROR_ON_UNMAPPED. Do you envision a use-case where
> > > PMADV_NO_ERROR_ON_UNMAPPED makes more sense than PMADV_SKIP_ERRORS?
> >
> > I thought I already explained this above:
> >
> > 	"In the case of process_madvise(), this is fatal, as the operation
> > 	 will stop immediately upon this occurring."
> >
> > This is somewhat bizarre behaviour. You specify multiple vector entries
> > spanning different ranges, but one spans some unmapped space and now the
> > whole process_madvise() operation grinds to a halt, except the vector entry
> > containing ranges including unmapped space is completed.
> >
> > This is strange behaviour, and it makes sense to me to optionally disable
> > this.
> >
> > If you were looping around doing an madvise(), this would _not_ occur, you
> > could filter out the -ENOMEM's. It's a really weird peculiarity in
> > process_madvise().
> >
> > Moreover, you might not want an error reported, that possibly duplicates
> > _real_ -ENOMEM errors, when you simply encounter unmapped addresses.
> >
> > Finally, if you perform an operation across the entire address space as
> > proposed you may wish to stop on actual error but not on the (inevitable at
> > least in 64-bit space) gaps you'll encounter.
>
> So, we *may* wish to stop on actual error, do you have a more concrete
> example? We should not add an API on a case which may be needed. We can
> always add stuff later when the actual concrete use-cases come up.

I feel like I just gave a concrete example?

It's useful to not have to be absolutely sure that the range specified
includes no unmapped ranges.

It's required for the MADV_[NO]HUGEPAGE use case proposed, specifically
applying the operation to the entire address space.

But I think it might make this a lot more concrete when I write tests - as
these will 'concretise' the interface and provide examples.

We can also choose to 'hide' this from users and add it back in as you say.

Perhaps worth just waiting for a respin with tests to see this in action.

>
> >
> > >
> > > >
> > > > 3. PMADV_SET_FORK_EXEC_DEFAULT
> > > >
> > > > It may be desirable for a user to specify that all VMAs mapped in a process
> > > > address space default to having an madvise() behaviour established by
> > > > default, in such a fashion as that this persists across fork/exec.
> > > >
> > > > Since this is a very powerful option that would make no sense for many
> > > > advice modes, we explicitly only permit known-safe flags here (currently
> > > > MADV_HUGEPAGE and MADV_NOHUGEPAGE only).
> > >
> > > Other flags seems general enough but this one is just weird. This is
> > > exactly the scenario for prctl() like interface. You are trying to make
> > > process_madvise() like prctl() and I can see process_madvise() would be
> > > included in all the hate that prctl() receives.
> >
> > I'm really not sure what you mean. prctl() has no rhyme nor reason, so not
> > sure what a 'prctl() like interface' means here, and you're not explaining
> > what you mean by that.
>
> I meant it applies a property at the task or process level and has
> examples where those properties are inherited to children.

But de-facto several prctl() interfaces do not do this, and mm interfaces
like mlockall() for instance do do this?

_In practice_ prctl() is a random hodge-podge of stuff, subject to bitrot.

>
> >
> > Presumably you mean you find this odd as you feel it sits outside the realm
> > of madvise() behaviour.
>
> The persistence across exec seems weird.

OK I'm not quite sure how to quantify 'weird'?

As I argue below, the idea here is we're doing 'madvise by default'. So you
can either have prctl() invoke madvise() for some stuff, and then establish
some 'madvise by default' logic, or we can do it the other way, by doing
_as much as possible_ madvise() stuff in madvise, and add the
default-across-exec there as a highly controlled, very clear flag.

I continue to believe the latter is cleaner, more maintainable, and less
subject to bitrot.

And I would argue invoking madvise() from prctl() is similarly odd (though
pretty much everything that happens in prctl() is, by de-facto definition,
sort of odd :)

>
> >
> > But I'd suggest it does not - the idea is to align _everything_ with
> > madvise(). Rather than adding an entirely arbitrary function in prctl(), we
> > are going the other way - keeping everything relating to madvise()-like
> > modification of memory _in mm_ and _in madvise()_, rather than bitrotting
> > away in kernel/sys.c.
>
> The above para seems like you are talking about code which can be moved
> to mm.
>
> >
> > So we get alignment in the fact that we're saying 'we establish a _default_
> > madvise flag for a process'.
>
> I think this is an important point. So, we want to introduce a way to
> set a process level property which can be inherited through fork/exec.
> With that in mind, is process_madvise() (or even madvise()) really a
> good interface for it? There is no need for address ranges for such
> use-case.
>
> >
> > We restrict initially to VM_HUGEPAGE and VM_NOHUGEPAGE to a. achieve what
> > you guys at meta want while also opening the door to doing so in future if
> > it makes sense to.
>
> Please drop the "you guys at meta". We should be aiming for what is good
> for all (or most) linux users. Whatever is done here will be
> incorporated in systemd which will be used very widely.

...!

I've spent several hours doing review of Usama's series and proposing this
idea precisely to serve the community. I would ask you to please respect
that.

The point I'm making here re: you guys, is that we can both serve the
community and solve your problem - because aiming at both is _the only way
this change will get merged_.

I am doing my absolute best to try to reach this end.

Re: systemd, I'm not sure what you mean - has there been an indication that
they will using this? I'm not sure they make use of every prctl() interface
do they?

>
> >
> > This couldn't be more different than putting some arbitrary code relating
> > to mm in the 'netherrealm' of prctl().
> >
> >
> > >
> > > Let me ask in a different way. Eventually we want to be in a state where
> > > hugepages works out of the box for all workloads. In that state what
> > > would the need for this flag unless you have use-cases other than
> > > hugepages. To me, there is a general consensus that prctl is a hacky
> > > interface, so having some intermediate solution through prctl until
> > > hugepages are good out of the box seems more reasonable.
> >
> > No, fundamentally disagree. We already have MADV_[NO]HUGEPAGE. This will
> > have to be supported. In a future where things are automatic, these may be
> > no-ops in 'auto' mode.
> >
> > But the requirement to have these flags will still exist, the requirement
> > to do madvise() operations will still exist, we're simply expanding this
> > functionality.
> >
> > The problem arises the other way around when we shovel mm stuff in
> > kernel/sys.c.
>
> I think you mixing the location of the code and the interface which will
> remain relevant long term. I don't see process_madvise (or madvise) good
> interface for this specific functionality (i.e. process level property
> that gets inherited through fork/exec). Now we can add a new syscall for
> this but to me, particularly for hugepage, this functionality is needed
> temporarily until hugepages are good out of the box. However if there is
> a use-case other than hugepages then yes, why not a new syscall.
>

I disagree on several levels. Firstly of course 'process' is a vague term
here, you mean address space, rather mm_struct.

And really you're saying 'it's ok to associate mm_strut and madvise
specific stuff with prctl() but not ok to associate mm_struct stuff with
madvise code' which not quite compelling to me.

Overall you can view this approach as - 'we are making an madvise() flag a
default, optionally'.

And this justifies us therefore doing this via an madvise() API call.

It also further allows us to expand the capabilities of madvise() for free
- we can address long-standing issues around what is possible with these
system calls while also providing this interface. The idea is it's win/win.

Otherwise we're simply doing a bunch of madvise() stuff from prctl() in the
same way that PR_SET_VMA_ANON_NAME for instance (hey - that's a VMA-level
feature! What's that doing in prctl() :) - where we have a bunch of mm code
living over there, and we have to export mm-internal stuff to kernel/sys.c
and it's a mess.

As to the temporary nature of this stuff, you seem to have disregarded what
I've said here rather in relation to the persistence of the
MADV_[NO]HUGEPAGE flags which are required as uAPI. They therefore must
remain, and thus I don't find the argument compelling re: prctl() being
better suited. It seems more likely things will bitrot there?