linux-kernel - Re: [RFC PATCH 0/5] add process_madvise() flags to modify behaviour

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <uxhvhja5igs5cau7tomk56wit65lh7ooq7i5xsdzyqsv5ikavm@kiwe26ioxl3t>
Date: Tue, 20 May 2025 12:49:24 -0700
From: Shakeel Butt <shakeel.butt@...ux.dev>
To: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>
Cc: Andrew Morton <akpm@...ux-foundation.org>, 
	"Liam R . Howlett" <Liam.Howlett@...cle.com>, David Hildenbrand <david@...hat.com>, 
	Vlastimil Babka <vbabka@...e.cz>, Jann Horn <jannh@...gle.com>, Arnd Bergmann <arnd@...db.de>, 
	Christian Brauner <brauner@...nel.org>, linux-mm@...ck.org, linux-arch@...r.kernel.org, 
	linux-kernel@...r.kernel.org, SeongJae Park <sj@...nel.org>, 
	Usama Arif <usamaarif642@...il.com>
Subject: Re: [RFC PATCH 0/5] add process_madvise() flags to modify behaviour

On Tue, May 20, 2025 at 07:45:43PM +0100, Lorenzo Stoakes wrote:
> On Tue, May 20, 2025 at 11:25:05AM -0700, Shakeel Butt wrote:
> > On Mon, May 19, 2025 at 09:52:37PM +0100, Lorenzo Stoakes wrote:
> > > REVIEWERS NOTES:
> > > ================
> > >
> > > This is a VERY EARLY version of the idea, it's relatively untested, and I'm
> > > 'putting it out there' for feedback. Any serious version of this will add a
> > > bunch of self-tests to assert correct behaviour and I will more carefully
> > > confirm everything's working.
> > >
> > > This is based on discussion arising from Usama's series [0], SJ's input on
> > > the thread around process_madvise() behaviour [1] (and a subsequent
> > > response by me [2]) and prior discussion about a new madvise() interface
> > > [3].
> > >
> > > [0]: https://lore.kernel.org/linux-mm/20250515133519.2779639-1-usamaarif642@gmail.com/
> > > [1]: https://lore.kernel.org/linux-mm/20250517162048.36347-1-sj@kernel.org/
> > > [2]: https://lore.kernel.org/linux-mm/e3ba284c-3cb1-42c1-a0ba-9c59374d0541@lucifer.local/
> > > [3]: https://lore.kernel.org/linux-mm/c390dd7e-0770-4d29-bb0e-f410ff6678e3@lucifer.local/
> > >
> > > ================
> > >
> > > Currently, we are rather restricted in how madvise() operations
> > > proceed. While effort has been put in to expanding what process_madvise()
> > > can do (that is - unrestricted application of advice to the local process
> > > alongside recent improvements on the efficiency of TLB operations over
> > > these batvches), we are still constrained by existing madvise() limitations
> > > and default behaviours.
> > >
> > > This series makes use of the currently unused flags field in
> > > process_madvise() to provide more flexiblity.
> > >
> > > It introduces four flags:
> > >
> > > 1. PMADV_SKIP_ERRORS
> > >
> > > Currently, when an error arises applying advice in any individual VMA
> > > (keeping in mind that a range specified to madvise() or as part of the
> > > iovec passed to process_madvise()), the operation stops where it is and
> > > returns an error.
> > >
> > > This might not be the desired behaviour of the user, who may wish instead
> > > for the operation to be 'best effort'. By setting this flag, that behaviour
> > > is obtained.
> > >
> > > Since process_madvise() would trivially, if skipping errors, simply return
> > > the input vector size, we instead return the number of entries in the
> > > vector which completed successfully without error.
> > >
> > > The PMADV_SKIP_ERRORS flag implicitly implies PMADV_NO_ERROR_ON_UNMAPPED.
> > >
> > > 2. PMADV_NO_ERROR_ON_UNMAPPED
> > >
> > > Currently madvise() has the peculiar behaviour of, if the range specified
> > > to it contains unmapped range(s), completing the full operation, but
> > > ultimately returning -ENOMEM.
> > >
> > > In the case of process_madvise(), this is fatal, as the operation will stop
> > > immediately upon this occurring.
> > >
> > > By setting PMADV_NO_ERROR_ON_UNMAPPED, the user can indicate that it wishes
> > > unmapped areas to simply be entirely ignored.
> >
> > Why do we need PMADV_NO_ERROR_ON_UNMAPPED explicitly and why
> > PMADV_SKIP_ERRORS is not enough? I don't see a need for
> > PMADV_NO_ERROR_ON_UNMAPPED. Do you envision a use-case where
> > PMADV_NO_ERROR_ON_UNMAPPED makes more sense than PMADV_SKIP_ERRORS?
> 
> I thought I already explained this above:
> 
> 	"In the case of process_madvise(), this is fatal, as the operation
> 	 will stop immediately upon this occurring."
> 
> This is somewhat bizarre behaviour. You specify multiple vector entries
> spanning different ranges, but one spans some unmapped space and now the
> whole process_madvise() operation grinds to a halt, except the vector entry
> containing ranges including unmapped space is completed.
> 
> This is strange behaviour, and it makes sense to me to optionally disable
> this.
> 
> If you were looping around doing an madvise(), this would _not_ occur, you
> could filter out the -ENOMEM's. It's a really weird peculiarity in
> process_madvise().
> 
> Moreover, you might not want an error reported, that possibly duplicates
> _real_ -ENOMEM errors, when you simply encounter unmapped addresses.
> 
> Finally, if you perform an operation across the entire address space as
> proposed you may wish to stop on actual error but not on the (inevitable at
> least in 64-bit space) gaps you'll encounter.

So, we *may* wish to stop on actual error, do you have a more concrete
example? We should not add an API on a case which may be needed. We can
always add stuff later when the actual concrete use-cases come up.

> 
> >
> > >
> > > 3. PMADV_SET_FORK_EXEC_DEFAULT
> > >
> > > It may be desirable for a user to specify that all VMAs mapped in a process
> > > address space default to having an madvise() behaviour established by
> > > default, in such a fashion as that this persists across fork/exec.
> > >
> > > Since this is a very powerful option that would make no sense for many
> > > advice modes, we explicitly only permit known-safe flags here (currently
> > > MADV_HUGEPAGE and MADV_NOHUGEPAGE only).
> >
> > Other flags seems general enough but this one is just weird. This is
> > exactly the scenario for prctl() like interface. You are trying to make
> > process_madvise() like prctl() and I can see process_madvise() would be
> > included in all the hate that prctl() receives.
> 
> I'm really not sure what you mean. prctl() has no rhyme nor reason, so not
> sure what a 'prctl() like interface' means here, and you're not explaining
> what you mean by that.

I meant it applies a property at the task or process level and has
examples where those properties are inherited to children.

> 
> Presumably you mean you find this odd as you feel it sits outside the realm
> of madvise() behaviour.

The persistence across exec seems weird.

> 
> But I'd suggest it does not - the idea is to align _everything_ with
> madvise(). Rather than adding an entirely arbitrary function in prctl(), we
> are going the other way - keeping everything relating to madvise()-like
> modification of memory _in mm_ and _in madvise()_, rather than bitrotting
> away in kernel/sys.c.

The above para seems like you are talking about code which can be moved
to mm.

> 
> So we get alignment in the fact that we're saying 'we establish a _default_
> madvise flag for a process'.

I think this is an important point. So, we want to introduce a way to
set a process level property which can be inherited through fork/exec.
With that in mind, is process_madvise() (or even madvise()) really a
good interface for it? There is no need for address ranges for such
use-case.

> 
> We restrict initially to VM_HUGEPAGE and VM_NOHUGEPAGE to a. achieve what
> you guys at meta want while also opening the door to doing so in future if
> it makes sense to.

Please drop the "you guys at meta". We should be aiming for what is good
for all (or most) linux users. Whatever is done here will be
incorporated in systemd which will be used very widely.

> 
> This couldn't be more different than putting some arbitrary code relating
> to mm in the 'netherrealm' of prctl().
> 
> 
> >
> > Let me ask in a different way. Eventually we want to be in a state where
> > hugepages works out of the box for all workloads. In that state what
> > would the need for this flag unless you have use-cases other than
> > hugepages. To me, there is a general consensus that prctl is a hacky
> > interface, so having some intermediate solution through prctl until
> > hugepages are good out of the box seems more reasonable.
> 
> No, fundamentally disagree. We already have MADV_[NO]HUGEPAGE. This will
> have to be supported. In a future where things are automatic, these may be
> no-ops in 'auto' mode.
> 
> But the requirement to have these flags will still exist, the requirement
> to do madvise() operations will still exist, we're simply expanding this
> functionality.
> 
> The problem arises the other way around when we shovel mm stuff in
> kernel/sys.c.

I think you mixing the location of the code and the interface which will
remain relevant long term. I don't see process_madvise (or madvise) good
interface for this specific functionality (i.e. process level property
that gets inherited through fork/exec). Now we can add a new syscall for
this but to me, particularly for hugepage, this functionality is needed
temporarily until hugepages are good out of the box. However if there is
a use-case other than hugepages then yes, why not a new syscall.