lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <5604190c-3309-4cb8-b746-2301615d933c@lucifer.local>
Date: Tue, 20 May 2025 19:45:43 +0100
From: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>
To: Shakeel Butt <shakeel.butt@...ux.dev>
Cc: Andrew Morton <akpm@...ux-foundation.org>,
        "Liam R . Howlett" <Liam.Howlett@...cle.com>,
        David Hildenbrand <david@...hat.com>, Vlastimil Babka <vbabka@...e.cz>,
        Jann Horn <jannh@...gle.com>, Arnd Bergmann <arnd@...db.de>,
        Christian Brauner <brauner@...nel.org>, linux-mm@...ck.org,
        linux-arch@...r.kernel.org, linux-kernel@...r.kernel.org,
        SeongJae Park <sj@...nel.org>, Usama Arif <usamaarif642@...il.com>
Subject: Re: [RFC PATCH 0/5] add process_madvise() flags to modify behaviour

On Tue, May 20, 2025 at 11:25:05AM -0700, Shakeel Butt wrote:
> On Mon, May 19, 2025 at 09:52:37PM +0100, Lorenzo Stoakes wrote:
> > REVIEWERS NOTES:
> > ================
> >
> > This is a VERY EARLY version of the idea, it's relatively untested, and I'm
> > 'putting it out there' for feedback. Any serious version of this will add a
> > bunch of self-tests to assert correct behaviour and I will more carefully
> > confirm everything's working.
> >
> > This is based on discussion arising from Usama's series [0], SJ's input on
> > the thread around process_madvise() behaviour [1] (and a subsequent
> > response by me [2]) and prior discussion about a new madvise() interface
> > [3].
> >
> > [0]: https://lore.kernel.org/linux-mm/20250515133519.2779639-1-usamaarif642@gmail.com/
> > [1]: https://lore.kernel.org/linux-mm/20250517162048.36347-1-sj@kernel.org/
> > [2]: https://lore.kernel.org/linux-mm/e3ba284c-3cb1-42c1-a0ba-9c59374d0541@lucifer.local/
> > [3]: https://lore.kernel.org/linux-mm/c390dd7e-0770-4d29-bb0e-f410ff6678e3@lucifer.local/
> >
> > ================
> >
> > Currently, we are rather restricted in how madvise() operations
> > proceed. While effort has been put in to expanding what process_madvise()
> > can do (that is - unrestricted application of advice to the local process
> > alongside recent improvements on the efficiency of TLB operations over
> > these batvches), we are still constrained by existing madvise() limitations
> > and default behaviours.
> >
> > This series makes use of the currently unused flags field in
> > process_madvise() to provide more flexiblity.
> >
> > It introduces four flags:
> >
> > 1. PMADV_SKIP_ERRORS
> >
> > Currently, when an error arises applying advice in any individual VMA
> > (keeping in mind that a range specified to madvise() or as part of the
> > iovec passed to process_madvise()), the operation stops where it is and
> > returns an error.
> >
> > This might not be the desired behaviour of the user, who may wish instead
> > for the operation to be 'best effort'. By setting this flag, that behaviour
> > is obtained.
> >
> > Since process_madvise() would trivially, if skipping errors, simply return
> > the input vector size, we instead return the number of entries in the
> > vector which completed successfully without error.
> >
> > The PMADV_SKIP_ERRORS flag implicitly implies PMADV_NO_ERROR_ON_UNMAPPED.
> >
> > 2. PMADV_NO_ERROR_ON_UNMAPPED
> >
> > Currently madvise() has the peculiar behaviour of, if the range specified
> > to it contains unmapped range(s), completing the full operation, but
> > ultimately returning -ENOMEM.
> >
> > In the case of process_madvise(), this is fatal, as the operation will stop
> > immediately upon this occurring.
> >
> > By setting PMADV_NO_ERROR_ON_UNMAPPED, the user can indicate that it wishes
> > unmapped areas to simply be entirely ignored.
>
> Why do we need PMADV_NO_ERROR_ON_UNMAPPED explicitly and why
> PMADV_SKIP_ERRORS is not enough? I don't see a need for
> PMADV_NO_ERROR_ON_UNMAPPED. Do you envision a use-case where
> PMADV_NO_ERROR_ON_UNMAPPED makes more sense than PMADV_SKIP_ERRORS?

I thought I already explained this above:

	"In the case of process_madvise(), this is fatal, as the operation
	 will stop immediately upon this occurring."

This is somewhat bizarre behaviour. You specify multiple vector entries
spanning different ranges, but one spans some unmapped space and now the
whole process_madvise() operation grinds to a halt, except the vector entry
containing ranges including unmapped space is completed.

This is strange behaviour, and it makes sense to me to optionally disable
this.

If you were looping around doing an madvise(), this would _not_ occur, you
could filter out the -ENOMEM's. It's a really weird peculiarity in
process_madvise().

Moreover, you might not want an error reported, that possibly duplicates
_real_ -ENOMEM errors, when you simply encounter unmapped addresses.

Finally, if you perform an operation across the entire address space as
proposed you may wish to stop on actual error but not on the (inevitable at
least in 64-bit space) gaps you'll encounter.

>
> >
> > 3. PMADV_SET_FORK_EXEC_DEFAULT
> >
> > It may be desirable for a user to specify that all VMAs mapped in a process
> > address space default to having an madvise() behaviour established by
> > default, in such a fashion as that this persists across fork/exec.
> >
> > Since this is a very powerful option that would make no sense for many
> > advice modes, we explicitly only permit known-safe flags here (currently
> > MADV_HUGEPAGE and MADV_NOHUGEPAGE only).
>
> Other flags seems general enough but this one is just weird. This is
> exactly the scenario for prctl() like interface. You are trying to make
> process_madvise() like prctl() and I can see process_madvise() would be
> included in all the hate that prctl() receives.

I'm really not sure what you mean. prctl() has no rhyme nor reason, so not
sure what a 'prctl() like interface' means here, and you're not explaining
what you mean by that.

Presumably you mean you find this odd as you feel it sits outside the realm
of madvise() behaviour.

But I'd suggest it does not - the idea is to align _everything_ with
madvise(). Rather than adding an entirely arbitrary function in prctl(), we
are going the other way - keeping everything relating to madvise()-like
modification of memory _in mm_ and _in madvise()_, rather than bitrotting
away in kernel/sys.c.

So we get alignment in the fact that we're saying 'we establish a _default_
madvise flag for a process'.

We restrict initially to VM_HUGEPAGE and VM_NOHUGEPAGE to a. achieve what
you guys at meta want while also opening the door to doing so in future if
it makes sense to.

This couldn't be more different than putting some arbitrary code relating
to mm in the 'netherrealm' of prctl().


>
> Let me ask in a different way. Eventually we want to be in a state where
> hugepages works out of the box for all workloads. In that state what
> would the need for this flag unless you have use-cases other than
> hugepages. To me, there is a general consensus that prctl is a hacky
> interface, so having some intermediate solution through prctl until
> hugepages are good out of the box seems more reasonable.

No, fundamentally disagree. We already have MADV_[NO]HUGEPAGE. This will
have to be supported. In a future where things are automatic, these may be
no-ops in 'auto' mode.

But the requirement to have these flags will still exist, the requirement
to do madvise() operations will still exist, we're simply expanding this
functionality.

The problem arises the other way around when we shovel mm stuff in
kernel/sys.c.

>
> >
> > 4. PMADV_ENTIRE_ADDRESS_SPACE
> >
> > It can be annoying, should a user wish to apply madvise() to all VMAs in an
> > address space, to have to add a singular large entry to the input iovec.
> >
> > So provide sugar to permit this - PMADV_ENTIRE_ADDRESS_SPACE. If specified,
> > we expect the user to pass NULL and -1 to the vec and vlen parameters
> > respectively so they explicitly acknowledge that these will be ignored,
> > e.g.:
> >
> > 	process_madvise(PIDFD_SELF, NULL, -1, MADV_HUGEPAGE,
> > 			PMADV_ENTIRE_ADDRESS_SPACE | PMADV_SKIP_ERRORS);
> >
>
> I still don't see a need for this flag. Why not the following?
>
> process_madvise(PIDFD_SELF, NULL, -1, advise, PMADV_SKIP_ERRORS)?

I don't think iovec=NULL, vlen=-1 means 'everything' in any other interface
anywhere else in the kernel? Why would we reasonably expect anybody to
know/assume that here?

It's surely far better to very explicitly say as much? The idea with NULL,
-1 is that you're making sure the user knows any provided iovec, vlen will
be ignored, and better to just disallow the user providing those at all
(similar to mmap requiring -1 for fd for anon mappings).

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ