linux-kernel - Re: [RFC PATCH 0/5] add process_madvise() flags to modify behaviour

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <8417a42c-102c-4f35-8461-842343d7df23@lucifer.local>
Date: Tue, 20 May 2025 20:21:21 +0100
From: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>
To: Usama Arif <usamaarif642@...il.com>
Cc: David Hildenbrand <david@...hat.com>,
        Andrew Morton <akpm@...ux-foundation.org>,
        "Liam R . Howlett" <Liam.Howlett@...cle.com>,
        Vlastimil Babka <vbabka@...e.cz>, Jann Horn <jannh@...gle.com>,
        Arnd Bergmann <arnd@...db.de>, Christian Brauner <brauner@...nel.org>,
        linux-mm@...ck.org, linux-arch@...r.kernel.org,
        linux-kernel@...r.kernel.org, SeongJae Park <sj@...nel.org>,
        Johannes Weiner <hannes@...xchg.org>,
        Shakeel Butt <shakeel.butt@...ux.dev>, Zi Yan <ziy@...dia.com>
Subject: Re: [RFC PATCH 0/5] add process_madvise() flags to modify behaviour

On Tue, May 20, 2025 at 07:24:04PM +0100, Usama Arif wrote:
>
>
> On 20/05/2025 18:47, Lorenzo Stoakes wrote:
> > On Tue, May 20, 2025 at 05:28:35PM +0200, David Hildenbrand wrote:
> >> On 19.05.25 22:52, Lorenzo Stoakes wrote:
> >>> REVIEWERS NOTES:
> >>> ================
> >>>
> >>> This is a VERY EARLY version of the idea, it's relatively untested, and I'm
> >>> 'putting it out there' for feedback. Any serious version of this will add a
> >>> bunch of self-tests to assert correct behaviour and I will more carefully
> >>> confirm everything's working.
> >>>
> >>> This is based on discussion arising from Usama's series [0], SJ's input on
> >>> the thread around process_madvise() behaviour [1] (and a subsequent
> >>> response by me [2]) and prior discussion about a new madvise() interface
> >>> [3].
> >>>
> >>> [0]: https://lore.kernel.org/linux-mm/20250515133519.2779639-1-usamaarif642@gmail.com/
> >>> [1]: https://lore.kernel.org/linux-mm/20250517162048.36347-1-sj@kernel.org/
> >>> [2]: https://lore.kernel.org/linux-mm/e3ba284c-3cb1-42c1-a0ba-9c59374d0541@lucifer.local/
> >>> [3]: https://lore.kernel.org/linux-mm/c390dd7e-0770-4d29-bb0e-f410ff6678e3@lucifer.local/
> >>>
> >>> ================
> >>>
> >>> Currently, we are rather restricted in how madvise() operations
> >>> proceed. While effort has been put in to expanding what process_madvise()
> >>> can do (that is - unrestricted application of advice to the local process
> >>> alongside recent improvements on the efficiency of TLB operations over
> >>> these batvches), we are still constrained by existing madvise() limitations
> >>> and default behaviours.
> >>>
> >>> This series makes use of the currently unused flags field in
> >>> process_madvise() to provide more flexiblity.
> >>>
> >>
> >> In general, sounds like an interesting approach.
> >
> > Thanks!
> >
> > If you agree this is workable, then I'll go ahead and put some more effort
> > into writing tests etc. on the next respin.
> >
>
> So the prctl and process_madvise patches both are trying to accomplish a
> similar end goal.
>
> Would it make sense to discuss what would be the best way forward before we
> continue developing the solutions? If we are not at that stage and a clear
> picture has not formed yet, happy to continue refining the solutions.
> But just thought I would check.

As stated in the thread (I went out of my way to link everything above),
and stated with you cc'd in every discussion (I think!), this idea arose as
a concept that came out of my brainstorming whether we could better align
this concept with madvise().

This arose because of discussions had in-thread where we agreed it made
most sense to have this feature in effect perform a 'default madvise()'.

At this point, we are essentially _duplicating what madvise does_ in the
prctl() with your approach, only now all of the code that does that is
bitrotting in kernel/sys.c.

This approach was an attempt to avoid that.

It started as a 'crazy idea' I was throwing out there, as an RFC. The idea
was we could compare and contrast with the prctl() RFC.

Obviously this is gaining some traction now as a concept and your respin
was a little rushed out so needs rework.

So, apologies if it seems this is an attempt to supercede your series or
such, it truly wasn't intended that way, rather I have been working 12 hour
days trying to find the best way possible to _make what you want happen_
while also doing what's best for mm and madvise (among many other tasks of
course :)

So the idea is we can:

a. solve long-standing problems with madvise() that prevent it being used
   for certain things (e.g. the error on gaps which breaks
   process_madvise() and hides real errors)

b. Also provide this functionality in a way that the absolute most minimal
   delta from existing logic...

c. ...While keeping all this logic in mm and avoiding bitrot in
   kernel/sys.c.

So obviously my view is that this approach is superior to the prctl() one.

However the intent was that you would probably take a little longer in
spinning up an RFC, and then we could compare the two approaches, if people
didn't reject my 'crazy idea' :)

Obviously you respan your (kinda ;) RFC way quicker than expected, and then
there were a ton of bugs, which added workload and made it perhaps a little
more pressing as to deciding which should be pursued.

I would suggest holding off on your series until we see whether this one
works on this basis. But obviously this is simply my point of view.

To be clear however, this was not a planned series of events...

I mean equally, we are seeing several other series from Yafang, Nico and
Dev in relation to [m]THP and even a obliquely-related series from Barry,
so it seems we are in contention across many planes here :)

>
> I feel like changing process_madvise which was designed to work on an array
> of iovec structures to have flags to skip errors and ignore the iovec
> makes it function similar to a prctl call is not the right approach.
> IMHO, prctl is a more direct solution to this.

I don't know what you mean by 'function similar to a prctl call', or what
you mean by it being a more direct solution.

The problem with prctl() is there is no pattern, it's a dumping ground for
arbitrary stuff and a prime place for bitrot. It also means we give up
maintainership of mm-specific code.

It encourages misalignment with other interfaces and APIs.

What is being discussed here is an effort to _better align what you want
with an existing interface_ - if we treat this as a 'default madvise()'
plus having additional madvise functionality (apply to all, ignoring errors
e.g.) - and put all this code _alongside existing madvise code_ - this
makes this vastly more maintainable, futureproof and robust.

And keep in mind the proposed flags are _flags_ not default behaviours,
ones which we will carefully restrict to known flags, starting with the
flags you guys need.

>
> I know that Lonenzo doesn't like prctl and wants to unify this in process_madvise.
> But if in the end, we want to have a THP auto way which is truly transparent,
> would it not be better to just have this as prctl and not change the madvise
> structure? Maybe in a few years we wont need any of this, and it will be lost
> in prctl and madvise wouldn't have changed for this?

This really sounds a lot like your colleague Shakeel's objection, so I
don't want to duplicate my response too much, but as I said to him, at this
stage we would set THP mode to 'auto', but still have to support
MADV_[NO]HUGEPAGE.

We may then wish to set these as no-ops in auto mode right? But they'd
still have to exist for uAPI reasons.

So would it be better to do this in mm/madvise.c alongside literally all
other madvise() code and the existing handling for MADV_[NO]HUGEPAGE, or
would it be better to throw it in kernel/sys.c and hope people remember to
update it?

I think it's pretty clear what the answer to that is.

>
> Again, this is just to have a discussion (and not an aggressive argument :)),
> and would love to get feedback from everyone in the community.
> If its too early to have this discussion, its completely fine and we can
> still keep developing the RFCs :)

I try hard never to be aggressive, I am firm when I feel it is appropriate
to be so (it's important to push back when necessarily I feel), but I
always try to maintain civility as well as I can.

Of course I am imperfect, so apologies if it comes across any other way, I
really do try very hard to maintain a high standard of professionalism
on-list.

Again as I said, I really did not intend to supercede or interfere with
your series, this was just unfortunate timing and throwing out ideas to
reach the best solution.

My objection to prctl() is due to bit-rot, mm code in inappropriate places,
the fact it by nature lends itself to violating conventions and practices
that exist in other mm APIs, not some subjective sense of evil.

It is historically a place where 'things that people don't really care
about but don't quite want to NACK' go also, and to me suggests that the
problem hasn't necessarily been thought through to see how it might be
implemented by extending existing APIs or finding ways to achieve the same
thing that better align with existing intefaces.

To be clear - this concept is _all_ ultimately a product of your series and
your ideas leading the discussion within the community to this point where
a potential alternative has arisen - without which this would not have
occurred.

So the idea here is to simply explore what the best solution is that aligns
with what best serves the community.

>
> Thanks!
> Usama

Thanks, Lorenzo