[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <955fd396-10b1-48cb-977d-74f3e158b1cd@redhat.com>
Date: Mon, 26 May 2025 14:57:17 +0200
From: David Hildenbrand <david@...hat.com>
To: Shakeel Butt <shakeel.butt@...ux.dev>
Cc: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>,
Andrew Morton <akpm@...ux-foundation.org>,
"Liam R . Howlett" <Liam.Howlett@...cle.com>,
Vlastimil Babka <vbabka@...e.cz>, Jann Horn <jannh@...gle.com>,
Arnd Bergmann <arnd@...db.de>, Christian Brauner <brauner@...nel.org>,
linux-mm@...ck.org, linux-arch@...r.kernel.org,
linux-kernel@...r.kernel.org, SeongJae Park <sj@...nel.org>,
Usama Arif <usamaarif642@...il.com>
Subject: Re: [RFC PATCH 0/5] add process_madvise() flags to modify behaviour
>>
>> To summarize my current view:
>>
>> 1) ebpf: most people are are not a fan of that, and I agree, at least
>> for this purpose. If we were talking about making better *placement*
>> decisions using epbf, it would be a different story.
>
> From placement decisions, do you mean placement between memory
> tiers/nodes or something else?
More like: which size to place, but it could be extended to other
policies, maybe.
Assume we have a page fault and have to decide which size to place.
For a process that we really want to use THPs (VM_HUEPAGE?), we could
use the largest free folio possible.
For a process that we don't want to spend valuable THPs on (VM_HUEPAGE
not set?), we could use the smallest free folio possible.
Such a possibly might be encoded in an ebpf program I assume.
The hints (prioritize regions/processes, deprioritize
regions/processes), such as VM_HUGEPAGE, inputs into such a program.
>
>>
>> 2) prctl(): the unloved child, and I can understand why. Maybe now is
>> the right time to stop adding new MM things that feel weird in there.
>> Maybe we should already have done that with the KSM toggle (guess who
>> was involved in that ;) ).
>
> At the moment systemd is the user I know of and I think it would very
> easy to migrate it to whatever new thing we decide here.
Agreed.
>
>>
>> 3) process_madvise(): I think it's an interesting extension, but
>> probably we should just have something that applies to the whole
>> address space naturally. At least my take for now.
>>
>> 4) new syscall: worth exploring how it would look. I'm especially
>> interested in flag options (e.g., SET_DEFAULT_EXEC) and how we could
>> make them only apply to selected controls.
>
> Were there any previous discussion on SET_DEFAULT_EXEC? First time I am
> hearing about it.
I think it evolved in the discussion here from PMADV_SET_FORK_EXEC_DEFAULT.
>
> Overall I agree with your assessment and thus I was requesting to at
> least discuss the new syscall option as well.
Yes.
I am still not sure if having a new "process" [1] mode would be a
reasonable alternative to setting the VM_HUGEPAGE/VM_NOHUGEPAGE default.
Assuming we would have a "process" mode, we could (a) set the policy
per-process using the new syscall we discuss here, and options to (B)
set the policy to use for the exec child and (c) maybe an option to seal
the policy (depending on who is allowed to set the policy in the first
place).
On the + side, we don't lose hints/instructions from the app
(VM_HUGEPAGE/VM_NOHUGEPAGE) when changing the policy on an already
running process.
The problem I see with the "process" policy is that people might want
different "default" policies for processes, which means that we will
have to add yet another toggle.
How I hate THP toggles. :)
[1]
https://lore.kernel.org/all/CALOAHbB-KQ4+z-Lupv7RcxArfjX7qtWcrboMDdT4LdpoTXOMyw@mail.gmail.com/
--
Cheers,
David / dhildenb
Powered by blists - more mailing lists