[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <803a927e-1c36-4934-9cda-a700acdaad0d@lucifer.local>
Date: Mon, 21 Jul 2025 14:15:54 +0100
From: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>
To: David Hildenbrand <david@...hat.com>
Cc: linux-kernel@...r.kernel.org, linux-mm@...ck.org,
linux-fsdevel@...r.kernel.org, linux-doc@...r.kernel.org,
Jonathan Corbet <corbet@....net>,
Andrew Morton <akpm@...ux-foundation.org>, Zi Yan <ziy@...dia.com>,
Baolin Wang <baolin.wang@...ux.alibaba.com>,
"Liam R. Howlett" <Liam.Howlett@...cle.com>,
Nico Pache <npache@...hat.com>, Ryan Roberts <ryan.roberts@....com>,
Dev Jain <dev.jain@....com>, Barry Song <baohua@...nel.org>,
Vlastimil Babka <vbabka@...e.cz>, Mike Rapoport <rppt@...nel.org>,
Suren Baghdasaryan <surenb@...gle.com>, Michal Hocko <mhocko@...e.com>,
Usama Arif <usamaarif642@...il.com>, SeongJae Park <sj@...nel.org>,
Jann Horn <jannh@...gle.com>, Yafang Shao <laoar.shao@...il.com>,
Matthew Wilcox <willy@...radead.org>
Subject: Re: [PATCH POC] prctl: extend PR_SET_THP_DISABLE to optionally
exclude VM_HUGEPAGE
On Mon, Jul 21, 2025 at 01:45:18PM +0200, David Hildenbrand wrote:
> > >
> > > (2) Stay at THP=none, but have "madvise" or "always" behavior for
> > > selected workloads.
> > >
> > > (3) Switch from THP=madvise to THP=always, but keep the old behavior
> > > (THP only when advised) for selected workloads.
> > >
> > > (4) Stay at THP=madvise, but have "always" behavior for selected
> > > workloads.
> > >
> > > In essence, (2) can be emulated through (1), by setting THP!=none while
> > > disabling THPs for all processes that don't want THPs. It requires
> > > configuring all workloads, but that is a user-space problem to sort out.
> >
> > NIT: Delete 'In essence' here.
>
> I wanted "something" there to not make it look like the list keeps going on
> in a weird way ;)
I mean it's not a big deal :P just have memories of English teachers telling me
off for repetition of such phrases...
> > > While we could emulate (3)+(4) through (1)+(2) by simply disabling THPs
> > > completely for these processes, this scares many THPs in our system
> > > because they could no longer get allocated where they used to be allocated
> > > for: regions flagged as VM_HUGEPAGE. Apparently, that imposes a
> > > problem for relevant workloads, because "not THPs" is certainly worse
> > > than "THPs only when advised".
> >
> > I don't know what you mean by 'scares' many THPs? :P
>
> They are very afraid of not getting allocated :)
Haha they sound cute!
> >
> > > of having a system-wide config to change PR_SET_THP_DISABLE behavior, but
> > > I just don't like the semantics.
> >
> > What do you mean?
>
> Kconfig option to change the behavior etc. In summary, I don't want to go
> down that path, it all gets weird.
Yeah please don't! :>)
> > > Happy to hear naming suggestions for "PR_THP_DISABLE_EXCEPT_ADVISED" where
> > > we essentially want to say "leave advised regions alone" -- "keep THP
> > > enabled for advised regions",
> >
> > Seems OK to me. I guess the one point of confusion could be people being
> > confused between the THP toggle 'madvise' vs. actually having MADV_HUGEPAGE
> > set, but that's moot, because 'madvise' mode only enables THP if the region
> > has had MADV_HUGEPAGE set.
>
> Right, whatever ends up setting VM_HUGEPAGE.
Yeah this naming is fine iMO.
>
> >
> > >
> > > The only thing I really dislike about this is using another MMF_* flag,
> > > but well, no way around it -- and seems like we could easily support
> > > more than 32 if we want to, or storing this thp information elsewhere.
> >
> > Yes my exact thoughts. But I will be adding a series to change this for VMA
> > flags soon enough, and can potentially do mm flags at the same time...
> >
> > So this shouldn't in the end be as much of a problem.
> >
> > Maybe it's worth holding off on this until I've done that? But at any rate
> > I intend to do those changes next cycle, and this will be a next cycle
> > thing at the earliest anyway.
>
> I don't think this series must be blocked by that. Using a bitmap instead of
> a single "unsigned long" should be fairly easy later -- I did not identify
> any big blockers.
Yeah that's fine. And I don't know when I will get the bitmap changes done, so
let's not block this with that!
> > > This is *completely* untested and might be utterly broken. It merely
> > > serves as a PoC of what I think could be done. If this ever goes upstream,
> > > we need some kselftests for it, and extensive tests.
> >
> > Well :) I mean we should definitely try this out in anger and it _MUST_
> > have self tests and put under some pressure.
> >
> > Usama, can you attack this and see?
>
> Yes, that's what I am hoping for.
Cool. And of course Usama is best placed to experiment with this approach,
as he can experiment with workloads relevant to this requirement.
>
> >
> > >
> > > [1] https://lore.kernel.org/r/20250507141132.2773275-1-usamaarif642@gmail.com
> > > [2] https://lkml.kernel.org/r/20250515133519.2779639-2-usamaarif642@gmail.com
> > > [3] https://lore.kernel.org/r/cover.1747686021.git.lorenzo.stoakes@oracle.com
> > > [4] https://lkml.kernel.org/r/85778a76-7dc8-4ea8-8827-acb45f74ee05@lucifer.local
> > > [5] https://lkml.kernel.org/r/20250608073516.22415-1-laoar.shao@gmail.com
> > > [6] https://lore.kernel.org/r/CAG48ez3-7EnBVEjpdoW7z5K0hX41nLQN5Wb65Vg-1p8DdXRnjg@mail.gmail.com
> > >
> > > ---
> > > Documentation/filesystems/proc.rst | 5 +--
> > > fs/proc/array.c | 2 +-
> > > include/linux/huge_mm.h | 20 ++++++++---
> > > include/linux/mm_types.h | 13 +++----
> > > include/uapi/linux/prctl.h | 7 ++++
> > > kernel/sys.c | 58 +++++++++++++++++++++++-------
> > > mm/khugepaged.c | 2 +-
> > > 7 files changed, 78 insertions(+), 29 deletions(-)
> > >
> > > diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
> > > index 2971551b72353..915a3e44bc120 100644
> > > --- a/Documentation/filesystems/proc.rst
> > > +++ b/Documentation/filesystems/proc.rst
> > > @@ -291,8 +291,9 @@ It's slow but very precise.
> > > HugetlbPages size of hugetlb memory portions
> > > CoreDumping process's memory is currently being dumped
> > > (killing the process may lead to a corrupted core)
> > > - THP_enabled process is allowed to use THP (returns 0 when
> > > - PR_SET_THP_DISABLE is set on the process
> > > + THP_enabled process is allowed to use THP (returns 0 when
> > > + PR_SET_THP_DISABLE is set on the process to disable
> > > + THP completely, not just partially)
> >
> > Hmm but this means we have no way of knowing if it's set for partial
>
> Yes. I briefly thought about indicating another member, but then I thought
> (a) it's ugly and (b) "who cares".
>
> I also thought about just printing "partial" instead of "1", but not sure if
> that would break any parser.
Hm and >1 could break a user who expects this to be 0, 1. We can always add
a new entry if needed.
> > > {
> > > + /* Are THPs disabled for this VMA? */
> > > + if (vm_flags & VM_NOHUGEPAGE)
> > > + return true;
> > > + /* Are THPs disabled for all VMAs in the whole process? */
> > > + if (test_bit(MMF_DISABLE_THP_COMPLETELY, &vma->vm_mm->flags))
> > > + return true;
> > > /*
> > > - * Explicitly disabled through madvise or prctl, or some
> > > - * architectures may disable THP for some mappings, for
> > > - * example, s390 kvm.
> > > + * Are THPs disabled only for VMAs where we didn't get an explicit
> > > + * advise to use them?
> >
> > Probably fine to drop the rather pernickety reference to s390 kvm here, I
> > mean we don't need to spell out massively specific details in a general
> > handler.
>
> No strong opinion.
I mean what I'm saying is this is fine :P Got no problem wtih removing this
bit of the comment.
>
> >
> > > */
> > > - return (vm_flags & VM_NOHUGEPAGE) ||
> > > - test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags);
> > > + if (vm_flags & VM_HUGEPAGE)
> > > + return false;
> > > + return test_bit(MMF_DISABLE_THP_EXCEPT_ADVISED, &vma->vm_mm->flags);
> > > }
> > >
> > > static inline bool thp_disabled_by_hw(void)
> > > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> > > index 1ec273b066915..a999f2d352648 100644
> > > --- a/include/linux/mm_types.h
> > > +++ b/include/linux/mm_types.h
> > > @@ -1743,19 +1743,16 @@ enum {
> > > #define MMF_VM_MERGEABLE 16 /* KSM may merge identical pages */
> > > #define MMF_VM_HUGEPAGE 17 /* set when mm is available for khugepaged */
> > >
> > > -/*
> > > - * This one-shot flag is dropped due to necessity of changing exe once again
> > > - * on NFS restore
> > > - */
> > > -//#define MMF_EXE_FILE_CHANGED 18 /* see prctl_set_mm_exe_file() */
> > > +#define MMF_HUGE_ZERO_PAGE 18 /* mm has ever used the global huge zero page */
> > >
> > > #define MMF_HAS_UPROBES 19 /* has uprobes */
> > > #define MMF_RECALC_UPROBES 20 /* MMF_HAS_UPROBES can be wrong */
> > > #define MMF_OOM_SKIP 21 /* mm is of no interest for the OOM killer */
> > > #define MMF_UNSTABLE 22 /* mm is unstable for copy_from_user */
> > > -#define MMF_HUGE_ZERO_PAGE 23 /* mm has ever used the global huge zero page */
> > > -#define MMF_DISABLE_THP 24 /* disable THP for all VMAs */
> > > -#define MMF_DISABLE_THP_MASK (1 << MMF_DISABLE_THP)
> > > +#define MMF_DISABLE_THP_EXCEPT_ADVISED 23 /* no THP except for VMAs with VM_HUGEPAGE */
> > > +#define MMF_DISABLE_THP_COMPLETELY 24 /* no THP for all VMAs */
> > > +#define MMF_DISABLE_THP_MASK ((1 << MMF_DISABLE_THP_COMPLETELY) |\
> > > + (1 << MMF_DISABLE_THP_EXCEPT_ADVISED))
> >
> > It feels a bit sigh to have to use up low-supply mm flags for this. But
> > again, I should be attacking this shortage soon enough.
> >
> > Are we not going ahead with Barry's series that was going to use one of
> > these in the end?
>
> Whoever gets acked first ;)
;)
>
> >
> > > #define MMF_OOM_REAP_QUEUED 25 /* mm was queued for oom_reaper */
> > > #define MMF_MULTIPROCESS 26 /* mm is shared between processes */
> > > /*
> > > diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
> > > index 43dec6eed559a..1949bb9270d48 100644
> > > --- a/include/uapi/linux/prctl.h
> > > +++ b/include/uapi/linux/prctl.h
> > > @@ -177,7 +177,14 @@ struct prctl_mm_map {
> > >
> > > #define PR_GET_TID_ADDRESS 40
> > >
> > > +/*
> > > + * Flags for PR_SET_THP_DISABLE are only applicable when disabling. Bit 0
> > > + * is reserved, so PR_GET_THP_DISABLE can return 1 when no other flags were
> > > + * specified for PR_SET_THP_DISABLE.
> > > + */
> >
> > Probably worth specifying that you're just returning the flags here.
>
> Yes.
>
> Thanks!
Cheers!
>
> --
> Cheers,
>
> David / dhildenb
>
Cheers, Lorenzo
Powered by blists - more mailing lists