lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CALOAHbAvQDee2=5vsDqj77g5gAGdGpXFBbsC7tpKnCYEDZS3vw@mail.gmail.com>
Date: Fri, 9 May 2025 17:43:10 +0800
From: Yafang Shao <laoar.shao@...il.com>
To: David Hildenbrand <david@...hat.com>
Cc: Johannes Weiner <hannes@...xchg.org>, Usama Arif <usamaarif642@...il.com>, Zi Yan <ziy@...dia.com>, 
	Andrew Morton <akpm@...ux-foundation.org>, linux-mm@...ck.org, shakeel.butt@...ux.dev, 
	riel@...riel.com, baolin.wang@...ux.alibaba.com, lorenzo.stoakes@...cle.com, 
	Liam.Howlett@...cle.com, npache@...hat.com, ryan.roberts@....com, 
	linux-kernel@...r.kernel.org, kernel-team@...a.com
Subject: Re: [PATCH 0/1] prctl: allow overriding system THP policy to always

On Fri, May 9, 2025 at 5:31 PM David Hildenbrand <david@...hat.com> wrote:
>
> On 09.05.25 11:24, Yafang Shao wrote:
> > On Fri, May 9, 2025 at 1:13 PM Johannes Weiner <hannes@...xchg.org> wrote:
> >>
> >> On Fri, May 09, 2025 at 10:15:08AM +0800, Yafang Shao wrote:
> >>> On Fri, May 9, 2025 at 12:04 AM Usama Arif <usamaarif642@...il.com> wrote:
> >>>>
> >>>>
> >>>>
> >>>> On 08/05/2025 06:41, Yafang Shao wrote:
> >>>>> On Thu, May 8, 2025 at 12:09 AM Usama Arif <usamaarif642@...il.com> wrote:
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> On 07/05/2025 16:57, Zi Yan wrote:
> >>>>>>> On 7 May 2025, at 11:12, Usama Arif wrote:
> >>>>>>>
> >>>>>>>> On 07/05/2025 15:57, Zi Yan wrote:
> >>>>>>>>> +Yafang, who is also looking at changing THP config at cgroup/container level.
> >>>>>
> >>>>> Thanks
> >>>>>
> >>>>>>>>>
> >>>>>>>>> On 7 May 2025, at 10:00, Usama Arif wrote:
> >>>>>>>>>
> >>>>>>>>>> Allowing override of global THP policy per process allows workloads
> >>>>>>>>>> that have shown to benefit from hugepages to do so, without regressing
> >>>>>>>>>> workloads that wouldn't benefit. This will allow such types of
> >>>>>>>>>> workloads to be run/stacked on the same machine.
> >>>>>>>>>>
> >>>>>>>>>> It also helps in rolling out hugepages in hyperscaler configurations
> >>>>>>>>>> for workloads that benefit from them, where a single THP policy is
> >>>>>>>>>> likely to be used across the entire fleet, and prctl will help override it.
> >>>>>>>>>>
> >>>>>>>>>> An advantage of doing it via prctl vs creating a cgroup specific
> >>>>>>>>>> option (like /sys/fs/cgroup/test/memory.transparent_hugepage.enabled) is
> >>>>>>>>>> that this will work even when there are no cgroups present, and my
> >>>>>>>>>> understanding is there is a strong preference of cgroups controls being
> >>>>>>>>>> hierarchical which usually means them having a numerical value.
> >>>>>>>>>
> >>>>>>>>> Hi Usama,
> >>>>>>>>>
> >>>>>>>>> Do you mind giving an example on how to change THP policy for a set of
> >>>>>>>>> processes running in a container (under a cgroup)?
> >>>>>>>>
> >>>>>>>> Hi Zi,
> >>>>>>>>
> >>>>>>>> In our case, we create the processes in the cgroup via systemd. The way we will enable THP=always
> >>>>>>>> for processes in a cgroup is in the same way we enable KSM for the cgroup.
> >>>>>>>> The change in systemd would be very similar to the line in [1], where we would set prctl PR_SET_THP_ALWAYS
> >>>>>>>> in exec-invoke.
> >>>>>>>> This is at the start of the process, but you would already know at the start of the process
> >>>>>>>> whether you want THP=always for it or not.
> >>>>>>>>
> >>>>>>>> [1] https://github.com/systemd/systemd/blob/2e72d3efafa88c1cb4d9b28dd4ade7c6ab7be29a/src/core/exec-invoke.c#L5045
> >>>>>>>
> >>>>>>> You also need to add a new systemd.directives, e.g., MemoryTHP, to
> >>>>>>> pass the THP enablement or disablement info from a systemd config file.
> >>>>>>> And if you find those processes do not benefit from using THPs,
> >>>>>>> you can just change the new "MemoryTHP" config and restart the processes.
> >>>>>>>
> >>>>>>> Am I getting it? Thanks.
> >>>>>>>
> >>>>>>
> >>>>>> Yes, thats right. They would exactly the same as what we (Meta) do
> >>>>>> for KSM. So have MemoryTHP similar to MemroryKSM [1] and if MemoryTHP is set,
> >>>>>> the ExecContext->memory_thp would be set similar to memory_ksm [2], and when
> >>>>>> that is set, the prctl will be called at exec_invoke of the process [3].
> >>>>>>
> >>>>>> The systemd changes should be quite simple to do.
> >>>>>>
> >>>>>> [1] https://github.com/systemd/systemd/blob/2e72d3efafa88c1cb4d9b28dd4ade7c6ab7be29a/man/systemd.exec.xml#L1978
> >>>>>> [2] https://github.com/systemd/systemd/blob/2e72d3efafa88c1cb4d9b28dd4ade7c6ab7be29a/src/core/dbus-execute.c#L2151
> >>>>>> [3] https://github.com/systemd/systemd/blob/2e72d3efafa88c1cb4d9b28dd4ade7c6ab7be29a/src/core/exec-invoke.c#L5045
> >>>>>
> >>>>> This solution carries a risk: since prctl() does not require any
> >>>>> capabilities, the task itself could call it and override your memory
> >>>>> policy. While we could enforce CAP_SYS_RESOURCE to restrict this, that
> >>>>> capability is typically enabled by default in containers, leaving them
> >>>>> still vulnerable.
> >>>>>
> >>>>> This approach might work for Kubernetes/container environments, but it
> >>>>> would require substantial code changes to implement securely.
> >>>>>
> >>>>
> >>>> You can already change the memory policy with prctl, for e.g. PR_SET_THP_DISABLE
> >>>> already exists and the someone could use this to slow the process down. So the
> >>>> approach this patch takes shouldn't be anymore of a security fix then what is already
> >>>> exposed by the kernel. I think as you mentioned, if prctl is an issue CAP_SYS_RESOURCE
> >>>> should be used to restrict this.
> >>>
> >>> I believe we should at least require CAP_SYS_RESOURCE to enable THP,
> >>> since it overrides global system settings. Alternatively,
> >>> CAP_SYS_ADMIN might be even more appropriate, though I'm not entirely
> >>> certain.
> >>
> >> Hm, could you verbalize a concrete security concern?
> >>
> >> I've never really looked at the global settings as a hard policy, more
> >> as picking a default for the workloads in the system. It's usually
> >> `madvise' or `always', and MADV_HUGEPAGE and MADV_NOHUGEPAGE have long
> >> existed to give applications the ability to refine the global choice.
> >>
> >> The prctl should probably respect `never' for consistency, but beyond
> >> that I don't really see the concern, or how this would allow something
> >> that isn't already possible.
> >
> > I would interpret the always, madvise, and never options as follows:
> > - always
> >    The sysadmin strongly recommends using THP. If a user does not
> > want to use it, they must explicitly disable it.
> > - madvise
> >   The sysadmin gently encourages the use of THP, but it is only
> > enabled when explicitly requested by the application.
> > - never
> >    The sysadmin discourages the use of THP, and "its use is only permitted
> > with explicit approval" .
>
> "never" so far means "no thps, no exceptions". We've had serious THP
> issues in the past, where our workaround until we sorted out the issue
> for affected customers was to force-disable THPs on that system during boot.

Right, that reflects the current behavior. What we aim to enhance is
by adding the requirement that "its use is only permitted with
explicit approval."

This would be a small but meaningful step toward making THP truly
usable in a transparent way in production environments.

-- 
Regards
Yafang

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ