lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <2fd7f80c-2b13-4478-900a-d65547586db3@gmail.com>
Date: Tue, 10 Jun 2025 16:30:43 +0100
From: Usama Arif <usamaarif642@...il.com>
To: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>,
 David Hildenbrand <david@...hat.com>
Cc: Andrew Morton <akpm@...ux-foundation.org>,
 Shakeel Butt <shakeel.butt@...ux.dev>,
 "Liam R . Howlett" <Liam.Howlett@...cle.com>,
 Vlastimil Babka <vbabka@...e.cz>, Jann Horn <jannh@...gle.com>,
 Arnd Bergmann <arnd@...db.de>, Christian Brauner <brauner@...nel.org>,
 SeongJae Park <sj@...nel.org>, Mike Rapoport <rppt@...nel.org>,
 Johannes Weiner <hannes@...xchg.org>, Barry Song <21cnbao@...il.com>,
 linux-mm@...ck.org, linux-arch@...r.kernel.org,
 linux-kernel@...r.kernel.org, linux-api@...r.kernel.org,
 Pedro Falcato <pfalcato@...e.de>, Matthew Wilcox <willy@...radead.org>
Subject: Re: [DISCUSSION] proposed mctl() API



On 10/06/2025 16:17, Lorenzo Stoakes wrote:
> On Tue, Jun 10, 2025 at 04:03:07PM +0100, Usama Arif wrote:
>>
>>
>> On 30/05/2025 14:10, Lorenzo Stoakes wrote:
>>> On Thu, May 29, 2025 at 06:21:55PM +0100, Usama Arif wrote:
>>>>
>>>>
>>>> My knowledge is security is limited, so please bare with me, but I actually
>>>> didn't understand the security issue and the need for CAP_SYS_ADMIN for
>>>> doing VM_(NO)HUGEPAGE.
>>>>
>>>> A process can already madvise its own VMAs, and this is just doing that
>>>> for the entire process. And VM_INIT_DEF_MASK is already set to VM_NOHUGEPAGE
>>>> so it will be inherited by the parent. Just adding VM_HUGEPAGE shouldnt be
>>>> a issue? Inheriting MMF_VM_HUGEPAGE will mean that khugepaged would enter
>>>> for that process as well, which again doesnt seem like a security issue
>>>> to me.
>>>
>>> W.R.T. the current process, the Issue is one Jann raised, in relation to
>>> propagation of behaviour to privileged (e.g. setuid) processes.
>>>
>>
>> But what is the actual security issue of having hugepages (or not having them) when
>> the process is running with setuid?
> 
> Speak to Jann about this. Security isn't my area. He gave feedback on this,
> which is why I raised it, if you search through previous threads you can find
> it.
> 

Yes, he is in CC here as well. I have read it in the previous thread. Just raising it
here as it was mentioned here :)

>>
>> I know the cgroup proposal has been shot down, but lets imagine if this was a cgroup
>> setting, similar to the other memory controls we have, for e.g. memory.swap.{max,high,peak}.
>>
>> We can chown the cgroup so that the property is set by unprivileged process.
>>
>> Having the process swap with setuid when the unprivileged process has swap disabled
>> in the cgroup is not the right behaviour. What currently happens is that the process
>> after obtaining the higher privilege level doesn't swap as well.
>>
>> Similarly for hugepages, if it was a cgroup level setting, having the process give
>> hugepages always with setuid when the unprivileged user had it disabled it or vice versa
>> would not be the right behaviour.
>>
>> Another example is PR_SET_MEMORY_MERGE, setuid does not change how it works as far as
>> I can tell.
>>
>> So madlibs I dont see what the security issue is and why we would need to elevate privileges
>> to do this.
>>
>>> W.R.T. remote processes, obviously we want to make sure we are permitted to do
>>> so.
>>>
>>
>> I know that this needs to be future proof. But I don't actually know of a real world
>> usecase where we want to do any of these things for remote processes.
>> Whether its the existing per process changes like PR_SET_MEMORY_MERGE for KSM and
>> PR_SET_THP_DISABLE for THP or the newer proposals of PR_DEFAULT_MADV_(NO)HUGEPAGE
>> or Barrys proposal.
>> All of them are for the process itself (and its children by fork+exec) and not for
>> remote processes. As we try to make our changes usecase driven, I think we should
>> not add support for remote processes (which is another reason why I think this might
>> sit better in prctl).
> 
> I'm extremely confused as to why you think this propoal is predicated upon
> remote process manipulation? It was simply suggested as a possibility for
> increased flexibility.
> 
> We can just remove this parameter no?
> 

Sure.

> It is entirely orthogonal to the prctl() stuff.
> 
> Overall at this point I share Matthew's point of view on this - we shouldn't be
> doing any of this upstream.

As I replied to Matthew in [1], it would be amazing if it was not needed, but thats not
how it works in the medium term and I dont think it will work even in the long term.
I will paste my answer from [1] below as well:

If we have 2 workloads on the same server, For e.g. one is database where THPs 
just dont do well, but the other one is AI where THPs do really well. How
will the kernel monitor that the database workload is performing worse
and the AI one isnt?

I added THP shrinker to hopefully try and do this automatically, and it does
really help. But unfortunately it is not a complete solution.
There are severely memory bound workloads where even a tiny increase
in memory will lead to an OOM. And if you colocate the container thats running
that workload with one in which we will benefit with THPs, we unfortunately
can't just rely on the system doing the right thing.

It would be awesome if THPs are truly transparent and don't require
any input, but unfortunately I don't think that there is a solution
for this with just kernel monitoring.

This is just a big hint from the user. If the global system policy is madvise
and the workload owner has done their own benchmarks and see benefits
with always, they set DEFAULT_MADV_HUGEPAGE for the process to optin as "always".
If the global system policy is always and the workload owner has done their own 
benchmarks and see worse results with always, they set DEFAULT_MADV_NOHUGEPAGE for 
the process to optin as "madvise". 

[1] https://lore.kernel.org/all/162c14e6-0b16-4698-bd76-735037ea0d73@gmail.com/


I havent seen activity on this thread over the past week, but I was hoping
we can reach a consensus on which approach to use, prctl or mctl.
If its mctl and if you don't think this should be done, please let me know
if you would like me to work on this instead. This is a valid big realworld
usecase that is a real blocker for deploying THPs in workloads in servers.

Thanks!
Usama

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ