linux-kernel - Re: [PATCH 0/4] mm: permit guard regions for file-backed/shmem mappings

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <a49d277e-128c-4853-bdeb-3a94134acbf6@redhat.com>
Date: Tue, 18 Feb 2025 18:14:00 +0100
From: David Hildenbrand <david@...hat.com>
To: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>
Cc: Vlastimil Babka <vbabka@...e.cz>,
 Andrew Morton <akpm@...ux-foundation.org>,
 Suren Baghdasaryan <surenb@...gle.com>,
 "Liam R . Howlett" <Liam.Howlett@...cle.com>,
 Matthew Wilcox <willy@...radead.org>, "Paul E . McKenney"
 <paulmck@...nel.org>, Jann Horn <jannh@...gle.com>, linux-mm@...ck.org,
 linux-kernel@...r.kernel.org, Shuah Khan <shuah@...nel.org>,
 linux-kselftest@...r.kernel.org, linux-api@...r.kernel.org,
 John Hubbard <jhubbard@...dia.com>, Juan Yescas <jyescas@...gle.com>,
 Kalesh Singh <kaleshsingh@...gle.com>
Subject: Re: [PATCH 0/4] mm: permit guard regions for file-backed/shmem
 mappings

On 18.02.25 17:43, Lorenzo Stoakes wrote:
> On Tue, Feb 18, 2025 at 04:20:18PM +0100, David Hildenbrand wrote:
>>> Right yeah that'd be super weird. And I don't want to add that logic.
>>>
>>>> Also not sure what happens if one does an mlock()/mlockall() after
>>>> already installing PTE markers.
>>>
>>> The existing logic already handles non-present cases by skipping them, in
>>> mlock_pte_range():
>>>
>>> 	for (pte = start_pte; addr != end; pte++, addr += PAGE_SIZE) {
>>> 		ptent = ptep_get(pte);
>>> 		if (!pte_present(ptent))
>>> 			continue;
>>>
>>> 		...
>>> 	}
>>
>> I *think* that code only updates already-mapped folios, to properly call
>> mlock_folio()/munlock_folio().
> 
> Guard regions _are_ 'already mapped' :) so it leaves them in place.

"mapped folios" -- there is no folio mapped. Yes, the VMA is in place.

> 
> do_mlock() -> apply_vma_lock_flags() -> mlock_fixup() -> mlock_vma_pages_range()
> implies this will be invoked.

Yes, to process any already mapped folios, to then continue population 
later.

> 
>>
>> It is not the code that populates pages on mlock()/mlockall(). I think all
>> that goes via mm_populate()/__mm_populate(), where "ordinary GUP" should
>> apply.
> 
> OK I want to correct what I said earlier.
> 
> Installing a guard region then attempting mlock() will result in an error. The
> populate will -EFAULT and stop at the guard region, which causes mlock() to
> error out.

Right, that's my expectation.

> 
> This is a partial failure, so the VMA is split and has VM_LOCKED applied, but
> the populate halts at the guard region.
> 
> This is ok as per previous discussion on aggregate operation failure, there can
> be no expectation of 'unwinding' of partially successful operations that form
> part of a requested aggregate one.
> 
> However, given there's stuff to clean up, and on error a user _may_ wish to then
> remove guard regions and try again, I guess there's no harm in keeping the code
> as it is where we allow MADV_GUARD_REMOVE even if VM_LOCKED is in place.

Likely yes; it's all weird code.

> 
>>
>> See populate_vma_page_range(), especially also the VM_LOCKONFAULT handling.
> 
> Yeah that code is horrible, you just reminded me of it... 'rightly or wrongly'
> yeah wrongly, very wrongly...
> 
>>
>>>
>>> Which covers off guard regions. Removing the guard regions after this will
>>> leave you in a weird situation where these entries will be zapped... maybe
>>> we need a patch to make MADV_GUARD_REMOVE check VM_LOCKED and in this case
>>> also populate?
>>
>> Maybe? Or we say that it behaves like MADV_DONTNEED_LOCKED.
> 
> See above, no we should not :P this is only good for cleanup after mlock()
> failure, although no sane program should really be trying to do this, a sane
> program would give up here (and it's a _programmatic error_ to try to mlock() a
> range with guard regions).
 >>>> Somme apps use mlockall(), and it might be nice to just be able to 
use guard
>> pages as if "Nothing happened".
> 
> Sadly I think not given above :P

QEMU, for example, will issue an mlockall(MCL_CURRENT | MCL_FUTURE); 
when requested to then exit(); if it fails.

Assume glibc or any lib uses it, QEMU would have no real way of figuring 
that out or instructing offending libraries to disabled that, at least 
for now  ...

... turning RT VMs less usable if any library uses guard regions. :(

There is upcoming support for MCL_ONFAULT in QEMU [1] (see below).

[1] https://lkml.kernel.org/r/20250212173823.214429-3-peterx@redhat.com

> 
>>
>> E.g., QEMU has the option to use mlockall().
>>
>>>
>>> Then again we're currently asymmetric as you can add them _before_
>>> mlock()'ing...
>>
>> Right.
>>
>> --
>> Cheers,
>>
>> David / dhildenb
>>
> 
> I think the _LOCKED idea is therefore kaput, because it just won't work
> properly because populating guard regions fails.

Right, I think basic VM_LOCKED is out of the picture. VM_LOCKONFAULT 
might be interesting, because we are skipping the population stage.

> 
> It fails because it tries to 'touch' the memory, but 'touching' guard
> region memory causes a segfault. This kind of breaks the idea of
> mlock()'ing guard regions.
> 
> I think adding workarounds to make this possible in any way is not really
> worth it (and would probably be pretty gross).
> 
> We already document that 'mlock()ing lightweight guard regions will fail'
> as per man page so this is all in line with that.

Right, and I claim that supporting VM_LOCKONFAULT might likely be as 
easy as allowing install/remove of guard regions when that flag is set.

-- 
Cheers,

David / dhildenb