linux-kernel - Re: [PATCH v4 0/5] implement lightweight guard pages

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <278393de-2729-4ed0-822c-87f33c7ce27e@redhat.com>
Date: Wed, 19 Mar 2025 15:52:56 +0100
From: David Hildenbrand <david@...hat.com>
To: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@...onical.com>,
 Lorenzo Stoakes <lorenzo.stoakes@...cle.com>
Cc: James.Bottomley@...senpartnership.com, Liam.Howlett@...cle.com,
 akpm@...ux-foundation.org, arnd@...nel.org, brauner@...nel.org,
 chris@...kel.net, deller@....de, hch@...radead.org, jannh@...gle.com,
 jcmvbkbc@...il.com, jeffxu@...omium.org, jhubbard@...dia.com,
 linux-api@...r.kernel.org, linux-kernel@...r.kernel.org, linux-mm@...ck.org,
 mattst88@...il.com, muchun.song@...ux.dev, paulmck@...nel.org,
 richard.henderson@...aro.org, shuah@...nel.org, sidhartha.kumar@...cle.com,
 surenb@...gle.com, tsbogend@...ha.franken.de, vbabka@...e.cz,
 willy@...radead.org, criu@...ts.linux.dev, Andrei Vagin <avagin@...il.com>,
 Pavel Tikhomirov <ptikhomirov@...tuozzo.com>
Subject: Re: [PATCH v4 0/5] implement lightweight guard pages

On 19.03.25 15:50, Alexander Mikhalitsyn wrote:
> On Mon, Oct 28, 2024 at 02:13:26PM +0000, Lorenzo Stoakes wrote:
>> Userland library functions such as allocators and threading implementations
>> often require regions of memory to act as 'guard pages' - mappings which,
>> when accessed, result in a fatal signal being sent to the accessing
>> process.
>>
>> The current means by which these are implemented is via a PROT_NONE mmap()
>> mapping, which provides the required semantics however incur an overhead of
>> a VMA for each such region.
>>
>> With a great many processes and threads, this can rapidly add up and incur
>> a significant memory penalty. It also has the added problem of preventing
>> merges that might otherwise be permitted.
>>
>> This series takes a different approach - an idea suggested by Vlasimil
>> Babka (and before him David Hildenbrand and Jann Horn - perhaps more - the
>> provenance becomes a little tricky to ascertain after this - please forgive
>> any omissions!)  - rather than locating the guard pages at the VMA layer,
>> instead placing them in page tables mapping the required ranges.
>>
>> Early testing of the prototype version of this code suggests a 5 times
>> speed up in memory mapping invocations (in conjunction with use of
>> process_madvise()) and a 13% reduction in VMAs on an entirely idle android
>> system and unoptimised code.
>>
>> We expect with optimisation and a loaded system with a larger number of
>> guard pages this could significantly increase, but in any case these
>> numbers are encouraging.
>>
>> This way, rather than having separate VMAs specifying which parts of a
>> range are guard pages, instead we have a VMA spanning the entire range of
>> memory a user is permitted to access and including ranges which are to be
>> 'guarded'.
>>
>> After mapping this, a user can specify which parts of the range should
>> result in a fatal signal when accessed.
>>
>> By restricting the ability to specify guard pages to memory mapped by
>> existing VMAs, we can rely on the mappings being torn down when the
>> mappings are ultimately unmapped and everything works simply as if the
>> memory were not faulted in, from the point of view of the containing VMAs.
>>
>> This mechanism in effect poisons memory ranges similar to hardware memory
>> poisoning, only it is an entirely software-controlled form of poisoning.
>>
>> The mechanism is implemented via madvise() behaviour - MADV_GUARD_INSTALL
>> which installs page table-level guard page markers - and
>> MADV_GUARD_REMOVE - which clears them.
>>
>> Guard markers can be installed across multiple VMAs and any existing
>> mappings will be cleared, that is zapped, before installing the guard page
>> markers in the page tables.
>>
>> There is no concept of 'nested' guard markers, multiple attempts to install
>> guard markers in a range will, after the first attempt, have no effect.
>>
>> Importantly, removing guard markers over a range that contains both guard
>> markers and ordinary backed memory has no effect on anything but the guard
>> markers (including leaving huge pages un-split), so a user can safely
>> remove guard markers over a range of memory leaving the rest intact.
>>
>> The actual mechanism by which the page table entries are specified makes
>> use of existing logic - PTE markers, which are used for the userfaultfd
>> UFFDIO_POISON mechanism.
>>
>> Unfortunately PTE_MARKER_POISONED is not suited for the guard page
>> mechanism as it results in VM_FAULT_HWPOISON semantics in the fault
>> handler, so we add our own specific PTE_MARKER_GUARD and adapt existing
>> logic to handle it.
>>
>> We also extend the generic page walk mechanism to allow for installation of
>> PTEs (carefully restricted to memory management logic only to prevent
>> unwanted abuse).
>>
>> We ensure that zapping performed by MADV_DONTNEED and MADV_FREE do not
>> remove guard markers, nor does forking (except when VM_WIPEONFORK is
>> specified for a VMA which implies a total removal of memory
>> characteristics).
>>
>> It's important to note that the guard page implementation is emphatically
>> NOT a security feature, so a user can remove the markers if they wish. We
>> simply implement it in such a way as to provide the least surprising
>> behaviour.
>>
>> An extensive set of self-tests are provided which ensure behaviour is as
>> expected and additionally self-documents expected behaviour of guard
>> ranges.
> 
> Dear Lorenzo,
> Dear colleagues,
> 
> sorry about raising an old thread.
> 
> It looks like this feature is now used in glibc [1]. And we noticed failures in CRIU [2]
> CI on Fedora Rawhide userspace. Now a question is how we can properly detect such
> "guarded" pages from user space. As I can see from MADV_GUARD_INSTALL implementation,
> it does not modify VMA flags anyhow, but only page tables. It means that /proc/<pid>/maps
> and /proc/<pid>/smaps interfaces are useless in this case. (Please, correct me if I'm missing
> anything here.)
> 
> I wonder if you have any ideas / suggestions regarding Checkpoint/Restore here. We (CRIU devs) are happy
> to develop some patches to bring some uAPI to expose MADV_GUARDs, but before going into this we decided
> to raise this question in LKML.


See [1] and [2]

[1] 
https://lkml.kernel.org/r/cover.1740139449.git.lorenzo.stoakes@oracle.com
[2] https://lwn.net/Articles/1011366/


-- 
Cheers,

David / dhildenb