[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <bddba849-b8ff-4a5c-acd1-51fa761e0402@lucifer.local>
Date: Fri, 18 Oct 2024 22:30:02 +0100
From: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>
To: Vlastimil Babka <vbabka@...e.cz>
Cc: Andrew Morton <akpm@...ux-foundation.org>,
Suren Baghdasaryan <surenb@...gle.com>,
"Liam R . Howlett" <Liam.Howlett@...cle.com>,
Matthew Wilcox <willy@...radead.org>,
"Paul E . McKenney" <paulmck@...nel.org>, Jann Horn <jannh@...gle.com>,
David Hildenbrand <david@...hat.com>, linux-mm@...ck.org,
linux-kernel@...r.kernel.org, Muchun Song <muchun.song@...ux.dev>,
Richard Henderson <richard.henderson@...aro.org>,
Ivan Kokshaysky <ink@...assic.park.msu.ru>,
Matt Turner <mattst88@...il.com>,
Thomas Bogendoerfer <tsbogend@...ha.franken.de>,
"James E . J . Bottomley" <James.Bottomley@...senpartnership.com>,
Helge Deller <deller@....de>, Chris Zankel <chris@...kel.net>,
Max Filippov <jcmvbkbc@...il.com>, Arnd Bergmann <arnd@...db.de>,
linux-alpha@...r.kernel.org, linux-mips@...r.kernel.org,
linux-parisc@...r.kernel.org, linux-arch@...r.kernel.org,
Shuah Khan <shuah@...nel.org>, Christian Brauner <brauner@...nel.org>,
linux-kselftest@...r.kernel.org,
Sidhartha Kumar <sidhartha.kumar@...cle.com>,
Jeff Xu <jeffxu@...omium.org>, Christoph Hellwig <hch@...radead.org>,
Linux API <linux-api@...r.kernel.org>
Subject: Re: [PATCH 0/4] implement lightweight guard pages
On Fri, Oct 18, 2024 at 05:17:56PM +0100, Lorenzo Stoakes wrote:
> On Fri, Oct 18, 2024 at 06:10:37PM +0200, Vlastimil Babka wrote:
> > +CC linux-api (also should on future revisions)
> >
>
> They're cc'd :) assuming Linux API <linux-api@...r.kernel.org> is correct
> right?
As discussed on IRC, no I was being a little slow here and hadn't realised
you'd added them, apologies!
Will add them on future respins, sorry guys :)
>
> > On 10/17/24 22:42, Lorenzo Stoakes wrote:
> > > Userland library functions such as allocators and threading implementations
> > > often require regions of memory to act as 'guard pages' - mappings which,
> > > when accessed, result in a fatal signal being sent to the accessing
> > > process.
> > >
> > > The current means by which these are implemented is via a PROT_NONE mmap()
> > > mapping, which provides the required semantics however incur an overhead of
> > > a VMA for each such region.
> > >
> > > With a great many processes and threads, this can rapidly add up and incur
> > > a significant memory penalty. It also has the added problem of preventing
> > > merges that might otherwise be permitted.
> > >
> > > This series takes a different approach - an idea suggested by Vlasimil
> > > Babka (and before him David Hildenbrand and Jann Horn - perhaps more - the
> > > provenance becomes a little tricky to ascertain after this - please forgive
> > > any omissions!) - rather than locating the guard pages at the VMA layer,
> > > instead placing them in page tables mapping the required ranges.
> > >
> > > Early testing of the prototype version of this code suggests a 5 times
> > > speed up in memory mapping invocations (in conjunction with use of
> > > process_madvise()) and a 13% reduction in VMAs on an entirely idle android
> > > system and unoptimised code.
> > >
> > > We expect with optimisation and a loaded system with a larger number of
> > > guard pages this could significantly increase, but in any case these
> > > numbers are encouraging.
> > >
> > > This way, rather than having separate VMAs specifying which parts of a
> > > range are guard pages, instead we have a VMA spanning the entire range of
> > > memory a user is permitted to access and including ranges which are to be
> > > 'guarded'.
> > >
> > > After mapping this, a user can specify which parts of the range should
> > > result in a fatal signal when accessed.
> > >
> > > By restricting the ability to specify guard pages to memory mapped by
> > > existing VMAs, we can rely on the mappings being torn down when the
> > > mappings are ultimately unmapped and everything works simply as if the
> > > memory were not faulted in, from the point of view of the containing VMAs.
> > >
> > > This mechanism in effect poisons memory ranges similar to hardware memory
> > > poisoning, only it is an entirely software-controlled form of poisoning.
> > >
> > > Any poisoned region of memory is also able to 'unpoisoned', that is, to
> > > have its poison markers removed.
> > >
> > > The mechanism is implemented via madvise() behaviour - MADV_GUARD_POISON
> > > which simply poisons ranges - and MADV_GUARD_UNPOISON - which clears this
> > > poisoning.
> > >
> > > Poisoning can be performed across multiple VMAs and any existing mappings
> > > will be cleared, that is zapped, before installing the poisoned page table
> > > mappings.
> > >
> > > There is no concept of 'nested' poisoning, multiple attempts to poison a
> > > range will, after the first poisoning, have no effect.
> > >
> > > Importantly, unpoisoning of poisoned ranges has no effect on non-poisoned
> > > memory, so a user can safely unpoison a range of memory and clear only
> > > poison page table mappings leaving the rest intact.
> > >
> > > The actual mechanism by which the page table entries are specified makes
> > > use of existing logic - PTE markers, which are used for the userfaultfd
> > > UFFDIO_POISON mechanism.
> > >
> > > Unfortunately PTE_MARKER_POISONED is not suited for the guard page
> > > mechanism as it results in VM_FAULT_HWPOISON semantics in the fault
> > > handler, so we add our own specific PTE_MARKER_GUARD and adapt existing
> > > logic to handle it.
> > >
> > > We also extend the generic page walk mechanism to allow for installation of
> > > PTEs (carefully restricted to memory management logic only to prevent
> > > unwanted abuse).
> > >
> > > We ensure that zapping performed by, for instance, MADV_DONTNEED, does not
> > > remove guard poison markers, nor does forking (except when VM_WIPEONFORK is
> > > specified for a VMA which implies a total removal of memory
> > > characteristics).
> > >
> > > It's important to note that the guard page implementation is emphatically
> > > NOT a security feature, so a user can remove the poisoning if they wish. We
> > > simply implement it in such a way as to provide the least surprising
> > > behaviour.
> > >
> > > An extensive set of self-tests are provided which ensure behaviour is as
> > > expected and additionally self-documents expected behaviour of poisoned
> > > ranges.
> > >
> > > Suggested-by: Vlastimil Babka <vbabka@...e.cz>
> >
> > Please fix the domain typo (also in patch 3 :)
> >
>
> Damnnn it! I can't believe I left that in. Sorry about that! Will fix on
> respin.
>
> Hopefully not to suse.cs ;)
>
> > Thanks for implementing this,
> > Vlastimil
>
> Thanks!
>
> >
> > > Suggested-by: Jann Horn <jannh@...gle.com>
> > > Suggested-by: David Hildenbrand <david@...hat.com>
> > >
> > > v1
> > > * Un-RFC'd as appears no major objections to approach but rather debate on
> > > implementation.
> > > * Fixed issue with arches which need mmu_context.h and
> > > tlbfush.h. header imports in pagewalker logic to be able to use
> > > update_mmu_cache() as reported by the kernel test bot.
> > > * Added comments in page walker logic to clarify who can use
> > > ops->install_pte and why as well as adding a check_ops_valid() helper
> > > function, as suggested by Christoph.
> > > * Pass false in full parameter in pte_clear_not_present_full() as suggested
> > > by Jann.
> > > * Stopped erroneously requiring a write lock for the poison operation as
> > > suggested by Jann and Suren.
> > > * Moved anon_vma_prepare() to the start of madvise_guard_poison() to be
> > > consistent with how this is used elsewhere in the kernel as suggested by
> > > Jann.
> > > * Avoid returning -EAGAIN if we are raced on page faults, just keep looping
> > > and duck out if a fatal signal is pending or a conditional reschedule is
> > > needed, as suggested by Jann.
> > > * Avoid needlessly splitting huge PUDs and PMDs by specifying
> > > ACTION_CONTINUE, as suggested by Jann.
> > >
> > > RFC
> > > https://lore.kernel.org/all/cover.1727440966.git.lorenzo.stoakes@oracle.com/
> > >
> > > Lorenzo Stoakes (4):
> > > mm: pagewalk: add the ability to install PTEs
> > > mm: add PTE_MARKER_GUARD PTE marker
> > > mm: madvise: implement lightweight guard page mechanism
> > > selftests/mm: add self tests for guard page feature
> > >
> > > arch/alpha/include/uapi/asm/mman.h | 3 +
> > > arch/mips/include/uapi/asm/mman.h | 3 +
> > > arch/parisc/include/uapi/asm/mman.h | 3 +
> > > arch/xtensa/include/uapi/asm/mman.h | 3 +
> > > include/linux/mm_inline.h | 2 +-
> > > include/linux/pagewalk.h | 18 +-
> > > include/linux/swapops.h | 26 +-
> > > include/uapi/asm-generic/mman-common.h | 3 +
> > > mm/hugetlb.c | 3 +
> > > mm/internal.h | 6 +
> > > mm/madvise.c | 168 ++++
> > > mm/memory.c | 18 +-
> > > mm/mprotect.c | 3 +-
> > > mm/mseal.c | 1 +
> > > mm/pagewalk.c | 200 ++--
> > > tools/testing/selftests/mm/.gitignore | 1 +
> > > tools/testing/selftests/mm/Makefile | 1 +
> > > tools/testing/selftests/mm/guard-pages.c | 1168 ++++++++++++++++++++++
> > > 18 files changed, 1564 insertions(+), 66 deletions(-)
> > > create mode 100644 tools/testing/selftests/mm/guard-pages.c
> > >
> > > --
> > > 2.46.2
> >
Powered by blists - more mailing lists