linux-kernel - Re: [RFC 00/14] Dynamic Kernel Stacks

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <0EE22907-1D81-4FA6-823B-13F7A94D3F85@zytor.com>
Date: Sat, 16 Mar 2024 17:47:18 -0700
From: "H. Peter Anvin" <hpa@...or.com>
To: Matthew Wilcox <willy@...radead.org>
CC: Pasha Tatashin <pasha.tatashin@...een.com>, linux-kernel@...r.kernel.org,
        linux-mm@...ck.org, akpm@...ux-foundation.org, x86@...nel.org,
        bp@...en8.de, brauner@...nel.org, bristot@...hat.com,
        bsegall@...gle.com, dave.hansen@...ux.intel.com, dianders@...omium.org,
        dietmar.eggemann@....com, eric.devolder@...cle.com, hca@...ux.ibm.com,
        hch@...radead.org, jacob.jun.pan@...ux.intel.com, jgg@...pe.ca,
        jpoimboe@...nel.org, jroedel@...e.de, juri.lelli@...hat.com,
        kent.overstreet@...ux.dev, kinseyho@...gle.com,
        kirill.shutemov@...ux.intel.com, lstoakes@...il.com, luto@...nel.org,
        mgorman@...e.de, mic@...ikod.net, michael.christie@...cle.com,
        mingo@...hat.com, mjguzik@...il.com, mst@...hat.com, npiggin@...il.com,
        peterz@...radead.org, pmladek@...e.com, rick.p.edgecombe@...el.com,
        rostedt@...dmis.org, surenb@...gle.com, tglx@...utronix.de,
        urezki@...il.com, vincent.guittot@...aro.org, vschneid@...hat.com
Subject: Re: [RFC 00/14] Dynamic Kernel Stacks

On March 14, 2024 12:43:06 PM PDT, Matthew Wilcox <willy@...radead.org> wrote:
>On Tue, Mar 12, 2024 at 10:18:10AM -0700, H. Peter Anvin wrote:
>> Second, non-dynamic kernel memory is one of the core design decisions in
>> Linux from early on. This means there are lot of deeply embedded assumptions
>> which would have to be untangled.
>
>I think there are other ways of getting the benefit that Pasha is seeking
>without moving to dynamically allocated kernel memory.  One icky thing
>that XFS does is punt work over to a kernel thread in order to use more
>stack!  That breaks a number of things including lockdep (because the
>kernel thread doesn't own the lock, the thread waiting for the kernel
>thread owns the lock).
>
>If we had segmented stacks, XFS could say "I need at least 6kB of stack",
>and if less than that was available, we could allocate a temporary
>stack and switch to it.  I suspect Google would also be able to use this
>API for their rare cases when they need more than 8kB of kernel stack.
>Who knows, we might all be able to use such a thing.
>
>I'd been thinking about this from the point of view of allocating more
>stack elsewhere in kernel space, but combining what Pasha has done here
>with this idea might lead to a hybrid approach that works better; allocate
>32kB of vmap space per kernel thread, put 12kB of memory at the top of it,
>rely on people using this "I need more stack" API correctly, and free the
>excess pages on return to userspace.  No complicated "switch stacks" API
>needed, just an "ensure we have at least N bytes of stack remaining" API.

This is what stack probes basically does. It provides a very cheap "API" that goes via the #PF (not #DF!) path in the slow case, but synchronously at a well-defined point, but is virtually free in the common case. As a side benefit, they can be compiler-generated, as some operating systems require them.