[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <A5238E27-88DB-4758-B630-22F17501AFD5@zytor.com>
Date: Thu, 14 Mar 2024 21:17:27 -0700
From: "H. Peter Anvin" <hpa@...or.com>
To: Pasha Tatashin <pasha.tatashin@...een.com>,
Matthew Wilcox <willy@...radead.org>
CC: Kent Overstreet <kent.overstreet@...ux.dev>, linux-kernel@...r.kernel.org,
linux-mm@...ck.org, akpm@...ux-foundation.org, x86@...nel.org,
bp@...en8.de, brauner@...nel.org, bristot@...hat.com,
bsegall@...gle.com, dave.hansen@...ux.intel.com, dianders@...omium.org,
dietmar.eggemann@....com, eric.devolder@...cle.com, hca@...ux.ibm.com,
hch@...radead.org, jacob.jun.pan@...ux.intel.com, jgg@...pe.ca,
jpoimboe@...nel.org, jroedel@...e.de, juri.lelli@...hat.com,
kinseyho@...gle.com, kirill.shutemov@...ux.intel.com,
lstoakes@...il.com, luto@...nel.org, mgorman@...e.de, mic@...ikod.net,
michael.christie@...cle.com, mingo@...hat.com, mjguzik@...il.com,
mst@...hat.com, npiggin@...il.com, peterz@...radead.org,
pmladek@...e.com, rick.p.edgecombe@...el.com, rostedt@...dmis.org,
surenb@...gle.com, tglx@...utronix.de, urezki@...il.com,
vincent.guittot@...aro.org, vschneid@...hat.com
Subject: Re: [RFC 00/14] Dynamic Kernel Stacks
On March 14, 2024 8:13:56 PM PDT, Pasha Tatashin <pasha.tatashin@...een.com> wrote:
>On Thu, Mar 14, 2024 at 3:57 PM Matthew Wilcox <willy@...radead.org> wrote:
>>
>> On Thu, Mar 14, 2024 at 03:53:39PM -0400, Kent Overstreet wrote:
>> > On Thu, Mar 14, 2024 at 07:43:06PM +0000, Matthew Wilcox wrote:
>> > > On Tue, Mar 12, 2024 at 10:18:10AM -0700, H. Peter Anvin wrote:
>> > > > Second, non-dynamic kernel memory is one of the core design decisions in
>> > > > Linux from early on. This means there are lot of deeply embedded assumptions
>> > > > which would have to be untangled.
>> > >
>> > > I think there are other ways of getting the benefit that Pasha is seeking
>> > > without moving to dynamically allocated kernel memory. One icky thing
>> > > that XFS does is punt work over to a kernel thread in order to use more
>> > > stack! That breaks a number of things including lockdep (because the
>> > > kernel thread doesn't own the lock, the thread waiting for the kernel
>> > > thread owns the lock).
>> > >
>> > > If we had segmented stacks, XFS could say "I need at least 6kB of stack",
>> > > and if less than that was available, we could allocate a temporary
>> > > stack and switch to it. I suspect Google would also be able to use this
>> > > API for their rare cases when they need more than 8kB of kernel stack.
>> > > Who knows, we might all be able to use such a thing.
>> > >
>> > > I'd been thinking about this from the point of view of allocating more
>> > > stack elsewhere in kernel space, but combining what Pasha has done here
>> > > with this idea might lead to a hybrid approach that works better; allocate
>> > > 32kB of vmap space per kernel thread, put 12kB of memory at the top of it,
>> > > rely on people using this "I need more stack" API correctly, and free the
>> > > excess pages on return to userspace. No complicated "switch stacks" API
>> > > needed, just an "ensure we have at least N bytes of stack remaining" API.
>
>I like this approach! I think we could also consider having permanent
>big stacks for some kernel only threads like kvm-vcpu. A cooperative
>stack increase framework could work well and wouldn't negatively
>impact the performance of context switching. However, thorough
>analysis would be necessary to proactively identify potential stack
>overflow situations.
>
>> > Why would we need an "I need more stack" API? Pasha's approach seems
>> > like everything we need for what you're talking about.
>>
>> Because double faults are hard, possibly impossible, and the FRED approach
>> Peter described has extra overhead? This was all described up-thread.
>
>Handling faults in #DF is possible. It requires code inspection to
>handle race conditions such as what was shown by tglx. However, as
>Andy pointed out, this is not supported by SDM as it is an abort
>context (yet we return from it because of ESPFIX64, so return is
>possible).
>
>My question, however, if we ignore memory savings and only consider
>reliability aspect of this feature. What is better unconditionally
>crashing the machine because a guard page was reached, or printing a
>huge warning with a backtracing information about the offending stack,
>handling the fault, and survive? I know that historically Linus
>preferred WARN() to BUG() [1]. But, this is a somewhat different
>scenario compared to simple BUG vs WARN.
>
>Pasha
>
>[1] https://lore.kernel.org/all/Pine.LNX.4.44.0209091832160.1714-100000@home.transmeta.com
>
From a reliability point of view it is better to die than to proceed with possible data loss. The latter is extremely serious.
However, the one way that this could be made to work would be with stack probes, which could be compiler-inserted. The point is that you touch an offset below the stack pointer that is large enough that you cover not only the maximum amount of stack the function needs, but with an additional margin, which includes enough space that you can safely take the #PF on the remaining stack.
Powered by blists - more mailing lists