[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <39F17EC4-7844-4111-BF7D-FFC97B05D9FA@zytor.com>
Date: Thu, 14 Mar 2024 20:39:36 -0700
From: "H. Peter Anvin" <hpa@...or.com>
To: Pasha Tatashin <pasha.tatashin@...een.com>,
Matthew Wilcox <willy@...radead.org>
CC: Kent Overstreet <kent.overstreet@...ux.dev>, linux-kernel@...r.kernel.org,
linux-mm@...ck.org, akpm@...ux-foundation.org, x86@...nel.org,
bp@...en8.de, brauner@...nel.org, bristot@...hat.com,
bsegall@...gle.com, dave.hansen@...ux.intel.com, dianders@...omium.org,
dietmar.eggemann@....com, eric.devolder@...cle.com, hca@...ux.ibm.com,
hch@...radead.org, jacob.jun.pan@...ux.intel.com, jgg@...pe.ca,
jpoimboe@...nel.org, jroedel@...e.de, juri.lelli@...hat.com,
kinseyho@...gle.com, kirill.shutemov@...ux.intel.com,
lstoakes@...il.com, luto@...nel.org, mgorman@...e.de, mic@...ikod.net,
michael.christie@...cle.com, mingo@...hat.com, mjguzik@...il.com,
mst@...hat.com, npiggin@...il.com, peterz@...radead.org,
pmladek@...e.com, rick.p.edgecombe@...el.com, rostedt@...dmis.org,
surenb@...gle.com, tglx@...utronix.de, urezki@...il.com,
vincent.guittot@...aro.org, vschneid@...hat.com
Subject: Re: [RFC 00/14] Dynamic Kernel Stacks
On March 14, 2024 8:13:56 PM PDT, Pasha Tatashin <pasha.tatashin@...een.com> wrote:
>On Thu, Mar 14, 2024 at 3:57 PM Matthew Wilcox <willy@...radead.org> wrote:
>>
>> On Thu, Mar 14, 2024 at 03:53:39PM -0400, Kent Overstreet wrote:
>> > On Thu, Mar 14, 2024 at 07:43:06PM +0000, Matthew Wilcox wrote:
>> > > On Tue, Mar 12, 2024 at 10:18:10AM -0700, H. Peter Anvin wrote:
>> > > > Second, non-dynamic kernel memory is one of the core design decisions in
>> > > > Linux from early on. This means there are lot of deeply embedded assumptions
>> > > > which would have to be untangled.
>> > >
>> > > I think there are other ways of getting the benefit that Pasha is seeking
>> > > without moving to dynamically allocated kernel memory. One icky thing
>> > > that XFS does is punt work over to a kernel thread in order to use more
>> > > stack! That breaks a number of things including lockdep (because the
>> > > kernel thread doesn't own the lock, the thread waiting for the kernel
>> > > thread owns the lock).
>> > >
>> > > If we had segmented stacks, XFS could say "I need at least 6kB of stack",
>> > > and if less than that was available, we could allocate a temporary
>> > > stack and switch to it. I suspect Google would also be able to use this
>> > > API for their rare cases when they need more than 8kB of kernel stack.
>> > > Who knows, we might all be able to use such a thing.
>> > >
>> > > I'd been thinking about this from the point of view of allocating more
>> > > stack elsewhere in kernel space, but combining what Pasha has done here
>> > > with this idea might lead to a hybrid approach that works better; allocate
>> > > 32kB of vmap space per kernel thread, put 12kB of memory at the top of it,
>> > > rely on people using this "I need more stack" API correctly, and free the
>> > > excess pages on return to userspace. No complicated "switch stacks" API
>> > > needed, just an "ensure we have at least N bytes of stack remaining" API.
>
>I like this approach! I think we could also consider having permanent
>big stacks for some kernel only threads like kvm-vcpu. A cooperative
>stack increase framework could work well and wouldn't negatively
>impact the performance of context switching. However, thorough
>analysis would be necessary to proactively identify potential stack
>overflow situations.
>
>> > Why would we need an "I need more stack" API? Pasha's approach seems
>> > like everything we need for what you're talking about.
>>
>> Because double faults are hard, possibly impossible, and the FRED approach
>> Peter described has extra overhead? This was all described up-thread.
>
>Handling faults in #DF is possible. It requires code inspection to
>handle race conditions such as what was shown by tglx. However, as
>Andy pointed out, this is not supported by SDM as it is an abort
>context (yet we return from it because of ESPFIX64, so return is
>possible).
>
>My question, however, if we ignore memory savings and only consider
>reliability aspect of this feature. What is better unconditionally
>crashing the machine because a guard page was reached, or printing a
>huge warning with a backtracing information about the offending stack,
>handling the fault, and survive? I know that historically Linus
>preferred WARN() to BUG() [1]. But, this is a somewhat different
>scenario compared to simple BUG vs WARN.
>
>Pasha
>
>[1] https://lore.kernel.org/all/Pine.LNX.4.44.0209091832160.1714-100000@home.transmeta.com
>
The real issue with using #DF is that if the event that caused it was asynchronous, you could lose the event.
Powered by blists - more mailing lists