linux-kernel - Re: [RFC 11/14] x86: add support for Dynamic Kernel Stacks

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CA+CK2bBU2zwu_V6hpOonswyuft5gWQh1H9tBbYP8efLRRAAdQQ@mail.gmail.com>
Date: Mon, 11 Mar 2024 20:08:16 -0400
From: Pasha Tatashin <pasha.tatashin@...een.com>
To: Andy Lutomirski <luto@...nel.org>
Cc: Linux Kernel Mailing List <linux-kernel@...r.kernel.org>, linux-mm@...ck.org, 
	Andrew Morton <akpm@...ux-foundation.org>, "the arch/x86 maintainers" <x86@...nel.org>, 
	Borislav Petkov <bp@...en8.de>, Christian Brauner <brauner@...nel.org>, bristot@...hat.com, 
	Ben Segall <bsegall@...gle.com>, Dave Hansen <dave.hansen@...ux.intel.com>, dianders@...omium.org, 
	dietmar.eggemann@....com, eric.devolder@...cle.com, hca@...ux.ibm.com, 
	"hch@...radead.org" <hch@...radead.org>, "H. Peter Anvin" <hpa@...or.com>, 
	Jacob Pan <jacob.jun.pan@...ux.intel.com>, Jason Gunthorpe <jgg@...pe.ca>, jpoimboe@...nel.org, 
	Joerg Roedel <jroedel@...e.de>, juri.lelli@...hat.com, 
	Kent Overstreet <kent.overstreet@...ux.dev>, kinseyho@...gle.com, 
	"Kirill A. Shutemov" <kirill.shutemov@...ux.intel.com>, lstoakes@...il.com, mgorman@...e.de, 
	mic@...ikod.net, michael.christie@...cle.com, Ingo Molnar <mingo@...hat.com>, 
	mjguzik@...il.com, "Michael S. Tsirkin" <mst@...hat.com>, Nicholas Piggin <npiggin@...il.com>, 
	"Peter Zijlstra (Intel)" <peterz@...radead.org>, Petr Mladek <pmladek@...e.com>, 
	Rick P Edgecombe <rick.p.edgecombe@...el.com>, Steven Rostedt <rostedt@...dmis.org>, 
	Suren Baghdasaryan <surenb@...gle.com>, Thomas Gleixner <tglx@...utronix.de>, 
	Uladzislau Rezki <urezki@...il.com>, vincent.guittot@...aro.org, vschneid@...hat.com
Subject: Re: [RFC 11/14] x86: add support for Dynamic Kernel Stacks

> >> There are some other options: you could pre-map
> >
> > Pre-mapping would be expensive. It would mean pre-mapping the dynamic
> > pages for every scheduled thread, and we'd still need to check the
> > access bit every time a thread leaves the CPU.
>
> That's a write to four consecutive words in memory, with no locking required.

You convinced me, this might not be that bad. At the thread creation
time we will save the locations of the unmapped thread PTE's, and set
them on every schedule. There is a slight increase in scheduling cost,
but perhaps it is not as bad as I initially thought. This approach,
however, makes this dynamic stac feature much safer, and can be easily
extended to all arches that support access/dirty bit tracking.

>
> > Dynamic thread faults
> > should be considered rare events and thus shouldn't significantly
> > affect the performance of normal context switch operations. With 8K
> > stacks, we might encounter only 0.00001% of stacks requiring an extra
> > page, and even fewer needing 16K.
>
> Well yes, but if you crash 0.0001% of the time due to the microcode not liking you, you lose. :)
>
> >
> >> Also, I think the whole memory allocation concept in this whole series is a bit odd.  Fundamentally, we *can't* block on these stack faults -- we may be in a context where blocking will deadlock.  We may be in the page allocator.  Panicing due to kernel stack allocation  would be very unpleasant.
> >
> > We never block during handling stack faults. There's a per-CPU page
> > pool, guaranteeing availability for the faulting thread. The thread
> > simply takes pages from this per-CPU data structure and refills the
> > pool when leaving the CPU. The faulting routine is efficient,
> > requiring a fixed number of loads without any locks, stalling, or even
> > cmpxchg operations.
>
> You can't block when scheduling, either.  What if you can't refill the pool?

Why can't we (I am not a scheduler guy)? IRQ's are not yet disabled,
what prevents us from blocking while the old process has not yet been
removed from the CPU?