linux-kernel - Re: [RFC 00/14] Dynamic Kernel Stacks

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CA+CK2bBr2wH4=L39ZthRPUnAjVxMqt80bsZj0NPx9xdH=_Mn0Q@mail.gmail.com>
Date: Mon, 11 Mar 2024 14:58:45 -0400
From: Pasha Tatashin <pasha.tatashin@...een.com>
To: Mateusz Guzik <mjguzik@...il.com>
Cc: linux-kernel@...r.kernel.org, linux-mm@...ck.org, 
	akpm@...ux-foundation.org, x86@...nel.org, bp@...en8.de, brauner@...nel.org, 
	bristot@...hat.com, bsegall@...gle.com, dave.hansen@...ux.intel.com, 
	dianders@...omium.org, dietmar.eggemann@....com, eric.devolder@...cle.com, 
	hca@...ux.ibm.com, hch@...radead.org, hpa@...or.com, 
	jacob.jun.pan@...ux.intel.com, jgg@...pe.ca, jpoimboe@...nel.org, 
	jroedel@...e.de, juri.lelli@...hat.com, kent.overstreet@...ux.dev, 
	kinseyho@...gle.com, kirill.shutemov@...ux.intel.com, lstoakes@...il.com, 
	luto@...nel.org, mgorman@...e.de, mic@...ikod.net, 
	michael.christie@...cle.com, mingo@...hat.com, mst@...hat.com, 
	npiggin@...il.com, peterz@...radead.org, pmladek@...e.com, 
	rick.p.edgecombe@...el.com, rostedt@...dmis.org, surenb@...gle.com, 
	tglx@...utronix.de, urezki@...il.com, vincent.guittot@...aro.org, 
	vschneid@...hat.com
Subject: Re: [RFC 00/14] Dynamic Kernel Stacks

On Mon, Mar 11, 2024 at 1:09 PM Mateusz Guzik <mjguzik@...il.com> wrote:
>
> On 3/11/24, Pasha Tatashin <pasha.tatashin@...een.com> wrote:
> > This is follow-up to the LSF/MM proposal [1]. Please provide your
> > thoughts and comments about dynamic kernel stacks feature. This is a WIP
> > has not been tested beside booting on some machines, and running LKDTM
> > thread exhaust tests. The series also lacks selftests, and
> > documentations.
> >
> > This feature allows to grow kernel stack dynamically, from 4KiB and up
> > to the THREAD_SIZE. The intend is to save memory on fleet machines. From
> > the initial experiments it shows to save on average 70-75% of the kernel
> > stack memory.
> >
>

Hi Mateusz,

> Can you please elaborate how this works? I have trouble figuring it
> out from cursory reading of the patchset and commit messages, that
> aside I would argue this should have been explained in the cover
> letter.

Sure, I answered your questions below.

> For example, say a thread takes a bunch of random locks (most notably
> spinlocks) and/or disables preemption, then pushes some stuff onto the
> stack which now faults. That is to say the fault can happen in rather
> arbitrary context.
>
> If any of the conditions described below are prevented in the first
> place it really needs to be described how.
>
> That said, from top of my head:
> 1. what about faults when the thread holds a bunch of arbitrary locks
> or has preemption disabled? is the allocation lockless?

Each thread has a stack with 4 pages.
Pre-allocated page: This page is always allocated and mapped at thread creation.
Dynamic pages (3): These pages are mapped dynamically upon stack faults.

A per-CPU data structure holds 3 dynamic pages for each CPU. These
pages are used to handle stack faults occurring when a running thread
faults (even within interrupt-disabled contexts). Typically, only one
page is needed, but in the rare case where the thread accesses beyond
that, we might use up to all three pages in a single fault. This
structure allows for atomic handling of stack faults, preventing
conflicts from other processes. Additionally, the thread's 16K-aligned
virtual address (VA) and guaranteed pre-allocated page means no page
table allocation is required during the fault.

When a thread leaves the CPU in normal kernel mode, we check a flag to
see if it has experienced stack faults. If so, we charge the thread
for the new stack pages and refill the per-CPU data structure with any
missing pages.

> 2. what happens if there is no memory from which to map extra pages in
> the first place? you may be in position where you can't go off cpu

When the per-CPU data structure cannot be refilled, and a new thread
faults, we issue a message indicating a critical stack fault. This
triggers a system-wide panic similar to a guard page access violation

Pasha