linux-kernel - Re: [RFC 00/14] Dynamic Kernel Stacks

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CA+CK2bC+bgOfohCEEW7nwAdakVmzg=RhUjjw=+Rw3wFALnOq-Q@mail.gmail.com>
Date: Tue, 12 Mar 2024 15:45:44 -0400
From: Pasha Tatashin <pasha.tatashin@...een.com>
To: "H. Peter Anvin" <hpa@...or.com>
Cc: linux-kernel@...r.kernel.org, linux-mm@...ck.org, 
	akpm@...ux-foundation.org, x86@...nel.org, bp@...en8.de, brauner@...nel.org, 
	bristot@...hat.com, bsegall@...gle.com, dave.hansen@...ux.intel.com, 
	dianders@...omium.org, dietmar.eggemann@....com, eric.devolder@...cle.com, 
	hca@...ux.ibm.com, hch@...radead.org, jacob.jun.pan@...ux.intel.com, 
	jgg@...pe.ca, jpoimboe@...nel.org, jroedel@...e.de, juri.lelli@...hat.com, 
	kent.overstreet@...ux.dev, kinseyho@...gle.com, 
	kirill.shutemov@...ux.intel.com, lstoakes@...il.com, luto@...nel.org, 
	mgorman@...e.de, mic@...ikod.net, michael.christie@...cle.com, 
	mingo@...hat.com, mjguzik@...il.com, mst@...hat.com, npiggin@...il.com, 
	peterz@...radead.org, pmladek@...e.com, rick.p.edgecombe@...el.com, 
	rostedt@...dmis.org, surenb@...gle.com, tglx@...utronix.de, urezki@...il.com, 
	vincent.guittot@...aro.org, vschneid@...hat.com
Subject: Re: [RFC 00/14] Dynamic Kernel Stacks

On Tue, Mar 12, 2024 at 1:19 PM H. Peter Anvin <hpa@...or.com> wrote:
>
>
>
> On 3/11/24 09:46, Pasha Tatashin wrote:
> > This is follow-up to the LSF/MM proposal [1]. Please provide your
> > thoughts and comments about dynamic kernel stacks feature. This is a WIP
> > has not been tested beside booting on some machines, and running LKDTM
> > thread exhaust tests. The series also lacks selftests, and
> > documentations.
> >
> > This feature allows to grow kernel stack dynamically, from 4KiB and up
> > to the THREAD_SIZE. The intend is to save memory on fleet machines. From
> > the initial experiments it shows to save on average 70-75% of the kernel
> > stack memory.
> >
> > The average depth of a kernel thread depends on the workload, profiling,
> > virtualization, compiler optimizations, and driver implementations.
> > However, the table below shows the amount of kernel stack memory before
> > vs. after on idling freshly booted machines:
> >
> > CPU           #Cores #Stacks  BASE(kb) Dynamic(kb)   Saving
> > AMD Genoa        384    5786    92576       23388    74.74%
> > Intel Skylake    112    3182    50912       12860    74.74%
> > AMD Rome         128    3401    54416       14784    72.83%
> > AMD Rome         256    4908    78528       20876    73.42%
> > Intel Haswell     72    2644    42304       10624    74.89%
> >
> > Some workloads with that have millions of threads would can benefit
> > significantly from this feature.
> >
>
> Ok, first of all, talking about "kernel memory" here is misleading.

Hi Peter,

I re-read my cover letter, and I do not see where "kernel memory" is
mentioned. We are talking about kernel stacks overhead that is
proportional to the user workload, as every active thread has an
associated kernel stack. The idea is to save memory by not
pre-allocating all pages of kernel-stacks, but instead use it as a
safeguard when a stack actually becomes deep. Come-up with a solution
that can handle rare deeper stacks only when needed. This could be
done through faulting on the supported hardware (as proposed in this
series), or via pre-map on every schedule event, and checking the
access when thread goes off cpu (as proposed by Andy Lutomirski to
avoid double faults on x86) .

In other words, this feature is only about one very specific type of
kernel memory that is not even directly mapped (the feature required
vmapped stacks).

> Unless your threads are spending nearly all their time sleeping, the
> threads will occupy stack and TLS memory in user space as well.

Can you please elaborate, what data is contained in the kernel stack
when thread is in user space? My series requires thread_info not to be
in the stack by depending on THREAD_INFO_IN_TASK.

> Second, non-dynamic kernel memory is one of the core design decisions in
> Linux from early on. This means there are lot of deeply embedded
> assumptions which would have to be untangled.
>
> Linus would, of course, be the real authority on this, but if someone
> would ask me what the fundamental design philosophies of the Linux
> kernel are -- the design decisions which make Linux Linux, if you will
> -- I would say:
>
>         1. Non-dynamic kernel memory
>         2. Permanent mapping of physical memory

The one and two are correlated. Given that all the memory is directly
mapped, the kernel core cannot be relocatable, swappable, faultable
etc.

>         3. Kernel API modeled closely after the POSIX API
>            (no complicated user space layers)
>         4. Fast system call entry/exit (a necessity for a
>            kernel API based on simple system calls)
>         5. Monolithic (but modular) kernel environment
>            (not cross-privilege, coroutine or message passing)
>
> Third, *IF* this is something that should be done (and I personally
> strongly suspect it should not), at least on x86-64 it probably should
> be for FRED hardware only. With FRED, it is possible to set the #PF
> event stack level to 1, which will cause an automatic stack switch for
> #PF in kernel space (only). However, even in kernel space, #PF can sleep
> if it references a user space page, in which case it would have to be
> demoted back onto the ring 0 stack (there are multiple ways of doing
> that, but it does entail an overhead.)

My understanding is that with the proposed approach only double faults
are prohibited to be used. Pre-map/check-access could still work, even
though it would add some cost to the context switching.

Thank you,
Pasha