linux-kernel - Re: [RFC 00/14] Dynamic Kernel Stacks

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CA+CK2bATJ7_EUszq4nr0AuZXG76nUhDs9osbxPUs=mLPFtW8Zg@mail.gmail.com>
Date: Sun, 17 Mar 2024 10:19:10 -0400
From: Pasha Tatashin <pasha.tatashin@...een.com>
To: Matthew Wilcox <willy@...radead.org>
Cc: "H. Peter Anvin" <hpa@...or.com>, Kent Overstreet <kent.overstreet@...ux.dev>, 
	linux-kernel@...r.kernel.org, linux-mm@...ck.org, akpm@...ux-foundation.org, 
	x86@...nel.org, bp@...en8.de, brauner@...nel.org, bristot@...hat.com, 
	bsegall@...gle.com, dave.hansen@...ux.intel.com, dianders@...omium.org, 
	dietmar.eggemann@....com, eric.devolder@...cle.com, hca@...ux.ibm.com, 
	hch@...radead.org, jacob.jun.pan@...ux.intel.com, jgg@...pe.ca, 
	jpoimboe@...nel.org, jroedel@...e.de, juri.lelli@...hat.com, 
	kinseyho@...gle.com, kirill.shutemov@...ux.intel.com, lstoakes@...il.com, 
	luto@...nel.org, mgorman@...e.de, mic@...ikod.net, 
	michael.christie@...cle.com, mingo@...hat.com, mjguzik@...il.com, 
	mst@...hat.com, npiggin@...il.com, peterz@...radead.org, pmladek@...e.com, 
	rick.p.edgecombe@...el.com, rostedt@...dmis.org, surenb@...gle.com, 
	tglx@...utronix.de, urezki@...il.com, vincent.guittot@...aro.org, 
	vschneid@...hat.com
Subject: Re: [RFC 00/14] Dynamic Kernel Stacks

On Sat, Mar 16, 2024 at 8:41 PM Matthew Wilcox <willy@...radead.org> wrote:
>
> On Sat, Mar 16, 2024 at 03:17:57PM -0400, Pasha Tatashin wrote:
> > Expanding on Mathew's idea of an interface for dynamic kernel stack
> > sizes, here's what I'm thinking:
> >
> > - Kernel Threads: Create all kernel threads with a fully populated
> > THREAD_SIZE stack.  (i.e. 16K)
> > - User Threads: Create all user threads with THREAD_SIZE kernel stack
> > but only the top page mapped. (i.e. 4K)
> > - In enter_from_user_mode(): Expand the thread stack to 16K by mapping
> > three additional pages from the per-CPU stack cache. This function is
> > called early in kernel entry points.
> > - exit_to_user_mode(): Unmap the extra three pages and return them to
> > the per-CPU cache. This function is called late in the kernel exit
> > path.
> >
> > Both of the above hooks are called with IRQ disabled on all kernel
> > entries whether through interrupts and syscalls, and they are called
> > early/late enough that 4K is enough to handle the rest of entry/exit.
>
> At what point do we replenish the per-CPU stash of pages?  If we're
> 12kB deep in the stack and call mutex_lock(), we can be scheduled out,
> and then the new thread can make a syscall.  Do we just assume that
> get_free_page() can sleep at kernel entry (seems reasonable)?  I don't
> think this is an infeasible problem, I'd just like it to be described.

Once irq is enabled it is perfectly OK to sleep and wait for the stack
pages to become available.

The following user entries that enable interrupts:
do_user_addr_fault()
   local_irq_enable()

do_syscall_64()
  syscall_enter_from_user_mode()
    local_irq_enable()

__do_fast_syscall_32()
  syscall_enter_from_user_mode_prepare()
    local_irq_enable()

exc_debug_user()
  local_irq_enable()

do_int3_user()
  cond_local_irq_enable()

With those it is perfectly OK to sleep and wait for the page to become
available when we are in a situation where the per-cpu cache is empty,
and alloc_page(GFP_NOWAIT) does not succeed.

The other interrupts from userland never enable IRQs. We can have
3-pages per-cpu reserved for handling specifically IRQ-never enable
cases, as there cannot be more than one ever needed.

Pasha