linux-kernel - Re: [RFC 11/14] x86: add support for Dynamic Kernel Stacks

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CA+CK2bAzJuCe06g_TEOh3B-hK+dXfUaGaOSTgzyxkN4zqpSU_A@mail.gmail.com>
Date: Wed, 13 Mar 2024 11:28:26 -0400
From: Pasha Tatashin <pasha.tatashin@...een.com>
To: Thomas Gleixner <tglx@...utronix.de>
Cc: linux-kernel@...r.kernel.org, linux-mm@...ck.org, 
	akpm@...ux-foundation.org, x86@...nel.org, bp@...en8.de, brauner@...nel.org, 
	bristot@...hat.com, bsegall@...gle.com, dave.hansen@...ux.intel.com, 
	dianders@...omium.org, dietmar.eggemann@....com, eric.devolder@...cle.com, 
	hca@...ux.ibm.com, hch@...radead.org, hpa@...or.com, 
	jacob.jun.pan@...ux.intel.com, jgg@...pe.ca, jpoimboe@...nel.org, 
	jroedel@...e.de, juri.lelli@...hat.com, kent.overstreet@...ux.dev, 
	kinseyho@...gle.com, kirill.shutemov@...ux.intel.com, lstoakes@...il.com, 
	luto@...nel.org, mgorman@...e.de, mic@...ikod.net, 
	michael.christie@...cle.com, mingo@...hat.com, mjguzik@...il.com, 
	mst@...hat.com, npiggin@...il.com, peterz@...radead.org, pmladek@...e.com, 
	rick.p.edgecombe@...el.com, rostedt@...dmis.org, surenb@...gle.com, 
	urezki@...il.com, vincent.guittot@...aro.org, vschneid@...hat.com
Subject: Re: [RFC 11/14] x86: add support for Dynamic Kernel Stacks

On Wed, Mar 13, 2024 at 9:43 AM Pasha Tatashin
<pasha.tatashin@...een.com> wrote:
>
> On Wed, Mar 13, 2024 at 6:23 AM Thomas Gleixner <tglx@...utronix.de> wrote:
> >
> > On Mon, Mar 11 2024 at 16:46, Pasha Tatashin wrote:
> > > @@ -413,6 +413,9 @@ DEFINE_IDTENTRY_DF(exc_double_fault)
> > >       }
> > >  #endif
> > >
> > > +     if (dynamic_stack_fault(current, address))
> > > +             return;
> > > +
> > >       irqentry_nmi_enter(regs);
> > >       instrumentation_begin();
> > >       notify_die(DIE_TRAP, str, regs, error_code, X86_TRAP_DF, SIGSEGV);
> > > diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
> > > index d6375b3c633b..651c558b10eb 100644
> > > --- a/arch/x86/mm/fault.c
> > > +++ b/arch/x86/mm/fault.c
> > > @@ -1198,6 +1198,9 @@ do_kern_addr_fault(struct pt_regs *regs, unsigned long hw_error_code,
> > >       if (is_f00f_bug(regs, hw_error_code, address))
> > >               return;
> > >
> > > +     if (dynamic_stack_fault(current, address))
> > > +             return;
> >
> > T1 schedules out with stack used close to the fault boundary.
> >
> >    switch_to(T2)
> >
> > Now T1 schedules back in
> >
> >    switch_to(T1)
> >      __switch_to_asm()
> >         ...
> >         switch_stacks()         <- SP on T1 stack
> > !        ...
> > !       jmp __switch_to()
> > !    __switch_to()
> > !       ...
> > !       raw_cpu_write(pcpu_hot.current_task, next_p);
> >
> > After switching SP to T1's stack and up to the point where
> > pcpu_hot.current_task (aka current) is updated to T1 a stack fault will
> > invoke dynamic_stack_fault(T2, address) which will return false here:
> >
> >         /* check if address is inside the kernel stack area */
> >         stack = (unsigned long)tsk->stack;
> >         if (address < stack || address >= stack + THREAD_SIZE)
> >                 return false;
> >
> > because T2's stack does obviously not cover the faulting address on T1's
> > stack. As a consequence double fault will panic the machine.
>
> Hi Thomas,
>
> Thank you, you are absolutely right, we can't trust "current" in the
> fault handler.
>
> We can change dynamic_stack_fault() to only accept fault_address as an
> argument, and let it determine the right task_struct pointer
> internally.
>
> Let's modify dynamic_stack_fault() to accept only the fault_address.
> It can then determine the correct task_struct pointer internally.
>
> Here's a potential solution that is fast, avoids locking, and ensures atomicity:
>
> 1. Kernel Stack VA Space
> Dedicate a virtual address range ([KSTACK_START_VA - KSTACK_END_VA])
> exclusively for kernel stacks. This simplifies validation of faulting
> addresses to be part of a stack.
>
> 2. Finding the faulty task
> - Use ALIGN(fault_address, THREAD_SIZE) to calculate the end of the
> topmost stack page (since stack addresses are aligned to THREAD_SIZE).
> - Store the task_struct pointer as the last word on this topmost page,
> that is always present as it is a pre-allcated stack page.
>
> 3. Stack Padding
> Increase padding to 8 bytes on x86_64 (TOP_OF_KERNEL_STACK_PADDING 8)
> to accommodate the task_struct pointer.

Alternatively, do not even look-up the task_struct in
dynamic_stack_fault(), but only install the mapping to the faulting
address, store va in the per-cpu array, and handle the rest in
dynamic_stack() during context switching. At that time spin locks can
be taken, and we can do a find_vm_area(addr) call.

This way, we would not need to modify TOP_OF_KERNEL_STACK_PADDING to
keep task_struct in there.

Pasha