linux-kernel - RE: [RFC 00/14] Dynamic Kernel Stacks

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <4b542b49b2994c9d8c4c73b9e3b42dde@AcuMS.aculab.com>
Date: Mon, 18 Mar 2024 15:53:07 +0000
From: David Laight <David.Laight@...LAB.COM>
To: 'Pasha Tatashin' <pasha.tatashin@...een.com>, Matthew Wilcox
	<willy@...radead.org>
CC: "H. Peter Anvin" <hpa@...or.com>, Kent Overstreet
	<kent.overstreet@...ux.dev>, "linux-kernel@...r.kernel.org"
	<linux-kernel@...r.kernel.org>, "linux-mm@...ck.org" <linux-mm@...ck.org>,
	"akpm@...ux-foundation.org" <akpm@...ux-foundation.org>, "x86@...nel.org"
	<x86@...nel.org>, "bp@...en8.de" <bp@...en8.de>, "brauner@...nel.org"
	<brauner@...nel.org>, "bristot@...hat.com" <bristot@...hat.com>,
	"bsegall@...gle.com" <bsegall@...gle.com>, "dave.hansen@...ux.intel.com"
	<dave.hansen@...ux.intel.com>, "dianders@...omium.org"
	<dianders@...omium.org>, "dietmar.eggemann@....com"
	<dietmar.eggemann@....com>, "eric.devolder@...cle.com"
	<eric.devolder@...cle.com>, "hca@...ux.ibm.com" <hca@...ux.ibm.com>,
	"hch@...radead.org" <hch@...radead.org>, "jacob.jun.pan@...ux.intel.com"
	<jacob.jun.pan@...ux.intel.com>, "jgg@...pe.ca" <jgg@...pe.ca>,
	"jpoimboe@...nel.org" <jpoimboe@...nel.org>, "jroedel@...e.de"
	<jroedel@...e.de>, "juri.lelli@...hat.com" <juri.lelli@...hat.com>,
	"kinseyho@...gle.com" <kinseyho@...gle.com>,
	"kirill.shutemov@...ux.intel.com" <kirill.shutemov@...ux.intel.com>,
	"lstoakes@...il.com" <lstoakes@...il.com>, "luto@...nel.org"
	<luto@...nel.org>, "mgorman@...e.de" <mgorman@...e.de>, "mic@...ikod.net"
	<mic@...ikod.net>, "michael.christie@...cle.com"
	<michael.christie@...cle.com>, "mingo@...hat.com" <mingo@...hat.com>,
	"mjguzik@...il.com" <mjguzik@...il.com>, "mst@...hat.com" <mst@...hat.com>,
	"npiggin@...il.com" <npiggin@...il.com>, "peterz@...radead.org"
	<peterz@...radead.org>, "pmladek@...e.com" <pmladek@...e.com>,
	"rick.p.edgecombe@...el.com" <rick.p.edgecombe@...el.com>,
	"rostedt@...dmis.org" <rostedt@...dmis.org>, "surenb@...gle.com"
	<surenb@...gle.com>, "tglx@...utronix.de" <tglx@...utronix.de>,
	"urezki@...il.com" <urezki@...il.com>, "vincent.guittot@...aro.org"
	<vincent.guittot@...aro.org>, "vschneid@...hat.com" <vschneid@...hat.com>
Subject: RE: [RFC 00/14] Dynamic Kernel Stacks

From: Pasha Tatashin
> Sent: 18 March 2024 15:31
> 
> On Mon, Mar 18, 2024 at 11:19 AM Matthew Wilcox <willy@...radead.org> wrote:
> >
> > On Mon, Mar 18, 2024 at 11:09:47AM -0400, Pasha Tatashin wrote:
> > > The TLB load is going to be exactly the same as today, we already use
> > > small pages for VMA mapped stacks. We won't need to have extra
> > > flushing either, the mappings are in the kernel space, and once pages
> > > are removed from the page table, no one is going to access that VA
> > > space until that thread enters the kernel again. We will need to
> > > invalidate the VA range only when the pages are mapped, and only on
> > > the local cpu.
> >
> > No; we can pass pointers to our kernel stack to other threads.  The
> > obvious one is a mutex; we put a mutex_waiter on our own stack and
> > add its list_head to the mutex's waiter list.  I'm sure you can
> > think of many other places we do this (eg wait queues, poll(), select(),
> > etc).
> 
> Hm, it means that stack is sleeping in the kernel space, and has its
> stack pages mapped and invalidated on the local CPU, but access from
> the remote CPU to that stack pages would be problematic.
> 
> I think we still won't need IPI, but VA-range invalidation is actually
> needed on unmaps, and should happen during context switch so every
> time we go off-cpu. Therefore, what Brian/Andy have suggested makes
> more sense instead of kernel/enter/exit paths.

I think you'll need to broadcast an invalidate.
Consider:
CPU A: task allocates extra pages and adds something to some list.
CPU B: accesses that data and maybe modifies it.
	Some page-table walk setup ut the TLB.
CPU A: task detects the modify, removes the item from the list,
	collapses back the stack and sleeps.
	Stack pages freed.
CPU A: task wakes up (on the same cpu for simplicity).
	Goes down a deep stack and puts an item on a list.
	Different physical pages are allocated.
CPU B: accesses the associated KVA.
	It better not have a cached TLB.

Doesn't that need an IPI?

Freeing the pages is much harder than allocating them.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)