[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250219105503.GKZ7W4h6QW1CNj48U9@fat_crate.local>
Date: Wed, 19 Feb 2025 11:55:03 +0100
From: Borislav Petkov <bp@...en8.de>
To: Brendan Jackman <jackmanb@...gle.com>
Cc: Thomas Gleixner <tglx@...utronix.de>, Ingo Molnar <mingo@...hat.com>,
Dave Hansen <dave.hansen@...ux.intel.com>,
"H. Peter Anvin" <hpa@...or.com>, Andy Lutomirski <luto@...nel.org>,
Peter Zijlstra <peterz@...radead.org>,
Richard Henderson <richard.henderson@...aro.org>,
Matt Turner <mattst88@...il.com>, Vineet Gupta <vgupta@...nel.org>,
Russell King <linux@...linux.org.uk>,
Catalin Marinas <catalin.marinas@....com>,
Will Deacon <will@...nel.org>, Guo Ren <guoren@...nel.org>,
Brian Cain <bcain@...cinc.com>, Huacai Chen <chenhuacai@...nel.org>,
WANG Xuerui <kernel@...0n.name>,
Geert Uytterhoeven <geert@...ux-m68k.org>,
Michal Simek <monstr@...str.eu>,
Thomas Bogendoerfer <tsbogend@...ha.franken.de>,
Dinh Nguyen <dinguyen@...nel.org>, Jonas Bonn <jonas@...thpole.se>,
Stefan Kristiansson <stefan.kristiansson@...nalahti.fi>,
Stafford Horne <shorne@...il.com>,
"James E.J. Bottomley" <James.Bottomley@...senpartnership.com>,
Helge Deller <deller@....de>, Michael Ellerman <mpe@...erman.id.au>,
Nicholas Piggin <npiggin@...il.com>,
Christophe Leroy <christophe.leroy@...roup.eu>,
Naveen N Rao <naveen@...nel.org>,
Madhavan Srinivasan <maddy@...ux.ibm.com>,
Paul Walmsley <paul.walmsley@...ive.com>,
Palmer Dabbelt <palmer@...belt.com>,
Albert Ou <aou@...s.berkeley.edu>,
Heiko Carstens <hca@...ux.ibm.com>,
Vasily Gorbik <gor@...ux.ibm.com>,
Alexander Gordeev <agordeev@...ux.ibm.com>,
Christian Borntraeger <borntraeger@...ux.ibm.com>,
Sven Schnelle <svens@...ux.ibm.com>,
Yoshinori Sato <ysato@...rs.sourceforge.jp>,
Rich Felker <dalias@...c.org>,
John Paul Adrian Glaubitz <glaubitz@...sik.fu-berlin.de>,
"David S. Miller" <davem@...emloft.net>,
Andreas Larsson <andreas@...sler.com>,
Richard Weinberger <richard@....at>,
Anton Ivanov <anton.ivanov@...bridgegreys.com>,
Johannes Berg <johannes@...solutions.net>,
Chris Zankel <chris@...kel.net>, Max Filippov <jcmvbkbc@...il.com>,
Arnd Bergmann <arnd@...db.de>,
Andrew Morton <akpm@...ux-foundation.org>,
Juri Lelli <juri.lelli@...hat.com>,
Vincent Guittot <vincent.guittot@...aro.org>,
Dietmar Eggemann <dietmar.eggemann@....com>,
Steven Rostedt <rostedt@...dmis.org>,
Ben Segall <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>,
Valentin Schneider <vschneid@...hat.com>,
Uladzislau Rezki <urezki@...il.com>,
Christoph Hellwig <hch@...radead.org>,
Masami Hiramatsu <mhiramat@...nel.org>,
Mathieu Desnoyers <mathieu.desnoyers@...icios.com>,
Mike Rapoport <rppt@...nel.org>,
Arnaldo Carvalho de Melo <acme@...nel.org>,
Namhyung Kim <namhyung@...nel.org>,
Mark Rutland <mark.rutland@....com>,
Alexander Shishkin <alexander.shishkin@...ux.intel.com>,
Jiri Olsa <jolsa@...nel.org>, Ian Rogers <irogers@...gle.com>,
Adrian Hunter <adrian.hunter@...el.com>,
Dennis Zhou <dennis@...nel.org>, Tejun Heo <tj@...nel.org>,
Christoph Lameter <cl@...ux.com>,
Sean Christopherson <seanjc@...gle.com>,
Paolo Bonzini <pbonzini@...hat.com>,
Ard Biesheuvel <ardb@...nel.org>,
Josh Poimboeuf <jpoimboe@...nel.org>,
Pawan Gupta <pawan.kumar.gupta@...ux.intel.com>, x86@...nel.org,
linux-kernel@...r.kernel.org, linux-alpha@...r.kernel.org,
linux-snps-arc@...ts.infradead.org,
linux-arm-kernel@...ts.infradead.org, linux-csky@...r.kernel.org,
linux-hexagon@...r.kernel.org, loongarch@...ts.linux.dev,
linux-m68k@...ts.linux-m68k.org, linux-mips@...r.kernel.org,
linux-openrisc@...r.kernel.org, linux-parisc@...r.kernel.org,
linuxppc-dev@...ts.ozlabs.org, linux-riscv@...ts.infradead.org,
linux-s390@...r.kernel.org, linux-sh@...r.kernel.org,
sparclinux@...r.kernel.org, linux-um@...ts.infradead.org,
linux-arch@...r.kernel.org, linux-mm@...ck.org,
linux-trace-kernel@...r.kernel.org,
linux-perf-users@...r.kernel.org, kvm@...r.kernel.org,
linux-efi@...r.kernel.org, Ofir Weisse <oweisse@...gle.com>,
Junaid Shahid <junaids@...gle.com>
Subject: Re: [PATCH RFC v2 03/29] mm: asi: Introduce ASI core API
On Fri, Jan 10, 2025 at 06:40:29PM +0000, Brendan Jackman wrote:
> Subject: Re: [PATCH RFC v2 03/29] mm: asi: Introduce ASI core API
x86/asi: ...
> Introduce core API for Address Space Isolation (ASI). Kernel address
> space isolation provides the ability to run some kernel
> code with a restricted kernel address space.
>
> There can be multiple classes of such restricted kernel address spaces
> (e.g. KPTI, KVM-PTI etc.). Each ASI class is identified by an index.
> The ASI class can register some hooks to be called when
> entering/exiting the restricted address space.
>
> Currently, there is a fixed maximum number of ASI classes supported.
> In addition, each process can have at most one restricted address space
> from each ASI class. Neither of these are inherent limitations and
> are merely simplifying assumptions for the time being.
>
> To keep things simpler for the time being, we disallow context switches
Please use passive voice in your commit message: no "we" or "I", etc,
and describe your changes in imperative mood.
Also, see section "Changelog" in
Documentation/process/maintainer-tip.rst
> within the restricted address space. In the future, we would be able to
> relax this limitation for the case of context switches to different
> threads within the same process (or to the idle thread and back).
>
> Note that this doesn't really support protecting sibling VM guests
> within the same VMM process from one another. From first principles
> it seems unlikely that anyone who cares about VM isolation would do
> that, but there could be a use-case to think about. In that case need
> something like the OTHER_MM logic might be needed, but specific to
> intra-process guest separation.
>
> [0]:
> https://lore.kernel.org/kvm/1562855138-19507-1-git-send-email-alexandre.chartre@oracle.com
>
> Notes about RFC-quality implementation details:
>
> - Ignoring checkpatch.pl AVOID_BUG.
> - The dynamic registration of classes might be pointless complexity.
> This was kept from RFCv1 without much thought.
> - The other-mm logic is also perhaps overly complex, suggestions are
> welcome for how best to tackle this (or we could just forget about
> it for the moment, and rely on asi_exit() happening in process
> switch).
> - The taint flag definitions would probably be clearer with an enum or
> something.
>
> Checkpatch-args: --ignore=AVOID_BUG,COMMIT_LOG_LONG_LINE,EXPORT_SYMBOL
> Co-developed-by: Ofir Weisse <oweisse@...gle.com>
> Signed-off-by: Ofir Weisse <oweisse@...gle.com>
> Co-developed-by: Junaid Shahid <junaids@...gle.com>
> Signed-off-by: Junaid Shahid <junaids@...gle.com>
> Signed-off-by: Brendan Jackman <jackmanb@...gle.com>
> ---
> arch/x86/include/asm/asi.h | 208 +++++++++++++++++++++++
> arch/x86/include/asm/processor.h | 8 +
> arch/x86/mm/Makefile | 1 +
> arch/x86/mm/asi.c | 350 +++++++++++++++++++++++++++++++++++++++
> arch/x86/mm/init.c | 3 +-
> arch/x86/mm/tlb.c | 1 +
> include/asm-generic/asi.h | 67 ++++++++
> include/linux/mm_types.h | 7 +
> kernel/fork.c | 3 +
> kernel/sched/core.c | 9 +
> mm/init-mm.c | 4 +
> 11 files changed, 660 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/include/asm/asi.h b/arch/x86/include/asm/asi.h
> new file mode 100644
> index 0000000000000000000000000000000000000000..7cc635b6653a3970ba9dbfdc9c828a470e27bd44
> --- /dev/null
> +++ b/arch/x86/include/asm/asi.h
> @@ -0,0 +1,208 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _ASM_X86_ASI_H
> +#define _ASM_X86_ASI_H
> +
> +#include <linux/sched.h>
> +
> +#include <asm-generic/asi.h>
> +
> +#include <asm/pgtable_types.h>
> +#include <asm/percpu.h>
> +#include <asm/processor.h>
> +
> +#ifdef CONFIG_MITIGATION_ADDRESS_SPACE_ISOLATION
> +
> +/*
> + * Overview of API usage by ASI clients:
> + *
> + * Setup: First call asi_init() to create a domain. At present only one domain
> + * can be created per mm per class, but it's safe to asi_init() this domain
> + * multiple times. For each asi_init() call you must call asi_destroy() AFTER
> + * you are certain all CPUs have exited the restricted address space (by
> + * calling asi_exit()).
> + *
> + * Runtime usage:
> + *
> + * 1. Call asi_enter() to switch to the restricted address space. This can't be
> + * from an interrupt or exception handler and preemption must be disabled.
> + *
> + * 2. Execute untrusted code.
> + *
> + * 3. Call asi_relax() to inform the ASI subsystem that untrusted code execution
> + * is finished. This doesn't cause any address space change. This can't be
> + * from an interrupt or exception handler and preemption must be disabled.
> + *
> + * 4. Either:
> + *
> + * a. Go back to 1.
> + *
> + * b. Call asi_exit() before returning to userspace. This immediately
> + * switches to the unrestricted address space.
So only from reading this, it does sound weird. Maybe the code does it
differently - I'll see soon - but this basically says:
I asi_enter(), do something, asi_relax() and then I decide to do something
more and to asi_enter() again!? And then I can end it all with a *single*
asi_exit() call?
Hm, definitely weird API. Why?
/*
* Leave the "tense" state if we are in it, i.e. end the critical section. We
* will stay relaxed until the next asi_enter.
*/
void asi_relax(void);
Yeah, so there's no API functions balance between enter() and relax()...
> + *
> + * The region between 1 and 3 is called the "ASI critical section". During the
> + * critical section, it is a bug to access any sensitive data, and you mustn't
> + * sleep.
> + *
> + * The restriction on sleeping is not really a fundamental property of ASI.
> + * However for performance reasons it's important that the critical section is
> + * absolutely as short as possible. So the ability to do sleepy things like
> + * taking mutexes oughtn't to confer any convenience on API users.
> + *
> + * Similarly to the issue of sleeping, the need to asi_exit in case 4b is not a
> + * fundamental property of the system but a limitation of the current
> + * implementation. With further work it is possible to context switch
> + * from and/or to the restricted address space, and to return to userspace
> + * directly from the restricted address space, or _in_ it.
> + *
> + * Note that the critical section only refers to the direct execution path from
> + * asi_enter to asi_relax: it's fine to access sensitive data from exceptions
> + * and interrupt handlers that occur during that time. ASI will re-enter the
> + * restricted address space before returning from the outermost
> + * exception/interrupt.
> + *
> + * Note: ASI does not modify KPTI behaviour; when ASI and KPTI run together
> + * there are 2+N address spaces per task: the unrestricted kernel address space,
> + * the user address space, and one restricted (kernel) address space for each of
> + * the N ASI classes.
> + */
> +
> +/*
> + * ASI uses a per-CPU tainting model to track what mitigation actions are
> + * required on domain transitions. Taints exist along two dimensions:
> + *
> + * - Who touched the CPU (guest, unprotected kernel, userspace).
> + *
> + * - What kind of state might remain: "data" means there might be data owned by
> + * a victim domain left behind in a sidechannel. "Control" means there might
> + * be state controlled by an attacker domain left behind in the branch
> + * predictor.
> + *
> + * In principle the same domain can be both attacker and victim, thus we have
> + * both data and control taints for userspace, although there's no point in
> + * trying to protect against attacks from the kernel itself, so there's no
> + * ASI_TAINT_KERNEL_CONTROL.
> + */
> +#define ASI_TAINT_KERNEL_DATA ((asi_taints_t)BIT(0))
> +#define ASI_TAINT_USER_DATA ((asi_taints_t)BIT(1))
> +#define ASI_TAINT_GUEST_DATA ((asi_taints_t)BIT(2))
> +#define ASI_TAINT_OTHER_MM_DATA ((asi_taints_t)BIT(3))
> +#define ASI_TAINT_USER_CONTROL ((asi_taints_t)BIT(4))
> +#define ASI_TAINT_GUEST_CONTROL ((asi_taints_t)BIT(5))
> +#define ASI_TAINT_OTHER_MM_CONTROL ((asi_taints_t)BIT(6))
> +#define ASI_NUM_TAINTS 6
> +static_assert(BITS_PER_BYTE * sizeof(asi_taints_t) >= ASI_NUM_TAINTS);
Why is this a typedef at all to make the code more unreadable than it needs to
be? Why not a simple unsigned int or char or whatever you need?
> +
> +#define ASI_TAINTS_CONTROL_MASK \
> + (ASI_TAINT_USER_CONTROL | ASI_TAINT_GUEST_CONTROL | ASI_TAINT_OTHER_MM_CONTROL)
> +
> +#define ASI_TAINTS_DATA_MASK \
> + (ASI_TAINT_KERNEL_DATA | ASI_TAINT_USER_DATA | ASI_TAINT_OTHER_MM_DATA)
> +
> +#define ASI_TAINTS_GUEST_MASK (ASI_TAINT_GUEST_DATA | ASI_TAINT_GUEST_CONTROL)
> +#define ASI_TAINTS_USER_MASK (ASI_TAINT_USER_DATA | ASI_TAINT_USER_CONTROL)
> +
> +/* The taint policy tells ASI how a class interacts with the CPU taints */
> +struct asi_taint_policy {
> + /*
> + * What taints would necessitate a flush when entering the domain, to
> + * protect it from attack by prior domains?
> + */
> + asi_taints_t prevent_control;
So if those necessitate a flush, why isn't this var called "taints_to_flush"
or whatever which exactly explains what it is?
> + /*
> + * What taints would necessetate a flush when entering the domain, to
+ * What taints would necessetate a flush when entering the domain, to
Unknown word [necessetate] in comment.
Suggestions: ['necessitate',
Spellchecker please. Go over your whole set.
> + * protect former domains from attack by this domain?
> + */
> + asi_taints_t protect_data;
Same.
> + /* What taints should be set when entering the domain? */
> + asi_taints_t set;
So "required_taints" or so... hm?
> +};
> +
> +/*
> + * An ASI domain (struct asi) represents a restricted address space. The
no need for "(struct asi)" - it is right below :).
> + * unrestricted address space (and user address space under PTI) are not
> + * represented as a domain.
> + */
> +struct asi {
> + pgd_t *pgd;
> + struct mm_struct *mm;
> + int64_t ref_count;
> + enum asi_class_id class_id;
> +};
> +
> +DECLARE_PER_CPU_ALIGNED(struct asi *, curr_asi);
Or simply "asi" - this per-CPU var will be so prominent so that when you do
"per_cpu(asi)" you know what exactly it is
> +
> +void asi_init_mm_state(struct mm_struct *mm);
> +
> +int asi_init_class(enum asi_class_id class_id, struct asi_taint_policy *taint_policy);
> +void asi_uninit_class(enum asi_class_id class_id);
"uninit", meh. "exit" perhaps? or "destroy"?
And you have "asi_destroy" already so I guess you can do:
asi_class_init()
asi_class_destroy()
to have the namespace correct.
> +const char *asi_class_name(enum asi_class_id class_id);
> +
> +int asi_init(struct mm_struct *mm, enum asi_class_id class_id, struct asi **out_asi);
> +void asi_destroy(struct asi *asi);
> +
> +/* Enter an ASI domain (restricted address space) and begin the critical section. */
> +void asi_enter(struct asi *asi);
> +
> +/*
> + * Leave the "tense" state if we are in it, i.e. end the critical section. We
> + * will stay relaxed until the next asi_enter.
> + */
> +void asi_relax(void);
> +
> +/* Immediately exit the restricted address space if in it */
> +void asi_exit(void);
> +
> +/* The target is the domain we'll enter when returning to process context. */
> +static __always_inline struct asi *asi_get_target(struct task_struct *p)
> +{
> + return p->thread.asi_state.target;
> +}
> +
> +static __always_inline void asi_set_target(struct task_struct *p,
> + struct asi *target)
> +{
> + p->thread.asi_state.target = target;
> +}
> +
> +static __always_inline struct asi *asi_get_current(void)
> +{
> + return this_cpu_read(curr_asi);
> +}
> +
> +/* Are we currently in a restricted address space? */
> +static __always_inline bool asi_is_restricted(void)
> +{
> + return (bool)asi_get_current();
> +}
> +
> +/* If we exit/have exited, can we stay that way until the next asi_enter? */
> +static __always_inline bool asi_is_relaxed(void)
> +{
> + return !asi_get_target(current);
> +}
> +
> +/*
> + * Is the current task in the critical section?
> + *
> + * This is just the inverse of !asi_is_relaxed(). We have both functions in order to
> + * help write intuitive client code. In particular, asi_is_tense returns false
> + * when ASI is disabled, which is judged to make user code more obvious.
> + */
> +static __always_inline bool asi_is_tense(void)
> +{
> + return !asi_is_relaxed();
> +}
So can we tone down the silly helpers above? You don't really need
asi_is_tense() for example. It is still very intuitive if I read
if (!asi_is_relaxed())
...
> +
> +static __always_inline pgd_t *asi_pgd(struct asi *asi)
> +{
> + return asi ? asi->pgd : NULL;
> +}
> +
> +#define INIT_MM_ASI(init_mm) \
> + .asi_init_lock = __MUTEX_INITIALIZER(init_mm.asi_init_lock),
> +
> +void asi_handle_switch_mm(void);
> +
> +#endif /* CONFIG_MITIGATION_ADDRESS_SPACE_ISOLATION */
> +
> +#endif
Splitting the patch here and will continue with the next one as this one is
kinda big for one mail.
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
Powered by blists - more mailing lists