linux-kernel - Re: [RFC PATCH 00/47] Address Space Isolation for KVM

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <91dd5f0a-61da-074d-42ed-bf0886f617d9@oracle.com>
Date:   Wed, 16 Mar 2022 22:34:45 +0100
From:   Alexandre Chartre <alexandre.chartre@...cle.com>
To:     Junaid Shahid <junaids@...gle.com>, linux-kernel@...r.kernel.org
Cc:     kvm@...r.kernel.org, pbonzini@...hat.com, jmattson@...gle.com,
        pjt@...gle.com, oweisse@...gle.com, rppt@...ux.ibm.com,
        dave.hansen@...ux.intel.com, peterz@...radead.org,
        tglx@...utronix.de, luto@...nel.org, linux-mm@...ck.org,
        alexandre.chartre@...cle.com
Subject: Re: [RFC PATCH 00/47] Address Space Isolation for KVM


Hi Junaid,

On 2/23/22 06:21, Junaid Shahid wrote:
> This patch series is a proof-of-concept RFC for an end-to-end implementation of
> Address Space Isolation for KVM. It has similar goals and a somewhat similar
> high-level design as the original ASI patches from Alexandre Chartre
> ([1],[2],[3],[4]), but with a different underlying implementation. This also
> includes several memory management changes to help with differentiating between
> sensitive and non-sensitive memory and mapping of non-sensitive memory into the
> ASI restricted address spaces.
> 
> This RFC is intended as a demonstration of what a full ASI implementation for
> KVM could look like, not necessarily as a direct proposal for what might
> eventually be merged. In particular, these patches do not yet implement KPTI on
> top of ASI, although the framework is generic enough to be able to support it.
> Similarly, these patches do not include non-sensitive annotations for data
> structures that did not get frequently accessed during execution of our test
> workloads, but the framework is designed such that new non-sensitive memory
> annotations can be added trivially.
> 
> The patches apply on top of Linux v5.16. These patches are also available via
> gerrit at https://linux-review.googlesource.com/q/topic:asi-rfc.
>
Sorry for the late answer, and thanks for investigating possible ASI
implementations. I have to admit I put ASI on the back-burner for
a while because I am more and more wondering if the complexity of
ASI is worth the benefit, especially given challenges to effectively
exploit flaws that ASI is expected to mitigate, in particular when VMs
are running on dedicated cpu cores, or when core scheduling is used.
So I have been looking at a more simplistic approach (see below, A
Possible Alternative to ASI).

But first, your implementation confirms that KVM-ASI can be broken up
into different parts: pagetable management, ASI core and sibling cpus
synchronization.

Pagetable Management
====================
For ASI, we need to build a pagetable with a subset of the kernel
pagetable mappings. Your solution is interesting as it is provides
a broad solution and also works well with dynamic allocations (while
my approach to copy mappings had several limitations). The drawback
is the extend of your changes which spread over all the mm code
(while the simple solution to copy mappings can be done with a few
self-contained independent functions).

ASI Core
========

KPTI
----
Implementing KPTI with ASI is possible but this is not straight forward.
This requires some special handling in particular in the assembly kernel
entry/exit code for syscall, interrupt and exception (see ASI RFC v4 [4]
as an example) because we are also switching privilege level in addition
of switching the pagetable. So this might be something to consider early
in your implementation to ensure it is effectively compatible with KPTI.

Going beyond KPTI (with a KPTI-next) and trying to execute most
syscalls/interrupts without switching to the full kernel address space
is more challenging, because it would require much more kernel mapping
in the user pagetable, and this would basically defeat the purpose of
KPTI. You can refer to discussions about the RFC to defer CR3 switch
to C code [7] which was an attempt to just reach the kernel entry C
code with a KPTI pagetable.

Interrupts/Exceptions
---------------------
As long as interrupts/exceptions are not expected to be processed with
ASI, it is probably better to explicitly exit ASI before processing an
interrupt/exception, otherwise you will have an extra overhead on each
interrupt/exception to take a page fault and then exit ASI.

This is particularily true if you have want to have KPTI use ASI, and
in that case the ASI exit will need to be done early in the interrupt
and exception assembly entry code.

ASI Hooks
---------
ASI hooks are certainly a good idea to perform specific actions on ASI
enter or exit. However, I am not sure they are appropriate places for cpus
stunning with KVM-ASI. That's because cpus stunning doesn't need to be
done precisely when entering and exiting ASI, and it probably shouldn't be
done there: it should be done right before VMEnter and right after VMExit
(see below).

Sibling CPUs Synchronization
============================
KVM-ASI requires the synchronization of sibling CPUs from the same CPU
core so that when a VM is running then sibling CPUs are running with the
ASI associated with this VM (or an ASI compatible with the VM, depending
on how ASI is defined). That way the VM can only spy on data from ASI
and won't be able to access any sensitive data.

So, right before entering a VM, KVM should ensures that sibling CPUs are
using ASI. If a sibling CPU is not using ASI then KVM can either wait for
that sibling to run ASI, or force it to use ASI (or to become idle).
This behavior should be enforced as long as any sibling is running the
VM. When all siblings are not running the VM then other siblings can run
any code (using ASI or not).

I would be interesting to see the code you use to achieve this, because
I don't get how this is achieved from the description of your sibling
hyperthread stun and unstun mechanism.

Note that this synchronization is critical for ASI to work, in particular
when entering the VM, we need to be absolutely sure that sibling CPUs are
effectively using ASI. The core scheduling sibling stunning code you
referenced [6] uses a mechanism which is fine for userspace synchronization
(the delivery of the IPI forces the sibling to immediately enter the kernel)
but this won't work for ASI as the delivery of the IPI won't guarantee that
the sibling as enter ASI yet. I did some experiments that show that data
will leak if siblings are not perfectly synchronized.

A Possible Alternative to ASI?
=============================
ASI prevents access to sensitive data by unmapping them. On the other
hand, the KVM code somewhat already identifies access to sensitive data
as part of the L1TF/MDS mitigation, and when KVM is about to access
sensitive data then it sets l1tf_flush_l1d to true (so that L1D gets
flushed before VMEnter).

With KVM knowing when it accesses sensitive data, I think we can provide
the same mitigation as ASI by simply allowing KVM code which doesn't
access sensitive data to be run concurrently with a VM. This can be done
by tagging the kernel thread when it enters KVM code which doesn't
access sensitive data, and untagging the thread right before it accesses
sensitive data. And when KVM is about to do a VMEnter then we synchronize
siblings CPUs so that they run threads with the same tag. Sounds familar?
Yes, because that's similar to core scheduling but inside the kernel
(let's call it "kernel core scheduling").

I think the benefit of this approach would be that it should be much
simpler to implement and less invasive than ASI, and it doesn't preclude
to later do ASI: ASI can be done in addition and provide an extra level
of mitigation in case some sensitive is still accessed by KVM. Also it
would provide the critical sibling CPU synchronization mechanism that
we also need with ASI.

I did some prototyping to implement this kernel core scheduling a while
ago (and then get diverted on other stuff) but so far performances have
been abyssal especially when doing a strict synchronization between
sibling CPUs. I am planning go back and do more investigations when I
have cycles but probably not that soon.


alex.

[4] https://lore.kernel.org/lkml/20200504144939.11318-1-alexandre.chartre@oracle.com
[6] https://lore.kernel.org/lkml/20200815031908.1015049-1-joel@joelfernandes.org
[7] https://lore.kernel.org/lkml/20201109144425.270789-1-alexandre.chartre@oracle.com


> Background
> ==========
> Address Space Isolation is a comprehensive security mitigation for several types
> of speculative execution attacks. Even though the kernel already has several
> speculative execution vulnerability mitigations, some of them can be quite
> expensive if enabled fully e.g. to fully mitigate L1TF using the existing
> mechanisms requires doing an L1 cache flush on every single VM entry as well as
> disabling hyperthreading altogether. (Although core scheduling can provide some
> protection when hyperthreading is enabled, it is not sufficient by itself to
> protect against all leaks unless sibling hyperthread stunning is also performed
> on every VM exit.) ASI provides a much less expensive mitigation for such
> vulnerabilities while still providing an almost similar level of protection.
> 
> There are a couple of basic insights/assumptions behind ASI:
> 
> 1. Most execution paths in the kernel (especially during virtual machine
> execution) access only memory that is not particularly sensitive even if it were
> to get leaked to the executing process/VM (setting aside for a moment what
> exactly should be considered sensitive or non-sensitive).
> 2. Even when executing speculatively, the CPU can generally only bring memory
> that is mapped in the current page tables into its various caches and internal
> buffers.
> 
> Given these, the idea of using ASI to thwart speculative attacks is that we can
> execute the kernel using a restricted set of page tables most of the time and
> switch to the full unrestricted kernel address space only when the kernel needs
> to access something that is not mapped in the restricted address space. And we
> keep track of when a switch to the full kernel address space is done, so that
> before returning back to the process/VM, we can switch back to the restricted
> address space. In the paths where the kernel is able to execute entirely while
> remaining in the restricted address space, we can skip other mitigations for
> speculative execution attacks (such as L1 cache / micro-arch buffer flushes,
> sibling hyperthread stunning etc.). Only in the cases where we do end up
> switching the page tables, we perform these more expensive mitigations. Assuming
> that happens relatively infrequently, the performance can be significantly
> better compared to performing these mitigations all the time.
> 
> Please note that although we do have a sibling hyperthread stunning
> implementation internally, which is fully integrated with KVM-ASI, it is not
> included in this RFC for the time being. The earlier upstream proposal for
> sibling stunning [6] could potentially be integrated into an upstream ASI
> implementation.
> 
> Basic concepts
> ==============
> Different types of restricted address spaces are represented by different ASI
> classes. For instance, KVM-ASI is an ASI class used during VM execution. KPTI
> would be another ASI class. An ASI instance (struct asi) represents a single
> restricted address space. There is a separate ASI instance for each untrusted
> context (e.g. a userspace process, a VM, or even a single VCPU etc.) Note that
> there can be multiple untrusted security contexts (and thus multiple restricted
> address spaces) within a single process e.g. in the case of VMs, the userspace
> process is a different security context than the guest VM, and in principle,
> even each VCPU could be considered a separate security context (That would be
> primarily useful for securing nested virtualization).
> 
> In this RFC, a process can have at most one ASI instance of each class, though
> this is not an inherent limitation and multiple instances of the same class
> should eventually be supported. (A process can still have ASI instances of
> different classes e.g. KVM-ASI and KPTI.) In fact, in principle, it is not even
> entirely necessary to tie an ASI instance to a process. That is just a
> simplification for the initial implementation.
> 
> An asi_enter operation switches into the restricted address space represented by
> the given ASI instance. An asi_exit operation switches to the full unrestricted
> kernel address space. Each ASI class can provide hooks to be executed during
> these operations, which can be used to perform speculative attack mitigations
> relevant to that class. For instance, the KVM-ASI hooks would perform a
> sibling-hyperthread-stun operation in the asi_exit hook, and L1-flush/MDS-clear
> and sibling-hyperthread-unstun operations in the asi_enter hook. On the other
> hand, the hooks for the KPTI class would be NO-OP, since the switching of the
> page tables is enough mitigation in that case.
> 
> If the kernel attempts to access memory that is not mapped in the currently
> active ASI instance, the page fault handler automatically performs an asi_exit
> operation. This means that except for a few critical pieces of memory, leaving
> something out of an unrestricted address space will result in only a performance
> hit, rather than a catastrophic failure. The kernel can also perform explicit
> asi_exit operations in some paths as needed.
> 
> Apart from the page fault handler, other exceptions and interrupts (even NMIs)
> do not automatically cause an asi_exit and could potentially be executed
> completely within a restricted address space if they don't end up accessing any
> sensitive piece of memory.
> 
> The mappings within a restricted address space are always a subset of the full
> kernel address space and each mapping is always the same as the corresponding
> mapping in the full kernel address space. This is necessary because we could
> potentially end up performing an asi_exit at any point.
> 
> Although this RFC only includes an implementation of the KVM-ASI class, a KPTI
> class could also be implemented on top of the same infrastructure. Furthermore,
> in the future we could also implement a KPTI-Next class that actually uses the
> ASI model for userspace processes i.e. mapping non-sensitive kernel memory in
> the restricted address space and trying to execute most syscalls/interrupts
> without switching to the full kernel address space, as opposed to the current
> KPTI which requires an address space switch on every kernel/user mode
> transition.
> 
> Memory classification
> =====================
> We divide memory into three categories.
> 
> 1. Sensitive memory
> This is memory that should never get leaked to any process or VM. Sensitive
> memory is only mapped in the unrestricted kernel page tables. By default, all
> memory is considered sensitive unless specifically categorized otherwise.
> 
> 2. Globally non-sensitive memory
> This is memory that does not present a substantial security threat even if it
> were to get leaked to any process or VM in the system. Globally non-sensitive
> memory is mapped in the restricted address spaces for all processes.
> 
> 3. Locally non-sensitive memory
> This is memory that does not present a substantial security threat if it were to
> get leaked to the currently running process or VM, but would present a security
> issue if it were to get leaked to any other process or VM in the system.
> Examples include userspace memory (or guest memory in the case of VMs) or kernel
> structures containing userspace/guest register context etc. Locally
> non-sensitive memory is mapped only in the restricted address space of a single
> process.
> 
> Various mechanisms are provided to annotate different types of memory (static,
> buddy allocator, slab, vmalloc etc.) as globally or locally non-sensitive. In
> addition, the ASI infrastructure takes care to ensure that different classes of
> memory do not share the same physical page. This includes separation of
> sensitive, globally non-sensitive and locally non-sensitive memory into
> different pages and also separation of locally non-sensitive memory for
> different processes into different pages as well.
> 
> What exactly should be considered non-sensitive (either globally or locally) is
> somewhat open-ended. Some things are clearly sensitive or non-sensitive, but
> many things also fall into a gray area, depending on how paranoid one wants to
> be. For this proof of concept, we have generally treated such things as
> non-sensitive, though that may not necessarily be the ideal classification in
> each case. Similarly, there is also a gray area between globally and locally
> non-sensitive classifications in some cases, and in those cases this RFC has
> mostly erred on the side of marking them as locally non-sensitive, even though
> many of those cases could likely be safely classified as globally non-sensitive.
> 
> Although this implementation includes fairly extensive support for marking most
> types of dynamically allocated memory as locally non-sensitive, it is possibly
> feasible, at least for KVM-ASI, to get away with a simpler implementation (such
> as [5]), if we are very selective about what memory we treat as locally
> non-sensitive (as opposed to globally non-sensitive). Nevertheless, the more
> general mechanism is included in this proof of concept as an illustration for
> what could be done if we really needed to treat any arbitrary kernel memory as
> locally non-sensitive.
> 
> It is also possible to have ASI classes that do not utilize the above described
> infrastructure and instead manage all the memory mappings inside the restricted
> address space on their own.
> 
> 
> References
> ==========
> [1] https://lore.kernel.org/lkml/1557758315-12667-1-git-send-email-alexandre.chartre@oracle.com
> [2] https://lore.kernel.org/lkml/1562855138-19507-1-git-send-email-alexandre.chartre@oracle.com
> [3] https://lore.kernel.org/lkml/1582734120-26757-1-git-send-email-alexandre.chartre@oracle.com
> [4] https://lore.kernel.org/lkml/20200504144939.11318-1-alexandre.chartre@oracle.com
> [5] https://lore.kernel.org/lkml/20190612170834.14855-1-mhillenb@amazon.de
> [6] https://lore.kernel.org/lkml/20200815031908.1015049-1-joel@joelfernandes.org
>