linux-kernel - Re: [RFC PATCH 00/47] Address Space Isolation for KVM

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <0813c9da-f91d-317e-2eda-f2ed0b95385f@oracle.com>
Date:   Fri, 8 Apr 2022 10:52:12 +0200
From:   Alexandre Chartre <alexandre.chartre@...cle.com>
To:     Junaid Shahid <junaids@...gle.com>, linux-kernel@...r.kernel.org
Cc:     kvm@...r.kernel.org, pbonzini@...hat.com, jmattson@...gle.com,
        pjt@...gle.com, oweisse@...gle.com, rppt@...ux.ibm.com,
        dave.hansen@...ux.intel.com, peterz@...radead.org,
        tglx@...utronix.de, luto@...nel.org, linux-mm@...ck.org
Subject: Re: [RFC PATCH 00/47] Address Space Isolation for KVM


On 3/23/22 20:35, Junaid Shahid wrote:
> On 3/22/22 02:46, Alexandre Chartre wrote:
>> 
>> On 3/18/22 00:25, Junaid Shahid wrote:
>>> 
>>> I agree that it is not secure to run one sibling in the
>>> unrestricted kernel address space while the other sibling is
>>> running in an ASI restricted address space, without doing a cache
>>> flush before re-entering the VM. However, I think that avoiding
>>> this situation does not require doing a sibling stun operation
>>> immediately after VM Exit. The way we avoid it is as follows.
>>> 
>>> First, we always use ASI in conjunction with core scheduling.
>>> This means that if HT0 is running a VCPU thread, then HT1 will be
>>> running either a VCPU thread of the same VM or the Idle thread.
>>> If it is running a VCPU thread, then if/when that thread takes a
>>> VM Exit, it will also be running in the same ASI restricted
>>> address space. For the idle thread, we have created another ASI
>>> Class, called Idle-ASI, which maps only globally non-sensitive
>>> kernel memory. The idle loop enters this ASI address space.
>>> 
>>> This means that when HT0 does a VM Exit, HT1 will either be
>>> running the guest code of a VCPU of the same VM, or it will be
>>> running kernel code in either a KVM-ASI or the Idle-ASI address
>>> space. (If HT1 is already running in the full kernel address
>>> space, that would imply that it had previously done an ASI Exit,
>>> which would have triggered a stun_sibling, which would have
>>> already caused HT0 to exit the VM and wait in the kernel).
>> 
>> Note that using core scheduling (or not) is a detail, what is
>> important is whether HT are running with ASI or not. Running core
>> scheduling will just improve chances to have all siblings run ASI
>> at the same time and so improve ASI performances.
>> 
>> 
>>> If HT1 now does an ASI Exit, that will trigger the
>>> stun_sibling() operation in its pre_asi_exit() handler, which
>>> will set the state of the core/HT0 to Stunned (and possibly send
>>> an IPI too, though that will be ignored if HT0 was already in
>>> kernel mode). Now when HT0 tries to re-enter the VM, since its
>>> state is set to Stunned, it will just wait in a loop until HT1
>>> does an unstun_sibling() operation, which it will do in its
>>> post_asi_enter handler the next time it does an ASI Enter (which
>>> would be either just before VM Enter if it was KVM-ASI, or in the
>>> next iteration of the idle loop if it was Idle-ASI). In either
>>> case, HT1's post_asi_enter() handler would also do a
>>> flush_sensitive_cpu_state operation before the unstun_sibling(), 
>>> so when HT0 gets out of its wait-loop and does a VM Enter, there
>>> will not be any sensitive state left.
>>> 
>>> One thing that probably was not clear from the patch, is that
>>> the stun state check and wait-loop is still always executed
>>> before VM Enter, even if no ASI Exit happened in that execution.
>>> 
>> 
>> So if I understand correctly, you have following sequence:
>> 
>> 0 - Initially state is set to "stunned" for all cpus (i.e. a cpu
>> should wait before VMEnter)
>> 
>> 1 - After ASI Enter: Set sibling state to "unstunned" (i.e. sibling
>> can do VMEnter)
>> 
>> 2 - Before VMEnter : wait while my state is "stunned"
>> 
>> 3 - Before ASI Exit : Set sibling state to "stunned" (i.e. sibling
>> should wait before VMEnter)
>> 
>> I have tried this kind of implementation, and the problem is with
>> step 2 (wait while my state is "stunned"); how do you wait exactly?
>> You can't just do an active wait otherwise you have all kind of
>> problems (depending if you have interrupts enabled or not)
>> especially as you don't know how long you have to wait for (this
>> depends on what the other cpu is doing).
> 
> In our stunning implementation, we do an active wait with interrupts 
> enabled and with a need_resched() check to decide when to bail out
> to the scheduler (plus we also make sure that we re-enter ASI at the
> end of the wait in case some interrupt exited ASI). What kind of
> problems have you run into with an active wait, besides wasted CPU
> cycles?

If you wait with interrupts enabled then there is window after the
wait and before interrupts get disabled where a cpu can get an interrupt,
exit ASI while the sibling is entering the VM. Also after a CPU has passed
the wait and have disable interrupts, it can't be notified if the sibling
has exited ASI:

T+01 - cpu A and B enter ASI - interrupts are enabled
T+02 - cpu A and B pass the wait because both are using ASI - interrupts are enabled
T+03 - cpu A gets an interrupt
T+04 - cpu B disables interrupts
T+05 - cpu A exit ASI and process interrupts
T+06 - cpu B enters VM  => cpu B runs VM while cpu A is not using ASI
T+07 - cpu B exits VM
T+08 - cpu B exits ASI
T+09 - cpu A returns from interrupt
T+10 - cpu A disables interrupts and enter VM => cpu A runs VM while cpu A is not using ASI


> In any case, the specific stunning mechanism is orthogonal to ASI.
> This implementation of ASI can be integrated with different stunning
> implementations. The "kernel core scheduling" that you proposed is
> also an alternative to stunning and could be similarly integrated
> with ASI.

Yes, but for ASI to be relevant with KVM to prevent data leak, you need
a fully functional and reliable stunning mechanism, otherwise ASI is
useless. That's why I think it is better to first focus on having an
effective stunning mechanism and then implement ASI.


alex.