[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <79529592-5d60-2a41-fbb6-4a5f8279f998@amazon.com>
Date: Fri, 17 Apr 2020 14:35:38 +0200
From: Alexander Graf <graf@...zon.com>
To: Peter Zijlstra <peterz@...radead.org>,
Joel Fernandes <joel@...lfernandes.org>
CC: vpillai <vpillai@...italocean.com>,
Nishanth Aravamudan <naravamudan@...italocean.com>,
Julien Desfossez <jdesfossez@...italocean.com>,
Tim Chen <tim.c.chen@...ux.intel.com>, <mingo@...nel.org>,
<tglx@...utronix.de>, <pjt@...gle.com>,
<torvalds@...ux-foundation.org>, <linux-kernel@...r.kernel.org>,
<fweisbec@...il.com>, <keescook@...omium.org>,
<kerrnel@...gle.com>, "Phil Auld" <pauld@...hat.com>,
Aaron Lu <aaron.lwe@...il.com>,
Aubrey Li <aubrey.intel@...il.com>,
<aubrey.li@...ux.intel.com>,
Valentin Schneider <valentin.schneider@....com>,
Mel Gorman <mgorman@...hsingularity.net>,
"Pawan Gupta" <pawan.kumar.gupta@...ux.intel.com>,
Paolo Bonzini <pbonzini@...hat.com>
Subject: Re: [RFC PATCH 00/13] Core scheduling v5
On 17.04.20 13:12, Peter Zijlstra wrote:
> On Wed, Apr 15, 2020 at 12:32:20PM -0400, Joel Fernandes wrote:
>> On Tue, Apr 14, 2020 at 04:21:52PM +0200, Peter Zijlstra wrote:
>>> On Wed, Mar 04, 2020 at 04:59:50PM +0000, vpillai wrote:
>>>> TODO
>>>> ----
>>>> - Work on merging patches that are ready to be merged
>>>> - Decide on the API for exposing the feature to userland
>>>> - Experiment with adding synchronization points in VMEXIT to mitigate
>>>> the VM-to-host-kernel leaking
>>>
>>> VMEXIT is too late, you need to hook irq_enter(), which is what makes
>>> the whole thing so horrible.
>>
>> We came up with a patch to do this as well. Currently testing it more and it
>> looks clean, will share it soon.
>
> Thomas said we actually first do VMEXIT, and then enable interrupts. So
> the VMEXIT thing should actually work, and that is indeed much saner
> than sticking it in irq_enter().
If we first kick out the sibling HT for every #VMEXIT, performance will
be abysmal, no?
I know of a few options to make this work without the big hammer:
1) Leave interrupts disabled on "fast-path" exits. This can become
very hard to grasp very quickly.
2) Patch the IRQ handlers (or build something more generic that
installs a trampoline on all IRQ handler installations)
3) Ignore IRQ data exposure (what could possibly go wrong, it's not
like your IRQ handler reads secret data from the network, right)
4) Create a "safe" page table which runs with HT enabled. Any access
outside of the "safe" zone disables the sibling and switches to the
"full" kernel page table. This should prevent any secret data to be
fetched into caches/core buffers.
5) Create a KVM specific "safe zone": Keep improving the ASI patches
and make only the ASI environment safe for HT, everything else not.
Has there been any progress on 4? It sounded like the most generic
option ...
>
> It does however put yet more nails in the out-of-tree hypervisors.
>
>>>> - Investigate the source of the overhead even when no tasks are tagged:
>>>> https://lkml.org/lkml/2019/10/29/242
>>>
>>> - explain why we're all still doing this ....
>>>
>>> Seriously, what actual problems does it solve? The patch-set still isn't
>>> L1TF complete and afaict it does exactly nothing for MDS.
>>
>> The L1TF incompleteness is because of cross-HT attack from Guest vCPU
>> attacker to an interrupt/softirq executing on the other sibling correct? The
>> IRQ enter pausing the other sibling should fix that (which we will share in
>> a future series revision after adequate testing).
>
> Correct, the vCPU still running can glean host (kernel) state from the
> sibling handling the interrupt in the host kernel.
>
>>> Like I've written many times now, back when the world was simpler and
>>> all we had to worry about was L1TF, core-scheduling made some sense, but
>>> how does it make sense today?
>>
>> For ChromeOS we're planning to tag each and every task seperately except for
>> trusted processes, so we are isolating untrusted tasks even from each other.
>>
>> Sorry if this sounds like pushing my usecase, but we do get parallelism
>> advantage for the trusted tasks while still solving all security issues (for
>> ChromeOS). I agree that cross-HT user <-> kernel MDS is still an issue if
>> untrusted (tagged) tasks execute together on same core, but we are not
>> planning to do that on our setup at least.
>
> That doesn't completely solve things I think. Even if you run all
> untrusted tasks as core exclusive, you still have a problem of them vs
> interrupts on the other sibling.
>
> You need to somehow arrange all interrupts to the core happen on the
> same sibling that runs your untrusted task, such that the VERW on
> return-to-userspace works as intended.
>
> I suppose you can try and play funny games with interrupt routing tied
> to the force-idle state, but I'm dreading what that'll look like. Or
> were you going to handle this from your irq_enter() thing too?
I'm not sure I follow. We have thread local interrupts (timers, IPIs)
and device interrupts (network, block, etc).
Thread local ones shouldn't transfer too much knowledge, so I'd be
inclined to say we can just ignore that attack vector.
Device interrupts we can easily route to HT0. If we now make "core
exclusive" a synonym for "always run on HT0", we can guarantee that they
always land on the same CPU, no?
Then you don't need to hook into any idle state tracking, because you
always know which CPU the "safe" one to both schedule tasks and route
interrupts to is.
Alex
Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879
Powered by blists - more mailing lists