linux-kernel - Re: [RFC PATCH v3 00/16] Core scheduling v3

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20190829143821.GX2369@hirez.programming.kicks-ass.net>
Date:   Thu, 29 Aug 2019 16:38:21 +0200
From:   Peter Zijlstra <peterz@...radead.org>
To:     Phil Auld <pauld@...hat.com>
Cc:     Matthew Garrett <mjg59@...f.ucam.org>,
        Vineeth Remanan Pillai <vpillai@...italocean.com>,
        Nishanth Aravamudan <naravamudan@...italocean.com>,
        Julien Desfossez <jdesfossez@...italocean.com>,
        Tim Chen <tim.c.chen@...ux.intel.com>, mingo@...nel.org,
        tglx@...utronix.de, pjt@...gle.com, torvalds@...ux-foundation.org,
        linux-kernel@...r.kernel.org, subhra.mazumdar@...cle.com,
        fweisbec@...il.com, keescook@...omium.org, kerrnel@...gle.com,
        Aaron Lu <aaron.lwe@...il.com>,
        Aubrey Li <aubrey.intel@...il.com>,
        Valentin Schneider <valentin.schneider@....com>,
        Mel Gorman <mgorman@...hsingularity.net>,
        Pawan Gupta <pawan.kumar.gupta@...ux.intel.com>,
        Paolo Bonzini <pbonzini@...hat.com>
Subject: Re: [RFC PATCH v3 00/16] Core scheduling v3

On Thu, Aug 29, 2019 at 10:30:51AM -0400, Phil Auld wrote:
> On Wed, Aug 28, 2019 at 06:01:14PM +0200 Peter Zijlstra wrote:
> > On Wed, Aug 28, 2019 at 11:30:34AM -0400, Phil Auld wrote:
> > > On Tue, Aug 27, 2019 at 11:50:35PM +0200 Peter Zijlstra wrote:
> > 
> > > > And given MDS, I'm still not entirely convinced it all makes sense. If
> > > > it were just L1TF, then yes, but now...
> > > 
> > > I was thinking MDS is really the reason for this. L1TF has mitigations but
> > > the only current mitigation for MDS for smt is ... nosmt. 
> > 
> > L1TF has no known mitigation that is SMT safe. The moment you have
> > something in your L1, the other sibling can read it using L1TF.
> > 
> > The nice thing about L1TF is that only (malicious) guests can exploit
> > it, and therefore the synchronizatin context is VMM. And it so happens
> > that VMEXITs are 'rare' (and already expensive and thus lots of effort
> > has already gone into avoiding them).
> > 
> > If you don't use VMs, you're good and SMT is not a problem.
> > 
> > If you do use VMs (and do/can not trust them), _then_ you need
> > core-scheduling; and in that case, the implementation under discussion
> > misses things like synchronization on VMEXITs due to interrupts and
> > things like that.
> > 
> > But under the assumption that VMs don't generate high scheduling rates,
> > it can work.
> > 
> > > The current core scheduler implementation, I believe, still has (theoretical?) 
> > > holes involving interrupts, once/if those are closed it may be even less 
> > > attractive.
> > 
> > No; so MDS leaks anything the other sibling (currently) does, this makes
> > _any_ privilidge boundary a synchronization context.
> > 
> > Worse still, the exploit doesn't require a VM at all, any other task can
> > get to it.
> > 
> > That means you get to sync the siblings on lovely things like system
> > call entry and exit, along with VMM and anything else that one would
> > consider a privilidge boundary. Now, system calls are not rare, they
> > are really quite common in fact. Trying to sync up siblings at the rate
> > of system calls is utter madness.
> > 
> > So under MDS, SMT is completely hosed. If you use VMs exclusively, then
> > it _might_ work because a 'pure' host doesn't schedule that often
> > (maybe, same assumption as for L1TF).
> > 
> > Now, there have been proposals of moving the privilidge boundary further
> > into the kernel. Just like PTI exposes the entry stack and code to
> > Meltdown, the thinking is, lets expose more. By moving the priv boundary
> > the hope is that we can do lots of common system calls without having to
> > sync up -- lots of details are 'pending'.
> 
> 
> Thanks for clarifying. My understanding is (somewhat) less fuzzy now. :)
> 
> I think, though, that you were basically agreeing with me that the current 
> core scheduler does not close the holes, or am I reading that wrong.

Agreed; the missing bits for L1TF are ugly but doable (I've actually
done them before, Tim has that _somewhere_), but I've not seen a
'workable' solution for MDS yet.