linux-kernel - Re: [PATCH v3] x86/speculation, KVM: only IBPB for switch_mm_always

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <Ym0GcKhPZxkcMCYp@zn.tnic>
Date:   Sat, 30 Apr 2022 11:50:40 +0200
From:   Borislav Petkov <bp@...en8.de>
To:     Sean Christopherson <seanjc@...gle.com>
Cc:     Jon Kohler <jon@...anix.com>, Thomas Gleixner <tglx@...utronix.de>,
        Ingo Molnar <mingo@...hat.com>,
        Dave Hansen <dave.hansen@...ux.intel.com>,
        "x86@...nel.org" <x86@...nel.org>,
        "H. Peter Anvin" <hpa@...or.com>,
        Paolo Bonzini <pbonzini@...hat.com>,
        Vitaly Kuznetsov <vkuznets@...hat.com>,
        Wanpeng Li <wanpengli@...cent.com>,
        Jim Mattson <jmattson@...gle.com>,
        Joerg Roedel <joro@...tes.org>,
        Josh Poimboeuf <jpoimboe@...hat.com>,
        Peter Zijlstra <peterz@...radead.org>,
        Balbir Singh <sblbir@...zon.com>,
        Kim Phillips <kim.phillips@....com>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        "kvm@...r.kernel.org" <kvm@...r.kernel.org>,
        Andrea Arcangeli <aarcange@...hat.com>,
        Kees Cook <keescook@...omium.org>,
        Waiman Long <longman@...hat.com>
Subject: Re: [PATCH v3] x86/speculation, KVM: only IBPB for
 switch_mm_always_ibpb on vCPU load

On Fri, Apr 29, 2022 at 11:23:32PM +0000, Sean Christopherson wrote:
> The host kernel is protected via RETPOLINE and by flushing the RSB immediately
> after VM-Exit.

Ah, right.

> I don't know definitively.  My guess is that IBPB is far too costly to do on every
> exit, and so the onus was put on userspace to recompile with RETPOLINE.  What I
> don't know is why it wasn't implemented as an opt-out feature.

Or, we could add the same logic on the exit path as in cond_mitigation()
and test for LAST_USER_MM_IBPB when the host has selected
switch_mm_cond_ibpb and thus allows for certain guests to be
protected...

Although, that use case sounds kinda meh: AFAIU, the attack vector here
would be, protecting the guest from a malicious kernel. I guess this
might matter for encrypted guests tho.

> I'll write up the bits I have my head wrapped around.

That's nice, thanks!

> I don't know of any actual examples.  But, it's trivially easy to create multiple
> VMs in a single process, and so proving the negative that no one runs multiple VMs
> in a single address space is essentially impossible.
> 
> The container thing is just one scenario I can think of where userspace might
> actually benefit from sharing an address space, e.g. it would allow backing the
> image for large number of VMs with a single set of read-only VMAs.

Why I keep harping about this: so let's say we eventually add something
and then months, years from now we cannot find out anymore why that
thing was added. We will likely remove it or start wasting time figuring
out why that thing was added in the first place.

This very questioning keeps popping up almost on a daily basis during
refactoring so I'd like for us to be better at documenting *why* we're
doing a certain solution or function or whatever.

And this is doubly important when it comes to the hw mitigations because
if you look at the problem space and all the possible ifs and whens and
but(t)s, your head will spin in no time.

So I'm really sceptical when there's not even a single actual use case
to a proposed change.

So Jon said something about oversubscription and a lot of vCPU
switching. That should be there in the docs as the use case and
explaining why dropping IBPB during vCPU switches is redundant.

> I truly have no idea, which is part of the reason I brought it up in the first
> place.  I'd have happily just whacked KVM's IBPB entirely, but it seemed prudent
> to preserve the existing behavior if someone went out of their way to enable
> switch_mm_always_ibpb.

So let me try to understand this use case: you have a guest and a bunch
of vCPUs which belong to it. And that guest gets switched between those
vCPUs and KVM does IBPB flushes between those vCPUs.

So either I'm missing something - which is possible - but if not, that
"protection" doesn't make any sense - it is all within the same guest!
So that existing behavior was silly to begin with so we might just as
well kill it.

> Yes, or do it iff switch_mm_always_ibpb is enabled to maintain "compability".

Yap, and I'm questioning the even smallest shred of reasoning for having
that IBPB flush there *at* *all*.

And here's the thing with documenting all that: we will document and
say, IBPB between vCPU flushes is non-sensical. Then, when someone comes
later and says, "but but, it makes sense because of X" and we hadn't
thought about X at the time, we will change it and document it again and
this way you'll have everything explicit there, how we arrived at the
current situation and be able to always go, "ah, ok, that's why we did
it."

I hope I'm making some sense...

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette