linux-kernel - Re: [PATCH] x86/tsx: fix KVM guest live migration for tsx=on

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <28C45B75-7FE3-4C79-9A29-F929AF9BC5A8@nutanix.com>
Date:   Tue, 12 Apr 2022 16:08:32 +0000
From:   Jon Kohler <jon@...anix.com>
To:     Dave Hansen <dave.hansen@...el.com>
CC:     Jon Kohler <jon@...anix.com>, Thomas Gleixner <tglx@...utronix.de>,
        Ingo Molnar <mingo@...hat.com>, Borislav Petkov <bp@...en8.de>,
        Dave Hansen <dave.hansen@...ux.intel.com>,
        "x86@...nel.org" <x86@...nel.org>,
        "H. Peter Anvin" <hpa@...or.com>, Tony Luck <tony.luck@...el.com>,
        Andi Kleen <ak@...ux.intel.com>,
        Pawan Gupta <pawan.kumar.gupta@...ux.intel.com>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        Borislav Petkov <bp@...e.de>,
        Neelima Krishnan <neelima.krishnan@...el.com>,
        "kvm @ vger . kernel . org" <kvm@...r.kernel.org>
Subject: Re: [PATCH] x86/tsx: fix KVM guest live migration for tsx=on

> On Apr 12, 2022, at 11:54 AM, Dave Hansen <dave.hansen@...el.com> wrote:
> 
> On 4/12/22 06:36, Jon Kohler wrote:
>> So my theory here is to extend the logical effort of the microcode driven
>> automatic disablement as well as the tsx=auto automatic disablement and
>> have tsx=on force abort all transactions on X86_BUG_TAA SKUs, but leave
>> the CPU features enumerated to maintain live migration.
>> 
>> This would still leave TSX totally good on Ice Lake / non-buggy systems.
>> 
>> If it would help, I'm working up an RFC patch, and we could discuss there?
> 
> Sure.  But, it sounds like you really want a new tdx=something rather
> than to muck with tsx=on behavior.  Surely someone else will come along
> and complain that we broke their TDX setup if we change its behavior.

Good point, there will always be a squeaky wheel. I’ll work that into the RFC,
I’ll do something like tsx=compat and see how it shapes up. 

To be fair though, this commit I’m patching with this series would break
setups as they apply 5.14+ and the microcode update, but you have a 
good point for certain.

> 
> Maybe you should just pay the one-time cost and move your whole fleet
> over to tsx=off if you truly believe nobody is using it.
> 

Trust me, I’d love to do that; however:
We’ve thousands of hosts across thousands of unique customers,
which aren't managed as a centralized service (customers manage them directly),
so doing that would require each individual customer to organize a full power
cycle for all of their VMs prior to an upgrade to tsx=off hosts.

That said, we are marching in that direction, we're shipping a control plane
update that will mask HLE and RTM after power cycles, but that requires
customers to apply that control plane update, then power cycle everything. Just
means that we've begun the feature deprecation now, it will take years to fully
bleed off without having customers to micro manage full power cycles.