lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ZW6FRBnOwYV-UCkY@google.com>
Date:   Mon, 4 Dec 2023 18:04:52 -0800
From:   Sean Christopherson <seanjc@...gle.com>
To:     Dave Hansen <dave.hansen@...el.com>
Cc:     Kai Huang <kai.huang@...el.com>,
        "kvm@...r.kernel.org" <kvm@...r.kernel.org>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        "rafael@...nel.org" <rafael@...nel.org>,
        Chao Gao <chao.gao@...el.com>, Tony Luck <tony.luck@...el.com>,
        "david@...hat.com" <david@...hat.com>,
        "bagasdotme@...il.com" <bagasdotme@...il.com>,
        "ak@...ux.intel.com" <ak@...ux.intel.com>,
        "kirill.shutemov@...ux.intel.com" <kirill.shutemov@...ux.intel.com>,
        "mingo@...hat.com" <mingo@...hat.com>,
        "pbonzini@...hat.com" <pbonzini@...hat.com>,
        "tglx@...utronix.de" <tglx@...utronix.de>,
        Isaku Yamahata <isaku.yamahata@...el.com>,
        "nik.borisov@...e.com" <nik.borisov@...e.com>,
        "hpa@...or.com" <hpa@...or.com>,
        "sagis@...gle.com" <sagis@...gle.com>,
        "imammedo@...hat.com" <imammedo@...hat.com>,
        "peterz@...radead.org" <peterz@...radead.org>,
        "bp@...en8.de" <bp@...en8.de>, Len Brown <len.brown@...el.com>,
        "sathyanarayanan.kuppuswamy@...ux.intel.com" 
        <sathyanarayanan.kuppuswamy@...ux.intel.com>,
        Ying Huang <ying.huang@...el.com>,
        Dan J Williams <dan.j.williams@...el.com>,
        "x86@...nel.org" <x86@...nel.org>
Subject: Re: [PATCH v15 22/23] x86/mce: Improve error log of kernel space TDX
 #MC due to erratum

On Mon, Dec 04, 2023, Dave Hansen wrote:
> On 12/4/23 15:24, Huang, Kai wrote:
> > On Mon, 2023-12-04 at 14:04 -0800, Hansen, Dave wrote:
> ...
> > In ancient time KVM used to immediately enable VMX when it is loaded, but later
> > it was changed to only enable VMX when there's active VM because of the above
> > reason.
> > 
> > See commit 10474ae8945ce ("KVM: Activate Virtualization On Demand").

Huh, I always just assumed it was some backwards thinking about enabling VMX/SVM
being "dangerous" or something.

> Fine.  This doesn't need to change ... until you load TDX.  Once you
> initialize the TDX module, no more out-of-tree VMMs for you.

It's not just out-of-tree hypervisors, which IMO should be little more than an
afterthought.  The other more important issue is that being post-VMXON blocks INIT,
i.e. VMX needs to be disabled before reboot, suspend, etc.  Forcing kvm_usage_count
would work, but it would essentially turn "graceful" reboots, i.e. reboots where
the host isn't running VMs and thus VMX is already disabled.  Having VMX be enabled
so long as KVM is loaded would turn all reboots into the "oh crap, the system is
rebooting, quick do VMXOFF!" variety.

> That doesn't seem too insane.  This is yet *ANOTHER* reason that doing
> dynamic TDX module initialization is a good idea.
> 
> >> It's not wrong to say that TDX is a KVM user.  If KVm wants
> >> 'kvm_usage_count' to go back to 0, it can shut down the TDX module.  Then
> >> there's no PAMT to worry about.
> >>
> >> The shutdown would be something like:
> >>
> >>       1. TDX module shutdown
> >>       2. Deallocate/Convert PAMT
> >>       3. vmxoff
> >>
> >> Then, no SEAMCALL failure because of vmxoff can cause a PAMT-induced #MC
> >> to be missed.
> > 
> > The limitation is once the TDX module is shutdown, it cannot be initialized
> > again unless it is runtimely updated.
> > 
> > Long-termly, if we go this design then there might be other problems when other
> > kernel components are using TDX.  For example, the VT-d driver will need to be
> > changed to support TDX-IO, and it will need to enable TDX module much earlier
> > than KVM to do some initialization.  It might need to some TDX work (e.g.,
> > cleanup) while KVM is unloaded.  I am not super familiar with TDX-IO but looks
> > we might have some problem here if we go with such design.
> 
> The burden for who does vmxon will simply need to change from KVM itself
> to some common code that KVM depends on.  Probably not dissimilar to
> those nutty (sorry folks, just calling it as I see 'em) multi-KVM module

You misspelled "amazing" ;-)

> patches that are floating around.

Joking aside, why shove TDX module ownership into KVM?  It honestly sounds like
a terrible fit, even without the whole TDX-IO mess.  KVM state is largely ephemeral,
in the sense that loading and unloading kvm.ko doesn't allocate/free much memory
or do all that much initialization or teardown.

TDX on the other hand is quite different.  IIRC the PAMT is hundreds of MiB, maybe
over a GiB in most expected use cases?  And also IIRC, TDH.SYS.INIT is rather
long running operation, blocks IRQs, NMIs, (SMIs?), etc.

So rather than shove TDX ownership into KVM and force KVM to figure out how to
manage the TDX module, why not do what us nutty people are suggesting and move
hardware enabling and TDX-module management into a dedicated base module (bonus
points if you call it vac.ko ;-) ).

Alternatively, we could have a dedicated kernel module for TDX, e.g. tdx.ko, and
then have tdx.ko and kvm.ko depend on vac.ko.  But I think that ends up being
quite gross and unnecessary, e.g. in such a setup, kvm-intel.ko ideally wouldn't
take a hard dependency on tdx.ko, as auto-loading tdx.ko would defeat some of the
purpose of the split, and KVM shouldn't fail to load just because TDX isn't supported.
But that'd mean conditionally doing request_module("tdx") or whatever and would
create other conundrums.

(Oof, typing that out made me realize that KVM depends on the PSP driver if
CONFIG_KVM_AMD_SEV=y, even if if the platform owner has no intention of ever using
SEV/SEV-ES.  IIUC, it works because sp_mod_init() just registers a driver, i.e.
doesn't fail out of there's no PSP.  That's kinda gross).

Anyways, vac.ko provides an API to grab a reference to the TDX module, e.g. the
"create a VM" API gets extended to say "create a VM of the TDX variety", and then
vac.ko manages its refcounts to VMX and TDX accordingly.  And KVM obviously keeps
its existing behavior of getting and putting references for each VM.

That way userspace gets to decide when to (un)load tdx.ko without needing to add
a KVM module param or whatever to allow forcefully unloading tdx.ko (which would
be bizarre and probably quite difficult to implement correctly), and unloading
kvm-intel.ko wouldn't require unloading the TDX module.

The end behavior might not be all that different in the short term, but it would
give us more options, e.g. for this erratum, it would be quite easy for vac.ko to
let usersepace choose between keeping VMX "on" (while the TDX module is loaded)
and potentially having imperfect #MC messages.

And out-of-tree hypervisors could even use vac.ko's exported APIs to manage hardware
enabling if they so choose.

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ