linux-kernel - Re: [PATCH v1 1/3] x86/tdx: Check for TDX partitioning during early TDX init

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <8747ed90-72b8-49bf-8df7-5c5f06056fe2@linux.microsoft.com>
Date:   Mon, 4 Dec 2023 14:39:03 +0100
From:   Jeremi Piotrowski <jpiotrowski@...ux.microsoft.com>
To:     "Reshetova, Elena" <elena.reshetova@...el.com>,
        Borislav Petkov <bp@...en8.de>
Cc:     "linux-hyperv@...r.kernel.org" <linux-hyperv@...r.kernel.org>,
        "stefan.bader@...onical.com" <stefan.bader@...onical.com>,
        "tim.gardner@...onical.com" <tim.gardner@...onical.com>,
        "roxana.nicolescu@...onical.com" <roxana.nicolescu@...onical.com>,
        "cascardo@...onical.com" <cascardo@...onical.com>,
        "kys@...rosoft.com" <kys@...rosoft.com>,
        "haiyangz@...rosoft.com" <haiyangz@...rosoft.com>,
        "wei.liu@...nel.org" <wei.liu@...nel.org>,
        "sashal@...nel.org" <sashal@...nel.org>,
        "stable@...r.kernel.org" <stable@...r.kernel.org>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        "H. Peter Anvin" <hpa@...or.com>,
        Dave Hansen <dave.hansen@...ux.intel.com>,
        Ingo Molnar <mingo@...hat.com>,
        "Kirill A. Shutemov" <kirill.shutemov@...ux.intel.com>,
        Michael Kelley <mhkelley58@...il.com>,
        Nikolay Borisov <nik.borisov@...e.com>,
        Peter Zijlstra <peterz@...radead.org>,
        Thomas Gleixner <tglx@...utronix.de>,
        Tom Lendacky <thomas.lendacky@....com>,
        "x86@...nel.org" <x86@...nel.org>,
        "Cui, Dexuan" <decui@...rosoft.com>
Subject: Re: [PATCH v1 1/3] x86/tdx: Check for TDX partitioning during early
 TDX init

On 30/11/2023 09:31, Reshetova, Elena wrote:
> 
>> On Thu, Nov 30, 2023 at 07:08:00AM +0000, Reshetova, Elena wrote:
>>> ...
>>> 3. Normal TDX 1.0 guest that is unaware that it runs in partitioned
>>>    environment
>>> 4. and so on
>>
>> There's a reason I call it a virt zoo.
>>
>>> I don’t know if AMD architecture would support all this spectrum of
>>> the guests through.
>>
>> I hear threats...
> 
> No threats whatsoever, I just truly don’t know details of SEV architecture
> on this and how it envisioned to operate under this nesting scenario.
> I raised this point to see if we can build the common understanding 
> on this. My personal understanding (please correct me) was that SEV
> would also allow different types of L2 guests, so I think we should be
> aligning on this.
I don't think SNP allows the level of freedom you describe. But regardless
of the possibilities, I can't think of users of this technology actually
being interested in running all of these options. Would love to hear someone
speak up.

I think of VMPLs provided by SNP and the TD-partitioning L1/L2 scheme as
the equivalent of ARM's trustzone/EL3 concept. It lets you separate high
privilege operations into a hardware isolated context. In this case it's
within the same confidential computing boundary. That's our use case.
 
> 
>>
>>> Instead we should have a flexible way for the L2 guest to discover
>>> the virt environment it runs in (as modelled by L1 VMM) and the
>>> baseline should not to assume it is a TDX or SEV guest, but assume
>>> this is some special virt guest (or legacy guest, whatever approach
>>> is cleaner) and expose additional interfaces to it.
>>
>> You can do flexible all you want but all that guest zoo is using the
>> kernel. The same code base which boots on gazillion incarnations of real
>> hardware. And we have trouble keeping that code base clean already.
> 
> Fully agree, I wasn’t objecting this. What I was objecting is to make
> explicit assumptions on what the L2 guest under TDX partitioning is. 
> 

That's fair, my intention was to have this simple logic (if td-partitioning,
then this and this is given) until a different user of TD-partitioning comes
along and then we figure out which parts generalize.

>>
>> Now, all those weird guests come along, they're more or less
>> "compatible" but not fully. So they have to do an exception here,
>> disable some feature there which they don't want/support/cannot/bla. Or
>> they use a paravisor which does *some* of the work for them so that
>> needs to be accomodated too.
>>
>> And so they start sprinkling around all those "differences" around the
>> kernel. And turn it into an unmaintainable mess. We've been here before
>> - last time it was called "if (XEN)"... and we're already getting there
>> again only with two L1 encrypted guests technologies. I'm currently
>> working on trimming down some of the SEV mess we've already added...
>>
>> So - and I've said this a bunch of times already - whatever guest type
>> it is, its interaction with the main kernel better be properly designed
>> and abstracted away so that it doesn't turn into a mess.
> 
> Yes, agree, so what are our options and overall strategy on this? 
> We can try to push as much as possible complexity into L1 VMM in this
> scenario to keep the guest kernel almost free from these sprinkling differences.
> Afterall the L1 VMM can emulate whatever it wants for the guest.
> We can also see if there is a true need to add another virtualization
> abstraction here, i.e. "nested encrypted guest". But to justify this one
> we need to have usecases/scenarios where L1 VMM actually cannot run
> L2 guest (legacy or TDX enabled) as it is. 
> @Jeremi Piotrowski do you have such usecase/scenarios you can describe?
> > Any other options we should be considering as overall strategy? 

Just taking a step back: we're big SNP and TDX users. The only kind of guest
that meets our users needs on both SNP and TDX and that we support across
our server fleet is closest to what you listed as 2:
"guest with a CoCo security module (paravisor) and targeted CoCo enlightenments".

We're aligned on the need to push complexity out of the kernel which is exactly
what has happened (also across vendors): the guest is mostly unconcerned by the
differences between TDX and SNP (except notification hypercall in the I/O path),
does not need all the changes in the early boot code that TDX/SNP have forced,
switches page visibility with the same hypercall for both etc.

I'm not aware of use cases for fully legacy guests, and my guess is they would suffer
from excessive overhead.

I am also not aware of use cases for "pretending to be an TDX 1.0 guest". Doing that
removes opportunities to share kernel code with normal guests and SNP guests on hyperv.
I'd also like to point out something that Michael wrote here[1] regarding paravisor
interfaces:
"But it seems like any kind of (forwarding) scheme needs to be a well-defined contract
that would work for both TDX and SEV-SNP."

[1]: https://lore.kernel.org/lkml/SN6PR02MB415717E09C249A31F2A4E229D4BCA@SN6PR02MB4157.namprd02.prod.outlook.com/

Another thing to note: in SNP you *need* to interact with VMPL0 (~L1 VMM) when
running at other VMPLs (eg. pvalidate and rmpadjust only possible at VMPL0) so
the kernel needs to cooperate with VMPL0 to operate. On skimming the TD-part
spec I'm not sure how "supporting fast-path I/O" would be possible with supporting
a "TDX 1.0 guest" with no TD-part awareness (if you need to trap all TDVMCALL then
that's not OK).

Now back to the topic at hand: I think what's needed is to stop treating
X86_FEATURE_TDX_GUEST as an all-or-nothing thing. Split out the individual
enlightenment into separate CC attributes, allow them to be selected without
requiring you to buy the whole zoo. I don't think we need a "nested encrypted guest"
abstraction.

Jeremi

> 
> Best Regards,
> Elena.
> 
>>
>> Thx.
>>
>> --
>> Regards/Gruss,
>>     Boris.
>>
>> https://people.kernel.org/tglx/notes-about-netiquette