linux-kernel - RE: [EXTERNAL] Re: "Paravisor" Feature Enumeration

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID:
 <CH8PR21MB5222D4771880715FC8104835CA87A@CH8PR21MB5222.namprd21.prod.outlook.com>
Date: Tue, 6 Jan 2026 23:01:10 +0000
From: Jon Lange <jlange@...rosoft.com>
To: Andrew Cooper <andrew.cooper3@...rix.com>, Dave Hansen
	<dave.hansen@...el.com>
CC: "Williams, Dan J" <dan.j.williams@...el.com>, Sean Christopherson
	<seanjc@...gle.com>, Paolo Bonzini <pbonzini@...hat.com>, John Starks
	<John.Starks@...rosoft.com>, Will Deacon <will@...nel.org>, Mark Rutland
	<mark.rutland@....com>, "linux-coco@...ts.linux.dev"
	<linux-coco@...ts.linux.dev>, LKML <linux-kernel@...r.kernel.org>,
	"Edgecombe, Rick P" <rick.p.edgecombe@...el.com>
Subject: RE: [EXTERNAL] Re: "Paravisor" Feature Enumeration

Andrew wrote:

> So overall, we're wanting the paravisor to be able to express "You're in
> a confidential VM, but you're not in charge" to the OS.

That is a great way to summarize the goal here.

> In your first example, when you say "attestation report", do you mean of
> the whole encrypted VM, or only the "OS" part of it?  After all, a
> paravisor could be running multiple OSes.

No, a paravisor can only run a single OS.  This is the key defining difference between a paravisor and a nested hypervisor.  This arises out of necessity from the confidential multi-privilege architectures that exist today; there is no architectural support for managing multiple guests.  So you can think of the paravisor as the entity that provides virtualization services to the single OS.

> What you're really describing is "just another hypervisor".  So really,
> on x86, the paravisor (which does control CPUID in this scenario) ought
> to hide the outer data, advertise itself at 0x4000_0000, and Linux wants
> a new paravirt mode for this new kind of virtual platform, which is
> probably not going to be very different from a typical KVM/XenHVM/HyperV
> guest today.

This is the reason that I find it so attractive to embed this in the virtualization driver.  In the case of the Hyper-V paravisor, the paravisor exposes the same Hyper-V interface as the Hyper-V hypervisor does, including all of its synthetic CPUID leaves, synthetic MSRs, and hypercalls.  As you suggest, the OS will boot up, completely unaware that it is running in a confidential VM (because the paravisor hides SNP/TDX/RME) and at some point, when it is discovering the presence of what it thinks of as the "hypervisor", the "hypervisor" (which is the paravisor in this context) can just advertise its unique presence in its own dialect.  Hyper-V is already capable of doing this through a hypervisor feature enumeration called the "isolation configuration".  I think you are arguing the same point that I am increasingly coming to believe: the existing hypervisor interfaces are adequate to express this configuration.  In that case, the challenge before us now is how to teach the kernel that "paravisor mode" is meaningful so that state can be advertised across the system for use by those components that need to know (attestation and TDISP, in my examples).  But if this is a configuration that is enumerated by the virtualization driver, then it can't live in device tree nor in ACPI, because those are passed into the kernel and not generated by it.

> Do you foresee a need to pass anything other than "here's a handful of
> services that are available to you"?

Assuming we move past the question of "are we in paravisor mode", something that is less clear to me is how components like the attestation driver know how to consume the confidential services that exist.  A fully enlightened OS that knows that it is in charge also knows that it has direct access to all of the platform services that support confidentiality (whether it's specific SNP ABI calls, or TDG.* TDCALL leaves, or GHCB/GHCI interaction, or whatever).  But when running behind a paravisor, some of that access might be restricted, and it might not be possible for the existing drivers to work without modification.  Since none of these paravisor support services have been built yet, it's hard for me to predict what kinds of differences need to exist in these drivers between paravisor mode and fully enlightened mode - it might turn out to be none at all.  I suspect that we're going to have to just try to build something and see where the problems lie in practice, and that will information how much additional information might need to flow (which might go beyond "these services are available" to "here's how you access them").  I don't think it's too productive to conjecture any specifics now until we have code to point to, but this is a potential problem worth acknowledging.

My hope is to try to spend some time on supporting attestation with a paravisor in the next several months, but I don't know when I'll be able to set aside the time.  So somebody other than me might end up blazing the trail.

-Jon

-----Original Message-----
From: Andrew Cooper <andrew.cooper3@...rix.com> 
Sent: Tuesday, January 6, 2026 2:39 PM
To: Jon Lange <jlange@...rosoft.com>; Dave Hansen <dave.hansen@...el.com>
Cc: Andrew Cooper <andrew.cooper3@...rix.com>; Williams, Dan J <dan.j.williams@...el.com>; Sean Christopherson <seanjc@...gle.com>; Paolo Bonzini <pbonzini@...hat.com>; John Starks <John.Starks@...rosoft.com>; Will Deacon <will@...nel.org>; Mark Rutland <mark.rutland@....com>; linux-coco@...ts.linux.dev; LKML <linux-kernel@...r.kernel.org>; Edgecombe, Rick P <rick.p.edgecombe@...el.com>
Subject: Re: [EXTERNAL] Re: "Paravisor" Feature Enumeration

On 06/01/2026 2:12 am, Jon Lange wrote:
> Andrew wrote:
>
>> Are we saying that, inside an opaque blob that a customer provides to a CSP to run we might have:
>> * a paravisor and an unaware OS, or
>> * svsm and a fully-aware OS, or
>> * something in-between these two.
>> and we're looking a way to describe which piece of the interior stack owns which capability/service?
>> I think the discussion would benefit greatly from having a couple of concrete examples of data this wants to hold,
>> and how it is to be used at different levels of the interior software stack.
> Here are two examples.  In both examples, the OS is running behind a paravisor but I wouldn't term it an "unaware OS".  Rather, the paravisor is present because of the set of services it provides, and it is running in paravisor mode (not SVSM mode) because the implementation benefits from taking full management responsibility for the confidential trust boundary (e.g. determination of when/how to validate/accept pages).  In such a configuration, where the paravisor has management responsibility for the confidential trust boundary, all of the enlightenments in the guest OS for managing confidentiality state must be suppressed.  The straightforward way to do this is for the paravisor to suppress the confidential VM enumeration information visible to the guest OS (the "SNP available" CPUID bit, or the "TDX active" bit, for example).
>
> Note that this occurs out of necessity because we can't have the paravisor and the guest OS fighting over who has the right/responsibility to execute PVALIDATE, or TDG.MEM.PAGE.ACCEPT, or whatever.  The kernel today only has two concepts of its execution mode: either it is a confidential VM, in which case it takes full responsibility, or it is not a confidential VM, in which case it ignores the responsibility.  When a paravisor (not SVSM) is active, we have to operate in the second mode because the first mode would provoke precisely the conflict we're trying to avoid. 
>
> First example: a confidential VM running under a paravisor wants to obtain an attestation report for itself to pass to a third party to vouch for the fact that it is a confidential VM.  Assume in this example that the relying party is aware of the paravisor and the paravisor's measurements, so the evidence provided in such an attestation report can successfully be verified as authentic.  In order for this to be possible, the kernel has to know that it's running in a confidential VM in a mode where attestation reports are available but where the responsibility for confidential memory state management is suppressed.  This is a third state beyond the two states described above.  This isn't just a userspace problem because access to the attestation service is mediated by a kernel-mode driver that needs to know how to configure itself (such configuration today is based on CPUID and not on ACPI).
>
> Second example: a confidential VM running under a paravisor determines that one of the devices available to it is a TDISP device that requires the OS - not the paravisor - to perform the operations required to configure the device, to obtain and verify its attestation information, and to consent to activating the device in the TDISP RUN state.  In order for the OS to be able to execute that sequence, the device has to know that it is running as a confidential VM so it knows that TDISP configuration may be necessary.

Thankyou - that is helpful.

So overall, we're wanting the paravisor to be able to express "You're in
a confidential VM, but you're not in charge" to the OS.

Hiding the SNP / TDX bit is of course necessary.  They have well defined
meanings which the OS cannot use when it's not in charge.

In your first example, when you say "attestation report", do you mean of
the whole encrypted VM, or only the "OS" part of it?  After all, a
paravisor could be running multiple OSes.

Whichever it is, this is clearly a service provided by the paravisor,
with some kind of API that's going to be of the from "execute
VM(M)CALL/etc with these regs".  TDISP is also CPU-initiated actions,
some of which may need a paravisor API.


What you're really describing is "just another hypervisor".  So really,
on x86, the paravisor (which does control CPUID in this scenario) ought
to hide the outer data, advertise itself at 0x4000_0000, and Linux wants
a new paravirt mode for this new kind of virtual platform, which is
probably not going to be very different from a typical KVM/XenHVM/HyperV
guest today.

Anything else, and it seems like you're just re-inventing the wheel but
a little more square.

Do you foresee a need to pass anything other than "here's a handful of
services that are available to you"?  An ACPI table might be an
approach, but this seems like it could be a leaf or two and nothing more.


There's no common enumeration scheme between different architectures,
but I'm a firm believer that things ought to be enumerated in the
typical way for the architecture/platform.  This means CPUID on x86, and
things like devicetree on ARM.  It's slightly ugly duplicating
information, but it's less ugly than shoehorning a non-typical
enumeration scheme in to an existing infrastructure.

~Andrew